[ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3821:
--------------------------------

    Description: 
The general bug is a case where a phrase with no slop is found,
but if you add slop its not.

I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
case,
jenkins just hasn't had enough time to chew on it.

ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
it fail on trunk or 3.x


  was:
In upgrading from Solr 1.4 to Solr 3.5, the following phrase searches stopped 
working in dismax:
  "The Beatles as musicians : Revolver through the Anthology"
  "Color-blindness [print/digital]; its dangers and its detection"
Both of these queries have a repeated work, and have many terms.  It's not the 
number of terms or the colon surrounded by spaces, because the following phrase 
search works in Solr 3.5 (and Solr 1.4):
    "International encyclopedia of revolution and protest : 1500 to the present"

With Robert Muir's help, we have narrowed the problem down to slop  (proximity 
in lucene QueryParser, query slop in dismax).   I have included debugQuery 
details for  the Beatles search;  I confirmed the same behavior with the 
color-blindness search.


Solr 3.5:   it fails when (query) slop setting isn't 0.
----
lucene QueryParser with proximity set to 1 (or anything > 0) :  no match
  URL: q=all_search:"The Beatles as musicians : Revolver through the 
Anthology"~1
  final query:  all_search:"the beatl as musician revolv through the antholog"~1

lucene QueryParser with proximity set to 0:    result!
  URL:   q=all_search:"The Beatles as musicians : Revolver through the 
Anthology"
  final query:  all_search:"the beatl as musician revolv through the antholog"

  6.0562754 = (MATCH) weight(all_search:"the beatl as musician revolv through 
the antholog" in 1064395), product of:
     <snip>
      48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 
musician=11805 revolv=872 through=81366 the=3531140 antholog=11611)
     <snip>

dismax QueryParser with qs=1:  no match
      ps=0
  URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver 
through the Anthology"&qs=1&ps=0
  final query:   +(all_search:"the beatl as musician revolv through the 
antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the 
antholog")~0.01
      ps=1
  URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver 
through the Anthology"&qs=1&ps=1
  final query:   +(all_search:"the beatl as musician revolv through the 
antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the 
antholog"~1)~0.01

dismax QueryParser with qs=0:    result!
     ps=0
  URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver 
through the Anthology"&qs=0&ps=0
  final query:  +(all_search:"the beatl as musician revolv through the 
antholog")~0.01 (all_search:"the beatl as musician revolv through the 
antholog")~0.01
      ps=1
  URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver 
through the Anthology"&qs=0&ps=1
  final query:  +(all_search:"the beatl as musician revolv through the 
antholog")~0.01 (all_search:"the beatl as musician revolv through the 
antholog"~1)~0.01

  8.564867 = (MATCH) sum of:
    4.2824335 = (MATCH) weight(all_search:"the beatl as musician revolv through 
the antholog" in 1064395), product of:
        <snip>
        48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 
musician=11805 revolv=872 through=81366 the=3531140 antholog=11611)
        <snip>


Solr 1.4:    it works regardless of slop settings
----
lucene QueryParser with any proximity value:    result!
      ~0
  URL:   q=all_search:"The Beatles as musicians : Revolver through the 
Anthology"
  final query:  all_search:"the beatl as musician revolv through the antholog"
      ~1
  URL: q=all_search:"The Beatles as musicians : Revolver through the 
Anthology"~1
  final query:  all_search:"the beatl as musician revolv through the antholog"~1

  5.2672544 = fieldWeight(all_search:"the beatl as musician revolv through the 
antholog" in 3469163), product of:
     <snip>
    48.157753 = idf(all_search: the=3549637 beatl=392 as=751093 musician=11992 
revolv=822 through=88522 the=3549637 antholog=11246)
     <snip>

dismax QueryParser with any qs:    result!
      qs=0, ps=0
   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver 
through the Anthology"&qs=0&ps=0
   final query: +(all_search:"the beatl as musician revolv through the 
antholog")~0.01 (all_search:"the beatl as musician revolv through the 
antholog")~0.01
      qs=0, ps=1
   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver 
through the Anthology"&qs=0&ps=1
   final query: +(all_search:"the beatl as musician revolv through the 
antholog")~0.01 (all_search:"the beatl as musician revolv through the 
antholog"~1)~0.01
dismax QueryParser with qs=0:    result!
      qs=1, ps=0
   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver 
through the Anthology"&qs=1&ps=0
   final query: +(all_search:"the beatl as musician revolv through the 
antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the 
antholog")~0.01
      qs=1, ps=1
   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver 
through the Anthology"&qs=1&ps=1
   final query: +(all_search:"the beatl as musician revolv through the 
antholog"~1)~0.01 (all_search:"the beatl as musician revolv through the 
antholog"~1)~0.01

  7.4490223 = (MATCH) sum of:
  3.7245111 = weight(all_search:"the beatl as musician revolv through the 
antholog"~1 in 3469163), product of:
        <snip>
      48.157753 = idf(all_search: the=3549637 beatl=392 as=751093 
musician=11992 revolv=822 through=88522 the=3549637 antholog=11246)
        <snip>


More information:

schema.xml:
  <field name="all_search" type="text" indexed="true" stored="false" />

solr 3.5:
      <fieldtype name="text" class="solr.TextField" positionIncrementGap="100" 
autoGeneratePhraseQueries="true">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.ICUFoldingFilterFactory"/>  
        <filter class="solr.WordDelimiterFilterFactory"
          splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
          splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
          catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
        <filter class="solr.EnglishPorterFilterFactory" 
protected="protwords.txt" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
      </analyzer>
    </fieldtype>

solr1.4:
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="schema.UnicodeNormalizationFilterFactory" 
version="icu4j" composed="false" remove_diacritics="true" 
remove_modifiers="true" fold="true" />
        <filter class="solr.WordDelimiterFilterFactory" 
          splitOnCaseChange="1" generateWordParts="1" catenateWords="1" 
          splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1" 
          catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.EnglishPorterFilterFactory" 
protected="protwords.txt" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
      </analyzer>
    </fieldtype>


And the analysis page shows the same results for Solr 3.5 and 1.4


Solr 3.5:

position        1       2       3       4       5       6       7       8
term text       the     beatl   as      musician        revolv  through the     
antholog
keyword         false   false   false   false   false   false   false   false
startOffset     0       4       12      15      27      36      44      48
endOffset       3       11      14      24      35      43      47      57
type    word    word    word    word    word    word    word    word

Solr 1.4:

term position   1       2       3       4       5       6       7       8
term text       the     beatl   as      musician        revolv  through the     
antholog
term type       word    word    word    word    word    word    word    word
source start,end        0,3     4,11    12,14   15,24   27,35   36,43   44,47   
48,57


For debug purposes, we can consider the Solr document as:

<doc>
  <str name="all_search">The Beatles as musicians : Revolver through the 
Anthology</str>
</doc>

I can't attached the full SolrDoc as all_search is indexed, but not stored, and 
I use SolrJ to write to the index from java objects ... plus our objects have a 
zillion fields (I work in a library with very rich metadata and very exacting 
solr fields).  I have attached the Solr 3.5 schema and solrconfig, but they are 
big and ugly for the same reasons.

For more details, see the erroneously titled email thread "result present in 
Solr 1.4 but missing in Solr 3.5, dismax only"  started on 2012-02-22 on 
solr-u...@lucene.apache.org.

- Naomi







        Summary: SloppyPhraseScorer sometimes misses documents that 
ExactPhraseScorer finds.  (was: search slop problem introduced somewhere 
between Solr 1.4 and Solr 3.5)

Moving to comments...
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>         Attachments: schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
> case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
> it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to