Re: bug in search with sloppy queries

Dmitry Kan Mon, 15 Jun 2015 05:04:28 -0700

Digging into the code, I see this:

[code]
public SpanWeight(SpanQuery query, IndexSearcher searcher)
    throws IOException {
    this.similarity = searcher.getSimilarity();
    this.query = query;


    termContexts = new HashMap<>();
    TreeSet<Term> terms = new TreeSet<>();
    query.extractTerms(terms);
    final IndexReaderContext context = searcher.getTopReaderContext();
    final TermStatistics termStats[] = new TermStatistics[terms.size()];
    int i = 0;
    for (Term term : terms) {
      TermContext state = TermContext.build(context, term);
      termStats[i] = searcher.termStatistics(term, state);
      termContexts.put(term, state);
      i++;
    }
    final String field = query.getField();
    if (field != null) {
      stats = similarity.computeWeight(query.getBoost(),

searcher.collectionStatistics(query.getField()),
                                       termStats);
    }
  }
[/code]

as query we get the above structure, from which all terms are extracted
without keeping the query structure? Could someone shed light on the logic
behind this weight calculation?

On Mon, Jun 15, 2015 at 10:23 AM, Dmitry Kan <solrexp...@gmail.com> wrote:

> To clarify additionally: we use StandardTokenizer & StandardFilter in
> front of the WDF. Already following ST's transformations e-tail gets split
> into two consecutive tokens
>
> On Mon, Jun 15, 2015 at 10:08 AM, Dmitry Kan <solrexp...@gmail.com> wrote:
>
>> Thanks, Erick. Analysis page shows the positions are growing=> there are
>> no "glued" words on the same position.
>>
>> On Sun, Jun 14, 2015 at 6:10 PM, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>>
>>> My guess is that you have WordDelimiterFilterFactory in your
>>> analysis chain with parameters that break up E-Tail to both "e" and
>>> "tail" _and_
>>> put them in the same position. This assumes that the result fragment
>>> you pasted is incomplete and "commerce" is in it
>>>
>>> From <em>E</em>-Tail <em>commerce</em>
>>>
>>> or some such. Try the admin/analysis screen with the "verbose" box
>>> checked
>>> and you'll see the position of each token after analysis to see if my
>>> guess
>>> is accurate.
>>>
>>> Best,
>>> Erick
>>>
>>> On Sun, Jun 14, 2015 at 4:34 AM, Dmitry Kan <solrexp...@gmail.com>
>>> wrote:
>>> > Hi guys,
>>> >
>>> > We observe some strange bug in solr 4.10.2, where by a sloppy query
>>> hits
>>> > words it should not:
>>> >
>>> > <lst name="debug"><str name="rawquerystring">the "e commerce"</str><str
>>> > name="querystring">the "e commerce"</str><str
>>> > name="parsedquery">SpanNearQuery(spanNear([Contents:the,
>>> > spanNear([Contents:eä, Contents:commerceä], 0, true)], 300,
>>> > false))</str><str name="parsedquery_toString">spanNear([Contents:the,
>>> > spanNear([Contents:eä, Contents:commerceä], 0, true)], 300,
>>> false)</str>
>>> >
>>> >
>>> > This query produces words as hits, like:
>>> >
>>> > From <em>E</em>-Tail
>>> >
>>> > In the inner spanNear query we expect that e and commerce will occur
>>> within
>>> > 0 slop in that order.
>>> >
>>> > Can somebody shed light into what is going on?
>>> >
>>> > --
>>> > Dmitry Kan
>>> > Luke Toolbox: http://github.com/DmitryKey/luke
>>> > Blog: http://dmitrykan.blogspot.com
>>> > Twitter: http://twitter.com/dmitrykan
>>> > SemanticAnalyzer: www.semanticanalyzer.info
>>>
>>
>>
>>
>> --
>> Dmitry Kan
>> Luke Toolbox: http://github.com/DmitryKey/luke
>> Blog: http://dmitrykan.blogspot.com
>> Twitter: http://twitter.com/dmitrykan
>> SemanticAnalyzer: www.semanticanalyzer.info
>>
>>
>
>
> --
> Dmitry Kan
> Luke Toolbox: http://github.com/DmitryKey/luke
> Blog: http://dmitrykan.blogspot.com
> Twitter: http://twitter.com/dmitrykan
> SemanticAnalyzer: www.semanticanalyzer.info
>
>


-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info

Re: bug in search with sloppy queries

Reply via email to