Re: bug in search with sloppy queries

2015-06-15 Thread Dmitry Kan
Digging into the code, I see this:

[code]
public SpanWeight(SpanQuery query, IndexSearcher searcher)
throws IOException {
this.similarity = searcher.getSimilarity();
this.query = query;

termContexts = new HashMap<>();
TreeSet terms = new TreeSet<>();
query.extractTerms(terms);
final IndexReaderContext context = searcher.getTopReaderContext();
final TermStatistics termStats[] = new TermStatistics[terms.size()];
int i = 0;
for (Term term : terms) {
  TermContext state = TermContext.build(context, term);
  termStats[i] = searcher.termStatistics(term, state);
  termContexts.put(term, state);
  i++;
}
final String field = query.getField();
if (field != null) {
  stats = similarity.computeWeight(query.getBoost(),

searcher.collectionStatistics(query.getField()),
   termStats);
}
  }
[/code]

as query we get the above structure, from which all terms are extracted
without keeping the query structure? Could someone shed light on the logic
behind this weight calculation?

On Mon, Jun 15, 2015 at 10:23 AM, Dmitry Kan  wrote:

> To clarify additionally: we use StandardTokenizer & StandardFilter in
> front of the WDF. Already following ST's transformations e-tail gets split
> into two consecutive tokens
>
> On Mon, Jun 15, 2015 at 10:08 AM, Dmitry Kan  wrote:
>
>> Thanks, Erick. Analysis page shows the positions are growing=> there are
>> no "glued" words on the same position.
>>
>> On Sun, Jun 14, 2015 at 6:10 PM, Erick Erickson 
>> wrote:
>>
>>> My guess is that you have WordDelimiterFilterFactory in your
>>> analysis chain with parameters that break up E-Tail to both "e" and
>>> "tail" _and_
>>> put them in the same position. This assumes that the result fragment
>>> you pasted is incomplete and "commerce" is in it
>>>
>>> From E-Tail commerce
>>>
>>> or some such. Try the admin/analysis screen with the "verbose" box
>>> checked
>>> and you'll see the position of each token after analysis to see if my
>>> guess
>>> is accurate.
>>>
>>> Best,
>>> Erick
>>>
>>> On Sun, Jun 14, 2015 at 4:34 AM, Dmitry Kan 
>>> wrote:
>>> > Hi guys,
>>> >
>>> > We observe some strange bug in solr 4.10.2, where by a sloppy query
>>> hits
>>> > words it should not:
>>> >
>>> > the "e commerce">> > name="querystring">the "e commerce">> > name="parsedquery">SpanNearQuery(spanNear([Contents:the,
>>> > spanNear([Contents:eä, Contents:commerceä], 0, true)], 300,
>>> > false))spanNear([Contents:the,
>>> > spanNear([Contents:eä, Contents:commerceä], 0, true)], 300,
>>> false)
>>> >
>>> >
>>> > This query produces words as hits, like:
>>> >
>>> > From E-Tail
>>> >
>>> > In the inner spanNear query we expect that e and commerce will occur
>>> within
>>> > 0 slop in that order.
>>> >
>>> > Can somebody shed light into what is going on?
>>> >
>>> > --
>>> > Dmitry Kan
>>> > Luke Toolbox: http://github.com/DmitryKey/luke
>>> > Blog: http://dmitrykan.blogspot.com
>>> > Twitter: http://twitter.com/dmitrykan
>>> > SemanticAnalyzer: www.semanticanalyzer.info
>>>
>>
>>
>>
>> --
>> Dmitry Kan
>> Luke Toolbox: http://github.com/DmitryKey/luke
>> Blog: http://dmitrykan.blogspot.com
>> Twitter: http://twitter.com/dmitrykan
>> SemanticAnalyzer: www.semanticanalyzer.info
>>
>>
>
>
> --
> Dmitry Kan
> Luke Toolbox: http://github.com/DmitryKey/luke
> Blog: http://dmitrykan.blogspot.com
> Twitter: http://twitter.com/dmitrykan
> SemanticAnalyzer: www.semanticanalyzer.info
>
>


-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Re: bug in search with sloppy queries

2015-06-15 Thread Dmitry Kan
To clarify additionally: we use StandardTokenizer & StandardFilter in front
of the WDF. Already following ST's transformations e-tail gets split into
two consecutive tokens

On Mon, Jun 15, 2015 at 10:08 AM, Dmitry Kan  wrote:

> Thanks, Erick. Analysis page shows the positions are growing=> there are
> no "glued" words on the same position.
>
> On Sun, Jun 14, 2015 at 6:10 PM, Erick Erickson 
> wrote:
>
>> My guess is that you have WordDelimiterFilterFactory in your
>> analysis chain with parameters that break up E-Tail to both "e" and
>> "tail" _and_
>> put them in the same position. This assumes that the result fragment
>> you pasted is incomplete and "commerce" is in it
>>
>> From E-Tail commerce
>>
>> or some such. Try the admin/analysis screen with the "verbose" box checked
>> and you'll see the position of each token after analysis to see if my
>> guess
>> is accurate.
>>
>> Best,
>> Erick
>>
>> On Sun, Jun 14, 2015 at 4:34 AM, Dmitry Kan  wrote:
>> > Hi guys,
>> >
>> > We observe some strange bug in solr 4.10.2, where by a sloppy query hits
>> > words it should not:
>> >
>> > the "e commerce"> > name="querystring">the "e commerce"> > name="parsedquery">SpanNearQuery(spanNear([Contents:the,
>> > spanNear([Contents:eä, Contents:commerceä], 0, true)], 300,
>> > false))spanNear([Contents:the,
>> > spanNear([Contents:eä, Contents:commerceä], 0, true)], 300, false)
>> >
>> >
>> > This query produces words as hits, like:
>> >
>> > From E-Tail
>> >
>> > In the inner spanNear query we expect that e and commerce will occur
>> within
>> > 0 slop in that order.
>> >
>> > Can somebody shed light into what is going on?
>> >
>> > --
>> > Dmitry Kan
>> > Luke Toolbox: http://github.com/DmitryKey/luke
>> > Blog: http://dmitrykan.blogspot.com
>> > Twitter: http://twitter.com/dmitrykan
>> > SemanticAnalyzer: www.semanticanalyzer.info
>>
>
>
>
> --
> Dmitry Kan
> Luke Toolbox: http://github.com/DmitryKey/luke
> Blog: http://dmitrykan.blogspot.com
> Twitter: http://twitter.com/dmitrykan
> SemanticAnalyzer: www.semanticanalyzer.info
>
>


-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Re: bug in search with sloppy queries

2015-06-15 Thread Dmitry Kan
Thanks, Erick. Analysis page shows the positions are growing=> there are no
"glued" words on the same position.

On Sun, Jun 14, 2015 at 6:10 PM, Erick Erickson 
wrote:

> My guess is that you have WordDelimiterFilterFactory in your
> analysis chain with parameters that break up E-Tail to both "e" and "tail"
> _and_
> put them in the same position. This assumes that the result fragment
> you pasted is incomplete and "commerce" is in it
>
> From E-Tail commerce
>
> or some such. Try the admin/analysis screen with the "verbose" box checked
> and you'll see the position of each token after analysis to see if my guess
> is accurate.
>
> Best,
> Erick
>
> On Sun, Jun 14, 2015 at 4:34 AM, Dmitry Kan  wrote:
> > Hi guys,
> >
> > We observe some strange bug in solr 4.10.2, where by a sloppy query hits
> > words it should not:
> >
> > the "e commerce" > name="querystring">the "e commerce" > name="parsedquery">SpanNearQuery(spanNear([Contents:the,
> > spanNear([Contents:eä, Contents:commerceä], 0, true)], 300,
> > false))spanNear([Contents:the,
> > spanNear([Contents:eä, Contents:commerceä], 0, true)], 300, false)
> >
> >
> > This query produces words as hits, like:
> >
> > From E-Tail
> >
> > In the inner spanNear query we expect that e and commerce will occur
> within
> > 0 slop in that order.
> >
> > Can somebody shed light into what is going on?
> >
> > --
> > Dmitry Kan
> > Luke Toolbox: http://github.com/DmitryKey/luke
> > Blog: http://dmitrykan.blogspot.com
> > Twitter: http://twitter.com/dmitrykan
> > SemanticAnalyzer: www.semanticanalyzer.info
>



-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Re: bug in search with sloppy queries

2015-06-14 Thread Erick Erickson
My guess is that you have WordDelimiterFilterFactory in your
analysis chain with parameters that break up E-Tail to both "e" and "tail" _and_
put them in the same position. This assumes that the result fragment
you pasted is incomplete and "commerce" is in it

>From E-Tail commerce

or some such. Try the admin/analysis screen with the "verbose" box checked
and you'll see the position of each token after analysis to see if my guess
is accurate.

Best,
Erick

On Sun, Jun 14, 2015 at 4:34 AM, Dmitry Kan  wrote:
> Hi guys,
>
> We observe some strange bug in solr 4.10.2, where by a sloppy query hits
> words it should not:
>
> the "e commerce" name="querystring">the "e commerce" name="parsedquery">SpanNearQuery(spanNear([Contents:the,
> spanNear([Contents:eä, Contents:commerceä], 0, true)], 300,
> false))spanNear([Contents:the,
> spanNear([Contents:eä, Contents:commerceä], 0, true)], 300, false)
>
>
> This query produces words as hits, like:
>
> From E-Tail
>
> In the inner spanNear query we expect that e and commerce will occur within
> 0 slop in that order.
>
> Can somebody shed light into what is going on?
>
> --
> Dmitry Kan
> Luke Toolbox: http://github.com/DmitryKey/luke
> Blog: http://dmitrykan.blogspot.com
> Twitter: http://twitter.com/dmitrykan
> SemanticAnalyzer: www.semanticanalyzer.info