Re: bug in search with sloppy queries

2015-06-15 Thread Dmitry Kan
Digging into the code, I see this:

[code]
public SpanWeight(SpanQuery query, IndexSearcher searcher)
throws IOException {
this.similarity = searcher.getSimilarity();
this.query = query;

termContexts = new HashMap();
TreeSetTerm terms = new TreeSet();
query.extractTerms(terms);
final IndexReaderContext context = searcher.getTopReaderContext();
final TermStatistics termStats[] = new TermStatistics[terms.size()];
int i = 0;
for (Term term : terms) {
  TermContext state = TermContext.build(context, term);
  termStats[i] = searcher.termStatistics(term, state);
  termContexts.put(term, state);
  i++;
}
final String field = query.getField();
if (field != null) {
  stats = similarity.computeWeight(query.getBoost(),

searcher.collectionStatistics(query.getField()),
   termStats);
}
  }
[/code]

as query we get the above structure, from which all terms are extracted
without keeping the query structure? Could someone shed light on the logic
behind this weight calculation?

On Mon, Jun 15, 2015 at 10:23 AM, Dmitry Kan solrexp...@gmail.com wrote:

 To clarify additionally: we use StandardTokenizer  StandardFilter in
 front of the WDF. Already following ST's transformations e-tail gets split
 into two consecutive tokens

 On Mon, Jun 15, 2015 at 10:08 AM, Dmitry Kan solrexp...@gmail.com wrote:

 Thanks, Erick. Analysis page shows the positions are growing= there are
 no glued words on the same position.

 On Sun, Jun 14, 2015 at 6:10 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 My guess is that you have WordDelimiterFilterFactory in your
 analysis chain with parameters that break up E-Tail to both e and
 tail _and_
 put them in the same position. This assumes that the result fragment
 you pasted is incomplete and commerce is in it

 From emE/em-Tail emcommerce/em

 or some such. Try the admin/analysis screen with the verbose box
 checked
 and you'll see the position of each token after analysis to see if my
 guess
 is accurate.

 Best,
 Erick

 On Sun, Jun 14, 2015 at 4:34 AM, Dmitry Kan solrexp...@gmail.com
 wrote:
  Hi guys,
 
  We observe some strange bug in solr 4.10.2, where by a sloppy query
 hits
  words it should not:
 
  lst name=debugstr name=rawquerystringthe e commerce/strstr
  name=querystringthe e commerce/strstr
  name=parsedquerySpanNearQuery(spanNear([Contents:the,
  spanNear([Contents:eä, Contents:commerceä], 0, true)], 300,
  false))/strstr name=parsedquery_toStringspanNear([Contents:the,
  spanNear([Contents:eä, Contents:commerceä], 0, true)], 300,
 false)/str
 
 
  This query produces words as hits, like:
 
  From emE/em-Tail
 
  In the inner spanNear query we expect that e and commerce will occur
 within
  0 slop in that order.
 
  Can somebody shed light into what is going on?
 
  --
  Dmitry Kan
  Luke Toolbox: http://github.com/DmitryKey/luke
  Blog: http://dmitrykan.blogspot.com
  Twitter: http://twitter.com/dmitrykan
  SemanticAnalyzer: www.semanticanalyzer.info




 --
 Dmitry Kan
 Luke Toolbox: http://github.com/DmitryKey/luke
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan
 SemanticAnalyzer: www.semanticanalyzer.info




 --
 Dmitry Kan
 Luke Toolbox: http://github.com/DmitryKey/luke
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan
 SemanticAnalyzer: www.semanticanalyzer.info




-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Re: bug in search with sloppy queries

2015-06-15 Thread Dmitry Kan
Thanks, Erick. Analysis page shows the positions are growing= there are no
glued words on the same position.

On Sun, Jun 14, 2015 at 6:10 PM, Erick Erickson erickerick...@gmail.com
wrote:

 My guess is that you have WordDelimiterFilterFactory in your
 analysis chain with parameters that break up E-Tail to both e and tail
 _and_
 put them in the same position. This assumes that the result fragment
 you pasted is incomplete and commerce is in it

 From emE/em-Tail emcommerce/em

 or some such. Try the admin/analysis screen with the verbose box checked
 and you'll see the position of each token after analysis to see if my guess
 is accurate.

 Best,
 Erick

 On Sun, Jun 14, 2015 at 4:34 AM, Dmitry Kan solrexp...@gmail.com wrote:
  Hi guys,
 
  We observe some strange bug in solr 4.10.2, where by a sloppy query hits
  words it should not:
 
  lst name=debugstr name=rawquerystringthe e commerce/strstr
  name=querystringthe e commerce/strstr
  name=parsedquerySpanNearQuery(spanNear([Contents:the,
  spanNear([Contents:eä, Contents:commerceä], 0, true)], 300,
  false))/strstr name=parsedquery_toStringspanNear([Contents:the,
  spanNear([Contents:eä, Contents:commerceä], 0, true)], 300, false)/str
 
 
  This query produces words as hits, like:
 
  From emE/em-Tail
 
  In the inner spanNear query we expect that e and commerce will occur
 within
  0 slop in that order.
 
  Can somebody shed light into what is going on?
 
  --
  Dmitry Kan
  Luke Toolbox: http://github.com/DmitryKey/luke
  Blog: http://dmitrykan.blogspot.com
  Twitter: http://twitter.com/dmitrykan
  SemanticAnalyzer: www.semanticanalyzer.info




-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Re: bug in search with sloppy queries

2015-06-15 Thread Dmitry Kan
To clarify additionally: we use StandardTokenizer  StandardFilter in front
of the WDF. Already following ST's transformations e-tail gets split into
two consecutive tokens

On Mon, Jun 15, 2015 at 10:08 AM, Dmitry Kan solrexp...@gmail.com wrote:

 Thanks, Erick. Analysis page shows the positions are growing= there are
 no glued words on the same position.

 On Sun, Jun 14, 2015 at 6:10 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 My guess is that you have WordDelimiterFilterFactory in your
 analysis chain with parameters that break up E-Tail to both e and
 tail _and_
 put them in the same position. This assumes that the result fragment
 you pasted is incomplete and commerce is in it

 From emE/em-Tail emcommerce/em

 or some such. Try the admin/analysis screen with the verbose box checked
 and you'll see the position of each token after analysis to see if my
 guess
 is accurate.

 Best,
 Erick

 On Sun, Jun 14, 2015 at 4:34 AM, Dmitry Kan solrexp...@gmail.com wrote:
  Hi guys,
 
  We observe some strange bug in solr 4.10.2, where by a sloppy query hits
  words it should not:
 
  lst name=debugstr name=rawquerystringthe e commerce/strstr
  name=querystringthe e commerce/strstr
  name=parsedquerySpanNearQuery(spanNear([Contents:the,
  spanNear([Contents:eä, Contents:commerceä], 0, true)], 300,
  false))/strstr name=parsedquery_toStringspanNear([Contents:the,
  spanNear([Contents:eä, Contents:commerceä], 0, true)], 300, false)/str
 
 
  This query produces words as hits, like:
 
  From emE/em-Tail
 
  In the inner spanNear query we expect that e and commerce will occur
 within
  0 slop in that order.
 
  Can somebody shed light into what is going on?
 
  --
  Dmitry Kan
  Luke Toolbox: http://github.com/DmitryKey/luke
  Blog: http://dmitrykan.blogspot.com
  Twitter: http://twitter.com/dmitrykan
  SemanticAnalyzer: www.semanticanalyzer.info




 --
 Dmitry Kan
 Luke Toolbox: http://github.com/DmitryKey/luke
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan
 SemanticAnalyzer: www.semanticanalyzer.info




-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Re: bug in search with sloppy queries

2015-06-14 Thread Erick Erickson
My guess is that you have WordDelimiterFilterFactory in your
analysis chain with parameters that break up E-Tail to both e and tail _and_
put them in the same position. This assumes that the result fragment
you pasted is incomplete and commerce is in it

From emE/em-Tail emcommerce/em

or some such. Try the admin/analysis screen with the verbose box checked
and you'll see the position of each token after analysis to see if my guess
is accurate.

Best,
Erick

On Sun, Jun 14, 2015 at 4:34 AM, Dmitry Kan solrexp...@gmail.com wrote:
 Hi guys,

 We observe some strange bug in solr 4.10.2, where by a sloppy query hits
 words it should not:

 lst name=debugstr name=rawquerystringthe e commerce/strstr
 name=querystringthe e commerce/strstr
 name=parsedquerySpanNearQuery(spanNear([Contents:the,
 spanNear([Contents:eä, Contents:commerceä], 0, true)], 300,
 false))/strstr name=parsedquery_toStringspanNear([Contents:the,
 spanNear([Contents:eä, Contents:commerceä], 0, true)], 300, false)/str


 This query produces words as hits, like:

 From emE/em-Tail

 In the inner spanNear query we expect that e and commerce will occur within
 0 slop in that order.

 Can somebody shed light into what is going on?

 --
 Dmitry Kan
 Luke Toolbox: http://github.com/DmitryKey/luke
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan
 SemanticAnalyzer: www.semanticanalyzer.info