Re: bug in search with sloppy queries
Digging into the code, I see this: [code] public SpanWeight(SpanQuery query, IndexSearcher searcher) throws IOException { this.similarity = searcher.getSimilarity(); this.query = query; termContexts = new HashMap<>(); TreeSet terms = new TreeSet<>(); query.extractTerms(terms); final IndexReaderContext context = searcher.getTopReaderContext(); final TermStatistics termStats[] = new TermStatistics[terms.size()]; int i = 0; for (Term term : terms) { TermContext state = TermContext.build(context, term); termStats[i] = searcher.termStatistics(term, state); termContexts.put(term, state); i++; } final String field = query.getField(); if (field != null) { stats = similarity.computeWeight(query.getBoost(), searcher.collectionStatistics(query.getField()), termStats); } } [/code] as query we get the above structure, from which all terms are extracted without keeping the query structure? Could someone shed light on the logic behind this weight calculation? On Mon, Jun 15, 2015 at 10:23 AM, Dmitry Kan wrote: > To clarify additionally: we use StandardTokenizer & StandardFilter in > front of the WDF. Already following ST's transformations e-tail gets split > into two consecutive tokens > > On Mon, Jun 15, 2015 at 10:08 AM, Dmitry Kan wrote: > >> Thanks, Erick. Analysis page shows the positions are growing=> there are >> no "glued" words on the same position. >> >> On Sun, Jun 14, 2015 at 6:10 PM, Erick Erickson >> wrote: >> >>> My guess is that you have WordDelimiterFilterFactory in your >>> analysis chain with parameters that break up E-Tail to both "e" and >>> "tail" _and_ >>> put them in the same position. This assumes that the result fragment >>> you pasted is incomplete and "commerce" is in it >>> >>> From E-Tail commerce >>> >>> or some such. Try the admin/analysis screen with the "verbose" box >>> checked >>> and you'll see the position of each token after analysis to see if my >>> guess >>> is accurate. >>> >>> Best, >>> Erick >>> >>> On Sun, Jun 14, 2015 at 4:34 AM, Dmitry Kan >>> wrote: >>> > Hi guys, >>> > >>> > We observe some strange bug in solr 4.10.2, where by a sloppy query >>> hits >>> > words it should not: >>> > >>> > the "e commerce">> > name="querystring">the "e commerce">> > name="parsedquery">SpanNearQuery(spanNear([Contents:the, >>> > spanNear([Contents:eä, Contents:commerceä], 0, true)], 300, >>> > false))spanNear([Contents:the, >>> > spanNear([Contents:eä, Contents:commerceä], 0, true)], 300, >>> false) >>> > >>> > >>> > This query produces words as hits, like: >>> > >>> > From E-Tail >>> > >>> > In the inner spanNear query we expect that e and commerce will occur >>> within >>> > 0 slop in that order. >>> > >>> > Can somebody shed light into what is going on? >>> > >>> > -- >>> > Dmitry Kan >>> > Luke Toolbox: http://github.com/DmitryKey/luke >>> > Blog: http://dmitrykan.blogspot.com >>> > Twitter: http://twitter.com/dmitrykan >>> > SemanticAnalyzer: www.semanticanalyzer.info >>> >> >> >> >> -- >> Dmitry Kan >> Luke Toolbox: http://github.com/DmitryKey/luke >> Blog: http://dmitrykan.blogspot.com >> Twitter: http://twitter.com/dmitrykan >> SemanticAnalyzer: www.semanticanalyzer.info >> >> > > > -- > Dmitry Kan > Luke Toolbox: http://github.com/DmitryKey/luke > Blog: http://dmitrykan.blogspot.com > Twitter: http://twitter.com/dmitrykan > SemanticAnalyzer: www.semanticanalyzer.info > > -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
Re: bug in search with sloppy queries
To clarify additionally: we use StandardTokenizer & StandardFilter in front of the WDF. Already following ST's transformations e-tail gets split into two consecutive tokens On Mon, Jun 15, 2015 at 10:08 AM, Dmitry Kan wrote: > Thanks, Erick. Analysis page shows the positions are growing=> there are > no "glued" words on the same position. > > On Sun, Jun 14, 2015 at 6:10 PM, Erick Erickson > wrote: > >> My guess is that you have WordDelimiterFilterFactory in your >> analysis chain with parameters that break up E-Tail to both "e" and >> "tail" _and_ >> put them in the same position. This assumes that the result fragment >> you pasted is incomplete and "commerce" is in it >> >> From E-Tail commerce >> >> or some such. Try the admin/analysis screen with the "verbose" box checked >> and you'll see the position of each token after analysis to see if my >> guess >> is accurate. >> >> Best, >> Erick >> >> On Sun, Jun 14, 2015 at 4:34 AM, Dmitry Kan wrote: >> > Hi guys, >> > >> > We observe some strange bug in solr 4.10.2, where by a sloppy query hits >> > words it should not: >> > >> > the "e commerce"> > name="querystring">the "e commerce"> > name="parsedquery">SpanNearQuery(spanNear([Contents:the, >> > spanNear([Contents:eä, Contents:commerceä], 0, true)], 300, >> > false))spanNear([Contents:the, >> > spanNear([Contents:eä, Contents:commerceä], 0, true)], 300, false) >> > >> > >> > This query produces words as hits, like: >> > >> > From E-Tail >> > >> > In the inner spanNear query we expect that e and commerce will occur >> within >> > 0 slop in that order. >> > >> > Can somebody shed light into what is going on? >> > >> > -- >> > Dmitry Kan >> > Luke Toolbox: http://github.com/DmitryKey/luke >> > Blog: http://dmitrykan.blogspot.com >> > Twitter: http://twitter.com/dmitrykan >> > SemanticAnalyzer: www.semanticanalyzer.info >> > > > > -- > Dmitry Kan > Luke Toolbox: http://github.com/DmitryKey/luke > Blog: http://dmitrykan.blogspot.com > Twitter: http://twitter.com/dmitrykan > SemanticAnalyzer: www.semanticanalyzer.info > > -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
Re: bug in search with sloppy queries
Thanks, Erick. Analysis page shows the positions are growing=> there are no "glued" words on the same position. On Sun, Jun 14, 2015 at 6:10 PM, Erick Erickson wrote: > My guess is that you have WordDelimiterFilterFactory in your > analysis chain with parameters that break up E-Tail to both "e" and "tail" > _and_ > put them in the same position. This assumes that the result fragment > you pasted is incomplete and "commerce" is in it > > From E-Tail commerce > > or some such. Try the admin/analysis screen with the "verbose" box checked > and you'll see the position of each token after analysis to see if my guess > is accurate. > > Best, > Erick > > On Sun, Jun 14, 2015 at 4:34 AM, Dmitry Kan wrote: > > Hi guys, > > > > We observe some strange bug in solr 4.10.2, where by a sloppy query hits > > words it should not: > > > > the "e commerce" > name="querystring">the "e commerce" > name="parsedquery">SpanNearQuery(spanNear([Contents:the, > > spanNear([Contents:eä, Contents:commerceä], 0, true)], 300, > > false))spanNear([Contents:the, > > spanNear([Contents:eä, Contents:commerceä], 0, true)], 300, false) > > > > > > This query produces words as hits, like: > > > > From E-Tail > > > > In the inner spanNear query we expect that e and commerce will occur > within > > 0 slop in that order. > > > > Can somebody shed light into what is going on? > > > > -- > > Dmitry Kan > > Luke Toolbox: http://github.com/DmitryKey/luke > > Blog: http://dmitrykan.blogspot.com > > Twitter: http://twitter.com/dmitrykan > > SemanticAnalyzer: www.semanticanalyzer.info > -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
Re: bug in search with sloppy queries
My guess is that you have WordDelimiterFilterFactory in your analysis chain with parameters that break up E-Tail to both "e" and "tail" _and_ put them in the same position. This assumes that the result fragment you pasted is incomplete and "commerce" is in it >From E-Tail commerce or some such. Try the admin/analysis screen with the "verbose" box checked and you'll see the position of each token after analysis to see if my guess is accurate. Best, Erick On Sun, Jun 14, 2015 at 4:34 AM, Dmitry Kan wrote: > Hi guys, > > We observe some strange bug in solr 4.10.2, where by a sloppy query hits > words it should not: > > the "e commerce" name="querystring">the "e commerce" name="parsedquery">SpanNearQuery(spanNear([Contents:the, > spanNear([Contents:eä, Contents:commerceä], 0, true)], 300, > false))spanNear([Contents:the, > spanNear([Contents:eä, Contents:commerceä], 0, true)], 300, false) > > > This query produces words as hits, like: > > From E-Tail > > In the inner spanNear query we expect that e and commerce will occur within > 0 slop in that order. > > Can somebody shed light into what is going on? > > -- > Dmitry Kan > Luke Toolbox: http://github.com/DmitryKey/luke > Blog: http://dmitrykan.blogspot.com > Twitter: http://twitter.com/dmitrykan > SemanticAnalyzer: www.semanticanalyzer.info