Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1
On 9/16/2015 5:42 AM, Alessandro Benedetti wrote: > Any update on this ? I found two workarounds, and went with the second one -- removing the PatternReplaceFilterFactory from fieldType definitions that also include WDF. They are both documented in the issue: https://issues.apache.org/jira/browse/LUCENE-6689 I still think that there's a bug that needs fixing, but I'm not desperate any more. Thanks, Shawn
Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1
Any update on this ? Cheers 2015-08-21 0:22 GMT+01:00 Shawn Heisey : > On 7/8/2015 6:13 PM, Yonik Seeley wrote: > > On Wed, Jul 8, 2015 at 6:50 PM, Shawn Heisey > wrote: > >> After the fix (with luceneMatchVersion at 4.9), both "aaa" and "bbb" end > >> up at position 2. > > Yikes, that's definitely wrong. > > I have filed LUCENE-6889 for this problem. I'd like to write a unit > test that demonstrates the problem, but Lucene internals are a mystery > to me. I have a concise and repeatable manual test (using Solr) > outlined in this comment: > > > https://issues.apache.org/jira/browse/LUCENE-6689?focusedCommentId=14705543&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14705543 > > Is there an existing Lucene test class that I could use as a basis for a > test? I will look into tests for analysis components and try to build > it on my own, but any help is appreciated. > > Thanks, > Shawn > > -- -- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England
Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1
On 7/8/2015 6:13 PM, Yonik Seeley wrote: > On Wed, Jul 8, 2015 at 6:50 PM, Shawn Heisey wrote: >> After the fix (with luceneMatchVersion at 4.9), both "aaa" and "bbb" end >> up at position 2. > Yikes, that's definitely wrong. I have filed LUCENE-6889 for this problem. I'd like to write a unit test that demonstrates the problem, but Lucene internals are a mystery to me. I have a concise and repeatable manual test (using Solr) outlined in this comment: https://issues.apache.org/jira/browse/LUCENE-6689?focusedCommentId=14705543&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14705543 Is there an existing Lucene test class that I could use as a basis for a test? I will look into tests for analysis components and try to build it on my own, but any help is appreciated. Thanks, Shawn
Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1
On 7/14/2015 11:42 AM, Shawn Heisey wrote: > So the problem might be with the rulefile, or with some strange > combination of these analysis components. I did not build this > rulefile myself. It was built by another, eitherRobert Muir or Steve > Rowe if I remember right, when SOLR-4123 was underway. The normal > settings for ICUTokenizer eliminate most of the things that WDF uses > for making tokens, which is why I'm using this custom rulefile. I found the place where I got that rulefile (named Latin-break-only-on-whitespace.rbbi). It's in the Lucene ICU source, in this directory: lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation The rbbi file that I'm using was slightly different than the one in the branch_5x source, so I copied the source file over. It didn't change the behavior. I'm using the ICU tokenizer with a custom rule file because I want tokenization on boundaries between different character sets (chinese, japanese, cyrillic, etc), but I want to handle internal punctuation with WordDelimiterFilter. Thanks, Shawn
Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1
On 7/14/2015 10:46 AM, Alessandro Benedetti wrote: > Furthermore I was checking with Solr 5.1 to find the WDFilter factory > actually to work in a proper way. > Is it possible to know what was the conclusion for this issue ? > Is there an issue in the WordDelimiter token filter in the current Solr > version? Has it been fixed ? > Any update ? It appears that the problem is not with WDF alone ... something about the combination of filters that I have chosen is causing this, but only with certain kinds of input. If I set up a minimal fieldType with the keyword tokenizer, then I cannot get the problem to reproduce: I tried with inputs of "aaa-bbb ccc" and "aaa-bbb: ccc" and everything worked as expected. I then tried some other analysis combinations trying to find the minimal problem fieldType, and I finally hit on the one that causes a problem. It's a combination of the ICUTokenizer with a custom rulefile, a pattern replace filter that eats leading and trailing punctuation, and the WDF. That must be combined with input text that includes trailing punctuation: "aaa-bbb: ccc" If the rulefile is not specified, then the problem doesn't occur, because the trailing punctuation is missing by the time it makes it to the PRF. If the PRF isn't there, then the problem doesn't occur. So the problem might be with the rulefile, or with some strange combination of these analysis components. I did not build this rulefile myself. It was built by another, eitherRobert Muir or Steve Rowe if I remember right, when SOLR-4123 was underway. The normal settings for ICUTokenizer eliminate most of the things that WDF uses for making tokens, which is why I'm using this custom rulefile. https://issues.apache.org/jira/browse/SOLR-4123 Any advice would be appreciated. I can make the .rbbi file available. Thanks, Shawn
Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1
Furthermore I was checking with Solr 5.1 to find the WDFilter factory actually to work in a proper way. Is it possible to know what was the conclusion for this issue ? Is there an issue in the WordDelimiter token filter in the current Solr version? Has it been fixed ? Any update ? Cheers 2015-07-14 17:16 GMT+01:00 Alessandro Benedetti : > Just found this interesting article of Mike, that actually explains the > sausagization problem, which actually is related to the strange positions > in some case. > > > http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html > > Cheers > > 2015-07-09 1:13 GMT+01:00 Yonik Seeley : > >> On Wed, Jul 8, 2015 at 6:50 PM, Shawn Heisey wrote: >> > After the fix (with luceneMatchVersion at 4.9), both "aaa" and "bbb" end >> > up at position 2. >> >> Yikes, that's definitely wrong. >> >> -Yonik >> > > > > -- > -- > > Benedetti Alessandro > Visiting card - http://about.me/alessandro_benedetti > Blog - http://alexbenedetti.blogspot.co.uk > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England > -- -- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England
Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1
Just found this interesting article of Mike, that actually explains the sausagization problem, which actually is related to the strange positions in some case. http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html Cheers 2015-07-09 1:13 GMT+01:00 Yonik Seeley : > On Wed, Jul 8, 2015 at 6:50 PM, Shawn Heisey wrote: > > After the fix (with luceneMatchVersion at 4.9), both "aaa" and "bbb" end > > up at position 2. > > Yikes, that's definitely wrong. > > -Yonik > -- -- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England
Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1
On Wed, Jul 8, 2015 at 6:50 PM, Shawn Heisey wrote: > After the fix (with luceneMatchVersion at 4.9), both "aaa" and "bbb" end > up at position 2. Yikes, that's definitely wrong. -Yonik
Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1
On 7/8/2015 4:01 PM, Jack Krupansky wrote: > In Lucene 4.8, LUCENE-5111: Fix WordDelimiterFilter offsets > > https://issues.apache.org/jira/browse/LUCENE-5111 > > Make sure the documents are queried and indexed with the same Lucene match > version. Since I have updated the luceneMatchVersion on the 4.9.1 version to LUCENE_47, I am now reindexing it, to see if that helps. I discovered that I had some information backwards in my previous messages -- it is *index* time analysis that differs. Query time analysis is the same across versions. The reindex may very well fix this problem, but luceneMatchVersion is a band-aid, and I think there is a bug to be fixed. I have no doubt that LUCENE-5111 fixed a real issue, but I think it also caused some new problems. When faced with text like "aaa-bbb", the original term (created by preserveOriginal) ends up at relative position 1. Prior to this fix, the next terms will be "aaa" at position 1 and "bbb" at position 2. The "aaabbb" term created by the catenation option also ends up at position 2. This arrangement makes perfect sense to me. After the fix (with luceneMatchVersion at 4.9), both "aaa" and "bbb" end up at position 2. I can't see how it is logical to end up with these positions. It breaks phrase queries on my index because the query-time analysis puts these two terms at position 1 and 2. The WDF options I chose seemed logical to me when I made them (about four years ago), but I admit that I don't remember the exact motivation behind those choices. You can find the entire fieldType definition in a previous message on this thread. The two analysis chains are the same except for WDF options. Should I use different options? Index-time options: | Query-time options: ||| Thanks, Shawn
Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1
In Lucene 4.8, LUCENE-5111: Fix WordDelimiterFilter offsets https://issues.apache.org/jira/browse/LUCENE-5111 Make sure the documents are queried and indexed with the same Lucene match version. -- Jack Krupansky On Wed, Jul 8, 2015 at 5:19 PM, Shawn Heisey wrote: > On 7/8/2015 2:19 PM, Shawn Heisey wrote: > > It appears that changing luceneMatchVersion from LUCENE_4_9 to LUCENE_47 > > has fixed this problem ... so I think somebody must have "fixed" WDF to > > its current behavior, but put in a version check for the old behavior. > > The luceneMatchVersion change has fixed this specific issue with WDF, > but these searches on 4.9.1 are still returning zero hits, and I don't > yet know why. > > Thanks, > Shawn > >
Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1
Yes Shawn, I was raising the fact that I see strange values in the positions as well. You said you fixed going back with an old version ? This should not be ok, I mean, I assume the latest version should be the best… Any idea or clarification guys ? 2015-07-08 21:10 GMT+01:00 Shawn Heisey : > On 7/8/2015 9:26 AM, Alessandro Benedetti wrote: > > Taking a look into the documentation I see this inconsistent orderings in > > my opinion : > > Alessandro, thank you for your reply. I couldn't really tell what you > were saying. I *think* you were agreeing with me that the current > behavior seems like a problem, but I'm not really sure. > > At this point I think I should probably file a bug in Jira ... anyone > have any thoughts on that? > > Thanks, > Shawn > > -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England
Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1
On 7/8/2015 2:19 PM, Shawn Heisey wrote: > It appears that changing luceneMatchVersion from LUCENE_4_9 to LUCENE_47 > has fixed this problem ... so I think somebody must have "fixed" WDF to > its current behavior, but put in a version check for the old behavior. The luceneMatchVersion change has fixed this specific issue with WDF, but these searches on 4.9.1 are still returning zero hits, and I don't yet know why. Thanks, Shawn
Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1
On 7/8/2015 2:10 PM, Shawn Heisey wrote: > At this point I think I should probably file a bug in Jira ... anyone > have any thoughts on that? It appears that changing luceneMatchVersion from LUCENE_4_9 to LUCENE_47 has fixed this problem ... so I think somebody must have "fixed" WDF to its current behavior, but put in a version check for the old behavior. I think that WDF's position output with a current luceneMatchVersion is wrong, but I'd like the input of someone who's a little more familiar with the codeand what SHOULD happen. Thanks, Shawn
Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1
On 7/8/2015 9:26 AM, Alessandro Benedetti wrote: > Taking a look into the documentation I see this inconsistent orderings in > my opinion : Alessandro, thank you for your reply. I couldn't really tell what you were saying. I *think* you were agreeing with me that the current behavior seems like a problem, but I'm not really sure. At this point I think I should probably file a bug in Jira ... anyone have any thoughts on that? Thanks, Shawn
Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1
Taking a look into the documentation I see this inconsistent orderings in my opinion : *Example:* Concatenate word parts and number parts, but not word and number parts that occur in the same token. *In:* "hot-spot 100+42 XL40" *Tokenizer to Filter:* "hot-spot"(1), "100+42"(2), "XL40"(3) *Out:* "hot"(1), "spot"(2), "hotspot"(2) *(1?)*, "100"(3), "42"(4), "10042"(4) *(2?)*, "XL"(5)*(3?)*, "40"(6)*(4?)* *Example:* Concatenate all. Word and/or number parts are joined together. *In:* "XL-4000/ES" *Tokenizer to Filter:* "XL-4000/ES"(1) *Out:* "XL"(1), "4000"(2), "ES"(3), "XL4000ES"(3)*(1?)* I have not clear why a token generated by a catenation should not occupy the same position of the original one. In your example , I am a little bit surprised of the first results as well : "RRR-COLECCION: COLECCIÓN: Gracita Morales foobar Here are the final positions and terms that 4.7.2 yields for this on query analysis: 1 rrr-coleccion 1 rrr 2 coleccion 2 rrrcoleccion *(1) ?* 3 coleccion 4 gracita 5 morales 6 foobar It is not so clear, if the tokens must simply inherit their position from the "parent" token, or if they must arrange it based on the final list of tokens . 2015-07-08 16:03 GMT+01:00 Shawn Heisey : > On 7/8/2015 8:44 AM, Shawn Heisey wrote: > > This is what 4.9.1 does with it: > > > > 1 rrr-coleccion > > 2 rrr > > 2 coleccion > > 2 rrrcoleccion > > 3 coleccion > > 4 gracita > > 5 morales > > 6 foobar > > Followup: This is what Solr 5.2.1 does for query analysis, which also > seems wrong, and doesn't match the phrase query: > > 1 rrr-coleccion > 2 coleccion > 2 rrr > 2 rrrcoleccion > 3 coleccion > 4 gracita > 5 morales > 6 bleh > > The index analysis on 5.2.1 is the same as the other two versions. > > Thanks, > Shawn > > -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England
Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1
On 7/8/2015 8:44 AM, Shawn Heisey wrote: > This is what 4.9.1 does with it: > > 1 rrr-coleccion > 2 rrr > 2 coleccion > 2 rrrcoleccion > 3 coleccion > 4 gracita > 5 morales > 6 foobar Followup: This is what Solr 5.2.1 does for query analysis, which also seems wrong, and doesn't match the phrase query: 1 rrr-coleccion 2 coleccion 2 rrr 2 rrrcoleccion 3 coleccion 4 gracita 5 morales 6 bleh The index analysis on 5.2.1 is the same as the other two versions. Thanks, Shawn