Don't use a stemmer with edge ngrams. Edge ngrams are a tool for matching the surface word. Stemmers are a tool for matching the root. Those are logically incompatible transforms.
wunder On Apr 21, 2013, at 5:21 PM, Steve Rowe wrote: > Karol has uncovered a bug introduced by LUCENE-4810 > <https://issues.apache.org/jira/browse/LUCENE-4810>, included in Lucene/Solr > 4.3.0. > > The problem is an interaction between the Morfologik stemmer, which can > produce multiple stems per input term, all but the first having a position > increment of zero, and EdgeNGramTokenFilter, which only outputs ngrams for > input terms that are at least as long as the minimum configured length, and > passes through unchanged the position increment for the first ngram output > for any given input term. > > So what happens in Karol's case is that "T." has the period stripped by > StandardTokenizer, then is stemmed by Morfologik to produce terms "to", "tom" > and "tona". The first term "to" has a position increment of 1, but is not > output by EdgeNGramTokenFilter, because it's length is below the configured > minimum of 3. The second term "tom" is given a position increment of 0 by > Morfologik, and meets EdgeNGramTokenFilter's minimum length, so gets output, > and since it's the first output term for the input term "tom", the input > position increment is left as-is in the output term: 0. That's how the first > output term gets a position increment of 0. > > Before LUCENE-4810 was committed and included in Lucene/Solr 4.3.0, > EdgeNGramTokenFilter indiscriminately set all output terms' position > increments to 1, so that explains why this behavior didn't occur with > previously released versions. > > I think the fix is a check in EdgeNGramTokenFilter when outputting the first > term, that the position increment is greater than 0, and if it's not, then it > should be set it to 1. > > Does anybody know if this could also be an issue for other filters? > > I'll work on a patch for EdgeNGramTokenFilter. > > Steve > > On Apr 21, 2013, at 9:21 AM, Karol Sikora <[email protected]> > wrote: > >> hi, >> >> I extracted minimal failing example, solr configs(schema, solrconfig.xml) >> and data are in attached archive. >> I try to import simple document: >> [ >> { >> "publisher": [ >> "T. Gl\u00fccksberg" >> ], >> "uid": "1000881" >> }, >> { >> "publisher": [ >> "Ala a kota" >> ], >> "uid": "1000894" >> } >> ] >> first fails on copyfield destination publisher_hl with exception (trace: >> https://gist.github.com/anonymous/5429558), second is added without any >> problems. >> schema.xml is here: https://gist.github.com/anonymous/5429562 >> >> When someone will trying to reproduce this behaviour remember to copy libs >> related with morfologik and icu filters. >> >> This extracted example works fine with solr 4.0 - 4.2.1. >> >> Regards, >> Karol >> >> >> >> W dniu 21.04.2013 09:03, Simon Willnauer pisze: >>> hey karol, >>> >>> can you reproduce this behaviour in a small test-case (curl command or >>> something like this) that we can reproduce? >>> >>> @solr guys any idea what this could be? >>> >>> simon >>> >>> On Sun, Apr 21, 2013 at 1:52 AM, Karol Sikora >>> >>> <[email protected]> >>> wrote: >>> >>>> Hi all, >>>> >>>> I have problem with solr 4.3 RC2 on my testing data for searching >>>> application which i'm developing. >>>> A lot of importing records fails with exception >>>> "java.lang.IllegalArgumentException: first position increment must be > 0 >>>> (got 0)". On versions from early 4.0 to 4.2.1 all documents was added >>>> successfully, so I'm thinking that something is broken in new release. >>>> I'll try examine tomorrow what is broken. >>>> >>>> >>>> Regards, >>>> Karol >>>> >>>> W dniu 20.04.2013 21:07, Andi Vajda pisze: >>>> >>>> >>>>> On Sat, 20 Apr 2013, Simon Willnauer wrote: >>>>> >>>>> >>>>>> Here is the RC: >>>>>> >>>>>> >>>>>> http://people.apache.org/~simonw/staging_area/lucene-solr-4.3.0-RC2-rev1470054 >>>>>> >>>>>> >>>>>> happy voting... >>>>>> >>>>>> here is my +1 >>>>>> >>>>> >>>>> PyLucene 4.3 builds and passes its tests. >>>>> >>>>> +1 ! >>>>> >>>>> Andi.. >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: >>>>> [email protected] >>>>> >>>>> For additional commands, e-mail: >>>>> [email protected] >>>>> >>>>> >>>>> >>>>> >>>> -- >>>> Karol Sikora >>>> +48 781 493 788 >>>> >>>> Laboratorium EE >>>> ul. Mokotowska 46A/23 | 00-543 Warszawa | >>>> >>>> www.laboratorium.ee | www.laboratorium.ee/facebook >>>> >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: >>>> [email protected] >>>> >>>> For additional commands, e-mail: >>>> [email protected] >>>> >>>> >>>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: >>> [email protected] >>> >>> For additional commands, e-mail: >>> [email protected] >>> >>> >>> >>> >> >> -- >> >> Karol Sikora >> Kierownik Informatyczny Projektu CBN - Interfejs 2.0 >> +48 781 493 788 >> >> Laboratorium EE >> ul. Mokotowska 46A/23 | 00-543 Warszawa | >> >> www.laboratorium.ee | www.laboratorium.ee/facebook > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > -- Walter Underwood [email protected]
