I've reopened LUCENE-4810 and attached a patch with a test and fix for this problem. - Steve
On Apr 22, 2013, at 1:09 AM, Steve Rowe <[email protected]> wrote: > Actually, Walter, I misspoke: Morfologik is a lemmatizer: it produces surface > forms. Not really so incompatible, I think. > > Regardless of the choice to use this particular sequence of filters, > EdgeNGramTokenFilter shouldn't produce a bad stream. > > Steve > > On Apr 21, 2013, at 8:34 PM, Walter Underwood <[email protected]> wrote: > >> Don't use a stemmer with edge ngrams. >> >> Edge ngrams are a tool for matching the surface word. Stemmers are a tool >> for matching the root. Those are logically incompatible transforms. >> >> wunder >> >> On Apr 21, 2013, at 5:21 PM, Steve Rowe wrote: >> >>> Karol has uncovered a bug introduced by LUCENE-4810 >>> <https://issues.apache.org/jira/browse/LUCENE-4810>, included in >>> Lucene/Solr 4.3.0. >>> >>> The problem is an interaction between the Morfologik stemmer, which can >>> produce multiple stems per input term, all but the first having a position >>> increment of zero, and EdgeNGramTokenFilter, which only outputs ngrams for >>> input terms that are at least as long as the minimum configured length, and >>> passes through unchanged the position increment for the first ngram output >>> for any given input term. >>> >>> So what happens in Karol's case is that "T." has the period stripped by >>> StandardTokenizer, then is stemmed by Morfologik to produce terms "to", >>> "tom" and "tona". The first term "to" has a position increment of 1, but >>> is not output by EdgeNGramTokenFilter, because it's length is below the >>> configured minimum of 3. The second term "tom" is given a position >>> increment of 0 by Morfologik, and meets EdgeNGramTokenFilter's minimum >>> length, so gets output, and since it's the first output term for the input >>> term "tom", the input position increment is left as-is in the output term: >>> 0. That's how the first output term gets a position increment of 0. >>> >>> Before LUCENE-4810 was committed and included in Lucene/Solr 4.3.0, >>> EdgeNGramTokenFilter indiscriminately set all output terms' position >>> increments to 1, so that explains why this behavior didn't occur with >>> previously released versions. >>> >>> I think the fix is a check in EdgeNGramTokenFilter when outputting the >>> first term, that the position increment is greater than 0, and if it's not, >>> then it should be set it to 1. >>> >>> Does anybody know if this could also be an issue for other filters? >>> >>> I'll work on a patch for EdgeNGramTokenFilter. >>> >>> Steve >>> >>> On Apr 21, 2013, at 9:21 AM, Karol Sikora <[email protected]> >>> wrote: >>> >>>> hi, >>>> >>>> I extracted minimal failing example, solr configs(schema, solrconfig.xml) >>>> and data are in attached archive. >>>> I try to import simple document: >>>> [ >>>> { >>>> "publisher": [ >>>> "T. Gl\u00fccksberg" >>>> ], >>>> "uid": "1000881" >>>> }, >>>> { >>>> "publisher": [ >>>> "Ala a kota" >>>> ], >>>> "uid": "1000894" >>>> } >>>> ] >>>> first fails on copyfield destination publisher_hl with exception (trace: >>>> https://gist.github.com/anonymous/5429558), second is added without any >>>> problems. >>>> schema.xml is here: https://gist.github.com/anonymous/5429562 >>>> >>>> When someone will trying to reproduce this behaviour remember to copy libs >>>> related with morfologik and icu filters. >>>> >>>> This extracted example works fine with solr 4.0 - 4.2.1. >>>> >>>> Regards, >>>> Karol >>>> >>>> >>>> >>>> W dniu 21.04.2013 09:03, Simon Willnauer pisze: >>>>> hey karol, >>>>> >>>>> can you reproduce this behaviour in a small test-case (curl command or >>>>> something like this) that we can reproduce? >>>>> >>>>> @solr guys any idea what this could be? >>>>> >>>>> simon >>>>> >>>>> On Sun, Apr 21, 2013 at 1:52 AM, Karol Sikora >>>>> >>>>> <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I have problem with solr 4.3 RC2 on my testing data for searching >>>>>> application which i'm developing. >>>>>> A lot of importing records fails with exception >>>>>> "java.lang.IllegalArgumentException: first position increment must be > 0 >>>>>> (got 0)". On versions from early 4.0 to 4.2.1 all documents was added >>>>>> successfully, so I'm thinking that something is broken in new release. >>>>>> I'll try examine tomorrow what is broken. >>>>>> >>>>>> >>>>>> Regards, >>>>>> Karol >>>>>> >>>>>> W dniu 20.04.2013 21:07, Andi Vajda pisze: >>>>>> >>>>>> >>>>>>> On Sat, 20 Apr 2013, Simon Willnauer wrote: >>>>>>> >>>>>>> >>>>>>>> Here is the RC: >>>>>>>> >>>>>>>> >>>>>>>> http://people.apache.org/~simonw/staging_area/lucene-solr-4.3.0-RC2-rev1470054 >>>>>>>> >>>>>>>> >>>>>>>> happy voting... >>>>>>>> >>>>>>>> here is my +1 >>>>>>>> >>>>>>> >>>>>>> PyLucene 4.3 builds and passes its tests. >>>>>>> >>>>>>> +1 ! >>>>>>> >>>>>>> Andi.. >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: >>>>>>> [email protected] >>>>>>> >>>>>>> For additional commands, e-mail: >>>>>>> [email protected] >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> -- >>>>>> Karol Sikora >>>>>> +48 781 493 788 >>>>>> >>>>>> Laboratorium EE >>>>>> ul. Mokotowska 46A/23 | 00-543 Warszawa | >>>>>> >>>>>> www.laboratorium.ee | www.laboratorium.ee/facebook >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: >>>>>> [email protected] >>>>>> >>>>>> For additional commands, e-mail: >>>>>> [email protected] >>>>>> >>>>>> >>>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: >>>>> [email protected] >>>>> >>>>> For additional commands, e-mail: >>>>> [email protected] >>>>> >>>>> >>>>> >>>>> >>>> >>>> -- >>>> >>>> Karol Sikora >>>> Kierownik Informatyczny Projektu CBN - Interfejs 2.0 >>>> +48 781 493 788 >>>> >>>> Laboratorium EE >>>> ul. Mokotowska 46A/23 | 00-543 Warszawa | >>>> >>>> www.laboratorium.ee | www.laboratorium.ee/facebook >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >> >> -- >> Walter Underwood >> [email protected] >> >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
