I think we can add this to 4.3 I can roll another RC for that. simon
On Mon, Apr 22, 2013 at 3:11 PM, Jack Krupansky <j...@basetechnology.com> wrote: > Is this a fix to 4.3 (RC3?) or for a 4.3.1? > > -- Jack Krupansky > > -----Original Message----- From: Steve Rowe > Sent: Monday, April 22, 2013 2:07 AM > > To: dev@lucene.apache.org > Subject: Re: "[VOTE] Lucene/Solr 4.3 Take 2 (RC2)" > > I've reopened LUCENE-4810 and attached a patch with a test and fix for this > problem. - Steve > > On Apr 22, 2013, at 1:09 AM, Steve Rowe <sar...@gmail.com> wrote: > >> Actually, Walter, I misspoke: Morfologik is a lemmatizer: it produces >> surface forms. Not really so incompatible, I think. >> >> Regardless of the choice to use this particular sequence of filters, >> EdgeNGramTokenFilter shouldn't produce a bad stream. >> >> Steve >> >> On Apr 21, 2013, at 8:34 PM, Walter Underwood <wun...@wunderwood.org> >> wrote: >> >>> Don't use a stemmer with edge ngrams. >>> >>> Edge ngrams are a tool for matching the surface word. Stemmers are a tool >>> for matching the root. Those are logically incompatible transforms. >>> >>> wunder >>> >>> On Apr 21, 2013, at 5:21 PM, Steve Rowe wrote: >>> >>>> Karol has uncovered a bug introduced by LUCENE-4810 >>>> <https://issues.apache.org/jira/browse/LUCENE-4810>, included in >>>> Lucene/Solr >>>> 4.3.0. >>>> >>>> The problem is an interaction between the Morfologik stemmer, which can >>>> produce multiple stems per input term, all but the first having a position >>>> increment of zero, and EdgeNGramTokenFilter, which only outputs ngrams for >>>> input terms that are at least as long as the minimum configured length, and >>>> passes through unchanged the position increment for the first ngram output >>>> for any given input term. >>>> >>>> So what happens in Karol's case is that "T." has the period stripped by >>>> StandardTokenizer, then is stemmed by Morfologik to produce terms "to", >>>> "tom" and "tona". The first term "to" has a position increment of 1, but >>>> is >>>> not output by EdgeNGramTokenFilter, because it's length is below the >>>> configured minimum of 3. The second term "tom" is given a position >>>> increment of 0 by Morfologik, and meets EdgeNGramTokenFilter's minimum >>>> length, so gets output, and since it's the first output term for the input >>>> term "tom", the input position increment is left as-is in the output term: >>>> 0. That's how the first output term gets a position increment of 0. >>>> >>>> Before LUCENE-4810 was committed and included in Lucene/Solr 4.3.0, >>>> EdgeNGramTokenFilter indiscriminately set all output terms' position >>>> increments to 1, so that explains why this behavior didn't occur with >>>> previously released versions. >>>> >>>> I think the fix is a check in EdgeNGramTokenFilter when outputting the >>>> first term, that the position increment is greater than 0, and if it's not, >>>> then it should be set it to 1. >>>> >>>> Does anybody know if this could also be an issue for other filters? >>>> >>>> I'll work on a patch for EdgeNGramTokenFilter. >>>> >>>> Steve >>>> >>>> On Apr 21, 2013, at 9:21 AM, Karol Sikora <karol.sik...@laboratorium.ee> >>>> wrote: >>>> >>>>> hi, >>>>> >>>>> I extracted minimal failing example, solr configs(schema, >>>>> solrconfig.xml) and data are in attached archive. >>>>> I try to import simple document: >>>>> [ >>>>> { >>>>> "publisher": [ >>>>> "T. Gl\u00fccksberg" >>>>> ], >>>>> "uid": "1000881" >>>>> }, >>>>> { >>>>> "publisher": [ >>>>> "Ala a kota" >>>>> ], >>>>> "uid": "1000894" >>>>> } >>>>> ] >>>>> first fails on copyfield destination publisher_hl with exception >>>>> (trace: https://gist.github.com/anonymous/5429558), second is added >>>>> without >>>>> any problems. >>>>> schema.xml is here: https://gist.github.com/anonymous/5429562 >>>>> >>>>> When someone will trying to reproduce this behaviour remember to copy >>>>> libs related with morfologik and icu filters. >>>>> >>>>> This extracted example works fine with solr 4.0 - 4.2.1. >>>>> >>>>> Regards, >>>>> Karol >>>>> >>>>> >>>>> >>>>> W dniu 21.04.2013 09:03, Simon Willnauer pisze: >>>>>> >>>>>> hey karol, >>>>>> >>>>>> can you reproduce this behaviour in a small test-case (curl command or >>>>>> something like this) that we can reproduce? >>>>>> >>>>>> @solr guys any idea what this could be? >>>>>> >>>>>> simon >>>>>> >>>>>> On Sun, Apr 21, 2013 at 1:52 AM, Karol Sikora >>>>>> >>>>>> <karol.sik...@laboratorium.ee> >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I have problem with solr 4.3 RC2 on my testing data for searching >>>>>>> application which i'm developing. >>>>>>> A lot of importing records fails with exception >>>>>>> "java.lang.IllegalArgumentException: first position increment must be >>>>>>> > 0 >>>>>>> (got 0)". On versions from early 4.0 to 4.2.1 all documents was added >>>>>>> successfully, so I'm thinking that something is broken in new >>>>>>> release. >>>>>>> I'll try examine tomorrow what is broken. >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Karol >>>>>>> >>>>>>> W dniu 20.04.2013 21:07, Andi Vajda pisze: >>>>>>> >>>>>>> >>>>>>>> On Sat, 20 Apr 2013, Simon Willnauer wrote: >>>>>>>> >>>>>>>> >>>>>>>>> Here is the RC: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> http://people.apache.org/~simonw/staging_area/lucene-solr-4.3.0-RC2-rev1470054 >>>>>>>>> >>>>>>>>> >>>>>>>>> happy voting... >>>>>>>>> >>>>>>>>> here is my +1 >>>>>>>>> >>>>>>>> >>>>>>>> PyLucene 4.3 builds and passes its tests. >>>>>>>> >>>>>>>> +1 ! >>>>>>>> >>>>>>>> Andi.. >>>>>>>> >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: >>>>>>>> dev-unsubscr...@lucene.apache.org >>>>>>>> >>>>>>>> For additional commands, e-mail: >>>>>>>> dev-h...@lucene.apache.org >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> -- >>>>>>> Karol Sikora >>>>>>> +48 781 493 788 >>>>>>> >>>>>>> Laboratorium EE >>>>>>> ul. Mokotowska 46A/23 | 00-543 Warszawa | >>>>>>> >>>>>>> www.laboratorium.ee | www.laboratorium.ee/facebook >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: >>>>>>> dev-unsubscr...@lucene.apache.org >>>>>>> >>>>>>> For additional commands, e-mail: >>>>>>> dev-h...@lucene.apache.org >>>>>>> >>>>>>> >>>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: >>>>>> dev-unsubscr...@lucene.apache.org >>>>>> >>>>>> For additional commands, e-mail: >>>>>> dev-h...@lucene.apache.org >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> >>>>> Karol Sikora >>>>> Kierownik Informatyczny Projektu CBN - Interfejs 2.0 >>>>> +48 781 493 788 >>>>> >>>>> Laboratorium EE >>>>> ul. Mokotowska 46A/23 | 00-543 Warszawa | >>>>> >>>>> www.laboratorium.ee | www.laboratorium.ee/facebook >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>> >>> >>> -- >>> Walter Underwood >>> wun...@wunderwood.org >>> >>> >>> >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org