If I was the RM, i would not respin for this edge-ngrams filter. We already have tests to find such bugs, but these tests are currently disabled (!) because the filter is basically rotting.
So i can't see how something can be important enough to respin a release candidate for, but not important in the sense no one cares if its unit tests are really working. On Mon, Apr 22, 2013 at 9:17 AM, Simon Willnauer <simon.willna...@gmail.com>wrote: > I think we can add this to 4.3 I can roll another RC for that. > > simon > > On Mon, Apr 22, 2013 at 3:11 PM, Jack Krupansky <j...@basetechnology.com> > wrote: > > Is this a fix to 4.3 (RC3?) or for a 4.3.1? > > > > -- Jack Krupansky > > > > -----Original Message----- From: Steve Rowe > > Sent: Monday, April 22, 2013 2:07 AM > > > > To: dev@lucene.apache.org > > Subject: Re: "[VOTE] Lucene/Solr 4.3 Take 2 (RC2)" > > > > I've reopened LUCENE-4810 and attached a patch with a test and fix for > this > > problem. - Steve > > > > On Apr 22, 2013, at 1:09 AM, Steve Rowe <sar...@gmail.com> wrote: > > > >> Actually, Walter, I misspoke: Morfologik is a lemmatizer: it produces > >> surface forms. Not really so incompatible, I think. > >> > >> Regardless of the choice to use this particular sequence of filters, > >> EdgeNGramTokenFilter shouldn't produce a bad stream. > >> > >> Steve > >> > >> On Apr 21, 2013, at 8:34 PM, Walter Underwood <wun...@wunderwood.org> > >> wrote: > >> > >>> Don't use a stemmer with edge ngrams. > >>> > >>> Edge ngrams are a tool for matching the surface word. Stemmers are a > tool > >>> for matching the root. Those are logically incompatible transforms. > >>> > >>> wunder > >>> > >>> On Apr 21, 2013, at 5:21 PM, Steve Rowe wrote: > >>> > >>>> Karol has uncovered a bug introduced by LUCENE-4810 > >>>> <https://issues.apache.org/jira/browse/LUCENE-4810>, included in > Lucene/Solr > >>>> 4.3.0. > >>>> > >>>> The problem is an interaction between the Morfologik stemmer, which > can > >>>> produce multiple stems per input term, all but the first having a > position > >>>> increment of zero, and EdgeNGramTokenFilter, which only outputs > ngrams for > >>>> input terms that are at least as long as the minimum configured > length, and > >>>> passes through unchanged the position increment for the first ngram > output > >>>> for any given input term. > >>>> > >>>> So what happens in Karol's case is that "T." has the period stripped > by > >>>> StandardTokenizer, then is stemmed by Morfologik to produce terms > "to", > >>>> "tom" and "tona". The first term "to" has a position increment of 1, > but is > >>>> not output by EdgeNGramTokenFilter, because it's length is below the > >>>> configured minimum of 3. The second term "tom" is given a position > >>>> increment of 0 by Morfologik, and meets EdgeNGramTokenFilter's minimum > >>>> length, so gets output, and since it's the first output term for the > input > >>>> term "tom", the input position increment is left as-is in the output > term: > >>>> 0. That's how the first output term gets a position increment of 0. > >>>> > >>>> Before LUCENE-4810 was committed and included in Lucene/Solr 4.3.0, > >>>> EdgeNGramTokenFilter indiscriminately set all output terms' position > >>>> increments to 1, so that explains why this behavior didn't occur with > >>>> previously released versions. > >>>> > >>>> I think the fix is a check in EdgeNGramTokenFilter when outputting the > >>>> first term, that the position increment is greater than 0, and if > it's not, > >>>> then it should be set it to 1. > >>>> > >>>> Does anybody know if this could also be an issue for other filters? > >>>> > >>>> I'll work on a patch for EdgeNGramTokenFilter. > >>>> > >>>> Steve > >>>> > >>>> On Apr 21, 2013, at 9:21 AM, Karol Sikora < > karol.sik...@laboratorium.ee> > >>>> wrote: > >>>> > >>>>> hi, > >>>>> > >>>>> I extracted minimal failing example, solr configs(schema, > >>>>> solrconfig.xml) and data are in attached archive. > >>>>> I try to import simple document: > >>>>> [ > >>>>> { > >>>>> "publisher": [ > >>>>> "T. Gl\u00fccksberg" > >>>>> ], > >>>>> "uid": "1000881" > >>>>> }, > >>>>> { > >>>>> "publisher": [ > >>>>> "Ala a kota" > >>>>> ], > >>>>> "uid": "1000894" > >>>>> } > >>>>> ] > >>>>> first fails on copyfield destination publisher_hl with exception > >>>>> (trace: https://gist.github.com/anonymous/5429558), second is added > without > >>>>> any problems. > >>>>> schema.xml is here: https://gist.github.com/anonymous/5429562 > >>>>> > >>>>> When someone will trying to reproduce this behaviour remember to copy > >>>>> libs related with morfologik and icu filters. > >>>>> > >>>>> This extracted example works fine with solr 4.0 - 4.2.1. > >>>>> > >>>>> Regards, > >>>>> Karol > >>>>> > >>>>> > >>>>> > >>>>> W dniu 21.04.2013 09:03, Simon Willnauer pisze: > >>>>>> > >>>>>> hey karol, > >>>>>> > >>>>>> can you reproduce this behaviour in a small test-case (curl command > or > >>>>>> something like this) that we can reproduce? > >>>>>> > >>>>>> @solr guys any idea what this could be? > >>>>>> > >>>>>> simon > >>>>>> > >>>>>> On Sun, Apr 21, 2013 at 1:52 AM, Karol Sikora > >>>>>> > >>>>>> <karol.sik...@laboratorium.ee> > >>>>>> wrote: > >>>>>> > >>>>>>> Hi all, > >>>>>>> > >>>>>>> I have problem with solr 4.3 RC2 on my testing data for searching > >>>>>>> application which i'm developing. > >>>>>>> A lot of importing records fails with exception > >>>>>>> "java.lang.IllegalArgumentException: first position increment must > be > >>>>>>> > 0 > >>>>>>> (got 0)". On versions from early 4.0 to 4.2.1 all documents was > added > >>>>>>> successfully, so I'm thinking that something is broken in new > >>>>>>> release. > >>>>>>> I'll try examine tomorrow what is broken. > >>>>>>> > >>>>>>> > >>>>>>> Regards, > >>>>>>> Karol > >>>>>>> > >>>>>>> W dniu 20.04.2013 21:07, Andi Vajda pisze: > >>>>>>> > >>>>>>> > >>>>>>>> On Sat, 20 Apr 2013, Simon Willnauer wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>>> Here is the RC: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > http://people.apache.org/~simonw/staging_area/lucene-solr-4.3.0-RC2-rev1470054 > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> happy voting... > >>>>>>>>> > >>>>>>>>> here is my +1 > >>>>>>>>> > >>>>>>>> > >>>>>>>> PyLucene 4.3 builds and passes its tests. > >>>>>>>> > >>>>>>>> +1 ! > >>>>>>>> > >>>>>>>> Andi.. > >>>>>>>> > >>>>>>>> > >>>>>>>> > --------------------------------------------------------------------- > >>>>>>>> To unsubscribe, e-mail: > >>>>>>>> dev-unsubscr...@lucene.apache.org > >>>>>>>> > >>>>>>>> For additional commands, e-mail: > >>>>>>>> dev-h...@lucene.apache.org > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> -- > >>>>>>> Karol Sikora > >>>>>>> +48 781 493 788 > >>>>>>> > >>>>>>> Laboratorium EE > >>>>>>> ul. Mokotowska 46A/23 | 00-543 Warszawa | > >>>>>>> > >>>>>>> www.laboratorium.ee | www.laboratorium.ee/facebook > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > --------------------------------------------------------------------- > >>>>>>> To unsubscribe, e-mail: > >>>>>>> dev-unsubscr...@lucene.apache.org > >>>>>>> > >>>>>>> For additional commands, e-mail: > >>>>>>> dev-h...@lucene.apache.org > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: > >>>>>> dev-unsubscr...@lucene.apache.org > >>>>>> > >>>>>> For additional commands, e-mail: > >>>>>> dev-h...@lucene.apache.org > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> -- > >>>>> > >>>>> Karol Sikora > >>>>> Kierownik Informatyczny Projektu CBN - Interfejs 2.0 > >>>>> +48 781 493 788 > >>>>> > >>>>> Laboratorium EE > >>>>> ul. Mokotowska 46A/23 | 00-543 Warszawa | > >>>>> > >>>>> www.laboratorium.ee | www.laboratorium.ee/facebook > >>>> > >>>> > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >>>> For additional commands, e-mail: dev-h...@lucene.apache.org > >>>> > >>> > >>> -- > >>> Walter Underwood > >>> wun...@wunderwood.org > >>> > >>> > >>> > >> > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >