Re: "[VOTE] Lucene/Solr 4.3 Take 2 (RC2)"

Simon Willnauer Mon, 22 Apr 2013 06:17:56 -0700

I think we can add this to 4.3 I can roll another RC for that.

simon


On Mon, Apr 22, 2013 at 3:11 PM, Jack Krupansky <[email protected]> wrote:
> Is this a fix to 4.3 (RC3?) or for a 4.3.1?
>
> -- Jack Krupansky
>
> -----Original Message----- From: Steve Rowe
> Sent: Monday, April 22, 2013 2:07 AM
>
> To: [email protected]
> Subject: Re: "[VOTE] Lucene/Solr 4.3 Take 2 (RC2)"
>
> I've reopened LUCENE-4810 and attached a patch with a test and fix for this
> problem. - Steve
>
> On Apr 22, 2013, at 1:09 AM, Steve Rowe <[email protected]> wrote:
>
>> Actually, Walter, I misspoke: Morfologik is a lemmatizer: it produces
>> surface forms.  Not really so incompatible, I think.
>>
>> Regardless of the choice to use this particular sequence of filters,
>> EdgeNGramTokenFilter shouldn't produce a bad stream.
>>
>> Steve
>>
>> On Apr 21, 2013, at 8:34 PM, Walter Underwood <[email protected]>
>> wrote:
>>
>>> Don't use a stemmer with edge ngrams.
>>>
>>> Edge ngrams are a tool for matching the surface word. Stemmers are a tool
>>> for matching the root. Those are logically incompatible transforms.
>>>
>>> wunder
>>>
>>> On Apr 21, 2013, at 5:21 PM, Steve Rowe wrote:
>>>
>>>> Karol has uncovered a bug introduced by LUCENE-4810
>>>> <https://issues.apache.org/jira/browse/LUCENE-4810>, included in 
>>>> Lucene/Solr
>>>> 4.3.0.
>>>>
>>>> The problem is an interaction between the Morfologik stemmer, which can
>>>> produce multiple stems per input term, all but the first having a position
>>>> increment of zero, and EdgeNGramTokenFilter, which only outputs ngrams for
>>>> input terms that are at least as long as the minimum configured length, and
>>>> passes through unchanged the position increment for the first ngram output
>>>> for any given input term.
>>>>
>>>> So what happens in Karol's case is that "T." has the period stripped by
>>>> StandardTokenizer, then is stemmed by Morfologik to produce terms "to",
>>>> "tom" and "tona".  The first term "to" has a position increment of 1, but 
>>>> is
>>>> not output by EdgeNGramTokenFilter, because it's length is below the
>>>> configured minimum of 3.  The second term "tom" is given a position
>>>> increment of 0 by Morfologik, and meets EdgeNGramTokenFilter's minimum
>>>> length, so gets output, and since it's the first output term for the input
>>>> term "tom", the input position increment is left as-is in the output term:
>>>> 0.  That's how the first output term gets a position increment of 0.
>>>>
>>>> Before LUCENE-4810 was committed and included in Lucene/Solr 4.3.0,
>>>> EdgeNGramTokenFilter indiscriminately set all output terms' position
>>>> increments to 1, so that explains why this behavior didn't occur with
>>>> previously released versions.
>>>>
>>>> I think the fix is a check in EdgeNGramTokenFilter when outputting the
>>>> first term, that the position increment is greater than 0, and if it's not,
>>>> then it should be set it to 1.
>>>>
>>>> Does anybody know if this could also be an issue for other filters?
>>>>
>>>> I'll work on a patch for EdgeNGramTokenFilter.
>>>>
>>>> Steve
>>>>
>>>> On Apr 21, 2013, at 9:21 AM, Karol Sikora <[email protected]>
>>>> wrote:
>>>>
>>>>> hi,
>>>>>
>>>>> I extracted minimal failing example, solr configs(schema,
>>>>> solrconfig.xml) and data are in attached archive.
>>>>> I try to import simple document:
>>>>> [
>>>>>   {
>>>>>       "publisher": [
>>>>>           "T. Gl\u00fccksberg"
>>>>>       ],
>>>>>       "uid": "1000881"
>>>>>   },
>>>>>   {
>>>>>       "publisher": [
>>>>>     "Ala a kota"
>>>>>       ],
>>>>>       "uid": "1000894"
>>>>>   }
>>>>> ]
>>>>> first fails on copyfield destination publisher_hl with exception
>>>>> (trace: https://gist.github.com/anonymous/5429558), second is added 
>>>>> without
>>>>> any problems.
>>>>> schema.xml is here: https://gist.github.com/anonymous/5429562
>>>>>
>>>>> When someone will trying to reproduce this behaviour remember to copy
>>>>> libs related with morfologik and icu filters.
>>>>>
>>>>> This extracted example works fine with solr 4.0 - 4.2.1.
>>>>>
>>>>> Regards,
>>>>> Karol
>>>>>
>>>>>
>>>>>
>>>>> W dniu 21.04.2013 09:03, Simon Willnauer pisze:
>>>>>>
>>>>>> hey karol,
>>>>>>
>>>>>> can you reproduce this behaviour in a small test-case (curl command or
>>>>>> something like this) that we can reproduce?
>>>>>>
>>>>>> @solr guys any idea what this could be?
>>>>>>
>>>>>> simon
>>>>>>
>>>>>> On Sun, Apr 21, 2013 at 1:52 AM, Karol Sikora
>>>>>>
>>>>>> <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I have problem with solr 4.3 RC2 on my testing data for searching
>>>>>>> application which i'm developing.
>>>>>>> A lot of importing records fails with exception
>>>>>>> "java.lang.IllegalArgumentException: first position increment must be
>>>>>>> > 0
>>>>>>> (got 0)". On versions from early 4.0 to 4.2.1 all documents was added
>>>>>>> successfully, so I'm thinking that something is broken in new
>>>>>>> release.
>>>>>>> I'll try examine tomorrow what is broken.
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Karol
>>>>>>>
>>>>>>> W dniu 20.04.2013 21:07, Andi Vajda pisze:
>>>>>>>
>>>>>>>
>>>>>>>> On Sat, 20 Apr 2013, Simon Willnauer wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Here is the RC:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://people.apache.org/~simonw/staging_area/lucene-solr-4.3.0-RC2-rev1470054
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> happy voting...
>>>>>>>>>
>>>>>>>>> here is my +1
>>>>>>>>>
>>>>>>>>
>>>>>>>> PyLucene 4.3 builds and passes its tests.
>>>>>>>>
>>>>>>>> +1 !
>>>>>>>>
>>>>>>>> Andi..
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail:
>>>>>>>> [email protected]
>>>>>>>>
>>>>>>>> For additional commands, e-mail:
>>>>>>>> [email protected]
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> Karol Sikora
>>>>>>> +48 781 493 788
>>>>>>>
>>>>>>> Laboratorium EE
>>>>>>> ul. Mokotowska 46A/23 | 00-543 Warszawa |
>>>>>>>
>>>>>>> www.laboratorium.ee | www.laboratorium.ee/facebook
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail:
>>>>>>> [email protected]
>>>>>>>
>>>>>>> For additional commands, e-mail:
>>>>>>> [email protected]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail:
>>>>>> [email protected]
>>>>>>
>>>>>> For additional commands, e-mail:
>>>>>> [email protected]
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Karol Sikora
>>>>> Kierownik Informatyczny Projektu CBN - Interfejs 2.0
>>>>> +48 781 493 788
>>>>>
>>>>> Laboratorium EE
>>>>> ul. Mokotowska 46A/23 | 00-543 Warszawa |
>>>>>
>>>>> www.laboratorium.ee | www.laboratorium.ee/facebook
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>
>>> --
>>> Walter Underwood
>>> [email protected]
>>>
>>>
>>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: "[VOTE] Lucene/Solr 4.3 Take 2 (RC2)"

Reply via email to