Re: "[VOTE] Lucene/Solr 4.3 Take 2 (RC2)"

Steve Rowe Sun, 21 Apr 2013 23:08:00 -0700

I've reopened LUCENE-4810 and attached a patch with a test and fix for this 
problem. - Steve


On Apr 22, 2013, at 1:09 AM, Steve Rowe <[email protected]> wrote:

> Actually, Walter, I misspoke: Morfologik is a lemmatizer: it produces surface 
> forms.  Not really so incompatible, I think.
> 
> Regardless of the choice to use this particular sequence of filters, 
> EdgeNGramTokenFilter shouldn't produce a bad stream.
> 
> Steve
> 
> On Apr 21, 2013, at 8:34 PM, Walter Underwood <[email protected]> wrote:
> 
>> Don't use a stemmer with edge ngrams.
>> 
>> Edge ngrams are a tool for matching the surface word. Stemmers are a tool 
>> for matching the root. Those are logically incompatible transforms. 
>> 
>> wunder
>> 
>> On Apr 21, 2013, at 5:21 PM, Steve Rowe wrote:
>> 
>>> Karol has uncovered a bug introduced by LUCENE-4810 
>>> <https://issues.apache.org/jira/browse/LUCENE-4810>, included in 
>>> Lucene/Solr 4.3.0.
>>> 
>>> The problem is an interaction between the Morfologik stemmer, which can 
>>> produce multiple stems per input term, all but the first having a position 
>>> increment of zero, and EdgeNGramTokenFilter, which only outputs ngrams for 
>>> input terms that are at least as long as the minimum configured length, and 
>>> passes through unchanged the position increment for the first ngram output 
>>> for any given input term.
>>> 
>>> So what happens in Karol's case is that "T." has the period stripped by 
>>> StandardTokenizer, then is stemmed by Morfologik to produce terms "to", 
>>> "tom" and "tona".  The first term "to" has a position increment of 1, but 
>>> is not output by EdgeNGramTokenFilter, because it's length is below the 
>>> configured minimum of 3.  The second term "tom" is given a position 
>>> increment of 0 by Morfologik, and meets EdgeNGramTokenFilter's minimum 
>>> length, so gets output, and since it's the first output term for the input 
>>> term "tom", the input position increment is left as-is in the output term: 
>>> 0.  That's how the first output term gets a position increment of 0.
>>> 
>>> Before LUCENE-4810 was committed and included in Lucene/Solr 4.3.0, 
>>> EdgeNGramTokenFilter indiscriminately set all output terms' position 
>>> increments to 1, so that explains why this behavior didn't occur with 
>>> previously released versions.
>>> 
>>> I think the fix is a check in EdgeNGramTokenFilter when outputting the 
>>> first term, that the position increment is greater than 0, and if it's not, 
>>> then it should be set it to 1.
>>> 
>>> Does anybody know if this could also be an issue for other filters?
>>> 
>>> I'll work on a patch for EdgeNGramTokenFilter.
>>> 
>>> Steve
>>> 
>>> On Apr 21, 2013, at 9:21 AM, Karol Sikora <[email protected]> 
>>> wrote:
>>> 
>>>> hi,
>>>> 
>>>> I extracted minimal failing example, solr configs(schema, solrconfig.xml) 
>>>> and data are in attached archive.
>>>> I try to import simple document:
>>>> [
>>>>   {
>>>>       "publisher": [
>>>>           "T. Gl\u00fccksberg"
>>>>       ],  
>>>>       "uid": "1000881" 
>>>>   }, 
>>>>   {
>>>>       "publisher": [
>>>>     "Ala a kota"
>>>>       ],
>>>>       "uid": "1000894"
>>>>   }
>>>> ]
>>>> first fails on copyfield destination publisher_hl with exception (trace: 
>>>> https://gist.github.com/anonymous/5429558), second is added without any 
>>>> problems.
>>>> schema.xml is here: https://gist.github.com/anonymous/5429562
>>>> 
>>>> When someone will trying to reproduce this behaviour remember to copy libs 
>>>> related with morfologik and icu filters.
>>>> 
>>>> This extracted example works fine with solr 4.0 - 4.2.1.
>>>> 
>>>> Regards,
>>>> Karol
>>>> 
>>>> 
>>>> 
>>>> W dniu 21.04.2013 09:03, Simon Willnauer pisze:
>>>>> hey karol,
>>>>> 
>>>>> can you reproduce this behaviour in a small test-case (curl command or
>>>>> something like this) that we can reproduce?
>>>>> 
>>>>> @solr guys any idea what this could be?
>>>>> 
>>>>> simon
>>>>> 
>>>>> On Sun, Apr 21, 2013 at 1:52 AM, Karol Sikora
>>>>> 
>>>>> <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Hi all,
>>>>>> 
>>>>>> I have problem with solr 4.3 RC2 on my testing data for searching
>>>>>> application which i'm developing.
>>>>>> A lot of importing records fails with exception
>>>>>> "java.lang.IllegalArgumentException: first position increment must be > 0
>>>>>> (got 0)". On versions from early 4.0 to 4.2.1 all documents was added
>>>>>> successfully, so I'm thinking that something is broken in new release.
>>>>>> I'll try examine tomorrow what is broken.
>>>>>> 
>>>>>> 
>>>>>> Regards,
>>>>>> Karol
>>>>>> 
>>>>>> W dniu 20.04.2013 21:07, Andi Vajda pisze:
>>>>>> 
>>>>>> 
>>>>>>> On Sat, 20 Apr 2013, Simon Willnauer wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> Here is the RC:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> http://people.apache.org/~simonw/staging_area/lucene-solr-4.3.0-RC2-rev1470054
>>>>>>>> 
>>>>>>>> 
>>>>>>>> happy voting...
>>>>>>>> 
>>>>>>>> here is my +1
>>>>>>>> 
>>>>>>> 
>>>>>>> PyLucene 4.3 builds and passes its tests.
>>>>>>> 
>>>>>>> +1 !
>>>>>>> 
>>>>>>> Andi..
>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: 
>>>>>>> [email protected]
>>>>>>> 
>>>>>>> For additional commands, e-mail: 
>>>>>>> [email protected]
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> --
>>>>>> Karol Sikora
>>>>>> +48 781 493 788
>>>>>> 
>>>>>> Laboratorium EE
>>>>>> ul. Mokotowska 46A/23 | 00-543 Warszawa |
>>>>>> 
>>>>>> www.laboratorium.ee | www.laboratorium.ee/facebook
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: 
>>>>>> [email protected]
>>>>>> 
>>>>>> For additional commands, e-mail: 
>>>>>> [email protected]
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: 
>>>>> [email protected]
>>>>> 
>>>>> For additional commands, e-mail: 
>>>>> [email protected]
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> -- 
>>>> 
>>>> Karol Sikora
>>>> Kierownik Informatyczny Projektu CBN - Interfejs 2.0
>>>> +48 781 493 788
>>>> 
>>>> Laboratorium EE
>>>> ul. Mokotowska 46A/23 | 00-543 Warszawa |
>>>> 
>>>> www.laboratorium.ee | www.laboratorium.ee/facebook
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>> 
>> 
>> --
>> Walter Underwood
>> [email protected]
>> 
>> 
>> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: "[VOTE] Lucene/Solr 4.3 Take 2 (RC2)"

Reply via email to