Don't use a stemmer with edge ngrams.

Edge ngrams are a tool for matching the surface word. Stemmers are a tool for 
matching the root. Those are logically incompatible transforms. 

wunder

On Apr 21, 2013, at 5:21 PM, Steve Rowe wrote:

> Karol has uncovered a bug introduced by LUCENE-4810 
> <https://issues.apache.org/jira/browse/LUCENE-4810>, included in Lucene/Solr 
> 4.3.0.
> 
> The problem is an interaction between the Morfologik stemmer, which can 
> produce multiple stems per input term, all but the first having a position 
> increment of zero, and EdgeNGramTokenFilter, which only outputs ngrams for 
> input terms that are at least as long as the minimum configured length, and 
> passes through unchanged the position increment for the first ngram output 
> for any given input term.
> 
> So what happens in Karol's case is that "T." has the period stripped by 
> StandardTokenizer, then is stemmed by Morfologik to produce terms "to", "tom" 
> and "tona".  The first term "to" has a position increment of 1, but is not 
> output by EdgeNGramTokenFilter, because it's length is below the configured 
> minimum of 3.  The second term "tom" is given a position increment of 0 by 
> Morfologik, and meets EdgeNGramTokenFilter's minimum length, so gets output, 
> and since it's the first output term for the input term "tom", the input 
> position increment is left as-is in the output term: 0.  That's how the first 
> output term gets a position increment of 0.
> 
> Before LUCENE-4810 was committed and included in Lucene/Solr 4.3.0, 
> EdgeNGramTokenFilter indiscriminately set all output terms' position 
> increments to 1, so that explains why this behavior didn't occur with 
> previously released versions.
> 
> I think the fix is a check in EdgeNGramTokenFilter when outputting the first 
> term, that the position increment is greater than 0, and if it's not, then it 
> should be set it to 1.
> 
> Does anybody know if this could also be an issue for other filters?
> 
> I'll work on a patch for EdgeNGramTokenFilter.
> 
> Steve
> 
> On Apr 21, 2013, at 9:21 AM, Karol Sikora <[email protected]> 
> wrote:
> 
>> hi,
>> 
>> I extracted minimal failing example, solr configs(schema, solrconfig.xml) 
>> and data are in attached archive.
>> I try to import simple document:
>> [
>>    {
>>        "publisher": [
>>            "T. Gl\u00fccksberg"
>>        ],  
>>        "uid": "1000881" 
>>    }, 
>>    {
>>        "publisher": [
>>      "Ala a kota"
>>        ],
>>        "uid": "1000894"
>>    }
>> ]
>> first fails on copyfield destination publisher_hl with exception (trace: 
>> https://gist.github.com/anonymous/5429558), second is added without any 
>> problems.
>> schema.xml is here: https://gist.github.com/anonymous/5429562
>> 
>> When someone will trying to reproduce this behaviour remember to copy libs 
>> related with morfologik and icu filters.
>> 
>> This extracted example works fine with solr 4.0 - 4.2.1.
>> 
>> Regards,
>> Karol
>> 
>> 
>> 
>> W dniu 21.04.2013 09:03, Simon Willnauer pisze:
>>> hey karol,
>>> 
>>> can you reproduce this behaviour in a small test-case (curl command or
>>> something like this) that we can reproduce?
>>> 
>>> @solr guys any idea what this could be?
>>> 
>>> simon
>>> 
>>> On Sun, Apr 21, 2013 at 1:52 AM, Karol Sikora
>>> 
>>> <[email protected]>
>>> wrote:
>>> 
>>>> Hi all,
>>>> 
>>>> I have problem with solr 4.3 RC2 on my testing data for searching
>>>> application which i'm developing.
>>>> A lot of importing records fails with exception
>>>> "java.lang.IllegalArgumentException: first position increment must be > 0
>>>> (got 0)". On versions from early 4.0 to 4.2.1 all documents was added
>>>> successfully, so I'm thinking that something is broken in new release.
>>>> I'll try examine tomorrow what is broken.
>>>> 
>>>> 
>>>> Regards,
>>>> Karol
>>>> 
>>>> W dniu 20.04.2013 21:07, Andi Vajda pisze:
>>>> 
>>>> 
>>>>> On Sat, 20 Apr 2013, Simon Willnauer wrote:
>>>>> 
>>>>> 
>>>>>> Here is the RC:
>>>>>> 
>>>>>> 
>>>>>> http://people.apache.org/~simonw/staging_area/lucene-solr-4.3.0-RC2-rev1470054
>>>>>> 
>>>>>> 
>>>>>> happy voting...
>>>>>> 
>>>>>> here is my +1
>>>>>> 
>>>>> 
>>>>> PyLucene 4.3 builds and passes its tests.
>>>>> 
>>>>> +1 !
>>>>> 
>>>>> Andi..
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: 
>>>>> [email protected]
>>>>> 
>>>>> For additional commands, e-mail: 
>>>>> [email protected]
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> --
>>>> Karol Sikora
>>>> +48 781 493 788
>>>> 
>>>> Laboratorium EE
>>>> ul. Mokotowska 46A/23 | 00-543 Warszawa |
>>>> 
>>>> www.laboratorium.ee | www.laboratorium.ee/facebook
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: 
>>>> [email protected]
>>>> 
>>>> For additional commands, e-mail: 
>>>> [email protected]
>>>> 
>>>> 
>>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: 
>>> [email protected]
>>> 
>>> For additional commands, e-mail: 
>>> [email protected]
>>> 
>>> 
>>> 
>>> 
>> 
>> -- 
>> 
>> Karol Sikora
>> Kierownik Informatyczny Projektu CBN - Interfejs 2.0
>> +48 781 493 788
>> 
>> Laboratorium EE
>> ul. Mokotowska 46A/23 | 00-543 Warszawa |
>> 
>> www.laboratorium.ee | www.laboratorium.ee/facebook
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 

--
Walter Underwood
[email protected]



Reply via email to