Re: "[VOTE] Lucene/Solr 4.3 Take 2 (RC2)"

Jack Krupansky Mon, 22 Apr 2013 06:12:38 -0700

Is this a fix to 4.3 (RC3?) or for a 4.3.1?

-- Jack Krupansky

-----Original Message-----From: Steve Rowe

Sent: Monday, April 22, 2013 2:07 AM
To: [email protected]
Subject: Re: "[VOTE] Lucene/Solr 4.3 Take 2 (RC2)"

I've reopened LUCENE-4810 and attached a patch with a test and fix for thisproblem. - Steve


On Apr 22, 2013, at 1:09 AM, Steve Rowe <[email protected]> wrote:

Actually, Walter, I misspoke: Morfologik is a lemmatizer: it producessurface forms. Not really so incompatible, I think.
Regardless of the choice to use this particular sequence of filters,EdgeNGramTokenFilter shouldn't produce a bad stream.
Steve
On Apr 21, 2013, at 8:34 PM, Walter Underwood <[email protected]>wrote:
Don't use a stemmer with edge ngrams.
Edge ngrams are a tool for matching the surface word. Stemmers are a toolfor matching the root. Those are logically incompatible transforms.
wunder

On Apr 21, 2013, at 5:21 PM, Steve Rowe wrote:
Karol has uncovered a bug introduced by LUCENE-4810<https://issues.apache.org/jira/browse/LUCENE-4810>, included inLucene/Solr 4.3.0.
The problem is an interaction between the Morfologik stemmer, which canproduce multiple stems per input term, all but the first having aposition increment of zero, and EdgeNGramTokenFilter, which only outputsngrams for input terms that are at least as long as the minimumconfigured length, and passes through unchanged the position incrementfor the first ngram output for any given input term.
So what happens in Karol's case is that "T." has the period stripped byStandardTokenizer, then is stemmed by Morfologik to produce terms "to","tom" and "tona". The first term "to" has a position increment of 1,but is not output by EdgeNGramTokenFilter, because it's length is belowthe configured minimum of 3. The second term "tom" is given a positionincrement of 0 by Morfologik, and meets EdgeNGramTokenFilter's minimumlength, so gets output, and since it's the first output term for theinput term "tom", the input position increment is left as-is in theoutput term: 0. That's how the first output term gets a positionincrement of 0.
Before LUCENE-4810 was committed and included in Lucene/Solr 4.3.0,EdgeNGramTokenFilter indiscriminately set all output terms' positionincrements to 1, so that explains why this behavior didn't occur withpreviously released versions.
I think the fix is a check in EdgeNGramTokenFilter when outputting thefirst term, that the position increment is greater than 0, and if it'snot, then it should be set it to 1.
Does anybody know if this could also be an issue for other filters?

I'll work on a patch for EdgeNGramTokenFilter.

Steve
On Apr 21, 2013, at 9:21 AM, Karol Sikora <[email protected]>wrote:
hi,
I extracted minimal failing example, solr configs(schema,solrconfig.xml) and data are in attached archive.
I try to import simple document:
[
  {
      "publisher": [
          "T. Gl\u00fccksberg"
      ],
      "uid": "1000881"
  },
  {
      "publisher": [
    "Ala a kota"
      ],
      "uid": "1000894"
  }
]
first fails on copyfield destination publisher_hl with exception(trace: https://gist.github.com/anonymous/5429558), second is addedwithout any problems.
schema.xml is here: https://gist.github.com/anonymous/5429562
When someone will trying to reproduce this behaviour remember to copylibs related with morfologik and icu filters.
This extracted example works fine with solr 4.0 - 4.2.1.

Regards,
Karol



W dniu 21.04.2013 09:03, Simon Willnauer pisze:
hey karol,

can you reproduce this behaviour in a small test-case (curl command or
something like this) that we can reproduce?

@solr guys any idea what this could be?

simon

On Sun, Apr 21, 2013 at 1:52 AM, Karol Sikora

<[email protected]>
wrote:
Hi all,

I have problem with solr 4.3 RC2 on my testing data for searching
application which i'm developing.
A lot of importing records fails with exception
"java.lang.IllegalArgumentException: first position increment must be> 0
(got 0)". On versions from early 4.0 to 4.2.1 all documents was added
successfully, so I'm thinking that something is broken in newrelease.
I'll try examine tomorrow what is broken.


Regards,
Karol

W dniu 20.04.2013 21:07, Andi Vajda pisze:
On Sat, 20 Apr 2013, Simon Willnauer wrote:
Here is the RC:


http://people.apache.org/~simonw/staging_area/lucene-solr-4.3.0-RC2-rev1470054


happy voting...

here is my +1
PyLucene 4.3 builds and passes its tests.

+1 !

Andi..

---------------------------------------------------------------------
To unsubscribe, e-mail:
[email protected]

For additional commands, e-mail:
[email protected]
--
Karol Sikora
+48 781 493 788

Laboratorium EE
ul. Mokotowska 46A/23 | 00-543 Warszawa |

www.laboratorium.ee | www.laboratorium.ee/facebook




---------------------------------------------------------------------
To unsubscribe, e-mail:
[email protected]

For additional commands, e-mail:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail:
[email protected]

For additional commands, e-mail:
[email protected]
--

Karol Sikora
Kierownik Informatyczny Projektu CBN - Interfejs 2.0
+48 781 493 788

Laboratorium EE
ul. Mokotowska 46A/23 | 00-543 Warszawa |

www.laboratorium.ee | www.laboratorium.ee/facebook
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
--
Walter Underwood
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]

For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: "[VOTE] Lucene/Solr 4.3 Take 2 (RC2)"

Reply via email to