Steve, thanks for investigating and fixing this problem. Your patch attached to issue fixes my problem.
So here is my little (and probably meaningless ;) ) vote: +1 :)

Walter, as Steve says, morfologik is a lemmatizer. It isn't really incomaptible, and solves one of requirement from client: highlight not only full matched phrases but also matched parts of them.


W dniu 22.04.2013 02:21, Steve Rowe pisze:
Karol has uncovered a bug introduced by LUCENE-4810 
<https://issues.apache.org/jira/browse/LUCENE-4810>, included in Lucene/Solr 
4.3.0.

The problem is an interaction between the Morfologik stemmer, which can produce 
multiple stems per input term, all but the first having a position increment of 
zero, and EdgeNGramTokenFilter, which only outputs ngrams for input terms that 
are at least as long as the minimum configured length, and passes through 
unchanged the position increment for the first ngram output for any given input 
term.

So what happens in Karol's case is that "T." has the period stripped by StandardTokenizer, then is stemmed by Morfologik to produce terms 
"to", "tom" and "tona".  The first term "to" has a position increment of 1, but is not output by 
EdgeNGramTokenFilter, because it's length is below the configured minimum of 3.  The second term "tom" is given a position increment of 0 
by Morfologik, and meets EdgeNGramTokenFilter's minimum length, so gets output, and since it's the first output term for the input term 
"tom", the input position increment is left as-is in the output term: 0.  That's how the first output term gets a position increment of 0.

Before LUCENE-4810 was committed and included in Lucene/Solr 4.3.0, 
EdgeNGramTokenFilter indiscriminately set all output terms' position increments 
to 1, so that explains why this behavior didn't occur with previously released 
versions.

I think the fix is a check in EdgeNGramTokenFilter when outputting the first 
term, that the position increment is greater than 0, and if it's not, then it 
should be set it to 1.

Does anybody know if this could also be an issue for other filters?

I'll work on a patch for EdgeNGramTokenFilter.

Steve

On Apr 21, 2013, at 9:21 AM, Karol Sikora <karol.sik...@laboratorium.ee> wrote:

hi,

I extracted minimal failing example, solr configs(schema, solrconfig.xml) and 
data are in attached archive.
I try to import simple document:
[
     {
         "publisher": [
             "T. Gl\u00fccksberg"
         ],
         "uid": "1000881"
     },
     {
         "publisher": [
       "Ala a kota"
         ],
         "uid": "1000894"
     }
]
first fails on copyfield destination publisher_hl with exception (trace: 
https://gist.github.com/anonymous/5429558), second is added without any 
problems.
schema.xml is here: https://gist.github.com/anonymous/5429562

When someone will trying to reproduce this behaviour remember to copy libs 
related with morfologik and icu filters.

This extracted example works fine with solr 4.0 - 4.2.1.

Regards,
Karol



W dniu 21.04.2013 09:03, Simon Willnauer pisze:
hey karol,

can you reproduce this behaviour in a small test-case (curl command or
something like this) that we can reproduce?

@solr guys any idea what this could be?

simon

On Sun, Apr 21, 2013 at 1:52 AM, Karol Sikora

<karol.sik...@laboratorium.ee>
  wrote:

Hi all,

I have problem with solr 4.3 RC2 on my testing data for searching
application which i'm developing.
A lot of importing records fails with exception
"java.lang.IllegalArgumentException: first position increment must be > 0
(got 0)". On versions from early 4.0 to 4.2.1 all documents was added
successfully, so I'm thinking that something is broken in new release.
I'll try examine tomorrow what is broken.


Regards,
Karol

W dniu 20.04.2013 21:07, Andi Vajda pisze:


On Sat, 20 Apr 2013, Simon Willnauer wrote:


Here is the RC:


http://people.apache.org/~simonw/staging_area/lucene-solr-4.3.0-RC2-rev1470054


happy voting...

here is my +1

PyLucene 4.3 builds and passes its tests.

+1 !

Andi..

---------------------------------------------------------------------
To unsubscribe, e-mail:
dev-unsubscr...@lucene.apache.org

For additional commands, e-mail:
dev-h...@lucene.apache.org




--
  Karol Sikora
+48 781 493 788

Laboratorium EE
ul. Mokotowska 46A/23 | 00-543 Warszawa |

www.laboratorium.ee | www.laboratorium.ee/facebook




---------------------------------------------------------------------
To unsubscribe, e-mail:
dev-unsubscr...@lucene.apache.org

For additional commands, e-mail:
dev-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail:
dev-unsubscr...@lucene.apache.org

For additional commands, e-mail:
dev-h...@lucene.apache.org




--
Karol Sikora
Kierownik Informatyczny Projektu CBN - Interfejs 2.0
+48 781 493 788

Laboratorium EE
ul. Mokotowska 46A/23 | 00-543 Warszawa |

www.laboratorium.ee | www.laboratorium.ee/facebook

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

.


--
Karol Sikora
Kierownik Informatyczny Projektu CBN - Interfejs 2.0
+48 781 493 788

Laboratorium EE
ul. Mokotowska 46A/23 | 00-543 Warszawa |
www.laboratorium.ee | www.laboratorium.ee/facebook


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to