Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1

2015-09-16 Thread Shawn Heisey
On 9/16/2015 5:42 AM, Alessandro Benedetti wrote:
> Any update on this ?

I found two workarounds, and went with the second one -- removing the
PatternReplaceFilterFactory from fieldType definitions that also include
WDF.  They are both documented in the issue:

https://issues.apache.org/jira/browse/LUCENE-6689

I still think that there's a bug that needs fixing, but I'm not
desperate any more.

Thanks,
Shawn



Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1

2015-09-16 Thread Alessandro Benedetti
Any update on this ?

Cheers

2015-08-21 0:22 GMT+01:00 Shawn Heisey :

> On 7/8/2015 6:13 PM, Yonik Seeley wrote:
> > On Wed, Jul 8, 2015 at 6:50 PM, Shawn Heisey 
> wrote:
> >> After the fix (with luceneMatchVersion at 4.9), both "aaa" and "bbb" end
> >> up at position 2.
> > Yikes, that's definitely wrong.
>
> I have filed LUCENE-6889 for this problem.  I'd like to write a unit
> test that demonstrates the problem, but Lucene internals are a mystery
> to me.  I have a concise and repeatable manual test (using Solr)
> outlined in this comment:
>
>
> https://issues.apache.org/jira/browse/LUCENE-6689?focusedCommentId=14705543&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14705543
>
> Is there an existing Lucene test class that I could use as a basis for a
> test?  I will look into tests for analysis components and try to build
> it on my own, but any help is appreciated.
>
> Thanks,
> Shawn
>
>


-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1

2015-08-20 Thread Shawn Heisey
On 7/8/2015 6:13 PM, Yonik Seeley wrote:
> On Wed, Jul 8, 2015 at 6:50 PM, Shawn Heisey  wrote:
>> After the fix (with luceneMatchVersion at 4.9), both "aaa" and "bbb" end
>> up at position 2.
> Yikes, that's definitely wrong.

I have filed LUCENE-6889 for this problem.  I'd like to write a unit
test that demonstrates the problem, but Lucene internals are a mystery
to me.  I have a concise and repeatable manual test (using Solr)
outlined in this comment:

https://issues.apache.org/jira/browse/LUCENE-6689?focusedCommentId=14705543&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14705543

Is there an existing Lucene test class that I could use as a basis for a
test?  I will look into tests for analysis components and try to build
it on my own, but any help is appreciated.

Thanks,
Shawn



Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1

2015-07-14 Thread Shawn Heisey
On 7/14/2015 11:42 AM, Shawn Heisey wrote:
> So the problem might be with the rulefile, or with some strange
> combination of these analysis components. I did not build this
> rulefile myself. It was built by another, eitherRobert Muir or Steve
> Rowe if I remember right, when SOLR-4123 was underway. The normal
> settings for ICUTokenizer eliminate most of the things that WDF uses
> for making tokens, which is why I'm using this custom rulefile.  

I found the place where I got that rulefile (named
Latin-break-only-on-whitespace.rbbi).  It's in the Lucene ICU source, in
this directory:

lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation

The rbbi file that I'm using was slightly different than the one in the
branch_5x source, so I copied the source file over.  It didn't change
the behavior.

I'm using the ICU tokenizer with a custom rule file because I want
tokenization on boundaries between different character sets (chinese,
japanese, cyrillic, etc), but I want to handle internal punctuation with
WordDelimiterFilter.

Thanks,
Shawn



Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1

2015-07-14 Thread Shawn Heisey
On 7/14/2015 10:46 AM, Alessandro Benedetti wrote:
> Furthermore I was checking with Solr 5.1 to find the WDFilter factory
> actually to work in a proper way.
> Is it possible to know what was the conclusion for this issue ?
> Is there an issue in the WordDelimiter token filter in the current Solr
> version? Has it been fixed ?
> Any update ?

It appears that the problem is not with WDF alone ... something about
the combination of filters that I have chosen is causing this, but only
with certain kinds of input.

If I set up a minimal fieldType with the keyword tokenizer, then I
cannot get the problem to reproduce:




  


I tried with inputs of "aaa-bbb ccc" and "aaa-bbb: ccc" and everything
worked as expected.

I then tried some other analysis combinations trying to find the minimal
problem fieldType, and I finally hit on the one that causes a problem. 
It's a combination of the ICUTokenizer with a custom rulefile, a pattern
replace filter that eats leading and trailing punctuation, and the WDF. 
That must be combined with input text that includes trailing
punctuation: "aaa-bbb: ccc"


  



  


If the rulefile is not specified, then the problem doesn't occur,
because the trailing punctuation is missing by the time it makes it to
the PRF.  If the PRF isn't there, then the problem doesn't occur.

So the problem might be with the rulefile, or with some strange
combination of these analysis components.  I did not build this rulefile
myself.  It was built by another, eitherRobert Muir or Steve Rowe if I
remember right, when SOLR-4123 was underway.  The normal settings for
ICUTokenizer eliminate most of the things that WDF uses for making
tokens, which is why I'm using this custom rulefile.

https://issues.apache.org/jira/browse/SOLR-4123

Any advice would be appreciated.  I can make the .rbbi file available.

Thanks,
Shawn



Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1

2015-07-14 Thread Alessandro Benedetti
Furthermore I was checking with Solr 5.1 to find the WDFilter factory
actually to work in a proper way.
Is it possible to know what was the conclusion for this issue ?
Is there an issue in the WordDelimiter token filter in the current Solr
version? Has it been fixed ?
Any update ?

Cheers

2015-07-14 17:16 GMT+01:00 Alessandro Benedetti 
:

> Just found this interesting article of Mike, that actually explains the
> sausagization problem, which actually is related to the strange positions
> in some case.
>
>
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>
> Cheers
>
> 2015-07-09 1:13 GMT+01:00 Yonik Seeley :
>
>> On Wed, Jul 8, 2015 at 6:50 PM, Shawn Heisey  wrote:
>> > After the fix (with luceneMatchVersion at 4.9), both "aaa" and "bbb" end
>> > up at position 2.
>>
>> Yikes, that's definitely wrong.
>>
>> -Yonik
>>
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1

2015-07-14 Thread Alessandro Benedetti
Just found this interesting article of Mike, that actually explains the
sausagization problem, which actually is related to the strange positions
in some case.

http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

Cheers

2015-07-09 1:13 GMT+01:00 Yonik Seeley :

> On Wed, Jul 8, 2015 at 6:50 PM, Shawn Heisey  wrote:
> > After the fix (with luceneMatchVersion at 4.9), both "aaa" and "bbb" end
> > up at position 2.
>
> Yikes, that's definitely wrong.
>
> -Yonik
>



-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1

2015-07-08 Thread Yonik Seeley
On Wed, Jul 8, 2015 at 6:50 PM, Shawn Heisey  wrote:
> After the fix (with luceneMatchVersion at 4.9), both "aaa" and "bbb" end
> up at position 2.

Yikes, that's definitely wrong.

-Yonik


Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1

2015-07-08 Thread Shawn Heisey
On 7/8/2015 4:01 PM, Jack Krupansky wrote:
> In Lucene 4.8, LUCENE-5111: Fix WordDelimiterFilter offsets
>
> https://issues.apache.org/jira/browse/LUCENE-5111
>
> Make sure the documents are queried and indexed with the same Lucene match
> version.

Since I have updated the luceneMatchVersion on the 4.9.1 version to
LUCENE_47, I am now reindexing it, to see if that helps.

I discovered that I had some information backwards in my previous
messages -- it is *index* time analysis that differs.  Query time
analysis is the same across versions.  The reindex may very well fix
this problem, but luceneMatchVersion is a band-aid, and I think there is
a bug to be fixed.

I have no doubt that LUCENE-5111 fixed a real issue, but I think it also
caused some new problems.

When faced with text like "aaa-bbb", the original term (created by
preserveOriginal) ends up at relative position 1.  Prior to this fix,
the next terms will be "aaa" at position 1 and "bbb" at position 2.  The
"aaabbb" term created by the catenation option also ends up at position
2.  This arrangement makes perfect sense to me.

After the fix (with luceneMatchVersion at 4.9), both "aaa" and "bbb" end
up at position 2.  I can't see how it is logical to end up with these
positions.  It breaks phrase queries on my index because the query-time
analysis puts these two terms at position 1 and 2.

The WDF options I chose seemed logical to me when I made them (about
four years ago), but I admit that I don't remember the exact motivation
behind those choices.  You can find the entire fieldType definition in a
previous message on this thread.  The two analysis chains are the same
except for WDF options.  Should I use different options?

Index-time options:

|

Query-time options:
|||


Thanks,
Shawn



Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1

2015-07-08 Thread Jack Krupansky
In Lucene 4.8, LUCENE-5111: Fix WordDelimiterFilter offsets

https://issues.apache.org/jira/browse/LUCENE-5111

Make sure the documents are queried and indexed with the same Lucene match
version.


-- Jack Krupansky

On Wed, Jul 8, 2015 at 5:19 PM, Shawn Heisey  wrote:

> On 7/8/2015 2:19 PM, Shawn Heisey wrote:
> > It appears that changing luceneMatchVersion from LUCENE_4_9 to LUCENE_47
> > has fixed this problem ... so I think somebody must have "fixed" WDF to
> > its current behavior, but put in a version check for the old behavior.
>
> The luceneMatchVersion change has fixed this specific issue with WDF,
> but these searches on 4.9.1 are still returning zero hits, and I don't
> yet know why.
>
> Thanks,
> Shawn
>
>


Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1

2015-07-08 Thread Alessandro Benedetti
Yes Shawn, I was raising the fact that I see strange values in the
positions as well.
You said you fixed going back with an old version ?
This should not be ok, I mean, I assume the latest version should be the
best…
Any idea or clarification guys ?

2015-07-08 21:10 GMT+01:00 Shawn Heisey :

> On 7/8/2015 9:26 AM, Alessandro Benedetti wrote:
> > Taking a look into the documentation I see this inconsistent orderings in
> > my opinion :
>
> Alessandro, thank you for your reply.  I couldn't really tell what you
> were saying.  I *think* you were agreeing with me that the current
> behavior seems like a problem, but I'm not really sure.
>
> At this point I think I should probably file a bug in Jira ... anyone
> have any thoughts on that?
>
> Thanks,
> Shawn
>
>


-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1

2015-07-08 Thread Shawn Heisey
On 7/8/2015 2:19 PM, Shawn Heisey wrote:
> It appears that changing luceneMatchVersion from LUCENE_4_9 to LUCENE_47
> has fixed this problem ... so I think somebody must have "fixed" WDF to
> its current behavior, but put in a version check for the old behavior.

The luceneMatchVersion change has fixed this specific issue with WDF,
but these searches on 4.9.1 are still returning zero hits, and I don't
yet know why.

Thanks,
Shawn



Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1

2015-07-08 Thread Shawn Heisey
On 7/8/2015 2:10 PM, Shawn Heisey wrote:
> At this point I think I should probably file a bug in Jira ... anyone
> have any thoughts on that?

It appears that changing luceneMatchVersion from LUCENE_4_9 to LUCENE_47
has fixed this problem ... so I think somebody must have "fixed" WDF to
its current behavior, but put in a version check for the old behavior.

I think that WDF's position output with a current luceneMatchVersion is
wrong, but I'd like the input of someone who's a little more familiar
with the codeand what SHOULD happen.

Thanks,
Shawn



Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1

2015-07-08 Thread Shawn Heisey
On 7/8/2015 9:26 AM, Alessandro Benedetti wrote:
> Taking a look into the documentation I see this inconsistent orderings in
> my opinion :

Alessandro, thank you for your reply.  I couldn't really tell what you
were saying.  I *think* you were agreeing with me that the current
behavior seems like a problem, but I'm not really sure.

At this point I think I should probably file a bug in Jira ... anyone
have any thoughts on that?

Thanks,
Shawn



Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1

2015-07-08 Thread Alessandro Benedetti
Taking a look into the documentation I see this inconsistent orderings in
my opinion :

*Example:*

Concatenate word parts and number parts, but not word and number parts that
occur in the same token.

  
  


*In:* "hot-spot 100+42 XL40"

*Tokenizer to Filter:* "hot-spot"(1), "100+42"(2), "XL40"(3)

*Out:* "hot"(1), "spot"(2), "hotspot"(2) *(1?)*, "100"(3), "42"(4),
"10042"(4) *(2?)*, "XL"(5)*(3?)*, "40"(6)*(4?)*

*Example:*

Concatenate all. Word and/or number parts are joined together.

  
  


*In:* "XL-4000/ES"

*Tokenizer to Filter:* "XL-4000/ES"(1)

*Out:* "XL"(1), "4000"(2), "ES"(3), "XL4000ES"(3)*(1?)*


I have not clear why a token generated by a catenation should not occupy
the same position of the original one.


In your example , I am a little bit surprised of the first results as well :

"RRR-COLECCION: COLECCIÓN: Gracita Morales foobar

Here are the final positions and terms that 4.7.2 yields for this on
query analysis:

1 rrr-coleccion
1 rrr
2 coleccion
2 rrrcoleccion *(1) ?*
3 coleccion
4 gracita
5 morales
6 foobar


It is not so clear, if the tokens must simply inherit their position from
the "parent" token, or if they must arrange it based on the final list of
tokens .

2015-07-08 16:03 GMT+01:00 Shawn Heisey :

> On 7/8/2015 8:44 AM, Shawn Heisey wrote:
> > This is what 4.9.1 does with it:
> >
> > 1 rrr-coleccion
> > 2 rrr
> > 2 coleccion
> > 2 rrrcoleccion
> > 3 coleccion
> > 4 gracita
> > 5 morales
> > 6 foobar
>
> Followup:  This is what Solr 5.2.1 does for query analysis, which also
> seems wrong, and doesn't match the phrase query:
>
> 1 rrr-coleccion
> 2 coleccion
> 2 rrr
> 2 rrrcoleccion
> 3 coleccion
> 4 gracita
> 5 morales
> 6 bleh
>
> The index analysis on 5.2.1 is the same as the other two versions.
>
> Thanks,
> Shawn
>
>


-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1

2015-07-08 Thread Shawn Heisey
On 7/8/2015 8:44 AM, Shawn Heisey wrote:
> This is what 4.9.1 does with it:
>
> 1 rrr-coleccion
> 2 rrr
> 2 coleccion
> 2 rrrcoleccion
> 3 coleccion
> 4 gracita
> 5 morales
> 6 foobar

Followup:  This is what Solr 5.2.1 does for query analysis, which also
seems wrong, and doesn't match the phrase query:

1 rrr-coleccion
2 coleccion
2 rrr
2 rrrcoleccion
3 coleccion
4 gracita
5 morales
6 bleh

The index analysis on 5.2.1 is the same as the other two versions.

Thanks,
Shawn