I spotted an unexepcted behavior when using the StandardAnalyzer.

This analyzer uses the StandardTokenizer which javadoc states:


Splits words at hyphens, unless there's a number in the token, in which case
the whole token is interpreted as a product number and is not split. 

 

But looking to my index with luke, I saw that my product reference
AB-CD-1234 is split in 3 token AB, CD and 123 while I was expected the
tokenizer to keep it as a whole.


So its look like the StandardTokenizer does not work as is should.


Am I right ?


I had a deeper look, and found out (
https://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
here ) the jflex source used to generate the StandardTokenizerImpl.


And here is how "product numbers" are defined: (P being the punctuation:
"_", "-", "/", "." and ",")


// floating point, serial, model numbers, ip addresses, etc.
// every other segment must have at least one digit
NUM        = ({ALPHANUM} {P} {HAS_DIGIT}
           | {HAS_DIGIT} {P} {ALPHANUM}
           | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
           | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
           | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
           | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)


I am not a jflex expert, but it looks like the {ALPHANUM} ({P} {ALPHANUM}
{P} {HAS_DIGIT}) is missing ?

As well as all other patterns containing two digits or two alpha separated
by a punctuation. :


-- 
View this message in context: 
http://www.nabble.com/StandardTokenizer-issue---tp22471475p22471475.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

Reply via email to