The UAX29URLEmailAnalyzer analyzer in Lucene 4.4 is not working as I
expected. Is this a bug in the analyzer or is this working as designed?
If I use the UAX29URLEmailAnalyzer, it tokenizes the following strings as
input=bwl-esl2.gbr.hp.com
output=[bwl-esl2.gbr.hp.com]
input=esl2.gbr
output=[esl2.gb][r]
input=bwl-esl2
output=[bwl][esl2]
input=bwl.esl2.gbr.hp.com
output=[bwl.esl2.gbr.hp.com]
The first 2 seem wrong to me. It seems as though it thinks there is an @
instead of the - in bwl-esl2.gbr.hp.com (i.e [email protected]). In
which case, the tokenizing would make sense. The second one is even more
difficult to understand. The word does not get tokenized if there are
either both alphabets or both numbers surrounding a period. But in this
case, there is a number on the left and a letter on the right of the
period. And the tokenizing of the letter r is even more puzzling.
By contrast, the standard analyzer works as I expect
input=bwl-esl2.gbr.hp.com
output=[bwl][esl2][gbr.hp.com]
input=bwl-esl2
output=[bwl][esl2]
input=bwl.esl2.gbr.hp.com
output=[bwl.esl2][gbr.hp.com]
input=esl2.gbr
output=[esl2][gbr]
Any insights would be appreciated
--
Regards
Milind