[ https://issues.apache.org/jira/browse/LUCENENET-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852582#action_12852582 ]
Digy commented on LUCENENET-354: -------------------------------- Hi Matt, I compared the Lucene.Net 2.9.2 & Lucene.Java 2.9.2, they both output the same tokens for your input. So it is not a bug. StandardAnalyzer works this way. Even it were a bug, changing StandardAnalyzer would result in compatibility problems among Lucene.Net & Lucene.Java versions. So, If it not suitable for your needs, you may want to use a different analyzer or write a custom analyzer that works like the way you want. DIGY > The StandardAnalyzer tokenizer doesn't tokenize on all tokens when numbers > are present in the original string > ------------------------------------------------------------------------------------------------------------- > > Key: LUCENENET-354 > URL: https://issues.apache.org/jira/browse/LUCENENET-354 > Project: Lucene.Net > Issue Type: Bug > Environment: Lucene.Net 2.9.1 > Reporter: Matt Dufrasne > > The StandardAnalyzer tokenizer doesn't tokenize on all tokens when numbers > are present in the original string. > I think there is a bug in the tokenizer for Lucene 2.9.1 and it was probably > there before. When indexing "BB_HHH_FFFF5_SSSS", when there is a number, the > following tokens are returned: > "bb hhh_ffff5_ssss" > After some testing, I've found that this is because of the number. If I input > "BB_HHH_FFFF_SSSS", I get > "bb hhh ffff ssss" > At this point, I'm leaning towards a tokenizer bug unless the presence of the > number is supposed to have this behavior but I fail to see why. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.