> Give me an example of a string and how you'd like it to be tokenized. > But first, give the AnalyzerUtils (from my java.net article) a try and > get a feel for what different analyzers do. > > Keep in mind that it can be tricky (see the AnalysisParalysis page on > the wiki and my java.net article on QueryParser) to make sense out of a > combination of QueryParser and an Analyzer - so its best to work with > them independently to get what you want and then put things together.
I already used Luke: This is what I found (making sense to me even :))) String dash-123-01 Was tokenized with 1.2 StandardAnalyzer dash 123 01 and is tokenized (1.4RC4) with any other than RussianAnalyser, simpleAnalyzer and StopAnalyzer (which just got dash and omitted all numbers) dash-123-01 On the other hand dash-my-string is tokenized dash my string by all of them except whitespaceAnalyser, of course. I guess this is what happens: numerical components turn the meaning of the preceding dash into a minus. With that, it is part of the token with the digits in it and no longer a separator. This is even for mixed terms like 123a-01. So -1andAnyOtherCharacters-evenWithDashes is an non-separable numerical expression for Lucene. Checked with Luke on the string dash\-123\-01 and got dash 123 01 with germanAnalyzer and standardAnalyzer and dash with all the other, except for whitespaceAnalyser, of course. This makes me think that an escaped dash is never a minus, somehow. Daniel --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]