On Tuesday, December 16, 2003, at 05:46 AM, Iain Young wrote:
Treating them as two separate words when quoted is indicative of your
analyzer not being sufficient for your domain.  What Analyzer are you
using?  Do you have knowledge of what it is tokenizing text into?

I have created a custom analyzer (CobolAnalyzer) which contains some custom
stop words for the language, but it's using the StandardTokenizer and
StandardFilters. I'll have a look and see if I can see what it's actually
tokenizing the text into...

Look at my article at java.net and try out the AnalyzerDemo code using some sample text and your custom analyzer:


http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html

One of the things I plan to do with an enhanced Lucene demo to ship with Lucene's binary distributions is integrate in this type of "analyzing the analyzer" feature. It is the root of a lot of questions about Lucene. You can really only search for what you index, and you only index what the Analyzer creates, so understanding it is key to a lot.

And yes, if you are using StandardTokenizer, you are probably not tokenizing COBOL quite like you expect. Is there a COBOL parser you could tap into that could give you the tokens you want?

Erik


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to