Re: Why does the StandardTokenizer split hyphenated words?

Erik Hatcher Wed, 15 Dec 2004 12:46:21 -0800


On Dec 15, 2004, at 3:14 PM, Mike Snare wrote:

[...]


In addition, why do we assume that a-1 is a "typical product name" but
a-b isn't?

I am in no way second-guessing or suggesting a change, It just doesn't
make sense to me, and I'm trying to understand.  It is very likely, as
is oft the case, that this is just one of those things one has to
accept.

It is one of those things we have to accept... or in this case write our own analyzer. An Analyzer is a very special and custom choice. StandardAnalyzer is a general purpose one, but quite insufficient in many cases. Like QueryParser. We're lucky to have these kitchen-sink pieces in Lucene to get us going quickly, but digging deeper we often need custom solutions.

I'm working on indexing the e-book of Lucene in Action. I'll blog up the details of this in the near future as case-study material, but here's the short version...

I got the PDF file, ran pdftotext on it. Many words are split across lines with a hyphen. Often these pieces should be combined with the hyphen removed. Sometimes, though, these words are to be split. The scenario is different than yours, because I want the hyphens gone - though sometimes they are a separator and sometimes they should be removed. It depends. I wrote a custom analyzer with several custom filters in the pipeline... dashes are originally kept in the stream, and a later filter combines two tokens and looks it up in an exception list and either combines it or leaves it separate. StandardAnalyzer would have wreaked havoc.

The results of my work will soon be available to all to poke at, but for now a screenshot is all I have public:

        http://www.lucenebook.com

Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Why does the StandardTokenizer split hyphenated words?

Reply via email to