Re: Need an analyzer that includes numbers.

Erik Hatcher Sun, 26 Dec 2004 01:07:11 -0800


On Dec 25, 2004, at 11:05 AM, Jim wrote:

I've seen some discussion on this and the answer seems to be "write your own". Hasn't someone already done that by now that would share? I really have to be able to include numeric and alphanumeric strings in my searches. I don't understand analyzers well enough to roll my own.

This is more involved than just keeping numbers around... or at least there are more steps to consider. Do you want the alpha characters lower-cased, which is the typical behavior so that searches are case-insensitive. What about punctuation characters? Generally these get tossed, however there are cases where that is not desired either.

The good news is that writing Tokenizer and TokenFilter pieces of an analyzer are generally relatively easy. There are a number of built-in Lucene pieces that you can leverage. I whipped up a quick AlphanumericAnalyzer for you demonstrating the CharTokenizer which treats alphanumeric characters as part of tokens, and any other character as a separator that gets thrown away. At the same time, it lowercases. The output of the main() method is shown below also.

public class AlphanumericAnalyzer extends Analyzer {
  public TokenStream tokenStream(String fieldName, Reader reader) {
    return new CharTokenizer(reader) {
      protected char normalize(char c) {
        return Character.toLowerCase(c);
      }

      protected boolean isTokenChar(char c) {
        return Character.isLetter(c) || Character.isDigit(c);
      }
    };
  }


  public static void main(String[] args) throws IOException {
    TokenStream ts =
        new AlphanumericAnalyzer().tokenStream("field",
            new StringReader("December 26, 2004"));

    String month = ts.next().termText();
    String day = ts.next().termText();
    String year = ts.next().termText();

    System.out.println(month + " " + day + " " + year);
  }


Output:
december 26 2004

Calling .tokenStream and .next().termText() is not something your production code would need to do - but its what happens under the covers of Lucene. If you are going to write a custom analyzer, you *should* write unit tests that "analyze" the analyzer using these lower-level methods.

Lucene in Action goes into the analysis topic deeply, but simply, and I spent a great deal of time toying with different customizations to analyzers to write about them. The sample code distribution includes utility methods and unit test helpers to illustrate, test, and debug the analysis process. And in retrospect, this very example I cobbled together to reply to this e-mail would have been a great example to add as well.

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Need an analyzer that includes numbers.

Reply via email to