Re: Creating additional tokens from input in a token filter

Paul Taylor Thu, 03 Nov 2011 04:35:35 -0700

On 02/11/2011 20:48, Paul Taylor wrote:

On 02/11/2011 17:15, Uwe Schindler wrote:
Hi Paul,
There is WordDelimiterFilter which does exactly what you want. In 3.xits
unfortunately only shipped in Solr JAR file, but in 4.0 it's in the
analyzers-common module.
Okay so I found it and its looks very interesting but really overlycomplex for what I want to do and doesnt handle my specific case,could anyone possibly give a code exampleof how I create two tokens from one, assume I already know how tosplit it (I cant work that bit out)

I took another look at WordDelimiterFilter and managed to get it work,sweet , thanks very much

In case of interest to others, and because I had to hack WordDelimiter alittle bit this is my solution.

1. I changed my existing tokenizer to convert control/punctuation charsto a '-' rather than dropping them

if (type == ALPHANUMANDPUNCTUATION) { // remove no alphanumerics

            int upto = 0;
            for (int i = 0; i < bufferLength; i++) {
                char c = buffer[i];
                if (!Character.isLetterOrDigit(c) )
                {

//Replace control/punctuation chars with '-' tohelp word delimiter

                    buffer[upto++] = '-';
                }
                else {
                    //Normal Char
                    buffer[upto++] = c;
                }
            }

2. I took a copy of WordDelimiter and WordDelimiterIterator and modifiedit slightly so that it only did anything for attributetype equalsALPHANUMANDPUNCTUATION (couldnt see any constructor that would let meset this)


public boolean incrementToken() throws IOException {
    while (true) {
      if (!hasSavedState) {
        // process a new input word
        if (!input.incrementToken()) {
          return false;
        }

        //Use Word Delimiter just on these tokens

if (typeAttribute.type() !=MusicbrainzTokenizer.TOKEN_TYPES[MusicbrainzTokenizer.ALPHANUMANDPUNCTUATION];){

            return true;
        }
        ...................
}

3. Added my WordDelimiter and just set it to to generateWordParts

streams.filteredTokenStream = newWordDelimiterFilter(streams.filteredTokenStream,WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE,

                                          1,
                                          0,
                                          0,
                                          0,
                                          0,
                                          0,
                                          0,
                                          0,
                                          0,
                                         null);

Cheers Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Creating additional tokens from input in a token filter

Reply via email to