TokenFilter question

Hiroaki Kawai Wed, 12 Mar 2008 09:50:24 -0700

I was trying to apply both
org.apache.solr.analysis.WordDelimiterFilter and 
org.apache.lucene.analysis.ngram.NGramTokenFilter.


Can I achive this with lucene's TokenStream?

While thinking about TokenFilters, I came to an idea that 
the TokenStream should have a structured representation. 
It is much like as we do in XML sax reader. XML is a 
serialized, character stream, and it also has a structure.

One example to show what happen in TokenFilter process, 
suppose we have one RAW term.

---------------------- [RAW]
term
----------------------

and it will be tokenized to

---------------------- [TOKENIZED1]
        termB1
termA <        > termC
        termB2
----------------------

and tne next token filter may tokenize to

---------------------- [TOKENIZED2]
        termB1-1 - termB1-2 
termA <                     > termC
               termB2
----------------------

and tne next token filter may tokenize to

---------------------- [TOKENIZED3]
          termB1-1-1 
        <            > - termB1-2 
          termB1-1-2 
termA <                            > termC
                    termB2
----------------------

Then, what we should do in indexing, and querying with this TokenStream?
I read the code and see that current lucene implementation can handle 
TOKENIZED1 query (in org.apache.lucene.queryParser.QueryParser#getFieldQuery). 
But it can't handle TOKENIZED2 or TOKENIZED3. ... is this right?

One solution may be, using Token.type to describe the structure like this:
<token type="and">
  <token type="word" value="termA"/>
  <token type="or">
    <token type="and">
      <token type="or">
        <token type="word" value="termB1-1-1"/>
        <token type="word" value="termB1-1-2"/>
      </token>
      <token type="word" value="termB1-2"/>
    </token>
    <token type="word" value="termB2"/>
  </token>
  <token type="word" value="termC"/>
</token>

Another solution may be, adding a internal flag-table to Token.flag or 
TokenStream, to describe the TokenStream structure.

Does anybody have suggestions?


---- Hiroaki Kawai


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

TokenFilter question

Reply via email to