I was trying to apply both
org.apache.solr.analysis.WordDelimiterFilter and
org.apache.lucene.analysis.ngram.NGramTokenFilter.
Can I achive this with lucene's TokenStream?
While thinking about TokenFilters, I came to an idea that
the TokenStream should have a structured representation.
It is much like as we do in XML sax reader. XML is a
serialized, character stream, and it also has a structure.
One example to show what happen in TokenFilter process,
suppose we have one RAW term.
---------------------- [RAW]
term
----------------------
and it will be tokenized to
---------------------- [TOKENIZED1]
termB1
termA < > termC
termB2
----------------------
and tne next token filter may tokenize to
---------------------- [TOKENIZED2]
termB1-1 - termB1-2
termA < > termC
termB2
----------------------
and tne next token filter may tokenize to
---------------------- [TOKENIZED3]
termB1-1-1
< > - termB1-2
termB1-1-2
termA < > termC
termB2
----------------------
Then, what we should do in indexing, and querying with this TokenStream?
I read the code and see that current lucene implementation can handle
TOKENIZED1 query (in org.apache.lucene.queryParser.QueryParser#getFieldQuery).
But it can't handle TOKENIZED2 or TOKENIZED3. ... is this right?
One solution may be, using Token.type to describe the structure like this:
<token type="and">
<token type="word" value="termA"/>
<token type="or">
<token type="and">
<token type="or">
<token type="word" value="termB1-1-1"/>
<token type="word" value="termB1-1-2"/>
</token>
<token type="word" value="termB1-2"/>
</token>
<token type="word" value="termB2"/>
</token>
<token type="word" value="termC"/>
</token>
Another solution may be, adding a internal flag-table to Token.flag or
TokenStream, to describe the TokenStream structure.
Does anybody have suggestions?
---- Hiroaki Kawai
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]