OK, not clear enough. I have documents in which I'm looking for 3 consecutive elements : <string1> <#1> <#2> (string1 is a predefined list)
I want to disregard those without this sequence and reverse index those with these markers... it looks to me that parsing won't do the job since my documents are unstructured and do not have a specific grammar. The triplets are to be found in the body of the document (if any) At this moment I have the impression that I need to have a double pass on the document stream : 1. pass 1 extract the triplets (with TokenFilter ???) - if there are no triplets disregard the document 2. pass 2 index the document stream with keywords for the found triplets and text (standardanalyzer) for the actual body. additionnal issue, I'm finding that the numbers might be in Roman notation... any idea how tocheck if a token could be roman number or just another random stream -Ray- On Tue, Sep 2, 2008 at 7:43 PM, Chris Hostetter <[EMAIL PROTECTED]>wrote: > I may be missunderstanding your question, but i wouldn't attempt to tackle > this with a TokenFilter unless you want both the "tag" and the numbers to > appear in the same field. i think what you want to do is first parse > whatever file format you are dealing with, then build Documents based on > the individual Fields. > > a TokenFilter comes into play when you are Analyzing individual Field > values. > > but since i have very little understanding of your problem, and what you > are trying to achieve, i may be way off base. > > : <tag> <#1> <#2> > : > : <tag> is a fixed list of words > : <#x> are small numbers <100 > : > : My idea is to simply build a TokenFilter that will look for those... do I > : have it right ? > : > : Some side questions: > : what if I want to index <tag> <#1> <#2> as keywords ? > : what if I also want to give full text search on the select documents ? > > > -Hoss > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >