Well that is well explained in "Lucene in Action" if you want to search files you have to build a file parser and there is a good example given. So not really my problem.
But I thought I could go thru the token stream only once, where I have to go twice 1. for detecting my triplets , 2. for indexing the text. -Raymond- On Tue, Sep 9, 2008 at 12:27 AM, Chris Hostetter <[EMAIL PROTECTED]>wrote: > > : I think I'm getting you. But the files I'm going to parse have many > formats > : : PDF, HTML, Word. > : they don't have a particular structure, memos if you will. But the ones > I'm > : interested in will have the triplets I described > > Ahhhh... see this is something i completley didn't realize. "Lucene" as > a library really doesn't provide any sort of mechanism for doing text > extraction from unknown file formats ... With some small exceptions (like > the HTMLStripTokenizer in Solr) the TokenStream concept is much more about > finding "Tokens" from a stream of plain text -- not about finding "Text" > in arbitrary (possibly binary) files. > > You'll probably wantto check out the Tika subproject... > http://incubator.apache.org/tika/ > ...or some of the various "How do i index _____ documents?" FAQs... > http://wiki.apache.org/lucene-java/LuceneFAQ > > > -Hoss > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >