Hi Raymond, Check out SinkTokenizer/TeeTokenFilter: <http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/TeeTokenFilter.html>
Look at the unit tests for usage hints: <http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/analysis/TeeSinkTokenTest.java?revision=687357&view=markup> Steve On 09/09/2008 at 9:11 AM, Raymond Balmès wrote: > Well that is well explained in "Lucene in Action" if you want > to search files you have to build a file parser and there is a good > example given. So not really my problem. > > But I thought I could go thru the token stream only once, > where I have to go twice 1. for detecting my triplets , > 2. for indexing the text. > > -Raymond- > > On Tue, Sep 9, 2008 at 12:27 AM, Chris Hostetter > <[EMAIL PROTECTED]>wrote: > > > > > > I think I'm getting you. But the files I'm going to > parse have many > > formats > > > > PDF, HTML, Word. > > > they don't have a particular structure, memos if you > will. But the ones > > I'm > > > interested in will have the triplets I described > > > > Ahhhh... see this is something i completley didn't realize. "Lucene" > > as a library really doesn't provide any sort of mechanism for doing > > text extraction from unknown file formats ... With some small > > exceptions (like the HTMLStripTokenizer in Solr) the TokenStream > > concept is much more about finding "Tokens" from a stream of plain text > > -- not about finding "Text" in arbitrary (possibly binary) files. > > > > You'll probably wantto check out the Tika subproject... > > http://incubator.apache.org/tika/ ...or some of the various "How do > > i index _____ documents?" FAQs... > > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]