RE: Beginner: Specific indexing

Steven A Rowe Tue, 09 Sep 2008 06:30:11 -0700

Hi Raymond,

Check out SinkTokenizer/TeeTokenFilter:
<http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/TeeTokenFilter.html>


Look at the unit tests for usage hints:
<http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/analysis/TeeSinkTokenTest.java?revision=687357&view=markup>

Steve

On 09/09/2008 at 9:11 AM, Raymond Balmès wrote:
> Well that is well explained in "Lucene in Action" if you want
> to search files you have to build a file parser and there is a good
> example given. So not really my problem.
> 
> But I thought I could go thru the token stream only once,
> where I have to go twice 1. for detecting my triplets ,
> 2. for indexing the text.
> 
> -Raymond-
> 
> On Tue, Sep 9, 2008 at 12:27 AM, Chris Hostetter
> <[EMAIL PROTECTED]>wrote:
> 
> > 
> > > I think I'm getting you. But the files I'm  going to
> parse have many
> > formats
> > > > PDF, HTML, Word.
> > > they don't have a particular structure, memos if you
> will. But the ones
> > I'm
> > > interested in will have the triplets I described
> > 
> > Ahhhh...  see this is something i completley didn't realize.  "Lucene"
> > as a library really doesn't provide any sort of mechanism for doing
> > text extraction from unknown file formats ... With some small
> > exceptions (like the HTMLStripTokenizer in Solr) the TokenStream
> > concept is much more about finding "Tokens" from a stream of plain text
> > -- not about finding "Text" in arbitrary (possibly binary) files.
> > 
> > You'll probably wantto check out the Tika subproject...
> >    http://incubator.apache.org/tika/ ...or some of the various "How do
> >    i index _____ documents?" FAQs...
> >    http://wiki.apache.org/lucene-java/LuceneFAQ
> > 
> > 
> > -Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Beginner: Specific indexing

Reply via email to