Re: Beginner: Specific indexing

Raymond Balmès Tue, 09 Sep 2008 06:11:35 -0700

Well that is well explained in "Lucene in Action" if you want to search
files you have to build a file parser and there is a good example given. So
not really my problem.


But I thought I could go thru the token stream only once, where I have to go
twice 1. for detecting my triplets , 2. for indexing the text.

-Raymond-

On Tue, Sep 9, 2008 at 12:27 AM, Chris Hostetter
<[EMAIL PROTECTED]>wrote:

>
> : I think I'm getting you. But the files I'm  going to parse have many
> formats
> : : PDF, HTML, Word.
> : they don't have a particular structure, memos if you will. But the ones
> I'm
> : interested in will have the triplets I described
>
> Ahhhh...  see this is something i completley didn't realize.  "Lucene" as
> a library really doesn't provide any sort of mechanism for doing text
> extraction from unknown file formats ... With some small exceptions (like
> the HTMLStripTokenizer in Solr) the TokenStream concept is much more about
> finding "Tokens" from a stream of plain text -- not about finding "Text"
> in arbitrary (possibly binary) files.
>
> You'll probably wantto check out the Tika subproject...
>    http://incubator.apache.org/tika/
> ...or some of the various "How do i index _____ documents?" FAQs...
>    http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Beginner: Specific indexing

Reply via email to