Re: storing the contents of a document in the lucene index

Erick Erickson Wed, 23 Jul 2008 06:54:06 -0700

OK, I'm finally catching on. You have to change the demo code to
get the contents into something besides an input stream, so you
can use one of the alternate forms of the Field constructor. For
instance, you could read it all into a string and use the form:


doc.add(new Field("content", <string with all the file contents in it>,
               Field.Store.YES, Field.Index.TOKENIZED))


Or, you can do something like this, which produces identical results
to the above

while (more text to read) {
     String line = read a line of text from the file
     doc.add(new Field("content", line, Field.Store.YES,
Field.Index.TOKENIZED))
}

You can add to the same field as often as you want and it just appends the
content of calls 2 to N to the same field.


Best
Erick


On Wed, Jul 23, 2008 at 3:42 AM, starz10de <[EMAIL PROTECTED]> wrote:

>
> Hi Erik,
>
>  I don't remove the stop words, as I index parallel corpora which is used
> for learning the translations between pair of languages. so every word is
> important. I even develop my own analyzer for Arabic which is just remove
> punctuations and special symbols and it return only Arabic text.
>
> I guess in the   FileDocument.java   the whole text is already stored
>
> doc.add(Field.Text("contents", IN));
>
> where IN is
>
> IN = new BufferedReader(new InputStreamReader(new FileInputStream(f))
>
> if this is not the case yould you please how to store the whole text inside
> the index ?
>
> I am new to lucene and I don't know how to use this "Field.Store.YES" to
> store whole text.
>
>
>
> Best regards
> Farag
>
>
>
> starz10de wrote:
> >
> >   Could any one tell me please how to print the content of the document
> > after reading the index.
> > for example if i like to print the  index terms then i do :
> >
> > IndexReader ir = IndexReader.open(index);
> > TermEnum termEnum = ir.terms();
> > while (termEnum.next()) {
> >                       TermDocs dok = ir.termDocs();
> >                       dok.seek(termEnum);
> >                       while (dok.next()) {
> > System.out.println(termEnum.term().text().trim());
> >                               }
> >
> > I can print the text files before indexing them, but because of encoding
> > issues i like to print them from the index.
> > As i know the content of the document(whole text) is also stored in the
> > index, my question how to print this content.
> >
> > so at the end i will print the path of the current document , index terms
> > and the content of the document
> >
> >
> > thanks in advance
> >
>
> --
> View this message in context:
> http://www.nabble.com/storing-the-contents-of-a-document-in-the--lucene-index-tp18595855p18605547.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: storing the contents of a document in the lucene index

Reply via email to