Re: Document term vectors in Lucene 4

Ian Lea Thu, 17 Jan 2013 02:07:33 -0800

When I run your code, as is except for using RAMDirectory and setting
up an IndexWriter using StandardAnalyzer


        RAMDirectory dir = new RAMDirectory();
        Analyzer anl = new StandardAnalyzer(Version.LUCENE_40);
        IndexWriterConfig iwcfg = new IndexWriterConfig(Version.LUCENE_40, anl);
        IndexWriter iw = new IndexWriter(dir, iwcfg);
        ...
        iw.addDocument(doc);
        iw.close();

it prints

doc 0 had 1 terms.

If change text to .e.g. "this is foobar gibberish" it says there are 2
terms.  So it looks OK to me. "this" and "is" are presumably in the
default list of stop words.

Not relevant, but why are you using SlowCompositeReaderWrapper rather than just
IndexReader rdr = DirectoryReader.open(dir)?  I get the same results either way,


--
Ian.


On Thu, Jan 17, 2013 at 5:52 AM, Jon Stewart
<[email protected]> wrote:
> Hello,
>
> I cannot extract document term vectors from an index, and have not
> turned up much in some determined googling. In short, when I call
> IndexReader.getTermVector(docID, field) or
> IndexReader.getTermVectors(docID) and then navigate down to the Terms
> for the specified field, I get a null result.
>
> // Indexing:
>   String bodyText = "this is foobar";
>   final FieldType BodyOptions = new FieldType();
>   BodyOptions.setIndexed(true);
>   
> BodyOptions.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
>   BodyOptions.setStored(true);
>   BodyOptions.setStoreTermVectors(true);
>   BodyOptions.setTokenized(true);
>   Document doc = new Document();
>   doc.add(new Field("body", bodyText, BodyOptions));
>
> When I examine docs in Luke, I can see the term vectors.
>
> // Retrieving (at a later time)
>   DirectoryReader dirRdr = DirectoryReader.open(FSDirectory.open(new
> File(path)));
>   SlowCompositeReaderWrapper rdr = new SlowCompositeReaderWrapper(dirRdr);
>   for (int i = 0; i < rdr.maxDoc(); ++i) {
>     int numTerms = 0;
>     Terms terms = rdr.getTermVector(i, "body");
>     if (terms != null) {
>       TermsEnum term = terms.iterator(null);
>       while (term.next() != null) {
>         ++numTerms;
>       }
>       System.out.println("doc " + i + " had " + numTerms + " terms");
>     }
>     else {
>       System.err.println("null term vector on doc " + i);
>     }
>   }
>
> On every doc, the Terms object I get back from getTermVector(i, "body") is 
> null.
>
>
> Jon
> --
> Jon Stewart, Principal
> (646) 719-0317 | [email protected] | Arlington, VA
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Document term vectors in Lucene 4

Reply via email to