Re: Document term vectors in Lucene 4

2013-01-18 Thread Jon Stewart
Thanks! I still can't see what was wrong with my original code--must have been a dumb typo somewhere--but starting over from that example now works on indices generated from my real indexing code. I will try to blog about it next week so there is some sample code up on the web for anyone else searc

Re: Document term vectors in Lucene 4

2013-01-18 Thread Ian Lea
To get stats from the whole index I think you need to come at this from a different direction. See the 4.0 migration guide for some details. With a variation on your code and 2 docs doc1: foobar qux quote doc2: foobar qux qux quorum this code snippet Fields fields = MultiFields.getFiel

Re: Document term vectors in Lucene 4

2013-01-17 Thread Jon Stewart
D'oh Thanks! Does TermsEnum.totalTermFreq() return the per-doc frequencies? It looks like it empirically, but the documentation refers to corpus usage, not document.field usage. Jon On Thu, Jan 17, 2013 at 10:00 AM, Ian Lea wrote: > typo time. You need doc2.add(...) not 2 doc.add(...) stat

Re: Document term vectors in Lucene 4

2013-01-17 Thread Ian Lea
typo time. You need doc2.add(...) not 2 doc.add(...) statements. -- Ian. On Thu, Jan 17, 2013 at 2:49 PM, Jon Stewart wrote: > On Thu, Jan 17, 2013 at 9:08 AM, Robert Muir wrote: >> Which statistics in particular (which methods)? > > I'd like to know the frequency of each term in each docume

Re: Document term vectors in Lucene 4

2013-01-17 Thread Jon Stewart
On Thu, Jan 17, 2013 at 9:08 AM, Robert Muir wrote: > Which statistics in particular (which methods)? I'd like to know the frequency of each term in each document. Those term counts for the most frequent terms in the corpus will make it into the document vectors for clustering. Looking at Terms

Re: Document term vectors in Lucene 4

2013-01-17 Thread Robert Muir
Which statistics in particular (which methods)? On Thu, Jan 17, 2013 at 5:10 AM, Jon Stewart wrote: > Thanks very much for your reply, Ian. > > I am using SlowCompositeReaderWrapper because I am also retrieving the > term frequency statistics for the corpus (at the end of the day, I am > doing so

Re: Document term vectors in Lucene 4

2013-01-17 Thread Jon Stewart
Thanks very much for your reply, Ian. I am using SlowCompositeReaderWrapper because I am also retrieving the term frequency statistics for the corpus (at the end of the day, I am doing some machine learning/document clustering). Despite its name and warning documentation not to use it, SlowComposi

Re: Document term vectors in Lucene 4

2013-01-17 Thread Ian Lea
When I run your code, as is except for using RAMDirectory and setting up an IndexWriter using StandardAnalyzer RAMDirectory dir = new RAMDirectory(); Analyzer anl = new StandardAnalyzer(Version.LUCENE_40); IndexWriterConfig iwcfg = new IndexWriterConfig(Version.LUCENE_40, a

Document term vectors in Lucene 4

2013-01-16 Thread Jon Stewart
Hello, I cannot extract document term vectors from an index, and have not turned up much in some determined googling. In short, when I call IndexReader.getTermVector(docID, field) or IndexReader.getTermVectors(docID) and then navigate down to the Terms for the specified field, I get a null result.