Yes thanks Grant I realize that if I need the term freq in all the documents I could use TermEnum, but I have a use case where I may need term frequencies of only selected documents, and the worst case scenario might be term freq for n-1
documents, where n is the total number of documents in the index.

-Amit


On Aug 9, 2006, at 2:34 PM, Grant Ingersoll wrote:

Hi Amit,

If you want all the freqs of all the terms (or even just some of the terms) in all documents, you don't need to use Term Vectors, take a look at TermEnum and TermDocs.

If you want for specific documents, then you do need Term Vectors. You may get some CPU ticks by only keeping Positions (and not offsets, assuming you don't need them), but I haven't confirmed this. I just figure it is slightly less data to read and write so it should be faster.

No, you do not need to Store your docs to have a term vector. They are written to different parts of the Index.

-Grant

On Aug 9, 2006, at 3:13 PM, Amit Kumar wrote:

Hi Lucene Users,

I am using the lucene indices to get term frequencies. I just wanted to check with you about the time it is taking to retrieve these term freq. Please suggest if I can improve the code/index or if this is expected. It takes 8 to 9 seconds to retrieve the term freq values of all 1030 documents,
with an index size of ~530MB.

Another question I have is Do I need to have Field.Store.Yes to get the term freq vector?

Index Details:
-------------------
Size: 532 MB,
1032 Documents with varying number of terms from 600 to 100,000
The field is indexed as Field.Store.YES, Field.Index.TOKENIZED,Field.TermVector.WITH_POSITIONS_OFFSETS


Term Freq Retrieval Time Values:
-------------------------------------

The time ranges in 8 to 9 seconds

 long s = System.currentTimeMillis();
 TermFreqVector termFreqVector;
    for (int i = 0; i < 1030; i++) {
      if (!reader.isDeleted(i)) {
       termFreqVector   = reader.getTermFreqVector(i, field);
       }
    }
    long l = System.currentTimeMillis();


Hardware and Memory Settings:
-------------------------------------------
-Xmx 2048m -XX:PermSize=16m -XX:MaxPermSize=128m

Dual 1800 MHz Optron on 32 bit Linux 2.6.15.2; Lucene 2.0.0.




How can I get better results? Can I?



Many thanks for your help.
-Amit





---------------------------------------------------------
Amit Kumar
Research Programmer
The Graduate School of Library and Information Science
University of Illinois, Urbana Champaign IL, 61820
phone: 217-333-4118 fax: 217-244-3302
---------------------------------------------------------





------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------
Amit Kumar
Research Programmer
The Graduate School of Library and Information Science
University of Illinois, Urbana Champaign IL, 61820
phone: 217-333-4118 fax: 217-244-3302
---------------------------------------------------------




Reply via email to