Hey,
On Wed, Jan 4, 2012 at 1:15 PM, Hany Azzam <[email protected]> wrote:
> Hi,
>
> I am experimenting with the Lucene trunk (aka 4.0), especially with the new
> IndexDocValues feature. I am trying to store some query-independent
> statistics such as PageRank, etc. One stat that I am trying to store is the
> sum of all the term frequencies in a document. This can be seen as the
> document length. Is there a way to pre-compute this sum while performing the
> indexing?
Lucene is already computing the length of the document in its
FieldInvertedState which is passed to similarity ie. look at
Similarity#computeNorms. Currently the norm value is a single byte
but I am working on exposing this via DocValues so you can store
custom data in your similarity.
simon
>
> Thank you,
> h.
>
>
>
>> TermVectors are still available in Lucene trunk aka 4.0, we just changed the
>> implementation of them to fit the general Lucene Terms/Fields/… APIs.
>> TermVectors (if enabled in the document during indexing) are simply
>> something like a small index per document written to disk like a stored
>> field (it has nothing to do with DocValues, because you mentioned this).
>> Theoretically, you can execute a query against the small “TermVectors Index”
>> and get exactly one hit or no hit, if the query matches this document. This
>> is e.g. used for highlighting if TV are enabled. To support this “TV as a
>> small index”, the old API was removed and the new TermVectors API returns
>> the same Terms/TermsEnum/DocsEnum APIs like IndexReader for a complete
>> index, but all structures simply return one document (ID=0) and
>> corresponding term frequencies/doc frequencies.
>>
>> To have some example code how to use it, review the Lucene testcases, some
>> example:
>>
>> Terms result =
>> reader.getTermVectors(docId).terms(DocHelper.TEXT_FIELD_2_KEY);
>> assertNotNull(result);
>> assertEquals(3, result.getUniqueTermCount());
>> TermsEnum termsEnum = result.iterator(null);
>> while(termsEnum.next() != null) {
>> String term = termsEnum.term().utf8ToString();
>> int freq = (int) termsEnum.totalTermFreq();
>> assertTrue(freq > 0);
>> }
>>
>> Fields results = reader.getTermVectors(docId);
>> assertTrue(results != null);
>> assertEquals("We do not have 3 term freq vectors", 3,
>> results.getUniqueFieldCount());
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: [email protected]
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]