Re: Get the total term frequency vector of a specific field from the hit results

Grant Ingersoll Tue, 10 Apr 2007 19:22:20 -0700

Would some sort of caching strategy work? How big is your overallcollection?

Also, lately there have been a few threads on TV (term vector)performance. I don't recall anyone having actively profiled orexamined it for improvements, so perhaps that would be helpful.

Another thought: could you have a stored field that contains the topX terms for a given document with their freqs and then just do amerge based on your hit results? Part of the problem w/ TVs is thatnot only do you have to load them, but then you have to iteratethrough them to sort them by frequency. I could see that it might bebeneficial to have alternate strategies for loading them, say into amap of terms -> frequencies or terms to TVInfo (freqs, offsets,positions) or parallel arrays sorted by frequency or something likethat.

It _might_ be possible to do this in a HitCollector or FieldSelectorstyle way. This way, perhaps, you could build the TV structure youwant as it is read from disk. Do you have any interest in diggingdown into the Lucene code to help on such an idea?


-Grant



On Apr 10, 2007, at 9:38 PM, Sengly Heng wrote:

Once again, thank you for your help.


>> We don't really know what your problem is. Explaining that rathern
>> than the solution you have thought of might render a couple of
>> alternate solutions. Perhaps something could be precalculated and

>> stored in the documents. Perhaps feature selection (reduction)of the

>> terms might do the trick for you. And so on.
>
> I have a corpus of documents indexed with different fields.
> Approximately
> each document indexed has an average of 30 fields. Each field has
> about 100
> terms.
>
> Normally, the hit will return less than 100 documents. For each of
> the 30
> fields of the documents, I have to calculate the top 35 keywords
> from all
> the documents as well as the top 30 popular keywords (the keywords
> that are
> distributed in many documents - something like docFreq or IDF).

Right, but /why/ do you need these values? Do you present them as
they are, or do you use them for some secondary calculation? Then
what is the result of this secondary calculation?

Yes, I just want those values as they are. No second calculation isto be

performed.

Please let me know if you are still have some more questions.

I'll reask of of the questions I placed in my previous reply:

>> How slow is it, and how fast did you expect it to be?

We expect to get those values sorted as fast as possible. Currentlyfor 100documents, the process is about 1~1.5 minutes. I believe this isbecause of

the loop.

Can you limit the evaulation to the top n documents?



Yes, we limit to only the top 100 documents.

Thank you.

Regards,

Sengly


--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Get the total term frequency vector of a specific field from the hit results

Reply via email to