11 apr 2007 kl. 04.21 skrev Grant Ingersoll:

Would some sort of caching strategy work? How big is your overall collection?

Also, lately there have been a few threads on TV (term vector) performance. I don't recall anyone having actively profiled or examined it for improvements, so perhaps that would be helpful.

Another thought: could you have a stored field that contains the top X terms for a given document with their freqs and then just do a merge based on your hit results? Part of the problem w/ TVs is that not only do you have to load them, but then you have to iterate through them to sort them by frequency. I could see that it might be beneficial to have alternate strategies for loading them, say into a map of terms -> frequencies or terms to TVInfo (freqs, offsets, positions) or parallel arrays sorted by frequency or something like that.

I personally don't like the idea of putting such information in a stored field. It would require parsing or so. I'd go deeper.

Are there any dependencies to the natural order of a term vector? Highlightning? What about allowing alternative order such as frequency rather than string value? Multiple term vectors per documents? Perhaps a new file would be the best way. I can think of many uses of a consumer configurable term vector. Actually, I think I'll look in to this some day.

Sengly, do you use the term vectors for anything else? I'd look in to hacking the order of the values in the term vector. Could be problematic if you want to use the default behaviour in the future though.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to