Can this be done as an offline task?
On Nov 15, 2006, at 1:07 PM, Phil Rosen wrote:
Thanks for your help!
Here is an example, I have 100 items, each with a set of
potentially unique
attributes. Attributes could be color, size, length, density, etc.
So an
example document could be:
Id: 1
ItemType: foo
Blob-field: all sorts of text handled normally
Outer-Color: Red
Size: Large
Temperature: hot
Etc:
Etc:
I would like to get the sum of frequency counts for each term in
the fields
I specify across the search results. I can just iterate through the
documents and use getTermFreqVector() for each desired field on each
document, then sum that; but this seems slow to me.
-----Original Message-----
From: Michael D. Curtin [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 15, 2006 11:35 AM
To: java-user@lucene.apache.org
Subject: Re: term vectors
Phil Rosen wrote:
I am building an application that requires I index a set of
documents on
the scale of hundreds of thousands.
A document can have a varying number of attribute fields with an
unknown
set of potential values. I realize that just indexing a blob of
fields
would be much faster, however I need to bin the search results
based on
common attributes; as different types of attributes could
potentially have
overlapping values a single blob for all attributes wont work.
My question is this, is there a way to get term frequencies for a
set of
documents or hits, without using getTermFreqVector() on each
document and
each attribute field? As I could have hundreds of results, each with
dozens of attribute fields, looping getTermFreqVector() would be very
slow. If there isn't something inherent to lucene, has anyone seen an
extension that could accomplish this?
Could you give an example of what you're starting with, what a
search looks
like, and what you want out? It sounds almost like you're looking
for a
custom statistical analysis of hits, which I doubt Lucene is going
to have
for
you, out of the box ...
--MDC
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]