the list of terms contained in a document

Chantal Ackermann Mon, 26 Nov 2001 01:08:58 -0800


dear all,


we have a linguistics project running here and we
want to use lucene for the 
information retrieval. rather then just searching
for specific terms we want 
to build frequency lists and detect coocurrences
of terms.

what we need is some kind of the following
functionality (I will give what I 
think could be a resulting API)

1. IndexSearcher.search(query) (already implemented)
2. Hits.getLength() (already implemented)
3. for (...) Hits.doc(i).getTerms() or
Hits.doc(i).getTerms(Field) (required)
(4. and for each returned doc its frequency, but
that is the same as above - 
or could it be retrieved together with the term list?)

This means, that if I get a Hits object back, I
want for all its documents to 
get the terms and their frequency. sure, I could
look the document up and 
parse it - again. but then if the first query
produces, say 20.000 hits, I 
would have to reparse these 20.000 documents while
this parsing has already 
been done for the index creation. instead I wanted
to ask if there is a 
possibility within the existing classes (or at
least with some use of them 
and some new ones) to retrieve this information:
to wich terms a single 
document is assigned to.

thanx a lot for any help or hint
sincerely,
Chantal

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

the list of terms contained in a document

Reply via email to