You can count me in on this :) --- Doug Cutting <[EMAIL PROTECTED]> wrote: > Right now, Lucene does not have good support for > what you're doing. Lucene > as it stands is designed to support basic search, > not other statistical text > processing. However there are two features that I > would like to add to > Lucene that would help you. > > 1. Seekable TermDocs. > > This would let you efficiently skip forward in a > TermDocs to a particular > document number. This would enable some search > optimizations. This > requires no API changes, as the TermDocs.skipTo() > method already exists. > > 2. Stored Document Vectors > > These would enable one to determine the set of terms > in a document. This > would be useful for, e.g. document clustering. > > This would add an IndexReader two methods: > public TermFreqVector getTermFreqVector(int > docNumber); > public Term getTerm(int termNumber); > The TermFreqVector class would be defined something > like: > public class TermFreqVector { > public int[] getTermNumbers(); > public int[] getTermFrequencies(); > } > The term number array would be sorted. The > frequency of the term numbered > getTermNumbers()[i] is getTermFrequencies()[i]. > > Another class that would be useful is something > like: > public class TermWeightVector { > public int[] getTermNumbers(); > public float[] getTermWeights(); > > public void add(TermWeightVector other); > public float distance(TermWeightVector other); > } > > Both of these are long-term changes, so it may be a > while before they are > completed. That said, I would like to implement > them, when I have time! > > Doug > > > -----Original Message----- > > From: Nestel, Frank [mailto:[EMAIL PROTECTED]] > > Sent: Wednesday, October 10, 2001 12:23 AM > > To: '[EMAIL PROTECTED]' > > Subject: Token retrieval question > > > > > > > > Hi, > > > > I've been reading the API and I couldn't figure > out a > > nice and fast way to solve the following problem: > > > > I'd like to enumerate the tokens of a document (or > > > document field). Do the internal datastructures > > of lucene allow such kind of traversal which is > (as > > I understand) of course orthogonal to the access > lucene > > is optimized for? > > > > More concrete I have s.th. like 20-50 tokens/words > and one > > document and I'd like to ask the document if (and > how often) > > it contains those particular tokens. The idea was > to augment > > search results with (kind of I know) automatic > query > > dependand keywords. > > > > The only way I see right now is to create 20-50 > TermEnums > > and walk through them until I end up in my > document or > > nowhere? Which is probably not feasible for a > search result > > page with (say) 20 hits in a larger index. > > > > Any (more elegant) chance, I missed? > > > > Thank you, > > Frank > > > > -- > > Dr. Frank Sven Nestel > > Principal Software Engineer > > > > COI GmbH Erlanger Straße 62, D-91074 > Herzogenaurach > > Phone +49 (0) 9132 82 4611 > > http://www.coi.de, mailto:[EMAIL PROTECTED] > > COI - Solutions for Documents > >
__________________________________________________ Do You Yahoo!? Make a great connection at Yahoo! Personals. http://personals.yahoo.com