Hello,

One of the commercial search platforms I work with has the concept of
'document vectors', which are 1-gram and 2-gram phrases and their
associated tf/idf weights on a 0-1 scale, i.e. ["banana pie", 0.99]
means banana pie is very relevant for this document.

During the ingest/indexing process you can configure the engine to
store the top N vectors (those with the highest weights) from a
document into a field that is indexed along with the original content
and is returned in a result set.  This is great for reporting and
other statistical analysis, and even some basic result clustering at
query time.

I've been looking at the Solr TermVectorComponent
(http://wiki.apache.org/solr/TermVectorComponent) and it seems to have
something similar to this, but it looks to me like this is a component
that is processed at query time (?) and is limited to 1-gram terms.
Also, the tf/idf scores are a little different as they come back in
integer values as separate components.

Does anyone know if Solr/Lucene has anything like what the commercial
platform has as I described above?

Thanks, appreciate any responses.

Michael Hughes
Lightcrest LLC

Reply via email to