Re: Indexing Term Frequency Vectors

Sharon W Tam Tue, 02 Apr 2013 07:10:29 -0700

Thanks for your help, Adrien.  But unfortunately, my term frequencies will
be partial counts so they won't be integers,  And finding a common
denominator and scaling the rest of the frequencies accordingly will affect
the relative lengths of the documents which will affect the Lucene scoring
because the length of the documents is taken into account in the scoring.
 Are there any other ideas?



On Thu, Mar 28, 2013 at 9:06 PM, Adrien Grand <[email protected]> wrote:

> Hi,
>
> On Thu, Mar 28, 2013 at 8:25 PM, Sharon Tam <[email protected]> wrote:
> > I believe that when Lucene indexes documents, it generates counts for a
> > term by counting how many times the term appears in a particular
> document.
> > Instead of having Lucene do the counting, I want to do my own counting
> and
> > feed a term-frequency vector representation of a document directly into
> the
> > indexer which will take my counts and proceed to do the other processing
> > such as generating inverse document frequency.  My term-frequencies may
> not
> > all be integers.  Is there a way to do this?
>
> You could provide the indexer with arbitrary frequencies by creating a
> handcrafted TokenStream that repeats terms ${termFreq} times, but
> unfortunately, frequencies need to be strictly positive (> 0)
> integers.
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Indexing Term Frequency Vectors

Reply via email to