Thanks for your help, Adrien. But unfortunately, my term frequencies will be partial counts so they won't be integers, And finding a common denominator and scaling the rest of the frequencies accordingly will affect the relative lengths of the documents which will affect the Lucene scoring because the length of the documents is taken into account in the scoring. Are there any other ideas?
On Thu, Mar 28, 2013 at 9:06 PM, Adrien Grand <jpou...@gmail.com> wrote: > Hi, > > On Thu, Mar 28, 2013 at 8:25 PM, Sharon Tam <sharon...@gmail.com> wrote: > > I believe that when Lucene indexes documents, it generates counts for a > > term by counting how many times the term appears in a particular > document. > > Instead of having Lucene do the counting, I want to do my own counting > and > > feed a term-frequency vector representation of a document directly into > the > > indexer which will take my counts and proceed to do the other processing > > such as generating inverse document frequency. My term-frequencies may > not > > all be integers. Is there a way to do this? > > You could provide the indexer with arbitrary frequencies by creating a > handcrafted TokenStream that repeats terms ${termFreq} times, but > unfortunately, frequencies need to be strictly positive (> 0) > integers. > > -- > Adrien > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >