Re: bigram problem

parnab kumar Wed, 02 Jul 2014 06:36:06 -0700

TF is straight forward, you can simply count the no of occurrences in the
doc by simple string matching. For IDF you need to know total no of docs in
the collection and the no. of docs having the bigram. reader.maxDoc() will
give you the total no of docs in the collection. To calculate the number of
docs containing the bigram use a phrase query with slop factor set to 0.
The number of docs returned by the indexsearcher with the phrase query will
be the number of docs having the bigram. I hope this is fine.

Alternatively, use   NGramTokenizer where ( n=2 in your case) while
indexing. In such a case, each bigram can interpreted as a normal lucene
term.

Thanks,
Parnab

On Wed, Jul 2, 2014 at 8:45 AM, Manjula Wijewickrema <[email protected]>
wrote:

> Hi,
>
> Could please explain me how to determine the tf-idf score for bigrams. My
> program is able to index and search bigrams correctly, but it does not
> calculate the tf-idf for bigrams. If someone can, please help me to resolve
> this.
>
> Regards,
> Manjula.
>

Re: bigram problem

Reply via email to