Re: Getting most occurring words in lucene

Michael McCandless Sun, 22 Feb 2015 05:26:05 -0800

Use TermsEnum.totalTermFreq(), which is the total number of
occurrences of the term, not TermsEnum.docFreq(), which is the number
of documents that contain at least one occurrence of the term.


Mike McCandless

http://blog.mikemccandless.com


On Sun, Feb 22, 2015 at 6:47 AM, Maisnam Ns <[email protected]> wrote:
> Hi,
>
> I am trying to get the top occurring words by building a memory index using
> lucene using the code below but I am not getting the desired results. The
> text contains 'freedom' three times but it gives only 1. Where am I
> committing a mistake. Is there a way out. Please help.
>
> RAMDirectory idx = new RAMDirectory(); //create ram directory
> IndexWriter writer =
>                      new IndexWriter(idx, new
> StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.LIMITED);
> // create the index
>
>  writer.addDocument(createDocument("key1",
>     "It behooves every man to freedom freedom freedom remember
> that                    the work of the "));  // add text to document
>
>
>
>              try {
>                 computeTopTermQuery(idx);  //compute the top term
>             } catch (Exception e) {
>                 // TODO Auto-generated catch block
>                 e.printStackTrace();
>             }
>
> The computeTopTermQuery is from this link
> http://sujitpal.blogspot.in/2009/02/summarization-with-lucene.html  by
> Suujit Pal's blog.
>
>   private static Query computeTopTermQuery(Directory ramdir) throws
> Exception {
>         final Map<String,Integer> frequencyMap =
>           new HashMap<String,Integer>();
>         List<String> termlist = new ArrayList<String>();
>         IndexReader reader = IndexReader.open(ramdir);
>         TermEnum terms = reader.terms();
>         while (terms.next()) {
>           Term term = terms.term();
>           String termText = term.text();
>           int frequency = reader.docFreq(term);
>           frequencyMap.put(termText, frequency);
>           termlist.add(termText);
>         }
>         reader.close();
>         // sort the term map by frequency descending
>         Collections.sort(termlist, new ReverseComparator<String>(
>           new ByValueComparator<String,Integer>(frequencyMap)));
>         // retrieve the top terms based on topTermCutoff
>         List<String> topTerms = new ArrayList<String>();
>         float topFreq = -1.0F;
>         for (String term : termlist) {
>           if (topFreq < 0.0F) {
>             // first term, capture the value
>             topFreq = (float) frequencyMap.get(term);
>             topTerms.add(term);
>           } else {
>             // not the first term, compute the ratio and discard if below
>             // topTermCutoff score
>             float ratio = (float) ((float) frequencyMap.get(term) /
> topFreq);
>             if (ratio >= topTermCutoff) {
>               topTerms.add(term);
>             } else {
>               break;
>             }
>           }
>         }
>         StringBuilder termBuf = new StringBuilder();
>         BooleanQuery q = new BooleanQuery();
>         for (String topTerm : topTerms) {
>           termBuf.append(topTerm).
>             append("(").
>             append(frequencyMap.get(topTerm)).
>             append(");");
>           q.add(new TermQuery(new Term("text", topTerm)), Occur.SHOULD);
>         }
>         System.out.println(">>> top terms: " + termBuf.toString());
>         System.out.println(">>> query: " + q.toString());
>         return q;
>       }
>
>
> But surprisingly I am getting freedom as (1) and not (3), where 3 is the
> occurrences of freedom.
>
> top terms:
> accomplished(1);altogether(1);behooves(1);critic(1);does(1);end(1);
> every(1);freedom(1);importance(1);man(1);progress(1);remember(1);
> secondary(1);things(1);who(1);work(1);
>
> Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Getting most occurring words in lucene

Reply via email to