Use TermsEnum.totalTermFreq(), which is the total number of occurrences of the term, not TermsEnum.docFreq(), which is the number of documents that contain at least one occurrence of the term.
Mike McCandless http://blog.mikemccandless.com On Sun, Feb 22, 2015 at 6:47 AM, Maisnam Ns <maisnam...@gmail.com> wrote: > Hi, > > I am trying to get the top occurring words by building a memory index using > lucene using the code below but I am not getting the desired results. The > text contains 'freedom' three times but it gives only 1. Where am I > committing a mistake. Is there a way out. Please help. > > RAMDirectory idx = new RAMDirectory(); //create ram directory > IndexWriter writer = > new IndexWriter(idx, new > StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.LIMITED); > // create the index > > writer.addDocument(createDocument("key1", > "It behooves every man to freedom freedom freedom remember > that the work of the ")); // add text to document > > > > try { > computeTopTermQuery(idx); //compute the top term > } catch (Exception e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > > The computeTopTermQuery is from this link > http://sujitpal.blogspot.in/2009/02/summarization-with-lucene.html by > Suujit Pal's blog. > > private static Query computeTopTermQuery(Directory ramdir) throws > Exception { > final Map<String,Integer> frequencyMap = > new HashMap<String,Integer>(); > List<String> termlist = new ArrayList<String>(); > IndexReader reader = IndexReader.open(ramdir); > TermEnum terms = reader.terms(); > while (terms.next()) { > Term term = terms.term(); > String termText = term.text(); > int frequency = reader.docFreq(term); > frequencyMap.put(termText, frequency); > termlist.add(termText); > } > reader.close(); > // sort the term map by frequency descending > Collections.sort(termlist, new ReverseComparator<String>( > new ByValueComparator<String,Integer>(frequencyMap))); > // retrieve the top terms based on topTermCutoff > List<String> topTerms = new ArrayList<String>(); > float topFreq = -1.0F; > for (String term : termlist) { > if (topFreq < 0.0F) { > // first term, capture the value > topFreq = (float) frequencyMap.get(term); > topTerms.add(term); > } else { > // not the first term, compute the ratio and discard if below > // topTermCutoff score > float ratio = (float) ((float) frequencyMap.get(term) / > topFreq); > if (ratio >= topTermCutoff) { > topTerms.add(term); > } else { > break; > } > } > } > StringBuilder termBuf = new StringBuilder(); > BooleanQuery q = new BooleanQuery(); > for (String topTerm : topTerms) { > termBuf.append(topTerm). > append("("). > append(frequencyMap.get(topTerm)). > append(");"); > q.add(new TermQuery(new Term("text", topTerm)), Occur.SHOULD); > } > System.out.println(">>> top terms: " + termBuf.toString()); > System.out.println(">>> query: " + q.toString()); > return q; > } > > > But surprisingly I am getting freedom as (1) and not (3), where 3 is the > occurrences of freedom. > > top terms: > accomplished(1);altogether(1);behooves(1);critic(1);does(1);end(1); > every(1);freedom(1);importance(1);man(1);progress(1);remember(1); > secondary(1);things(1);who(1);work(1); > > Thanks --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org