I got the patch before JIRA was down, and just saw another thing: + private double countInClassC(String c) throws IOException { + TopDocs topDocs = indexSearcher.search(new TermQuery(new Term(classFieldName, c)), Integer.MAX_VALUE); + int res = 0; + for (ScoreDoc scoreDoc : topDocs.scoreDocs) { + Fields termVectors = indexSearcher.getIndexReader().getTermVectors(scoreDoc.doc); + if (termVectors != null) { + res += termVectors.terms(textFieldName).size(); + } else { + // TODO : warn about not existing term vectors for field 'textFieldName' + } + } + return res; + }
For this part, I am unsure what the statistic is you are driving for: It seems currently that it takes all documents that have term c in field classFieldName, and sums the number of unique terms each doc has that in field classFieldName? If this is really what you want and you need 100% exact numbers, just like the other computation i would not do a search with a PQ of Integer.MAX_VALUE, but instead just iterate over a DocsEnum for that term. But if a good approximation is ok, I would do this, which is instant and needs no term vectors: Terms terms = MultiFields.getTerms(reader, classFieldName); long numPostings = terms.getSumDocFreq(); // number of term/doc pairs double avgNumberOfUniqueTerms = numPostings / (double)terms.getDocCount(); // avg # of unique terms per doc return avgNumberOfUniqueTerms * reader.docFreq(c); // avg # of unique terms per doc * # docs with c On Fri, Aug 10, 2012 at 8:36 AM, Tommaso Teofili (JIRA) <j...@apache.org> wrote: > > [ > https://issues.apache.org/jira/browse/SOLR-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > > Tommaso Teofili updated SOLR-3700: > ---------------------------------- > > Attachment: SOLR-3700_2.patch > > new patch incorporating Robert's suggestions (plus added a couple more TODOs) > >> Create a Classification component >> --------------------------------- >> >> Key: SOLR-3700 >> URL: https://issues.apache.org/jira/browse/SOLR-3700 >> Project: Solr >> Issue Type: New Feature >> Reporter: Tommaso Teofili >> Priority: Minor >> Attachments: SOLR-3700.patch, SOLR-3700_2.patch >> >> >> Lucene/Solr can host huge sets of documents containing lots of information >> in fields so that these can be used as training examples (w/ features) in >> order to very quickly create classifiers algorithms to use on new documents >> and / or to provide an additional service. >> So the idea is to create a contrib module (called 'classification') to host >> a ClassificationComponent that will use already seen data (the indexed >> documents / fields) to classify new documents / text fragments. >> The first version will contain a (simplistic) Lucene based Naive Bayes >> classifier but more implementations should be added in the future. > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA > administrators: > https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa > For more information on JIRA, see: http://www.atlassian.com/software/jira > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > -- lucidimagination.com --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org