[ https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445830#comment-13445830 ]
Robert Muir commented on LUCENE-4345: ------------------------------------- docsWithClassSize should ideally be terms.getDocCount() for the field as well rather than maxDoc. docCount() should not do a search, instead I think it should just return IR.docFreq(term) ? One more piece: if classCount is just a Map<UniqueValues,DocFreq>, it would be a lot better to just compute this with a TermsEnum, just iterating over the terms for the field. It seems the "value" part is not used, so for now it could be just a hashset as well? This would remove the stored fields loop (replacing it with a termsenum loop), but I think we can probably remove the loop entirely too as a second step. I don't like that assignClass has a loop over all possible terms in the field, re-tokenizing the doc for each one! it seems we dont need this classCount map at all, nor the priors map? Instead we would just tokenize each doc a single time, and compute the prior of the terms we find on the fly (it seems to just be IDF anyway really). And we wouldnt need any maps for that. > Create a Classification module > ------------------------------ > > Key: LUCENE-4345 > URL: https://issues.apache.org/jira/browse/LUCENE-4345 > Project: Lucene - Core > Issue Type: New Feature > Reporter: Tommaso Teofili > Assignee: Tommaso Teofili > Priority: Minor > Attachments: LUCENE-4345.patch, SOLR-3700_2.patch, SOLR-3700.patch > > > Lucene/Solr can host huge sets of documents containing lots of information in > fields so that these can be used as training examples (w/ features) in order > to very quickly create classifiers algorithms to use on new documents and / > or to provide an additional service. > So the idea is to create a contrib module (called 'classification') to host a > ClassificationComponent that will use already seen data (the indexed > documents / fields) to classify new documents / text fragments. > The first version will contain a (simplistic) Lucene based Naive Bayes > classifier but more implementations should be added in the future. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org