[jira] [Commented] (LUCENE-4345) Create a Classification module

Robert Muir (JIRA) Fri, 31 Aug 2012 03:44:12 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445830#comment-13445830
 ]


Robert Muir commented on LUCENE-4345:
-------------------------------------

docsWithClassSize should ideally be terms.getDocCount() for the field as well
rather than maxDoc.

docCount() should not do a search, instead I think it should just return 
IR.docFreq(term) ?

One more piece: if classCount is just a Map<UniqueValues,DocFreq>,
it would be a lot better to just compute this with a TermsEnum,
just iterating over the terms for the field.

It seems the "value" part is not used, so for now it could be
just a hashset as well?

This would remove the stored fields loop (replacing it with a termsenum
loop), but I think we can probably remove the loop entirely too as
a second step.

I don't like that assignClass has a loop over all possible terms in the
field, re-tokenizing the doc for each one! 

it seems we dont need this classCount map at all, nor the priors map?

Instead we would just tokenize each doc a single time, and compute the prior of 
the terms
we find on the fly (it seems to just be IDF anyway really).

And we wouldnt need any maps for that.

                
> Create a Classification module
> ------------------------------
>
>                 Key: LUCENE-4345
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4345
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Tommaso Teofili
>            Assignee: Tommaso Teofili
>            Priority: Minor
>         Attachments: LUCENE-4345.patch, SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4345) Create a Classification module

Reply via email to