Re: [jira] [Updated] (SOLR-3700) Create a Classification component

Tommaso Teofili Fri, 10 Aug 2012 06:52:46 -0700

Thanks Robert,

2012/8/10 Robert Muir <[email protected]>


> I got the patch before JIRA was down, and just saw another thing:
>
> +  private double countInClassC(String c) throws IOException {
> +    TopDocs topDocs = indexSearcher.search(new TermQuery(new
> Term(classFieldName, c)), Integer.MAX_VALUE);
> +    int res = 0;
> +    for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
> +      Fields termVectors =
> indexSearcher.getIndexReader().getTermVectors(scoreDoc.doc);
> +      if (termVectors != null) {
> +        res += termVectors.terms(textFieldName).size();
> +      } else {
> +        // TODO : warn about not existing term vectors for field
> 'textFieldName'
> +      }
> +    }
> +    return res;
> +  }
>
> For this part, I am unsure what the statistic is you are driving for:
>
> It seems currently that it takes all documents that have term c in
> field classFieldName, and sums the number of unique terms each doc has
> that in field classFieldName?
>

yes, it is.


>
> If this is really what you want and you need 100% exact numbers, just
> like the other computation i would not do a search with a PQ of
> Integer.MAX_VALUE, but instead just iterate over a DocsEnum for that
> term.
>

I noticed that just after I submitted the patch but then Jira was down
again :-)


>
> But if a good approximation is ok, I would do this, which is instant
> and needs no term vectors:
>
>     Terms terms = MultiFields.getTerms(reader, classFieldName);
>     long numPostings = terms.getSumDocFreq(); // number of term/doc pairs
>     double avgNumberOfUniqueTerms = numPostings /
> (double)terms.getDocCount(); // avg # of unique terms per doc
>     return avgNumberOfUniqueTerms * reader.docFreq(c); // avg # of
> unique terms per doc * # docs with c
>

this may be good (and much more performant), I'll give it a try, thanks :-)
The NB classifier there is very simplistic and could much be improved (or
at least provided with parameters / options)
Apart from that a kNN / MoreLikeThis based classifier should be fairly easy
to add.
Regards,

Tommaso


>
> On Fri, Aug 10, 2012 at 8:36 AM, Tommaso Teofili (JIRA) <[email protected]>
> wrote:
> >
> >      [
> https://issues.apache.org/jira/browse/SOLR-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
> >
> > Tommaso Teofili updated SOLR-3700:
> > ----------------------------------
> >
> >     Attachment: SOLR-3700_2.patch
> >
> > new patch incorporating Robert's suggestions (plus added a couple more
> TODOs)
> >
> >> Create a Classification component
> >> ---------------------------------
> >>
> >>                 Key: SOLR-3700
> >>                 URL: https://issues.apache.org/jira/browse/SOLR-3700
> >>             Project: Solr
> >>          Issue Type: New Feature
> >>            Reporter: Tommaso Teofili
> >>            Priority: Minor
> >>         Attachments: SOLR-3700.patch, SOLR-3700_2.patch
> >>
> >>
> >> Lucene/Solr can host huge sets of documents containing lots of
> information in fields so that these can be used as training examples (w/
> features) in order to very quickly create classifiers algorithms to use on
> new documents and / or to provide an additional service.
> >> So the idea is to create a contrib module (called 'classification') to
> host a ClassificationComponent that will use already seen data (the indexed
> documents / fields) to classify new documents / text fragments.
> >> The first version will contain a (simplistic) Lucene based Naive Bayes
> classifier but more implementations should be added in the future.
> >
> > --
> > This message is automatically generated by JIRA.
> > If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> > For more information on JIRA, see:
> http://www.atlassian.com/software/jira
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
>
>
> --
> lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [jira] [Updated] (SOLR-3700) Create a Classification component

Reply via email to