[jira] [Commented] (SOLR-3700) Create a Classification component
[ https://issues.apache.org/jira/browse/SOLR-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13444827#comment-13444827 ] Tommaso Teofili commented on SOLR-3700: --- the suggested snippet for calculating the frequency of terms (from the 'content' field) in docs with a certain class is ok apart that the terms should be extracted from the text field instead of the class field and the docFreq should be counted on the class field: {code} Terms terms = MultiFields.getTerms(atomicReader, textFieldName); long numPostings = terms.getSumDocFreq(); // number of term/doc pairs double avgNumberOfUniqueTerms = numPostings / (double) terms.getDocCount(); // avg # of unique terms per doc int docsWithC = atomicReader.docFreq(classFieldName, new BytesRef(c)); return avgNumberOfUniqueTerms * docsWithC; // avg # of unique terms in text field per doc * # docs with c {code} comparing the previous (slow) ranked search giving an output of 92 this gives an estimated output of ~98.6 which seems reasonable. Create a Classification component - Key: SOLR-3700 URL: https://issues.apache.org/jira/browse/SOLR-3700 Project: Solr Issue Type: New Feature Reporter: Tommaso Teofili Assignee: Tommaso Teofili Priority: Minor Attachments: SOLR-3700_2.patch, SOLR-3700.patch Lucene/Solr can host huge sets of documents containing lots of information in fields so that these can be used as training examples (w/ features) in order to very quickly create classifiers algorithms to use on new documents and / or to provide an additional service. So the idea is to create a contrib module (called 'classification') to host a ClassificationComponent that will use already seen data (the indexed documents / fields) to classify new documents / text fragments. The first version will contain a (simplistic) Lucene based Naive Bayes classifier but more implementations should be added in the future. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3700) Create a Classification component
[ https://issues.apache.org/jira/browse/SOLR-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13444837#comment-13444837 ] Shai Erera commented on SOLR-3700: -- Is there any reason not to develop it as a Lucene module? I haven't looked at the patch, but if it's not Solr-specific, or depends on Solr API, perhaps we can make this issue a LUCENE- one? I see no reason such module will be available for Solr users only, unless you plan to depend on Solr API, in which case I will not slow down your development by insisting it becomes a Lucene module. Create a Classification component - Key: SOLR-3700 URL: https://issues.apache.org/jira/browse/SOLR-3700 Project: Solr Issue Type: New Feature Reporter: Tommaso Teofili Assignee: Tommaso Teofili Priority: Minor Attachments: SOLR-3700_2.patch, SOLR-3700.patch Lucene/Solr can host huge sets of documents containing lots of information in fields so that these can be used as training examples (w/ features) in order to very quickly create classifiers algorithms to use on new documents and / or to provide an additional service. So the idea is to create a contrib module (called 'classification') to host a ClassificationComponent that will use already seen data (the indexed documents / fields) to classify new documents / text fragments. The first version will contain a (simplistic) Lucene based Naive Bayes classifier but more implementations should be added in the future. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3700) Create a Classification component
[ https://issues.apache.org/jira/browse/SOLR-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13444841#comment-13444841 ] Chris Male commented on SOLR-3700: -- bq. Is there any reason not to develop it as a Lucene module? I haven't looked at the patch, but if it's not Solr-specific, or depends on Solr API, perhaps we can make this issue a LUCENE- one? +1 Create a Classification component - Key: SOLR-3700 URL: https://issues.apache.org/jira/browse/SOLR-3700 Project: Solr Issue Type: New Feature Reporter: Tommaso Teofili Assignee: Tommaso Teofili Priority: Minor Attachments: SOLR-3700_2.patch, SOLR-3700.patch Lucene/Solr can host huge sets of documents containing lots of information in fields so that these can be used as training examples (w/ features) in order to very quickly create classifiers algorithms to use on new documents and / or to provide an additional service. So the idea is to create a contrib module (called 'classification') to host a ClassificationComponent that will use already seen data (the indexed documents / fields) to classify new documents / text fragments. The first version will contain a (simplistic) Lucene based Naive Bayes classifier but more implementations should be added in the future. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3700) Create a Classification component
[ https://issues.apache.org/jira/browse/SOLR-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13444843#comment-13444843 ] Tommaso Teofili commented on SOLR-3700: --- The patch is Solr specific just because it uses a Solr RequestHandler to expose the Classifier interface, also the idea is that more classifier implementations (e.g. based on MLT) may be plugged in so I thought Solr was a good place for it, however it's ok for me to put this into a Lucene module and then add only the needed Solr specific bindings to use it in Solr. Create a Classification component - Key: SOLR-3700 URL: https://issues.apache.org/jira/browse/SOLR-3700 Project: Solr Issue Type: New Feature Reporter: Tommaso Teofili Assignee: Tommaso Teofili Priority: Minor Attachments: SOLR-3700_2.patch, SOLR-3700.patch Lucene/Solr can host huge sets of documents containing lots of information in fields so that these can be used as training examples (w/ features) in order to very quickly create classifiers algorithms to use on new documents and / or to provide an additional service. So the idea is to create a contrib module (called 'classification') to host a ClassificationComponent that will use already seen data (the indexed documents / fields) to classify new documents / text fragments. The first version will contain a (simplistic) Lucene based Naive Bayes classifier but more implementations should be added in the future. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3700) Create a Classification component
[ https://issues.apache.org/jira/browse/SOLR-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13444854#comment-13444854 ] Shai Erera commented on SOLR-3700: -- bq. however it's ok for me to put this into a Lucene module and then add only the needed Solr specific bindings to use it in Solr if it doesn't complicate matters for you, then it will be great if you can do that ! Create a Classification component - Key: SOLR-3700 URL: https://issues.apache.org/jira/browse/SOLR-3700 Project: Solr Issue Type: New Feature Reporter: Tommaso Teofili Assignee: Tommaso Teofili Priority: Minor Attachments: SOLR-3700_2.patch, SOLR-3700.patch Lucene/Solr can host huge sets of documents containing lots of information in fields so that these can be used as training examples (w/ features) in order to very quickly create classifiers algorithms to use on new documents and / or to provide an additional service. So the idea is to create a contrib module (called 'classification') to host a ClassificationComponent that will use already seen data (the indexed documents / fields) to classify new documents / text fragments. The first version will contain a (simplistic) Lucene based Naive Bayes classifier but more implementations should be added in the future. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org