[jira] [Commented] (SOLR-3700) Create a Classification component

2012-08-30 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13444827#comment-13444827
 ] 

Tommaso Teofili commented on SOLR-3700:
---

the suggested snippet for calculating the frequency of terms (from the 
'content' field) in docs with a certain class is ok apart that the terms should 
be extracted from the text field instead of the class field and the docFreq 
should be counted on the class field:
{code}
Terms terms = MultiFields.getTerms(atomicReader, textFieldName);
long numPostings = terms.getSumDocFreq(); // number of term/doc pairs
double avgNumberOfUniqueTerms = numPostings / (double) terms.getDocCount(); 
// avg # of unique terms per doc
int docsWithC = atomicReader.docFreq(classFieldName, new BytesRef(c));
return avgNumberOfUniqueTerms * docsWithC; // avg # of unique terms in text 
field per doc * # docs with c
{code}
comparing the previous (slow) ranked search giving an output of 92 this gives 
an estimated output of ~98.6 which seems reasonable.

 Create a Classification component
 -

 Key: SOLR-3700
 URL: https://issues.apache.org/jira/browse/SOLR-3700
 Project: Solr
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Tommaso Teofili
Priority: Minor
 Attachments: SOLR-3700_2.patch, SOLR-3700.patch


 Lucene/Solr can host huge sets of documents containing lots of information in 
 fields so that these can be used as training examples (w/ features) in order 
 to very quickly create classifiers algorithms to use on new documents and / 
 or to provide an additional service.
 So the idea is to create a contrib module (called 'classification') to host a 
 ClassificationComponent that will use already seen data (the indexed 
 documents / fields) to classify new documents / text fragments.
 The first version will contain a (simplistic) Lucene based Naive Bayes 
 classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3700) Create a Classification component

2012-08-30 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13444837#comment-13444837
 ] 

Shai Erera commented on SOLR-3700:
--

Is there any reason not to develop it as a Lucene module? I haven't looked at 
the patch, but if it's not Solr-specific, or depends on Solr API, perhaps we 
can make this issue a LUCENE- one?

I see no reason such module will be available for Solr users only, unless you 
plan to depend on Solr API, in which case I will not slow down your development 
by insisting it becomes a Lucene module.

 Create a Classification component
 -

 Key: SOLR-3700
 URL: https://issues.apache.org/jira/browse/SOLR-3700
 Project: Solr
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Tommaso Teofili
Priority: Minor
 Attachments: SOLR-3700_2.patch, SOLR-3700.patch


 Lucene/Solr can host huge sets of documents containing lots of information in 
 fields so that these can be used as training examples (w/ features) in order 
 to very quickly create classifiers algorithms to use on new documents and / 
 or to provide an additional service.
 So the idea is to create a contrib module (called 'classification') to host a 
 ClassificationComponent that will use already seen data (the indexed 
 documents / fields) to classify new documents / text fragments.
 The first version will contain a (simplistic) Lucene based Naive Bayes 
 classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3700) Create a Classification component

2012-08-30 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13444841#comment-13444841
 ] 

Chris Male commented on SOLR-3700:
--

bq. Is there any reason not to develop it as a Lucene module? I haven't looked 
at the patch, but if it's not Solr-specific, or depends on Solr API, perhaps we 
can make this issue a LUCENE- one?

+1

 Create a Classification component
 -

 Key: SOLR-3700
 URL: https://issues.apache.org/jira/browse/SOLR-3700
 Project: Solr
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Tommaso Teofili
Priority: Minor
 Attachments: SOLR-3700_2.patch, SOLR-3700.patch


 Lucene/Solr can host huge sets of documents containing lots of information in 
 fields so that these can be used as training examples (w/ features) in order 
 to very quickly create classifiers algorithms to use on new documents and / 
 or to provide an additional service.
 So the idea is to create a contrib module (called 'classification') to host a 
 ClassificationComponent that will use already seen data (the indexed 
 documents / fields) to classify new documents / text fragments.
 The first version will contain a (simplistic) Lucene based Naive Bayes 
 classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3700) Create a Classification component

2012-08-30 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13444843#comment-13444843
 ] 

Tommaso Teofili commented on SOLR-3700:
---

The patch is Solr specific just because it uses a Solr RequestHandler to expose 
the Classifier interface, also the idea is that more classifier implementations 
(e.g. based on MLT) may be plugged in so I thought Solr was a good place for 
it, however it's ok for me to put this into a Lucene module and then add only 
the needed Solr specific bindings to use it in Solr.

 Create a Classification component
 -

 Key: SOLR-3700
 URL: https://issues.apache.org/jira/browse/SOLR-3700
 Project: Solr
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Tommaso Teofili
Priority: Minor
 Attachments: SOLR-3700_2.patch, SOLR-3700.patch


 Lucene/Solr can host huge sets of documents containing lots of information in 
 fields so that these can be used as training examples (w/ features) in order 
 to very quickly create classifiers algorithms to use on new documents and / 
 or to provide an additional service.
 So the idea is to create a contrib module (called 'classification') to host a 
 ClassificationComponent that will use already seen data (the indexed 
 documents / fields) to classify new documents / text fragments.
 The first version will contain a (simplistic) Lucene based Naive Bayes 
 classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3700) Create a Classification component

2012-08-30 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13444854#comment-13444854
 ] 

Shai Erera commented on SOLR-3700:
--

bq. however it's ok for me to put this into a Lucene module and then add only 
the needed Solr specific bindings to use it in Solr

if it doesn't complicate matters for you, then it will be great if you can do 
that !

 Create a Classification component
 -

 Key: SOLR-3700
 URL: https://issues.apache.org/jira/browse/SOLR-3700
 Project: Solr
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Tommaso Teofili
Priority: Minor
 Attachments: SOLR-3700_2.patch, SOLR-3700.patch


 Lucene/Solr can host huge sets of documents containing lots of information in 
 fields so that these can be used as training examples (w/ features) in order 
 to very quickly create classifiers algorithms to use on new documents and / 
 or to provide an additional service.
 So the idea is to create a contrib module (called 'classification') to host a 
 ClassificationComponent that will use already seen data (the indexed 
 documents / fields) to classify new documents / text fragments.
 The first version will contain a (simplistic) Lucene based Naive Bayes 
 classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org