[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454729#comment-13454729
 ] 

Simon Willnauer commented on LUCENE-4345:
-----------------------------------------

hey tommaso, 

I just briefly skimmed through your latest patch and I have a bunch of comments:

* I agree with robert you should build a small inverted index instead of 
retokenizing. I'd use a BytesRefHash with a parallel array as we use during 
indexing, if you have trouble with this I am happy to update your patch and 
give you an example.
* I suggest to move the termsEnum.next() into the while() part like while((next 
= termsEnum.next) != null) for consistency (in assignClass)
* Can you use BytesRef for fieldNames to safe the conversion everytime.
* Instead of specifying the document as a String you should rather use 
IndexableField and in turn pull the tokenstream from 
IndexableField#tokenStream(Analyzer)
* I didn't see a reason why you use Double instead of double (primitive) as 
return values, I think the boxing is unnecessary
* in assignClass can't you reuse the BytesRef returned from the termsEnum for 
further processing instead of converting it to a string?
* in getWordFreqForClass you might want to use TotalHitCountCollector since you 
are only interested in the number of hits. That collector will not score or 
collect any documents at all and is way less complex that the default 
TopDocsCollector
* I have trouble to understand why the interface expects an atomic reader here. 
From my perspective you should handle per-segment aspect internally and instead 
just use IndexReader in the interface.
* The interface you defined has some problems with respect to Multi-Threading 
IMO. The interface itself suggests that this class is stateful and you have to 
call methods in a certain order and at the same you need to make sure that it 
is not published for read access before training is done. I think it would be 
wise to pass in all needed objects as constructor arguments and make the 
references final so it can be shared across threads and add an interface that 
represents the trained model computed offline? In this case it doesn't really 
matter but in the future it might make sense. We can also skip the model 
interface entirely and remove the training method until we have some impls that 
really need to be trained.  


                
> Create a Classification module
> ------------------------------
>
>                 Key: LUCENE-4345
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4345
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Tommaso Teofili
>            Assignee: Tommaso Teofili
>            Priority: Minor
>         Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to