[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454802#comment-13454802
 ] 

Tommaso Teofili commented on LUCENE-4345:
-----------------------------------------

bq. I agree with robert you should build a small inverted index instead of 
retokenizing. I'd use a BytesRefHash with a parallel array as we use during 
indexing, if you have trouble with this I am happy to update your patch and 
give you an example.

+1, not trouble but actually not much time in the latest days to work on it, 
however if you have time that'd be surely nice (I've committed this on trunk at 
r1384219 so it should be easier to update it).

bq. I suggest to move the termsEnum.next() into the while() part like 
while((next = termsEnum.next) != null) for consistency (in assignClass)

sure

bq. Can you use BytesRef for fieldNames to safe the conversion everytime.

actually this depends on "who"'s calling the Classifier, if, for example, it's 
Solr then it makes sense to keep Strings but since this lives in Lucene it may 
make sense to use ByteRefs, however from an API point of view I'm not sure what 
could be better.

bq. Instead of specifying the document as a String you should rather use 
IndexableField and in turn pull the tokenstream from 
IndexableField#tokenStream(Analyzer)

I think this is not always ok as often the document to be classified is not 
supposed to be in the index immediately but _may_ get indexed right after the 
classification, however we could provide that with as an additional method. 

bq. I didn't see a reason why you use Double instead of double (primitive) as 
return values, I think the boxing is unnecessary

yes, I agree.

bq. in assignClass can't you reuse the BytesRef returned from the termsEnum for 
further processing instead of converting it to a string?

actually, after reading your comment above I realized converting to a String is 
not a good idea, so I'll change the methods (#calculateLikelihood and 
#calculatePrior) to use ByteRef rather than String.

bq. in getWordFreqForClass you might want to use TotalHitCountCollector since 
you are only interested in the number of hits. That collector will not score or 
collect any documents at all and is way less complex that the default 
TopDocsCollector

very good point, thanks :)

bq. I have trouble to understand why the interface expects an atomic reader 
here. From my perspective you should handle per-segment aspect internally and 
instead just use IndexReader in the interface.

as a first implementation I thought it made sense to keep the complexity of 
explicitly doing distributed probabilities calculations out, also AtomicReaders 
expose more internals that can be leveraged in a classification algorithm.

bq. The interface you defined has some problems with respect to Multi-Threading 
IMO. The interface itself suggests that this class is stateful and you have to 
call methods in a certain order and at the same you need to make sure that it 
is not published for read access before training is done. 

it'd raise an exception if #assignClass() is called before #train()

bq. I think it would be wise to pass in all needed objects as constructor 
arguments and make the references final so it can be shared across threads and 
add an interface that represents the trained model computed offline? In this 
case it doesn't really matter but in the future it might make sense. We can 
also skip the model interface entirely and remove the training method until we 
have some impls that really need to be trained.

I'm +1 for making the references final while I put the #train() method so that 
a Classifier could be trained multiple times. In this implementation that 
doesn't make much difference but it may not be the case for other 
implementations.
Therefore we could (maybe should) mark this API _@experimental_ and just evolve 
it form the different implementations we have so finally moving parameters to 
the constructor may be a nice idea here.
On the contrary removing the #train() method from the API would remove any 
reference to Lucene APIs in the Classifier interface leading to question if 
that's too much generic.
                
> Create a Classification module
> ------------------------------
>
>                 Key: LUCENE-4345
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4345
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Tommaso Teofili
>            Assignee: Tommaso Teofili
>            Priority: Minor
>         Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to