[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-11-29 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506317#comment-13506317
 ] 

Commit Tag Bot commented on LUCENE-4345:


[trunk commit] Uwe Schindler
http://svn.apache.org/viewvc?view=revision&revision=1415074

LUCENE-4345: Fix forbidden APIs and make the test more predicatable



> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-12-10 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527765#comment-13527765
 ] 

Commit Tag Bot commented on LUCENE-4345:


[trunk commit] Tommaso Teofili
http://svn.apache.org/viewvc?view=revision&revision=1419258

[LUCENE-4345] - improved DS performance by doing commits only once


> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-12-21 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13538691#comment-13538691
 ] 

Tommaso Teofili commented on LUCENE-4345:
-

I think I'll resolve this and make further improvements / additions in 
different more fine grained issues.

> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-10-23 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482448#comment-13482448
 ] 

Tommaso Teofili commented on LUCENE-4345:
-

I've just committed some slight improvements to testing and a basic MLT based 
kNearestNeighbor classifier (with a bunch of TODOs), comments are welcome :)

> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-10-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13483117#comment-13483117
 ] 

Michael McCandless commented on LUCENE-4345:


The builds have been failing because some methods are missing javadocs:
{noformat}
-documentation-lint:
 [echo] Checking for broken links...
 [exec]
 [exec] Crawl/parse...
 [exec]
 [exec] Verify...
 [echo] Checking for missing docs...
 [exec]
 [exec] 
build/docs/classification/org/apache/lucene/classification/ClassificationResult.html
 [exec]   missing Constructors: ClassificationResult(java.lang.String, 
double)
 [exec]   missing Methods: getAssignedClass()
 [exec]   missing Methods: getScore()
 [exec]
 [exec] 
build/docs/classification/org/apache/lucene/classification/KNearestNeighborClassifier.html
 [exec]   missing Constructors: KNearestNeighborClassifier(int)
 [exec]
 [exec] Missing javadocs were found!
{noformat}


> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-10-24 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13483234#comment-13483234
 ] 

Tommaso Teofili commented on LUCENE-4345:
-

thanks Michael, it should be fixed now.

> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-08-31 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445830#comment-13445830
 ] 

Robert Muir commented on LUCENE-4345:
-

docsWithClassSize should ideally be terms.getDocCount() for the field as well
rather than maxDoc.

docCount() should not do a search, instead I think it should just return 
IR.docFreq(term) ?

One more piece: if classCount is just a Map,
it would be a lot better to just compute this with a TermsEnum,
just iterating over the terms for the field.

It seems the "value" part is not used, so for now it could be
just a hashset as well?

This would remove the stored fields loop (replacing it with a termsenum
loop), but I think we can probably remove the loop entirely too as
a second step.

I don't like that assignClass has a loop over all possible terms in the
field, re-tokenizing the doc for each one! 

it seems we dont need this classCount map at all, nor the priors map?

Instead we would just tokenize each doc a single time, and compute the prior of 
the terms
we find on the fly (it seems to just be IDF anyway really).

And we wouldnt need any maps for that.


> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345.patch, SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-08-31 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445867#comment-13445867
 ] 

Tommaso Teofili commented on LUCENE-4345:
-

bq. docsWithClassSize should ideally be terms.getDocCount() for the field as 
well rather than maxDoc.

yep, the early assumption here was that all the docs have a value for the class 
field but your suggestion is good.

bq. docCount() should not do a search, instead I think it should just return 
IR.docFreq(term) ?

correct

bq. it seems we dont need this classCount map at all, nor the priors map?

yes and no, having the priors map slows the training phase (each time it needs 
to recompute the priors for all the classes), but fasten the classification 
task with the unseen text (it's a cache in the end), wrt the classCount I agree 
with you it could be easily replaced (with TermsEnum).

bq. Instead we would just tokenize each doc a single time, and compute the 
prior of the terms
we find on the fly (it seems to just be

you mean because of the likelihood calculation tokenizing the same doc multiple 
times (|terms in the class field|), right? That'd be surely good, I'll work on 
improving that.

Thanks Robert :)


> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345.patch, SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-08-31 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445884#comment-13445884
 ] 

Robert Muir commented on LUCENE-4345:
-

{quote}
yes and no, having the priors map slows the training phase (each time it needs 
to recompute the priors for all the classes), but fasten the classification 
task with the unseen text (it's a cache in the end), wrt the classCount I agree 
with you it could be easily replaced (with TermsEnum).
{quote}

My concern here is that if the # of terms is large, its a lot of ram too. We 
can see though, I think tokenizing the doc so many times today is
actually the slowest part. But we can move to termsenum as a step, just an 
iteration :)

{quote}
you mean because of the likelihood calculation tokenizing the same doc multiple 
times (|terms in the class field|), right? That'd be surely good, I'll work on 
improving that
{quote}

Exactly, basically i was thinking in the short term lets remove the extra loop, 
as an iteration.

long term I think we would not need the maps and just call docFreq on the terms 
from the term dictionary on the fly here.
While this sounds like a lot of docFreq calls, i am not so sure. it seems the 
larger formula is looking for a max() here?

So we could consider performance-driven heuristics/approximations like 
MoreLikeThis does based on things like local
term frequency within the document/term length, whatever to save on docFreq() 
calls, if it makes sense (i have to look at the formula in more detail here).

In that case instead of consuming the tokenStream as an array, it probably 
makes more sense to consume it into a Map
so we have a little 'inverted index' for the doc. the current code, given a 
word that appears many times in the document,
will do many computations when instead we could really just work across the 
unique terms within the document.


> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345.patch, SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-09-02 Thread Lance Norskog (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447052#comment-13447052
 ] 

Lance Norskog commented on LUCENE-4345:
---

Nice! I've found that filtering for nouns & verbs makes another NLP task 
(latent semantic indexing) work much better. This will benefit from 
parts-of-speech filtering.

> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345.patch, SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-09-03 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447240#comment-13447240
 ] 

Tommaso Teofili commented on LUCENE-4345:
-

bq. This will benefit from parts-of-speech filtering.

sure, and that can be done by passing the correct Analyzer to the 
Classifier#train() method.

> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345.patch, SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-09-03 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447243#comment-13447243
 ] 

Tommaso Teofili commented on LUCENE-4345:
-

bq. So we could consider performance-driven heuristics/approximations like 
MoreLikeThis does based on things like local term frequency within the 
document/term length, whatever to save on docFreq() calls, if it makes sense (i 
have to look at the formula in more detail here).

The generic formula is _C = argmax( P(doc|class) * P(class) )_ , I agree it 
makes sense to incrementally see if we can find good heuristics / 
approximations which low the computational cost of this calculation.

bq. the current code, given a word that appears many times in the document, 
will do many computations when instead we could really just work across the 
unique terms within the document.

another good point where we can improve, thanks :)

I managed to remove all the Maps from the code, I'll attach the patch shortly. 
I'll then work on removing the tokenizeDoc() loop.

> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345.patch, SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-09-03 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447293#comment-13447293
 ] 

Tommaso Teofili commented on LUCENE-4345:
-

bq. Nice! I've found that filtering for nouns & verbs makes another NLP task 
(latent semantic indexing) work much better. This will benefit from 
parts-of-speech filtering.

my former comment is partially correct as the Analyzer is currently used only 
on the unseen text rather than on the whole set of docs too, using it (or other 
Analyzers) with the existing docs' text would make training slower but it could 
be useful to improve accuracy. Maybe a subclass of the current one which is 
capable of doing that would be a nice addition.

> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-09-03 Thread Lance Norskog (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447429#comment-13447429
 ] 

Lance Norskog commented on LUCENE-4345:
---

bq. would make training slower but it could be useful to improve accuracy
If you use index data which is already analyzed with the same analyzer as your 
test (unseen) documents, you can use a lot more documents as input. More is 
better. As the training data increases, signal drives out noise. Once you add 
the ability to store & load models, training speed becomes less important. 

Look at the Mahout project for ideas about text classifiers. The 
ConfusionMatrix class and the html page it prints are really handy for 
summarizing and probing the classifier's performance.


> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-09-11 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452980#comment-13452980
 ] 

Tommaso Teofili commented on LUCENE-4345:
-

Thanks Lance for your useful insights, I'll definitely have a look :) .

bq. If you use index data which is already analyzed with the same analyzer as 
your test (unseen) documents, you can use a lot more documents as input. More 
is better. As the training data increases, signal drives out noise.

I agree, we could leverage this for sure.

bq. Once you add the ability to store & load models, training speed becomes 
less important.

Regarding storing and loading models, the base intuition (at least my intuition 
:P) in the case of Lucene is that the index itself plays that role.

> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-09-11 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452987#comment-13452987
 ] 

Tommaso Teofili commented on LUCENE-4345:
-

by the way, if no one objects I plan to commit this shortly so that we can 
improve things directly by patching the trunk.

> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-09-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13453208#comment-13453208
 ] 

Robert Muir commented on LUCENE-4345:
-

Can we remove the ClassificationException? It only seems to box IOException... 
we can just throw IOException directly instead?

> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-09-11 Thread Lance Norskog (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13453694#comment-13453694
 ] 

Lance Norskog commented on LUCENE-4345:
---

What is the scale that you expect this bayesian classifier to handle? How many 
training documents does it need? 

> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-09-12 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13453853#comment-13453853
 ] 

Tommaso Teofili commented on LUCENE-4345:
-

bq. Can we remove the ClassificationException? It only seems to box 
IOException... we can just throw IOException directly instead?

sure, we can keep IOException for now

bq. What is the scale that you expect this bayesian classifier to handle? How 
many training documents does it need?

I'm doing some benchmarking in these days therefore I should be able to say 
something about this shortly.

> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-09-12 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13453899#comment-13453899
 ] 

Tommaso Teofili commented on LUCENE-4345:
-

side note: it seems a bit old but I just realized something similar had been 
done in LUCENE-1039, maybe both impl could be then added in the future.

> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-09-13 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454729#comment-13454729
 ] 

Simon Willnauer commented on LUCENE-4345:
-

hey tommaso, 

I just briefly skimmed through your latest patch and I have a bunch of comments:

* I agree with robert you should build a small inverted index instead of 
retokenizing. I'd use a BytesRefHash with a parallel array as we use during 
indexing, if you have trouble with this I am happy to update your patch and 
give you an example.
* I suggest to move the termsEnum.next() into the while() part like while((next 
= termsEnum.next) != null) for consistency (in assignClass)
* Can you use BytesRef for fieldNames to safe the conversion everytime.
* Instead of specifying the document as a String you should rather use 
IndexableField and in turn pull the tokenstream from 
IndexableField#tokenStream(Analyzer)
* I didn't see a reason why you use Double instead of double (primitive) as 
return values, I think the boxing is unnecessary
* in assignClass can't you reuse the BytesRef returned from the termsEnum for 
further processing instead of converting it to a string?
* in getWordFreqForClass you might want to use TotalHitCountCollector since you 
are only interested in the number of hits. That collector will not score or 
collect any documents at all and is way less complex that the default 
TopDocsCollector
* I have trouble to understand why the interface expects an atomic reader here. 
From my perspective you should handle per-segment aspect internally and instead 
just use IndexReader in the interface.
* The interface you defined has some problems with respect to Multi-Threading 
IMO. The interface itself suggests that this class is stateful and you have to 
call methods in a certain order and at the same you need to make sure that it 
is not published for read access before training is done. I think it would be 
wise to pass in all needed objects as constructor arguments and make the 
references final so it can be shared across threads and add an interface that 
represents the trained model computed offline? In this case it doesn't really 
matter but in the future it might make sense. We can also skip the model 
interface entirely and remove the training method until we have some impls that 
really need to be trained.  



> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-09-13 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454802#comment-13454802
 ] 

Tommaso Teofili commented on LUCENE-4345:
-

bq. I agree with robert you should build a small inverted index instead of 
retokenizing. I'd use a BytesRefHash with a parallel array as we use during 
indexing, if you have trouble with this I am happy to update your patch and 
give you an example.

+1, not trouble but actually not much time in the latest days to work on it, 
however if you have time that'd be surely nice (I've committed this on trunk at 
r1384219 so it should be easier to update it).

bq. I suggest to move the termsEnum.next() into the while() part like 
while((next = termsEnum.next) != null) for consistency (in assignClass)

sure

bq. Can you use BytesRef for fieldNames to safe the conversion everytime.

actually this depends on "who"'s calling the Classifier, if, for example, it's 
Solr then it makes sense to keep Strings but since this lives in Lucene it may 
make sense to use ByteRefs, however from an API point of view I'm not sure what 
could be better.

bq. Instead of specifying the document as a String you should rather use 
IndexableField and in turn pull the tokenstream from 
IndexableField#tokenStream(Analyzer)

I think this is not always ok as often the document to be classified is not 
supposed to be in the index immediately but _may_ get indexed right after the 
classification, however we could provide that with as an additional method. 

bq. I didn't see a reason why you use Double instead of double (primitive) as 
return values, I think the boxing is unnecessary

yes, I agree.

bq. in assignClass can't you reuse the BytesRef returned from the termsEnum for 
further processing instead of converting it to a string?

actually, after reading your comment above I realized converting to a String is 
not a good idea, so I'll change the methods (#calculateLikelihood and 
#calculatePrior) to use ByteRef rather than String.

bq. in getWordFreqForClass you might want to use TotalHitCountCollector since 
you are only interested in the number of hits. That collector will not score or 
collect any documents at all and is way less complex that the default 
TopDocsCollector

very good point, thanks :)

bq. I have trouble to understand why the interface expects an atomic reader 
here. From my perspective you should handle per-segment aspect internally and 
instead just use IndexReader in the interface.

as a first implementation I thought it made sense to keep the complexity of 
explicitly doing distributed probabilities calculations out, also AtomicReaders 
expose more internals that can be leveraged in a classification algorithm.

bq. The interface you defined has some problems with respect to Multi-Threading 
IMO. The interface itself suggests that this class is stateful and you have to 
call methods in a certain order and at the same you need to make sure that it 
is not published for read access before training is done. 

it'd raise an exception if #assignClass() is called before #train()

bq. I think it would be wise to pass in all needed objects as constructor 
arguments and make the references final so it can be shared across threads and 
add an interface that represents the trained model computed offline? In this 
case it doesn't really matter but in the future it might make sense. We can 
also skip the model interface entirely and remove the training method until we 
have some impls that really need to be trained.

I'm +1 for making the references final while I put the #train() method so that 
a Classifier could be trained multiple times. In this implementation that 
doesn't make much difference but it may not be the case for other 
implementations.
Therefore we could (maybe should) mark this API _@experimental_ and just evolve 
it form the different implementations we have so finally moving parameters to 
the constructor may be a nice idea here.
On the contrary removing the #train() method from the API would remove any 
reference to Lucene APIs in the Classifier interface leading to question if 
that's too much generic.

> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the i

[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-09-14 Thread Lance Norskog (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456053#comment-13456053
 ] 

Lance Norskog commented on LUCENE-4345:
---

I recently did some related research in text analysis and found that limiting 
terms to nouns&verbs was a 10-15% increase in all variations of the test.

So, filtering terms from Parts-of-Speech annotation will be very helpful. In my 
OpenNLP patch is a FilterPayloadsFilter which keeps or rips out from a list of 
text payloads.

[http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html]

> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-09-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456415#comment-13456415
 ] 

Robert Muir commented on LUCENE-4345:
-

I don't think this should be using payloads to pull POS tags: the purpose of 
payloads
is when you need something stored in the actual index (and should be limited to 
e.g. a single byte),
its not type-safe but application-specific.

Instead such taggers should expose a type-safe PartOfSpeechAttribute as 
suggested in the
o.a.l.analysis package javadocs. If they want to put POS into the index for 
e.g. payload-based queries,
thats a separate concern, they should have a separate tokenfilter that encodes 
the POS attribute
into the payload so this is optional (as it has tradeoffs in the index). See 
TypeAsPayloadFilter etc
as an example of what I mean. But for this module we don't need anything in the 
index.

If we think its useful for classifiers to limit the analysis to certain POS 
categories, then
instead we should factor out a *minimal* POSAttribute sub-interface with 
something very generic
like isNominal()/isVerbal() that can actually be implemented by different 
taggers with different tag sets
across different languages.

Then things like kuromoji's POSAttribute, openNLP's POSAttribute, or even your 
custom home-grown one,
or some commercial one could extend this sub-interface and plug into it.

At least i think this is possible with our attributes API :)


> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-09-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456418#comment-13456418
 ] 

Robert Muir commented on LUCENE-4345:
-

another simpler idea, you just handle this yourself in the Analyzer you pass to 
the thing.

This is currently how Kuromoji works, it has a POS-based stopfilter. these are 
trivial to write.

> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-09-16 Thread Lance Norskog (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456734#comment-13456734
 ] 

Lance Norskog commented on LUCENE-4345:
---

bq. I don't think this should be using payloads to pull POS tags: the purpose 
of payloads
is when you need something stored in the actual index (and should be limited to 
e.g. a single byte),
its not type-safe but application-specific.
Yes, some NLP applications want actual payloads. For entity resolution you can 
have a UI add little icons for person, place, etc. In the OpenNLP patch it just 
seemed silly to add another Attribute type.

bq. If we think its useful for classifiers to limit the analysis to certain POS 
categories, then instead we should factor out a minimal POSAttribute 
sub-interface with something very generic like isNominal()/isVerbal() that can 
actually be implemented by different taggers with different tag sets across 
different languages.
There is a generic subset with mapping lists for most common tagsets for 
different languages. They map these tags down to 12 POS tags. Adding this 
mapper to the OpenNLP patch is on my large TODO list. They even have a mapping 
set for the Twitter Parts-of-Speech tagger.

bq. This is currently how Kuromoji works, it has a POS-based stopfilter. these 
are trivial to write. I also added a filter to remove payloads. If you use a 
different Attribute for the analysis chain, then you need a 'change 
POSAttribute to PayloadAttribute' at the bottom of the analysis chain.
Yes, I added one also. Some of the Kuromoji Attributes should be pulled up into 
the generic set.

> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2013-02-16 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13580103#comment-13580103
 ] 

Steve Rowe commented on LUCENE-4345:


Tommasso, is there any reason this can't be backported to branch_4x?

> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4345) Create a Classification module

2013-02-16 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13580133#comment-13580133
 ] 

Tommaso Teofili commented on LUCENE-4345:
-

Hi Steve. While it was not the case when this was started, surely it can be 
backported now, I'm just not sure it can be safely merged back (w/ svn merge) 
so maybe I'll just create a patch for branch_4x in a separate issue from the 
trunk version and commit that.

> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org