[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456734#comment-13456734
 ] 

Lance Norskog commented on LUCENE-4345:
---------------------------------------

bq. I don't think this should be using payloads to pull POS tags: the purpose 
of payloads
is when you need something stored in the actual index (and should be limited to 
e.g. a single byte),
its not type-safe but application-specific.
Yes, some NLP applications want actual payloads. For entity resolution you can 
have a UI add little icons for person, place, etc. In the OpenNLP patch it just 
seemed silly to add another Attribute type.

bq. If we think its useful for classifiers to limit the analysis to certain POS 
categories, then instead we should factor out a minimal POSAttribute 
sub-interface with something very generic like isNominal()/isVerbal() that can 
actually be implemented by different taggers with different tag sets across 
different languages.
There is a generic subset with mapping lists for most common tagsets for 
different languages. They map these tags down to 12 POS tags. Adding this 
mapper to the OpenNLP patch is on my large TODO list. They even have a mapping 
set for the Twitter Parts-of-Speech tagger.

bq. This is currently how Kuromoji works, it has a POS-based stopfilter. these 
are trivial to write. I also added a filter to remove payloads. If you use a 
different Attribute for the analysis chain, then you need a 'change 
POSAttribute to PayloadAttribute' at the bottom of the analysis chain.
Yes, I added one also. Some of the Kuromoji Attributes should be pulled up into 
the generic set.
                
> Create a Classification module
> ------------------------------
>
>                 Key: LUCENE-4345
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4345
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Tommaso Teofili
>            Assignee: Tommaso Teofili
>            Priority: Minor
>         Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to