[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

Lance Norskog (JIRA) Sun, 01 Jul 2012 23:11:04 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lance Norskog updated LUCENE-2899:
----------------------------------

    Attachment: LUCENE-2899.patch

This is about finished. The Tokenizer and TokenFilters are moved over into 
lucene/analysis/opennlp. They do not have unit tests in lucene/ because of the 
difficulty in supplying model data. They are unit-tested by the factories in 
solr/contrib/opennlp.

The solr/example/opennlp directory is gone, as per request. Possible field 
types are documented in the solrconfig.xml in the unit test resources.

All jars are downloaded via ivy. The jwnl library is one rev after what this 
was compiled with. It is only used in collocation, which is not exposed in this 
release.

To build, test and commit, there is a boostrap sequence. In the top-level 
directory:
{code}
  ant clean compile
{code}
This downloads the OpenNLP jars
{code}
cd solr/contrib/opennlp/test-files/training
sh bin/training.sh
{code}
This creates low-quality model files in 
{{solr/contrib/opennlp/src/test-files/opennlp/solr/collection1/conf/opennlp}}. 
In the trunk/solr directory, run
{code} 
ant example test-contrib
{code}
You now have committable binary models. They are small, and only there to run 
the OpenNLP unit tests. They generate results that are objectively bogus, but 
the unit tests are matched to the results. If you want real models, you have to 
download them from sourceforge.
                
> Add OpenNLP Analysis capabilities as a module
> ---------------------------------------------
>
>                 Key: LUCENE-2899
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2899
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

Reply via email to