[
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lance Norskog updated LUCENE-2899:
----------------------------------
Attachment: LUCENE-2899.patch
This is about finished. The Tokenizer and TokenFilters are moved over into
lucene/analysis/opennlp. They do not have unit tests in lucene/ because of the
difficulty in supplying model data. They are unit-tested by the factories in
solr/contrib/opennlp.
The solr/example/opennlp directory is gone, as per request. Possible field
types are documented in the solrconfig.xml in the unit test resources.
All jars are downloaded via ivy. The jwnl library is one rev after what this
was compiled with. It is only used in collocation, which is not exposed in this
release.
To build, test and commit, there is a boostrap sequence. In the top-level
directory:
{code}
ant clean compile
{code}
This downloads the OpenNLP jars
{code}
cd solr/contrib/opennlp/test-files/training
sh bin/training.sh
{code}
This creates low-quality model files in
{{solr/contrib/opennlp/src/test-files/opennlp/solr/collection1/conf/opennlp}}.
In the trunk/solr directory, run
{code}
ant example test-contrib
{code}
You now have committable binary models. They are small, and only there to run
the OpenNLP unit tests. They generate results that are objectively bogus, but
the unit tests are matched to the results. If you want real models, you have to
download them from sourceforge.
> Add OpenNLP Analysis capabilities as a module
> ---------------------------------------------
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Java
> Issue Type: New Feature
> Components: modules/analysis
> Reporter: Grant Ingersoll
> Assignee: Grant Ingersoll
> Priority: Minor
> Attachments: LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch,
> opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice
> to have a submodule (under analysis) that exposed capabilities for it. Drew
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]