On 04/15/2013 02:31 AM, Richard Head Jr. wrote:
I have a bunch of sentences like the following:

Guacamole Dip: 5 Hass Avocados, Jalapeno Puree with Salt and BHT (preservative).

They are standalone, i.e., they are not contained within a larger 
paragraph/document structure.

I want to tag various words, creating the following:

Guacamole Dip: 5 Hass <START:term>Avocados<END>, <START:term>Jalapeno<END> Puree with 
<START:term>Salt<END> and <START:term>BHT<END> (preservative).

Looking through the mailing list for guidance, I came across this:

http://mail-archives.apache.org/mod_mbox/opennlp-users/201205.mbox/%3C4FA1EE7E.2080608%40gmail.com%3E

Which made me think that, before going though a 100 or so documents and tagging 
the words to create training data, I should get some clarification on the 
following:

1. Is NER the right tool for this?
2. My training data is somewhat small (~100 sentences) will this stymie my goal 
above?
3. Were the poor results the gentleman had with Italian addresses in part do to 
a bug mentioned here:
http://mail-archives.apache.org/mod_mbox/opennlp-users/201205.mbox/%3C4FA1EF10.2020904%40gmail.com%3E
4. Is it possible to use a text file containing only terms, or a tab delimited 
file like the ones the Stanford NER uses?


Yes, the NER should be capable of detecting the terms, but you could also try to use a dictionary.

Your training data is too small, especially when you train with a cutoff of 5 and the maxent model, the perceptron will work better. Label more data until you have a few thousand sentences.

The mentioned bug was fixed in 1.5.3, but it only occurred in multi type models. You need complete sentences to train the NER model, just using the terms does not work, no we do not support the Stanford format.

Jörn

Reply via email to