Re: Surronding tokens of the entity on MaxEnt models

Russ, Daniel (NIH/CIT) [E] Mon, 02 May 2016 06:21:51 -0700

Hi Damiano,

     Why are you so sure that your model with not work?  A couple of things to 
remember, 1. you need quite a bit of training data.  Two sentences does not 
make a training set.  2. You probably need more than a window of words as your 
features.  However, you can see that word-2=“name" and word-1=“is” tend to 
precede a name.  Look into other potential features and get a larger dataset 
and your results may surprise you.


Daniel


On May 1, 2016, at 3:13 PM, Jeffrey Zemerick 
<[email protected]<mailto:[email protected]>> wrote:

I'm sure the others on this list can give you a more complete answer so I
will try to not lead you astray.

The WindowFeatureGenerator is only one of the available feature generators.
There are many classes that implement the AdaptiveFeatureGenerator
interface [1] and you can, of course, provide your own implementation of
that interface to support additional features. For example, the
SentenceFeatureGenerator [2] looks at the beginning and end of each
training sentence. So to answer your question, the length of the training
sentence should not matter - what matters is if the combination of
configured feature generators used can provide a model that accurately
describes the training text.

Jeff

[1]
https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/AdaptiveFeatureGenerator.html
[2]
https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/SentenceFeatureGenerator.html


On Sun, May 1, 2016 at 12:02 PM, Damiano Porta <[email protected]>
wrote:

Hi Jeff!
Thank you so much for your fast reply.

I have a doubt, let suppose we use this feature with a window of:

2 tokens on the left + *ENTITY* + 2 tokens on the right

The doubt is how can i train the model correctly?

if only the previous 2 tokens and the next 2 tokens matters i should not
use long sentences to training the model. Right?

For example (person-model.train):

1. I am <START:person> Barack <END> and I am the president of USA

2. My name is <START:person> Barack <END> and my surname is Obama

...

Those are two stupid training samples, it is just to let you know my doubt.

In this case i should have:

*I am Barack and I*

*name is Barack and my*

the others tokens (left and right) do not matter. So the sentences on my
training set should be very short, right? Basically I should only define
all the "combinations" of the previous/next 2 tokens, right?

Thank you!
Damiano



2016-05-01 16:07 GMT+02:00 Jeffrey Zemerick <[email protected]>:

I think you are looking for the WindowFeatureGenerator [1]. You can set
the
size of the window by specifying the number of previous tokens and number
of next tokens.

Jeff

[1]


https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/WindowFeatureGenerator.html


On Sun, May 1, 2016 at 5:16 AM, Damiano Porta <[email protected]>
wrote:

Hello everybody
How many surrounding tokens are kept into account to find the entity
using
a maxent model?
Basically a maxent model should detect an entity looking at the
surronding
tokens, right ?
I would like to understand if:

1. can i set the number of tokens on the left side?
2. can i set the number of tokens on the right side too ?

Thank you in advance for the clarification
Best

Damiano

Re: Surronding tokens of the entity on MaxEnt models

Reply via email to