[Name Finder] Wikipedia training set and parameters tuning

Riccardo Tasso Thu, 26 Jan 2012 04:11:56 -0800

Hi all,
    I'm looking for using Wikipedia as a source to train my own NameFinder.


The main idea is based on two assumptions:

1) Almost each Wikipedia Article has a template which make easy toclassify it as a Person, Place or some kind of entity

2) Each Wikipedia Article contains hyper text to other Wikipedia Articles

Given that is possible to translate links to typed annotations to trainthe Name Finder.

I know that Olivier has already tried this approach, but I wanted towork on my own implementation and I think this is the right place todiscuss about it. There are some general questions and some morespecific, regarding the Name Finder.

The general question regards the fact that Wikipedia isn't the "perfect"training set, because not all the entities are linked / tagged. The goodthing is that as dataset it is very large, which means a lot of taggedexamples and a lot of untagged ones. Do you think this is a huge problem?

What do you think about selecting as training set a subset of pages withhigh precision? I have some ideas about which strategy to implement:* select only featured pages (which somehow is a guarantee that linkingis done properly)* selecting only pages regarding the Name Finder entity I'm trying totrain (e.g. only People pages for People Name Finder)

The specific questions regard the right tuning of training parameters,which I think is a frequent question. I hope this discussion may bringto the creation of new material to improve documentation, I advice you Iwon't be brief. For this I'm starting from some hints given by Jörn:


On 19/01/2012 14:16, Jörn Kottmann wrote:

When I am doing training I always take our defaults as a base line andthen modify the parametersto see how it changes the performance. When you are working with atraining set which grows overtime I suggest to once in a while start again from the default andverify if the modifications are still
giving an improvement.

A few hints:
- Using more iterations on the maxent model helps especially when yourdata set is small,
   e.g. try 300 to 500 instead of 100.


My dataset is huge, but I will try to test also adding more iterations.


final Map<String, Object> resources

- Depending on domain and language feature generation should beadapted, try to useour xml feature generation (for this use trunk version, there was asevere bug in 1.5.2).

For feature generation, I admit I haven't the idea on how use it. I'musing the CachedFeatureGenerator exactly as instantiated in thedocumentation. Can you help me in explaining them?


new WindowFeatureGenerator(new TokenFeatureGenerator(), 2, 2):

this one means that the two previous and next tokens are used asfeatures to train the model: the window size probably depends on thelanguage and shouldn't be too big to avoid loosing generalization.


new WindowFeatureGenerator(new TokenClassFeatureGenerator(true), 2, 2):

this one is similar to the former, but it uses the class of tokeninstead of the token itself. Let's say I can do POS-tagging on mysentences, which is a classification for them. I think this may be aninteresting property to detect named entities (e.g. a Place is oftenintroduced by a token whith the pos IN). How can I exploit this idea?


new OutcomePriorFeatureGenerator(),
new PreviousMapFeatureGenerator(),
new BigramNameFeatureGenerator():

these FeatureGenerator aren't very clear to me and I would like to getmore in depth. I can only understand that they aren't used by default.


new SentenceFeatureGenerator(true, false):

used to keep or skip the first and the last word of a sentence asfeature (depending on the boolean parameters given in input). Which isthe rationale to keep the first word and skip the last word? How can Idecide of this setting? Which are the possible customizations of thisFeatureGenerator?

- Try the perceptron, usually has a higher recall, train it with acutoff of 0.
- Use our build-in evaluation to test how a model performs, it canoutput performance numbers
   and print out misclassified samples.
- Look carefully at misclassified samples maybe there is are patternswhich do not really work
   with your model.

- Add training data which contains cases which should work but do not.

Hope this helps,
Jörn


Thank you for these hints, I will try each one carefully.

Regards,
    Riccardo

[Name Finder] Wikipedia training set and parameters tuning

Reply via email to