Hi all,
I'm looking for using Wikipedia as a source to train my own NameFinder.
The main idea is based on two assumptions:
1) Almost each Wikipedia Article has a template which make easy to
classify it as a Person, Place or some kind of entity
2) Each Wikipedia Article contains hyper text to other Wikipedia Articles
Given that is possible to translate links to typed annotations to train
the Name Finder.
I know that Olivier has already tried this approach, but I wanted to
work on my own implementation and I think this is the right place to
discuss about it. There are some general questions and some more
specific, regarding the Name Finder.
The general question regards the fact that Wikipedia isn't the "perfect"
training set, because not all the entities are linked / tagged. The good
thing is that as dataset it is very large, which means a lot of tagged
examples and a lot of untagged ones. Do you think this is a huge problem?
What do you think about selecting as training set a subset of pages with
high precision? I have some ideas about which strategy to implement:
* select only featured pages (which somehow is a guarantee that linking
is done properly)
* selecting only pages regarding the Name Finder entity I'm trying to
train (e.g. only People pages for People Name Finder)
The specific questions regard the right tuning of training parameters,
which I think is a frequent question. I hope this discussion may bring
to the creation of new material to improve documentation, I advice you I
won't be brief. For this I'm starting from some hints given by Jörn:
On 19/01/2012 14:16, Jörn Kottmann wrote:
When I am doing training I always take our defaults as a base line and
then modify the parameters
to see how it changes the performance. When you are working with a
training set which grows over
time I suggest to once in a while start again from the default and
verify if the modifications are still
giving an improvement.
A few hints:
- Using more iterations on the maxent model helps especially when your
data set is small,
e.g. try 300 to 500 instead of 100.
My dataset is huge, but I will try to test also adding more iterations.
final Map<String, Object> resources
- Depending on domain and language feature generation should be
adapted, try to use
our xml feature generation (for this use trunk version, there was a
severe bug in 1.5.2).
For feature generation, I admit I haven't the idea on how use it. I'm
using the CachedFeatureGenerator exactly as instantiated in the
documentation. Can you help me in explaining them?
new WindowFeatureGenerator(new TokenFeatureGenerator(), 2, 2):
this one means that the two previous and next tokens are used as
features to train the model: the window size probably depends on the
language and shouldn't be too big to avoid loosing generalization.
new WindowFeatureGenerator(new TokenClassFeatureGenerator(true), 2, 2):
this one is similar to the former, but it uses the class of token
instead of the token itself. Let's say I can do POS-tagging on my
sentences, which is a classification for them. I think this may be an
interesting property to detect named entities (e.g. a Place is often
introduced by a token whith the pos IN). How can I exploit this idea?
new OutcomePriorFeatureGenerator(),
new PreviousMapFeatureGenerator(),
new BigramNameFeatureGenerator():
these FeatureGenerator aren't very clear to me and I would like to get
more in depth. I can only understand that they aren't used by default.
new SentenceFeatureGenerator(true, false):
used to keep or skip the first and the last word of a sentence as
feature (depending on the boolean parameters given in input). Which is
the rationale to keep the first word and skip the last word? How can I
decide of this setting? Which are the possible customizations of this
FeatureGenerator?
- Try the perceptron, usually has a higher recall, train it with a
cutoff of 0.
- Use our build-in evaluation to test how a model performs, it can
output performance numbers
and print out misclassified samples.
- Look carefully at misclassified samples maybe there is are patterns
which do not really work
with your model.
- Add training data which contains cases which should work but do not.
Hope this helps,
Jörn
Thank you for these hints, I will try each one carefully.
Regards,
Riccardo