Re: OpenNLP NER for Polish

Tomasz Sobczak Tue, 20 Aug 2013 12:06:25 -0700

Ok, I will pay attention on untagged persons in my corpus.

I handle different forms of first name by regular expressions i.e.
(Tomasz|Tomek)  - second one is name diminution. I've prepared this
expressions based on wikipedia list of Polish names.

I stem articles in corpus because of persons names/surnames inflections.
But I don't stem test data - thanks for remark.
I will try to apply your suggestion not to use stemmer, but problem with
inflection can be serious. I need to have automatic persons tagging that's
way I use stemmer and then regular expression to find entity.
In Polish names inflection is mostly realized by adding some suffix but not
always - and then problems arise.

How tokenization can help me with language inflection?

Last thing - what kind of valuable information I lose after stemming? Is
there any difference for NER tools when it has original word and its basic
form (stemmed) ?
If explanation is too complicated, could you recommend some materials to
read about it?

Thanks,
Tomek

2013/8/20 Svetoslav Marinov <[email protected]>

> As Jörn wrote you should tag ALL person names in your corpus, not just the
> famous ones.
>
> Then, Polish is a highly inflected language. How do you deal with all the
> case forms of a person name? Do you have them in the list? If you don't,
> that's one of the problems as well. Why do you need to stem the articles?
> Is it to account for the inflections? But then you should do exactly the
> same with your test data. However, I would strongly advise you not to use
> the stemmer. You lose a lot of valuable information which can help
> distinguish whether a word is a name or not. Just tag the texts as they are
> (maybe with some proper tokenization and sentence splitting) - this should
> improve the results.
>
> Svetoslav
> ________________________________________
> Från: Jörn Kottmann <[email protected]>
> Skickat: den 20 augusti 2013 09:56
> Till: [email protected]
> Ämne: Re: OpenNLP NER for Polish
>
> On 08/20/2013 09:47 AM, Tomasz Sobczak wrote:
> > Could you suggest me what have I missed or what can I do better in my
> input
> > text file to improve my entity recognition?
>
> Its hard to tell without seeing your training data, but I suspect your
> tagging is too inconsistent,
> e.g. many people names are not tagged.
>
> Try to use a linguistic annotation tool to annotate at least a few
> hundred articles with all mentioned
> person names.
>
> Jörn
>

Re: OpenNLP NER for Polish

Reply via email to