Ok, I will pay attention on untagged persons in my corpus. I handle different forms of first name by regular expressions i.e. (Tomasz|Tomek) - second one is name diminution. I've prepared this expressions based on wikipedia list of Polish names.
I stem articles in corpus because of persons names/surnames inflections. But I don't stem test data - thanks for remark. I will try to apply your suggestion not to use stemmer, but problem with inflection can be serious. I need to have automatic persons tagging that's way I use stemmer and then regular expression to find entity. In Polish names inflection is mostly realized by adding some suffix but not always - and then problems arise. How tokenization can help me with language inflection? Last thing - what kind of valuable information I lose after stemming? Is there any difference for NER tools when it has original word and its basic form (stemmed) ? If explanation is too complicated, could you recommend some materials to read about it? Thanks, Tomek 2013/8/20 Svetoslav Marinov <[email protected]> > As Jörn wrote you should tag ALL person names in your corpus, not just the > famous ones. > > Then, Polish is a highly inflected language. How do you deal with all the > case forms of a person name? Do you have them in the list? If you don't, > that's one of the problems as well. Why do you need to stem the articles? > Is it to account for the inflections? But then you should do exactly the > same with your test data. However, I would strongly advise you not to use > the stemmer. You lose a lot of valuable information which can help > distinguish whether a word is a name or not. Just tag the texts as they are > (maybe with some proper tokenization and sentence splitting) - this should > improve the results. > > Svetoslav > ________________________________________ > Från: Jörn Kottmann <[email protected]> > Skickat: den 20 augusti 2013 09:56 > Till: [email protected] > Ämne: Re: OpenNLP NER for Polish > > On 08/20/2013 09:47 AM, Tomasz Sobczak wrote: > > Could you suggest me what have I missed or what can I do better in my > input > > text file to improve my entity recognition? > > Its hard to tell without seeing your training data, but I suspect your > tagging is too inconsistent, > e.g. many people names are not tagged. > > Try to use a linguistic annotation tool to annotate at least a few > hundred articles with all mentioned > person names. > > Jörn >
