Re: POSTaggerTrainer encoding

Yakov Keranchuk Wed, 30 Oct 2013 03:23:29 -0700

Hi Jörn,

but I annotated my data according underscore usage: ex. "run_action" , "run"
is data, "action" is a tag. Or there is another requirements for tags?









On Wed, Oct 30, 2013 at 2:08 PM, Jörn Kottmann <[email protected]> wrote:

> On 10/30/2013 10:25 AM, Yakov Keranchuk wrote:
>
>> Hi all!
>>
>>
>> I encountered a small problem (as I think), with POSTaggerTrainer.
>>
>> Train file contains russian and english words, ex. "бежать_action" in
>> UTF-8
>> encoding. So in training (with or without -encoding UTF-8 option) I have
>> following:
>>
>> opennlp.tools.postag.**WordTagSampleStream read
>> WARNING: Error during parsing, ignoring sentence: ъєяшы_action ....(the
>> rest of sentence)
>>
>> Where can be the problem?
>>
>
> The training file formats assumes that a token and pos tag is always
> seperated by an underscore,
> since your data contains underscores this does not work anymore, thats
> what the error message tries
> to tell you ...
>
> One way to solve this is to somehow get rid of the underscores in your
> text data.
>
> We have an open jira issue to make the char which is used to seperate a
> token and a tag configurable,
> this would probably solve your problem.
>
> I don't think implementing this will be much work, a contribution would be
> very welcome.
>
> HTH,
> Jörn
>
>
>
>
>

Re: POSTaggerTrainer encoding

Reply via email to