Hi Jörn, but I annotated my data according underscore usage: ex. "run_action" , "run" is data, "action" is a tag. Or there is another requirements for tags?
On Wed, Oct 30, 2013 at 2:08 PM, Jörn Kottmann <[email protected]> wrote: > On 10/30/2013 10:25 AM, Yakov Keranchuk wrote: > >> Hi all! >> >> >> I encountered a small problem (as I think), with POSTaggerTrainer. >> >> Train file contains russian and english words, ex. "бежать_action" in >> UTF-8 >> encoding. So in training (with or without -encoding UTF-8 option) I have >> following: >> >> opennlp.tools.postag.**WordTagSampleStream read >> WARNING: Error during parsing, ignoring sentence: ъєяшы_action ....(the >> rest of sentence) >> >> Where can be the problem? >> > > The training file formats assumes that a token and pos tag is always > seperated by an underscore, > since your data contains underscores this does not work anymore, thats > what the error message tries > to tell you ... > > One way to solve this is to somehow get rid of the underscores in your > text data. > > We have an open jira issue to make the char which is used to seperate a > token and a tag configurable, > this would probably solve your problem. > > I don't think implementing this will be much work, a contribution would be > very welcome. > > HTH, > Jörn > > > > >
