El 2018-11-06 17:24, mansur escribió:
Hello!

It was quite difficult to find those few cases when lines merging
happens in the file containing dozens of millions of lines overall.
And it seems that in this case it is not because of apertium's tagger
like the previous time. These cases happened because there were some
non-printable characters that other my scripts removed...

It seems that Fran's recommendation worked out and apertium doesn't
merge lines with that line ending. I modified it a little bit for the
convenience:

I add before passing it to the tagger:

sed -r 's/$/ __@@__@@__ @\.@#\.#/'

And remove it this way after tagging is done:
sed -r 's/ *__@@__@@__.*$//'

The issue I want to consult with you is: doesn't this part somehow
affect tagger's marking of other words in a sentence? Doesn't it
change other words' POS that were guessed by tagger?


Yes it does. It will put a sentence boundary after every word, meaning that you won't get reliable tagger output. Apertium as far as I know has no way to treat sentences as a sequence of lines. This is because of how the format handling works.

I think it would really be an excellent feature though. Perhaps a GitHub issue? I do however think it would involve messing with quite a bit of the pipeline.

Fran


_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to