El 2018-11-06 17:24, mansur escribió:
Hello!
It was quite difficult to find those few cases when lines merging
happens in the file containing dozens of millions of lines overall.
And it seems that in this case it is not because of apertium's tagger
like the previous time. These cases happened because there were some
non-printable characters that other my scripts removed...
It seems that Fran's recommendation worked out and apertium doesn't
merge lines with that line ending. I modified it a little bit for the
convenience:
I add before passing it to the tagger:
sed -r 's/$/ __@@__@@__ @\.@#\.#/'
And remove it this way after tagging is done:
sed -r 's/ *__@@__@@__.*$//'
The issue I want to consult with you is: doesn't this part somehow
affect tagger's marking of other words in a sentence? Doesn't it
change other words' POS that were guessed by tagger?
Yes it does. It will put a sentence boundary after every word, meaning
that you won't get reliable tagger output. Apertium as far as I know has
no way to treat sentences as a sequence of lines. This is because of how
the format handling works.
I think it would really be an excellent feature though. Perhaps a GitHub
issue? I do however think it would involve messing with quite a bit of
the pipeline.
Fran
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff