Re: [Apertium-stuff] Stop merging lines

Francis Tyers Tue, 06 Nov 2018 08:49:33 -0800


El 2018-11-06 17:24, mansur escribió:

Hello!


It was quite difficult to find those few cases when lines merging
happens in the file containing dozens of millions of lines overall.
And it seems that in this case it is not because of apertium's tagger
like the previous time. These cases happened because there were some
non-printable characters that other my scripts removed...

It seems that Fran's recommendation worked out and apertium doesn't
merge lines with that line ending. I modified it a little bit for the
convenience:

I add before passing it to the tagger:

sed -r 's/$/ __@@__@@__ @\.@#\.#/'

And remove it this way after tagging is done:
sed -r 's/ *__@@__@@__.*$//'

The issue I want to consult with you is: doesn't this part somehow
affect tagger's marking of other words in a sentence? Doesn't it
change other words' POS that were guessed by tagger?

Yes it does. It will put a sentence boundary after every word, meaningthat you won't get reliable tagger output. Apertium as far as I know hasno way to treat sentences as a sequence of lines. This is because of howthe format handling works.

I think it would really be an excellent feature though. Perhaps a GitHubissue? I do however think it would involve messing with quite a bit ofthe pipeline.


Fran


_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Stop merging lines

Reply via email to