Francis Tyers <fty...@prompsit.com> čálii: > Yes it does. It will put a sentence boundary after every word, meaning > that you won't get reliable tagger output. Apertium as far as I know > has no way to treat sentences as a sequence of lines. This is because > of how the format handling works. > > I think it would really be an excellent feature though. Perhaps a > GitHub issue? I do however think it would involve messing with quite a > bit of the pipeline.
However, we *should* treat NUL as hard separators – if we don't, apertium-apy (and thus www.apertium.org) will risk sending output meant for person1 to person2. (I have an inkling there might still be bugs in apertium-transfer related to this.) Anyway, if we at least handle NUL's correctly in lt-proc and cg-proc, you could turn linebreak's into NUL's (first deleting any existing NUL's in the corpus) and tag with the -z option to lt-/cg-proc: cat corpus.txt \ | tr -d '\0' \ | tr '\n' '\0' \ | apertium-deshtml -n \ | lt-proc -z -w 'apertium-tat/tat.automorf.bin' \ | cg-proc -z 'apertium-tat/tat.rlx.bin' \ | cg-proc -z -w -1 'apertium-tat/dev/mansur.bin' \ | tr '\0' '\n' \ | apertium-rehtml-noent … finally turning NUL's back into newlines. With apertium-nob, this doesn't seem to run slower than without -z, and doesn't merge lines in my test corpus.
signature.asc
Description: PGP signature
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff