Re: [Apertium-stuff] Stop merging lines

Francis Tyers Thu, 08 Nov 2018 04:47:32 -0800

El 2018-11-08 09:57, Kevin Brubeck Unhammer escribió:

mansur <6688000-re5jqeeqqe8avxtiumw...@public.gmane.org> čálii:

Some examples of Apertium's tagger messing with lines.

Original:
Китаплар да, кешеләр
дә кайтты.

Аңа ярдәм
итәргә кирәк.

Output lines where partial merging occurred:
^Китаплар да/Китап<n><pl><nom>+да<cnjcoo>$^,/,<cm>$ ^кешеләр
дә/кеше<n><pl><nom>+да<cnjcoo>$
^кайтты/кайт<v><tv><ifi><p3><sg>$^./.<sent>$

^Аңа/Ул<prn><dem><dat>$ ^ярдәм итәргә/ярдәм ит<v><tv><inf>$
^кирәк/кирәк<n><sg><nom>+и<cop><aor><p3><sg>$^./.<sent>$

It is very difficult to find such cases in the big corpus.

Best!
Mansur

OK, so this isn't actually two lines getting merged into one (that'swhy

the wc -l is the same), but a multiword where the latter part is moved
before the linebreak so it can actually be part of the analysis, ie.

кешеләр
дә

on two lines gets the analysis

^кешеләр дә/кеше<n><pl><nom>+да<cnjcoo>$


where the linebreak is output *after* the analysis.

Do you not want the multiword analysis here? In that case, putting some
noise like .@#@ at the end of lines should work, assuming you have no

multiwords with those characters (but when doing translation, theperiod

at least should get an analysis, since unanalysed noise can get moved
around (or deleted) by transfer rules).

The NUL solution also works, but it seems the tools expect the NUL to
come after a superblank like [][\n], so

$ sed 's/proc /proc -z /g' modes/tat-mansur.mode>modes/tat-mansur-z.mode


$ cat /tmp/test                   \
  | tr -d '\0'                    \
  | apertium-deshtml -n           \
  | sed 's/\[$/[][/; s/^]/]\x00/' \
  | sh modes/tat-mansur-z.mode    \
  | tr -d '\0'                    \
  | apertium-rehtml-noent

^Китаплар да/Китап<n><pl><nom>+да<cnjcoo>$^,/,<cm>$^кешеләр/кеше<n><pl><nom>$

 ^дә/да<cnjcoo>$ ^кайтты/кайт<v><tv><ifi><p3><sg>$^./.<sent>$

^Аңа/Ул<prn><dem><dat>$ ^ярдәм/ярдәм<n><sg><nom>$
 ^итәргә/ит<v><tv><inf>$
^кирәк/кирәк<n><sg><nom>+и<cop><aor><p3><sg>$^./.<sent>$


Maybe it'd make sense to have that as an option to apertium-destxt or
similar? So "apertium -f lines -d . tat-mansur" would add the -z's and
run with NUL's on each line, making the tools treat each line

separately, as if you'd just typed 'echo "$line"|apertium -d .tat-mansur'

for every line.


That would be a good feature, but wouldn't get past the issue of the

tagger/cg. E.g. if we do that then the tagger can't take into accountcontext.

Another feature would be a compile option or mode for the Turkictransducers which

gets rid of multitokens.

Fran


_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Stop merging lines

Reply via email to