El 2018-11-08 09:57, Kevin Brubeck Unhammer escribió:
mansur <6688000-re5jqeeqqe8avxtiumw...@public.gmane.org> čálii:

Some examples of Apertium's tagger messing with lines.

Original:
Китаплар да, кешеләр
дә кайтты.

Аңа ярдәм
итәргә кирәк.

Output lines where partial merging occurred:
^Китаплар да/Китап<n><pl><nom>+да<cnjcoo>$^,/,<cm>$ ^кешеләр
дә/кеше<n><pl><nom>+да<cnjcoo>$
^кайтты/кайт<v><tv><ifi><p3><sg>$^./.<sent>$

^Аңа/Ул<prn><dem><dat>$ ^ярдәм итәргә/ярдәм ит<v><tv><inf>$
^кирәк/кирәк<n><sg><nom>+и<cop><aor><p3><sg>$^./.<sent>$

It is very difficult to find such cases in the big corpus.

Best!
Mansur

OK, so this isn't actually two lines getting merged into one (that's why
the wc -l is the same), but a multiword where the latter part is moved
before the linebreak so it can actually be part of the analysis, ie.

кешеләр
дә

on two lines gets the analysis

^кешеләр дә/кеше<n><pl><nom>+да<cnjcoo>$


where the linebreak is output *after* the analysis.

Do you not want the multiword analysis here? In that case, putting some
noise like .@#@ at the end of lines should work, assuming you have no
multiwords with those characters (but when doing translation, the period
at least should get an analysis, since unanalysed noise can get moved
around (or deleted) by transfer rules).

The NUL solution also works, but it seems the tools expect the NUL to
come after a superblank like [][\n], so

$ sed 's/proc /proc -z /g' modes/tat-mansur.mode >modes/tat-mansur-z.mode

$ cat /tmp/test                   \
  | tr -d '\0'                    \
  | apertium-deshtml -n           \
  | sed 's/\[$/[][/; s/^]/]\x00/' \
  | sh modes/tat-mansur-z.mode    \
  | tr -d '\0'                    \
  | apertium-rehtml-noent
^Китаплар да/Китап<n><pl><nom>+да<cnjcoo>$^,/,<cm>$ ^кешеләр/кеше<n><pl><nom>$
 ^дә/да<cnjcoo>$ ^кайтты/кайт<v><tv><ifi><p3><sg>$^./.<sent>$

^Аңа/Ул<prn><dem><dat>$ ^ярдәм/ярдәм<n><sg><nom>$
 ^итәргә/ит<v><tv><inf>$
^кирәк/кирәк<n><sg><nom>+и<cop><aor><p3><sg>$^./.<sent>$


Maybe it'd make sense to have that as an option to apertium-destxt or
similar? So "apertium -f lines -d . tat-mansur" would add the -z's and
run with NUL's on each line, making the tools treat each line
separately, as if you'd just typed 'echo "$line"|apertium -d . tat-mansur'
for every line.

That would be a good feature, but wouldn't get past the issue of the
tagger/cg. E.g. if we do that then the tagger can't take into account context.

Another feature would be a compile option or mode for the Turkic transducers which
gets rid of multitokens.

Fran


_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to