I'm looking at po/*.po and have found, well, lots of stuff that worries me.

Firstly, 582 cases where the msgid includes %s and the msgstr doesn't, or
vice versa.

(Before running the following checks, I folded all multi-line msgids and
msgstrs into single quoted strings.
I'm using GNU tools (grep et al); apologies to anyone who wants to
reproduce my checks without those tools.)

$ grep --text -A1 '^msgid[^%]*$' po/*.po* |
> grep --text -c 'msgstr.*%'
493
$ grep --text -A1 '^msgid.*%' po/*.po* |
> grep --text -v 'msgstr ""$' |
> grep --text -c 'msgstr[^%]*$'
89

Change « -c » to « -B1 » to see the affected translations.
I needed to add «--text» to avoid warnings about non-text files
(because po/af.po is Latin-1 rather than UTF-8 or plain ASCII).

Most cases where the msgtxt includes a spurious “%s” are easily fixed -
just cut it out.
Cases where the msgstr is missing “%s” will need a bit more work.

This then led me down a bit of a rabbit hole, noticing other stuff that
seems strange. (My background is in comparative linguistics rather than
fluency in particular languages.)

Most translation sets appear to be reasonable, but the translation sets for
some languages are, ahem, disappointing.

Some things were obvious to me even as someone who doesn't speak the
language in question:

   - Multiple msgids have the same msgstr (not counting where the same
   msgstr is shared between languages

$ grep --text -H '^msgstr....' po/*.po* | sort | uniq -c | grep -cwv '^ *1'
561
Or by language:
$ grep --text -H '^msgstr....' po/*.po* | sort | uniq -c | grep --text -wv
'^ *1' | cut -c9- | cut -d: -f1 | uniq -c | sort -r
     56 po/af.po
     41 po/lt.po
     40 po/gl.po
     39 po/fi.po
     38 po/da.po
     35 po/sl.po
     34 po/sk.po
     32 po/vi.po
     25 po/ga.po
     24 po/hu.po
     21 po/el.po
     20 po/id.po
     19 po/ru.po
     18 po/eo.po
     16 po/ja.po
     15 po/nb.po
     15 po/ca.po
     14 po/zh_TW.po
     13 po/tr.po
      5 po/zh_CN.po
      5 po/sr.po
      4 po/pl.po
      3 po/uk.po
      3 po/ko.po
      3 po/bg.po
      2 po/sv.po
      2 po/ro.po
      2 po/nl.po
      2 po/ka.po
      2 po/it.po
      2 po/hr.po
      2 po/et.po
      2 po/es.po
      2 po/de.po
      2 po/cs.po
      1 po/pt_BR.po
      1 po/pt.po
      1 po/fr.po
Languages with high duplication counts are probably due to copy-and-paste
errors rather than translation errors.

   - Some translations are unexpectedly short (less than half the words or
   syllables of the English text)


   - Back-tick quotes are not properly translated:

$ grep --text -A1 ^'msgid.*`' po/*.po* | grep --text -v 'msgstr ""$' |
 grep --text -c 'msgstr.*`'
1062

   - Sometimes back-tick quotes are used when they're not present in the
   msgid:

$ grep --text -A1 ^'msgid[^`]*$' po/*.po* | grep --text -v 'msgstr ""$' |
 grep --text -c 'msgstr.*`'
68

Clearly *some* of these will be valid for the languages concerned, but the
large number of anomalies in some languages is concerning. Doing spot
checks using Google Translate to perform reverse translations confirms that
a large proportion of these are indeed mistakes. Two examples:

In po/ca.po (Catalan) one msgstr

   - "%s: el límit de temps no és vàlid"

is given as the translation for 2 msgids

   - "%s: invalid job specification"
   - "%s: invalid timeout specification"

when the reverse translation appears to be:

   - time limit is invalid

(I spotted this one just based on "tempus" being Latin for "time".)

In po/af.po (Afrikaans), two semantically interchangeable msgstrs:

   - "Pypfout.\n"
   - "pypfout: %s"

are given as the translations for 9 msgids:

   - "%s: expression error\n"
   - "%s: missing separator"
   - "Bus error"
   - "pipe error"
   - "programming error"
   - "read error"
   - "redirection error: cannot duplicate fd"
   - "script file read error"
   - "write error"

when the reverse translation appears to be:

   - "pipe failure"

Is it worthwhile for a non-speaker to attempt to resolve any of these, or
should they be referred back to the translators?
Are machine-assisted translations acceptable as fill-ins until humans can
look them over? Or should faulty translations simply be deleted?

Some of the anomalies I've found suggest misinterpretation of technical
English. For example “script file read error” is translated in ways that
suggests that the script itself is invalid, rather than that the file
containing it is unreadable, apparently because "script" was taken as the
verb:

   - po/ca.po-msgstr "error d'escriptura: %s" ("typing error" in Catalan)
   - po/da.po-msgstr "skrivefejl: %s" ("typo" in Danish)
   - po/eo.po-msgstr "Eraro ĉe skribo: %s" ("typing error" in Esperanto)
   - po/fi.po-msgstr "kirjoitusvirhe: %s" ("typo" in Finnish)
   - po/gl.po-msgstr "erro de escritura: %s" ("typo" in Galician)
   - po/nb.po-msgstr "skrivefeil: %s" ("typing error" in Norwegian Bokmål)


   - po/ga.po-msgstr "earráid scríofa: %s" ("writing error" in Irish Gaelic)
   - po/hu.po-msgstr "írási hiba: %s" ("writing error" in Hungarian)
   - po/id.po-msgstr "gagal menulis: %s" ("failed to write) in Indonesian)
   - po/lt.po-msgstr "rašymo klaida: %s" ("write error" in Lithuanian)
   - po/ru.po-msgstr "ошибка записи: %s" ("write error" or "recording
   error" in Russian)
   - po/sk.po-msgstr "chyba zapisovania" ("write error" in Slovakian)
   - po/sl.po-msgstr "napaka med pisanjem" ("write error" in Slovenian)
   - po/tr.po-msgstr "yazma hatası: %s" ("write error" in Türkçe)
   - po/vi.po-msgstr "lỗi ghi: %s" ("write error" in Tiếng Việt)
   - po/zh_TW.po-msgstr "寫入時發生錯誤:%s" ("error while writing" in traditional
   Taiwanese Chinese)

I'm relying on automated translations for these, so I've likely missed some
nuances, but to have consistent misinterpretation among many languages
suggests that the subtlety of the English version is also a problem, as
most of the "correct" translations come back with something like "error
while reading script file" or "error while reading file of script".

Should I/we rewrite some of the English message to make them easier for
translators?

Lastly, some error messages indicate internal faults, and include function
names or other code symbols.
Am I right in assuming that they should be left untranslated? For example,
in po/sr.po has

   - msgid "shell_getc: shell_input_line_size (%zu) exceeds SIZE_MAX (%zu):
   line truncated"
   msgstr "shell_getc: величина_реда_улаза_шкољке (%zu) је премашила
   НАЈВЕЋУ_ВЕЛИЧИНУ (%zu): ред је скраћен"

when maybe it should have

   - msgid "shell_getc: shell_input_line_size (%zu) exceeds SIZE_MAX (%zu):
   line truncated"
   msgstr "shell_getc: shell_input_line_size (%zu) је премашила SIZE_MAX
   (%zu): ред је скраћен"

Should these be adjusted to keep code symbols untranslated? With or without
translations? Perhaps

   - msgstr "shell_getc: величина реда улаза шкољке (shell_input_line_size=%zu)
   је премашила највећу величину (SIZE_MAX=%zu): ред је скраћен"

or

   - msgstr "shell_getc: shell_input_line_size (величина реда улаза шкољке=%zu)
   је премашила SIZE_MAX (највећу величину=%zu): ред је скраћен"

-Martin

PS: a counter example is in builtins/caller.def which has these two lines:

   - Return the context of the current subroutine call.
   - N_("Returns the context of the current subroutine call.\n\

which differ only by “return” vs “returns”, and thus result in the same
translation in many languages. Perhaps we need some guidance on whether
help docs should be written in imperative form or in descriptive form?

Reply via email to