Re: Grammatical forms in translatable texts

Akim Demaille Sun, 19 Apr 2020 01:32:21 -0700

Hi Frank,

Cc'd to Bruno, the gettext maintainer.  If he has time, I'm
sure he'll have good suggestions!


> Le 18 avr. 2020 à 17:09, Frank Heckenbach <[email protected]> a écrit :
> 
> Hi,
> 
> when I played a bit with the new i18n features in Bison 3.5.90,
> I noticed some grammatical issues with the generated texts. I think
> this belongs on a Bison list, since it affects not only the
> translations themselves, but also the translatable texts Bison
> emits.
> 
> E.g.:
> 
>  [en] msgid "syntax error, unexpected %s"
>  [de] msgstr "Syntaxfehler, unerwartetes %s"
>  [fr] msgstr "erreur de syntaxe, %s inattendu"
> 
> I know (for de) and think (for fr) that "unerwartetes"/"inattendu"
> needs to take different forms depending on the gender of %s.

You're right about French.  You can also add Spanish and Italian.
And plenty of others I guess :)


> Another case:
> 
>  [en] msgid "syntax error, unexpected %s, expecting %s or %s"
>  [de] msgstr "Syntaxfehler, unerwartetes %s, hatte %s oder %s erwartet"
>  [fr] msgstr "erreur de syntaxe, %s inattendu, attendait %s ou %s"
> 
> The way it's worded (it could be worded differently) is so that the
> first %s must be in the nominative (subject) case, but the other
> ones in the accusative (object) case, in de, I think also in fr (and
> theoretically even in en, but there's no difference except for a few
> words like "I"/"me" which are unlikely to occur in token names).

I don't think there would be a problem in French, except for
the article, as you mention below.


> Luckily in de, these forms are often the same -- but not always,
> e.g. "Buchstabe"(nom.) ("letter") / "Buchstaben"(acc.). They're
> often different for adjectives which might easily be part of token
> names, and for articles -- which brings us to another point:
> 
> The first %s requires no article in de just like in en; the other
> ones, strictly speaking, do require an article (though in a short
> message like this, it might be barely acceptable to omit them, in de
> somewhat less so than in en).
> 
> Seeing that de is rather closely related to en, compared to most
> other languages, other languages might have even more grammatical
> issues.

I agree.  But I don't see how we can solve the problem of the
general pattern here.  See below for what I have in mind for a
completely different approach in the future.

> As a complex example, using these token names:
> 
>  "Cyrillic letter" -> "kyrillischer Buchstabe"
>  "Latin letter" -> "lateinischer Buchstabe"
>  "Greek letter" -> "griechischer Buchstabe"
> 
> a correctly translated message in de would look like this:
> 
>  "Syntaxfehler, unerwarteter(nom.masc.) kyrillischer(nom.masc.)
>  Buchstabe(nom.), hatte einen(article/acc.masc.)
>  lateinischen(acc.masc.) Buchstaben(acc.) oder [einen](same
>  article, optional) griechischen(acc.masc.) Buchstaben(acc.)
>  erwartet"
> 
> Of course, you might consider this nitpicking. I bring it up because
> with the current wording of the translatable texts, it's basically
> impossible to produce grammatically correct translations in all
> cases.

Agreed.

> Also, as currently worded, the token names themselves would need to
> be translated differently in different contexts, so Bison users
> would have to be aware of that. I've done something similar in
> another program of mine where I needed two forms (only two,
> luckily), and defined a "|" in the translations to separate them
> (with no "|" meaning the same form for both), e.g.:
> 
>  msgid "Cyrillic letter"
>  msgstr "kyrillischer Buchstabe|kyrillischen Buchstaben"
> 
> Of course, this requires (and would require in Bison) the caller of
> "_" to parse this, or pass a parameter to "_", so "_" could parse it
> (more flexible, and those who don't care about it could just ignore
> that parameter).

I don't know exactly what's the problem you are addressing here,
but I believe contexts would help
(https://www.gnu.org/software/gettext/manual/html_node/Contexts.html).

In the case of Bison, it would require that we support
pgettext, something like

%token EOF p_("token", "end of input")

but that's heavy.

There's a feature in Ruby's Faster Gettext that I like: s_, or
sgettext, which includes the context in the MsgId itself
(https://github.com/grosser/fast_gettext#s_-or-sgetext-translation-with-namespace).

Of course s_("foo|bar") returns "bar" if "foo|bar" is not in the
catalogue.


GNU Gettext does not feature this, and it can't be tricked
to accept it by using \004 is the msgid:

> diff --git a/src/parse-gram.y b/src/parse-gram.y
> index d09f49a7..db06e45e 100644
> --- a/src/parse-gram.y
> +++ b/src/parse-gram.y
> @@ -33,6 +33,9 @@
>  {
>    #include "system.h"
>  
> +#undef _
> +#define _(Msgid) (pgettext_expr ("", Msgid))
> +  
>    #include <c-ctype.h>
>    #include <errno.h>
>    #include <intprops.h>
> @@ -141,8 +144,8 @@
>  }
>  
>  %token
> -  STRING              _("string")
> -  TSTRING             _("translatable string")
> +  STRING              _("token\004string")
> +  TSTRING             _("token\004translatable string")
>  
>    PERCENT_TOKEN       "%token"
>    PERCENT_NTERM       "%nterm"

gives

> cd bison/po && msgmerge  --lang=bg bg.po bison.pot -o bg.new.po
> bison.pot:581: séparateur de contexte <EOT> à l'intérieur d'une chaîne
> bison.pot:585: séparateur de contexte <EOT> à l'intérieur d'une chaîne
> msgmerge: 2 erreurs fatales trouvées
> msgmerge for bg.po failed!


Otherwise I would suggest

%token EOF _("token|end of input")

and simplify have _ implement something like Faster Gettext's
sgettext.  With C strings, you don't even need to edit the string,
so there's no problem with memory management.



> Another option would be rather roundabout wordings to make sure the
> token names always occur in the same case and without article, but
> these would generally be less readable (and I'm not sure if even
> possible in every language), something like:
> 
>  "syntax error, the token \"%s\" was unexpected, expected one of
>  the following tokens: %s, ..."

Well, I have grown up in a word of rather terse err msgs, so I am
probably biased here.  Again, if there is consensus for something
different, I'll subscribe to it.

Bear in mind though that the translators have already provided
translations for these messages.  Changing them could have some
unexpected impact on projects who benefit from the current translations.



As a final note, here's what I have in mind for the forthcoming
releases wrt error messages.

In Bison 3.7, we shall merge Vincent Imbimbo's implementation of
counter example generation for conflicts
(https://github.com/akimd/bison/pull/15.  The PR seems stuck, but
we are actually still discussing offline).  To provide this feature,
he had to add sort of a parser emulator in Bison itself (being
able to "run parses" in bison itself).

If I am not mistaken, a lot of the things he needed are those
used by Benoit Pottier in Menhir to implement a full catalogue
of all the possible parser configurations (or "states" if you
wish) and map each one to a specific hand-written error message
(http://gallium.inria.fr/~fpottier/publis/fpottier-reachability-cc2016.pdf).

Sure, writing an error message for each situation will take a lot
of work, but for those who are ready to pay the price, I cannot
imagine any better approach.  One will truly be able to customize
error messages as good as in hand-written parsers.

That's what I'd like to have in Bison 3.8.

Cheers!

Re: Grammatical forms in translatable texts

Reply via email to