Re: [Freeipa-devel] i18n infrastructure improvements

John Dennis Fri, 11 Jan 2013 08:23:34 -0800

On 01/11/2013 10:04 AM, Petr Viktorin wrote:

Hello list,
This discussion was started in private; I'll continue it here.


On 01/10/2013 05:41 PM, John Dennis wrote:

On 01/10/2013 04:27 AM, Petr Viktorin wrote:

On 01/09/2013 03:55 PM, John Dennis wrote:

And I could work on improving the i18n/translations infrastructure,
starting by writing up a RFE+design.

Could you elaborate as to what you perceive as the current problems and
what this work would address.

Here are my notes:

- Use fake translations for tests


We already do (but perhaps not sufficiently).


I mean use it in *all* tests, to ensure all the right things are
translated and weird characters are handled well.
See https://www.redhat.com/archives/freeipa-devel/2012-October/msg00278.html

Ah yes, I like the idea of a test domain for strings, this is a goodidea. Not only would it exercise our i18n code more but it couldinsulate the tests from string changes (the test would look for acanonical string in the test domain)

- Split up huge strings so the entire text doesn't have to be
retranslated each time something changes/is added


Good idea. But one question I have is should we be optimizing for our
programmers time or the translators time? The Transifex tool should make
available to translators similar existing translations (in fact it
might, I seem to recall some functionality in this area). Wouldn't it be
better to address this issue in Transifex where all projects would benefit?

Also the exact same functionality is needed to support release versions.
The strings between releases are often close but not identical. The
Transifex tool should make available a close match from a previous
version to the translator working on a new version (or visa versa). See
your issue below concerning versions.

IMHO this is a Transifex issue which needs to be solved there, not
something we should be investing precious IPA programmers time on. Plus
if it's solved in Transifex it's a *huge* win for *everyone*, not just IPA.


Huh? Splitting the strings provides additional information
(paragraph/context boundaries) that Transifex can't get otherwise. From
what I hear it's a pretty standard technique when working with gettext.

I'm not sure how splitting text into smaller units gives more contextbut I can see the argument for each msgid being a logical paragraph. Wedon't have too many multi-paragraph strings now so it shouldn't be tooinvolved.


For typos, gettext has the "fuzzy" functionality that we explicitly turn
off. I think we're on our own here.

Be very afraid of turning on fuzzy matching. Before we moved to TX weused the entire gnu tool chain. I discovered a number of our PO fileswere horribly corrupted. With a lot of work I traced this down to fuzzymatches. If memory serves me right here is what happened.

When a msgstr was absent a fuzzy match was performed and inserted as acandidate msgstr. Somehow the fuzzy candidates got accepted as actualmsgstr's. I'm not sure if we ever figured out how this happened. The twomost likely explanations were 1) a known bug in TX that stripped thefuzzy flag off the msgstr or 2) a translator who blindly accepted all"TX suggestions". (A suggestion in TX comes from a fuzzy match).

But the real problem is the fuzzy matching is horribly bad. Most of thefuzzy suggestions (primarily on short strings) were wildly incorrect.

I had to go back to a number of PO files and manually locate all fuzzysuggestions that had been promoted to legitimate msgstr's. A tediousprocess I hope to never repeat.

BTW, if memory serves me correctly the fuzzy suggestions got into the POfiles in the first place because we were running the full gnu tool chain(sorry off the top of my head I don't recall exactly which componentinserts the fuzzy suggestion), but I think we've since turned that off,for a very good reason.

- Keep a history/repo of the translations, since Transifex only stores
the latest version


We already do keep a history, it's in git.


It's not updated often enough. If I mess something up before a release
and Transifex gets wiped, or if a rogue translator deletes some
translations, the work is gone.


Yes, updating more frequently is an excellent goal.

- Update the source strings on Transifex more often (ideally as soon as
patches are pushed)


Yes, great idea, this would be really useful and is necessary.

- Break Git dependencies: make it possible generate the POT in an
unpacked tarball


Are you talking about the fact our scripts invoke git to determine what
files to process? If so then yes, this would be a good dependency to get
rid of. However it does mean we somehow have to maintain a manifest list
of some sort somewhere.


A directory listing is fine IMO. We use it for more critical things,
like loading plugins, without any trouble.
Also, when run in a Git repo the Makefile can compare the file list with
what Git says and warn accordingly.

How do you which files in a directory should be scanned fortranslations? Perhaps it doesn't hurt to scan every file, we never triedscanning inappropriate files so I don't know what the consequence would be.

A little history: Originally there was a manifest of every file to bescanned. Simo made some fixes to the i18n infrastructure and didn't likethe manifest idea (it has an obvious downside, it has to be updated whensource files are added/deleted/moved). Simo used git to get a list ofsource files. But that mechanism depended on identifying a source filesvia it's filename extension. Our scripts don't have an extension sothose were hardcoded just like the original manifest. It didn'teliminate the maintenance problem, it just made it smaller. To my mindthat was the worst of both worlds, it introduced a git dependency butdidn't solve the manual manifest problem.

One could do something else, if a file doesn't have an extension you canread the beginning of the file and look for a shebang (#!) interpreterline, that would identify the file as a script and hence a translationcandidate.

But perhaps your idea of scan everything and throw away (or ignore)anything which won't scan correctly because because it's not valid inputis the best. As I said I don't think we ever explored that, but it mightalso be because in some instances we have to tell the scanner what typeof file the input is (a chicken-n-egg problem). But it's an interestingidea and we should see how it works in practice.

- Figure out how to best share messages across versions (2.x vs. 3.x) so
they only have to be translated once


There is a crying need for this, but isn't this a Transifex issue? Why
would we solving this in IPA? What about SSSD and every other project,
they all have identical issues. As far as I can tell Transifex has never
addressed this issue sufficiently (see above) and the onus is on them to
do so.


I don't think waiting for Transifex will solve the problem.


Then what is your suggestion?

Pull every msgid from every version, put it into one massive unified potfile and then split the resulting unified PO files back into versionspecific PO files?

Well I suppose we wouldn't have to split the PO files, they could justcontain translations that are never referenced, it would make themlarger but wouldn't hurt anything.

Of course merging the strings from every version into one unified POTwould play havoc with the msgid references (where is the string located)unless the filename was modified to include it's git branch.


Just thinking off the top of my head.

- Clean up checked-in PO files even more, for nicer diffs


A nice feature, but I'm wondering to extent we're currently suffering
because of this. It's rare that we have to compare PO files. Plus diff
is not well suited for comparing PO's because PO files with equivalent
data can be formatted differently. That's why I wrote some tools to read
PO files, normalize the contents and then do a comparison. Anyway my top
level question is is this something we really need at this point?


You're right that files have to be normalized to diff well.That's
actually the point here :)
Anyway I'm just thinking of sorting the PO alphabetically - an extra
option to msgattrib should do it.

- Automate & document the process so any dev can do it


Excellent goal, we're not too far from it now, but of all the things on
the list this is the most important.



--
John Dennis <jden...@redhat.com>

Looking to carve out IT costs?
www.redhat.com/carveoutcosts/

_______________________________________________
Freeipa-devel mailing list
Freeipa-devel@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-devel

Re: [Freeipa-devel] i18n infrastructure improvements

Reply via email to