Hi all,

just hooking this up somewhere in this thread.

I've been watching this thread with interest, as I see that discussion
coming at me at Mozilla (hi Danilo :-))

My thoughts are somewhat assumption-heavy, so bear with me when I'm wrong.

I think that a good deal of the web translation tools actually offer
multiple values for each translatable string, not sure about launchpad
here in particular, but it may not matter.

Here's my picture:

After a piece of software went through a webtool, at least through one
with a low barrier of entry, you end up with something that I call a
translation cloud. In this picture, I'm mostly dropping the
change-pattern, and simply look at the result. Each particle in this
cloud has various meta data, like, author, date of creation, possibly
"imported from upstream".

The task seems to be to me to extract a localization from this
translation cloud. Where I use "localization" here in contrast to
"translation" to mean something that satisfies certain software
engineering principles, like consistency, correctness, testings
passed, etc.

That seems to be a data mining thing. Democracy on individual entries
might be something to bootstrap with, but I would hope that there is
way more structure hidden in that data. Like, if you'd pick a set of
the N entries with the most participation, and you'd pick a winner
localization for each, then you could take the set of authors that got
M% of those strings right. And then you take all the strings where,
say K% of those authors agree. Sounds complicated, but really isn't,
if you drop the numbers to tune. For the 10 most debated strings, pick
a winning localization. Pick all localized strings that are the same
from all the authors that had all winners right, and you get a
possibly good data set.

In the context of this discussion, one valuable form of meta data
would be "imported from upstream" and to grant a significant
trust-value to those 'particles' in the translation cloud. That keeps
random changes out without ruling out improvements.

A different approach would be to not look at the end particles, but
rather to look at changes. Like, each 'edit' would be a branch, and it
might be interesting to look at what the version control system
authors know about their algebras and merging to come up with valuable
output from low-barrier systems. The fact that we're really dealing
with a huge amount of not so structured branches makes me favour the
data mining idea, but then again, what do I know about the algebras
that the distributed version control system folks have. And what do I
know if I'd understand what they saying when they talked about it.

Anyway, I think it's worthwhile to focus on how to gain output out of
low-barrier systems and create measures of confidence for translated
strings from them.

I guess most of the folks that are actually hacking on the tools side
will be at fosdem, so if this makes sense to you, you might want to
spoil a beer or two with chatting about this. Sadly, I won't be able
to join. I'll try next time.

Axel

PS: If I CCed someone not reading the original thread, sorry.
http://mail.gnome.org/archives/gnome-i18n/2008-January/thread.html#00227
has the context.
_______________________________________________
gnome-i18n mailing list
gnome-i18n@gnome.org
http://mail.gnome.org/mailman/listinfo/gnome-i18n

Reply via email to