Hi all, just hooking this up somewhere in this thread.
I've been watching this thread with interest, as I see that discussion coming at me at Mozilla (hi Danilo :-)) My thoughts are somewhat assumption-heavy, so bear with me when I'm wrong. I think that a good deal of the web translation tools actually offer multiple values for each translatable string, not sure about launchpad here in particular, but it may not matter. Here's my picture: After a piece of software went through a webtool, at least through one with a low barrier of entry, you end up with something that I call a translation cloud. In this picture, I'm mostly dropping the change-pattern, and simply look at the result. Each particle in this cloud has various meta data, like, author, date of creation, possibly "imported from upstream". The task seems to be to me to extract a localization from this translation cloud. Where I use "localization" here in contrast to "translation" to mean something that satisfies certain software engineering principles, like consistency, correctness, testings passed, etc. That seems to be a data mining thing. Democracy on individual entries might be something to bootstrap with, but I would hope that there is way more structure hidden in that data. Like, if you'd pick a set of the N entries with the most participation, and you'd pick a winner localization for each, then you could take the set of authors that got M% of those strings right. And then you take all the strings where, say K% of those authors agree. Sounds complicated, but really isn't, if you drop the numbers to tune. For the 10 most debated strings, pick a winning localization. Pick all localized strings that are the same from all the authors that had all winners right, and you get a possibly good data set. In the context of this discussion, one valuable form of meta data would be "imported from upstream" and to grant a significant trust-value to those 'particles' in the translation cloud. That keeps random changes out without ruling out improvements. A different approach would be to not look at the end particles, but rather to look at changes. Like, each 'edit' would be a branch, and it might be interesting to look at what the version control system authors know about their algebras and merging to come up with valuable output from low-barrier systems. The fact that we're really dealing with a huge amount of not so structured branches makes me favour the data mining idea, but then again, what do I know about the algebras that the distributed version control system folks have. And what do I know if I'd understand what they saying when they talked about it. Anyway, I think it's worthwhile to focus on how to gain output out of low-barrier systems and create measures of confidence for translated strings from them. I guess most of the folks that are actually hacking on the tools side will be at fosdem, so if this makes sense to you, you might want to spoil a beer or two with chatting about this. Sadly, I won't be able to join. I'll try next time. Axel PS: If I CCed someone not reading the original thread, sorry. http://mail.gnome.org/archives/gnome-i18n/2008-January/thread.html#00227 has the context. _______________________________________________ gnome-i18n mailing list gnome-i18n@gnome.org http://mail.gnome.org/mailman/listinfo/gnome-i18n