Note that some language pairs could easily be 100% correct. On Wed, May 3, 2017 at 1:06 PM, David Cuenca Tudela <dacu...@gmail.com> wrote:
> Perhaps it would be a good idea to compare the translated text to the text > that the user wants to save. > > If they are more than 95% the same, that means that the user didn't take > the effort to correct the text. > > Cheers, > Micru > > On Wed, May 3, 2017 at 10:31 AM, Wojciech Pędzich <wpedz...@gmail.com> > wrote: > > > It does depend a lot on the engagement level of the human behind the > > keyboard. When I deal with machine-translated text, I simply wonder > whether > > the someone behind the keyboard took efforts to actually read the piece. > > > > Now whether this would work if limited to namespaces outside "main" - I > do > > not want to demonise the issue, but if the person submitting the text for > > machine translation does not read it, what will stop them from a quick > > ctrl+c / ctrl+v? Just asking. > > > > Wojciech > > > > W dniu 2017-05-03 o 09:33, Yaroslav Blanter pisze: > > > > Creating machine translations only in the draft space (or in the user > space > >> in the projects which do not have draft) could help. > >> > >> Cheers > >> Yaroslav > >> > >> On Tue, May 2, 2017 at 10:16 PM, Pharos <pharosofalexand...@gmail.com> > >> wrote: > >> > >> I think it all depends on the level of engagement of the human > translator. > >>> > >>> When the tool is used in the right way, it is a fantastic tool. > >>> > >>> Maybe we can find better methods to nudge people toward taking their > time > >>> and really doing work on their translations. > >>> > >>> Thanks, > >>> Pharos > >>> > >>> On Tue, May 2, 2017 at 4:09 PM, Bodhisattwa Mandal < > >>> bodhisattwa.rg...@gmail.com> wrote: > >>> > >>> Content translation with Yandex is also a problem in Bengali Wikipedia. > >>>> Some users have grown a tendency to create machine translated > >>>> meaningless > >>>> articles with this extension to increase edit count and article count. > >>>> > >>> This > >>> > >>>> has increased the workloads of admins to find and delete those > articles. > >>>> > >>>> Yandex is not ready for many languages and it is better to shut it. We > >>>> don't need it in Bengali. > >>>> > >>>> Regards > >>>> On May 3, 2017 12:17 AM, "John Erling Blad" <jeb...@gmail.com> wrote: > >>>> > >>>> Actually this _is_ about turning ContentTranslation off, that is what > >>>>> several users in the community want. They block people using the > >>>>> > >>>> extension > >>>> > >>>>> and delete the translated articles. Use of ContentTranslation has > >>>>> > >>>> become > >>> > >>>> a > >>>> > >>>>> rather contentious case. > >>>>> > >>>>> Yandex as a general translation engine to be able to read some alien > >>>>> language is quite good, but as an engine to produce written text it > is > >>>>> > >>>> not > >>>> > >>>>> very good at all. In fact it often creates quite horrible Norwegian, > >>>>> > >>>> even > >>> > >>>> for closely related languages. One quite common problem is reordering > >>>>> > >>>> of > >>> > >>>> words into meaningless constructs, an other problem is reordering > >>>>> > >>>> lexical > >>> > >>>> gender in weird ways. The English preposition "a" is often translated > >>>>> > >>>> as > >>> > >>>> "en" in a propositional phrase, and then the gender is added to the > >>>>> following phrase. That gives a translation of "Oppland is a county > >>>>> > >>>> in…" > >>> > >>>> into something like "Oppland er en fylket i…" This should be > "Oppland > >>>>> > >>>> er > >>> > >>>> et fylke i…". > >>>>> > >>>>> (I just checked and it seems like Yandex messes up a lot less now > than > >>>>> previously, but it is still pretty bad.) > >>>>> > >>>>> Apertium works because the language is closely related, Yandex does > not > >>>>> work because it is used between very different languages. People try > to > >>>>> > >>>> use > >>>> > >>>>> Yandex and gets disappointed, and falsely conclude that all language > >>>>> translations are equally weird. They are not, but Yandex translations > >>>>> > >>>> are > >>> > >>>> weird. > >>>>> > >>>>> The numerical threshold does not work. The reason is simple, the > number > >>>>> > >>>> of > >>>> > >>>>> fixes depends on language constructs that fails, and that is simply > >>>>> > >>>> not a > >>> > >>>> constant for small text fragments. Perhaps if we could flag specific > >>>>> language constructs that is known to give a high percentage of > >>>>> > >>>> failures, > >>> > >>>> and if the translator must check those sentences. One such language > >>>>> construct is disappearances between the preposition and the gender of > >>>>> > >>>> the > >>> > >>>> following term in a prepositional phrase. If they are not similar, > then > >>>>> > >>>> the > >>>> > >>>>> sentence must be checked. It is not always wrong to write "en jenta" > in > >>>>> Norwegian, but it is likely to be wrong. > >>>>> > >>>>> A language model could be a statistical model for the language > itself, > >>>>> > >>>> not > >>>> > >>>>> for the translation into that language. We don't want a perfect > >>>>> > >>>> language > >>> > >>>> model, but a sufficient language model to mark weird constructs. A > very > >>>>> simple solution could simply be to mark tri-grams that does not > >>>>> > >>>> already > >>> > >>>> exist in the text base for the destination as possible errors. It is > >>>>> > >>>> not > >>> > >>>> necessary to do a live check, but at least do it before the page can > >>>>> > >>>> be > >>> > >>>> saved. > >>>>> > >>>>> Note the difference in what Yandex do and what we want to achieve; > >>>>> > >>>> Yandex > >>> > >>>> translates a text between two different languages, without any clear > >>>>> > >>>> reason > >>>> > >>>>> why. It is not to important if there are weird constructs in the > text, > >>>>> > >>>> as > >>> > >>>> long as it is usable in "some" context. We translate a text for the > >>>>> > >>>> purpose > >>>> > >>>>> of republishing it. The text should be usable and easily readable in > >>>>> > >>>> that > >>> > >>>> language. > >>>>> > >>>>> > >>>>> > >>>>> On Tue, May 2, 2017 at 7:07 PM, Amir E. Aharoni < > >>>>> amir.ahar...@mail.huji.ac.il> wrote: > >>>>> > >>>>> 2017-05-02 18:20 GMT+03:00 John Erling Blad <jeb...@gmail.com>: > >>>>>> > >>>>>> Brute force solution; turn the ContentTranslation off. Really > >>>>>>> > >>>>>> stupid > >>> > >>>> solution. > >>>>>>> > >>>>>> > >>>>>> ... Then I guess you don't mind that I'm changing the thread name :) > >>>>>> > >>>>>> > >>>>>> The next solution; turn the Yandex engine off. That would solve a > >>>>>>> part of the problem. Kind of lousy solution though. > >>>>>>> > >>>>>>> What about adding a language model that warns when the language > >>>>>>> > >>>>>> constructs > >>>>>> > >>>>>>> gets to weird? It is like a "test" for the translation. The CT is > >>>>>>> > >>>>>> used > >>>> > >>>>> for > >>>>>> > >>>>>>> creating a translation, but the language model is used for > >>>>>>> > >>>>>> verifying > >>> > >>>> if > >>>> > >>>>> the > >>>>>> > >>>>>>> translation is good enough. If it does not validate against the > >>>>>>> > >>>>>> language > >>>>> > >>>>>> model it should simply not be published to the main name space. It > >>>>>>> > >>>>>> will > >>>> > >>>>> still be possible to create a draft, but then the user is > >>>>>>> > >>>>>> completely > >>> > >>>> aware > >>>>>> > >>>>>>> that the translation isn't good enough. > >>>>>>> > >>>>>>> Such a language model should be available as a test for any > >>>>>>> > >>>>>> article, > >>> > >>>> as > >>>> > >>>>> it > >>>>>> > >>>>>>> can be used as a quality measure for the article. It is really a > >>>>>>> > >>>>>> quantity > >>>>> > >>>>>> measure for the well-spokenness of the article, but that isn't > >>>>>>> > >>>>>> quite > >>> > >>>> so > >>>> > >>>>> intuitive. > >>>>>>> > >>>>>>> So, I'll allow myself to guess that you are talking about one > >>>>>> > >>>>> particular > >>>> > >>>>> language, probably Norwegian. > >>>>>> > >>>>>> Several technical facts: > >>>>>> > >>>>>> 1. In the past there were several cases in which translators to > >>>>>> > >>>>> different > >>>> > >>>>> languages who reported common translation mistakes to me. I passed > >>>>>> > >>>>> them > >>> > >>>> on > >>>>> > >>>>>> to Yandex developers, with whom I communicate quite regularly. They > >>>>>> acknowledged receiving all of them. I am aware of at least one such > >>>>>> > >>>>> common > >>>>> > >>>>>> mistake that was fixed; possibly there were more. If you can give me > >>>>>> > >>>>> a > >>> > >>>> list > >>>>> > >>>>>> of such mistakes for Norwegian, I'll be very happy to pass them on. > I > >>>>>> absolutely cannot promise that they will be fixed upstream, but it's > >>>>>> possible. > >>>>>> > >>>>>> 2. In Norwegian, Apertium is used for translating between the two > >>>>>> > >>>>> varieties > >>>>> > >>>>>> of Norwegian itself (Bokmål and Nynorsk), and from other > Scandinavian > >>>>>> languages. That's probably why it works so well—they are similar in > >>>>>> grammar, vocabulary, and narrative style (I'll pass it on to > Apertium > >>>>>> developers—I'm sure they'll be happy to hear it). Unfortunately, > >>>>>> > >>>>> machine > >>>> > >>>>> translation from English is not available in Apertium. Apertium works > >>>>>> > >>>>> best > >>>>> > >>>>>> with very similar languages, and English has two characteristics, > >>>>>> > >>>>> which > >>> > >>>> are > >>>>> > >>>>>> unfortunate when combined: it is both the most popular source for > >>>>>> translation into almost all other languages (including Norwegian), > >>>>>> > >>>>> and > >>> > >>>> it > >>>> > >>>>> is not _very_ similar to any other languages (except maybe Scots). > >>>>>> > >>>>> Machine > >>>>> > >>>>>> translation from English into Norwegian is only possible with Yandex > >>>>>> > >>>>> at > >>> > >>>> the > >>>>> > >>>>>> moment. More engines may be added in the future, but at the moment > >>>>>> > >>>>> that's > >>>> > >>>>> all we have. That's why disabling Yandex completely would indeed be a > >>>>>> > >>>>> lousy > >>>>> > >>>>>> solution: A lot of people say that without machine translation > >>>>>> > >>>>> integration > >>>>> > >>>>>> Content Translation is useless. Not all users think like that, but > >>>>>> > >>>>> many > >>> > >>>> do. > >>>>> > >>>>>> 3. We can define a numerical threshold of acceptable percentage of > >>>>>> > >>>>> machine > >>>>> > >>>>>> translation post-editing. Currently it's 75%. It's a tad > >>>>>> > >>>>> embarrassing, > >>> > >>>> but > >>>>> > >>>>>> it's hard-coded at the moment, but it can be very easily be made > >>>>>> > >>>>> into a > >>> > >>>> variable per language. If the translator tries to publish a page in > >>>>>> > >>>>> which > >>>> > >>>>> less than that is modified, a warning will be shown. > >>>>>> > >>>>>> 4. I'm not sure what do you mean by "language model". If it's any > >>>>>> > >>>>> kind > >>> > >>>> of a > >>>>> > >>>>>> linguistic engine, then it's definitely not within the resources > that > >>>>>> > >>>>> the > >>>> > >>>>> Language team itself can currently dedicate. However, if somebody who > >>>>>> > >>>>> knows > >>>>> > >>>>>> Norwegian and some programming will write a script that analyzes > >>>>>> > >>>>> common > >>> > >>>> bad > >>>>> > >>>>>> constructs in a Wikipedia dump, this will be very useful. This would > >>>>>> basically be an upgraded version of suggestion #1 above. (In my > spare > >>>>>> > >>>>> time > >>>>> > >>>>>> as a volunteer I'm doing something comparable for Hebrew, although > >>>>>> > >>>>> not > >>> > >>>> for > >>>>> > >>>>>> translation, but for improving how MediaWiki link trails work.) > >>>>>> _______________________________________________ > >>>>>> Wikimedia-l mailing list, guidelines at: > https://meta.wikimedia.org/ > >>>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/ > >>>>>> wiki/Wikimedia-l > >>>>>> New messages to: Wikimedia-l@lists.wikimedia.org > >>>>>> Unsubscribe: https://lists.wikimedia.org/ > >>>>>> > >>>>> mailman/listinfo/wikimedia-l, > >>> > >>>> <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe> > >>>>>> > >>>>> _______________________________________________ > >>>>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/ > >>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/ > >>>>> wiki/Wikimedia-l > >>>>> New messages to: Wikimedia-l@lists.wikimedia.org > >>>>> Unsubscribe: https://lists.wikimedia.org/ > mailman/listinfo/wikimedia-l, > >>>>> <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe> > >>>>> > >>>> _______________________________________________ > >>>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/ > >>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/ > >>>> wiki/Wikimedia-l > >>>> New messages to: Wikimedia-l@lists.wikimedia.org > >>>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l > , > >>>> <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe> > >>>> > >>>> _______________________________________________ > >>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/ > >>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/ > >>> wiki/Wikimedia-l > >>> New messages to: Wikimedia-l@lists.wikimedia.org > >>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, > >>> <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe> > >>> > >>> _______________________________________________ > >> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wik > >> i/Mailing_lists/Guidelines and https://meta.wikimedia.org/wik > >> i/Wikimedia-l > >> New messages to: Wikimedia-l@lists.wikimedia.org > >> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, > >> <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe> > >> > > > > > > > > _______________________________________________ > > Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wik > > i/Mailing_lists/Guidelines and https://meta.wikimedia.org/ > wiki/Wikimedia-l > > New messages to: Wikimedia-l@lists.wikimedia.org > > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, > > <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe> > > > > > > -- > Etiamsi omnes, ego non > _______________________________________________ > Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/ > wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/ > wiki/Wikimedia-l > New messages to: Wikimedia-l@lists.wikimedia.org > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, > <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe> > _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l New messages to: Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>