Re: [Wikimedia-l] machine translation

David Cuenca Tudela Wed, 03 May 2017 04:07:53 -0700

Perhaps it would be a good idea to compare the translated text to the text
that the user wants to save.


If they are more than 95% the same, that means that the user didn't take
the effort to correct the text.

Cheers,
Micru

On Wed, May 3, 2017 at 10:31 AM, Wojciech Pędzich <wpedz...@gmail.com>
wrote:

> It does depend a lot on the engagement level of the human behind the
> keyboard. When I deal with machine-translated text, I simply wonder whether
> the someone behind the keyboard took efforts to actually read the piece.
>
> Now whether this would work if limited to namespaces outside "main" - I do
> not want to demonise the issue, but if the person submitting the text for
> machine translation does not read it, what will stop them from a quick
> ctrl+c / ctrl+v? Just asking.
>
> Wojciech
>
> W dniu 2017-05-03 o 09:33, Yaroslav Blanter pisze:
>
> Creating machine translations only in the draft space (or in the user space
>> in the projects which do not have draft) could help.
>>
>> Cheers
>> Yaroslav
>>
>> On Tue, May 2, 2017 at 10:16 PM, Pharos <pharosofalexand...@gmail.com>
>> wrote:
>>
>> I think it all depends on the level of engagement of the human translator.
>>>
>>> When the tool is used in the right way, it is a fantastic tool.
>>>
>>> Maybe we can find better methods to nudge people toward taking their time
>>> and really doing work on their translations.
>>>
>>> Thanks,
>>> Pharos
>>>
>>> On Tue, May 2, 2017 at 4:09 PM, Bodhisattwa Mandal <
>>> bodhisattwa.rg...@gmail.com> wrote:
>>>
>>> Content translation with Yandex is also a problem in Bengali Wikipedia.
>>>> Some users have grown a tendency to create machine translated
>>>> meaningless
>>>> articles with this extension to increase edit count and article count.
>>>>
>>> This
>>>
>>>> has increased the workloads of admins to find and delete those articles.
>>>>
>>>> Yandex is not ready for many languages and it is better to shut it. We
>>>> don't need it in Bengali.
>>>>
>>>> Regards
>>>> On May 3, 2017 12:17 AM, "John Erling Blad" <jeb...@gmail.com> wrote:
>>>>
>>>> Actually this _is_ about turning ContentTranslation off, that is what
>>>>> several users in the community want. They block people using the
>>>>>
>>>> extension
>>>>
>>>>> and delete the translated articles. Use of ContentTranslation has
>>>>>
>>>> become
>>>
>>>> a
>>>>
>>>>>   rather contentious case.
>>>>>
>>>>> Yandex as a general translation engine to be able to read some alien
>>>>> language is quite good, but as an engine to produce written text it is
>>>>>
>>>> not
>>>>
>>>>> very good at all. In fact it often creates quite horrible Norwegian,
>>>>>
>>>> even
>>>
>>>> for closely related languages. One quite common problem is reordering
>>>>>
>>>> of
>>>
>>>> words into meaningless constructs, an other problem is reordering
>>>>>
>>>> lexical
>>>
>>>> gender in weird ways. The English preposition "a" is often translated
>>>>>
>>>> as
>>>
>>>> "en" in a propositional phrase, and then the gender is added to the
>>>>> following phrase. That gives a translation of  "Oppland is a county
>>>>>
>>>> in…"
>>>
>>>>   into something like "Oppland er en fylket i…" This should be "Oppland
>>>>>
>>>> er
>>>
>>>> et fylke i…".
>>>>>
>>>>> (I just checked and it seems like Yandex messes up a lot less now than
>>>>> previously, but it is still pretty bad.)
>>>>>
>>>>> Apertium works because the language is closely related, Yandex does not
>>>>> work because it is used between very different languages. People try to
>>>>>
>>>> use
>>>>
>>>>> Yandex and gets disappointed, and falsely conclude that all language
>>>>> translations are equally weird. They are not, but Yandex translations
>>>>>
>>>> are
>>>
>>>> weird.
>>>>>
>>>>> The numerical threshold does not work. The reason is simple, the number
>>>>>
>>>> of
>>>>
>>>>> fixes depends on language constructs that fails, and that is simply
>>>>>
>>>> not a
>>>
>>>> constant for small text fragments. Perhaps if we could flag specific
>>>>> language constructs that is known to give a high percentage of
>>>>>
>>>> failures,
>>>
>>>> and if the translator must check those sentences. One such language
>>>>> construct is disappearances between the preposition and the gender of
>>>>>
>>>> the
>>>
>>>> following term in a prepositional phrase. If they are not similar, then
>>>>>
>>>> the
>>>>
>>>>> sentence must be checked. It is not always wrong to write "en jenta" in
>>>>> Norwegian, but it is likely to be wrong.
>>>>>
>>>>> A language model could be a statistical model for the language itself,
>>>>>
>>>> not
>>>>
>>>>> for the translation into that language. We don't want a perfect
>>>>>
>>>> language
>>>
>>>> model, but a sufficient language model to mark weird constructs. A very
>>>>> simple solution could simply be to mark tri-grams that does not
>>>>>
>>>> already
>>>
>>>> exist in the text base for the destination as possible errors. It is
>>>>>
>>>> not
>>>
>>>> necessary to do a live check, but  at least do it before the page can
>>>>>
>>>> be
>>>
>>>> saved.
>>>>>
>>>>> Note the difference in what Yandex do and what we want to achieve;
>>>>>
>>>> Yandex
>>>
>>>> translates a text between two different languages, without any clear
>>>>>
>>>> reason
>>>>
>>>>> why. It is not to important if there are weird constructs in the text,
>>>>>
>>>> as
>>>
>>>> long as it is usable in "some" context. We translate a text for the
>>>>>
>>>> purpose
>>>>
>>>>> of republishing it. The text should be usable and easily readable in
>>>>>
>>>> that
>>>
>>>> language.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, May 2, 2017 at 7:07 PM, Amir E. Aharoni <
>>>>> amir.ahar...@mail.huji.ac.il> wrote:
>>>>>
>>>>> 2017-05-02 18:20 GMT+03:00 John Erling Blad <jeb...@gmail.com>:
>>>>>>
>>>>>> Brute force solution; turn the ContentTranslation off. Really
>>>>>>>
>>>>>> stupid
>>>
>>>> solution.
>>>>>>>
>>>>>>
>>>>>> ... Then I guess you don't mind that I'm changing the thread name :)
>>>>>>
>>>>>>
>>>>>> The next solution; turn the Yandex engine off. That would solve a
>>>>>>> part of the problem. Kind of lousy solution though.
>>>>>>>
>>>>>>> What about adding a language model that warns when the language
>>>>>>>
>>>>>> constructs
>>>>>>
>>>>>>> gets to weird? It is like a "test" for the translation. The CT is
>>>>>>>
>>>>>> used
>>>>
>>>>> for
>>>>>>
>>>>>>> creating a translation, but the language model is used for
>>>>>>>
>>>>>> verifying
>>>
>>>> if
>>>>
>>>>> the
>>>>>>
>>>>>>> translation is good enough. If it does not validate against the
>>>>>>>
>>>>>> language
>>>>>
>>>>>> model it should simply not be published to the main name space. It
>>>>>>>
>>>>>> will
>>>>
>>>>> still be possible to create a draft, but then the user is
>>>>>>>
>>>>>> completely
>>>
>>>> aware
>>>>>>
>>>>>>> that the translation isn't good enough.
>>>>>>>
>>>>>>> Such a language model should be available as a test for any
>>>>>>>
>>>>>> article,
>>>
>>>> as
>>>>
>>>>> it
>>>>>>
>>>>>>> can be used as a quality measure for the article. It is really a
>>>>>>>
>>>>>> quantity
>>>>>
>>>>>> measure for the well-spokenness of the article, but that isn't
>>>>>>>
>>>>>> quite
>>>
>>>> so
>>>>
>>>>> intuitive.
>>>>>>>
>>>>>>> So, I'll allow myself to guess that you are talking about one
>>>>>>
>>>>> particular
>>>>
>>>>> language, probably Norwegian.
>>>>>>
>>>>>> Several technical facts:
>>>>>>
>>>>>> 1. In the past there were several cases in which translators to
>>>>>>
>>>>> different
>>>>
>>>>> languages who reported common translation mistakes to me. I passed
>>>>>>
>>>>> them
>>>
>>>> on
>>>>>
>>>>>> to Yandex developers, with whom I communicate quite regularly. They
>>>>>> acknowledged receiving all of them. I am aware of at least one such
>>>>>>
>>>>> common
>>>>>
>>>>>> mistake that was fixed; possibly there were more. If you can give me
>>>>>>
>>>>> a
>>>
>>>> list
>>>>>
>>>>>> of such mistakes for Norwegian, I'll be very happy to pass them on. I
>>>>>> absolutely cannot promise that they will be fixed upstream, but it's
>>>>>> possible.
>>>>>>
>>>>>> 2. In Norwegian, Apertium is used for translating between the two
>>>>>>
>>>>> varieties
>>>>>
>>>>>> of Norwegian itself (Bokmål and Nynorsk), and from other Scandinavian
>>>>>> languages. That's probably why it works so well—they are similar in
>>>>>> grammar, vocabulary, and narrative style (I'll pass it on to Apertium
>>>>>> developers—I'm sure they'll be happy to hear it). Unfortunately,
>>>>>>
>>>>> machine
>>>>
>>>>> translation from English is not available in Apertium. Apertium works
>>>>>>
>>>>> best
>>>>>
>>>>>> with very similar languages, and English has two characteristics,
>>>>>>
>>>>> which
>>>
>>>> are
>>>>>
>>>>>> unfortunate when combined: it is both the most popular source for
>>>>>> translation into almost all other languages (including Norwegian),
>>>>>>
>>>>> and
>>>
>>>> it
>>>>
>>>>> is not _very_ similar to any other languages (except maybe Scots).
>>>>>>
>>>>> Machine
>>>>>
>>>>>> translation from English into Norwegian is only possible with Yandex
>>>>>>
>>>>> at
>>>
>>>> the
>>>>>
>>>>>> moment. More engines may be added in the future, but at the moment
>>>>>>
>>>>> that's
>>>>
>>>>> all we have. That's why disabling Yandex completely would indeed be a
>>>>>>
>>>>> lousy
>>>>>
>>>>>> solution: A lot of people say that without machine translation
>>>>>>
>>>>> integration
>>>>>
>>>>>> Content Translation is useless. Not all users think like that, but
>>>>>>
>>>>> many
>>>
>>>> do.
>>>>>
>>>>>> 3. We can define a numerical threshold of acceptable percentage of
>>>>>>
>>>>> machine
>>>>>
>>>>>> translation post-editing. Currently it's 75%. It's a tad
>>>>>>
>>>>> embarrassing,
>>>
>>>> but
>>>>>
>>>>>> it's hard-coded at the moment, but it can be very easily be made
>>>>>>
>>>>> into a
>>>
>>>> variable per language. If the translator tries to publish a page in
>>>>>>
>>>>> which
>>>>
>>>>> less than that is modified, a warning will be shown.
>>>>>>
>>>>>> 4. I'm not sure what do you mean by "language model". If it's any
>>>>>>
>>>>> kind
>>>
>>>> of a
>>>>>
>>>>>> linguistic engine, then it's definitely not within the resources that
>>>>>>
>>>>> the
>>>>
>>>>> Language team itself can currently dedicate. However, if somebody who
>>>>>>
>>>>> knows
>>>>>
>>>>>> Norwegian and some programming will write a script that analyzes
>>>>>>
>>>>> common
>>>
>>>> bad
>>>>>
>>>>>> constructs in a Wikipedia dump, this will be very useful. This would
>>>>>> basically be an upgraded version of suggestion #1 above. (In my spare
>>>>>>
>>>>> time
>>>>>
>>>>>> as a volunteer I'm doing something comparable for Hebrew, although
>>>>>>
>>>>> not
>>>
>>>> for
>>>>>
>>>>>> translation, but for improving how MediaWiki link trails work.)
>>>>>> _______________________________________________
>>>>>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
>>>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
>>>>>> wiki/Wikimedia-l
>>>>>> New messages to: Wikimedia-l@lists.wikimedia.org
>>>>>> Unsubscribe: https://lists.wikimedia.org/
>>>>>>
>>>>> mailman/listinfo/wikimedia-l,
>>>
>>>> <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>
>>>>>>
>>>>> _______________________________________________
>>>>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
>>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
>>>>> wiki/Wikimedia-l
>>>>> New messages to: Wikimedia-l@lists.wikimedia.org
>>>>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
>>>>> <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>
>>>>>
>>>> _______________________________________________
>>>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
>>>> wiki/Wikimedia-l
>>>> New messages to: Wikimedia-l@lists.wikimedia.org
>>>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
>>>> <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>
>>>>
>>>> _______________________________________________
>>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
>>> wiki/Wikimedia-l
>>> New messages to: Wikimedia-l@lists.wikimedia.org
>>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
>>> <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>
>>>
>>> _______________________________________________
>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wik
>> i/Mailing_lists/Guidelines and https://meta.wikimedia.org/wik
>> i/Wikimedia-l
>> New messages to: Wikimedia-l@lists.wikimedia.org
>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
>> <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>
>>
>
>
>
> _______________________________________________
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wik
> i/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
> New messages to: Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>
>



-- 
Etiamsi omnes, ego non
_______________________________________________
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
<mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>

Re: [Wikimedia-l] machine translation

Reply via email to