Re: [Wikimedia-l] machine translation

John Erling Blad Wed, 03 May 2017 04:20:49 -0700

Note that some language pairs could easily be 100% correct.

On Wed, May 3, 2017 at 1:06 PM, David Cuenca Tudela <dacu...@gmail.com>
wrote:


> Perhaps it would be a good idea to compare the translated text to the text
> that the user wants to save.
>
> If they are more than 95% the same, that means that the user didn't take
> the effort to correct the text.
>
> Cheers,
> Micru
>
> On Wed, May 3, 2017 at 10:31 AM, Wojciech Pędzich <wpedz...@gmail.com>
> wrote:
>
> > It does depend a lot on the engagement level of the human behind the
> > keyboard. When I deal with machine-translated text, I simply wonder
> whether
> > the someone behind the keyboard took efforts to actually read the piece.
> >
> > Now whether this would work if limited to namespaces outside "main" - I
> do
> > not want to demonise the issue, but if the person submitting the text for
> > machine translation does not read it, what will stop them from a quick
> > ctrl+c / ctrl+v? Just asking.
> >
> > Wojciech
> >
> > W dniu 2017-05-03 o 09:33, Yaroslav Blanter pisze:
> >
> > Creating machine translations only in the draft space (or in the user
> space
> >> in the projects which do not have draft) could help.
> >>
> >> Cheers
> >> Yaroslav
> >>
> >> On Tue, May 2, 2017 at 10:16 PM, Pharos <pharosofalexand...@gmail.com>
> >> wrote:
> >>
> >> I think it all depends on the level of engagement of the human
> translator.
> >>>
> >>> When the tool is used in the right way, it is a fantastic tool.
> >>>
> >>> Maybe we can find better methods to nudge people toward taking their
> time
> >>> and really doing work on their translations.
> >>>
> >>> Thanks,
> >>> Pharos
> >>>
> >>> On Tue, May 2, 2017 at 4:09 PM, Bodhisattwa Mandal <
> >>> bodhisattwa.rg...@gmail.com> wrote:
> >>>
> >>> Content translation with Yandex is also a problem in Bengali Wikipedia.
> >>>> Some users have grown a tendency to create machine translated
> >>>> meaningless
> >>>> articles with this extension to increase edit count and article count.
> >>>>
> >>> This
> >>>
> >>>> has increased the workloads of admins to find and delete those
> articles.
> >>>>
> >>>> Yandex is not ready for many languages and it is better to shut it. We
> >>>> don't need it in Bengali.
> >>>>
> >>>> Regards
> >>>> On May 3, 2017 12:17 AM, "John Erling Blad" <jeb...@gmail.com> wrote:
> >>>>
> >>>> Actually this _is_ about turning ContentTranslation off, that is what
> >>>>> several users in the community want. They block people using the
> >>>>>
> >>>> extension
> >>>>
> >>>>> and delete the translated articles. Use of ContentTranslation has
> >>>>>
> >>>> become
> >>>
> >>>> a
> >>>>
> >>>>>   rather contentious case.
> >>>>>
> >>>>> Yandex as a general translation engine to be able to read some alien
> >>>>> language is quite good, but as an engine to produce written text it
> is
> >>>>>
> >>>> not
> >>>>
> >>>>> very good at all. In fact it often creates quite horrible Norwegian,
> >>>>>
> >>>> even
> >>>
> >>>> for closely related languages. One quite common problem is reordering
> >>>>>
> >>>> of
> >>>
> >>>> words into meaningless constructs, an other problem is reordering
> >>>>>
> >>>> lexical
> >>>
> >>>> gender in weird ways. The English preposition "a" is often translated
> >>>>>
> >>>> as
> >>>
> >>>> "en" in a propositional phrase, and then the gender is added to the
> >>>>> following phrase. That gives a translation of  "Oppland is a county
> >>>>>
> >>>> in…"
> >>>
> >>>>   into something like "Oppland er en fylket i…" This should be
> "Oppland
> >>>>>
> >>>> er
> >>>
> >>>> et fylke i…".
> >>>>>
> >>>>> (I just checked and it seems like Yandex messes up a lot less now
> than
> >>>>> previously, but it is still pretty bad.)
> >>>>>
> >>>>> Apertium works because the language is closely related, Yandex does
> not
> >>>>> work because it is used between very different languages. People try
> to
> >>>>>
> >>>> use
> >>>>
> >>>>> Yandex and gets disappointed, and falsely conclude that all language
> >>>>> translations are equally weird. They are not, but Yandex translations
> >>>>>
> >>>> are
> >>>
> >>>> weird.
> >>>>>
> >>>>> The numerical threshold does not work. The reason is simple, the
> number
> >>>>>
> >>>> of
> >>>>
> >>>>> fixes depends on language constructs that fails, and that is simply
> >>>>>
> >>>> not a
> >>>
> >>>> constant for small text fragments. Perhaps if we could flag specific
> >>>>> language constructs that is known to give a high percentage of
> >>>>>
> >>>> failures,
> >>>
> >>>> and if the translator must check those sentences. One such language
> >>>>> construct is disappearances between the preposition and the gender of
> >>>>>
> >>>> the
> >>>
> >>>> following term in a prepositional phrase. If they are not similar,
> then
> >>>>>
> >>>> the
> >>>>
> >>>>> sentence must be checked. It is not always wrong to write "en jenta"
> in
> >>>>> Norwegian, but it is likely to be wrong.
> >>>>>
> >>>>> A language model could be a statistical model for the language
> itself,
> >>>>>
> >>>> not
> >>>>
> >>>>> for the translation into that language. We don't want a perfect
> >>>>>
> >>>> language
> >>>
> >>>> model, but a sufficient language model to mark weird constructs. A
> very
> >>>>> simple solution could simply be to mark tri-grams that does not
> >>>>>
> >>>> already
> >>>
> >>>> exist in the text base for the destination as possible errors. It is
> >>>>>
> >>>> not
> >>>
> >>>> necessary to do a live check, but  at least do it before the page can
> >>>>>
> >>>> be
> >>>
> >>>> saved.
> >>>>>
> >>>>> Note the difference in what Yandex do and what we want to achieve;
> >>>>>
> >>>> Yandex
> >>>
> >>>> translates a text between two different languages, without any clear
> >>>>>
> >>>> reason
> >>>>
> >>>>> why. It is not to important if there are weird constructs in the
> text,
> >>>>>
> >>>> as
> >>>
> >>>> long as it is usable in "some" context. We translate a text for the
> >>>>>
> >>>> purpose
> >>>>
> >>>>> of republishing it. The text should be usable and easily readable in
> >>>>>
> >>>> that
> >>>
> >>>> language.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Tue, May 2, 2017 at 7:07 PM, Amir E. Aharoni <
> >>>>> amir.ahar...@mail.huji.ac.il> wrote:
> >>>>>
> >>>>> 2017-05-02 18:20 GMT+03:00 John Erling Blad <jeb...@gmail.com>:
> >>>>>>
> >>>>>> Brute force solution; turn the ContentTranslation off. Really
> >>>>>>>
> >>>>>> stupid
> >>>
> >>>> solution.
> >>>>>>>
> >>>>>>
> >>>>>> ... Then I guess you don't mind that I'm changing the thread name :)
> >>>>>>
> >>>>>>
> >>>>>> The next solution; turn the Yandex engine off. That would solve a
> >>>>>>> part of the problem. Kind of lousy solution though.
> >>>>>>>
> >>>>>>> What about adding a language model that warns when the language
> >>>>>>>
> >>>>>> constructs
> >>>>>>
> >>>>>>> gets to weird? It is like a "test" for the translation. The CT is
> >>>>>>>
> >>>>>> used
> >>>>
> >>>>> for
> >>>>>>
> >>>>>>> creating a translation, but the language model is used for
> >>>>>>>
> >>>>>> verifying
> >>>
> >>>> if
> >>>>
> >>>>> the
> >>>>>>
> >>>>>>> translation is good enough. If it does not validate against the
> >>>>>>>
> >>>>>> language
> >>>>>
> >>>>>> model it should simply not be published to the main name space. It
> >>>>>>>
> >>>>>> will
> >>>>
> >>>>> still be possible to create a draft, but then the user is
> >>>>>>>
> >>>>>> completely
> >>>
> >>>> aware
> >>>>>>
> >>>>>>> that the translation isn't good enough.
> >>>>>>>
> >>>>>>> Such a language model should be available as a test for any
> >>>>>>>
> >>>>>> article,
> >>>
> >>>> as
> >>>>
> >>>>> it
> >>>>>>
> >>>>>>> can be used as a quality measure for the article. It is really a
> >>>>>>>
> >>>>>> quantity
> >>>>>
> >>>>>> measure for the well-spokenness of the article, but that isn't
> >>>>>>>
> >>>>>> quite
> >>>
> >>>> so
> >>>>
> >>>>> intuitive.
> >>>>>>>
> >>>>>>> So, I'll allow myself to guess that you are talking about one
> >>>>>>
> >>>>> particular
> >>>>
> >>>>> language, probably Norwegian.
> >>>>>>
> >>>>>> Several technical facts:
> >>>>>>
> >>>>>> 1. In the past there were several cases in which translators to
> >>>>>>
> >>>>> different
> >>>>
> >>>>> languages who reported common translation mistakes to me. I passed
> >>>>>>
> >>>>> them
> >>>
> >>>> on
> >>>>>
> >>>>>> to Yandex developers, with whom I communicate quite regularly. They
> >>>>>> acknowledged receiving all of them. I am aware of at least one such
> >>>>>>
> >>>>> common
> >>>>>
> >>>>>> mistake that was fixed; possibly there were more. If you can give me
> >>>>>>
> >>>>> a
> >>>
> >>>> list
> >>>>>
> >>>>>> of such mistakes for Norwegian, I'll be very happy to pass them on.
> I
> >>>>>> absolutely cannot promise that they will be fixed upstream, but it's
> >>>>>> possible.
> >>>>>>
> >>>>>> 2. In Norwegian, Apertium is used for translating between the two
> >>>>>>
> >>>>> varieties
> >>>>>
> >>>>>> of Norwegian itself (Bokmål and Nynorsk), and from other
> Scandinavian
> >>>>>> languages. That's probably why it works so well—they are similar in
> >>>>>> grammar, vocabulary, and narrative style (I'll pass it on to
> Apertium
> >>>>>> developers—I'm sure they'll be happy to hear it). Unfortunately,
> >>>>>>
> >>>>> machine
> >>>>
> >>>>> translation from English is not available in Apertium. Apertium works
> >>>>>>
> >>>>> best
> >>>>>
> >>>>>> with very similar languages, and English has two characteristics,
> >>>>>>
> >>>>> which
> >>>
> >>>> are
> >>>>>
> >>>>>> unfortunate when combined: it is both the most popular source for
> >>>>>> translation into almost all other languages (including Norwegian),
> >>>>>>
> >>>>> and
> >>>
> >>>> it
> >>>>
> >>>>> is not _very_ similar to any other languages (except maybe Scots).
> >>>>>>
> >>>>> Machine
> >>>>>
> >>>>>> translation from English into Norwegian is only possible with Yandex
> >>>>>>
> >>>>> at
> >>>
> >>>> the
> >>>>>
> >>>>>> moment. More engines may be added in the future, but at the moment
> >>>>>>
> >>>>> that's
> >>>>
> >>>>> all we have. That's why disabling Yandex completely would indeed be a
> >>>>>>
> >>>>> lousy
> >>>>>
> >>>>>> solution: A lot of people say that without machine translation
> >>>>>>
> >>>>> integration
> >>>>>
> >>>>>> Content Translation is useless. Not all users think like that, but
> >>>>>>
> >>>>> many
> >>>
> >>>> do.
> >>>>>
> >>>>>> 3. We can define a numerical threshold of acceptable percentage of
> >>>>>>
> >>>>> machine
> >>>>>
> >>>>>> translation post-editing. Currently it's 75%. It's a tad
> >>>>>>
> >>>>> embarrassing,
> >>>
> >>>> but
> >>>>>
> >>>>>> it's hard-coded at the moment, but it can be very easily be made
> >>>>>>
> >>>>> into a
> >>>
> >>>> variable per language. If the translator tries to publish a page in
> >>>>>>
> >>>>> which
> >>>>
> >>>>> less than that is modified, a warning will be shown.
> >>>>>>
> >>>>>> 4. I'm not sure what do you mean by "language model". If it's any
> >>>>>>
> >>>>> kind
> >>>
> >>>> of a
> >>>>>
> >>>>>> linguistic engine, then it's definitely not within the resources
> that
> >>>>>>
> >>>>> the
> >>>>
> >>>>> Language team itself can currently dedicate. However, if somebody who
> >>>>>>
> >>>>> knows
> >>>>>
> >>>>>> Norwegian and some programming will write a script that analyzes
> >>>>>>
> >>>>> common
> >>>
> >>>> bad
> >>>>>
> >>>>>> constructs in a Wikipedia dump, this will be very useful. This would
> >>>>>> basically be an upgraded version of suggestion #1 above. (In my
> spare
> >>>>>>
> >>>>> time
> >>>>>
> >>>>>> as a volunteer I'm doing something comparable for Hebrew, although
> >>>>>>
> >>>>> not
> >>>
> >>>> for
> >>>>>
> >>>>>> translation, but for improving how MediaWiki link trails work.)
> >>>>>> _______________________________________________
> >>>>>> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/
> >>>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> >>>>>> wiki/Wikimedia-l
> >>>>>> New messages to: Wikimedia-l@lists.wikimedia.org
> >>>>>> Unsubscribe: https://lists.wikimedia.org/
> >>>>>>
> >>>>> mailman/listinfo/wikimedia-l,
> >>>
> >>>> <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>
> >>>>>>
> >>>>> _______________________________________________
> >>>>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> >>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> >>>>> wiki/Wikimedia-l
> >>>>> New messages to: Wikimedia-l@lists.wikimedia.org
> >>>>> Unsubscribe: https://lists.wikimedia.org/
> mailman/listinfo/wikimedia-l,
> >>>>> <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>
> >>>>>
> >>>> _______________________________________________
> >>>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> >>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> >>>> wiki/Wikimedia-l
> >>>> New messages to: Wikimedia-l@lists.wikimedia.org
> >>>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
> ,
> >>>> <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>
> >>>>
> >>>> _______________________________________________
> >>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> >>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> >>> wiki/Wikimedia-l
> >>> New messages to: Wikimedia-l@lists.wikimedia.org
> >>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> >>> <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>
> >>>
> >>> _______________________________________________
> >> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wik
> >> i/Mailing_lists/Guidelines and https://meta.wikimedia.org/wik
> >> i/Wikimedia-l
> >> New messages to: Wikimedia-l@lists.wikimedia.org
> >> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> >> <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>
> >>
> >
> >
> >
> > _______________________________________________
> > Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wik
> > i/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> wiki/Wikimedia-l
> > New messages to: Wikimedia-l@lists.wikimedia.org
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>
> >
>
>
>
> --
> Etiamsi omnes, ego non
> _______________________________________________
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> wiki/Wikimedia-l
> New messages to: Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>
>
_______________________________________________
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
<mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>

Re: [Wikimedia-l] machine translation

Reply via email to