Re: [Wikidata] Machine translation efforts for underserved languages

Info WorldUniversity Mon, 18 Jun 2018 12:44:22 -0700

Hi Olya, Lucie, and Wikidatans,

Very interesting projects. And thanks for publishing, Lucie - very helpful!


With regard to Swahili, Arabic (both African languages!) and Esperanto, and
leveraging Google Translate / GNMT, I've been looking at this Google GNMT
gif  image -
https://1.bp.blogspot.com/-jwgtcgkgG2o/WDSBrwu9jeI/AAAAAAAABbM/2Eobq-N9_nYeAdeH-sB_NZGbhyoSWgReACLcB/s1600/image01.gif
- and wondering how the triplets of the Linked Open Data of Wikidata
structured Knowledge Base (KB) would stream through this in multiple
smaller languages?

I couldn't deduce from this paper - https://arxiv.org/pdf/1803.07116.pdf -
here, for example ...

2.1 Encoding the Triples The encoder part of the model is a feed-forward
architecture that encodes the set of input triples into a fixed
dimensionality vector, which is subsequently used to initialise the
decoder. Given a set of un-ordered triples FE = {f1, f2, . . . , fR : fj =
(sj , pj , oj )}, where sj , pj and oj are the onehot vector
representations of the respective subject, property and object of the j-th
triple, we compute an embedding hfj for the j-th triple by forward
propagating as follows: hfj = q(Wh[Winsj ;Winpj ;Winoj ]) , (1) hFE =
WF[hf1 ; . . . ; hfR−1 ; hfR ] , (2) where hfj is the embedding vector of
each triple fj , hFE is a fixed-length vector representation for all the
input triples FE. q is a non-linear activation function, [. . . ; . . .]
represents vector concatenation. Win,Wh,WF are trainable weight matrices.
Unlike (Chisholm et al., 2017), our encoder is agnostic with respect to the
order of input triples. As a result, the order of a particular triple fj in
the triples set does not change its significance towards the computation of
the vector representation of the whole triples set, hFE .

... whether this would address streaming triplets through GNMT?

Would this? And since Swahili, Arabic and Esperanto, are all active
languages in - https://translate.google.com/ - no further coding on the
GNMT side would be necessary. (I'm curious how best for WUaS to grow small
languages not yet in either Wikipedia/Wikidata's 287-301 languages or in
GNMT's ~100+ languages?).

How could your Wikidata / Wikibabel work interface with Google GNMT more
fully with time, building on your great Wikidata coding/papers?

Cheers,
Scott

https://en.wikipedia.org/wiki/User:Scott_WUaS



On Mon, Jun 18, 2018 at 5:17 AM, Gerard Meijssen <gerard.meijs...@gmail.com>
wrote:

> Hoi,
> On average there is little or no support for subjects that have to do with
> Africa. When I check the articles for politicians for instance, I find that
> even current presidents let alone ministers are missing in African
> Wikipedias. So it is wonderful that there have been projects that deal with
> gaps but what if there is hardly anything?
>
> What this approach brings us is at least information. Basic information in
> lists, info boxes maybe an additional line of text.
>
> What we apparently have not done is learn from the Cebuano experience. The
> biggest issue was not the quality of the new information, it is the
> integration with Wikidata. Everything is new and it did not link with what
> we already knew. What we bring in this way is integrated information and as
> long as data is not saved as an article, the quality provided improves as
> Wikidata gains better intel.
>
> If anything, the experience of the Welsh Wikipedia brings us more than
> gapfinder or tiger editathon because of this is more in line with this
> approach.
> Thanks,
>      GerardM
>
> On 18 June 2018 at 13:19, Amir E. Aharoni <amir.ahar...@mail.huji.ac.il>
> wrote:
>
>>
>> ‬
>>
>> 2018-06-18 2:12 GMT+03:00 Olya Irzak <oir...@gmail.com>:
>>
>>> Dear Wikidata community,
>>>
>>> We're working on a project called Wikibabel to machine-translate parts
>>> of Wikipedia into underserved languages, starting with Swahili.
>>>
>>> In hopes that some of our ideas can be helpful to machine translation
>>> projects, we wrote a blogpost about how we prioritized which pages to
>>> translate, and what categories need a human in the loop:
>>> https://medium.com/@oirzak/wikibabel-equalizing-information-
>>> access-on-a-budget-4038f750e90e
>>>
>>> Rumor has it that the Wikidata community has thought deeply about
>>> information access. We'd love your feedback on our work. Please let us know
>>> about past / ongoing machine translation related projects so we can learn
>>> from & collaborate with them.
>>>
>>
>> I'm not sure how has the Wikidata community think deeply about it.
>>
>> One project that does something related to what you're doing is GapFinder
>> ( https://www.mediawiki.org/wiki/GapFinder ). As far as I know, the
>> GapFinder frontend is not developed actively, but the recommendation API
>> behind it is being actively maintained and developed, but you should ask
>> the Research team for more info (see https://www.mediawiki.org/wiki
>> /Wikimedia_Research ).
>>
>> Project Tiger is also doing something similar:
>> https://meta.wikimedia.org/wiki/Project_Tiger_Editathon_2018
>>
>> As a general comment, displaying machine-translated text in a way that
>> appears that is had been written by humans is misleading and damaging. I
>> don't know any Swahili, but in languages that I can read (Russian, Hebrew,
>> Catalan, Spanish, French, German), the quality of machine translation is at
>> its best good as an aid during writing a translation by a human, and it's
>> never good for actually reading. I also don't understand why do you invest
>> credits into pre-machine-translating articles that people can
>> machine-translate for free, but maybe I'm missing something about how your
>> project works.
>>
>> _______________________________________________
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
>
> _______________________________________________
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>


-- 

-- 
- Scott MacLeod - Founder & President
- https://twitter.com/WorldUnivAndSch
- World University and School
- http://worlduniversityandschool.org
- http://scottmacleod.com

- CC World University and School - like CC Wikipedia with best STEM-centric
CC OpenCourseWare - incorporated as a nonprofit university and school in
California, and is a U.S. 501 (c) (3) tax-exempt educational organization.

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Machine translation efforts for underserved languages

Reply via email to