Re: [INDOLOGY] Google Translate for Sanskrit

Oliver Hellwig via INDOLOGY Fri, 13 May 2022 03:34:53 -0700

Most probably they have built their MT system on top of so called deep
contextualized embeddings such as BERT
(https://towardsdatascience.com/nlp-extract-contextualized-word-embeddings-from-bert-keras-tf-67ef29f60a7b)
or Roberta (https://huggingface.co/docs/transformers/model_doc/roberta).
We have analyzed such multilingual embeddings for a Sanskrit project,
and it turned out that the Sanskrit data were mainly taken from a dump
of the Sanskrit Wikipedia, which explains the preference for the modern
version. Very useful for MT, less so for a close reading of Vedic texts.


Best, Oliver

On 13/05/2022 11:51, Antonia Ruppel via INDOLOGY wrote:

I think it's also worth asking what the programmers who made this meant
when they said 'Sanskrit'. The classical language, or the modern spoken
version taught and stratified by organisations like e.g. Samskrita
Bharati? I tried a few simple sentences (I went into town, I saw the
man, Where is the cat? etc) and found that
-- the past tense is expressed by means of the ta- and tavant-
participles (the default is the masculine participle, by the way, even
when you try things like 'I, Sītā, went into town'), as favoured e.g. by
modern spoken Sanskrit (not only by it, of course)
--  'Where is the cat?' resulted in the word order बिडालः कुत्र अस्ति
favoured by modern Sanskrit (and mirrored by e.g. Hindi).
- my, her etc in sentences like 'she sees her sisters' are usually
expressed, e.g. by means of sva- or through the actual genitive pronoun,
unlike the Classical Sanskrit tendency of only expressing this when
omission causes confusion
- with at least some expressions we get the noun in accusative + karoti
expression (e.g. smitam karoti rather than smayati), that, I think, also
becomes more prevalent as time passes
- external sandhi is not applied, again following the prevalent modern
spoken convention

Entering 'I have seen him' (rather than 'I saw him') gives me मया तं
दृष्टम्, which I don't quite understand because I'd have expected 'him' to
be the subject and thus nominative. (The same results with other
transitive verbs.)

When you create a translation program, you need to decide what the
'right' translation of something is. With literary languages, like
Sanskrit, whose features usually include variety of expression, that is
difficult. So it seems natural that the programmers would use the
standards of the modern spoken language, for whose creation those
decisions were at some point made.

That google translate now includes Sanskrit is a fascinating social
phenomenon. I'm looking forward to seeing how they are going to develop
it, and hope someone might at some point talk about their methodology in
creating this function. (Let's find out and invite them to a conference?
It would surely make for a fascinating talk!)

All best,
      Antonia

On Fri, 13 May 2022 at 10:01, Satyanad Kichenassamy
<[email protected]
<mailto:[email protected]>> wrote:


    Dear All,

    Here are a few further experiments that illustrate other issues :

    Input: सत्यमेव जयते
    Output: Truth always triumphs

    Input: Truth always triumphs
    Output: सत्यं सदा विजयते

    Input: सत्यं सदा विजयते
    Output: Truth always triumphs

    Input: C'est la réalité qui triomphe
    Output: Reality wins

    Input: C'est la réalité qui triomphe.
    Output: It is reality that triumphs.

    (The only difference between the last two inputs is the final period.)

    Input: Reality prevails.
    Input: La réalité l'emporte.

    Input: Reality alone prevails.
    Output: Seule la réalité prévaut.

    Input: Seule la réalité prévaut.
    Output: Only reality prevails.

    And, for fun, Prop. 12.21 from Brahmagupta's Braahmasphu.tasiddhaanta.

    Input: स्थूलफलं त्रिचतुर्भुजबाहुप्रतिबाहुयोगदलघातः।
    भुजयोगार्धचतुष्टयभुजोनघातात्पदं सूक्ष्मम् ॥

    Output: The gross fruit is the three-four-arm arm-counter-arm
    combination team attack.
    The subtle step is from the impact of the four and a half arms of
    the Yoga of the arms.

    A correct translation is as follows (the four lines correspond to
    the four parts of this Arya verse):
    A crude value [indeed] of the area of a triquadrilateral
        Is the product of the half-sums of opposite sides ;
    Of a group consisting of four half-sums of the sides, from which
        The sides have been subtracted [in turn], the root of the
    product is the refined [value].

    NB: There are quite a few technical terms here; taking some of them
    in their ordinary meaning leads to gibberish. "Pada" here is the
    square root (because the foot of a tree is its root). Yoga is here
    the sum. "Dala" is the half (literally, "broken (in half)"). A
    triquadrilateral is the figure obtained from a trilateral by adding
    a fourth vertex on its circumcircle. Tricaturbhuja is a neologism
    introduced by Brahmagupta that we translated by a neologism because
    there is no corresponding notion in English.

    Thus, Google Translate seems adequate at the स्थूल level, but may miss
    the सूक्ष्म.

    Reverting to general issues from an Indological or mathematical (or
    computer science) viewpoint, I would suggest offhand the following
    for discussion:

    (i) is the algorithm public or not? (Probably not, but who knows?)

    (ii) is there a public algorithm with comparable performance?

    (iii) what is the knowledge base (or training set in the sense of
    neural networks) of known algorithms?

    (iv) a possibly related issue is that there does not seem to be any
    equivalent for Indian languages of Chinese databases such as
    ctext.org <http://ctext.org> for instance, that include many tools
    in addition to searching. For Sanskrit and Tamil, we are grateful to
    have what you can find on
    https://indology.info/external-resources/
    <https://indology.info/external-resources/>
    including
    https://www.projectmadurai.org/ <https://www.projectmadurai.org/>
    http://gretil.sub.uni-goettingen.de/gretil.html
    <http://gretil.sub.uni-goettingen.de/gretil.html>
    https://titus.uni-frankfurt.de/indexf.htm
    <https://titus.uni-frankfurt.de/indexf.htm>

    etc.

    For Sanskrit morphology and, to some extent, parsing, the situation
    is much better : https://sanskrit.inria.fr/DICO/
    <https://sanskrit.inria.fr/DICO/>
    But such tools do not seem to have been integrated into other
    databases (so that, for instance,  hovering the mouse over a word
    would suggest its grammatical nature, or suggest meanings -- such
    things exist in Chinese). This may require the text input into the
    database to integrate a modicum of grammatical analysis and
    therefore, what amounts to an implicit commentary. This may
    nonetheless be appropriate for research journals that could provide
    enriched versions of papers. Automated translation always requires
    some form of semantic input anyway, except for the crudest examples.

    Best regards,

          Satyanad Kichenassamy

    On Thu, 12 May 2022 16:48:48 -0400
    Elliot Stern via INDOLOGY <[email protected]
    <mailto:[email protected]>> wrote:

     > Aleksandar’s comment is spot on:
     >
     >
     >
     > Elliot M. Stern
     > 552 South 48th Street
     > Philadelphia, PA 19143-2029
     > [email protected] <mailto:[email protected]>
     > 267-240-8418
     >
     > > On May 12, 2022, at 1:45 PM, Uskokov, Aleksandar via INDOLOGY
    <[email protected] <mailto:[email protected]>>
    wrote:
     > >
     > > It will be a while before it becomes a philosopher --
     > >
     > > Aleksandar Uskokov
     > > Lector in Sanskrit
     > > South Asian Studies Council, Yale University
     > > 203-432-1972 | [email protected]
    <mailto:[email protected]>
    <mailto:[email protected]
    <mailto:[email protected]>>
     > >
     > > Office Hours Sign-up: https://calendly.com/aleksandar-uskokov
    <https://calendly.com/aleksandar-uskokov>
    <https://calendly.com/aleksandar-uskokov
    <https://calendly.com/aleksandar-uskokov>>
     > > From: INDOLOGY <[email protected]
    <mailto:[email protected]>
    <mailto:[email protected]
    <mailto:[email protected]>>> on behalf of Madhav
    Deshpande via INDOLOGY <[email protected]
    <mailto:[email protected]>
    <mailto:[email protected]
    <mailto:[email protected]>>>
     > > Sent: Thursday, May 12, 2022 1:31 PM
     > > To: Dominik Wujastyk <[email protected]
    <mailto:[email protected]> <mailto:[email protected]
    <mailto:[email protected]>>>
     > > Cc: Indology <[email protected]
    <mailto:[email protected]>
    <mailto:[email protected]
    <mailto:[email protected]>>>
     > > Subject: Re: [INDOLOGY] Google Translate for Sanskrit
     > >
     > > This is Google Translator for the first verse of Meghadūta:
     > >
     > > "Someone is neglected by the teacher of separation from his lover:
     > > Shapenastangmitamahima varshabhogyaena bhartu:
     > > The yaksha bathed Janaka's daughter in the holy waters
     > > I lived in the hermitages of Ramagiri among the lush shady trees."
     > >
     > > GT could not figure out the long compounds, and "guru" got
    translated as "teacher." The syntax of the verse is also missed.
     > >
     > > Madhav M. Deshpande
     > > Professor Emeritus, Sanskrit and Linguistics
     > > University of Michigan, Ann Arbor, Michigan, USA
     > > Senior Fellow, Oxford Center for Hindu Studies
     > > Adjunct Professor, National Institute of Advanced Studies,
    Bangalore, India
     > >
     > > [Residence: Campbell, California, USA]
     > >
     > >
     > > On Thu, May 12, 2022 at 10:17 AM Madhav Deshpande
    <[email protected] <mailto:[email protected]> <mailto:[email protected]
    <mailto:[email protected]>>> wrote:
     > > <image.png>
     > > Madhav M. Deshpande
     > > Professor Emeritus, Sanskrit and Linguistics
     > > University of Michigan, Ann Arbor, Michigan, USA
     > > Senior Fellow, Oxford Center for Hindu Studies
     > > Adjunct Professor, National Institute of Advanced Studies,
    Bangalore, India
     > >
     > > [Residence: Campbell, California, USA]
     > >
     > >
     > > On Thu, May 12, 2022 at 10:16 AM Dominik Wujastyk via INDOLOGY
    <[email protected] <mailto:[email protected]>
    <mailto:[email protected]
    <mailto:[email protected]>>> wrote:
     > > It's quite remarkable:
     > > <image.png>
     > >
     > >
     > > _______________________________________________
     > > INDOLOGY mailing list
     > > [email protected]
    <mailto:[email protected]>
    <mailto:[email protected]
    <mailto:[email protected]>>
     > > https://list.indology.info/mailman/listinfo/indology
    <https://list.indology.info/mailman/listinfo/indology>
    
<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist.indology.info%2Fmailman%2Flistinfo%2Findology&data=05%7C01%7Caleksandar.uskokov%40yale.edu%7C68ffc9acdc8241aa59d608da343d6b2e%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637879735643840753%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=XN7lAO3%2B4CZr1hjpYDZY4y0AcEs0HCIrhj1vCDTcw9k%3D&reserved=0
    
<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist.indology.info%2Fmailman%2Flistinfo%2Findology&data=05%7C01%7Caleksandar.uskokov%40yale.edu%7C68ffc9acdc8241aa59d608da343d6b2e%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637879735643840753%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=XN7lAO3%2B4CZr1hjpYDZY4y0AcEs0HCIrhj1vCDTcw9k%3D&reserved=0>>
     > > <Screenshot 2022-05-12 134349.png>
     > > _______________________________________________
     > > INDOLOGY mailing list
     > > [email protected]
    <mailto:[email protected]>
    <mailto:[email protected]
    <mailto:[email protected]>>
     > > https://list.indology.info/mailman/listinfo/indology
    <https://list.indology.info/mailman/listinfo/indology>
    <https://list.indology.info/mailman/listinfo/indology
    <https://list.indology.info/mailman/listinfo/indology>>
     >
     >
     >
     >
     >


    --
    **********************************************
    Satyanad KICHENASSAMY
    Professor of Mathematics
    Laboratoire de Mathématiques de Reims  (CNRS, UMR9008)
    Université de Reims Champagne-Ardenne
    F-51687 Reims Cedex 2
    France
    Web: https://www.normalesup.org/~kichenassamy
    <https://www.normalesup.org/~kichenassamy>
    **********************************************

    _______________________________________________
    INDOLOGY mailing list
    [email protected] <mailto:[email protected]>
    https://list.indology.info/mailman/listinfo/indology
    <https://list.indology.info/mailman/listinfo/indology>



_______________________________________________
INDOLOGY mailing list
[email protected]
https://list.indology.info/mailman/listinfo/indology


_______________________________________________
INDOLOGY mailing list
[email protected]
https://list.indology.info/mailman/listinfo/indology

Re: [INDOLOGY] Google Translate for Sanskrit

Reply via email to