Re: [Apertium-stuff] GSoC

Bernard Chardonneau Sat, 28 Mar 2020 14:53:00 -0700

My point of view now.

About French<->English translations.


There are a lot of candidates for working for Apertium during GSOC and few
(I don't know what part) of them are taken. So, for the reason Hèctor
indicated, the English-French pair my not be taken for GSOC.

But for me, any language pair not available as a free software is interesting
for Apertium project, so, it will be good to develop English-French pair in
the future, even if it is outside GSOC. The fact Systran (a French enterprise)
proposed several years ago it's translator for around 50 € the Windows version,
but around 5000 $ the version for GNU/Linux / UNIX is a shame that gave me a
good reason to join the apertium project.

So, for the pairs including Esperanto.

When these pairs were developed, each pair included any data files needed to
compile a language pair. So these pairs includes at least :
- one morphological dictionary for each language,
- one file for disambiguating analysis for each language,
- a bilingual dictionary,
- transfer files,
- generally one post-generation file for each language (not really needed when
  target language is Esperanto, but there is one of these files in eo-en pair.
  
For the nature and the content of these files, let watch the wiki pages. Here
is a list of pages http://wiki.apertium.org/wiki/Traductions_en_fran%C3%A7ais
generally available both in English and French.
For installing Apertium, there was a lot of changes and English pages are the
most up to date.

A presentation in French about how Apertium works is also available here :
http://imagesn.free.fr/apertium/pres-atelier-2014.odp  (for the slides)
https://rmll.ubicast.tv/videos/developper-des-paires-de-langues-pour-la-traduction-automatique-avec-apertium/
for the sound.
(As I only used my personal computer with a video-projector 3 times in my
life (always for Apertium), I was not used to first plug the video-projector
cable, and secondly start or reboot my computer).
  
Now, morphological dictionaries, disambiguation files and post-generation files
are in the language branch and any (new) language pair using one particular
language access the same files for it.

So, the new language pairs includes only specifics files for this pair which
are :
- bilingual dictionary
- transfer files.

For the French Esperanto pair, there are 2 accesses on Apertium repository :
apertium-eo-fr
apertium-epo-fra

apertium-eo-fr is the original version which only translates French into
Esperanto. I updated dictionaries when adding words, but I finally stopped
doing it.

apertium-epo-fra has quite the same files. But last updates of dictionaries
were done only in this pair. Transfer files for translating French into
Esperanto are exactly the same (or possibly only their visual aspect was
changed). But this pair also include transfer files for translating
Esperanto into French. This side is not finished, was never released and
must be improved.

So, for this pair, here is what could be done :

In apertium-fra/apertium-fra.fra.metadix

When a paradigm analyzes a word with gender mf or number sp, add RL
lines to accept also generation with gender m ou f or number sg or pl.
  
2 examples of paradigms to change :

In apertium-fra.fra.metadix we find :

<pardef n="académique__adj">
  <e>       <p><l>s</l>         <r><s n="adj"/><s n="mf"/><s 
n="pl"/></r></p></e>
  <e>       <p><l></l>          <r><s n="adj"/><s n="mf"/><s 
n="sg"/></r></p></e>
</pardef>

<pardef n="mois__n">
  <e>       <p><l></l>          <r><s n="n"/><s n="m"/><s n="sp"/></r></p></e>
</pardef>

<pardef n="fois__n">
  <e>       <p><l></l>          <r><s n="n"/><s n="f"/><s n="sp"/></r></p></e>
</pardef>

A better form present (and necessary) in apertium-epo-fra (and 
apertium-fra-por) is :

    <pardef n="académique__adj">
      <e r="LR"><p><l></l>          <r><s n="adj"/><s n="mf"/><s 
n="sg"/></r></p></e>
      <e r="LR"><p><l>s</l>         <r><s n="adj"/><s n="mf"/><s 
n="pl"/></r></p></e>
      <e r="RL"><p><l></l>          <r><s n="adj"/><s n="m"/><s 
n="sg"/></r></p></e>
      <e r="RL"><p><l>s</l>         <r><s n="adj"/><s n="m"/><s 
n="pl"/></r></p></e>
      <e r="RL"><p><l></l>          <r><s n="adj"/><s n="f"/><s 
n="sg"/></r></p></e>
      <e r="RL"><p><l>s</l>         <r><s n="adj"/><s n="f"/><s 
n="pl"/></r></p></e>
    </pardef>

    <pardef n="fois__n"
      <e><p><l></l>                 <r><s n="n"/><s n="f"/><s 
n="sp"/></r></p></e>
      <e r="RL"><p><l></l>          <r><s n="n"/><s n="f"/><s 
n="sg"/></r></p></e>
      <e r="RL"><p><l></l>          <r><s n="n"/><s n="f"/><s 
n="pl"/></r></p></e>
    </pardef>
    
    <pardef n="mois__n"
      <e><p><l></l>                 <r><s n="n"/><s n="m"/><s 
n="sp"/></r></p></e>
      <e r="RL"><p><l></l>          <r><s n="n"/><s n="m"/><s 
n="sg"/></r></p></e>
      <e r="RL"><p><l></l>          <r><s n="n"/><s n="m"/><s 
n="pl"/></r></p></e>
    </pardef>

For the first paradigm, the more simple syntax in fra-por pair is even a little
better :

    <pardef n="académique__adj">
      <e><p><l></l>                 <r><s n="adj"/><s n="mf"/><s 
n="sg"/></r></p></e>
      <e><p><l>s</l>                <r><s n="adj"/><s n="mf"/><s 
n="pl"/></r></p></e>
      <e r="RL"><p><l></l>          <r><s n="adj"/><s n="m"/><s 
n="sg"/></r></p></e>
      <e r="RL"><p><l>s</l>         <r><s n="adj"/><s n="m"/><s 
n="pl"/></r></p></e>
      <e r="RL"><p><l></l>          <r><s n="adj"/><s n="f"/><s 
n="sg"/></r></p></e>
      <e r="RL"><p><l>s</l>         <r><s n="adj"/><s n="f"/><s 
n="pl"/></r></p></e>
    </pardef>
    
So, this kind of change will have to be done everywhere in 
apertium-fra.fra.metadix
a word is analysed as mf (don't know if masculine or feminine) or sp (don't 
know if
singular ou plural).

That id already done everywhere in apertium-epo-fra.epo.dix (as in
apertium-fra-por.fra.metadix), so, you will just have to report these
changes.

Add the words presents into apertium-epo-fra.fra.dix which are not yet
also in apertium-fra.fra.metadix

Note : At least for one paradigm there is a difference between the two files.
masculine names on which a "s" is added for plural use paradigm livre__n in
apertium-fra.fra.metadix but accessoire__n in apertium-epo-fra.fra.dix

I prefer accessoire__n that would do for the two most commons paradigms
for names, the reference name or the paradigm appearing very early in the
alphabetically sorted list of words.

So, let change everywhere livre__n by accessoire__n

I don't know if there are other paradigms doing the same with different names
in the two files, but if you find them, let take as the reference word the
first of these names in alphabetical order.

Like that, the most frequently used paradigms will be the ones who appear
early in the full list of words alphabetically sorted. And that could be
a help for choosing a paradigm without generally having to read the content
of a large number of them.

Now for the language epo.

I found a horrible file of more than 200 000 lines of paradigms, and no
word for using them ! Completely useless. Only comments in the sdef section
could be usefull.

So, this file will have to be built again from apertium-epo-fra.epo.dix and
apertium-eo-en.eo.dix.xml (+ eventually other files of that kind) to get
all the Esperanto word used in these pairs. Paradigm used seem to work the
same in both pairs.

After that, you will have to test if tranfer rules still work and correct them.

As he said, for eo-en pair, ask to Jacob Nordfalk

For fra => epo translation direction, ask to Hèctor Alòs

For epo => fra translation direction, ask to me.

For this translation direction, the 0 step (apertium-epo-fra.epo-fra.t0x)
add "unu" to names without the determinant "la". That allows to use the
same transfer rules for names with determinant "le" "la" "les" (or sometimes
"l'" after post generation) and for names with determinant "un" "une" "des".

After that, only one stage of transfer is used. Presently, there are no rules
for adverbs, or for pronouns in accusative form. Adding them would reduce a
lot the number of # in a translation.

I also did a lot of tranfer rules for sentences like
<det>? <adj>* <n> "de" <det>? <adj>* <n> estas <adj>
<det>? <adj>* <n> "de" <det>? <adj>* <n> <verb>

Example :
la kato de la najbarino estas blanka
la malgranda katino de la najbaro estas blanka
la kanino de la dika najbarino ne estas nigra
katoj de la dika granda najbaro estas blankaj
..

With the possibility of having 0, 1 or 2 adjectives for each name, that
makes plenty of similar transfer rules, even if in that case, the 0 step
divides them by 4.

A good change should be to rewrite transfer rules for this kind of
sentences using a 3 stage transfer. That allows to process shorter
lists of words send gender and number (or other informations) of one
group to another.


    
> Date: Wed, 25 Mar 2020 13:48:17 +0300
> From: Hèctor Alòs i Font <hectora...@gmail.com>
> To: "[apertium-stuff]" <apertium-stuff@lists.sourceforge.net>, 
>  Bernard Chardonneau <bechapert...@free.fr>,
>  Jacob Nordfalk <jacob.nordf...@gmail.com>
> Reply-To: apertium-stuff@lists.sourceforge.net
> Subject: Re: [Apertium-stuff] GSoC
> Pièce(s) jointes(s) probable(s)>
> Saluton, Andrew!
> Mi ĝojas legi pri propono rilata al esperanto. Mi daŭrigas angle, por ke al
> ĉiuj estu kompreneble.
>
> It probably doesn't make any sense to work on the English-French pair in
> Apertium, since these are two of the languages with the most resources in
> the world (linguistic and non-linguistic). As a result, there are quite a
> lot of good translators between them, although most of them commercial.
>
> Esperanto is also included in Google Translator, but I think the
> Esperanto-French translation can be done at a similar level in Apertium.
> Moreover, the translation from Esperanto to French could be used to test
> the new apertium-recursive module.
>
> In fact, the current versions of Apertium's four Esperanto pairs (English ⇆
> Esperanto, French → Esperanto, Spanish → Esperanto and Catalan →
> Esperanto)
> were released ten years ago. They all use the old all-in-one-repository
> structure. Porting these four pairs into the new structure which shares
> language resources (using apertium-eng, apertium-fra, apertium-spa and
> apertium-cat) would result in a big improvement, because a lot of work has
> been done in Apertium on these languages in the last ten years. But porting
> is not automatic, since there are differences betweem the monodixes of the
> current pairs and the ones in the given four repositories. There are even
> differences between the Esperanto-monodixes in these four pairs.
>
> Another question is that @Bernard Chardonneau <bechapert...@free.fr> has
> been working in his own branch of apertium-fra-esp. So, it'd be interesting
> to read to what he thinks on a GSoC that would include this pair. Maybe he
> could find time to mentor the project (and maybe @Jacob Nordfalk
> <jacob.nordf...@gmail.com>, who created the Esperanto-English pair, too).
>
> So, in short, in my opinion:
>
> - You should evaluate the quality of the current Google translation,
> especially for the French-Esperanto pair, in which Google translates in two
> steps (this
> <http://wiki.apertium.org/wiki/Hectoralos/GSOC_2019_proposal:_Catalan-Italian_and_Catalan-Portuguese#Current_situation_of_the_language_pairs>
>> is what I did in a similar case last year)
>
> - A part of the project could be a kind of elementary "make a language pair
> state-of the art
> <http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Make_a_language_pair_state-of-the-art>".
>> At a minimum this would include the French-Esperanto pair
>
> - Maybe half of the project could be developing a translator from Esperanto
> into French
>
> Hèctor
>
> Missatge de Andrew Briand <atb8...@comcast.net> del dia dc., 25 de març
> 2020 a les 11:46:
>
> > Hello,
> >
> >
> >
> > I am an undergrad interested in adopting an Apertium language pair for
> > Google Summer of Code 2020. I am most interested in French<->English,
> > Esperanto->English, and Esperanto<->French. What might a project for those
> > language pairs look like?
> >
> >
> >
> > Thank you,
> >
> >
> >
> > Andrew Briand
> >
> >
> >
> >
> > _______________________________________________
> > Apertium-stuff mailing list
> > Apertium-stuff@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/apertium-stuff
> >
>
--------------------------------
Bernard Chardonneau (France)
Phone : [33] 9 72 36 32 90
GSM phone : [33] 7 69 46 16 31

An alternative Apertium translation website :
http://apertiumtrad.tuxfamily.org

Multilingual websites for my free softwares :
http://libremail.free.fr and http://libremail.tuxfamily.org
http://cyloop.tuxfamily.org (mainly translated with Apertium)

My general website (in french only)
http://bech.free.fr


_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSoC

Reply via email to