Re: [Apertium-stuff] My GSoC 2017 proposal

2017-04-03 Thread Mikel L. Forcada

Late critical feedback for Marc (sorry for being so late):

This is a very important project for Apertium in my opinion. While 
Catalan is not an under-resourced language, it may soon be the official 
language of a medium-sized country, and having an open-source 
alternative is clearly a desirable situation for Catalan.


You say: "Apertium now provides an English-Catalan language pair good 
for assimilation". I am not ready to accept this statement in the 
motivation of a proposal. Some of my students, who use apertium-en-ca in 
projects, would probably disagree or make a more cautious statement. 
But, in any case, it would be better to say that it can be greatly 
improved for dissemination purposes (we are excluding interactive 
translation prediction here, where it may even be more useful).


One important problem with en→ca is that rules are very  hard to modify. 
The distribution of rules in .t1x and .t2x is not consistent. I would 
advocate for a deep study of how rules are made now (producing a 
documentation) and a complete rehaul of the rule base (ensuring no 
regression), before even trying to actually improve them by "adding 
transfer rules" as your proposal proposes.


I see very little discussion of actual structural transfer and CG rule 
problems.


The coverage predictions are adequate if one assumes a Zipfian 
distribution of naïve coverage.


I used this formula on Wolfram alpha, starting with 35000 enttries and a 
coverage of 85,9%:


https://www.wolframalpha.com/input/?i=(0.859%2Fsum(1%2Fk,1,35000))*sum(1%2Fk,1,38000)

You probably did too, as the results are almost identical.

As the distribution is surely not Zipfian (meaning that some words that 
are more probable than many words currently in the dictionary may still 
be missing), the coverage figures will probably be better.


But, on the other hand, I don't see any justification for the WER 
reduction forecast, unless the "unknown words" component of the current 
WER is clearly quantified. Looks like ballpark figures. A WER of 0.199 
would be outstanding for eng→cat.


Why is eng→cat (expected to be) so different from en→ca as regards WER? 
Any justification for such a big drop in WER? What will bring this about?


It would be nice to have a study on how the main commercial rule-based 
system (Lucy) currently does, to put the figures in perspective. Most of 
the vocabulary in apertium-es-ca actually comes from Lucy (the language 
data are property of the Generalitat de Catalunya, who decided to 
release it through Apertium in 2007).


I find it however quite hard to see a GSoC student actually duplicating 
the vocabulary of a language pair. You plan to add 35000 words ( 3000 
stems a week in your proposal). You will work 30 h a week during 12 
weeks. That is 360 h. This means that if you only added words, you would 
be adding 100 words per hour. This is of course possible if you have 
free/open-source sources of new vocabulary that size that can be 
automatically (and legitimately) converted to Apertium format.  You have 
to provide compelling evidence that this will actually happen.


By the way, there is no mention of where you will work additional hours 
to make up for the exam period.


I hope there is time to improve your proposal along these lines.

Cheers

Mikel


El 30/03/17 a les 11:11, Marc Riera Irigoyen ha escrit:

Hello everyone,

I have been working on my proposal for this year's GSoC and I have 
published a first version of it on the wiki. You can find it here: 
http://wiki.apertium.org/wiki/User:Marcriera/proposal


It would be great to get some feedback about it. The workplan is not 
final, as I am working on the coding challenge and it will be based on 
the results.


Thank you!

Marc Riera

--
*Marc Riera Irigoyen
*
Freelance Translator EN/JA>CA/ES*

*(+34) 652 492 008*
*


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot


___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


--
Mikel L. Forcada  http://www.dlsi.ua.es/~mlf/
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03690 Sant Vicent del Raspeig
Spain
Office: +34 96 590 9776

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] My GSoC 2017 proposal

2017-04-02 Thread Xavi Ivars
2017-04-02 9:22 GMT+02:00 Francis Tyers :

>
> Have you done an evaluation of the English tagger? I got the impression
> that it was much worse. I think it might be hard to improve over
> the 96% of the Catalan tagger:
>
> http://wiki.apertium.org/wiki/Comparison_of_part-of-speech_tagging_systems


I agree with Fran that improving the current Catalan tagger may get a bit
hard. However, the apertium-cat tagger performs a bit worse than the "best"
CAT tagger we had before (apertium-es-ca.ca), mainly due to the addition of
a lot of verbal forms. So a retrain (or maybe a few CG rules) could help
not only eng-cat, but other language pairs as well.



> Do you think an MT system with 91% coverage will be useful for
> postedition ?
>

I know Wikipedian editors are using currently the fra-cat language pair,
which doesn't perform that well either (Hèctor has been doing some
improvements to it)


> What kind of coverage does Google have ? (Try taking a large text and
> seeing
> how many words are untranslated)
>
> Have you thought about using a guesser for "translating" proper names?
>
>
That would make a huge difference, even more if coverage + WER is computed
agains the Wikipedia, where there are a lot of proper names.


> [...]

Are you going to be calculating the coverage over the
> Wikipedia ?
> If so, 20 million tokens is going to take a long time to translate.
>
>
+1. Clarity on what will be used to validate the improvements implemented
is needed.



> I think that it might be better if you focussed just on eng->cat rather
> than cat->eng.
>
>
+1 on focusing on eng->cat, as it's what makes more sense for
dissemination. However, having something a bit "better" than the current
cat-eng for assimilation purposes would also benefit.

-- 
< Xavi Ivars >
< http://xavi.ivars.me >
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] My GSoC 2017 proposal

2017-04-02 Thread Francis Tyers
El 2017-04-01 19:59, Marc Riera Irigoyen escribió:
> Thank you very much for your feedback, it has helped me improve my
> proposal.

No problem, I ask some more questions below :)

> I agree with you that the estimates were low, I have made them higher
> and calculated an estimated coverage goal based on them. I have also
> added more details about the planned tasks to answer your questions (I
> just had not thought enough about the whole process of adding stems
> and rules).
> 
> I think that the English tagger is good enough at the present and I
> have not considered training it, but I am aware that the Catalan
> tagger fails more frequently, so I have included some time in my work
> plan to train it (if necessary).


Have you done an evaluation of the English tagger? I got the impression
that it was much worse. I think it might be hard to improve over
the 96% of the Catalan tagger:

http://wiki.apertium.org/wiki/Comparison_of_part-of-speech_tagging_systems

Do you think an MT system with 91% coverage will be useful for 
postedition ?

What kind of coverage does Google have ? (Try taking a large text and 
seeing
how many words are untranslated)

Have you thought about using a guesser for "translating" proper names?

What are the principle transfer problems with eng->cat as you see them 
now ?

Perhaps give a criticism of this piece of text ?

What are other existing resources you can use ? Are there any 
dictionaries
you can use ? Are you going to be calculating the coverage over the 
Wikipedia ?
If so, 20 million tokens is going to take a long time to translate.

I think that it might be better if you focussed just on eng->cat rather 
than
cat->eng.


Fran

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] My GSoC 2017 proposal

2017-04-01 Thread Marc Riera Irigoyen
Thank you very much for your feedback, it has helped me improve my proposal.

I agree with you that the estimates were low, I have made them higher and
calculated an estimated coverage goal based on them. I have also added more
details about the planned tasks to answer your questions (I just had not
thought enough about the whole process of adding stems and rules).

I think that the English tagger is good enough at the present and I have
not considered training it, but I am aware that the Catalan tagger fails
more frequently, so I have included some time in my work plan to train it
(if necessary). New stems will be added by frequency, and the priority of
new rules will be decided based on error frequency when testing the
corpora. CG may not be necessary at all, but if I finally need it I will
add it the same way.

Regards,

Marc

2017-03-30 15:58 GMT+02:00 Joonas Kylmälä :

> On 3/30/17, Francis Tyers  wrote:
> > There are parallel corpora for English and Catalan, are you planning to
> > learn lexical selection rules ?
>
> Oh wow, I thought this wasn't implemented yet. I found the related
> wiki article:  parallel_and_non-parallel_corpora>.
> Thanks Fran!
>
> -Joonas
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>



-- 

*Marc Riera Irigoyen*
Freelance Translator EN/JA>CA/ES

(+34) 652 492 008
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] My GSoC 2017 proposal

2017-03-30 Thread Joonas Kylmälä
On 3/30/17, Francis Tyers  wrote:
> There are parallel corpora for English and Catalan, are you planning to
> learn lexical selection rules ?

Oh wow, I thought this wasn't implemented yet. I found the related
wiki article: 
.
Thanks Fran!

-Joonas

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] My GSoC 2017 proposal

2017-03-30 Thread Francis Tyers
El 2017-03-30 12:11, Marc Riera Irigoyen escribió:
> Hello everyone,
> 
> I have been working on my proposal for this year's GSoC and I have
> published a first version of it on the wiki. You can find it here:
> http://wiki.apertium.org/wiki/User:Marcriera/proposal
> 
> It would be great to get some feedback about it. The workplan is not
> final, as I am working on the coding challenge and it will be based on
> the results.
> 
> Thank you!

Hey there! I think that your estimates are quite low:

  New stems in bidix (~400 stems a week)
Additional transfer rules, lexical selection rules and CG (~8-10 rules a 
week)

Someone working on this task should be looking at more like 1000s 
entries/day... you should be adding them semi-automatically based on 
other resources and scripts. I just timed myself translating 100 words 
from the middle of the frequency list from Spanish to English and I did 
it in 3 minutes 20 seconds. I recommend you time yourself doing some of 
the work to come up with more realistic estimates.

There are parallel corpora for English and Catalan, are you planning to 
learn lexical selection rules ?

How are you planning to choose the transfer rules to write ?

What do you think a good approach is for finding which CG rules to 
write. Do you plan to train the perceptron tagger on a large tagged 
corpus of English ? If so, how do you plan to do it ?

Hope these questions/suggestions help! :)

Regards,

Fran

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] My GSoC 2017 proposal

2017-03-30 Thread Marc Riera Irigoyen
Hello everyone,

I have been working on my proposal for this year's GSoC and I have
published a first version of it on the wiki. You can find it here:
http://wiki.apertium.org/wiki/User:Marcriera/proposal

It would be great to get some feedback about it. The workplan is not final,
as I am working on the coding challenge and it will be based on the results.

Thank you!

Marc Riera

-- 

*Marc Riera Irigoyen*
Freelance Translator EN/JA>CA/ES

(+34) 652 492 008
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff