Thank you very much for your suggestions, I have made some changes just in
time before sending the proposal.

I completely agree about the importance of the English-Catalan language
pair and the need for improvements. While I first thought about focusing on
new stems (that is the reason behind the big numbers), I now think that
this could be an opportunity to rework the "invisible" operations that
Apertium does. I have noticed the lack of organization in the transfer rule
files and I have not been able to work more on my coding challenge mostly
due to this (adding any new rule starts breaking others). For this reason I
have changed my mind and I will focus on reworking the current rules
instead of just adding thousands of words. I like the idea of writing
documentation about rules, it will surely help future development.

The WER figures included unknown words, but I agree that the previous
progression was simply impossible to achieve. I have now modified both the
WER and the coverage figures to reflect the final plan. Other related
things such as Wikipedia coverage calculation are now better explained. I
know that ~89% coverage may be too little for this kind of proposal, but if
the language pair gets better organization, rules and documentation, it
will finally compensate.

Regards,

Marc


2017-04-03 10:57 GMT+02:00 Mikel L. Forcada <m...@dlsi.ua.es>:

> Late critical feedback for Marc (sorry for being so late):
>
> This is a very important project for Apertium in my opinion. While Catalan
> is not an under-resourced language, it may soon be the official language of
> a medium-sized country, and having an open-source alternative is clearly a
> desirable situation for Catalan.
>
> You say: "Apertium now provides an English-Catalan language pair good for
> assimilation". I am not ready to accept this statement in the motivation of
> a proposal. Some of my students, who use apertium-en-ca in projects, would
> probably disagree or make a more cautious statement. But, in any case, it
> would be better to say that it can be greatly improved for dissemination
> purposes (we are excluding interactive translation prediction here, where
> it may even be more useful).
>
> One important problem with en→ca is that rules are very  hard to modify.
> The distribution of rules in .t1x and .t2x is not consistent. I would
> advocate for a deep study of how rules are made now (producing a
> documentation) and a complete rehaul of the rule base (ensuring no
> regression), before even trying to actually improve them by "adding
> transfer rules" as your proposal proposes.
>
> I see very little discussion of actual structural transfer and CG rule
> problems.
>
> The coverage predictions are adequate if one assumes a Zipfian
> distribution of naïve coverage.
>
> I used this formula on Wolfram alpha, starting with 35000 enttries and a
> coverage of 85,9%:
>
> https://www.wolframalpha.com/input/?i=(0.859%2Fsum(1%2Fk,1,
> 35000))*sum(1%2Fk,1,38000)
>
> You probably did too, as the results are almost identical.
>
> As the distribution is surely not Zipfian (meaning that some words that
> are more probable than many words currently in the dictionary may still be
> missing), the coverage figures will probably be better.
>
> But, on the other hand, I don't see any justification for the WER
> reduction forecast, unless the "unknown words" component of the current WER
> is clearly quantified. Looks like ballpark figures. A WER of 0.199 would be
> outstanding for eng→cat.
>
> Why is eng→cat (expected to be) so different from en→ca as regards WER?
> Any justification for such a big drop in WER? What will bring this about?
>
> It would be nice to have a study on how the main commercial rule-based
> system (Lucy) currently does, to put the figures in perspective. Most of
> the vocabulary in apertium-es-ca actually comes from Lucy (the language
> data are property of the Generalitat de Catalunya, who decided to release
> it through Apertium in 2007).
>
> I find it however quite hard to see a GSoC student actually duplicating
> the vocabulary of a language pair. You plan to add 35000 words ( 3000 stems
> a week in your proposal). You will work 30 h a week during 12 weeks. That
> is 360 h. This means that if you only added words, you would be adding 100
> words per hour. This is of course possible if you have free/open-source
> sources of new vocabulary that size that can be automatically (and
> legitimately) converted to Apertium format.  You have to provide compelling
> evidence that this will actually happen.
>
> By the way, there is no mention of where you will work additional hours to
> make up for the exam period.
>
> I hope there is time to improve your proposal along these lines.
>
> Cheers
>
> Mikel
>
>
> El 30/03/17 a les 11:11, Marc Riera Irigoyen ha escrit:
>
> Hello everyone,
>
> I have been working on my proposal for this year's GSoC and I have
> published a first version of it on the wiki. You can find it here:
> http://wiki.apertium.org/wiki/User:Marcriera/proposal
>
> It would be great to get some feedback about it. The workplan is not
> final, as I am working on the coding challenge and it will be based on the
> results.
>
> Thank you!
>
> Marc Riera
>
> --
>
> *Marc Riera Irigoyen *
> Freelance Translator EN/JA>CA/ES
>
> (+34) 652 492 008 <+34%20652%2049%2020%2008>
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>
>
>
> _______________________________________________
> Apertium-stuff mailing 
> listApertium-stuff@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
> --
> Mikel L. Forcada  http://www.dlsi.ua.es/~mlf/
> Departament de Llenguatges i Sistemes Informàtics
> Universitat d'Alacant
> E-03690 Sant Vicent del Raspeig
> Spain
> Office: +34 96 590 9776 <+34%20965%2090%2097%2076>
>
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>


-- 

*Marc Riera Irigoyen*
Freelance Translator EN/JA>CA/ES

(+34) 652 492 008
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to