Re: [Apertium-stuff] Guidance for hin-pan language pair development

Hèctor Alòs i Font Thu, 19 Mar 2020 00:22:17 -0700

Hi Priyank,

I'll try to look at your coding challenge later, although I'm not sure I'll
be able to read anything :)


With regard to the use of mass matching techniques of words in the two
languages, I would strongly advise against it in the first phase of the
project and would use it very carefully at the end. To have a good quality
of translation it is especially important to be very careful with very
frequent words (particularly for closed categories, such as prepositions,
conjunctions and pronouns, but not only with them).

I have used word matching (without using any fuzzy logic, but precise rules
of regular changes), but only: 1) after having the 10,000 most frequent
words in the bilingual dictionary; 2) by loading a few words at a time (no
more than 200, and better if less than 100); 3) by loading words with
precise characteristics (of the same grammatical category, with the same
ending or with the same prefix); 4) by ocularly checking each pair before
entering it in the dictionary; 5) by testing the results of the loading of
this set of words against a test corpus (for each direction, I have a
corpus of 10,000 or 20,000 randomly chosen sentences, automatically
translated. When I make changes, either in the bilingual dictionary, in
morphological disambiguation rules or in structural transfer rules, I
re-translate the corpus and compare it with the previous translation. The
aim is to check that the changes do not involve any unforeseen side
effects. If this happens, which is not uncommon, the translation, the rule,
etc., must be refined.)

But I know that not everyone works as manually as I do...

Best,
Hèctor

Missatge de Priyank Modi <priyankmod...@gmail.com> del dia dj., 19 de març
2020 a les 0:29:

> Hi Hector, Francis;
> I've made progress on the coding challenge and wanted your* feedback *on
> it - https://github.com/priyankmodiPM/apertium-hin-pan_pmodi
> *(The bin files remained after a `make clean`, so I didn't remove them
> from the repo, let me know if this is incorrect)*
>
> > I've attempted to translate the file already added in the original
> repository
> <https://github.com/apertium/apertium-hin-pan/tree/b8cea06c4748b24db7eb7e94b455a491425c04b5>
> .
> > Output file
> <https://github.com/priyankmodiPM/apertium-hin-pan_pmodi/blob/master/apertium-hin-pan/test_pan.txt>
> > Right now, I'm fixing the few missing/un/incorrectly translated words
> and focusing more on translating a full article which can be compared
> against a benchmark(parallel text), using the techniques mentioned in the
> section on Building dictionaries
> <http://wiki.apertium.org/wiki/Building_dictionaries>. I'll be mentioning
> the WER and coverage details in my proposal.
> > As Hector mentioned last time, I've been able to find some parallel
> texts and am asking others to free their resources. I was able to retrieve
> a good corpus available at request(owned by the tourism department of the
> state). Could someone *send me the terms for safely using a corpus*?
> > Given that both Hindi and Punjabi have phonemic orthography, could we
> use *fuzzy string matching*(simple string mapping in this case) to
> translate proper nouns/borrowed words(at least single word NEs)?
> > Finally, could you point out to me some *resources about the way case
> markers and dependencies* are being used in the apertium model? This
> could be crucial for this language pair because most of the POS tagging and
> chunking revolves around the case markers and dependency relations.
>
> Thank you so much for the support. Have a great day!
>
> Warn regards,
> PM
>
> On Thu, Mar 12, 2020 at 10:46 AM Hèctor Alòs i Font <hectora...@gmail.com>
> wrote:
>
>> Hi Priyank,
>>
>> I calculated the coverage on the Wikipedia dumps I got, and which I used
>> for getting the frequency lists. I think this is fair, since these corpora
>> are enormous. But I calculated WER on the basis of other texts. I
>> calculated it only a few times, at fixed project benchmarks, since I needed
>> 2-3 hours for it (maybe because I work too slowly). Every time I got 3
>> pseudo-random "good" Wikipedia articles (the feature of the day and two
>> more). I just took the introduction at the beginning. This ups to c. 1000
>> words. Sometimes I took random front page news from top newspapers
>> (typically, sociopolitical). In the final calculation, I got 4-5 short
>> texts from both Wikipedia and newspapers (c. 1500 words). This shows the
>> type of language I aimed. The idea has been to develop I tool for a more or
>> less under-resourced language, especially for helping the creation of
>> Wikipedia articles.
>>
>> @Marc Riera Irigoyen <marc.riera.irigo...@gmail.com> has used another
>> strategy for following the evolution of WER/PER (see
>> http://wiki.apertium.org/wiki/Romanian_and_Catalan/Workplan). He got a
>> reference text for the whole project and automatically tested against it at
>> the end of every week. If you use this strategy, you have to be very
>> disciplined and not be influenced by the mistakes you see in these tests
>> (this means not adding certain words in dictionaries, or morphological
>> disambiguation rules, lexical selection rules, or transfer rules because of
>> detected errors during this weekly tests). I am not really a good example
>> of discipline at work, so I prefer to use the more manual, and more
>> time-consuming, method that I have described above.
>>
>> Currently, I'm preparing my own proposal, and I'm doing as you. As you,
>> my proposal includes a widely-used language, which is released in Apertium,
>> and a (very) under-resourced language, unreleased in Apertium, which needs
>> a lot of work. I have got a test text for both languages and I've added the
>> needed words in the dictionaries, so that most of the text is translated.
>> It is just a test, because still there are big errors due to the lack of
>> transfer rules (although, I've copied some useful transfer rules from
>> another close-related language pair). I'm currently collecting resources:
>> dictionaries, texts in the under-resourced language and bilingual texts (in
>> my case, it is not so easy, because the under-resourced language is really
>> very under-resourced, there are several competing orthographies, and there
>> is a very big dialect variety). I'm also seeing which major transfer rules
>> have to be included. In your case, I suppose you'll use a 3-stage transfer,
>> so you should plan what will have to be done in each of stages 1, 2 and 3.
>> This includes to plan which information should have the chunk headers
>> created at stage 1. I guess, the Hindi-Urdu language pair can be a good
>> possibility, but maybe something else would need to be added in the
>> headers, since Hindi and Urdu are extremely closed languages, and Punjabi,
>> as far as I know, is not so closed to Hindi.
>>
>> Best,
>> Hèctor
>>
>> Missatge de Priyank Modi <priyankmod...@gmail.com> del dia dj., 12 de
>> març 2020 a les 2:44:
>>
>>> Hi Hector,
>>> Thank you so much for the reply. The proposals were really helpful. I've
>>> completed the coding challenge for a small set of 10 sentences(for now)
>>> which I believe Francis has added to the repo as a test set. I'll included
>>> the same in the proposal. For now, I'm working on building the dictionaries
>>> using the wiki dumps as mentioned in the documentation, adding the most
>>> frequent words systematically.
>>> Looking through your proposal, I noticed that you included metrics like
>>> WER and coverage to determine progress. I just wanted to confirm if these
>>> are being computed against the dumps one downloads for the respective
>>> languages(which seems to be the case seeing the way you mentioned the same
>>> in your own proposal)? Or is there some separate benchmark? This will be
>>> helpful as I can then go ahead and mention the current state of the
>>> dictionaries in a more statistical manner.
>>>
>>> Finally, is there something else I can do to make my proposal better? Or
>>> is it advisable to start working on my proposal/some other non-entry level
>>> project?
>>>
>>> Thank you for sharing the proposals and the guidance once again.
>>> Have a great day!
>>>
>>> Warm regards,
>>> PM
>>>
>>> --
>>> Priyank Modi      ●  Undergrad Research Student
>>> IIIT-Hyderabad        ●  Language Technologies Research Center
>>> Mobile:  +91 83281 45692
>>> Website <https://priyankmodipm.github.io/>    ●    Linkedin
>>> <https://www.linkedin.com/in/priyank-modi-81584b175/>
>>>
>>> On Sat, Mar 7, 2020 at 11:43 AM Hèctor Alòs i Font <hectora...@gmail.com>
>>> wrote:
>>>
>>>> Hi Priyank,
>>>>
>>>> Hindi-Punjabi seems to me a very nice pair for Apertium. It is usual
>>>> that closely related pairs give not very satisfactory results with Google,
>>>> because most of the time there is as an intermediate translation into
>>>> English. In any case, if you can give some data about the quality of the
>>>> Google translator (as I did in my 2019 GSoC application
>>>> <http://wiki.apertium.org/wiki/Hectoralos/GSOC_2019_proposal:_Catalan-Italian_and_Catalan-Portuguese#Current_situation_of_the_language_pairs>),
>>>> it may be useful, I think.
>>>>
>>>> In order to present an application for a language-pair development it
>>>> is required to pass the so called "coding challenge"
>>>> <http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Adopt_a_language_pair#Coding_challenge>.
>>>> Basically, this will show that you understand the basis of the architecture
>>>> and knows how to add new words in the dictionaries.
>>>>
>>>> For the project itself, you'll need to add many words to the Punjabi
>>>> and Punjabi-Hindi dictionaries, transfer rules and lexical selection rules.
>>>> If you intend to translate from Punjabi, you'll need to work on
>>>> morphological disambiguation, which needs at least a couple of weeks of
>>>> work. This is basic, since plenty of errors in Indo-European languages
>>>> (and, I guess, not only) come from bad morphological disambiguation.
>>>> Usually, closed categories are added first in the dictionaries and
>>>> afterwards words are mostly added using frequency lists. If there are free
>>>> resources you may use, this would be great, but it is absolutely necessary
>>>> not to automatically copy from copyrighted materials. For my own
>>>> application this year, I'm asking people to free their resources in order
>>>> to be able to use them.
>>>>
>>>> You may be interested in previous applications for developing language
>>>> pairs, for instance this one
>>>> <http://wiki.apertium.org/wiki/Grfro3d/proposal_apertium_cat-srd_and_ita-srd>,
>>>> in addition to mine last year.
>>>>
>>>> Best wishes,
>>>> Hèctor
>>>>
>>>>
>>>> Missatge de Priyank Modi <priyankmod...@gmail.com> del dia dv., 6 de
>>>> març 2020 a les 23:49:
>>>>
>>>>> Hi,
>>>>> I am trying to work towards developing the Hindi-Punjabi pair and
>>>>> needed some guidance on how to go about it. I ran the test files and could
>>>>> notice that the dictionary file for Punjabi needs work(even a lot of
>>>>> function words could not be found by the translator). Should I start with
>>>>> that? Are there some tests each stage needs to pass? Also, finally what
>>>>> sort of work is expected to make a decent GSOC proposal, of course I'll be
>>>>> interested in developing this pair regardless since even Google translate
>>>>> doesn't seem to work well for this pair(for the test set specifically the
>>>>> apertium translator worked significantly better)
>>>>> Any help would be appreciated.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> Warm regards,
>>>>> PM
>>>>>
>>>>> --
>>>>> Priyank Modi       ●  Undergrad Research Student
>>>>> IIIT-Hyderabad        ●  Language Technologies Research Center
>>>>> Mobile:  +91 83281 45692
>>>>> Website <https://priyankmodipm.github.io/>    ●    Linkedin
>>>>> <https://www.linkedin.com/in/priyank-modi-81584b175/>
>>>>>
>>>>> _______________________________________________
>>>>> Apertium-stuff mailing list
>>>>> Apertium-stuff@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>
>>>> _______________________________________________
>>>> Apertium-stuff mailing list
>>>> Apertium-stuff@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>
>>>
>>>
>>> --
>>> Priyank Modi       ●  Undergrad Research Student
>>> IIIT-Hyderabad        ●  Language Technologies Research Center
>>> Mobile:  +91 83281 45692
>>> Website <https://priyankmodipm.github.io/>    ●    Linkedin
>>> <https://www.linkedin.com/in/priyank-modi-81584b175/>
>>> _______________________________________________
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>> _______________________________________________
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>
>
> --
> Priyank Modi      ●  Undergrad Research Student
> IIIT-Hyderabad        ●  Language Technologies Research Center
> Mobile:  +91 83281 45692
> Website <https://priyankmodipm.github.io/>    ●    Linkedin
> <https://www.linkedin.com/in/priyank-modi-81584b175/>
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Guidance for hin-pan language pair development

Reply via email to