Re: [Apertium-stuff] Guidance for hin-pan language pair development

Priyank Modi Wed, 18 Mar 2020 14:29:26 -0700

Hi Hector, Francis;
I've made progress on the coding challenge and wanted your* feedback *on it
- https://github.com/priyankmodiPM/apertium-hin-pan_pmodi
*(The bin files remained after a `make clean`, so I didn't remove them from
the repo, let me know if this is incorrect)*


> I've attempted to translate the file already added in the original
repository
<https://github.com/apertium/apertium-hin-pan/tree/b8cea06c4748b24db7eb7e94b455a491425c04b5>
.
> Output file
<https://github.com/priyankmodiPM/apertium-hin-pan_pmodi/blob/master/apertium-hin-pan/test_pan.txt>
> Right now, I'm fixing the few missing/un/incorrectly translated words and
focusing more on translating a full article which can be compared against a
benchmark(parallel text), using the techniques mentioned in the
section on Building
dictionaries <http://wiki.apertium.org/wiki/Building_dictionaries>. I'll be
mentioning the WER and coverage details in my proposal.
> As Hector mentioned last time, I've been able to find some parallel texts
and am asking others to free their resources. I was able to retrieve a good
corpus available at request(owned by the tourism department of the state).
Could someone *send me the terms for safely using a corpus*?
> Given that both Hindi and Punjabi have phonemic orthography, could we use 
> *fuzzy
string matching*(simple string mapping in this case) to translate proper
nouns/borrowed words(at least single word NEs)?
> Finally, could you point out to me some *resources about the way case
markers and dependencies* are being used in the apertium model? This could
be crucial for this language pair because most of the POS tagging and
chunking revolves around the case markers and dependency relations.

Thank you so much for the support. Have a great day!

Warn regards,
PM

On Thu, Mar 12, 2020 at 10:46 AM Hèctor Alòs i Font <hectora...@gmail.com>
wrote:

> Hi Priyank,
>
> I calculated the coverage on the Wikipedia dumps I got, and which I used
> for getting the frequency lists. I think this is fair, since these corpora
> are enormous. But I calculated WER on the basis of other texts. I
> calculated it only a few times, at fixed project benchmarks, since I needed
> 2-3 hours for it (maybe because I work too slowly). Every time I got 3
> pseudo-random "good" Wikipedia articles (the feature of the day and two
> more). I just took the introduction at the beginning. This ups to c. 1000
> words. Sometimes I took random front page news from top newspapers
> (typically, sociopolitical). In the final calculation, I got 4-5 short
> texts from both Wikipedia and newspapers (c. 1500 words). This shows the
> type of language I aimed. The idea has been to develop I tool for a more or
> less under-resourced language, especially for helping the creation of
> Wikipedia articles.
>
> @Marc Riera Irigoyen <marc.riera.irigo...@gmail.com> has used another
> strategy for following the evolution of WER/PER (see
> http://wiki.apertium.org/wiki/Romanian_and_Catalan/Workplan). He got a
> reference text for the whole project and automatically tested against it at
> the end of every week. If you use this strategy, you have to be very
> disciplined and not be influenced by the mistakes you see in these tests
> (this means not adding certain words in dictionaries, or morphological
> disambiguation rules, lexical selection rules, or transfer rules because of
> detected errors during this weekly tests). I am not really a good example
> of discipline at work, so I prefer to use the more manual, and more
> time-consuming, method that I have described above.
>
> Currently, I'm preparing my own proposal, and I'm doing as you. As you, my
> proposal includes a widely-used language, which is released in Apertium,
> and a (very) under-resourced language, unreleased in Apertium, which needs
> a lot of work. I have got a test text for both languages and I've added the
> needed words in the dictionaries, so that most of the text is translated.
> It is just a test, because still there are big errors due to the lack of
> transfer rules (although, I've copied some useful transfer rules from
> another close-related language pair). I'm currently collecting resources:
> dictionaries, texts in the under-resourced language and bilingual texts (in
> my case, it is not so easy, because the under-resourced language is really
> very under-resourced, there are several competing orthographies, and there
> is a very big dialect variety). I'm also seeing which major transfer rules
> have to be included. In your case, I suppose you'll use a 3-stage transfer,
> so you should plan what will have to be done in each of stages 1, 2 and 3.
> This includes to plan which information should have the chunk headers
> created at stage 1. I guess, the Hindi-Urdu language pair can be a good
> possibility, but maybe something else would need to be added in the
> headers, since Hindi and Urdu are extremely closed languages, and Punjabi,
> as far as I know, is not so closed to Hindi.
>
> Best,
> Hèctor
>
> Missatge de Priyank Modi <priyankmod...@gmail.com> del dia dj., 12 de
> març 2020 a les 2:44:
>
>> Hi Hector,
>> Thank you so much for the reply. The proposals were really helpful. I've
>> completed the coding challenge for a small set of 10 sentences(for now)
>> which I believe Francis has added to the repo as a test set. I'll included
>> the same in the proposal. For now, I'm working on building the dictionaries
>> using the wiki dumps as mentioned in the documentation, adding the most
>> frequent words systematically.
>> Looking through your proposal, I noticed that you included metrics like
>> WER and coverage to determine progress. I just wanted to confirm if these
>> are being computed against the dumps one downloads for the respective
>> languages(which seems to be the case seeing the way you mentioned the same
>> in your own proposal)? Or is there some separate benchmark? This will be
>> helpful as I can then go ahead and mention the current state of the
>> dictionaries in a more statistical manner.
>>
>> Finally, is there something else I can do to make my proposal better? Or
>> is it advisable to start working on my proposal/some other non-entry level
>> project?
>>
>> Thank you for sharing the proposals and the guidance once again.
>> Have a great day!
>>
>> Warm regards,
>> PM
>>
>> --
>> Priyank Modi      ●  Undergrad Research Student
>> IIIT-Hyderabad        ●  Language Technologies Research Center
>> Mobile:  +91 83281 45692
>> Website <https://priyankmodipm.github.io/>    ●    Linkedin
>> <https://www.linkedin.com/in/priyank-modi-81584b175/>
>>
>> On Sat, Mar 7, 2020 at 11:43 AM Hèctor Alòs i Font <hectora...@gmail.com>
>> wrote:
>>
>>> Hi Priyank,
>>>
>>> Hindi-Punjabi seems to me a very nice pair for Apertium. It is usual
>>> that closely related pairs give not very satisfactory results with Google,
>>> because most of the time there is as an intermediate translation into
>>> English. In any case, if you can give some data about the quality of the
>>> Google translator (as I did in my 2019 GSoC application
>>> <http://wiki.apertium.org/wiki/Hectoralos/GSOC_2019_proposal:_Catalan-Italian_and_Catalan-Portuguese#Current_situation_of_the_language_pairs>),
>>> it may be useful, I think.
>>>
>>> In order to present an application for a language-pair development it is
>>> required to pass the so called "coding challenge"
>>> <http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Adopt_a_language_pair#Coding_challenge>.
>>> Basically, this will show that you understand the basis of the architecture
>>> and knows how to add new words in the dictionaries.
>>>
>>> For the project itself, you'll need to add many words to the Punjabi and
>>> Punjabi-Hindi dictionaries, transfer rules and lexical selection rules. If
>>> you intend to translate from Punjabi, you'll need to work on morphological
>>> disambiguation, which needs at least a couple of weeks of work. This is
>>> basic, since plenty of errors in Indo-European languages (and, I guess, not
>>> only) come from bad morphological disambiguation. Usually, closed
>>> categories are added first in the dictionaries and afterwards words are
>>> mostly added using frequency lists. If there are free resources you may
>>> use, this would be great, but it is absolutely necessary not to
>>> automatically copy from copyrighted materials. For my own application this
>>> year, I'm asking people to free their resources in order to be able to use
>>> them.
>>>
>>> You may be interested in previous applications for developing language
>>> pairs, for instance this one
>>> <http://wiki.apertium.org/wiki/Grfro3d/proposal_apertium_cat-srd_and_ita-srd>,
>>> in addition to mine last year.
>>>
>>> Best wishes,
>>> Hèctor
>>>
>>>
>>> Missatge de Priyank Modi <priyankmod...@gmail.com> del dia dv., 6 de
>>> març 2020 a les 23:49:
>>>
>>>> Hi,
>>>> I am trying to work towards developing the Hindi-Punjabi pair and
>>>> needed some guidance on how to go about it. I ran the test files and could
>>>> notice that the dictionary file for Punjabi needs work(even a lot of
>>>> function words could not be found by the translator). Should I start with
>>>> that? Are there some tests each stage needs to pass? Also, finally what
>>>> sort of work is expected to make a decent GSOC proposal, of course I'll be
>>>> interested in developing this pair regardless since even Google translate
>>>> doesn't seem to work well for this pair(for the test set specifically the
>>>> apertium translator worked significantly better)
>>>> Any help would be appreciated.
>>>>
>>>> Thanks.
>>>>
>>>> Warm regards,
>>>> PM
>>>>
>>>> --
>>>> Priyank Modi       ●  Undergrad Research Student
>>>> IIIT-Hyderabad        ●  Language Technologies Research Center
>>>> Mobile:  +91 83281 45692
>>>> Website <https://priyankmodipm.github.io/>    ●    Linkedin
>>>> <https://www.linkedin.com/in/priyank-modi-81584b175/>
>>>>
>>>> _______________________________________________
>>>> Apertium-stuff mailing list
>>>> Apertium-stuff@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>
>>> _______________________________________________
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>>
>>
>> --
>> Priyank Modi       ●  Undergrad Research Student
>> IIIT-Hyderabad        ●  Language Technologies Research Center
>> Mobile:  +91 83281 45692
>> Website <https://priyankmodipm.github.io/>    ●    Linkedin
>> <https://www.linkedin.com/in/priyank-modi-81584b175/>
>> _______________________________________________
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>


-- 
Priyank Modi      ●  Undergrad Research Student
IIIT-Hyderabad        ●  Language Technologies Research Center
Mobile:  +91 83281 45692
Website <https://priyankmodipm.github.io/>    ●    Linkedin
<https://www.linkedin.com/in/priyank-modi-81584b175/>

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Guidance for hin-pan language pair development

Reply via email to