Re: [Apertium-stuff] Guidance for hin-pan language pair development

2020-03-27 Thread Priyank Modi
Hi all,
I've completed the preliminary draft of my proposal and would really
appreciate your comments/suggestions on the same :
http://wiki.apertium.org/wiki/Pmodi/GSOC_2020_proposal:_Hindi-Punjabi

Francis(firstly sorry for cc'ing you personally), since you have been
managing the repo, could you review my coding challenge(I believe you know
the script).

Warm Regards,
Priyank Modi

On Sat, Mar 21, 2020 at 1:11 PM Hèctor Alòs i Font 
wrote:

> Hi Prinyak,
>
> Yes, I now see that the Hindi गलत__adj paradigm is like this, and the
> Punjabi ਗਲਤ__adj seems to be a copy of it.
>
> I can only say that we do differently in the Romance languages I work
> with. I can say that the "Hindi method" is bad. It works for Hindi-Urdu,
> doesn't it? This makes morphological disambiguation harder, but probably
> transfer is easier.
>
> I agree with you that, since apertium-urd-hin is released, apertium-hin
> should be quite reliable, so you should concentrate on Punjabi.
> Nevertheless, according to my experience, it is not unusual that a language
> package with just one released pair needs some improvement too. This
> happens especially in cases like Urdu-Hindi, when the pair language is one
> extremely close-related. For instance, if morphological disambiguation is
> only superficially done, there won't be any problem for a translation into
> Urdu because almost all the time the same ambiguity will exist in Urdu too.
> But when translating to a less close-related language problems arise, and
> more work on disambiguation has to be done.
>
> Best,
> Hèctor
>
> Missatge de Priyank Modi  del dia ds., 21 de
> març 2020 a les 9:22:
>
>> By the way, it seems strange that you have 9 analyses for this adjective.
>>> Usually in these cases we put only the first analysis in the dictionary.
>>> The other, in really needed, can be added as .
>>
>>
>> Regarding this, I found a number of such anomalies in the Hindi monodix,
>> and tried to resolve some of them by asking mentors on irc. But since
>> urdu-hindi is a released pair(and hence the hindi monodix should have been
>> reviewed) I have tried to add similar rules in the Punjabi monodix as well.
>> This will have to be fixed in the final version. I guess following your
>> suggestion, I'll add to my list of (possible) errors I find in current hin,
>> hin-pan dictionaries and report the same in the proposal. This will also
>> help me in getting quick feedback on most of these so that I can alteast
>> bring the hindi monodix up to a reviewed and correct state during the
>> duration between post-proposal and acceptance period. :D
>>
>> Does this look good?
>> Thanks.
>>
>> On Sat, Mar 21, 2020 at 11:37 AM Priyank Modi 
>> wrote:
>>
>>> Hi Hector,
>>> Thank you so much for taking time to look at my challenge in detail and
>>> providing the feedback. I already understand this error and will work on
>>> removing all '#' symbols in the final submission of my coding challenge. To
>>> start with, the number of '#'s were atleast 3-4 times of what I have
>>> currently. Quite a few of these still exist because these words were
>>> already added to bidix but the monodix for Punjabi was almost empty when I
>>> started off(u can check the original repo in the incubator).
>>> Anyways, this has been really helpful and I'll make sure to improve on
>>> this. Since you couldn't read the script, I should tell you that I'm able
>>> to achieve close to human translation for most of these test sentences (as
>>> said earlier, I'll be including an analysis in my proposal explaining the
>>> translations in ipa, with which I'll need your help in reviewing as well 😬)
>>>
>>> I was able to find some dictionaries and parallel texts for both
>>> languages. Is there anything else I can do right now? Could you help me
>>> with some references on the use of case markers during translation as well?
>>> :)
>>>
>>> Thank you again.
>>>
>>> Warm regards,
>>> Priyank
>>>
>>>
>>> On Sat 21 Mar, 2020, 10:49 AM Hèctor Alòs i Font, 
>>> wrote:
>>>
 Hi Prinyak,

 I've been looking at you coding challenge. I can't understand anything,
 but I see the symbol # relatively often. That is annoying. See:
 http://wiki.apertium.org/wiki/Apertium_stream_format#Special

 This happens, for instance, when in the bidix the target word has a
 given gender and/or case, but in the monodix it has another. The lemma is
 recognized, but there isn't any information for generating the surface form
 as received from the bidix + transfer.

 Using apertium-viewer, I analysed this case:

 सब
 ^सब/सब/सब/सब/सब/सब/सब/सब/सब/सब/सब/सब$

 ^सब/सब/सब$
 ^सब$
 ^सब/ਸਭ$
 ^default{^ਸਭ$}$
 ^ਸਭ$
 #ਸਭ

 As expected, the problem is that ^ਸਭ$
 cannot be generated.

 Then I do:
 apertium-pan$ echo "ਸਭ" | apertium -d . pan_Guru-disam
 "<ਸਭ>"
 "ਸਭ" adj mfn sp
 "ਸਭ" adj m sg nom
 "ਸਭ" adj m sg obl
 "ਸਭ" adj m pl nom
 "ਸਭ" adj m pl obl
 "ਸ

Re: [Apertium-stuff] Guidance for hin-pan language pair development

2020-03-21 Thread Hèctor Alòs i Font
Hi Prinyak,

Yes, I now see that the Hindi गलत__adj paradigm is like this, and the
Punjabi ਗਲਤ__adj seems to be a copy of it.

I can only say that we do differently in the Romance languages I work with.
I can say that the "Hindi method" is bad. It works for Hindi-Urdu, doesn't
it? This makes morphological disambiguation harder, but probably transfer
is easier.

I agree with you that, since apertium-urd-hin is released, apertium-hin
should be quite reliable, so you should concentrate on Punjabi.
Nevertheless, according to my experience, it is not unusual that a language
package with just one released pair needs some improvement too. This
happens especially in cases like Urdu-Hindi, when the pair language is one
extremely close-related. For instance, if morphological disambiguation is
only superficially done, there won't be any problem for a translation into
Urdu because almost all the time the same ambiguity will exist in Urdu too.
But when translating to a less close-related language problems arise, and
more work on disambiguation has to be done.

Best,
Hèctor

Missatge de Priyank Modi  del dia ds., 21 de març
2020 a les 9:22:

> By the way, it seems strange that you have 9 analyses for this adjective.
>> Usually in these cases we put only the first analysis in the dictionary.
>> The other, in really needed, can be added as .
>
>
> Regarding this, I found a number of such anomalies in the Hindi monodix,
> and tried to resolve some of them by asking mentors on irc. But since
> urdu-hindi is a released pair(and hence the hindi monodix should have been
> reviewed) I have tried to add similar rules in the Punjabi monodix as well.
> This will have to be fixed in the final version. I guess following your
> suggestion, I'll add to my list of (possible) errors I find in current hin,
> hin-pan dictionaries and report the same in the proposal. This will also
> help me in getting quick feedback on most of these so that I can alteast
> bring the hindi monodix up to a reviewed and correct state during the
> duration between post-proposal and acceptance period. :D
>
> Does this look good?
> Thanks.
>
> On Sat, Mar 21, 2020 at 11:37 AM Priyank Modi 
> wrote:
>
>> Hi Hector,
>> Thank you so much for taking time to look at my challenge in detail and
>> providing the feedback. I already understand this error and will work on
>> removing all '#' symbols in the final submission of my coding challenge. To
>> start with, the number of '#'s were atleast 3-4 times of what I have
>> currently. Quite a few of these still exist because these words were
>> already added to bidix but the monodix for Punjabi was almost empty when I
>> started off(u can check the original repo in the incubator).
>> Anyways, this has been really helpful and I'll make sure to improve on
>> this. Since you couldn't read the script, I should tell you that I'm able
>> to achieve close to human translation for most of these test sentences (as
>> said earlier, I'll be including an analysis in my proposal explaining the
>> translations in ipa, with which I'll need your help in reviewing as well 😬)
>>
>> I was able to find some dictionaries and parallel texts for both
>> languages. Is there anything else I can do right now? Could you help me
>> with some references on the use of case markers during translation as well?
>> :)
>>
>> Thank you again.
>>
>> Warm regards,
>> Priyank
>>
>>
>> On Sat 21 Mar, 2020, 10:49 AM Hèctor Alòs i Font, 
>> wrote:
>>
>>> Hi Prinyak,
>>>
>>> I've been looking at you coding challenge. I can't understand anything,
>>> but I see the symbol # relatively often. That is annoying. See:
>>> http://wiki.apertium.org/wiki/Apertium_stream_format#Special
>>>
>>> This happens, for instance, when in the bidix the target word has a
>>> given gender and/or case, but in the monodix it has another. The lemma is
>>> recognized, but there isn't any information for generating the surface form
>>> as received from the bidix + transfer.
>>>
>>> Using apertium-viewer, I analysed this case:
>>>
>>> सब
>>> ^सब/सब/सब/सब/सब/सब/सब/सब/सब/सब/सब/सब$
>>>
>>> ^सब/सब/सब$
>>> ^सब$
>>> ^सब/ਸਭ$
>>> ^default{^ਸਭ$}$
>>> ^ਸਭ$
>>> #ਸਭ
>>>
>>> As expected, the problem is that ^ਸਭ$ cannot
>>> be generated.
>>>
>>> Then I do:
>>> apertium-pan$ echo "ਸਭ" | apertium -d . pan_Guru-disam
>>> "<ਸਭ>"
>>> "ਸਭ" adj mfn sp
>>> "ਸਭ" adj m sg nom
>>> "ਸਭ" adj m sg obl
>>> "ਸਭ" adj m pl nom
>>> "ਸਭ" adj m pl obl
>>> "ਸਭ" adj f sg nom
>>> "ਸਭ" adj f sg obl
>>> "ਸਭ" adj f pl nom
>>> "ਸਭ" adj f pl obl
>>> "<.>"
>>> "." sent
>>>
>>> So that's the problem: in the bidix it is said that ਸਭ is a pronoun, but
>>> in the monodix is defined as an adjective.
>>>
>>> By the way, it seems strange that you have 9 analyses for this
>>> adjective. Usually in these cases we put only the first analysis in the
>>> dictionary. The other, in really needed, can be added as .
>>>
>>> Best,
>>> Hèctor
>>>
>>>
>>> Missatge de Priyank Modi  del dia dj., 19 de
>>> març 2020 a les 0:29:
>>>
 Hi Hector, Francis;

Re: [Apertium-stuff] Guidance for hin-pan language pair development

2020-03-20 Thread Priyank Modi
>
> By the way, it seems strange that you have 9 analyses for this adjective.
> Usually in these cases we put only the first analysis in the dictionary.
> The other, in really needed, can be added as .


Regarding this, I found a number of such anomalies in the Hindi monodix,
and tried to resolve some of them by asking mentors on irc. But since
urdu-hindi is a released pair(and hence the hindi monodix should have been
reviewed) I have tried to add similar rules in the Punjabi monodix as well.
This will have to be fixed in the final version. I guess following your
suggestion, I'll add to my list of (possible) errors I find in current hin,
hin-pan dictionaries and report the same in the proposal. This will also
help me in getting quick feedback on most of these so that I can alteast
bring the hindi monodix up to a reviewed and correct state during the
duration between post-proposal and acceptance period. :D

Does this look good?
Thanks.

On Sat, Mar 21, 2020 at 11:37 AM Priyank Modi 
wrote:

> Hi Hector,
> Thank you so much for taking time to look at my challenge in detail and
> providing the feedback. I already understand this error and will work on
> removing all '#' symbols in the final submission of my coding challenge. To
> start with, the number of '#'s were atleast 3-4 times of what I have
> currently. Quite a few of these still exist because these words were
> already added to bidix but the monodix for Punjabi was almost empty when I
> started off(u can check the original repo in the incubator).
> Anyways, this has been really helpful and I'll make sure to improve on
> this. Since you couldn't read the script, I should tell you that I'm able
> to achieve close to human translation for most of these test sentences (as
> said earlier, I'll be including an analysis in my proposal explaining the
> translations in ipa, with which I'll need your help in reviewing as well 😬)
>
> I was able to find some dictionaries and parallel texts for both
> languages. Is there anything else I can do right now? Could you help me
> with some references on the use of case markers during translation as well?
> :)
>
> Thank you again.
>
> Warm regards,
> Priyank
>
>
> On Sat 21 Mar, 2020, 10:49 AM Hèctor Alòs i Font, 
> wrote:
>
>> Hi Prinyak,
>>
>> I've been looking at you coding challenge. I can't understand anything,
>> but I see the symbol # relatively often. That is annoying. See:
>> http://wiki.apertium.org/wiki/Apertium_stream_format#Special
>>
>> This happens, for instance, when in the bidix the target word has a given
>> gender and/or case, but in the monodix it has another. The lemma is
>> recognized, but there isn't any information for generating the surface form
>> as received from the bidix + transfer.
>>
>> Using apertium-viewer, I analysed this case:
>>
>> सब
>> ^सब/सब/सब/सब/सब/सब/सब/सब/सब/सब/सब/सब$
>>
>> ^सब/सब/सब$
>> ^सब$
>> ^सब/ਸਭ$
>> ^default{^ਸਭ$}$
>> ^ਸਭ$
>> #ਸਭ
>>
>> As expected, the problem is that ^ਸਭ$ cannot
>> be generated.
>>
>> Then I do:
>> apertium-pan$ echo "ਸਭ" | apertium -d . pan_Guru-disam
>> "<ਸਭ>"
>> "ਸਭ" adj mfn sp
>> "ਸਭ" adj m sg nom
>> "ਸਭ" adj m sg obl
>> "ਸਭ" adj m pl nom
>> "ਸਭ" adj m pl obl
>> "ਸਭ" adj f sg nom
>> "ਸਭ" adj f sg obl
>> "ਸਭ" adj f pl nom
>> "ਸਭ" adj f pl obl
>> "<.>"
>> "." sent
>>
>> So that's the problem: in the bidix it is said that ਸਭ is a pronoun, but
>> in the monodix is defined as an adjective.
>>
>> By the way, it seems strange that you have 9 analyses for this adjective.
>> Usually in these cases we put only the first analysis in the dictionary.
>> The other, in really needed, can be added as .
>>
>> Best,
>> Hèctor
>>
>>
>> Missatge de Priyank Modi  del dia dj., 19 de
>> març 2020 a les 0:29:
>>
>>> Hi Hector, Francis;
>>> I've made progress on the coding challenge and wanted your* feedback *on
>>> it - https://github.com/priyankmodiPM/apertium-hin-pan_pmodi
>>> *(The bin files remained after a `make clean`, so I didn't remove them
>>> from the repo, let me know if this is incorrect)*
>>>
>>> > I've attempted to translate the file already added in the original
>>> repository
>>> 
>>> .
>>> > Output file
>>> 
>>> > Right now, I'm fixing the few missing/un/incorrectly translated words
>>> and focusing more on translating a full article which can be compared
>>> against a benchmark(parallel text), using the techniques mentioned in the
>>> section on Building dictionaries
>>> . I'll be
>>> mentioning the WER and coverage details in my proposal.
>>> > As Hector mentioned last time, I've been able to find some parallel
>>> texts and am asking others to free their resources. I was able to retrieve
>>> a good corpus available at request(owned by the tourism department of the
>>> state). Could someone *send me the terms for sa

Re: [Apertium-stuff] Guidance for hin-pan language pair development

2020-03-20 Thread Priyank Modi
Hi Hector,
Thank you so much for taking time to look at my challenge in detail and
providing the feedback. I already understand this error and will work on
removing all '#' symbols in the final submission of my coding challenge. To
start with, the number of '#'s were atleast 3-4 times of what I have
currently. Quite a few of these still exist because these words were
already added to bidix but the monodix for Punjabi was almost empty when I
started off(u can check the original repo in the incubator).
Anyways, this has been really helpful and I'll make sure to improve on
this. Since you couldn't read the script, I should tell you that I'm able
to achieve close to human translation for most of these test sentences (as
said earlier, I'll be including an analysis in my proposal explaining the
translations in ipa, with which I'll need your help in reviewing as well 😬)

I was able to find some dictionaries and parallel texts for both languages.
Is there anything else I can do right now? Could you help me with some
references on the use of case markers during translation as well? :)

Thank you again.

Warm regards,
Priyank


On Sat 21 Mar, 2020, 10:49 AM Hèctor Alòs i Font, 
wrote:

> Hi Prinyak,
>
> I've been looking at you coding challenge. I can't understand anything,
> but I see the symbol # relatively often. That is annoying. See:
> http://wiki.apertium.org/wiki/Apertium_stream_format#Special
>
> This happens, for instance, when in the bidix the target word has a given
> gender and/or case, but in the monodix it has another. The lemma is
> recognized, but there isn't any information for generating the surface form
> as received from the bidix + transfer.
>
> Using apertium-viewer, I analysed this case:
>
> सब
> ^सब/सब/सब/सब/सब/सब/सब/सब/सब/सब/सब/सब$
>
> ^सब/सब/सब$
> ^सब$
> ^सब/ਸਭ$
> ^default{^ਸਭ$}$
> ^ਸਭ$
> #ਸਭ
>
> As expected, the problem is that ^ਸਭ$ cannot
> be generated.
>
> Then I do:
> apertium-pan$ echo "ਸਭ" | apertium -d . pan_Guru-disam
> "<ਸਭ>"
> "ਸਭ" adj mfn sp
> "ਸਭ" adj m sg nom
> "ਸਭ" adj m sg obl
> "ਸਭ" adj m pl nom
> "ਸਭ" adj m pl obl
> "ਸਭ" adj f sg nom
> "ਸਭ" adj f sg obl
> "ਸਭ" adj f pl nom
> "ਸਭ" adj f pl obl
> "<.>"
> "." sent
>
> So that's the problem: in the bidix it is said that ਸਭ is a pronoun, but
> in the monodix is defined as an adjective.
>
> By the way, it seems strange that you have 9 analyses for this adjective.
> Usually in these cases we put only the first analysis in the dictionary.
> The other, in really needed, can be added as .
>
> Best,
> Hèctor
>
>
> Missatge de Priyank Modi  del dia dj., 19 de
> març 2020 a les 0:29:
>
>> Hi Hector, Francis;
>> I've made progress on the coding challenge and wanted your* feedback *on
>> it - https://github.com/priyankmodiPM/apertium-hin-pan_pmodi
>> *(The bin files remained after a `make clean`, so I didn't remove them
>> from the repo, let me know if this is incorrect)*
>>
>> > I've attempted to translate the file already added in the original
>> repository
>> 
>> .
>> > Output file
>> 
>> > Right now, I'm fixing the few missing/un/incorrectly translated words
>> and focusing more on translating a full article which can be compared
>> against a benchmark(parallel text), using the techniques mentioned in the
>> section on Building dictionaries
>> . I'll be
>> mentioning the WER and coverage details in my proposal.
>> > As Hector mentioned last time, I've been able to find some parallel
>> texts and am asking others to free their resources. I was able to retrieve
>> a good corpus available at request(owned by the tourism department of the
>> state). Could someone *send me the terms for safely using a corpus*?
>> > Given that both Hindi and Punjabi have phonemic orthography, could we
>> use *fuzzy string matching*(simple string mapping in this case) to
>> translate proper nouns/borrowed words(at least single word NEs)?
>> > Finally, could you point out to me some *resources about the way case
>> markers and dependencies* are being used in the apertium model? This
>> could be crucial for this language pair because most of the POS tagging and
>> chunking revolves around the case markers and dependency relations.
>>
>> Thank you so much for the support. Have a great day!
>>
>> Warn regards,
>> PM
>>
>> On Thu, Mar 12, 2020 at 10:46 AM Hèctor Alòs i Font 
>> wrote:
>>
>>> Hi Priyank,
>>>
>>> I calculated the coverage on the Wikipedia dumps I got, and which I used
>>> for getting the frequency lists. I think this is fair, since these corpora
>>> are enormous. But I calculated WER on the basis of other texts. I
>>> calculated it only a few times, at fixed project benchmarks, since I needed
>>> 2-3 hours for it (maybe because I work too slowly). Every time I got 3
>>> pseudo-random "good" Wikipedia article

Re: [Apertium-stuff] Guidance for hin-pan language pair development

2020-03-20 Thread Hèctor Alòs i Font
Hi Prinyak,

I've been looking at you coding challenge. I can't understand anything, but
I see the symbol # relatively often. That is annoying. See:
http://wiki.apertium.org/wiki/Apertium_stream_format#Special

This happens, for instance, when in the bidix the target word has a given
gender and/or case, but in the monodix it has another. The lemma is
recognized, but there isn't any information for generating the surface form
as received from the bidix + transfer.

Using apertium-viewer, I analysed this case:

सब
^सब/सब/सब/सब/सब/सब/सब/सब/सब/सब/सब/सब$

^सब/सब/सब$
^सब$
^सब/ਸਭ$
^default{^ਸਭ$}$
^ਸਭ$
#ਸਭ

As expected, the problem is that ^ਸਭ$ cannot be
generated.

Then I do:
apertium-pan$ echo "ਸਭ" | apertium -d . pan_Guru-disam
"<ਸਭ>"
"ਸਭ" adj mfn sp
"ਸਭ" adj m sg nom
"ਸਭ" adj m sg obl
"ਸਭ" adj m pl nom
"ਸਭ" adj m pl obl
"ਸਭ" adj f sg nom
"ਸਭ" adj f sg obl
"ਸਭ" adj f pl nom
"ਸਭ" adj f pl obl
"<.>"
"." sent

So that's the problem: in the bidix it is said that ਸਭ is a pronoun, but in
the monodix is defined as an adjective.

By the way, it seems strange that you have 9 analyses for this adjective.
Usually in these cases we put only the first analysis in the dictionary.
The other, in really needed, can be added as .

Best,
Hèctor


Missatge de Priyank Modi  del dia dj., 19 de març
2020 a les 0:29:

> Hi Hector, Francis;
> I've made progress on the coding challenge and wanted your* feedback *on
> it - https://github.com/priyankmodiPM/apertium-hin-pan_pmodi
> *(The bin files remained after a `make clean`, so I didn't remove them
> from the repo, let me know if this is incorrect)*
>
> > I've attempted to translate the file already added in the original
> repository
> 
> .
> > Output file
> 
> > Right now, I'm fixing the few missing/un/incorrectly translated words
> and focusing more on translating a full article which can be compared
> against a benchmark(parallel text), using the techniques mentioned in the
> section on Building dictionaries
> . I'll be mentioning
> the WER and coverage details in my proposal.
> > As Hector mentioned last time, I've been able to find some parallel
> texts and am asking others to free their resources. I was able to retrieve
> a good corpus available at request(owned by the tourism department of the
> state). Could someone *send me the terms for safely using a corpus*?
> > Given that both Hindi and Punjabi have phonemic orthography, could we
> use *fuzzy string matching*(simple string mapping in this case) to
> translate proper nouns/borrowed words(at least single word NEs)?
> > Finally, could you point out to me some *resources about the way case
> markers and dependencies* are being used in the apertium model? This
> could be crucial for this language pair because most of the POS tagging and
> chunking revolves around the case markers and dependency relations.
>
> Thank you so much for the support. Have a great day!
>
> Warn regards,
> PM
>
> On Thu, Mar 12, 2020 at 10:46 AM Hèctor Alòs i Font 
> wrote:
>
>> Hi Priyank,
>>
>> I calculated the coverage on the Wikipedia dumps I got, and which I used
>> for getting the frequency lists. I think this is fair, since these corpora
>> are enormous. But I calculated WER on the basis of other texts. I
>> calculated it only a few times, at fixed project benchmarks, since I needed
>> 2-3 hours for it (maybe because I work too slowly). Every time I got 3
>> pseudo-random "good" Wikipedia articles (the feature of the day and two
>> more). I just took the introduction at the beginning. This ups to c. 1000
>> words. Sometimes I took random front page news from top newspapers
>> (typically, sociopolitical). In the final calculation, I got 4-5 short
>> texts from both Wikipedia and newspapers (c. 1500 words). This shows the
>> type of language I aimed. The idea has been to develop I tool for a more or
>> less under-resourced language, especially for helping the creation of
>> Wikipedia articles.
>>
>> @Marc Riera Irigoyen  has used another
>> strategy for following the evolution of WER/PER (see
>> http://wiki.apertium.org/wiki/Romanian_and_Catalan/Workplan). He got a
>> reference text for the whole project and automatically tested against it at
>> the end of every week. If you use this strategy, you have to be very
>> disciplined and not be influenced by the mistakes you see in these tests
>> (this means not adding certain words in dictionaries, or morphological
>> disambiguation rules, lexical selection rules, or transfer rules because of
>> detected errors during this weekly tests). I am not really a good example
>> of discipline at work, so I prefer to use the more manual, and more
>> time-consuming, method that I have described above.
>>
>> Currently, I'm preparing my own proposal, and I'm doing as you. As y

Re: [Apertium-stuff] Guidance for hin-pan language pair development

2020-03-19 Thread Hèctor Alòs i Font
Hi Priyank,

I'll try to look at your coding challenge later, although I'm not sure I'll
be able to read anything :)

With regard to the use of mass matching techniques of words in the two
languages, I would strongly advise against it in the first phase of the
project and would use it very carefully at the end. To have a good quality
of translation it is especially important to be very careful with very
frequent words (particularly for closed categories, such as prepositions,
conjunctions and pronouns, but not only with them).

I have used word matching (without using any fuzzy logic, but precise rules
of regular changes), but only: 1) after having the 10,000 most frequent
words in the bilingual dictionary; 2) by loading a few words at a time (no
more than 200, and better if less than 100); 3) by loading words with
precise characteristics (of the same grammatical category, with the same
ending or with the same prefix); 4) by ocularly checking each pair before
entering it in the dictionary; 5) by testing the results of the loading of
this set of words against a test corpus (for each direction, I have a
corpus of 10,000 or 20,000 randomly chosen sentences, automatically
translated. When I make changes, either in the bilingual dictionary, in
morphological disambiguation rules or in structural transfer rules, I
re-translate the corpus and compare it with the previous translation. The
aim is to check that the changes do not involve any unforeseen side
effects. If this happens, which is not uncommon, the translation, the rule,
etc., must be refined.)

But I know that not everyone works as manually as I do...

Best,
Hèctor

Missatge de Priyank Modi  del dia dj., 19 de març
2020 a les 0:29:

> Hi Hector, Francis;
> I've made progress on the coding challenge and wanted your* feedback *on
> it - https://github.com/priyankmodiPM/apertium-hin-pan_pmodi
> *(The bin files remained after a `make clean`, so I didn't remove them
> from the repo, let me know if this is incorrect)*
>
> > I've attempted to translate the file already added in the original
> repository
> 
> .
> > Output file
> 
> > Right now, I'm fixing the few missing/un/incorrectly translated words
> and focusing more on translating a full article which can be compared
> against a benchmark(parallel text), using the techniques mentioned in the
> section on Building dictionaries
> . I'll be mentioning
> the WER and coverage details in my proposal.
> > As Hector mentioned last time, I've been able to find some parallel
> texts and am asking others to free their resources. I was able to retrieve
> a good corpus available at request(owned by the tourism department of the
> state). Could someone *send me the terms for safely using a corpus*?
> > Given that both Hindi and Punjabi have phonemic orthography, could we
> use *fuzzy string matching*(simple string mapping in this case) to
> translate proper nouns/borrowed words(at least single word NEs)?
> > Finally, could you point out to me some *resources about the way case
> markers and dependencies* are being used in the apertium model? This
> could be crucial for this language pair because most of the POS tagging and
> chunking revolves around the case markers and dependency relations.
>
> Thank you so much for the support. Have a great day!
>
> Warn regards,
> PM
>
> On Thu, Mar 12, 2020 at 10:46 AM Hèctor Alòs i Font 
> wrote:
>
>> Hi Priyank,
>>
>> I calculated the coverage on the Wikipedia dumps I got, and which I used
>> for getting the frequency lists. I think this is fair, since these corpora
>> are enormous. But I calculated WER on the basis of other texts. I
>> calculated it only a few times, at fixed project benchmarks, since I needed
>> 2-3 hours for it (maybe because I work too slowly). Every time I got 3
>> pseudo-random "good" Wikipedia articles (the feature of the day and two
>> more). I just took the introduction at the beginning. This ups to c. 1000
>> words. Sometimes I took random front page news from top newspapers
>> (typically, sociopolitical). In the final calculation, I got 4-5 short
>> texts from both Wikipedia and newspapers (c. 1500 words). This shows the
>> type of language I aimed. The idea has been to develop I tool for a more or
>> less under-resourced language, especially for helping the creation of
>> Wikipedia articles.
>>
>> @Marc Riera Irigoyen  has used another
>> strategy for following the evolution of WER/PER (see
>> http://wiki.apertium.org/wiki/Romanian_and_Catalan/Workplan). He got a
>> reference text for the whole project and automatically tested against it at
>> the end of every week. If you use this strategy, you have to be very
>> disciplined and not be influenced by the mistakes you see in these tests
>> (this means not adding ce

Re: [Apertium-stuff] Guidance for hin-pan language pair development

2020-03-18 Thread Priyank Modi
Hi Hector, Francis;
I've made progress on the coding challenge and wanted your* feedback *on it
- https://github.com/priyankmodiPM/apertium-hin-pan_pmodi
*(The bin files remained after a `make clean`, so I didn't remove them from
the repo, let me know if this is incorrect)*

> I've attempted to translate the file already added in the original
repository

.
> Output file

> Right now, I'm fixing the few missing/un/incorrectly translated words and
focusing more on translating a full article which can be compared against a
benchmark(parallel text), using the techniques mentioned in the
section on Building
dictionaries . I'll be
mentioning the WER and coverage details in my proposal.
> As Hector mentioned last time, I've been able to find some parallel texts
and am asking others to free their resources. I was able to retrieve a good
corpus available at request(owned by the tourism department of the state).
Could someone *send me the terms for safely using a corpus*?
> Given that both Hindi and Punjabi have phonemic orthography, could we use 
> *fuzzy
string matching*(simple string mapping in this case) to translate proper
nouns/borrowed words(at least single word NEs)?
> Finally, could you point out to me some *resources about the way case
markers and dependencies* are being used in the apertium model? This could
be crucial for this language pair because most of the POS tagging and
chunking revolves around the case markers and dependency relations.

Thank you so much for the support. Have a great day!

Warn regards,
PM

On Thu, Mar 12, 2020 at 10:46 AM Hèctor Alòs i Font 
wrote:

> Hi Priyank,
>
> I calculated the coverage on the Wikipedia dumps I got, and which I used
> for getting the frequency lists. I think this is fair, since these corpora
> are enormous. But I calculated WER on the basis of other texts. I
> calculated it only a few times, at fixed project benchmarks, since I needed
> 2-3 hours for it (maybe because I work too slowly). Every time I got 3
> pseudo-random "good" Wikipedia articles (the feature of the day and two
> more). I just took the introduction at the beginning. This ups to c. 1000
> words. Sometimes I took random front page news from top newspapers
> (typically, sociopolitical). In the final calculation, I got 4-5 short
> texts from both Wikipedia and newspapers (c. 1500 words). This shows the
> type of language I aimed. The idea has been to develop I tool for a more or
> less under-resourced language, especially for helping the creation of
> Wikipedia articles.
>
> @Marc Riera Irigoyen  has used another
> strategy for following the evolution of WER/PER (see
> http://wiki.apertium.org/wiki/Romanian_and_Catalan/Workplan). He got a
> reference text for the whole project and automatically tested against it at
> the end of every week. If you use this strategy, you have to be very
> disciplined and not be influenced by the mistakes you see in these tests
> (this means not adding certain words in dictionaries, or morphological
> disambiguation rules, lexical selection rules, or transfer rules because of
> detected errors during this weekly tests). I am not really a good example
> of discipline at work, so I prefer to use the more manual, and more
> time-consuming, method that I have described above.
>
> Currently, I'm preparing my own proposal, and I'm doing as you. As you, my
> proposal includes a widely-used language, which is released in Apertium,
> and a (very) under-resourced language, unreleased in Apertium, which needs
> a lot of work. I have got a test text for both languages and I've added the
> needed words in the dictionaries, so that most of the text is translated.
> It is just a test, because still there are big errors due to the lack of
> transfer rules (although, I've copied some useful transfer rules from
> another close-related language pair). I'm currently collecting resources:
> dictionaries, texts in the under-resourced language and bilingual texts (in
> my case, it is not so easy, because the under-resourced language is really
> very under-resourced, there are several competing orthographies, and there
> is a very big dialect variety). I'm also seeing which major transfer rules
> have to be included. In your case, I suppose you'll use a 3-stage transfer,
> so you should plan what will have to be done in each of stages 1, 2 and 3.
> This includes to plan which information should have the chunk headers
> created at stage 1. I guess, the Hindi-Urdu language pair can be a good
> possibility, but maybe something else would need to be added in the
> headers, since Hindi and Urdu are extremely closed languages, and Punjabi,
> as far as I know, is not so closed to Hindi.
>
> Best,
> Hèctor
>
> Missatge de Priyank Modi  del dia dj., 12 de

Re: [Apertium-stuff] Guidance for hin-pan language pair development

2020-03-11 Thread Hèctor Alòs i Font
Hi Priyank,

I calculated the coverage on the Wikipedia dumps I got, and which I used
for getting the frequency lists. I think this is fair, since these corpora
are enormous. But I calculated WER on the basis of other texts. I
calculated it only a few times, at fixed project benchmarks, since I needed
2-3 hours for it (maybe because I work too slowly). Every time I got 3
pseudo-random "good" Wikipedia articles (the feature of the day and two
more). I just took the introduction at the beginning. This ups to c. 1000
words. Sometimes I took random front page news from top newspapers
(typically, sociopolitical). In the final calculation, I got 4-5 short
texts from both Wikipedia and newspapers (c. 1500 words). This shows the
type of language I aimed. The idea has been to develop I tool for a more or
less under-resourced language, especially for helping the creation of
Wikipedia articles.

@Marc Riera Irigoyen  has used another
strategy for following the evolution of WER/PER (see
http://wiki.apertium.org/wiki/Romanian_and_Catalan/Workplan). He got a
reference text for the whole project and automatically tested against it at
the end of every week. If you use this strategy, you have to be very
disciplined and not be influenced by the mistakes you see in these tests
(this means not adding certain words in dictionaries, or morphological
disambiguation rules, lexical selection rules, or transfer rules because of
detected errors during this weekly tests). I am not really a good example
of discipline at work, so I prefer to use the more manual, and more
time-consuming, method that I have described above.

Currently, I'm preparing my own proposal, and I'm doing as you. As you, my
proposal includes a widely-used language, which is released in Apertium,
and a (very) under-resourced language, unreleased in Apertium, which needs
a lot of work. I have got a test text for both languages and I've added the
needed words in the dictionaries, so that most of the text is translated.
It is just a test, because still there are big errors due to the lack of
transfer rules (although, I've copied some useful transfer rules from
another close-related language pair). I'm currently collecting resources:
dictionaries, texts in the under-resourced language and bilingual texts (in
my case, it is not so easy, because the under-resourced language is really
very under-resourced, there are several competing orthographies, and there
is a very big dialect variety). I'm also seeing which major transfer rules
have to be included. In your case, I suppose you'll use a 3-stage transfer,
so you should plan what will have to be done in each of stages 1, 2 and 3.
This includes to plan which information should have the chunk headers
created at stage 1. I guess, the Hindi-Urdu language pair can be a good
possibility, but maybe something else would need to be added in the
headers, since Hindi and Urdu are extremely closed languages, and Punjabi,
as far as I know, is not so closed to Hindi.

Best,
Hèctor

Missatge de Priyank Modi  del dia dj., 12 de març
2020 a les 2:44:

> Hi Hector,
> Thank you so much for the reply. The proposals were really helpful. I've
> completed the coding challenge for a small set of 10 sentences(for now)
> which I believe Francis has added to the repo as a test set. I'll included
> the same in the proposal. For now, I'm working on building the dictionaries
> using the wiki dumps as mentioned in the documentation, adding the most
> frequent words systematically.
> Looking through your proposal, I noticed that you included metrics like
> WER and coverage to determine progress. I just wanted to confirm if these
> are being computed against the dumps one downloads for the respective
> languages(which seems to be the case seeing the way you mentioned the same
> in your own proposal)? Or is there some separate benchmark? This will be
> helpful as I can then go ahead and mention the current state of the
> dictionaries in a more statistical manner.
>
> Finally, is there something else I can do to make my proposal better? Or
> is it advisable to start working on my proposal/some other non-entry level
> project?
>
> Thank you for sharing the proposals and the guidance once again.
> Have a great day!
>
> Warm regards,
> PM
>
> --
> Priyank Modi  ●  Undergrad Research Student
> IIIT-Hyderabad●  Language Technologies Research Center
> Mobile:  +91 83281 45692
> Website ●Linkedin
> 
>
> On Sat, Mar 7, 2020 at 11:43 AM Hèctor Alòs i Font 
> wrote:
>
>> Hi Priyank,
>>
>> Hindi-Punjabi seems to me a very nice pair for Apertium. It is usual that
>> closely related pairs give not very satisfactory results with Google,
>> because most of the time there is as an intermediate translation into
>> English. In any case, if you can give some data about the quality of the
>> Google translator (as I did in my 2019 GSoC application
>> 

Re: [Apertium-stuff] Guidance for hin-pan language pair development

2020-03-11 Thread Priyank Modi
Hi Hector,
Thank you so much for the reply. The proposals were really helpful. I've
completed the coding challenge for a small set of 10 sentences(for now)
which I believe Francis has added to the repo as a test set. I'll included
the same in the proposal. For now, I'm working on building the dictionaries
using the wiki dumps as mentioned in the documentation, adding the most
frequent words systematically.
Looking through your proposal, I noticed that you included metrics like WER
and coverage to determine progress. I just wanted to confirm if these are
being computed against the dumps one downloads for the respective
languages(which seems to be the case seeing the way you mentioned the same
in your own proposal)? Or is there some separate benchmark? This will be
helpful as I can then go ahead and mention the current state of the
dictionaries in a more statistical manner.

Finally, is there something else I can do to make my proposal better? Or is
it advisable to start working on my proposal/some other non-entry level
project?

Thank you for sharing the proposals and the guidance once again.
Have a great day!

Warm regards,
PM

-- 
Priyank Modi  ●  Undergrad Research Student
IIIT-Hyderabad●  Language Technologies Research Center
Mobile:  +91 83281 45692
Website ●Linkedin


On Sat, Mar 7, 2020 at 11:43 AM Hèctor Alòs i Font 
wrote:

> Hi Priyank,
>
> Hindi-Punjabi seems to me a very nice pair for Apertium. It is usual that
> closely related pairs give not very satisfactory results with Google,
> because most of the time there is as an intermediate translation into
> English. In any case, if you can give some data about the quality of the
> Google translator (as I did in my 2019 GSoC application
> ),
> it may be useful, I think.
>
> In order to present an application for a language-pair development it is
> required to pass the so called "coding challenge"
> .
> Basically, this will show that you understand the basis of the architecture
> and knows how to add new words in the dictionaries.
>
> For the project itself, you'll need to add many words to the Punjabi and
> Punjabi-Hindi dictionaries, transfer rules and lexical selection rules. If
> you intend to translate from Punjabi, you'll need to work on morphological
> disambiguation, which needs at least a couple of weeks of work. This is
> basic, since plenty of errors in Indo-European languages (and, I guess, not
> only) come from bad morphological disambiguation. Usually, closed
> categories are added first in the dictionaries and afterwards words are
> mostly added using frequency lists. If there are free resources you may
> use, this would be great, but it is absolutely necessary not to
> automatically copy from copyrighted materials. For my own application this
> year, I'm asking people to free their resources in order to be able to use
> them.
>
> You may be interested in previous applications for developing language
> pairs, for instance this one
> ,
> in addition to mine last year.
>
> Best wishes,
> Hèctor
>
>
> Missatge de Priyank Modi  del dia dv., 6 de març
> 2020 a les 23:49:
>
>> Hi,
>> I am trying to work towards developing the Hindi-Punjabi pair and needed
>> some guidance on how to go about it. I ran the test files and could notice
>> that the dictionary file for Punjabi needs work(even a lot of function
>> words could not be found by the translator). Should I start with that? Are
>> there some tests each stage needs to pass? Also, finally what sort of work
>> is expected to make a decent GSOC proposal, of course I'll be interested in
>> developing this pair regardless since even Google translate doesn't seem to
>> work well for this pair(for the test set specifically the apertium
>> translator worked significantly better)
>> Any help would be appreciated.
>>
>> Thanks.
>>
>> Warm regards,
>> PM
>>
>> --
>> Priyank Modi   ●  Undergrad Research Student
>> IIIT-Hyderabad●  Language Technologies Research Center
>> Mobile:  +91 83281 45692
>> Website ●Linkedin
>> 
>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>


-- 
Priyank Modi   ●  Undergrad Research Student
IIIT-Hyderabad●  Language Tec

Re: [Apertium-stuff] Guidance for hin-pan language pair development

2020-03-06 Thread Hèctor Alòs i Font
Hi Priyank,

Hindi-Punjabi seems to me a very nice pair for Apertium. It is usual that
closely related pairs give not very satisfactory results with Google,
because most of the time there is as an intermediate translation into
English. In any case, if you can give some data about the quality of the
Google translator (as I did in my 2019 GSoC application
),
it may be useful, I think.

In order to present an application for a language-pair development it is
required to pass the so called "coding challenge"
.
Basically, this will show that you understand the basis of the architecture
and knows how to add new words in the dictionaries.

For the project itself, you'll need to add many words to the Punjabi and
Punjabi-Hindi dictionaries, transfer rules and lexical selection rules. If
you intend to translate from Punjabi, you'll need to work on morphological
disambiguation, which needs at least a couple of weeks of work. This is
basic, since plenty of errors in Indo-European languages (and, I guess, not
only) come from bad morphological disambiguation. Usually, closed
categories are added first in the dictionaries and afterwards words are
mostly added using frequency lists. If there are free resources you may
use, this would be great, but it is absolutely necessary not to
automatically copy from copyrighted materials. For my own application this
year, I'm asking people to free their resources in order to be able to use
them.

You may be interested in previous applications for developing language
pairs, for instance this one
,
in addition to mine last year.

Best wishes,
Hèctor


Missatge de Priyank Modi  del dia dv., 6 de març
2020 a les 23:49:

> Hi,
> I am trying to work towards developing the Hindi-Punjabi pair and needed
> some guidance on how to go about it. I ran the test files and could notice
> that the dictionary file for Punjabi needs work(even a lot of function
> words could not be found by the translator). Should I start with that? Are
> there some tests each stage needs to pass? Also, finally what sort of work
> is expected to make a decent GSOC proposal, of course I'll be interested in
> developing this pair regardless since even Google translate doesn't seem to
> work well for this pair(for the test set specifically the apertium
> translator worked significantly better)
> Any help would be appreciated.
>
> Thanks.
>
> Warm regards,
> PM
>
> --
> Priyank Modi   ●  Undergrad Research Student
> IIIT-Hyderabad●  Language Technologies Research Center
> Mobile:  +91 83281 45692
> Website ●Linkedin
> 
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff