Re: [Apertium-stuff] Working around monodix trimming

2020-03-23 Thread Tommi A Pirinen
On Sun, Mar 22, 2020 at 02:58:41PM +0530, Tanmai Khanna wrote:
> Hey Hèctor,
> You're right, the task of creating a SL Lemma + TL morph is not a trivial
> one, and as we discussed on the IRC recently, the task of eliminating
> trimming is an essential first step towards that goal.

Yes, and I like the example in your initial email clearly shows an
example of what we could be doing wen we achieve the elimination of
trimming. I also agree that the actual guessing device is in itself a
separate project and we can perhaps treat it as a stretch goal of a
sort.

> So as for the task for this GSoC, it would probably be to first eliminate
> dictionary trimming, use the source analysis and output the source word
> surface form (as an unknown) instead of source lemma. This would give us
> the benefits of trimming without actually trimming. Then we can set the
> foundations for a morph guessing idea, which can evolve over time - and
> yes, initially it would be an optional module.


Yeah it sounds good. For this project as well I would recommend taking
test-driven development approach, it fits well for the use case since we
have a stable code base with large user base who would not be happy of
any regression.. 

The morphology guessing I also find an interesting task, for affixing
languages (and to certain degree of morph variation) I have developed
earlier a finite-state algorithm for making affix guessers, that could
be usable (in HFST library as guessify / affix-guessify), it's a bit of
a prototype and has issues with efficiency and the stability but the FSA
algebra should be correct if the underlying FSA library's understanding
of unknown alphabets works.


-- 
Doktor Tommi A Pirinen, Computational Linguist,
, Universität
Hamburg, Hamburger Zentrum für Sprachkorpora . CLARIN-D
Entwickler.  President of ACL SIGUR SIG for Uralic languages
.
I tend to follow inline-posting style in desktop e-mail messages.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Working around monodix trimming

2020-03-22 Thread Tanmai Khanna
Hey Hèctor,
You're right, the task of creating a SL Lemma + TL morph is not a trivial
one, and as we discussed on the IRC recently, the task of eliminating
trimming is an essential first step towards that goal.

So as for the task for this GSoC, it would probably be to first eliminate
dictionary trimming, use the source analysis and output the source word
surface form (as an unknown) instead of source lemma. This would give us
the benefits of trimming without actually trimming. Then we can set the
foundations for a morph guessing idea, which can evolve over time - and
yes, initially it would be an optional module.

Thanks for your comments.

Tanmai

On Sun, Mar 22, 2020 at 2:38 PM Hèctor Alòs i Font 
wrote:

> I have some comments and questions from a simple Apertium user's point of
> view.
>
> In principle, I find the initial idea very useful: to go beyond trimming,
> maintaining its advantages, but keeping the source language information so
> that it is not lost for the transfer rules. Great!
>
> What I do not see so clearly is the (enormous) complication of inventing a
> word in the target language from more or less regular patterns in the
> target language (and maybe in the source too). If I understand correctly,
> it would be something like if we have, for example, "desaladoras" in
> Spanish, and we don't have this word in the bilingual dictionary. So the
> new module would try to produce something like "*desaladora's", or even
> "*desalators", "*dissalators" or "*dissaltators". And the same would be if
> the source and/or the target language are, say, Lingala, Tamil or Quechua.
> The more the target language has a complex morphology, the harder the task.
>
> This may be interesting, but the problem seems quite different from the
> first one and with a much higher degree of difficulty. I would separate the
> two issues. If we could ensure that we get a reliable tool for the first
> one, it would be very useful. If we also have a prototype for the second,
> which can be activated or not at the developer's discretion, it would be
> perfect.
>
> Hèctor
>
> Missatge de Mikel L. Forcada  del dia dg., 22 de març
> 2020 a les 10:25:
>
>> For suffixing or prefixing languages, you could expand the morphological
>> dictionary and use an algorithm such as OSTIA (1) to learn morphological
>> analyses for word endings.
>>
>> Mikel
>>
>> (1) Oncina, J., Garcia, P., Vidal, E., IEEE Trans Patt Recog Mach Intell
>> 15:5 (1993)448-458.
>>
>>
>>
>>
>>
>>
>> El 21 de març de 2020 21:12:16 CET, Tanmai Khanna <
>> khanna.tan...@gmail.com> ha escrit:
>>>
>>> Guessing the morphology would definitely require some creativity, but
>>> yes a guessing dictionary could be created. As mentioned, it would assign
>>> morphs to morphological analysis in the TL. The easiest (and the most
>>> naive) way to do this might be to take all the entries with that analysis
>>> and find a common substring. It will be more complex for morphemes that
>>> aren't prefix or suffixes or even process morphemes. However, to work
>>> towards a morph analyser that can assign morphs to analyses sounds like a
>>> good goal to work towards, and eliminating dictionary trimming is an
>>> essential step in that direction.
>>>
>>> Tanmai
>>>
>>> On Sat, Mar 21, 2020 at 9:48 PM Mikel L. Forcada  wrote:
>>>
 This looks interesting.

 Note that generating target language morphology may not always be
 possible, unless a "guessing" dictionary is created automatically from both
 the source and target dictionaries. A "guessing" dictionary would try to
 assign a morphological analysis to an unknown word by looking at the
 morphology of known words in the dictionary...

 This would be easy if one could, e.g. match suffixes to morphology in a
 suffixing language.

 Mikel


 El 21/3/20 a les 15:37, Tanmai Khanna ha escrit:

 Hey guys,
 Dictionary trimming is the process of removing those words and their
 analyses from monolingual language models (FSTs compiled from
 monodixes) which don't have an entry in the bidix, to avoid a lot of
 untranslated lemmas (with an @ if debugging) in the output, which lead to
 issues with comprehension and post-editing the output.

 There is a GSoC project
 
 which aims to eliminate this trimming and propose a solution such that you
 don't lose the benefits of dictionary trimming as well. In this email I
 will list a summary of the discussion that has taken place up until now.

 By trimming the dictionary, you throw away valuable analyses of words
 in the source language, which, if preserved, can be used as context for
 lexical selection and analysis of the input. Also, several transfer
 rules don't match as the word is shown as unknown.

 Several solutions are possible for avoiding trimming, some of which
 have been di

Re: [Apertium-stuff] Working around monodix trimming

2020-03-22 Thread Hèctor Alòs i Font
I have some comments and questions from a simple Apertium user's point of
view.

In principle, I find the initial idea very useful: to go beyond trimming,
maintaining its advantages, but keeping the source language information so
that it is not lost for the transfer rules. Great!

What I do not see so clearly is the (enormous) complication of inventing a
word in the target language from more or less regular patterns in the
target language (and maybe in the source too). If I understand correctly,
it would be something like if we have, for example, "desaladoras" in
Spanish, and we don't have this word in the bilingual dictionary. So the
new module would try to produce something like "*desaladora's", or even
"*desalators", "*dissalators" or "*dissaltators". And the same would be if
the source and/or the target language are, say, Lingala, Tamil or Quechua.
The more the target language has a complex morphology, the harder the task.

This may be interesting, but the problem seems quite different from the
first one and with a much higher degree of difficulty. I would separate the
two issues. If we could ensure that we get a reliable tool for the first
one, it would be very useful. If we also have a prototype for the second,
which can be activated or not at the developer's discretion, it would be
perfect.

Hèctor

Missatge de Mikel L. Forcada  del dia dg., 22 de març 2020
a les 10:25:

> For suffixing or prefixing languages, you could expand the morphological
> dictionary and use an algorithm such as OSTIA (1) to learn morphological
> analyses for word endings.
>
> Mikel
>
> (1) Oncina, J., Garcia, P., Vidal, E., IEEE Trans Patt Recog Mach Intell
> 15:5 (1993)448-458.
>
>
>
>
>
>
> El 21 de març de 2020 21:12:16 CET, Tanmai Khanna 
> ha escrit:
>>
>> Guessing the morphology would definitely require some creativity, but yes
>> a guessing dictionary could be created. As mentioned, it would assign
>> morphs to morphological analysis in the TL. The easiest (and the most
>> naive) way to do this might be to take all the entries with that analysis
>> and find a common substring. It will be more complex for morphemes that
>> aren't prefix or suffixes or even process morphemes. However, to work
>> towards a morph analyser that can assign morphs to analyses sounds like a
>> good goal to work towards, and eliminating dictionary trimming is an
>> essential step in that direction.
>>
>> Tanmai
>>
>> On Sat, Mar 21, 2020 at 9:48 PM Mikel L. Forcada  wrote:
>>
>>> This looks interesting.
>>>
>>> Note that generating target language morphology may not always be
>>> possible, unless a "guessing" dictionary is created automatically from both
>>> the source and target dictionaries. A "guessing" dictionary would try to
>>> assign a morphological analysis to an unknown word by looking at the
>>> morphology of known words in the dictionary...
>>>
>>> This would be easy if one could, e.g. match suffixes to morphology in a
>>> suffixing language.
>>>
>>> Mikel
>>>
>>>
>>> El 21/3/20 a les 15:37, Tanmai Khanna ha escrit:
>>>
>>> Hey guys,
>>> Dictionary trimming is the process of removing those words and their
>>> analyses from monolingual language models (FSTs compiled from
>>> monodixes) which don't have an entry in the bidix, to avoid a lot of
>>> untranslated lemmas (with an @ if debugging) in the output, which lead to
>>> issues with comprehension and post-editing the output.
>>>
>>> There is a GSoC project
>>> 
>>> which aims to eliminate this trimming and propose a solution such that you
>>> don't lose the benefits of dictionary trimming as well. In this email I
>>> will list a summary of the discussion that has taken place up until now.
>>>
>>> By trimming the dictionary, you throw away valuable analyses of words in
>>> the source language, which, if preserved, can be used as context for
>>> lexical selection and analysis of the input. Also, several transfer
>>> rules don't match as the word is shown as unknown.
>>>
>>> Several solutions are possible for avoiding trimming, some of which have
>>> been discussed by Unhammer here
>>> . These involve keeping
>>> the surface form of the source word, and the lemma+analysis as well - use
>>> the analysis till you need it in the pipe and then propagate the source
>>> form as an unknown word (like it would be done in trimming).
>>>
>>> Another interesting solution that was discussed was that instead of just
>>> propagating the source surface form, we can output [source-word lemma +
>>> target morphology], as is shown in this example by Mikel:
>>>
>>> Translating from Basque to English:
>>> "Andonik izarak izeki zuen" ('Andoni hung up the sheets') → 'Andoni
>>> *izeki-ed the sheets".
>>>
>>> This might help in comprehensibility of the output, and to some extent
>>> even the post-editability.
>>>
>>> If you have any significant pros, cons, or suggestions to add f

Re: [Apertium-stuff] Working around monodix trimming

2020-03-22 Thread Mikel L. Forcada
For suffixing or prefixing languages, you could expand the morphological 
dictionary and use an algorithm such as OSTIA (1) to learn morphological 
analyses for word endings.

Mikel

(1) Oncina, J., Garcia, P., Vidal, E., IEEE Trans Patt Recog Mach Intell 15:5 
(1993)448-458.






El 21 de març de 2020 21:12:16 CET, Tanmai Khanna  ha 
escrit:
>Guessing the morphology would definitely require some creativity, but
>yes a
>guessing dictionary could be created. As mentioned, it would assign
>morphs
>to morphological analysis in the TL. The easiest (and the most naive)
>way
>to do this might be to take all the entries with that analysis and find
>a
>common substring. It will be more complex for morphemes that aren't
>prefix
>or suffixes or even process morphemes. However, to work towards a morph
>analyser that can assign morphs to analyses sounds like a good goal to
>work
>towards, and eliminating dictionary trimming is an essential step in
>that
>direction.
>
>Tanmai
>
>On Sat, Mar 21, 2020 at 9:48 PM Mikel L. Forcada 
>wrote:
>
>> This looks interesting.
>>
>> Note that generating target language morphology may not always be
>> possible, unless a "guessing" dictionary is created automatically
>from both
>> the source and target dictionaries. A "guessing" dictionary would try
>to
>> assign a morphological analysis to an unknown word by looking at the
>> morphology of known words in the dictionary...
>>
>> This would be easy if one could, e.g. match suffixes to morphology in
>a
>> suffixing language.
>>
>> Mikel
>>
>>
>> El 21/3/20 a les 15:37, Tanmai Khanna ha escrit:
>>
>> Hey guys,
>> Dictionary trimming is the process of removing those words and their
>> analyses from monolingual language models (FSTs compiled from
>monodixes)
>> which don't have an entry in the bidix, to avoid a lot of
>untranslated
>> lemmas (with an @ if debugging) in the output, which lead to issues
>with
>> comprehension and post-editing the output.
>>
>> There is a GSoC project
>>
>
>> which aims to eliminate this trimming and propose a solution such
>that you
>> don't lose the benefits of dictionary trimming as well. In this email
>I
>> will list a summary of the discussion that has taken place up until
>now.
>>
>> By trimming the dictionary, you throw away valuable analyses of words
>in
>> the source language, which, if preserved, can be used as context for
>> lexical selection and analysis of the input. Also, several transfer
>rules
>> don't match as the word is shown as unknown.
>>
>> Several solutions are possible for avoiding trimming, some of which
>have
>> been discussed by Unhammer here
>> . These involve
>keeping
>> the surface form of the source word, and the lemma+analysis as well -
>use
>> the analysis till you need it in the pipe and then propagate the
>source
>> form as an unknown word (like it would be done in trimming).
>>
>> Another interesting solution that was discussed was that instead of
>just
>> propagating the source surface form, we can output [source-word lemma
>+
>> target morphology], as is shown in this example by Mikel:
>>
>> Translating from Basque to English:
>> "Andonik izarak izeki zuen" ('Andoni hung up the sheets') → 'Andoni
>> *izeki-ed the sheets".
>>
>> This might help in comprehensibility of the output, and to some
>extent
>> even the post-editability.
>>
>> If you have any significant pros, cons, or suggestions to add for
>this
>> project, you're requested to reply to this thread so that if I work
>on this
>> project, I can do it fully informed.
>>
>> Thanks and Regards,
>> Tanmai Khanna
>>
>> --
>> *Khanna, Tanmai*
>>
>>
>> ___
>> Apertium-stuff mailing
>listApertium-stuff@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>> --
>> Mikel L. Forcada  http://www.dlsi.ua.es/~mlf/
>> Departament de Llenguatges i Sistemes Informàtics
>> Universitat d'Alacant
>> E-03690 Sant Vicent del Raspeig
>> Spain
>> Office: +34 96 590 9776
>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>
>
>-- 
>*Khanna, Tanmai*

-- 
Enviat des del meu dispositiu Android amb el K-9 Mail. Disculpeu la brevetat.___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Working around monodix trimming

2020-03-21 Thread Tanmai Khanna
Guessing the morphology would definitely require some creativity, but yes a
guessing dictionary could be created. As mentioned, it would assign morphs
to morphological analysis in the TL. The easiest (and the most naive) way
to do this might be to take all the entries with that analysis and find a
common substring. It will be more complex for morphemes that aren't prefix
or suffixes or even process morphemes. However, to work towards a morph
analyser that can assign morphs to analyses sounds like a good goal to work
towards, and eliminating dictionary trimming is an essential step in that
direction.

Tanmai

On Sat, Mar 21, 2020 at 9:48 PM Mikel L. Forcada  wrote:

> This looks interesting.
>
> Note that generating target language morphology may not always be
> possible, unless a "guessing" dictionary is created automatically from both
> the source and target dictionaries. A "guessing" dictionary would try to
> assign a morphological analysis to an unknown word by looking at the
> morphology of known words in the dictionary...
>
> This would be easy if one could, e.g. match suffixes to morphology in a
> suffixing language.
>
> Mikel
>
>
> El 21/3/20 a les 15:37, Tanmai Khanna ha escrit:
>
> Hey guys,
> Dictionary trimming is the process of removing those words and their
> analyses from monolingual language models (FSTs compiled from monodixes)
> which don't have an entry in the bidix, to avoid a lot of untranslated
> lemmas (with an @ if debugging) in the output, which lead to issues with
> comprehension and post-editing the output.
>
> There is a GSoC project
> 
> which aims to eliminate this trimming and propose a solution such that you
> don't lose the benefits of dictionary trimming as well. In this email I
> will list a summary of the discussion that has taken place up until now.
>
> By trimming the dictionary, you throw away valuable analyses of words in
> the source language, which, if preserved, can be used as context for
> lexical selection and analysis of the input. Also, several transfer rules
> don't match as the word is shown as unknown.
>
> Several solutions are possible for avoiding trimming, some of which have
> been discussed by Unhammer here
> . These involve keeping
> the surface form of the source word, and the lemma+analysis as well - use
> the analysis till you need it in the pipe and then propagate the source
> form as an unknown word (like it would be done in trimming).
>
> Another interesting solution that was discussed was that instead of just
> propagating the source surface form, we can output [source-word lemma +
> target morphology], as is shown in this example by Mikel:
>
> Translating from Basque to English:
> "Andonik izarak izeki zuen" ('Andoni hung up the sheets') → 'Andoni
> *izeki-ed the sheets".
>
> This might help in comprehensibility of the output, and to some extent
> even the post-editability.
>
> If you have any significant pros, cons, or suggestions to add for this
> project, you're requested to reply to this thread so that if I work on this
> project, I can do it fully informed.
>
> Thanks and Regards,
> Tanmai Khanna
>
> --
> *Khanna, Tanmai*
>
>
> ___
> Apertium-stuff mailing 
> listApertium-stuff@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
> --
> Mikel L. Forcada  http://www.dlsi.ua.es/~mlf/
> Departament de Llenguatges i Sistemes Informàtics
> Universitat d'Alacant
> E-03690 Sant Vicent del Raspeig
> Spain
> Office: +34 96 590 9776
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>


-- 
*Khanna, Tanmai*
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Working around monodix trimming

2020-03-21 Thread Mikel L. Forcada

This looks interesting.

Note that generating target language morphology may not always be 
possible, unless a "guessing" dictionary is created automatically from 
both the source and target dictionaries. A "guessing" dictionary would 
try to assign a morphological analysis to an unknown word by looking at 
the morphology of known words in the dictionary...


This would be easy if one could, e.g. match suffixes to morphology in a 
suffixing language.


Mikel


El 21/3/20 a les 15:37, Tanmai Khanna ha escrit:

Hey guys,
Dictionary trimming is the process of removing those words and their 
analyses from monolingual language models (FSTs compiled from 
monodixes) which don't have an entry in the bidix, to avoid a lot of 
untranslated lemmas (with an @ if debugging) in the output, which lead 
to issues with comprehension and post-editing the output.


There is a GSoC project 
 
which aims to eliminate this trimming and propose a solution such that 
you don't lose the benefits of dictionary trimming as well. In this 
email I will list a summary of the discussion that has taken place up 
until now.


By trimming the dictionary, you throw away valuable analyses of words 
in the source language, which, if preserved, can be used as context 
for lexical selection and analysis of the input. Also, several 
transfer rules don't match as the word is shown as unknown.


Several solutions are possible for avoiding trimming, some of which 
have been discussed by Unhammer here 
. These involve 
keeping the surface form of the source word, and the lemma+analysis as 
well - use the analysis till you need it in the pipe and then 
propagate the source form as an unknown word (like it would be done in 
trimming).


Another interesting solution that was discussed was that instead of 
just propagating the source surface form, we can output [source-word 
lemma + target morphology], as is shown in this example by Mikel:


Translating from Basque to English:
"Andonik izarak izeki zuen" ('Andoni hung up the sheets') → 'Andoni 
*izeki-ed the sheets".


This might help in comprehensibility of the output, and to some extent 
even the post-editability.


If you have any significant pros, cons, or suggestions to add for this 
project, you're requested to reply to this thread so that if I work on 
this project, I can do it fully informed.


Thanks and Regards,
Tanmai Khanna

--
*Khanna, Tanmai*


___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


--
Mikel L. Forcada  http://www.dlsi.ua.es/~mlf/
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03690 Sant Vicent del Raspeig
Spain
Office: +34 96 590 9776

___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Working around monodix trimming

2020-03-21 Thread Scoop Gracie
I think the last solution mentioned sounds best.

On Sat, Mar 21, 2020, 07:38 Tanmai Khanna  wrote:

> Hey guys,
> Dictionary trimming is the process of removing those words and their
> analyses from monolingual language models (FSTs compiled from monodixes)
> which don't have an entry in the bidix, to avoid a lot of untranslated
> lemmas (with an @ if debugging) in the output, which lead to issues with
> comprehension and post-editing the output.
>
> There is a GSoC project
> 
> which aims to eliminate this trimming and propose a solution such that you
> don't lose the benefits of dictionary trimming as well. In this email I
> will list a summary of the discussion that has taken place up until now.
>
> By trimming the dictionary, you throw away valuable analyses of words in
> the source language, which, if preserved, can be used as context for
> lexical selection and analysis of the input. Also, several transfer rules
> don't match as the word is shown as unknown.
>
> Several solutions are possible for avoiding trimming, some of which have
> been discussed by Unhammer here
> . These involve keeping
> the surface form of the source word, and the lemma+analysis as well - use
> the analysis till you need it in the pipe and then propagate the source
> form as an unknown word (like it would be done in trimming).
>
> Another interesting solution that was discussed was that instead of just
> propagating the source surface form, we can output [source-word lemma +
> target morphology], as is shown in this example by Mikel:
>
> Translating from Basque to English:
> "Andonik izarak izeki zuen" ('Andoni hung up the sheets') → 'Andoni
> *izeki-ed the sheets".
>
> This might help in comprehensibility of the output, and to some extent
> even the post-editability.
>
> If you have any significant pros, cons, or suggestions to add for this
> project, you're requested to reply to this thread so that if I work on this
> project, I can do it fully informed.
>
> Thanks and Regards,
> Tanmai Khanna
>
> --
> *Khanna, Tanmai*
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff