Re: [Apertium-stuff] About de-duplicating of dictionaries

Kevin Brubeck Unhammer Wed, 28 Mar 2012 02:10:23 -0700

Ilnar Salimzyan
<ilnar.salimz...@gmail.com> writes:

> On Wed, Mar 28, 2012 at 11:11 AM, Kevin Brubeck Unhammer
> <unham...@fsfe.org> wrote:
>> Ilnar Salimzyan
>> <ilnar.salimz...@gmail.com> writes:
>>
>>> This thread grew out of the discussion of my proposal draft [see
>>> "GSoC: Adopting a language pair: Tur-Tat / Kaz-Tat" from March 26].
>>>
>>> Having discussed the problem of monodixes/lexc-files copied in many
>>> pairs (and in more and more pairs) with Jonathan and seeing that
>>> people at IRC come to this question quite often (Like "What lexc of
>>> Tatar should I choose for my new Tatar-X translator?"), I decided to
>>> start a new discussion here :)
>>>
>>> On Mon, Mar 26, 2012 at 2:37 PM, Kevin Brubeck Unhammer
>>> <unham...@fsfe.org> wrote:
>>>
>>>> It'd be nice to have some general method for deduplicating
>>>> dictionaries
>>>
>>> I think we all share the same view.
>>>
>>> Obvious that having single transducers for many related languages
>>> compatible with each other is great. It would facilitate creation of
>>> new translators.
>>> And I think that keeping them compatible on the tags/morphotactics
>>> level can and should be done.
>>>
>>>>>… We use a trimming script in apertium-sme-nob; with this
>>>> method, you would have apertium-kaz and apertium-tat as just
>>>> "development dependencies". So you'd add stuff to apertium-kaz/kaz.lexc
>>>> and to your bidix, and then run a script from apertium-kaz-tat with the
>>>> path to apertium-kaz and it creates a file apertium-kaz-tat/kaz.lexc
>>>> (and you never change this file, although it's in SVN). Similarly for
>>>> tat.lexc.
>>>>
>>>> This works, as long as the trimming script is well configured, but
>>>> perhaps it'd be 'cleaner' to have apertium-kaz/apertium-tat as "make
>>>> dependencies" and do the trimming each time you type make (no need for
>>>> apertium-kaz-tat to have generated kaz.lexc/tat.lexc files in SVN).
>>>>
>>>> (The weak point in the chain is the trimming script though, which
>>>> expects the lexc files to be fairly easily parsable (they're not,
>>>> really). Ideally we would have ways of trimming both HFST and lttoolbox
>>>> dictionaries so that we never had to copy-paste anything between pairs,
>>>> but language pairs tend to have stuff in them that's rather specific to
>>>> that pair, not sure how that is best dealt with.)
>>>
>>> = Reasons why we have monodixes copied =
>>> 1. Historical (there weren't many pairs having common part initially,
>>> but Apertium keeps growing);
>>> 2. Because of the stuff specific to a given pair.
>>>
>>> = Some imaginable solutions =
>>> Just to sum up:
>>> 1. Transducers for language A and Language B as "make-dependencies";
>>> 2. Mono-dictionaries in apertium-langA and apertium-langB as
>>> "development-dependencies" + some trimming / duplicating /
>>> keeping-up-to-date scripts.
>>>
>>> = Strengths and weaknesses of each solution =
>>> Strengths and weaknesses become clear when we 'do' need to add
>>> language-pair-specific stuff to mono-dictionaries.
>>>
>>> All examples that come up in mind are for Russian-Tatar (=not related
>>> languages), so for related languages this might be not relevant. Maybe
>>> they won't need any pair-specific-stuff in their mono-dictionaries at
>>> all, but this sounds too good to be true :)
>>>
>>> Consider Russian word "заговорить" ("start to talk"). To Tatar it is
>>> translated with two words, just like to English. And in Russian-Tatar
>>> / Russian-English pair we will need to add "start to talk" as a
>>> multiword.
>>>
>>> I am sure that similar cases, when a word of languageA is translated
>>> to languageB with a multiword, can be found for related languages too.
>>>
>>> == 1. Make-dependencies ==
>>> We can add such words to monodictionaries in apertium-langA,
>>> separating them into sublexicons or commenting them like "this stuff
>>> is needed for langA-langB pair".
>>> But this way transducer will become noisier and noisier.
>>>
>>> == 2. Mono-dictionaries in apertium-langA and apertium-langB as
>>> "development-dependencies" + some trimming / duplicating /
>>> keeping-up-to-date scripts ==
>>> In this case monodictionaries in apertium-langX are considered to be
>>> something like "vanilla software". They are kept close to linguistical
>>> traditions of POS-tagging etc. And they serve as base for building new
>>> pairs involving this languages.
>>>
>>> Modifying them for a given pair is like patching the vanilla software.
>>> A script could keep this modified versions in apertium-langX-langY
>>> up-to-date with mono-dictionaries in apertium-langX and
>>> apertium-langY.
>>>
>>> A challenge here is not to overwrite modifications while updating.
>>> Although script used in sme-nob solves the problem of updating, as I
>>> understand, it will overwrite any modifications made in
>>> apertium-sme-nob. And I am not sure if this can be done at all
>>> technically.
>>
>> We never modify the trimmed dictionary, we consider it a generated file.
>> All modifications go to the dictionary it was trimmed from.
>>
>> Although we don't, we _could_ actually have sme-nob-specific additions
>> to the sme dictionary. It shouldn't be much worse than concatenating
>> another .lexc file onto the trimmed sme.lexc. Note that this would only
>> be lexicon additions (like "start to talk", good example), not changes
>> to tagging etc.
>>
>> On the other hand, if you're already trimming, it shouldn't hurt to put
>> "start to talk" into the monolingual module (apertium-eng or whatever);
>> if apertium-tat-eng doesn't have
>>    …<r>start<g/><b/>to<b/>talk</g><s n="vblex"/></r>…
>> in the bidix, it won't be in the trimmed apertium-tat-eng.eng.dix,
>> even if it is in the trimmed apertium-rus-eng.eng.dix.
>>
>
> That sounds good, my only concern was -- if this monolingual module
> will be used as a basis for building stand-alone applications --
> spellcheckers or whatever (there are big plans to have a transducer
> for the most of the Turkic languages for example), it will
> overgenerate a bit (that's what I ment by "noisiness"). But I don't
> think that anyone who has intentions to reuse them will be ever so
> lazy not to cut down the stuff like "start to talk" from it.
>
> So this seems to be rather a question of taste.


Well, that's problem anyway with any dictionary from an Apertium
language pair. I'd say a monolingual module would be an improvement for
spelling purposes, since it would have better coverage. 


-Kevin


------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] About de-duplicating of dictionaries

Reply via email to