Re: [Apertium-stuff] About de-duplicating of dictionaries

Kevin Brubeck Unhammer Wed, 28 Mar 2012 00:12:44 -0700

Ilnar Salimzyan
<ilnar.salimz...@gmail.com> writes:

> This thread grew out of the discussion of my proposal draft [see
> "GSoC: Adopting a language pair: Tur-Tat / Kaz-Tat" from March 26].
>
> Having discussed the problem of monodixes/lexc-files copied in many
> pairs (and in more and more pairs) with Jonathan and seeing that
> people at IRC come to this question quite often (Like "What lexc of
> Tatar should I choose for my new Tatar-X translator?"), I decided to
> start a new discussion here :)
>
> On Mon, Mar 26, 2012 at 2:37 PM, Kevin Brubeck Unhammer
> <unham...@fsfe.org> wrote:
>
>> It'd be nice to have some general method for deduplicating
>> dictionaries
>
> I think we all share the same view.
>
> Obvious that having single transducers for many related languages
> compatible with each other is great. It would facilitate creation of
> new translators.
> And I think that keeping them compatible on the tags/morphotactics
> level can and should be done.
>
>>>… We use a trimming script in apertium-sme-nob; with this
>> method, you would have apertium-kaz and apertium-tat as just
>> "development dependencies". So you'd add stuff to apertium-kaz/kaz.lexc
>> and to your bidix, and then run a script from apertium-kaz-tat with the
>> path to apertium-kaz and it creates a file apertium-kaz-tat/kaz.lexc
>> (and you never change this file, although it's in SVN). Similarly for
>> tat.lexc.
>>
>> This works, as long as the trimming script is well configured, but
>> perhaps it'd be 'cleaner' to have apertium-kaz/apertium-tat as "make
>> dependencies" and do the trimming each time you type make (no need for
>> apertium-kaz-tat to have generated kaz.lexc/tat.lexc files in SVN).
>>
>> (The weak point in the chain is the trimming script though, which
>> expects the lexc files to be fairly easily parsable (they're not,
>> really). Ideally we would have ways of trimming both HFST and lttoolbox
>> dictionaries so that we never had to copy-paste anything between pairs,
>> but language pairs tend to have stuff in them that's rather specific to
>> that pair, not sure how that is best dealt with.)
>
> = Reasons why we have monodixes copied =
> 1. Historical (there weren't many pairs having common part initially,
> but Apertium keeps growing);
> 2. Because of the stuff specific to a given pair.
>
> = Some imaginable solutions =
> Just to sum up:
> 1. Transducers for language A and Language B as "make-dependencies";
> 2. Mono-dictionaries in apertium-langA and apertium-langB as
> "development-dependencies" + some trimming / duplicating /
> keeping-up-to-date scripts.
>
> = Strengths and weaknesses of each solution =
> Strengths and weaknesses become clear when we 'do' need to add
> language-pair-specific stuff to mono-dictionaries.
>
> All examples that come up in mind are for Russian-Tatar (=not related
> languages), so for related languages this might be not relevant. Maybe
> they won't need any pair-specific-stuff in their mono-dictionaries at
> all, but this sounds too good to be true :)
>
> Consider Russian word "заговорить" ("start to talk"). To Tatar it is
> translated with two words, just like to English. And in Russian-Tatar
> / Russian-English pair we will need to add "start to talk" as a
> multiword.
>
> I am sure that similar cases, when a word of languageA is translated
> to languageB with a multiword, can be found for related languages too.
>
> == 1. Make-dependencies ==
> We can add such words to monodictionaries in apertium-langA,
> separating them into sublexicons or commenting them like "this stuff
> is needed for langA-langB pair".
> But this way transducer will become noisier and noisier.
>
> == 2. Mono-dictionaries in apertium-langA and apertium-langB as
> "development-dependencies" + some trimming / duplicating /
> keeping-up-to-date scripts ==
> In this case monodictionaries in apertium-langX are considered to be
> something like "vanilla software". They are kept close to linguistical
> traditions of POS-tagging etc. And they serve as base for building new
> pairs involving this languages.
>
> Modifying them for a given pair is like patching the vanilla software.
> A script could keep this modified versions in apertium-langX-langY
> up-to-date with mono-dictionaries in apertium-langX and
> apertium-langY.
>
> A challenge here is not to overwrite modifications while updating.
> Although script used in sme-nob solves the problem of updating, as I
> understand, it will overwrite any modifications made in
> apertium-sme-nob. And I am not sure if this can be done at all
> technically.


We never modify the trimmed dictionary, we consider it a generated file.
All modifications go to the dictionary it was trimmed from.

Although we don't, we _could_ actually have sme-nob-specific additions
to the sme dictionary. It shouldn't be much worse than concatenating
another .lexc file onto the trimmed sme.lexc. Note that this would only
be lexicon additions (like "start to talk", good example), not changes
to tagging etc.

On the other hand, if you're already trimming, it shouldn't hurt to put
"start to talk" into the monolingual module (apertium-eng or whatever);
if apertium-tat-eng doesn't have
    …<r>start<g/><b/>to<b/>talk</g><s n="vblex"/></r>…
in the bidix, it won't be in the trimmed apertium-tat-eng.eng.dix,
even if it is in the trimmed apertium-rus-eng.eng.dix.



-Kevin


------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] About de-duplicating of dictionaries

Reply via email to