Ilnar Salimzyan <ilnar.salimz...@gmail.com> writes: > On Wed, Mar 28, 2012 at 11:11 AM, Kevin Brubeck Unhammer > <unham...@fsfe.org> wrote: >> Ilnar Salimzyan >> <ilnar.salimz...@gmail.com> writes: >> >>> This thread grew out of the discussion of my proposal draft [see >>> "GSoC: Adopting a language pair: Tur-Tat / Kaz-Tat" from March 26]. >>> >>> Having discussed the problem of monodixes/lexc-files copied in many >>> pairs (and in more and more pairs) with Jonathan and seeing that >>> people at IRC come to this question quite often (Like "What lexc of >>> Tatar should I choose for my new Tatar-X translator?"), I decided to >>> start a new discussion here :) >>> >>> On Mon, Mar 26, 2012 at 2:37 PM, Kevin Brubeck Unhammer >>> <unham...@fsfe.org> wrote: >>> >>>> It'd be nice to have some general method for deduplicating >>>> dictionaries >>> >>> I think we all share the same view. >>> >>> Obvious that having single transducers for many related languages >>> compatible with each other is great. It would facilitate creation of >>> new translators. >>> And I think that keeping them compatible on the tags/morphotactics >>> level can and should be done. >>> >>>>>… We use a trimming script in apertium-sme-nob; with this >>>> method, you would have apertium-kaz and apertium-tat as just >>>> "development dependencies". So you'd add stuff to apertium-kaz/kaz.lexc >>>> and to your bidix, and then run a script from apertium-kaz-tat with the >>>> path to apertium-kaz and it creates a file apertium-kaz-tat/kaz.lexc >>>> (and you never change this file, although it's in SVN). Similarly for >>>> tat.lexc. >>>> >>>> This works, as long as the trimming script is well configured, but >>>> perhaps it'd be 'cleaner' to have apertium-kaz/apertium-tat as "make >>>> dependencies" and do the trimming each time you type make (no need for >>>> apertium-kaz-tat to have generated kaz.lexc/tat.lexc files in SVN). >>>> >>>> (The weak point in the chain is the trimming script though, which >>>> expects the lexc files to be fairly easily parsable (they're not, >>>> really). Ideally we would have ways of trimming both HFST and lttoolbox >>>> dictionaries so that we never had to copy-paste anything between pairs, >>>> but language pairs tend to have stuff in them that's rather specific to >>>> that pair, not sure how that is best dealt with.) >>> >>> = Reasons why we have monodixes copied = >>> 1. Historical (there weren't many pairs having common part initially, >>> but Apertium keeps growing); >>> 2. Because of the stuff specific to a given pair. >>> >>> = Some imaginable solutions = >>> Just to sum up: >>> 1. Transducers for language A and Language B as "make-dependencies"; >>> 2. Mono-dictionaries in apertium-langA and apertium-langB as >>> "development-dependencies" + some trimming / duplicating / >>> keeping-up-to-date scripts. >>> >>> = Strengths and weaknesses of each solution = >>> Strengths and weaknesses become clear when we 'do' need to add >>> language-pair-specific stuff to mono-dictionaries. >>> >>> All examples that come up in mind are for Russian-Tatar (=not related >>> languages), so for related languages this might be not relevant. Maybe >>> they won't need any pair-specific-stuff in their mono-dictionaries at >>> all, but this sounds too good to be true :) >>> >>> Consider Russian word "заговорить" ("start to talk"). To Tatar it is >>> translated with two words, just like to English. And in Russian-Tatar >>> / Russian-English pair we will need to add "start to talk" as a >>> multiword. >>> >>> I am sure that similar cases, when a word of languageA is translated >>> to languageB with a multiword, can be found for related languages too. >>> >>> == 1. Make-dependencies == >>> We can add such words to monodictionaries in apertium-langA, >>> separating them into sublexicons or commenting them like "this stuff >>> is needed for langA-langB pair". >>> But this way transducer will become noisier and noisier. >>> >>> == 2. Mono-dictionaries in apertium-langA and apertium-langB as >>> "development-dependencies" + some trimming / duplicating / >>> keeping-up-to-date scripts == >>> In this case monodictionaries in apertium-langX are considered to be >>> something like "vanilla software". They are kept close to linguistical >>> traditions of POS-tagging etc. And they serve as base for building new >>> pairs involving this languages. >>> >>> Modifying them for a given pair is like patching the vanilla software. >>> A script could keep this modified versions in apertium-langX-langY >>> up-to-date with mono-dictionaries in apertium-langX and >>> apertium-langY. >>> >>> A challenge here is not to overwrite modifications while updating. >>> Although script used in sme-nob solves the problem of updating, as I >>> understand, it will overwrite any modifications made in >>> apertium-sme-nob. And I am not sure if this can be done at all >>> technically. >> >> We never modify the trimmed dictionary, we consider it a generated file. >> All modifications go to the dictionary it was trimmed from. >> >> Although we don't, we _could_ actually have sme-nob-specific additions >> to the sme dictionary. It shouldn't be much worse than concatenating >> another .lexc file onto the trimmed sme.lexc. Note that this would only >> be lexicon additions (like "start to talk", good example), not changes >> to tagging etc. >> >> On the other hand, if you're already trimming, it shouldn't hurt to put >> "start to talk" into the monolingual module (apertium-eng or whatever); >> if apertium-tat-eng doesn't have >> …<r>start<g/><b/>to<b/>talk</g><s n="vblex"/></r>… >> in the bidix, it won't be in the trimmed apertium-tat-eng.eng.dix, >> even if it is in the trimmed apertium-rus-eng.eng.dix. >> > > That sounds good, my only concern was -- if this monolingual module > will be used as a basis for building stand-alone applications -- > spellcheckers or whatever (there are big plans to have a transducer > for the most of the Turkic languages for example), it will > overgenerate a bit (that's what I ment by "noisiness"). But I don't > think that anyone who has intentions to reuse them will be ever so > lazy not to cut down the stuff like "start to talk" from it. > > So this seems to be rather a question of taste.
Well, that's problem anyway with any dictionary from an Apertium language pair. I'd say a monolingual module would be an improvement for spelling purposes, since it would have better coverage. -Kevin ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff