Re: [Apertium-stuff] GSoC: Adopting a language pair: Tur-Tat / Kaz-Tat

Ilnar Salimzyan Tue, 27 Mar 2012 07:27:44 -0700

First of all: thanks for your responses!

On Mon, Mar 26, 2012 at 2:37 PM, Kevin Brubeck Unhammer
<unham...@fsfe.org> wrote:
> Francis Tyers <fty...@prompsit.com> writes:
>
>> El dl 26 de 03 de 2012 a les 07:08 +0400, en/na Ilnar Salimzyan va
>> escriure:
>>> I am sorry for this email to be so long. Honestly, I shortened it
>>> several times. Consider it to be the first draft of the proposal.
>>
>> No problem :)
>>
>>> Dear Apertium mentors,
>>>
>>> my name is Ilnar Salimzyanov (‘selimcan’ on Sourceforge and IRC,
>>> ‘Ilnar.salimzyan’ on Apertium’s wiki, ‘Ilnar Salimzyan’ on many other
>>> places).
>>
>> Hi Ilnar! :D
>>
>>>
>>> My native language is Tatar, I also speak Russian on native level.
>>>
>>> = Reason I am writing =
>>>
>>> I would like to apply for Google Summer of Code and work on adopting
>>> Turkish-Tatar / Kazakh-Tatar language pair. I am writing here to
>>> discuss my plans, to get some feedback, which would facilitate writing
>>> my proposal.
>>>
>>> = Who I am / Some background information =
>>> I am the first year master’s student at the Kazan Federal University,
>>> studying Applied Linguistics [1].
>>>
>>> I got to know about Apertium first time in 2009, while writing a small
>>> paper at the university on comparison of available machine translation
>>> systems. Apertium fascinated me then being open source, showing rapid
>>> growth and being a good potential starting point for Tatar and other
>>> Turkic languages (yes, I have thought about them too). I played around
>>> with lttoolbox dictionary for Tatar (bad idea, I know, but I didn’t
>>> know about FSTs then and there weren’t any other Turkic languages
>>> involved). I even managed to model nouns morphotactics using it! :)
>>
>> Well, lttoolbox also produces FSTs, the difference is that in HFST you
>> have separate morphotactics and morphophonology transducers, which are
>> then composed to form the final transducer in lttoolbox there is only a
>> single transducer.
>>


Yes, it is certainly not correct to restrict finite-state transducers
to ones described in FSMBook :) With "FST" I was rather referring to
transducer technologies having "FST" as part of their names, not
giving thought that lttoolbox is also a transducer technology and
therefore can also be called "FST".

>>> Back in 2009 I translated part of the Official Documentation into
>>> Russian [2] (till chapter 3.2.3; besides someone willing to finish it
>>> the translation needs a good editor). Also in 2009 I translated
>>> Apertium New language pair Howto into Russian.
>>>
>>> I was one of the participants of the Šupaškar Apertium Workshop, held
>>> in January this year, where Francis Tyers, Hector Alos-i-Font,
>>> Jonathan Washington and Trond Trosterud were instructors.
>
> Cool =D
>
>>> I was very fortunate to see Jonathan and Francis work on Tatar-Bashkir
>>> pair as an example pair for the Šupaškar Workshop and move it to
>>> nursery. It is very useful to have a transducer for my native language
>>> (and a language closest to it) to learn the semantics and structure of
>>> lexc and twol files (which I wasn’t really familiar with, since using
>>> HFST with Apertium is relatively new thing and it is not mentioned in
>>> the Official Documentation), along with the reading of the famous
>>> FSMBook.
>>
>> :)
>>
>>> I have been involved in work on Tatar-Bashkir pair as, let’s say,
>>> “language-consultant” and “tester”. With another fellow from Ufa we
>>> have been translating top-5000 wordlist of Russian National Corpus
>>> into
>>> Tatar and Bashkir. This translations were added then to the translator
>>> files. Also, I have been analyzing some errors in the translations
>>> finding out, where Apertium-tt-ba performed not so well, describing it
>>> on the wiki [3,4] and commiting from time to time to svn.
>>>
>>> = Resources =
>>> // I will list all relevant resources on the wiki before submitting
>>> the proposal//
>>>
>>> For both language pairs I will not have to start from absolute
>>> scratch. Transducers for all three languages —  Turkish, Kazakh and
>>> Tatar —  perform quite well, having 87%, 76% and 56% coverage each
>>> [5]. Having that, I thought that the crucial thing to benefit from
>>> these separate  transducers most with less work is to write bidix
>>> files, translating words from each lexc file into Tatar.
>>>
>>> == Bilingual dictionaries ==
>>> ===Kazakh===
>>> All words in kazakh.lexc [6] were commented with English glosses
>>> (thanx who had done this!). Using a simple sed one-liner, I prepared
>>> bidix entries with Kazakh words as the left side, putting english
>>> glosses again into comments. In few hour’s work, I translated ~500
>>> nouns (not proper nouns) and most of the adjectives into Tatar [7].
>>> For Kazakh words which look very similar to Tatar ones and have the
>>> same meaning as these Tatar equivalents, this can be done very
>>> quickly. For other I consulted Kazakh-Russian dictionaries too, but
>>> again, translating all remaining words from kazakh.lexc will take no
>>> more than few days of focused work.
>>> ===Turkish===
>>> Unfortunately very few words in Turkish.lexc have English glosses. But
>>> there is a Tatar-Turkish dictionary, which was released under GPL [8],
>>> and another Tatar-Turkish online dictionary [9], also under GPL. The
>>> process will be similar for Turkish too —  take stems from
>>> turkish.lexc, put them automatically to bidix and translate them into
>>> Tatar, consulting where necessary dictionaries mentioned above or
>>> Turkish-Tatar dictionary in print.
>>>
>>> ==Parallel corpora==
>>> Some sentences for Turkish-Tatar are available at Tatoeba project. As
>>> a source for parallel corpora Bible or Quran translations can serve.
>>> Right away I will sent an email to Jörg Tiedemann (the maintainer of
>>> the OPUS project) with the suggestion to add Quran translations from
>>>
>>> tanzil.net as a sub-pool to his corpora collection. This translations
>>> are very easy to align and put to a TMX file, since each aya (verse —
>>> usually one sentence, sometimes more) is placed in one line. One just
>>> has to align text files line by line using something like uplug.
>>> Even if  Mr.Tiedemann refuses, or doesn’t respond quickly, this
>>> translations will be useful for my purposes while working on transfer
>>> rules —  no matter, with creating of tmx files or not.
>>
>> (rather than "refuses", better say "doesn't take me up on the offer" --
>> refuses sounds a bit strong) :)
>
> Did you have a specific plan involving TMX files?
>

No, I don't have any plans or tasks which require corpus to be in TMX
format. I described why it is easy to convert them to TMX just to
increase the probability that Mr. Tiedermann takes me up on the offer
;)  I think that these reflections should definitely not go into the
final draft of the proposal.

>>> ==Frequency lists===
>>> According to Francis, stems in both Kazakh and Turkish transducers
>>> were taken from frequency lists (obtained from Kazakh RLFE corpora and
>>> SETimes corpora I guess), which is certainly good. As for Tatar,
>>> corpus.tatfolk.ru (a project aiming to create a web-crawled corpus of
>>> Tatar, similar in functionality with Wortschatz project), after
>>> sharing
>>> with them preprocessed pages collected by me earlier and concatenating
>>> them with what corpus.tatfolk.ru had, provided me a freqeuncy list of
>>> Tatar wordforms [10].
>>>
>>> = Reasons for dual-application =
>>> My first intention was to apply for Turkish-Tatar pair, as I actually
>>> have been learning Turkish intensively since half a year and can speak
>>> it
>>>
>>> quite well now. But after knowing about Tatar-Turkish translator which
>>> was already available [11] and about plans of releasing it under GPL
>>> (which could potentially mean applying of students of this
>>> organization for the same pair), I decided to make a dual application.
>>
>> Well, we still haven't heard anything, after being quite explicit that
>> they should get in contact with us as soon as possible, so it isn't
>> certain that they will submit a valid application. I should probably
>> send a reminder to Diljara.
>>
>>> ==Pro’s for Turkish-Tatar pair==
>>> + there are GPLed bilingual dictionaries;
>>> + Turmorph has the most coverage comparing to other Turkic languages
>>> in Apertium;
>>> + I speak Turkish better than I speak Kazakh [can’t really say that I
>>> speak Kazakh, never tried :)]. This is relevant for Tatar > Turkish
>>> translation and working on transfer rules for it, I assume;
>>>
>>> ==Contra’s for Turkish-Tatar pair==
>>> - although Turmorph has a good coverage, it is not so far away from
>>> Kazmorph, which has stems four times less.
>>> This allows me to make the conclusion that
>>> - available Turkish transducer has to be thoroughly revised.
>>
>> Yes, the morphotactics are based on my conversion from TRmorph, and
>> given my lack of knowledge of Turkish, these (especially the verb
>> lexica) will need to be seriously revised.
>>
>>> ==Pro’s for Kazakh-Tatar pair==
>>> + Kazakh is closer to Tatar, so I understand almost everything I read
>>>
>>> (and this without learning it! );
>>> + less transfer rules needed;
>>> + more chances to complete the task in three months;
>>
>> Another pro might be that you will be able to create the transducers for
>> Kazakh and Tatar in lockstep, meaning less problems when it comes to
>> testvoc.
>
> Kazakh-Tatar sounds like the best choice to me.

I think I should mark Kazakh - Tatar pair as a preferable choice for me.

>>> = General plans for the pairs =
>>> 1. Translate stems from Turkish and Kazakh transducers into Tatar.
>>> Create bilingual dictionaries. Finish this as quick as possible,
>>> better even before coding time begins.
>>>
>>> 2. Make transducers really compatible. I already started doing this
>>> [see apertium-tat and apertium-kaz in branches], following
>>> tag-choosing conventions described in
>>> [http://wiki.apertium.org/wiki/Turkic_languages]
>>> and general structure of continuation classes and lexicons implemented
>>> in [branches tur kir].
>>>
>>> This task also includes remedy of known shortcomings of the transducer
>>> for Tatar mentioned above.
>>>
>>> 3. Expand transducer for Tatar with stems from bidix files, if it
>>> doesn’t recognize them.
>>>
>>> 4. Work on constraint grammars and transfer rules. Since Turkic
>>> languages usually share the POS-ambiguities, by using the same tags
>>> and having the same logic of morphotactics (using “syntactic”
>>> categories like ‘subst’, ‘attr’ etc), I guess that translators would
>>> perform quite well even without much of Constraint grammar rules. So
>>> invest more time to transfer rules (esp. by Turkish-Tatar pair).
>>
>> Seems like a pretty sensible plan.
>>
>>> = Coding challenge =
>>> My intention is to put everything available together and make of it
>>> two pairs, which can be built and installed. They can be moved to
>>> nursery then, which will be a good move psychologically :) I’d like to
>>> see them in nursery before sending the proposal.
>>
>> Great! :D
>>
>>> Apertium-tr-tt in incubator already proved to built, but needs to be
>>> checked and expanded first  before moving to nursery.
>>>
>>> For Kazakh-Tatar pair, I would like to try another solution — not to
>>> put everything into one folder (which means duplicating some of them),
>>> but rather make apertium-tat, apertium-kaz and apertium-kaz-tat in
>>> branches build-able and install-able as separate “packages”
>>>
>>> (apertium-kaz-tat relying on two separate transducers, which should be
>>> installed first).
>>> This can be achieved by modifying makefiles from apertium-tur-kir.
>>>
>>
>> This sounds nice, but there are many things to be resolved, I wouldn't
>> be too worried if you aren't able to do it. Remember, the problem isn't
>> getting it to build with separate dictionaries, but getting it to pass
>> testvoc with separate dictionaries. (the tt-ba still doesn't quite pass
>> due to differences in the verb lexica).

So, if I understand you correctly, to run testvoc on a pair, you don't
have to have this pair installed, just all dictionaries should be
available and compiled? I didn't know it before, thanks.

> It'd be nice to have some general method for deduplicating
> dictionaries … We use a trimming script in apertium-sme-nob; with this
> method, you would have apertium-kaz and apertium-tat as just
> "development dependencies". So you'd add stuff to apertium-kaz/kaz.lexc
> and to your bidix, and then run a script from apertium-kaz-tat with the
> path to apertium-kaz and it creates a file apertium-kaz-tat/kaz.lexc
> (and you never change this file, although it's in SVN). Similarly for
> tat.lexc.
>
> This works, as long as the trimming script is well configured, but
> perhaps it'd be 'cleaner' to have apertium-kaz/apertium-tat as "make
> dependencies" and do the trimming each time you type make (no need for
> apertium-kaz-tat to have generated kaz.lexc/tat.lexc files in SVN).
>
> (The weak point in the chain is the trimming script though, which
> expects the lexc files to be fairly easily parsable (they're not,
> really). Ideally we would have ways of trimming both HFST and lttoolbox
> dictionaries so that we never had to copy-paste anything between pairs,
> but language pairs tend to have stuff in them that's rather specific to
> that pair, not sure how that is best dealt with.)
>

I find these ideas very insightful. And you explained it so well with
"make dependencies" and "development dependencies"!

For now, I'll try to make "make dependencies" of apertium-kaz and
apertium-tat. It seems to be simpler to achieve, as the same
approach is used for apertium-tur-kir (or, rather, is planned to be
used).

But I find your approach better. I'll put this sub-discussion into
another mail to apertium-stuff, because it really extends our topic
here. So some other ideas about dictionary de-duplicating can be found
in one of the following threads.

>>> I firmly believe that all of that I can do before sending the final
>>> draft of the proposal(s), which I plan to do not later than April 1.
>>>
>>> If it doesn’t sound realistic to you, please let me know.
>>>
>>> = Proposal itself =
>>> In what form will I have to submit the final version of the proposal —
>>> as a pdf created from TeX files — or should I just put it to the wiki
>>> and give you the link to it or doesn’t it really matter?
>>
>> Best is on the Wiki (subpage of your user page /Application or something
>> like that) and then the final version as PDF.
>
> I think the one that goes to Google has to be plain text anyway.
>
>>> ==================================================
>>>
>>> I am sorry for all the mistakes in this email. Unlike it, I will try
>>> to find someone with proficient English to check my final draft for
>>> errors :)
>>
>> It won't be necessary, your English is fine ;) See you on IRC!
>>
>> Fran
>
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSoC: Adopting a language pair: Tur-Tat / Kaz-Tat

Reply via email to