Re: [Apertium-stuff] GSoC: Adopting a language pair: Tur-Tat / Kaz-Tat

Francis Tyers Mon, 26 Mar 2012 00:52:59 -0700

El dl 26 de 03 de 2012 a les 07:08 +0400, en/na Ilnar Salimzyan va
escriure:
> I am sorry for this email to be so long. Honestly, I shortened it
> several times. Consider it to be the first draft of the proposal.


No problem :)

> Dear Apertium mentors,
> 
> my name is Ilnar Salimzyanov (‘selimcan’ on Sourceforge and IRC,
> ‘Ilnar.salimzyan’ on Apertium’s wiki, ‘Ilnar Salimzyan’ on many other
> places).

Hi Ilnar! :D

> 
> My native language is Tatar, I also speak Russian on native level.
> 
> = Reason I am writing =
> 
> I would like to apply for Google Summer of Code and work on adopting
> Turkish-Tatar / Kazakh-Tatar language pair. I am writing here to
> discuss my plans, to get some feedback, which would facilitate writing
> my proposal.
> 
> = Who I am / Some background information =
> I am the first year master’s student at the Kazan Federal University,
> studying Applied Linguistics [1].
> 
> I got to know about Apertium first time in 2009, while writing a small
> paper at the university on comparison of available machine translation
> systems. Apertium fascinated me then being open source, showing rapid
> growth and being a good potential starting point for Tatar and other
> Turkic languages (yes, I have thought about them too). I played around
> with lttoolbox dictionary for Tatar (bad idea, I know, but I didn’t
> know about FSTs then and there weren’t any other Turkic languages
> involved). I even managed to model nouns morphotactics using it! :)

Well, lttoolbox also produces FSTs, the difference is that in HFST you
have separate morphotactics and morphophonology transducers, which are
then composed to form the final transducer in lttoolbox there is only a
single transducer.

> Back in 2009 I translated part of the Official Documentation into
> Russian [2] (till chapter 3.2.3; besides someone willing to finish it
> the translation needs a good editor). Also in 2009 I translated
> Apertium New language pair Howto into Russian.
> 
> I was one of the participants of the Šupaškar Apertium Workshop, held
> in January this year, where Francis Tyers, Hector Alos-i-Font,
> Jonathan Washington and Trond Trosterud were instructors.
> 
> I was very fortunate to see Jonathan and Francis work on Tatar-Bashkir
> pair as an example pair for the Šupaškar Workshop and move it to
> nursery. It is very useful to have a transducer for my native language
> (and a language closest to it) to learn the semantics and structure of
> lexc and twol files (which I wasn’t really familiar with, since using
> HFST with Apertium is relatively new thing and it is not mentioned in
> the Official Documentation), along with the reading of the famous
> FSMBook.

:)

> I have been involved in work on Tatar-Bashkir pair as, let’s say,
> “language-consultant” and “tester”. With another fellow from Ufa we
> have been translating top-5000 wordlist of Russian National Corpus
> into
> Tatar and Bashkir. This translations were added then to the translator
> files. Also, I have been analyzing some errors in the translations
> finding out, where Apertium-tt-ba performed not so well, describing it
> on the wiki [3,4] and commiting from time to time to svn.
> 
> = Resources =
> // I will list all relevant resources on the wiki before submitting
> the proposal//
> 
> For both language pairs I will not have to start from absolute
> scratch. Transducers for all three languages —  Turkish, Kazakh and
> Tatar —  perform quite well, having 87%, 76% and 56% coverage each
> [5]. Having that, I thought that the crucial thing to benefit from
> these separate  transducers most with less work is to write bidix
> files, translating words from each lexc file into Tatar.
> 
> == Bilingual dictionaries ==
> ===Kazakh===
> All words in kazakh.lexc [6] were commented with English glosses
> (thanx who had done this!). Using a simple sed one-liner, I prepared
> bidix entries with Kazakh words as the left side, putting english
> glosses again into comments. In few hour’s work, I translated ~500
> nouns (not proper nouns) and most of the adjectives into Tatar [7].
> For Kazakh words which look very similar to Tatar ones and have the
> same meaning as these Tatar equivalents, this can be done very
> quickly. For other I consulted Kazakh-Russian dictionaries too, but
> again, translating all remaining words from kazakh.lexc will take no
> more than few days of focused work.
> ===Turkish===
> Unfortunately very few words in Turkish.lexc have English glosses. But
> there is a Tatar-Turkish dictionary, which was released under GPL [8],
> and another Tatar-Turkish online dictionary [9], also under GPL. The
> process will be similar for Turkish too —  take stems from
> turkish.lexc, put them automatically to bidix and translate them into
> Tatar, consulting where necessary dictionaries mentioned above or
> Turkish-Tatar dictionary in print.
> 
> ==Parallel corpora==
> Some sentences for Turkish-Tatar are available at Tatoeba project. As
> a source for parallel corpora Bible or Quran translations can serve.
> Right away I will sent an email to Jörg Tiedemann (the maintainer of
> the OPUS project) with the suggestion to add Quran translations from
> 
> tanzil.net as a sub-pool to his corpora collection. This translations
> are very easy to align and put to a TMX file, since each aya (verse —
> usually one sentence, sometimes more) is placed in one line. One just
> has to align text files line by line using something like uplug.
> Even if  Mr.Tiedemann refuses, or doesn’t respond quickly, this
> translations will be useful for my purposes while working on transfer
> rules —  no matter, with creating of tmx files or not.

(rather than "refuses", better say "doesn't take me up on the offer" --
refuses sounds a bit strong) :)

> ==Frequency lists===
> According to Francis, stems in both Kazakh and Turkish transducers
> were taken from frequency lists (obtained from Kazakh RLFE corpora and
> SETimes corpora I guess), which is certainly good. As for Tatar,
> corpus.tatfolk.ru (a project aiming to create a web-crawled corpus of
> Tatar, similar in functionality with Wortschatz project), after
> sharing
> with them preprocessed pages collected by me earlier and concatenating
> them with what corpus.tatfolk.ru had, provided me a freqeuncy list of
> Tatar wordforms [10].
> 
> = Reasons for dual-application =
> My first intention was to apply for Turkish-Tatar pair, as I actually
> have been learning Turkish intensively since half a year and can speak
> it
> 
> quite well now. But after knowing about Tatar-Turkish translator which
> was already available [11] and about plans of releasing it under GPL
> (which could potentially mean applying of students of this
> organization for the same pair), I decided to make a dual application.

Well, we still haven't heard anything, after being quite explicit that
they should get in contact with us as soon as possible, so it isn't
certain that they will submit a valid application. I should probably
send a reminder to Diljara.

> ==Pro’s for Turkish-Tatar pair==
> + there are GPLed bilingual dictionaries;
> + Turmorph has the most coverage comparing to other Turkic languages
> in Apertium;
> + I speak Turkish better than I speak Kazakh [can’t really say that I
> speak Kazakh, never tried :)]. This is relevant for Tatar > Turkish
> translation and working on transfer rules for it, I assume;
> 
> ==Contra’s for Turkish-Tatar pair==
> - although Turmorph has a good coverage, it is not so far away from
> Kazmorph, which has stems four times less.
> This allows me to make the conclusion that
> - available Turkish transducer has to be thoroughly revised.

Yes, the morphotactics are based on my conversion from TRmorph, and
given my lack of knowledge of Turkish, these (especially the verb
lexica) will need to be seriously revised.

> ==Pro’s for Kazakh-Tatar pair==
> + Kazakh is closer to Tatar, so I understand almost everything I read
> 
> (and this without learning it! );
> + less transfer rules needed;
> + more chances to complete the task in three months;

Another pro might be that you will be able to create the transducers for
Kazakh and Tatar in lockstep, meaning less problems when it comes to
testvoc.

> = General plans for the pairs =
> 1. Translate stems from Turkish and Kazakh transducers into Tatar.
> Create bilingual dictionaries. Finish this as quick as possible,
> better even before coding time begins.
> 
> 2. Make transducers really compatible. I already started doing this
> [see apertium-tat and apertium-kaz in branches], following
> tag-choosing conventions described in
> [http://wiki.apertium.org/wiki/Turkic_languages]
> and general structure of continuation classes and lexicons implemented
> in [branches tur kir].
> 
> This task also includes remedy of known shortcomings of the transducer
> for Tatar mentioned above.
> 
> 3. Expand transducer for Tatar with stems from bidix files, if it
> doesn’t recognize them.
> 
> 4. Work on constraint grammars and transfer rules. Since Turkic
> languages usually share the POS-ambiguities, by using the same tags
> and having the same logic of morphotactics (using “syntactic”
> categories like ‘subst’, ‘attr’ etc), I guess that translators would
> perform quite well even without much of Constraint grammar rules. So
> invest more time to transfer rules (esp. by Turkish-Tatar pair).

Seems like a pretty sensible plan.

> = Coding challenge =
> My intention is to put everything available together and make of it
> two pairs, which can be built and installed. They can be moved to
> nursery then, which will be a good move psychologically :) I’d like to
> see them in nursery before sending the proposal.

Great! :D

> Apertium-tr-tt in incubator already proved to built, but needs to be
> checked and expanded first  before moving to nursery.
> 
> For Kazakh-Tatar pair, I would like to try another solution — not to
> put everything into one folder (which means duplicating some of them),
> but rather make apertium-tat, apertium-kaz and apertium-kaz-tat in
> branches build-able and install-able as separate “packages”
> 
> (apertium-kaz-tat relying on two separate transducers, which should be
> installed first).
> This can be achieved by modifying makefiles from apertium-tur-kir.
> 

This sounds nice, but there are many things to be resolved, I wouldn't
be too worried if you aren't able to do it. Remember, the problem isn't
getting it to build with separate dictionaries, but getting it to pass
testvoc with separate dictionaries. (the tt-ba still doesn't quite pass
due to differences in the verb lexica).

> I firmly believe that all of that I can do before sending the final
> draft of the proposal(s), which I plan to do not later than April 1.
> 
> If it doesn’t sound realistic to you, please let me know.
> 
> = Proposal itself =
> In what form will I have to submit the final version of the proposal —
> as a pdf created from TeX files — or should I just put it to the wiki
> and give you the link to it or doesn’t it really matter?

Best is on the Wiki (subpage of your user page /Application or something
like that) and then the final version as PDF.

> ==================================================
> 
> I am sorry for all the mistakes in this email. Unlike it, I will try
> to find someone with proficient English to check my final draft for
> errors :)

It won't be necessary, your English is fine ;) See you on IRC!

Fran



------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSoC: Adopting a language pair: Tur-Tat / Kaz-Tat

Reply via email to