[Apertium-stuff] GSoC: Adopting a language pair: Tur-Tat / Kaz-Tat

Ilnar Salimzyan Sun, 25 Mar 2012 20:08:40 -0700

I am sorry for this email to be so long. Honestly, I shortened it
several times. Consider it to be the first draft of the proposal.


Dear Apertium mentors,

my name is Ilnar Salimzyanov (‘selimcan’ on Sourceforge and IRC,
‘Ilnar.salimzyan’ on Apertium’s wiki, ‘Ilnar Salimzyan’ on many other
places).

My native language is Tatar, I also speak Russian on native level.

= Reason I am writing =
I would like to apply for Google Summer of Code and work on adopting
Turkish-Tatar / Kazakh-Tatar language pair. I am writing here to
discuss my plans, to get some feedback, which would facilitate writing
my proposal.

= Who I am / Some background information =
I am the first year master’s student at the Kazan Federal University,
studying Applied Linguistics [1].

I got to know about Apertium first time in 2009, while writing a small
paper at the university on comparison of available machine translation
systems. Apertium fascinated me then being open source, showing rapid
growth and being a good potential starting point for Tatar and other
Turkic languages (yes, I have thought about them too). I played around
with lttoolbox dictionary for Tatar (bad idea, I know, but I didn’t
know about FSTs then and there weren’t any other Turkic languages
involved). I even managed to model nouns morphotactics using it! :)

Back in 2009 I translated part of the Official Documentation into
Russian [2] (till chapter 3.2.3; besides someone willing to finish it
the translation needs a good editor). Also in 2009 I translated
Apertium New language pair Howto into Russian.

I was one of the participants of the Šupaškar Apertium Workshop, held
in January this year, where Francis Tyers, Hector Alos-i-Font,
Jonathan Washington and Trond Trosterud were instructors.

I was very fortunate to see Jonathan and Francis work on Tatar-Bashkir
pair as an example pair for the Šupaškar Workshop and move it to
nursery. It is very useful to have a transducer for my native language
(and a language closest to it) to learn the semantics and structure of
lexc and twol files (which I wasn’t really familiar with, since using
HFST with Apertium is relatively new thing and it is not mentioned in
the Official Documentation), along with the reading of the famous
FSMBook.

I have been involved in work on Tatar-Bashkir pair as, let’s say,
“language-consultant” and “tester”. With another fellow from Ufa we
have been translating top-5000 wordlist of Russian National Corpus into
Tatar and Bashkir. This translations were added then to the translator
files. Also, I have been analyzing some errors in the translations
finding out, where Apertium-tt-ba performed not so well, describing it
on the wiki [3,4] and commiting from time to time to svn.

= Resources =
// I will list all relevant resources on the wiki before submitting
the proposal//

For both language pairs I will not have to start from absolute
scratch. Transducers for all three languages —  Turkish, Kazakh and
Tatar —  perform quite well, having 87%, 76% and 56% coverage each
[5]. Having that, I thought that the crucial thing to benefit from
these separate  transducers most with less work is to write bidix
files, translating words from each lexc file into Tatar.

== Bilingual dictionaries ==
===Kazakh===
All words in kazakh.lexc [6] were commented with English glosses
(thanx who had done this!). Using a simple sed one-liner, I prepared
bidix entries with Kazakh words as the left side, putting english
glosses again into comments. In few hour’s work, I translated ~500
nouns (not proper nouns) and most of the adjectives into Tatar [7].
For Kazakh words which look very similar to Tatar ones and have the
same meaning as these Tatar equivalents, this can be done very
quickly. For other I consulted Kazakh-Russian dictionaries too, but
again, translating all remaining words from kazakh.lexc will take no
more than few days of focused work.
===Turkish===
Unfortunately very few words in Turkish.lexc have English glosses. But
there is a Tatar-Turkish dictionary, which was released under GPL [8],
and another Tatar-Turkish online dictionary [9], also under GPL. The
process will be similar for Turkish too —  take stems from
turkish.lexc, put them automatically to bidix and translate them into
Tatar, consulting where necessary dictionaries mentioned above or
Turkish-Tatar dictionary in print.

==Parallel corpora==
Some sentences for Turkish-Tatar are available at Tatoeba project. As
a source for parallel corpora Bible or Quran translations can serve.
Right away I will sent an email to Jörg Tiedemann (the maintainer of
the OPUS project) with the suggestion to add Quran translations from
tanzil.net as a sub-pool to his corpora collection. This translations
are very easy to align and put to a TMX file, since each aya (verse —
usually one sentence, sometimes more) is placed in one line. One just
has to align text files line by line using something like uplug.
Even if  Mr.Tiedemann refuses, or doesn’t respond quickly, this
translations will be useful for my purposes while working on transfer
rules —  no matter, with creating of tmx files or not.

==Frequency lists===
According to Francis, stems in both Kazakh and Turkish transducers
were taken from frequency lists (obtained from Kazakh RLFE corpora and
SETimes corpora I guess), which is certainly good. As for Tatar,
corpus.tatfolk.ru (a project aiming to create a web-crawled corpus of
Tatar, similar in functionality with Wortschatz project), after sharing
with them preprocessed pages collected by me earlier and concatenating
them with what corpus.tatfolk.ru had, provided me a freqeuncy list of
Tatar wordforms [10].

= Reasons for dual-application =
My first intention was to apply for Turkish-Tatar pair, as I actually
have been learning Turkish intensively since half a year and can speak it
quite well now. But after knowing about Tatar-Turkish translator which
was already available [11] and about plans of releasing it under GPL
(which could potentially mean applying of students of this
organization for the same pair), I decided to make a dual application.

==Pro’s for Turkish-Tatar pair==
+ there are GPLed bilingual dictionaries;
+ Turmorph has the most coverage comparing to other Turkic languages
in Apertium;
+ I speak Turkish better than I speak Kazakh [can’t really say that I
speak Kazakh, never tried :)]. This is relevant for Tatar > Turkish
translation and working on transfer rules for it, I assume;

==Contra’s for Turkish-Tatar pair==
- although Turmorph has a good coverage, it is not so far away from
Kazmorph, which has stems four times less.
This allows me to make the conclusion that
- available Turkish transducer has to be thoroughly revised.

==Pro’s for Kazakh-Tatar pair==
+ Kazakh is closer to Tatar, so I understand almost everything I read
(and this without learning it! );
+ less transfer rules needed;
+ more chances to complete the task in three months;

= General plans for the pairs =
1. Translate stems from Turkish and Kazakh transducers into Tatar.
Create bilingual dictionaries. Finish this as quick as possible,
better even before coding time begins.

2. Make transducers really compatible. I already started doing this
[see apertium-tat and apertium-kaz in branches], following
tag-choosing conventions described in
[http://wiki.apertium.org/wiki/Turkic_languages]
and general structure of continuation classes and lexicons implemented
in [branches tur kir].
This task also includes remedy of known shortcomings of the transducer
for Tatar mentioned above.

3. Expand transducer for Tatar with stems from bidix files, if it
doesn’t recognize them.

4. Work on constraint grammars and transfer rules. Since Turkic
languages usually share the POS-ambiguities, by using the same tags
and having the same logic of morphotactics (using “syntactic”
categories like ‘subst’, ‘attr’ etc), I guess that translators would
perform quite well even without much of Constraint grammar rules. So
invest more time to transfer rules (esp. by Turkish-Tatar pair).

= Coding challenge =
My intention is to put everything available together and make of it
two pairs, which can be built and installed. They can be moved to
nursery then, which will be a good move psychologically :) I’d like to
see them in nursery before sending the proposal.

Apertium-tr-tt in incubator already proved to built, but needs to be
checked and expanded first  before moving to nursery.

For Kazakh-Tatar pair, I would like to try another solution — not to
put everything into one folder (which means duplicating some of them),
but rather make apertium-tat, apertium-kaz and apertium-kaz-tat in
branches build-able and install-able as separate “packages”
(apertium-kaz-tat relying on two separate transducers, which should be
installed first).
This can be achieved by modifying makefiles from apertium-tur-kir.

I firmly believe that all of that I can do before sending the final
draft of the proposal(s), which I plan to do not later than April 1.

If it doesn’t sound realistic to you, please let me know.

= Proposal itself =
In what form will I have to submit the final version of the proposal —
as a pdf created from TeX files — or should I just put it to the wiki
and give you the link to it or doesn’t it really matter?

==================================================
I am sorry for all the mistakes in this email. Unlike it, I will try
to find someone with proficient English to check my final draft for
errors :)

I will be thankful for any feedback and suggestions!

Best,
Ilnar Salimzyan

= Notes/References =
1. A not-so-clear term, which caused many debates. What we study is a
mix of computational linguistics, lexicography and several other
courses.
2. See
/apertium/trunk/apertium-documentation/apertium-2.0/ru/apertium_docu.odt
3. http://wiki.apertium.org/wiki/Tatar_and_Bashkir
4. http://wiki.apertium.org/wiki/Morphology_of_Tatar_language
5. consult ‘Turmorph’, ‘Kazmorph’ and ‘Tatmorph’ pages on the wiki
6. See branches/apertium-kaz
7. See branches/apertium-kaz-tat
8. See incubator/apertium-tr-tt
9. http://dictionary.suleyman.cc/
10. See branches/apertium-tat/words
11. D. Suleymanov, R. Guilmoulline, A. Guilmoulline (2008) "THE
TATAR-TURKISH MACHINE TRANSLATION BASED ON THE TWO-LEVEL MORPHOLOGICAL
ANALYZER".

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

[Apertium-stuff] GSoC: Adopting a language pair: Tur-Tat / Kaz-Tat

Reply via email to