First of all: thanks for your responses! On Mon, Mar 26, 2012 at 2:37 PM, Kevin Brubeck Unhammer <unham...@fsfe.org> wrote: > Francis Tyers <fty...@prompsit.com> writes: > >> El dl 26 de 03 de 2012 a les 07:08 +0400, en/na Ilnar Salimzyan va >> escriure: >>> I am sorry for this email to be so long. Honestly, I shortened it >>> several times. Consider it to be the first draft of the proposal. >> >> No problem :) >> >>> Dear Apertium mentors, >>> >>> my name is Ilnar Salimzyanov (‘selimcan’ on Sourceforge and IRC, >>> ‘Ilnar.salimzyan’ on Apertium’s wiki, ‘Ilnar Salimzyan’ on many other >>> places). >> >> Hi Ilnar! :D >> >>> >>> My native language is Tatar, I also speak Russian on native level. >>> >>> = Reason I am writing = >>> >>> I would like to apply for Google Summer of Code and work on adopting >>> Turkish-Tatar / Kazakh-Tatar language pair. I am writing here to >>> discuss my plans, to get some feedback, which would facilitate writing >>> my proposal. >>> >>> = Who I am / Some background information = >>> I am the first year master’s student at the Kazan Federal University, >>> studying Applied Linguistics [1]. >>> >>> I got to know about Apertium first time in 2009, while writing a small >>> paper at the university on comparison of available machine translation >>> systems. Apertium fascinated me then being open source, showing rapid >>> growth and being a good potential starting point for Tatar and other >>> Turkic languages (yes, I have thought about them too). I played around >>> with lttoolbox dictionary for Tatar (bad idea, I know, but I didn’t >>> know about FSTs then and there weren’t any other Turkic languages >>> involved). I even managed to model nouns morphotactics using it! :) >> >> Well, lttoolbox also produces FSTs, the difference is that in HFST you >> have separate morphotactics and morphophonology transducers, which are >> then composed to form the final transducer in lttoolbox there is only a >> single transducer. >>
Yes, it is certainly not correct to restrict finite-state transducers to ones described in FSMBook :) With "FST" I was rather referring to transducer technologies having "FST" as part of their names, not giving thought that lttoolbox is also a transducer technology and therefore can also be called "FST". >>> Back in 2009 I translated part of the Official Documentation into >>> Russian [2] (till chapter 3.2.3; besides someone willing to finish it >>> the translation needs a good editor). Also in 2009 I translated >>> Apertium New language pair Howto into Russian. >>> >>> I was one of the participants of the Šupaškar Apertium Workshop, held >>> in January this year, where Francis Tyers, Hector Alos-i-Font, >>> Jonathan Washington and Trond Trosterud were instructors. > > Cool =D > >>> I was very fortunate to see Jonathan and Francis work on Tatar-Bashkir >>> pair as an example pair for the Šupaškar Workshop and move it to >>> nursery. It is very useful to have a transducer for my native language >>> (and a language closest to it) to learn the semantics and structure of >>> lexc and twol files (which I wasn’t really familiar with, since using >>> HFST with Apertium is relatively new thing and it is not mentioned in >>> the Official Documentation), along with the reading of the famous >>> FSMBook. >> >> :) >> >>> I have been involved in work on Tatar-Bashkir pair as, let’s say, >>> “language-consultant” and “tester”. With another fellow from Ufa we >>> have been translating top-5000 wordlist of Russian National Corpus >>> into >>> Tatar and Bashkir. This translations were added then to the translator >>> files. Also, I have been analyzing some errors in the translations >>> finding out, where Apertium-tt-ba performed not so well, describing it >>> on the wiki [3,4] and commiting from time to time to svn. >>> >>> = Resources = >>> // I will list all relevant resources on the wiki before submitting >>> the proposal// >>> >>> For both language pairs I will not have to start from absolute >>> scratch. Transducers for all three languages — Turkish, Kazakh and >>> Tatar — perform quite well, having 87%, 76% and 56% coverage each >>> [5]. Having that, I thought that the crucial thing to benefit from >>> these separate transducers most with less work is to write bidix >>> files, translating words from each lexc file into Tatar. >>> >>> == Bilingual dictionaries == >>> ===Kazakh=== >>> All words in kazakh.lexc [6] were commented with English glosses >>> (thanx who had done this!). Using a simple sed one-liner, I prepared >>> bidix entries with Kazakh words as the left side, putting english >>> glosses again into comments. In few hour’s work, I translated ~500 >>> nouns (not proper nouns) and most of the adjectives into Tatar [7]. >>> For Kazakh words which look very similar to Tatar ones and have the >>> same meaning as these Tatar equivalents, this can be done very >>> quickly. For other I consulted Kazakh-Russian dictionaries too, but >>> again, translating all remaining words from kazakh.lexc will take no >>> more than few days of focused work. >>> ===Turkish=== >>> Unfortunately very few words in Turkish.lexc have English glosses. But >>> there is a Tatar-Turkish dictionary, which was released under GPL [8], >>> and another Tatar-Turkish online dictionary [9], also under GPL. The >>> process will be similar for Turkish too — take stems from >>> turkish.lexc, put them automatically to bidix and translate them into >>> Tatar, consulting where necessary dictionaries mentioned above or >>> Turkish-Tatar dictionary in print. >>> >>> ==Parallel corpora== >>> Some sentences for Turkish-Tatar are available at Tatoeba project. As >>> a source for parallel corpora Bible or Quran translations can serve. >>> Right away I will sent an email to Jörg Tiedemann (the maintainer of >>> the OPUS project) with the suggestion to add Quran translations from >>> >>> tanzil.net as a sub-pool to his corpora collection. This translations >>> are very easy to align and put to a TMX file, since each aya (verse — >>> usually one sentence, sometimes more) is placed in one line. One just >>> has to align text files line by line using something like uplug. >>> Even if Mr.Tiedemann refuses, or doesn’t respond quickly, this >>> translations will be useful for my purposes while working on transfer >>> rules — no matter, with creating of tmx files or not. >> >> (rather than "refuses", better say "doesn't take me up on the offer" -- >> refuses sounds a bit strong) :) > > Did you have a specific plan involving TMX files? > No, I don't have any plans or tasks which require corpus to be in TMX format. I described why it is easy to convert them to TMX just to increase the probability that Mr. Tiedermann takes me up on the offer ;) I think that these reflections should definitely not go into the final draft of the proposal. >>> ==Frequency lists=== >>> According to Francis, stems in both Kazakh and Turkish transducers >>> were taken from frequency lists (obtained from Kazakh RLFE corpora and >>> SETimes corpora I guess), which is certainly good. As for Tatar, >>> corpus.tatfolk.ru (a project aiming to create a web-crawled corpus of >>> Tatar, similar in functionality with Wortschatz project), after >>> sharing >>> with them preprocessed pages collected by me earlier and concatenating >>> them with what corpus.tatfolk.ru had, provided me a freqeuncy list of >>> Tatar wordforms [10]. >>> >>> = Reasons for dual-application = >>> My first intention was to apply for Turkish-Tatar pair, as I actually >>> have been learning Turkish intensively since half a year and can speak >>> it >>> >>> quite well now. But after knowing about Tatar-Turkish translator which >>> was already available [11] and about plans of releasing it under GPL >>> (which could potentially mean applying of students of this >>> organization for the same pair), I decided to make a dual application. >> >> Well, we still haven't heard anything, after being quite explicit that >> they should get in contact with us as soon as possible, so it isn't >> certain that they will submit a valid application. I should probably >> send a reminder to Diljara. >> >>> ==Pro’s for Turkish-Tatar pair== >>> + there are GPLed bilingual dictionaries; >>> + Turmorph has the most coverage comparing to other Turkic languages >>> in Apertium; >>> + I speak Turkish better than I speak Kazakh [can’t really say that I >>> speak Kazakh, never tried :)]. This is relevant for Tatar > Turkish >>> translation and working on transfer rules for it, I assume; >>> >>> ==Contra’s for Turkish-Tatar pair== >>> - although Turmorph has a good coverage, it is not so far away from >>> Kazmorph, which has stems four times less. >>> This allows me to make the conclusion that >>> - available Turkish transducer has to be thoroughly revised. >> >> Yes, the morphotactics are based on my conversion from TRmorph, and >> given my lack of knowledge of Turkish, these (especially the verb >> lexica) will need to be seriously revised. >> >>> ==Pro’s for Kazakh-Tatar pair== >>> + Kazakh is closer to Tatar, so I understand almost everything I read >>> >>> (and this without learning it! ); >>> + less transfer rules needed; >>> + more chances to complete the task in three months; >> >> Another pro might be that you will be able to create the transducers for >> Kazakh and Tatar in lockstep, meaning less problems when it comes to >> testvoc. > > Kazakh-Tatar sounds like the best choice to me. I think I should mark Kazakh - Tatar pair as a preferable choice for me. >>> = General plans for the pairs = >>> 1. Translate stems from Turkish and Kazakh transducers into Tatar. >>> Create bilingual dictionaries. Finish this as quick as possible, >>> better even before coding time begins. >>> >>> 2. Make transducers really compatible. I already started doing this >>> [see apertium-tat and apertium-kaz in branches], following >>> tag-choosing conventions described in >>> [http://wiki.apertium.org/wiki/Turkic_languages] >>> and general structure of continuation classes and lexicons implemented >>> in [branches tur kir]. >>> >>> This task also includes remedy of known shortcomings of the transducer >>> for Tatar mentioned above. >>> >>> 3. Expand transducer for Tatar with stems from bidix files, if it >>> doesn’t recognize them. >>> >>> 4. Work on constraint grammars and transfer rules. Since Turkic >>> languages usually share the POS-ambiguities, by using the same tags >>> and having the same logic of morphotactics (using “syntactic” >>> categories like ‘subst’, ‘attr’ etc), I guess that translators would >>> perform quite well even without much of Constraint grammar rules. So >>> invest more time to transfer rules (esp. by Turkish-Tatar pair). >> >> Seems like a pretty sensible plan. >> >>> = Coding challenge = >>> My intention is to put everything available together and make of it >>> two pairs, which can be built and installed. They can be moved to >>> nursery then, which will be a good move psychologically :) I’d like to >>> see them in nursery before sending the proposal. >> >> Great! :D >> >>> Apertium-tr-tt in incubator already proved to built, but needs to be >>> checked and expanded first before moving to nursery. >>> >>> For Kazakh-Tatar pair, I would like to try another solution — not to >>> put everything into one folder (which means duplicating some of them), >>> but rather make apertium-tat, apertium-kaz and apertium-kaz-tat in >>> branches build-able and install-able as separate “packages” >>> >>> (apertium-kaz-tat relying on two separate transducers, which should be >>> installed first). >>> This can be achieved by modifying makefiles from apertium-tur-kir. >>> >> >> This sounds nice, but there are many things to be resolved, I wouldn't >> be too worried if you aren't able to do it. Remember, the problem isn't >> getting it to build with separate dictionaries, but getting it to pass >> testvoc with separate dictionaries. (the tt-ba still doesn't quite pass >> due to differences in the verb lexica). So, if I understand you correctly, to run testvoc on a pair, you don't have to have this pair installed, just all dictionaries should be available and compiled? I didn't know it before, thanks. > It'd be nice to have some general method for deduplicating > dictionaries … We use a trimming script in apertium-sme-nob; with this > method, you would have apertium-kaz and apertium-tat as just > "development dependencies". So you'd add stuff to apertium-kaz/kaz.lexc > and to your bidix, and then run a script from apertium-kaz-tat with the > path to apertium-kaz and it creates a file apertium-kaz-tat/kaz.lexc > (and you never change this file, although it's in SVN). Similarly for > tat.lexc. > > This works, as long as the trimming script is well configured, but > perhaps it'd be 'cleaner' to have apertium-kaz/apertium-tat as "make > dependencies" and do the trimming each time you type make (no need for > apertium-kaz-tat to have generated kaz.lexc/tat.lexc files in SVN). > > (The weak point in the chain is the trimming script though, which > expects the lexc files to be fairly easily parsable (they're not, > really). Ideally we would have ways of trimming both HFST and lttoolbox > dictionaries so that we never had to copy-paste anything between pairs, > but language pairs tend to have stuff in them that's rather specific to > that pair, not sure how that is best dealt with.) > I find these ideas very insightful. And you explained it so well with "make dependencies" and "development dependencies"! For now, I'll try to make "make dependencies" of apertium-kaz and apertium-tat. It seems to be simpler to achieve, as the same approach is used for apertium-tur-kir (or, rather, is planned to be used). But I find your approach better. I'll put this sub-discussion into another mail to apertium-stuff, because it really extends our topic here. So some other ideas about dictionary de-duplicating can be found in one of the following threads. >>> I firmly believe that all of that I can do before sending the final >>> draft of the proposal(s), which I plan to do not later than April 1. >>> >>> If it doesn’t sound realistic to you, please let me know. >>> >>> = Proposal itself = >>> In what form will I have to submit the final version of the proposal — >>> as a pdf created from TeX files — or should I just put it to the wiki >>> and give you the link to it or doesn’t it really matter? >> >> Best is on the Wiki (subpage of your user page /Application or something >> like that) and then the final version as PDF. > > I think the one that goes to Google has to be plain text anyway. > >>> ================================================== >>> >>> I am sorry for all the mistakes in this email. Unlike it, I will try >>> to find someone with proficient English to check my final draft for >>> errors :) >> >> It won't be necessary, your English is fine ;) See you on IRC! >> >> Fran > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff