For a couple of years I am talking to different people inside of WMF about the need for solving conversion engines issue systematically. However, all of the responses which I am getting are non-understanding (in better cases) or silence.
== Why do we need conversion engines? == Unlike, for example, French, English, German and Russian, there are languages which have more than trivial internal differences. It may vary between: * slightly different orthographies, so, person who knows one orthography is not able to write in another; * slightly different language varieties (or "dialects"), so, person who knows one variety is not able to write in another; * different scripts, so, person who knows one script doesn't know [well] another; * some combination of the previous possibilities. Options which we have are: * Not to care about differences. The most known situation is related to the English language projects, which allows writing in both major varieties. However, difference between "kilometer" and "kilometre" is small and it belongs to the common knowledge of every educated English speaker. The other situations known to me are Persian language projects (Farsi and Dari are allowed) and Serbian language projects (Ekavian and Iyekavian allowed). Problems with such approach is that at least one group, usually a bigger one, doesn't know to write in the other variety. Speakers of Farsi don't know to write Dari, as well as speakers of Ekavian don't know to write Iyekavian. There are significant problems in keeping and expanding articles written in a variety of minority group: Even with a lot of good will, speaker of majority group has to ask a speaker of minority group to check consistency of an article, *if* there are active speakers of minority group at the project. * To make different projects. This is the case with Belarus projects. (Parts of Belarus diaspora don't want to write in the "communist" orthography, while the educational system (including the educational system for Belarus minority in Poland) is using that orthography.) I see that as the worst possible solution: Instead of having one project for one language system, there are two projects; which means that efforts needed to make a good source of knowledge are doubled. * To use a conversion engine. There are few of implemented conversion engines: Chinese, Serbian and Kazakh (I think that this is the full list, but I am not sure). This is the best possible solution *if* it is working. The smallest issue is in the Serbian case. All literate people in Serbia know to write in both scripts: Cyrillic and Latin. Usage of scripts is at the level of preference and rarely at the level of functional styles (usually, materials for children will be written in Cyrillic, while emails will be written usually in Latin; formal acts have to be written in Cyrillic). Chinese is a little bit more complex because there are a number of characters. However, AFAIK, Simplified and Traditional scripts share a number of characters and some of others may be guessed form context. But, again, current implementation may solve just cases which fulfill the next two conditions: (1) they are more or less straight-forward (more or less one character for one character) and (2) speakers are able to read and write (at least partially) the other script. == Problems with the current conversion engine == * Current conversion engine is able to convert the text just for reading. When you switch to edit mode, you'll are able to see just text in one script (in which article is written). This is not a problem for Serbian case and this is a small scale problem in Chinese case. However, this would be a significant problem for cases like Azerbaijani is: one Azerbaijani from Azerbaijan doesn't know Perso-Arabic script, while just educated Azerbaijanis from Iran know not so well Latin script (note that literacy in Iran is ~80%, which is quite low for Western standards; it means that one in five persons doesn't know to read and write). In other words, make a simple conversion engine, one on one, from Latin to Arabic script for English and try to read converted text. If you don't want to bother yourself with right-to-left text, try with Devanagari. * Current conversion engine converts *everything* into the output script. This means that text with mixed scripts will be converted in one. This is useful for Chinese case because contributors may write text in any script, while readers would be able to read in one of them. This is a redundant (and sometimes irritating) feature for Serbian case because no one is writing Serbian texts by mixing Cyrillic and Latin (except, of course, for scientific purposes). But, it makes the engine useless in the cases where just orthographies or language varieties need to be converted. For example, if Dari has word which form is X and meaning A (and written in Farsi as Y) and Farsi has word which form is X (and written in Dari as Z) and meaning is B, the only option which conversion engine gives is escape syntax like -{ Dari: X; Farsi: Y }-. Imagine now how the wiki code would look like if, for example, genitive case is written Dari like accusative case in Farsi: All syntactic objects will have to be escaped; which means that almost every sentence will have one escape from regular rules. == What do we need? == Actually, we don't need a lot to solve this problem. I have the solution for the most important part of the problem, the linguistic one. Even if I don't have enough of time to deal with all cases, I am able to find students or professors of linguists who are willing to work on those issues for free (they would have scientific papers after the work is done). We need "just" a PHP programmer who is willing to work on this problem. And for a couple of years I didn't find any (even I know a lot of PHP programmers). P.S. I am writing this because I've got an email with an ask to help in solving an orthography problem. The only option which I am able to give them is to make a Python script which would make four articles from one at their project. _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l