As a follow up to Localisation camp, we are doing a Payyans Workout this Sunday at Red Hat office in Magarpatta City. We expect participants to know basics of encoding. More at
http://wiki.smc.org.in/Localisation_Camp/Pune/Payyans_Workout please add your name if you are coming before Friday 3pm. If you attended Localisation Camp in Pune, please consider attending this. If you are interested to attend this but have not come for Localisation camp, please come prepared with knowledge of Unicode and Ascii encoding. Go through http://wiki.smc.org.in/Localisation_Camp/2_Pune_20,21_March_2010#Summary_and_Presentations first. Some background: To get started with basics, we have to start from number system. As you might already know, computers understand only binary data ie zero or one. So how do we represent data in a way computers can understand? Using sequence of ones and zeros we can represent any number. Now what about letters? Character encoding is introduced as a way of representing characters as numbers. In ASCII encoding systems 7 bits (there is 8 bit variant as well) are used to represent a character. Using 7 bits, we can represent up to 2^7 (128) characters. It was sufficient to represent all characters in English/Latin and special characters (including control characters). But there are so many scripts around the world and using 128 numbers we cannot represent all of them. There were different attempts to solve this issue. For European languages 8 bit ASCII was sufficient. We started using the same numbers (from 0 to 127) to represent characters in Indian languages, but internally the computer still handled it as English characters. But we substituted Indian language characters in font and fooled the computer into thinking we are using Indian languages. This was good enough for displaying Indian Languages on screen and printing, though other important tasks like sorting and searching was impossible, because internally they were still understood as English characters. This technique became widely popular and even now many popular new papers use this system. This kind of technique was so closely tied to a font that it requires the same font used for entering the data to be available on every system one wants to read it. Now Unicode encoding comes into picture with a promise of uniquely identifying every character in the world. Now the limit of 128 (or 256 with 8 bit ASCII) characters is taken away and it became possible to have separate code points/numbers for each of Indian languages. There are different ways of representing this numbers and these are called encoding methods. Most popular is UTF-8 which uses variable number of bytes to represent a character. There is UTF-16 which uses 16 bits for representing a character. Unicode encoded data can be read using any Unicode font taking away the dependency on a particular font. OpenType specification for fonts has option for substituting sequence of characters with another glyph (glyph is the pictorial representation of a character). This takes care of conjuncts ie ka halant ka (क ् क) is substituted with kka (क्क). Even though Unicode is used widely on the internet some applications used popularly for DTP still does not support them and many people did not move to Unicode. So there is lot if data encoded in ASCII format which needs to be converted to Unicode if we want to make them, readable without needing a specific font, search-able, sortable ... Payyans is such a software written in python for converting ASCII font specific data into Unicode. Padma is firefox plugin which does the same for many Indian languages. Now it seems simple to map the ASCII data to its corresponding Unicode, but each font followed its own encoding and for every ASCII font, you need a separate mapping table. Moreover there are script specific reordering, like moving ikar from left to right (in ASCII ikar is added before the conjunct but in Unicode ikar is added after the conjunct), required for proper conversion. For Devanagari conversion, the requirement is more complex than for Malayalam and so we need to adapt Payyans for supporting Devanagari. Work is already started and it needs handling of some specific cases. And about Santhosh Thottingal, He is one of the most active developer working on Indian language technologies. He is the main author and inspiration behind ambitious silpa project (http://smc.org.in/silpa/) - which aims to bring all Indian language application under one roof. Some of these include * Guess Language * Encoding Converter * Approximate Search * Sort * Spellcheck * Dictionary * Transliterate * Hyphenation * Syllabalize * N-Gram * Random Quote * Indic Soundex * Character Details * Katapayadi Numbers * Webfonts Currently most of these projects have support for Malayalam. Some of them have support for Tamil and Kannada. He is actively looking for more developers to join his team to add support to Marathi, Hindi and the rest of Indian languages. He is also maintainer of Dhvani Indian language text to speech system, aspell spell checking dictionaries, hyphenation support ... and many more. He is in Pune on a short visit and it would be a great opportunity to learn from him. Thanks Praveen -- പ്രവീണ് അരിമ്പ്രത്തൊടിയില് You have to keep reminding your government that you don't get your rights from them; you give them permission to rule, only so long as they follow the rules: laws and constitution. ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ IndLinux-group mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/indlinux-group
