[Indlinux-group] Localisation Camp followup - Payyans Workout session on Sunday 22 August 2010

Praveen A Thu, 19 Aug 2010 00:33:05 -0700

As a follow up to Localisation camp, we are doing a Payyans Workout
this Sunday at Red Hat office in Magarpatta City. We expect
participants to know basics of encoding. More at

http://wiki.smc.org.in/Localisation_Camp/Pune/Payyans_Workout please
add your name if you are coming before Friday 3pm.

If you attended Localisation Camp in Pune, please consider attending this. If
you are interested to attend this but have not come for Localisation
camp, please come prepared with knowledge of Unicode and Ascii
encoding. Go through
http://wiki.smc.org.in/Localisation_Camp/2_Pune_20,21_March_2010#Summary_and_Presentations
first.

Some background:

To get started with basics, we have to start from number system.
As you might already know, computers understand only binary data ie
zero or one. So how do we represent data in a way computers can
understand? Using sequence of ones and zeros we can represent any
number. Now what about letters? Character encoding is introduced as a
way of representing characters as numbers. In ASCII encoding systems 7
bits (there is 8 bit variant as well) are used to represent a
character. Using 7 bits, we can represent up to 2^7 (128) characters.
It was sufficient to represent all characters in English/Latin and
special characters (including control characters). But there are so
many scripts around the world and using 128 numbers we cannot
represent all of them.

There were different attempts to solve this issue. For European
languages 8 bit ASCII was sufficient. We started using the same
numbers (from 0 to 127) to represent characters in Indian languages,
but internally the computer still handled it as English characters.
But we substituted Indian language characters in font and fooled the
computer into thinking we are using Indian languages. This was good
enough for displaying Indian Languages on screen and printing, though
other important tasks like sorting and searching was impossible,
because internally they were still understood as English characters.
This technique became widely popular and even now many popular new
papers use this system. This kind of technique was so closely tied to
a font that it requires the same font used for entering the data to be
available on every system one wants to read it.

Now Unicode encoding comes into picture with a promise of uniquely
identifying every character in the world. Now the limit of 128 (or 256
with 8 bit ASCII) characters is taken away and it became possible to
have separate code points/numbers for each of Indian languages. There
are different ways of representing this numbers and these are called
encoding methods. Most popular is UTF-8 which uses variable number of
bytes to represent a character. There is UTF-16 which uses 16 bits for
representing a character. Unicode encoded data can be read using any
Unicode font taking away the dependency on a particular font. OpenType
specification for fonts has option for substituting sequence of
characters with another glyph (glyph is the pictorial representation
of a character). This takes care of conjuncts ie ka halant ka (क ् क)
is substituted with kka (क्क).

Even though Unicode is used widely on the internet some applications
used popularly for DTP still does not support them and many people did
not move to Unicode. So there is lot if data encoded in ASCII format
which needs to be converted to Unicode if we want to make them,
readable without needing a specific font, search-able, sortable ...

Payyans is such a software written in python for converting ASCII font
specific data into Unicode. Padma is firefox plugin which does the
same for many Indian languages. Now it seems simple to map the ASCII
data to its corresponding Unicode, but each font followed its own
encoding and for every ASCII font, you need a separate mapping table.
Moreover there are script specific reordering, like moving ikar from
left to right (in ASCII ikar is added before the conjunct but in
Unicode ikar is added after the conjunct), required for proper
conversion.

For Devanagari conversion, the requirement is more complex than for
Malayalam and so we need to adapt Payyans for supporting Devanagari.
Work is already started and it needs handling of some specific cases.

And about Santhosh Thottingal,

He is one of the most active developer working on Indian language
technologies. He is the main author and inspiration behind ambitious
silpa project (http://smc.org.in/silpa/) - which aims to bring all
Indian language application under one roof. Some of these include
* Guess Language
* Encoding Converter
* Approximate Search
* Sort
* Spellcheck
* Dictionary
* Transliterate
* Hyphenation
* Syllabalize
* N-Gram
* Random Quote
* Indic Soundex
* Character Details
* Katapayadi Numbers
* Webfonts

Currently most of these projects have support for Malayalam. Some of
them have support for Tamil and Kannada. He is actively looking for
more developers to join his team to add support to Marathi, Hindi and
the rest of Indian languages. He is also maintainer of Dhvani Indian
language text to speech system, aspell spell checking dictionaries,
hyphenation support ... and many more. He is in Pune on a short visit
and it would be a great opportunity to learn from him.

Thanks
Praveen
--
പ്രവീണ്‍ അരിമ്പ്രത്തൊടിയില്‍
You have to keep reminding your government that you don't get your
rights from them; you give them permission to rule, only so long as
they follow the rules: laws and constitution.

------------------------------------------------------------------------------
This SF.net email is sponsored by

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev
_______________________________________________
IndLinux-group mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/indlinux-group

[Indlinux-group] Localisation Camp followup - Payyans Workout session on Sunday 22 August 2010

Reply via email to