Thank you for this info. There is still a lot of content in Hindi being generated in non-Unicode fonts (lot of DTP software being used in India still does not support Unicode).
>> The LDC *might* still have the encoding converters laying around somewhere. These will be very useful, if they can be made available. There is a need for easily converting legacy documents to Unicode. One of the applications for which someone was looking for these recently was for checking for plagiarism in student projects/thesis. ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sat, Feb 17, 2018 at 10:45 PM, Mike Maxwell <maxw...@umiacs.umd.edu> wrote: > On 2/17/2018 11:58 AM, ShreeDevi Kumar wrote: > >> Before unicode, devanagari fonts used the ASCII range (legacy fonts) - >> however AFAIK there is no standardization in the mapping, though various >> families of fonts had similar mapping. >> >> see http://hindi-fonts.com/tools for converters from different mappings >> to unicode. >> >> So, ASCII to Unicode mapping for Devanagari will change based on the >> font used. >> > > Indeed! In 2003, DARPA held a "surprise language exercise", the goal of > which was to produce (very basic) MT etc. tools for Hindi, in a month's > time. I had been involved in the prep for it to ensure that there would be > no roadblocks (at the time, I was working at the LDC). One of the things > that Bill Poser and I verified was that there was a Unicode encoding for > Hindi/Devanagari. There was, but that was the wrong question. > > The right question was whether any Hindi website used Unicode. The answer > to that was that the BBC and Colgate did, but hardly anyone else. A few > Indian government sites used ISCII, which wouldn't have been bad, but most > places used proprietary encodings that went along with a proprietary font. > Worse, these were not simple code-point-to-character encodings; it was as > if the Latin letter 'l' had been encoded as 'l', but then 'd' had been > encoded as 'c' + 'l', 'b' as 'l' + a sort of backwards 'c', 'p' as a > lowered 'l' _ the backwards 'c', etc. It was a mess, and for awhile it was > unclear whether the exercise would fail because most of the data we needed > was in these weird proprietary encodings. (It eventually succeeded.) > > There are some notes here-- > > http://languagelog.ldc.upenn.edu/myl/ldc/hindi_fonts_and_conversions.html > --that Mark Liberman of the LDC made at the time concerning some of the > issues. Most of it is long out of date (and the links are probably > broken), and these proprietary encodings have thankfully been replaced by > Unicode; but if you're dealing with documents from that era, you might > still run into them. The LDC *might* still have the encoding converters > laying around somewhere. > -- > Mike Maxwell > "My definition of an interesting universe is > one that has the capacity to study itself." > --Stephen Eastmond >
-------------------------------------------------- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex