[unicode] Re: Malay (Latin) characters in Unicode?
At Fri, 23 Mar 2001 00:13:33 -0800, Rick McGowan <[EMAIL PROTECTED]> wrote: >David Starner wrote: > >> I have a copy of Shellbear's Practical Malay Grammar that I'm preparing >> to transcribe for Project Gutenberg. Unfortunately, he represents >>the >> Malaysian alphabet in a Latin transliteration that includes ng as >>a >> single ligatured form, and I don't know how to transcribe in Unicode. > >Could you perhaps post or point to a picture of what it looks like? > I >suppose it's an "N" with a loopy tail of some type. More like rg. A picture is attached. (Was attached. Rick probably has a copy, but it seems to have got lost between here and the Unicode mailing list.) >The character you are looking for is probably U+014B in lowercase or >U+014A in uppercase. I would be rather surprised if that's not what >you're >looking for. It's not exactly what I was looking for. I may just use it and make a note that the glyph is probably not exactly right. >BTW, a bit off topic here but: I think it's high time that Project >Gutenberg adopted some very clear character encoding guidelines now >that >they're expanding so widely. Or have they already adopted them and >I've >just missed the policy statement...? They're in for a real mess if >they >don't specify character encodings in a very controlled way. At some points, they are already a real mess. You can dig through Gutenberg archives and find various (unlabeled) encodings for the Latin-1 coverage. There's at least one Japenese document that just says "you need a Japenese OS to read this." 8-bit documents are usually labeled as 8-bit, without any indication of encoding. The Bulgarian files are clearlly labeled Windows-1251, at least. OTOH, the policy of doing everything possible in ASCII has saved Gutenberg some problems. They're moving towards Unicode for any files that can't be released in a standard 8-bit encoding (and a few that can are double released), and a number of new books are being released in both ASCII and Unicode editions. See ftp://metalab.unc.edu/pub/docs/books/gutenberg/GUTINDEX.02 and GUTINDEX.01 for recent examples. Most of the unmarked stuff is ASCII, but there's a number of clearly Unicode marked and "8-bit German" marked files. -- David Starner - [EMAIL PROTECTED] Free, encrypted, secure Web-based email at www.hushmail.com
[unicode] Re: Malay (Latin) characters in Unicode?
At Fri, 23 Mar 2001 00:13:33 -0800, Rick McGowan <[EMAIL PROTECTED]> wrote: >David Starner wrote: > >> I have a copy of Shellbear's Practical Malay Grammar that I'm preparing >> to transcribe for Project Gutenberg. Unfortunately, he represents >>the >> Malaysian alphabet in a Latin transliteration that includes ng as >>a >> single ligatured form, and I don't know how to transcribe in Unicode. > >Could you perhaps post or point to a picture of what it looks like? > I >suppose it's an "N" with a loopy tail of some type. More like rg. A picture is attached. >The character you are looking for is probably U+014B in lowercase or > >U+014A in uppercase. I would be rather surprised if that's not what >you're >looking for. It's not exactly what I was looking for. I may just use it and make a note that the glyph is probably not exactly right. >BTW, a bit off topic here but: I think it's high time that Project >Gutenberg adopted some very clear character encoding guidelines now >that >they're expanding so widely. Or have they already adopted them and >I've >just missed the policy statement...? They're in for a real mess if >they >don't specify character encodings in a very controlled way. At some points, they are already a real mess. You can dig through Gutenberg archives and find various (unlabeled) encodings for the Latin-1 coverage. There's at least one Japenese document that just says "you need a Japenese OS to read this." 8-bit documents are usually labeled as 8-bit, without any indication of encoding. OTOH, the policy of doing everything possible in ASCII has saved Gutenberg some problems. They're moving towards Unicode for any files that need it. The Bulgarian files are clearlly labeled windows-1251, which is at least as start. See ftp://metalab.unc.edu/pub/docs/books/gutenberg/GUTINDEX.02 and GUTINDEX.01 for recent examples. Most of the unmarked stuff is ASCII, but there's a number of clearly Unicode marked and "8-bit German" marked files. -- David Starner - [EMAIL PROTECTED] Free, encrypted, secure Web-based email at www.hushmail.com R_T_malay_ng.png
[unicode] Re: removing compromises from unicode ("WCode")
[Hoping the shubnet doesn't got this one too . . .] WTF-8 could potentially be as compact or more compact than UTF-8 (for Greek, Arabic ...), since much of the Latin-1 and Latin Extended A blocks aren't needed in WCode. If you moved the other characters down to fill that space, you might win what you lost to C1 compatibilty. I've considered writing up my own WCode (just for the heck of it) before. My big fix would be losing ASCII compatibility(!), which allows us to remove redundant and ill-defined controls and characters (ASCII apostraphe! CF-LF!). Move the basic set of controls (LS, PS, ZWJ, etc.) and the basic set of script-neutral punctionation and characters (.,:;?!; possibly the Indo-European (Arabic?) digits 0-9) into the bottom 128, followed by the combinging characters and then the decomposed Latin and so on. Losing ASCII compatibilty is much more radical than you've proposed, though. -- David Starner - [EMAIL PROTECTED] Pointless (and temporaily down) webpage: http://dvdeug.dhis.org Free, encrypted, secure Web-based email at www.hushmail.com
[unicode] Re: removing compromises from unicode ("WCode")
[Hoping the shubnet doesn't got this one too . . .] WTF-8 could potentially be as compact or more compact than UTF-8 (for Greek, Arabic ...), since much of the Latin-1 and Latin Extended A blocks aren't needed in WCode. If you moved the other characters down to fill that space, you might win what you lost to C1 compatibilty. I've considered writing up my own WCode (just for the heck of it) before. My big fix would be losing ASCII compatibility(!), which allows us to remove redundant and ill-defined controls and characters (ASCII apostraphe! CF-LF!). Move the basic set of controls (LS, PS, ZWJ, etc.) and the basic set of script-neutral punctionation and characters (.,:;?!; possibly the Indo-European (Arabic?) digits 0-9) into the bottom 128, followed by the combinging characters and then the decomposed Latin and so on. Losing ASCII compatibilty is much more radical than you've proposed, though. -- David Starner - [EMAIL PROTECTED] Pointless (and temporaily down) webpage: http://dvdeug.dhis.org Free, encrypted, secure Web-based email at www.hushmail.com
[unicode] Malay (Latin) characters in Unicode?
[Feed another to the shubnet . . .] I have a copy of Shellbear's Practical Malay Grammar that I'm preparing to transcribe for Project Gutenberg. Unfortunately, he represents the Malaysian alphabet in a Latin transliteration that includes ng as a single ligatured form, and I don't know how to transcribe in Unicode. Some ideas: (1) Use a private use character. Not feasible, because it needs to readable by the average person, not just someone who has patience to set up their computer for this one file. (2) Use a ZWJ between n and g. If I'm not mistaken, most current systems will show the ZWJ as a little black box, and there's going to be very few systems any time soon that would actually display the ng ligature. Still, a good Unicode system will elide the ZWJ displaying the acceptable ng with the real information still in the file. (3) Petition Unicode for a new character. Right. I'm going to argue for a character used in two books (that I know of) that bears annoying similarity to the ng (non-ligatured) flame wars, that in the best of cases I wait a couple years for it to be accepted. (4) Resort to ASCII trickery to distinguish between ng (ligatured) and ng (non-ligatured). Marking the ng (ligatured) would be ugly; marking the unligatured would be also ugly, although a lot rarer - I don't know if Malay (in this transliteration) uses ng (non-ligatured). (5) Just use ng. A simple, just ASCII solution. I don't know if it's information preserving though. Any suggestions? -- David Starner - [EMAIL PROTECTED] Gutenberg stuff - http://dvdeug.dhis.org/guten/ (down for the week) Free, encrypted, secure Web-based email at www.hushmail.com
Re: [OT] Close to latin
At Tue, 2 Jan 2001 09:43:18 -0800 (GMT-0800), Antoine Leca <[EMAIL PROTECTED]> wrote: >- a living language, as opposed to a dead one, should evolve (this is > exactly the problem French is currently having, by the way); trying > to stick with a past reference is going exactly backwards; Esperanto > showed us that a fossilized language cannot aim at being lingua franca I don't see why Esperanto is a 'fossilized language'. KDE has almost been completely translated into it, showing that Esperanto can handle the computer terminology. From what I've seen, Esperanto picks up new terminology whenever it's needed. It has evolved as needed by the community. I don't think linguist causes can be blamed for Esperanto's failure; the sociological causes are much more apparent. -- David Starner - [EMAIL PROTECTED] ([EMAIL PROTECTED] off vacation)
Re: [langue-fr] L'anglais est-il une langue universelle ?
At Wed, 20 Dec 2000 13:08:52 -0800 (GMT-0800), Alain LaBonté <[EMAIL PROTECTED]> wrote: >[Alain] I had no intent of asking anything, but since you provoke me, > I >found something with which I wholeheartedly agree: >>International forums and discussion groups should welcome contributions >in all >>languages if their participants were really seeking the best and most >>interesting contributions. [...] If people want the best >>from the Internet, they have to invite back the best by first realizing >that >>original thoughts automatically entail the use of original modes of >>expression. So, one paw, most people are incapable of learning another language, but on the other, forums should be in many languages, so people have to know a dozen languages to understand them. Hmm. The use of a forum is limited to its participants' ability to understand the messages on that forum, including the language. A forum that mixes English, Russian, Spanish, French, Hebrew, Greek and Chinese in equal proporation will be of little use to many people; the signal to noise ratio will be over 1/6 or 2/5 for most people. So 7 different forums will appear with s/n rations approaching 1, and anyone wanting to communicate in multiple languages can subscribe to multiple forums. -- David Starner - [EMAIL PROTECTED] ([EMAIL PROTECTED] off vacation)
Re: Unicode DIFF tool?
At Thu, 17 Aug 2000 10:50:18 -0800 (GMT-0800), Mikko Lahti <[EMAIL PROTECTED]> wrote: >Are there any DIFF tools out there that do Unicode? If you use UTF8 with Unix line ending semantics (LF - though CRLF (Dos/Windows) would probably work), Unix diff will work. Since it has no knowledge of Unicode, stuff that involves character counts and stuff (like --side-by- side) won't work right, but UTF8 is designed to preserve all the line ending and ASCII semantics diff depends on. -- David Starner - [EMAIL PROTECTED]
SCSU Error?
I was having some problems with a test of my SCSU decoder recently, and I discovered it was due to my decoder rejecting 10 as a valid Unicode value (because it ends in .) The fourth test pattern, Section 9.4 of Tech Report 6 (SCSU) uses DBFF DFFF as a surrogate pair, which is 10. Is this wrong, or is there something I'm overlooking? -- David Starner - normally [EMAIL PROTECTED]