[unicode] Re: Malay (Latin) characters in Unicode?

2001-03-23 Thread dvdeug

At Fri, 23 Mar 2001 00:13:33 -0800, Rick McGowan <[EMAIL PROTECTED]> 
wrote:
>David Starner wrote:
>
>> I have a copy of Shellbear's Practical Malay Grammar that I'm preparing
>> to transcribe for Project Gutenberg. Unfortunately, he represents 
>>the
>> Malaysian alphabet in a Latin transliteration that includes ng as 
>>a
>> single ligatured form, and I don't know how to transcribe in Unicode.
>
>Could you perhaps post or point to a picture of what it looks like? 
> I  
>suppose it's an "N" with a loopy tail of some type.

More like rg. A picture is attached. (Was attached. Rick probably has a 
copy,
but it seems to have got lost between here and the Unicode mailing list.)

>The character you are looking for is probably U+014B in lowercase or 
>U+014A in uppercase.  I would be rather surprised if that's not what 
>you're  
>looking for.

It's not exactly what I was looking for. I may just use it and make 
a note that the glyph is probably not exactly right.

>BTW, a bit off topic here but: I think it's high time that Project 
>Gutenberg adopted some very clear character encoding guidelines now 
>that  
>they're expanding so widely.  Or have they already adopted them and 
>I've  
>just missed the policy statement...?  They're in for a real mess if 
>they  
>don't specify character encodings in a very controlled way.

At some points, they are already a real mess. You can dig 
through Gutenberg archives and find various (unlabeled) 
encodings for the Latin-1 coverage. There's at least one 
Japenese document that just says "you need a Japenese 
OS to read this." 8-bit documents are usually labeled as
8-bit, without any indication of encoding. The Bulgarian files
are clearlly labeled Windows-1251, at least.

OTOH, the policy of doing everything possible in ASCII has
saved Gutenberg some problems. They're moving towards
Unicode for any files that can't be released in a standard 
8-bit encoding (and a few that can are double released), 
and a number of new books are being released in both 
ASCII and Unicode editions.

See
ftp://metalab.unc.edu/pub/docs/books/gutenberg/GUTINDEX.02
and GUTINDEX.01 for recent examples. Most of the unmarked
stuff is ASCII, but there's a number of clearly Unicode marked
and "8-bit German" marked files.

-- 
David Starner - [EMAIL PROTECTED]
Free, encrypted, secure Web-based email at www.hushmail.com


[unicode] Re: Malay (Latin) characters in Unicode?

2001-03-23 Thread dvdeug

At Fri, 23 Mar 2001 00:13:33 -0800, Rick McGowan <[EMAIL PROTECTED]> wrote:
>David Starner wrote:
>
>> I have a copy of Shellbear's Practical Malay Grammar that I'm preparing
>> to transcribe for Project Gutenberg. Unfortunately, he represents 
>>the
>> Malaysian alphabet in a Latin transliteration that includes ng as 
>>a
>> single ligatured form, and I don't know how to transcribe in Unicode.
>
>Could you perhaps post or point to a picture of what it looks like? 
> I  
>suppose it's an "N" with a loopy tail of some type.

More like rg. A picture is attached.

>The character you are looking for is probably U+014B in lowercase or 
> 
>U+014A in uppercase.  I would be rather surprised if that's not what 
>you're  
>looking for.

It's not exactly what I was looking for. I may just use it and make a
note that the glyph is probably not exactly right.

>BTW, a bit off topic here but: I think it's high time that Project  
>Gutenberg adopted some very clear character encoding guidelines now 
>that  
>they're expanding so widely.  Or have they already adopted them and 
>I've  
>just missed the policy statement...?  They're in for a real mess if 
>they  
>don't specify character encodings in a very controlled way.

At some points, they are already a real mess. You can dig 
through Gutenberg archives and find various (unlabeled) 
encodings for the Latin-1 coverage. There's at least one 
Japenese document that just says "you need a Japenese 
OS to read this." 8-bit documents are usually labeled as
8-bit, without any indication of encoding.

OTOH, the policy of doing everything possible in ASCII has
saved Gutenberg some problems. They're moving towards
Unicode for any files that need it. The Bulgarian files are 
clearlly labeled windows-1251, which is at least as start.

See
ftp://metalab.unc.edu/pub/docs/books/gutenberg/GUTINDEX.02
and GUTINDEX.01 for recent examples. Most of the unmarked
stuff is ASCII, but there's a number of clearly Unicode marked
and "8-bit German" marked files.

-- 
David Starner - [EMAIL PROTECTED]
Free, encrypted, secure Web-based email at www.hushmail.com
 R_T_malay_ng.png


[unicode] Re: removing compromises from unicode ("WCode")

2001-03-23 Thread dvdeug

[Hoping the shubnet doesn't got this one too . . .]

WTF-8 could potentially be as compact or more compact than UTF-8 (for 
Greek, Arabic ...), since much of the Latin-1 and Latin Extended A blocks 
aren't needed in WCode. If you moved the other characters down to
fill that space, you might win what you lost to C1 compatibilty. 

I've considered writing up my own WCode (just for the heck of it) before. 
My big fix would be losing ASCII compatibility(!), which allows us to 
remove redundant and ill-defined controls and characters (ASCII 
apostraphe! CF-LF!). Move the basic set of controls (LS, PS, ZWJ, etc.) 
and the basic set of script-neutral punctionation and characters 
(.,:;?!; possibly the Indo-European (Arabic?) digits 0-9) into the 
bottom 128, followed by the combinging characters and then 
the decomposed Latin and so on. Losing ASCII compatibilty is
much more radical than you've proposed, though.

-- 
David Starner - [EMAIL PROTECTED]
Pointless (and temporaily down) webpage: http://dvdeug.dhis.org

Free, encrypted, secure Web-based email at www.hushmail.com


[unicode] Re: removing compromises from unicode ("WCode")

2001-03-23 Thread dvdeug

[Hoping the shubnet doesn't got this one too . . .]

WTF-8 could potentially be as compact or more compact than UTF-8 (for 
Greek, Arabic ...), since much of the Latin-1 and Latin Extended A blocks 
aren't needed in WCode. If you moved the other characters down to
fill that space, you might win what you lost to C1 compatibilty. 

I've considered writing up my own WCode (just for the heck of it) before. 
My big fix would be losing ASCII compatibility(!), which allows us to 
remove redundant and ill-defined controls and characters (ASCII 
apostraphe! CF-LF!). Move the basic set of controls (LS, PS, ZWJ, etc.) 
and the basic set of script-neutral punctionation and characters 
(.,:;?!; possibly the Indo-European (Arabic?) digits 0-9) into the 
bottom 128, followed by the combinging characters and then 
the decomposed Latin and so on. Losing ASCII compatibilty is
much more radical than you've proposed, though.

-- 
David Starner - [EMAIL PROTECTED]
Pointless (and temporaily down) webpage: http://dvdeug.dhis.org
Free, encrypted, secure Web-based email at www.hushmail.com


[unicode] Malay (Latin) characters in Unicode?

2001-03-23 Thread dvdeug

[Feed another to the shubnet . . .]

I have a copy of Shellbear's Practical Malay Grammar that I'm preparing 
to transcribe for Project Gutenberg. Unfortunately, he represents the 
Malaysian alphabet in a Latin transliteration that includes ng as a 
single ligatured form, and I don't know how to transcribe in Unicode. 
Some ideas: 

(1) Use a private use character. Not feasible, because it needs to readable 
by the average person, not just someone who has patience to set up their 
computer for this one file.

(2) Use a ZWJ between n and g. If I'm not mistaken, most current systems 
will show the ZWJ as a little black box, and there's going to be very 
few systems any time soon that  would  actually display the ng ligature.
Still, a good Unicode system will elide the ZWJ displaying the acceptable 
ng with the real information still in the file.

(3) Petition Unicode for a new character. Right. I'm going to argue 
for a character used in two books (that I know of) that bears 
annoying similarity to the ng (non-ligatured) flame wars, that 
in the best of cases I wait a couple years for it to be accepted.

(4) Resort to ASCII trickery to distinguish between ng (ligatured) and 
ng (non-ligatured). Marking the ng (ligatured) would be ugly; marking
the unligatured would be also ugly, although a lot rarer - I don't know 
if Malay (in this transliteration) uses ng (non-ligatured). 

(5) Just use ng. A simple, just ASCII solution. I don't know if it's 
information preserving though.

Any suggestions?

-- 
David Starner - [EMAIL PROTECTED]
Gutenberg stuff - http://dvdeug.dhis.org/guten/ (down for the week)

Free, encrypted, secure Web-based email at www.hushmail.com


Re: [OT] Close to latin

2001-01-02 Thread dvdeug

At Tue, 2 Jan 2001 09:43:18 -0800 (GMT-0800), Antoine Leca <[EMAIL PROTECTED]> 
wrote:
>- a living language, as opposed to a dead one, should evolve (this is
>  exactly the problem French is currently having, by the way); trying
>  to stick with a past reference is going exactly backwards; Esperanto
>  showed us that a fossilized language cannot aim at being lingua franca

I don't see why Esperanto is a 'fossilized language'. KDE has almost been 
completely translated into it, showing that Esperanto can handle the computer 
terminology. From what I've seen, Esperanto picks up new terminology whenever 
it's needed.  It has evolved as needed by the community.

I don't think linguist causes can be blamed for Esperanto's failure; the 
sociological causes are much more apparent.

--
David Starner - [EMAIL PROTECTED] ([EMAIL PROTECTED] off vacation)




Re: [langue-fr] L'anglais est-il une langue universelle ?

2000-12-20 Thread dvdeug

At Wed, 20 Dec 2000 13:08:52 -0800 (GMT-0800), Alain LaBonté  <[EMAIL PROTECTED]> 
wrote:
>[Alain]  I had no intent of asking anything, but since you provoke me,
> I 
>found something with which I wholeheartedly agree:
>>International forums and discussion groups should welcome contributions 
>in all
>>languages if their participants were really seeking the best and most
>>interesting contributions. [...] If people want the best
>>from the Internet, they have to invite back the best by first realizing 
>that
>>original thoughts automatically entail the use of original modes of
>>expression.

So, one paw, most people are incapable of learning another language, but 
on the other, forums should be in many languages, so people have to know 
a dozen languages to understand them. Hmm.

The use of a forum is limited to its participants' ability to understand 
the messages on that forum, including the language.  A forum that mixes 
English, Russian, Spanish, French, Hebrew, Greek and Chinese in equal proporation 
will be of little use to many people; the signal to noise ratio will be 
over 1/6 or 2/5 for most people. So 7 different forums will appear with 
s/n rations approaching 1, and anyone wanting to communicate in multiple 
languages can subscribe to multiple forums.

-- 
David Starner - [EMAIL PROTECTED] ([EMAIL PROTECTED] off vacation)


Re: Unicode DIFF tool?

2000-08-17 Thread dvdeug

At Thu, 17 Aug 2000 10:50:18 -0800 (GMT-0800), Mikko Lahti <[EMAIL PROTECTED]> 
wrote:
>Are there any DIFF tools out there that do Unicode?

If you use UTF8 with Unix line ending semantics (LF - though CRLF (Dos/Windows) 
would probably work), Unix diff will work. Since it has no knowledge of 
Unicode, stuff that involves character counts and stuff (like --side-by-
side) won't work right, but UTF8 is designed to preserve all the line ending 
and ASCII semantics diff depends on.

-- 
David Starner - [EMAIL PROTECTED]


SCSU Error?

2000-08-03 Thread dvdeug

I was having some problems with a test of my SCSU decoder recently, and 
I discovered it was due to my
decoder rejecting 10 as a valid Unicode value (because it ends in .) 
The fourth test pattern, 
Section 9.4 of Tech Report 6 (SCSU) uses DBFF DFFF as a surrogate pair, 
which is 10. Is this wrong,
or is there something I'm overlooking?

-- 
David Starner - normally [EMAIL PROTECTED]