-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf
Of Lars Kristan
Subject: RE: Roundtripping Solved
However, requirements 1 and 2 are actually taken from Unicode standard, they
are not my requirements.
How's that? Well, they are my requirements also, but instead
This probably doesn't make any difference, Peter, but just so we're talking
the same language as each other, I had actually defined f() to return a
stream of Unicode characters, not a stream of UTF-16 code units, so I would
have written this as:
UTF-16(f(s8)) = UTF-16(utf_8_decode(s8))
which si
Arcane Jill wrote:
#for all possible octet sequences s:
#length of (UTF-8(f(s)) <= length of s,
No, that is not the requirement. It is:
bytelength(f(s)) <= 2*bytelength(s)
You haven't understood. By definition, s is an octet stream, and f(s) is a
Unicode character s
the business of the UTC.
Hope I haven't misunderstood things completely. That would be /so/
embarrassing!
Jill
-Original Message-
From: Peter Kirk [mailto:[EMAIL PROTECTED]
Sent: 16 December 2004 12:09
To: Lars Kristan
Cc: Arcane Jill; Unicode
Subject: Re: Roundtripping Solved
The on
-Original Message-
From: Lars Kristan [mailto:[EMAIL PROTECTED]
As for your solution, I didn't really analyze it. But it is escaping, isn't
it?
Yes
With a lot of overhead.
If you call string length "overhead", yes. This was to provide reasonable
assurance that an escape sequence won't be
[mailto:[EMAIL PROTECTED]
Sent: 15 December 2004 16:28
To: Unicode Mailing List
Cc: Arcane Jill
Subject: Re: Roundtripping Solved
Of course, Jill's scheme uses non-private-use Unicode scalar values to
achieve what is essentially a private-use function, so this is still
non-conformant. (A simi
incorrect identification down
astronomically low.
Jill
-Original Message-
From: Peter Kirk [mailto:[EMAIL PROTECTED]
Sent: 15 December 2004 12:54
To: Arcane Jill
Cc: Unicode
Subject: Re: Roundtripping Solved
But would it not work just as
well to for Lars' purposes to use, instead of y
-Original Message-
From: [EMAIL PROTECTED] On Behalf Of Philippe Verdy
Sent: 14 December 2004 22:47
To: Marcin 'Qrczak' Kowalczyk
Cc: [EMAIL PROTECTED]
Subject: Re: Roundtripping in Unicode
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>
"Ar
I followed (and understood) Lar's explanation as to why the NOT-
solution wouldn't work for him. Shame really - but here's another bash at a
solution, again without breaking the Unicode model. If I have understood
this correctly, these are Lars' requirements:
1) There exists a function, f()
I've been following this thread for a while, and I've pretty much got the
hang of the issues here. To summarize:
Unix filenames consist of an arbitrary sequence of octets, excluding 0x00
and 0x2F. How they are /displayed/ to any given user depends on that user's
locale setting. In this scenario
If I have understood this correctly, filenames are not "in" a locale, they
are absolute. Users, on the other hand, are "in" a locale, and users view
filenames. The same filename can "look" different to two different users. To
user A (whose locale is Latin-1), a filename might look valid; to user
I like that. Makes total sense. Thanks.
Jill
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Behalf Of Antoine Leca
Sent: 10 December 2004 17:38
To: Unicode
Subject: Re: When to validate?
As a result, your strings are likely to be some stuctures.
Then, it is pretty easy
have to do the validation somewhere
else - for example something like
t = tolower(trim(validate(s))).
where validate(s) does nothing but throw an exception if s is invalid.
Other people must have had to make decisions like this. What's the preferred
strategy?
Arcane Jill
- Original Message -
From: "Arcane Jill" <[EMAIL PROTECTED]>
To: "Unicode" <[EMAIL PROTECTED]>
Sent: Friday, December 10, 2004 7:17 AM
Subject: RE: US-ASCII (was: Re: Invalid UTF-8 sequences)
Yes, of course it was a joke. Rest assured, if I perceive any k
next time. :-)
Oh, and thanks for the interesting historical character set info.
Jill
-Original Message-
From: Doug Ewell [mailto:[EMAIL PROTECTED]
Sent: 09 December 2004 16:28
To: Unicode Mailing List
Cc: Arcane Jill
Subject: US-ASCII (was: Re: Invalid UTF-8 sequences)
I hope that's j
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Behalf Of Antoine Leca
Sent: 09 December 2004 11:29
To: Unicode Mailing List
Subject: Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Windows filesystems do know what encoding they use.
Err, not really. MS-DOS *need to k
is simply that
number expressed in binary. But now I'm getting /very/ silly - please don't
take any of this seriously.) :-)
The "UTF-24" thing seems a reasonably sensible question though. Is it just
that we don't like it because some processors have alignment restrictions
Oh for a chip with 21-bit wide registers!
:-)
Jill
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Behalf Of Antoine Leca
Sent: 02 December 2004 12:12
To: Unicode Mailing List
Subject: Re: Nicest UTF
There are other factors that might influence your choice.
For example,
OK, I was wrong about the ZX80 character set. Seems I was actually
thinking about the ZX Spectrum. Ahem. It's character set is listed
here:
http://www.madhippy.com/8-bit/sinclair/zxspecman/zxmanappa.html
Note the distinction between character 0x20 and character 0x80.
Arcane Jill
indistinguishable from space, but was NOT space.
Of course ZX80 characters did not, in general, have
properties, but line breaking algorithms looked for character
0x00, not character 0x80, and so graphic-space behaved like a
non-space, not like a space.
Arcane Jill
> -Original Mess
de is another matter
entirely, but it sounds good to me, so I'll raise it for discussion.
Phillippe's idea does have precedent.
Arcane Jill
> -Original Message-
> From: Philippe Verdy [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, April 01, 2004 10:52 AM
> Subject: Re: Fixed
> -Original Message-
> From: Asmus Freytag [mailto:[EMAIL PROTECTED]
> Sent: Sunday, March 28, 2004 10:56 PM
> Subject: Re: [OT] proscribed words... (was:What is the principle?)
>
>
> being more used to the European practice of
> banning certain ideas.
Eh? Cou
Hi,
Ignoring all compatibility characters; ignoring everything that has gone
before; and considering only present and future characters (that is,
characters currently under consideration for inclusion in Unicode, and
characters which will be under consideration in the future), which of
the fol
UST be hateful or violent,
and also no reason why adherents of such a philosophy should not be able
to organise sufficiently to agree on standardizing the use of a
character. It seems to me that quips such as those below are
detrimental, and irrelevant to issues of character encoding.
Arcane
erical order on a first-come first-served basis.
Maybe someone could assuage my curiousity?
Arcane Jill
But if you lowercased that, surely you'd get .
How should that be rendered?
> -Original Message-
> From: Kent Karlsson [mailto:[EMAIL PROTECTED]
>
> A dotted capital J can already be encoded as .
> Hence, a separate precomposed such character will not be added.
>
> /kent k
>
> > Well, i
> Nope, sorry. Not American -- Minbari.
>
> For more info on the Minbari, please see:
> http://www.sadgeezer.com/babylon5/minbari.htm
>
> Best regards,
>
> James Kass
>
Good point. I was actually referring to the writers, not the character,
but you could certainly argue that the writers
-Original Message-
From: Hohberger, Clive [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, February 04, 2004 11:08 PM
To: Mike Ayers; 'John Burger'; [EMAIL PROTECTED]
Subject: RE: Phonology [was: interesting SIL-document]
Mike,
Actually
"be-f***king-hind" is a B
I would be very surprised if it did, since Java chars are still only
sixteen bits wide, and the new math alphanumerics are not in BMP.
Still, I'd be very happy to be proved wrong on this one.
Actually, I'd quite like to use these as variable names in other
languages too, like in C++ for examp
---Original Message-
> From: Marco Cimarosti [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, December 18, 2003 5:44 PM
> To: 'Arcane Jill'; [EMAIL PROTECTED]
> Subject: RE: [OT] Keyboards (was: American English translation of
> characte r names)
>
>
> Arcane J
From: Doug Ewell [mailto:[EMAIL PROTECTED]
> Sent: Thursday, December 18, 2003 4:28 PM
> To: Unicode Mailing List
> Cc: Arcane Jill
> Subject: Re: [OT] Keyboards (was: American English translation of
> character names)
>
>
> On U.S. keyboards, there is no letter key to the left
> -Original Message-
> From: Eric Scace [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, December 18, 2003 3:57 PM
> To: John Cowan; Arcane Jill
> Cc: [EMAIL PROTECTED]
> Subject: RE: American English translation of character names
>
>
> The logical "no
else from this part of the world care to
confirm this? Or perhaps explain why?).
Jill
> -Original Message-
> From: John Cowan [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, December 18, 2003 2:31 PM
> To: Arcane Jill
> Cc: [EMAIL PROTECTED]
> Subject: Re: American
Thanks, that's interesting. It may well be the case that printers,
typesetters, etc., are the only people who actually need these
things to have names, so I guess their names should be respected. The
rest of us just seem to get by without them, somehow. For example,
U+00AC (NOT SIGN) is someth
> From: Christopher John Fynn [mailto:[EMAIL PROTECTED]]
> There is plenty of disagreement about what the "proper" name for
many
> characters should be
Or, indeed, why the "proper" name for a character must be in English,
and spellable in ASCII, instead of, say, Japanese.
> From: Kenneth Whi
> Would it not make more sense to have not two, but three
different kinds of lowercase i: , and ?. (And similarly for uppercase). Of
course, then you might as well invent COMBINING SOFT DOT ABOVE so we
can use it elsewhere.
I should have mentioned that in this hypothetical scheme, the fol
Far be it from me to stir things up even further, but...
QUESTION - Is the rendering of {U+0065} {U+0302} (that's ) locale-dependent?
I may have got this totally wrong, but it occurs to me that in
non-Turkic fonts, U+0065 is "soft-dotted". That is, the dot disappears
in the presence of any C
There was talk recently on this list of mapping grapheme clusters to
the PUA (for application internal use only, obviously, not for export
to the real world). I actually did this recently, though my app ended
up in an incomplete state since I got bored and moved onto something
else. The algorit
> Do we have Unicode DNS yet?
Yup. You can put Chinese letters in domain names now. You do it like
this:
(1) Convert to NFC
(2) Encode in UTF-8
(3) Replace all reserved characters (space, %, etc.) with the three
character string "%hh" (where hh is hex for the substituted character)
(4) Now
Speaking as a Brit, I would like to know the answer to this one too.
What's the problem with answering online?
And if you're really not going toanswer this online, you could
have just emailed Peter privately, instead of telling the whole list
that you're going to keep the answer secret from a
This occurred to be even before I read Phillppe's email.
Since {U+0069} is not canonically equivalent to
{U+0131}{U+0307}, I don't see anything to stop me from registering the
domain name "un{U+0131}{U+0307}code.org", for example. It is in
NFC, after all.
Jill
-Original Message-
Yes, I know - same as dotted a, b, c, d, e, f, g and so on are
distinct from dotless a, b, c, d, e, f, g and so on. I just meant that
U+0069 could have been considered dotless - with dotted i being
somewhere else. This wouldn't necessarily stop font designers for
Western markers from putting a
Not wishing to bring the conversation down too low-brow, ABBA
often spelt their name with the first B reversed.
Jill (in a silly mood --- and I sure am glad that this thread is marked
OT).
> -Original Message-
> From: Mark E. Shoulson [mailto:[EMAIL PROTECTED]]
> Sent: Monday, Decembe
I sometimes wonder whether or not it was a wise choice to regard "LATIN
SMALL LETTER I" and "LATIN SMALL LETTER DOTLESS I" as distinct. Too
late to change it now, of course, but (with the benefit of hindsight)
it occurs to me that if U+0069 had been regarded as dotless, all these
problems woul
And what, I find myself wondering, does "nearly infinite" mean? Could
you perhaps give us an example of a number which is both finite and
"nearly infinite" ? ;-)
Jill (just havin a larf)
-Original Message-
From: Philippe Verdy [mailto:[EMAIL PROTECTED]
Sent:Friday, December 12,
n text
would a reasonable feature for a text editor to offer. (Even in XML
documents, it would only affect one character, if I've
understood this thread correctly).
Jill
> -Original Message-
> From: Marco Cimarosti [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, December 09, 2
Hmm. Now here's some C++ source code (syntax colored as Philippe
suggests, to imply that the text editor understands C++ at least well
:enough to color it)
int n = wcslen(L"café");
(That's int n = wcslen(L"café"); for those without HTML email)
The L prefix on a string literal makes it a wide
Okay, I've read enough. I've got the message.
Microsoft's view = make the customer pay through the nose for everything
you can possibly get away with
Linux view = you can have whatever you want for free, but you have to be
techy enough to understand it in some detail and/or write it yourself
Appl
This should really be in a FAQ somewhere on the Unicode web site,
methinks. One thing - the fonts print spectacularly well, but don't seem
to display well on the screen (at least, not in Microsoft Word). Any
idea why that might be?
Jill
-Original Message-
From: Philippe Verdy [mail
Sigh. What it is to be constantly misunderstood.
In an earlier email on this thread, Peter Constable said "So, out of
the box, Windows XP does not support (e.g.) Sinhalese, or ship with
Sinhalese fonts. And so, if the next version of Windows does include
support for Sinhalese and perhaps even
Actually, a number of points have been made in the course of this
thread. Of course it is true that Apple's Last Resort font doesn't
display every character with an approximation of its shape, I
acknowledge that. I still think it's a lot better than nothing though.
But - to clarify my expectat
You misunderstand me. Whilst I have no objection to paying for ADDED
value, I'm talking about what comes built in, out of the box.
Consider the literary equivalent. Suppose I went to a library and
borrowed a book, took it home, and attempted to read it (the real world
equivalent of viewing a
TED]
> Sent: Tuesday, December 02, 2003 1:50 PM
> To: Arcane Jill
> Cc: [EMAIL PROTECTED]
> Subject: Re: Fonts on Web Pages
>
> Well, note that that technology works with Netscape 4.x and
> nothing else:
> no IE, no Mozilla/Netscape 6/Netscape 7, no Opera. Overall, I t
]
Sent: Tuesday, December 02, 2003 12:51 PM
To: Arcane Jill
Cc: [EMAIL PROTECTED]
Subject: Re: Fonts on Web Pages
Of course Adobe was designed to do just the
problem you defined,
e W3C or some other bunch.
Jill
-Original Message-
From: Raymond Mercier [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, December 02, 2003 11:29 AM
To: Arcane Jill
Cc: [EMAIL PROTECTED]
Subject: Re: Fonts on Web Pages
Surely
Adobe Acrobat will solve both problems ?
The
recipient only needs to
Anyone know the current status on embedded fonts
in web pages?
I basically have two questions. (1) Assume the existence of a font to
which I legally own the copyright. For example, let's say I invented
it. Now, I design a web page which uses this font. Now, it's easy (but terribly
inconvenient
Damn right. I would like to know this too. In particular, I want all
the math characters working, and all the musical symbols working. Note
that many of these are not in the BMP. I want to be able to put these
characters on web pages, and know that they will be displayed correctly
on my own ch
Forgive my ignorance.
What is ICU?
(I like to know what something is before I download it).
Jill
> -Original Message-
> From: Markus Scherer [mailto:[EMAIL PROTECTED]
> Sent: Monday, December 01, 2003 10:36 PM
> To: unicode
> Subject: no more precomposed characters for 1:1 conversion
>
>
>
argue
that the default case mappings should be the ones used everywhere.
Jill
> -Original Message-----
> From: Mark E. Shoulson [mailto:[EMAIL PROTECTED]
> Sent: Monday, December 01, 2003 1:58 PM
> To: Arcane Jill
> Cc: [EMAIL PROTECTED]
> Subject: Re: MS Windows and Unicode 4.0 ?
>
>
> Shouldn't it permit "assa" and "aßa" to co-exist? It isn't like ß is
> canonically equivalent to ss
No probs, Doug. I was actually ill over the weekend, and I think I was
probably way too sensitive on Friday when it was coming on. I guess I
didn't really notice at the time and blamed everyone else for having a
go at me when I should have been blaming a bunch of nasty microbes for
making me fe
Indeed.
The current Windows OS still stores filenames as strings of sixteen-bit
wide words (not codpoints; not characters). It allows filenames "assa"
and "aßa" to coexist in the same folder, despite its claim to being
case-insensitive, and I have even managed to create filenames containing
un
Of course, one really important point is that Unicode text should
remain stateless. It would be foolish indeed if, starting from an
arbitrary point in the string, one had to parse backwards and forwards
to see if there were any invisible brackets. In the extreme, one would
have to scan the ent
You are getting personal and indulging in ad hominem. I consider this
out of order. Yes I have read TUS Section 2.2, and indeed the whole of
the rest of the book - and understood it, too, so you can stop wondering
that right now.
Unicode design principles do not change the fact that there are
ot;
propery) is the one which remains unanswered.
Thanks again.
Jill
> -Original Message-
> From: Jim Allan [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, November 27, 2003 6:56 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Decimal digit property - What's it for?
>
>
&
ompletely out of context, then I'd feel a lot happier.
Of course I know what "decimal" means in everyday language. Do
you think I'm an idiot? Please stop treating me as one.
Jill
> -Original Message-
> From: Doug Ewell [mailto:[EMAIL PROTECTED]]
> Sent: Thursday
Hi,
It has been explained to me that the "decimal digit" property has the
following meaning: "Decimal numbers are those using in decimal-radix
number systems. In particular, the sequence of the ONE character
followed by the TWO character is interpreted as having the value of twelve".
What's th
MAIL PROTECTED]]
Sent: Thursday, November 27, 2003 1:01 PM
To: Arcane Jill
Cc: [EMAIL PROTECTED]
Subject: RE: numeric properties of Nl characters in the UCD
Arcane Jill writes:
> Gotcha. It's all starting to make sense now. Including the
opposition to hex.
>
> Maybe one coul
ts in any radix;
"number integer" for integer types such as circled 2 which can't be
used positionally; "number fraction" for fractions, and "number other"
for everything else. Or maybe some other similar scheme. Is it too late
to change things now?
Jill
--
...which brings me back to my question (which no-one's answered yet).
What do the properties "digit" versus "decimal digit" actually MEAN? Is
it possible for someone to give a PRECISE definition. I mean, it seems
pretty clear that "decimal digit" does NOT mean "radix ten digit"
(otherwise circl
In full agreement with Philippe here. But also, ever since I first
discovered Unicode, I have had the opinion that the descriptions in
what is now UCD.html are very confusingly worded.
For a start, the three types of numeric property are called "decimal
digit", "digit", and "numeric". Now, as
In the case of GIF versus JPG, which are usually regarded as "lossless"
versus "lossy", please note that there is no "orignal", in the
sense of a stream of bytes. Why not? Because an image is not a stream
of bytes. Period. What is being compressed here is a rectangular array
of pixels, and tha
That is almost precisely what I said. You repeated it perfectly. Thanks.
But actually, there is one small difference between what I said and
what you said. I merely observed that no characters have
different non-null values for the various number-related properties.
But you state (emphasis on
Actually, I don't understand why UnicodeData.txt has no less than three
different fields for numerical value anyway. I mean, it's not as though
there exists EVEN A SINGLE CODEPOINT for which two or more of these
fields exist and are defined differently from each other. One never
sees, for exam
I'm pretty sure it depends on whether you regard a text document as a
sequence of characters, or as a sequence of glyphs. (Er - I mean
"default grapheme clusters" of course). Regarded as a sequence of
characters, normalisation changes that sequence. But regarded as a
sequence of glyphs, normali
Is anyone able to answer this? I for one would really like to know.
Thanks
> -Original Message-
> From: Frank Yung-Fong Tang [mailto:[EMAIL PROTECTED]
> Sent: Thursday, November 20, 2003 2:29 AM
> To: John Jenkins
> Cc: [EMAIL PROTECTED]
> Subject: Re: creating a test font w/ CJKV Extension
Actually, I'd also like to know how to create OTF fonts, not just TTF
fonts, as OTF seems to be the new big thing, and TTF's successor.
Jill
76 matches
Mail list logo