...apologies if I've posted this twice, mutt crashed on me as I was trying
to post it the first time.
On Thu, Aug 30, 2001 at 11:55:24PM -0500, [EMAIL PROTECTED] wrote:
> So, it comes down to a question of how we define "encode", and of the
> usage context that determines our definition. Marco was assuming a
> definition as it would be used internal to Unicode. Misha apparently
> was using a broader definition that is valid in other contexts, though
> not internally to Unicode.
>
> So, they were both right in relation to the assumptions they were
> making. The question, though, is what definition or context Viranga
> was assuming when the question was asked.
Hi All,
I started writing the context, but it soon turned into my work
history. This is my second attempt : ) Thanks for your patience.
And apologies for not replying to the thread sooner. I work in
Australia which puts me slightly out of phase to most people.
And apologies for my previously vague questions. Tho' I must
admit that, in hindsight, I'm glad the questions were open to
interpretation, as I have learned much from the thread : )
My (Viranga's) original question was:
> Is it ok for Unicode code points to be encoded/serialized using EUC?
> I'm not planning on doing this; just wondering what (?if any?)
> restrictions, there are on choice of transformation format.
Perhaps I can ask another question (with a slightly wider scope).
When I came across the weekly-euc-jp.xml document, I was rapt; an
xml document with japanese tags. But when I looked at the underlying
hex, it clearly wasn't "encoded" using a UTF. Which confused me
as I was (?mistakenly?) under the notion that XML required unicode.
I have read W3C's XML spec
(see http://www.w3.org/TR/2000/REC-xml-20001006#sec-well-formed)
"2.2 Characters
[Definition: A parsed entity contains text, a sequence of
characters, which may represent markup or character data.]
[Definition: A character is an atomic unit of text as specified
by ISO/IEC 10646 [ISO/IEC 10646] (see also [ISO/IEC 10646-2000]).
Legal characters are tab, carriage return, line feed, and the
legal characters of Unicode and ISO/IEC 10646."
[rest of paragraph deleted]
"Character Range
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |
[#xE000-#xFFFD] | [#x10000-#x10FFFF]
/* any Unicode character, excluding the
surrogate blocks, FFFE, and FFFF. */ "
However in the next paragraph...
"The mechanism for encoding character code points into bit
patterns may vary from entity to entity. All XML processors
must accept the UTF-8 and UTF-16 encodings of 10646; the
mechanisms for signaling which of the two is in use, or for
bringing other encodings into play, are discussed later, in
4.3.3 Character Encoding in Entities."
If the character set is specified as ISO/IEC 10646, in what
circumstances would it be appropriate to use an "encoding" other
than UTF-8 or UTF-16 ?
Further questions are:
Could I, theoretically, invent my own encoding and say that this
is conformant XML?
Would the character set, I use, have to be Unicode/10646 ?
Or could an XML document use one of the JIS character sets.
The page at...
http://java.sun.com/xml/jaxp-1.1/examples/samples/weekly-euc-jp.xml
...states
<?xml version="1.0" encoding="euc-jp"?>
Does '-jp' (or "euc-jp" collectively) imply JIS ?
If so, does this violate section 2.2 from the XML 1.0 standard?
Can you have a document that simultaneously satisfies Unicode and
JIS? Or (as is more likely : ) is my understanding flawed?
Regards,
Viranga
P.S. I am interested in the DoCoMo/WAP stuff purely as a source of "real"
japanese XML/XHTML documents; we're not in the phone business.
P.P.S. I have looked at ICU but have had difficulty compiling it on a
Solaris box (our principal OS for new development is Solaris 8).
I'm a lurker on the icu list; noting with some hope the increased
success other people seem to be having re: compiling with solaris.
P.P.P.S. For those who might be interested:
The group I work for is planning on going to Japan to find a
Japanese partner for the software we produce. We're essentially
success other people seem to be having re: compiling with solaris.
P.P.P.S. For those who might be interested:
The group I work for is planning on going to Japan to find a
Japanese partner for the software we produce. We're essentially
an SGML/XML group that writes document management systems,
high performance information retrieval engines, ...
But we don't really have much (any) experience with East Asian
scripts and languages.
I'm one of the people responsible for making us unicode conformant.
And also to keep an eye on the unicode mailing list. Most of my job
involves writing C++ class libraries, database (for want of a better
word) "wizards", and most recently - helping out with a demo (to show
that we can supply a toolkit; for japanese developers to use).
So, I was spending some of my time hunting for japanese documents.
Preferably in unicode because we can do (hopefully) intelligent things
with the character properties; in word parsers, finite state machines,
(I'm sure there are other things, just can't think of them right now :)
Our string classes are essentially smart arrays of 8-bit, 16-bit
and 32-bit code units. We also use James Clark's (SP and Expat)
parsers. We have seen references to JIS in his stuff, but would
rather stick to interfacing with the Unicode stuff (mainly because
it's so much easier supporting just the one thing internally, and
we can deal with other character sets by either (converting to
Unicode) or (promising only storage and retrieval of raw data w/o
interpreting it in any way).
P.P.P.P.S. Which leads me to ask for a clarification of the interoperability
issue which David Starner introduced
> So EUC-JP <-> Shift-JIS produces different results than
> EUC-JP <-> Unicode <-> Shift-JIS.
Does one of the transformations produce lossy output or mutations
or is it some other issue?