On Saturday, 25 May 2013 at 18:09:26 UTC, Diggory wrote:
On Saturday, 25 May 2013 at 08:07:42 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:
I think you are a little confused about what unicode actually
is... Unicode has nothing to do with code pages and nobody
uses code pages any more except for compatibility with legacy
applications (with good reason!).
Incorrect.
"Unicode is an effort to include all characters from previous
code pages into a single character enumeration that can be
used with a number of encoding schemes... In practice the
various Unicode character set encodings have simply been
assigned their own code page numbers, and all the other code
pages have been technically redefined as encodings for various
subsets of Unicode."
http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode
That confirms exactly what I just said...
No, that directly _contradicts_ what you said about Unicode
having "nothing to do with code pages." All UCS did is take a
bunch of existing code pages and standardize them into one
massive character set. For example, ISCII was a pre-existing
single-byte encoding and Unicode "largely preserves the ISCII
layout within each block."
http://en.wikipedia.org/wiki/ISCII
All a code page is is a table of mappings, UCS is just a much
larger, standardized table of such mappings.
You said that phobos converts UTF-8 strings to UTF-32 before
operating on them but that's not true. As it iterates over
UTF-8 strings it iterates over dchars rather than chars, but
that's not in any way inefficient so I don't really see the
problem.
And what's a dchar? Let's check:
dchar : unsigned 32 bit UTF-32
http://dlang.org/type.html
Of course that's inefficient, you are translating your whole
encoding over to a 32-bit encoding every time you need to
process it. Walter as much as said so up above.
Given that all the machine registers are at least 32-bits
already it doesn't make the slightest difference. The only
additional operations on top of ascii are when it's a
multi-byte character, and even then it's some simple bit
manipulation which is as fast as any variable width encoding is
going to get.
I see you've abandoned without note your claim that phobos
doesn't convert UTF-8 to UTF-32 internally. Perhaps converting
to UTF-32 is "as fast as any variable width encoding is going to
get" but my claim is that single-byte encodings will be faster.
The only alternatives to a variable width encoding I can see
are:
- Single code page per string
This is completely useless because now you can't concatenate
strings of different code pages.
I wouldn't be so fast to ditch this. There is a real argument to
be made that strings of different languages are sufficiently
different that there should be no multi-language strings. Is
this the best route? I'm not sure, but I certainly wouldn't
dismiss it out of hand.
- Multiple code pages per string
This just makes everything overly complicated and is far slower
to decode what the actual character is than UTF-8.
I disagree, this would still be far faster than UTF-8,
particularly if you designed your header right.
- String with escape sequences to change code page
Can no longer access characters in the middle or end of the
string, you have to parse the entire string every time which
completely negates the benefit of a fixed width encoding.
I didn't think of this possibility, but you may be right that
it's sub-optimal.
Also your complaint that UTF-8 reserves the short characters
for the english alphabet is not really relevant - the
characters with longer encodings tend to be rarer (such as
special symbols) or carry more information (such as chinese
characters where the same sentence takes only about 1/3 the
number of characters).
The vast majority of non-english alphabets in UCS can be
encoded in a single byte. It is your exceptions that are not
relevant.
Well obviously... That's like saying "if you know what the
exact contents of a file are going to be anyway you can
compress it to a single byte!"
ie. It's possible to devise an encoding which will encode any
given string to an arbitrarily small size. It's still
completely useless because you'd have to know the string in
advance...
No, it's not the same at all. The contents of an
arbitrary-length file cannot be compressed to a single byte, you
would have collisions galore. But since most non-english
alphabets are less than 256 characters, they can all be uniquely
encoded in a single byte per character, with the header
determining what language's code page to use. I don't understand
your analogy whatsoever.
- A useful encoding has to be able to handle every unicode
character
- As I've shown the only space-efficient way to do this is
using a variable length encoding like UTF-8
You haven't shown this.
- Given the frequency distribution of unicode characters, UTF-8
does a pretty good job at encoding higher frequency characters
in fewer bytes.
No, it does a very bad job of this. Every non-ASCII character
takes at least two bytes to encode, whereas my single-byte
encoding scheme would encode every alphabet with less than 256
characters in a single byte.
- Yes you COULD encode non-english alphabets in a single byte
but doing so would be inefficient because it would mean the
more frequently used characters take more bytes to encode.
Not sure what you mean by this.