On Saturday, 25 May 2013 at 18:56:42 UTC, Diggory wrote:
"limited success of UTF-8"

Becoming the de-facto standard encoding EVERYWERE except for windows which uses UTF-16 is hardly a failure...
So you admit that UTF-8 isn't used on the vast majority of computers since the inception of Unicode. That's what I call limited success, thank you for agreeing with me. :)

I really don't understand your hatred for UTF-8 - it's simple to decode and encode, fast and space-efficient. Fixed width encodings are not inherently fast, the only thing they are faster at is if you want to randomly access the Nth character instead of the Nth byte. In the rare cases that you need to do a lot of this kind of random access there exists UTF-32...
Space-efficient? Do you even understand what a single-byte encoding is? Suffice to say, a single-byte encoding beats UTF-8 on all these measures, not just one.

Any fixed width encoding which can encode every unicode character must use at least 3 bytes, and using 4 bytes is probably going to be faster because of alignment, so I don't see what the great improvement over UTF-32 is going to be.
Slaps head. You don't need "at least 3 bytes" because you're packing language info in the header. I don't think you even know what I'm talking about.

slicing does require decoding
Nope.
Of course it does, at least partially. There is no other way to know where the code points are.

I didn't mean that people are literally keeping code pages. I meant that there's not much of a difference between code pages with 2 bytes per char and the language character sets in UCS.

Unicode doesn't have "language character sets". The different planes only exist for organisational purposes they don't affect how characters are encoded.
Nobody's talking about different planes. I'm talking about all the different language character sets in this list:

http://en.wikipedia.org/wiki/List_of_Unicode_characters

?! It's okay because you deem it "coherent in its scheme?" I deem headers much more coherent. :)

Sure if you change the word "coherent" to mean something completely different... Coherent means that you store related things together, ie. everything that you need to decode a character in the same place, not spread out between part of a character and a header.
Coherent means that the organizational pieces fit together and make sense conceptually, not that everything is stored together. My point is that putting the language info in a header seems much more coherent to me than ramming that info into every character.

but I suspect substring search not requiring decoding is the exception for UTF-8 algorithms, not the rule.
The only time you need to decode is when you need to do some transformation that depends on the code point such as converting case or identifying which character class a particular character belongs to. Appending, slicing, copying, searching, replacing, etc. basically all the most common text operations can all be done without any encoding or decoding.
Slicing by byte, which is the only way to slice without decoding, is useless, I have to laugh that you even include it. :) All these basic operations can be done very fast, often faster than UTF-8, in a single-byte encoding. Once you start talking code points, it's no contest: UTF-8 flat out loses.

On Saturday, 25 May 2013 at 19:42:41 UTC, Diggory wrote:
All a code page is is a table of mappings, UCS is just a much larger, standardized table of such mappings.

UCS does have nothing to do with code pages, it was designed as a replacement for them. A codepage is a strict subset of the possible characters, UCS is the entire set of possible characters.
"[I]t was designed as a replacement for them" by combining several of them into a master code page and removing redundancies. Functionally, they are the same and historically they maintain the same layout in at least some cases. To then say, UCS has "nothing to do with code pages" is just dense.

I see you've abandoned without note your claim that phobos doesn't convert UTF-8 to UTF-32 internally. Perhaps converting to UTF-32 is "as fast as any variable width encoding is going to get" but my claim is that single-byte encodings will be faster.

I haven't "abandoned my claim". It's a simple fact that phobos does not convert UTF-8 string to UTF-32 strings before it uses them.

ie. the difference between this:
string mystr = ...;
dstring temp = mystr.to!dstring;
for (int i = 0; i < temp.length; ++i)
    process(temp[i]);

and this:
string mystr = ...;
size_t i = 0;
while (i < mystr.length) {
    dchar current = decode(mystr, i);
    process(current);
}

And if you can't see why the latter example is far more efficient I give up...
I take your point that phobos is often decoding by char as it iterates through, but there are still functions in std.string that convert the entire string, as in your first example. The point is that you are forced to decode everything to UTF-32, whether by char or the entire string. Your latter example may be marginally more efficient but it is only useful for functions that start from the beginning and walk the string in only one direction, which not all operations do.

- Multiple code pages per string
This just makes everything overly complicated and is far slower to decode what the actual character is than UTF-8.
I disagree, this would still be far faster than UTF-8, particularly if you designed your header right.

The cache misses alone caused by simply accessing the separate headers would be a larger overhead than decoding UTF-8 which takes a few assembly instructions and has perfect locality and can be efficiently pipelined by the CPU.
Lol, you think a few potential cache misses is going to be slower than repeatedly decoding, whether in assembly and pipelined or not, every single UTF-8 character? :D

Then there's all the extra processing involved combining the headers when you concatenate strings. Plus you lose the one benefit a fixed width encoding has because random access is no longer possible without first finding out which header controls the location you want to access.
There would be a few arithmetic operations on substring indices when concatenating strings, hardly anything.

Random access is still not only possible, it is incredibly fast in most cases: you just have to check first if the header lists any two-byte encodings. This can be done once and cached as a property of the string (set a boolean no_two_byte_encoding once and simply have the slice operator check it before going ahead), just as you could add a property to UTF-8 strings to allow quick random access if they happen to be pure ASCII. The difference is that only strings that include the two-byte encoded Korean/Chinese/Japanese characters would require a bit more calculation for slicing in my scheme, whereas _every_ non-ASCII UTF-8 string requires full decoding to allow random access. This is a clear win for my single-byte encoding, though maybe not the complete demolition of UTF-8 you were hoping for. ;)

No, it's not the same at all. The contents of an arbitrary-length file cannot be compressed to a single byte, you would have collisions galore. But since most non-english alphabets are less than 256 characters, they can all be uniquely encoded in a single byte per character, with the header determining what language's code page to use. I don't understand your analogy whatsoever.

It's very simple - the more information about the type of data you are compressing you have at the time of writing the algorithm the better compression ration you can get, to the point that if you know exactly what the file is going to contain you can compress it to nothing. This is why you have specialised compression algorithms for images, video, audio, etc.
This may be mostly true in general, but your specific example of compressing down to a byte is nonsense. For any arbitrarily long data, there are always limits to compression. What any of this has to do with my single-byte encoding, I have no idea.

It doesn't matter how few characters non-english alphabets have - unless you know WHICH alphabet it is before-hand you can't store it in a single byte. Since any given character could be in any alphabet the best you can do is look at the probabilities of different characters appearing and use shorter representations for more common ones. (This is the basis for all lossless compression) The english alphabet plus 0-9 and basic punctuation are by far the most common characters used on computers so it makes sense to use one byte for those and multiple bytes for rarer characters.
How many times have I said that "you know WHICH alphabet it is before-hand" because that info is stored in the header? That is why I specifically said, from my first post, that multi-language strings would have more complex headers, which I later pointed out could list all the different language substrings within a multi-language string. Your silly exposition of how compression works makes me wonder if you understand anything about how a single-byte encoding would work.

Perhaps it made sense to use one byte for ASCII characters and relegate _every other language_ to multiple bytes two decades ago. It doesn't make sense today.

- As I've shown the only space-efficient way to do this is using a variable length encoding like UTF-8
You haven't shown this.
If you had thought through your suggestion of multiple code pages per string you would see that I had.
You are not packaging and transmitting the code pages with the string, just as you do not ship the entire UCS with every UTF-8 string. A single-byte encoding is going to be more space-efficient for the vast majority of strings, everybody knows this.

No, it does a very bad job of this. Every non-ASCII character takes at least two bytes to encode, whereas my single-byte encoding scheme would encode every alphabet with less than 256 characters in a single byte.

And strings with mixed characters would use lots of memory and be extremely slow. Common when using proper names, quotes, inline translations, graphical characters, etc. etc. Not to mention the added complexity to actually implement the algorithms.
Ah, you have finally stumbled across the path to a good argument, though I'm not sure how, given your seeming ignorance of how single-byte encodings work. :) There _is_ a degenerate case with my particular single-byte encoding (not the ones you list, which would still be faster and use less memory than UTF-8): strings that use many, if not all, character sets. So the worst case scenario might be something like a string that had 100 characters, every one from a different language. In that case, I think it would still be smaller than the equivalent UTF-8 string, but not by much.

There might be some complexity in implementing the algorithms, but on net, likely less than UTF-8, while being much more usable for most programmers.

On Saturday, 25 May 2013 at 22:41:59 UTC, Diggory wrote:
1) Take the byte at a particular offset in the string
2) If it is ASCII then we're done
3) Otherwise count the number of '1's at the start of the byte - this is how many bytes make up the character (there's even an ASM instruction to do this) 4) This first byte will look like '1110xxxx' for a 3 byte character, '11110xxx' for a 4 byte character, etc.
5) All following bytes are of the form '10xxxxxx'
6) Now just concatenate all the 'x's together and add an offset to get the code point
Not sure why you chose to write this basic UTF-8 stuff out, other than to bluster on without much use.

Note that this is CONSTANT TIME, O(1) with minimal branching so well suited to pipelining (after the initial byte the other bytes can all be processed in parallel by the CPU) and only sequential memory access so no cache misses, and zero additional memory requirements
It is constant time _per character_. You have to do it for _every_ non-ASCII character in your string, so the decoding adds up.

Now compare your encoding:
1) Look up the offset in the header using binary search: O(log N) lots of branching
It is difficult to reason about the header, because it all depends on the number of languages used and how many substrings there are. There are worst-case scenarios that could approach something like log(n) but extremely unlikely in real-world use. Most of the time, this would be O(1).

2) Look up the code page ID in a massive array of code pages to work out how many bytes per character
Hardly, this could be done by a simple lookup function that simply checked if the language was one of the few alphabets that require two bytes.

3) Hope this array hasn't been paged out and is still in the cache 4) Extract that many bytes from the string and combine them into a number
Lol, I love how you think this is worth listing as a separate step for the few two-byte encodings, yet have no problem with doing this for every non-ASCII character in UTF-8.

5) Look up this new number in yet another large array specific to the code page
Why? The language byte and number uniquely specify the character, just like your Unicode code point above. If you were simply encoding the UCS in a single-byte encoding, you would arrange your scheme in such a way to trivially be able to generate the UCS code point using these two bytes.

This is O(log N) has lots of branching so no pipelining (every stage depends on the result of the stage before), lots of random memory access so lots of cache misses, lots of additional memory requirements to store all those tables, and an algorithm that isn't even any easier to understand.
Wrong on practically every count, as detailed above.

Plus every other algorithm to operate on it except for decoding is insanely complicated.
They are still much _less_ complicated than UTF-8, that's the comparison that matters.

Reply via email to