25-May-2013 02:42, H. S. Teoh пишет:
On Sat, May 25, 2013 at 01:21:25AM +0400, Dmitry Olshansky wrote:
24-May-2013 21:05, Joakim пишет:
[...]

As far as Phobos is concerned, Dmitry's new std.uni module has powerful
code-generation templates that let you write code that operate directly
on UTF-8 without needing to convert to UTF-32 first.

As is there are no UTF-8 specific tables (yet), but there are tools to create the required abstraction by hand. I plan to grow one for std.regex that will thus be field-tested and then get into public interface. In fact the needs of std.regex prompted me to provide more Unicode stuff in the std.

Well, OK, maybe
we're not quite there yet, but the foundations are in place, and I'm
looking forward to the day when string functions will no longer have
implicit conversion to UTF-32, but will directly manipulate UTF-8 using
optimized state tables generated by std.uni.

Yup, but let's get the correctness part first, then performance ;)


Want small - use compression schemes which are perfectly fine and
get to the precious 1byte per codepoint with exceptional speed.
http://www.unicode.org/reports/tr6/

+1.  Using your own encoding is perfectly fine. Just don't do that for
data interchange. Unicode was created because we *want* a single
standard to communicate with each other without stupid broken encoding
issues that used to be rampant on the web before Unicode came along.


BTW the document linked discusses _standard_ compression so that anybody can decode that stuff. How you compress would largely affect the compression ratio but not much beyond it..

In the bad ole days, HTML could be served in any random number of
encodings, often out-of-sync with what the server claims the encoding
is, and browsers would assume arbitrary default encodings that for the
most part *appeared* to work but are actually fundamentally b0rken.
Sometimes webpages would show up mostly-intact, but with a few
characters mangled, because of deviations / variations on codepage
interpretation, or non-standard characters being used in a particular
encoding. It was a total, utter mess, that wasted who knows how many
man-hours of programming time to work around. For data interchange on
the internet, we NEED a universal standard that everyone can agree on.

+1 on these and others :)

--
Dmitry Olshansky

Reply via email to