Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Juan Manuel Cabo


░ⓌⓉⒻ░
╔╗░╔╗░╔╗╔╗╔╗░░
║║░║║░║║╚═╗╔═╝║╔═══╝░░
║║░║║░║║░░║║░░║╚═╗
║╚═╝╚═╝║╔╗║║╔╗║╔═╝╔╗░░
╚══╝╚╝╚╝╚╝╚╝░░╚╝░░


█░█░█░░▐░░▐░
█░█░█▐▀█▐▀█▐░█▐▀█▐▀█▐▀█░
█░█░█▐▄█▐▄█▐▄▀▐▄█▐░█▐░█░
█▄█▄█▐▄▄▐▄▄▐░█▐▄▄▐░█▐▄█░



--jm



Re: D on next-gen consoles and for game development

2013-05-25 Thread Patrick Down

On Saturday, 25 May 2013 at 05:29:31 UTC, deadalnix wrote:

This is technically possible, but you said you make few 
allocations. So with the tax on pointer write or the reference 
counting, you'll pay a lot to collect very few garbages. I'm 
not sure the tradeoff is worthwhile.




Incidentally, I ran across this paper that talks about a 
reference counted garbage collector that claims to address this 
issue.  MIght be of interest to this group.


http://researcher.watson.ibm.com/researcher/files/us-bacon/Bacon03Pure.pdf

From the paper:

There are two primary problems with reference counting, namely:
(1) run-time overhead of incrementing and decrementing the 
reference count each time a

pointer is copied, particularly on the stack; and
(2) inability to detect cycles and consequent necessity of 
including a second garbage collection technique to deal with 
cyclic garbage.
In this paper we present new algorithms that address these 
problems and describe a
new multiprocessor garbage collector based on these techniques 
that achieves maximum
measured pause times of 2.6 milliseconds over a set of eleven 
benchmark programs that

perform significant amounts of memory allocation.



Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Joakim

On Saturday, 25 May 2013 at 17:03:43 UTC, Dmitry Olshansky wrote:

25-May-2013 10:44, Joakim пишет:
Yes, on the encoding, if it's a variable-length encoding like 
UTF-8, no,
on the code space.  I was originally going to title my post, 
"Why
Unicode?" but I have no real problem with UCS, which merely 
standardized
a bunch of pre-existing code pages.  Perhaps there are a lot 
of problems

with UCS also, I just haven't delved into it enough to know.


UCS is dead and gone. Next in line to "640K is enough for 
everyone".
I think you are confused.  UCS refers to the Universal Character 
Set, which is the backbone of Unicode:


http://en.wikipedia.org/wiki/Universal_Character_Set

You might be thinking of the unpopular UCS-2 and UCS-4 encodings, 
which I have never referred to.


Separate code spaces were the case before Unicode (and 
utf-8). The
problem is not only that without header text is meaningless 
(no easy
slicing) but the fact that encoding of data after header 
strongly
depends a variety of factors -  a list of encodings actually. 
Now
everybody has to keep a (code) page per language to at least 
know if
it's 2 bytes per char or 1 byte per char or whatever. And you 
still
work on a basis that there is no combining marks and regional 
specific

stuff :)
Everybody is still keeping code pages, UTF-8 hasn't changed 
that.


Legacy. Hard to switch overnight. There are graphs that 
indicate that few years from now you might never encounter a 
legacy encoding anymore, only UTF-8/UTF-16.
I didn't mean that people are literally keeping code pages.  I 
meant that there's not much of a difference between code pages 
with 2 bytes per char and the language character sets in UCS.



Does
UTF-8 not need "to at least know if it's 2 bytes per char or 1 
byte per

char or whatever?"


It's coherent in its scheme to determine that. You don't need 
extra information synced to text unlike header stuff.
?!  It's okay because you deem it "coherent in its scheme?"  I 
deem headers much more coherent. :)



It has to do that also. Everyone keeps talking about
"easy slicing" as though UTF-8 provides it, but it doesn't.  
Phobos
turns UTF-8 into UTF-32 internally for all that ease of use, 
at least
doubling your string size in the process.  Correct me if I'm 
wrong, that

was what I read on the newsgroup sometime back.


Indeed you are - searching for UTF-8 substring in UTF-8 string 
doesn't do any decoding and it does return you a slice of a 
balance of original.
Perhaps substring search doesn't strictly require decoding but 
you have changed the subject: slicing does require decoding and 
that's the use case you brought up to begin with.  I haven't 
looked into it, but I suspect substring search not requiring 
decoding is the exception for UTF-8 algorithms, not the rule.


??? Simply makes no sense. There is no intersection between 
some legacy encodings as of now. Or do you want to add N*(N-1) 
cross-encodings for any combination of 2? What about 3 in one 
string?
I sketched two possible encodings above, none of which would 
require "cross-encodings."


We want monoculture! That is to understand each without all 
these
"par-le-vu-france?" and codepages of various 
complexity(insanity).
I hate monoculture, but then I haven't had to decipher some 
screwed-up

codepage in the middle of the night. ;)


So you never had trouble of internationalization? What 
languages do you use (read/speak/etc.)?
This was meant as a point in your favor, conceding that I haven't 
had to code with the terrible code pages system from the past.  I 
can read and speak multiple languages, but I don't use anything 
other than English text.



That said, you could standardize
on UCS for your code space without using a bad encoding like 
UTF-8, as I

said above.


UCS is a myth as of ~5 years ago. Early adopters of Unicode 
fell into that trap (Java, Windows NT). You shouldn't.
UCS, the character set, as noted above.  If that's a myth, 
Unicode is a myth. :)


This is it but it's far more flexible in a sense that it allows 
multi-linguagal strings just fine and lone full-with unicode 
codepoints as well.
That's only because it uses a more complex header than a single 
byte for the language, which I noted could be done with my 
scheme, by adding a more complex header, long before you 
mentioned this unicode compression scheme.



But I get the impression that it's only for sending over
the wire, ie transmision, so all the processing issues that 
UTF-8

introduces would still be there.


Use mime-type etc. Standards are always a bit stringy and 
suboptimal, their acceptance rate is one of chief advantages 
they have. Unicode has horrifically large momentum now and not 
a single organization aside from them tries to do this dirty 
work (=i18n).
You misunderstand.  I was saying that this unicode compression 
scheme doesn't help you with string processing, it is only for 
transmission and is probably fine for that, precisely because it 
seems to implement so

Re: DLang Spec rewrite (?)

2013-05-25 Thread Borden
I hasten to add that I don't mean to criticise the original 
writers of the DLang Spec for writing it in DDoc macros. So far, 
I've found the documentation fairly easy to follow (as plain 
text) and so I don't want to lose any of that should the spec be 
rewritten.


It's also possible (although, in my opinion, less preferable) to 
keep the spec written in DDoc macros but reformatted to allow for 
easier conversion to other formats...


DLang Spec rewrite (?)

2013-05-25 Thread Borden

Good afternoon, all,

I would still like to compile the D Lang Spec into EPUB (and 
possibly other formats) but, as we discussed in these threads:


http://forum.dlang.org/thread/bsbdpjyjubfxvmecw...@forum.dlang.org
http://forum.dlang.org/thread/uzdngvjzexukbgkxd...@forum.dlang.org

having the D Lang Specification written in DDoc macros is making 
it extremely difficult to work with.


I ask, therefore, what opposition would there be to me rewriting 
the DLang Spec files into another format that will be easier to 
parse and compile for the website, PDF, Latex, eBook and other 
formats? If the answer is 'minimal', 'go ahead' or 'it's your 
funeral', then my follow-up question is 'what format would be the 
easiest to write, debug and maintain?'


For greater clarity, I am NOT proposing to rewrite the 
DDoc-generated library documentation or any other pages outside 
of the spec. In the makefile, they are defined as the files 
covered in $(SPEC_ROOT).


With regards,


Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Diggory

On Saturday, 25 May 2013 at 08:07:42 UTC, Joakim wrote:

On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:
I think you are a little confused about what unicode actually 
is... Unicode has nothing to do with code pages and nobody 
uses code pages any more except for compatibility with legacy 
applications (with good reason!).

Incorrect.

"Unicode is an effort to include all characters from previous 
code pages into a single character enumeration that can be used 
with a number of encoding schemes... In practice the various 
Unicode character set encodings have simply been assigned their 
own code page numbers, and all the other code pages have been 
technically redefined as encodings for various subsets of 
Unicode."

http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode



That confirms exactly what I just said...

You said that phobos converts UTF-8 strings to UTF-32 before 
operating on them but that's not true. As it iterates over 
UTF-8 strings it iterates over dchars rather than chars, but 
that's not in any way inefficient so I don't really see the 
problem.

And what's a dchar?  Let's check:

dchar : unsigned 32 bit UTF-32
http://dlang.org/type.html

Of course that's inefficient, you are translating your whole 
encoding over to a 32-bit encoding every time you need to 
process it.  Walter as much as said so up above.


Given that all the machine registers are at least 32-bits already 
it doesn't make the slightest difference. The only additional 
operations on top of ascii are when it's a multi-byte character, 
and even then it's some simple bit manipulation which is as fast 
as any variable width encoding is going to get.


The only alternatives to a variable width encoding I can see are:
- Single code page per string
This is completely useless because now you can't concatenate 
strings of different code pages.


- Multiple code pages per string
This just makes everything overly complicated and is far slower 
to decode what the actual character is than UTF-8.


- String with escape sequences to change code page
Can no longer access characters in the middle or end of the 
string, you have to parse the entire string every time which 
completely negates the benefit of a fixed width encoding.


- An encoding wide enough to store every character
This is just UTF-32.



Also your complaint that UTF-8 reserves the short characters 
for the english alphabet is not really relevant - the 
characters with longer encodings tend to be rarer (such as 
special symbols) or carry more information (such as chinese 
characters where the same sentence takes only about 1/3 the 
number of characters).
The vast majority of non-english alphabets in UCS can be 
encoded in a single byte.  It is your exceptions that are not 
relevant.


Well obviously... That's like saying "if you know what the exact 
contents of a file are going to be anyway you can compress it to 
a single byte!"


ie. It's possible to devise an encoding which will encode any 
given string to an arbitrarily small size. It's still completely 
useless because you'd have to know the string in advance...


- A useful encoding has to be able to handle every unicode 
character
- As I've shown the only space-efficient way to do this is using 
a variable length encoding like UTF-8
- Given the frequency distribution of unicode characters, UTF-8 
does a pretty good job at encoding higher frequency characters in 
fewer bytes.
- Yes you COULD encode non-english alphabets in a single byte but 
doing so would be inefficient because it would mean the more 
frequently used characters take more bytes to encode.


Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Dmitry Olshansky

25-May-2013 12:58, Vladimir Panteleev пишет:

On Saturday, 25 May 2013 at 07:33:15 UTC, Joakim wrote:

This is more a problem with the algorithms taking the easy way than a
problem with UTF-8. You can do all the string algorithms, including
regex, by working with the UTF-8 directly rather than converting to
UTF-32. Then the algorithms work at full speed.

I call BS on this.  There's no way working on a variable-width
encoding can be as "full speed" as a constant-width encoding. Perhaps
you mean that the slowdown is minimal, but I doubt that also.


For the record, I noticed that programmers (myself included) that had an
incomplete understanding of Unicode / UTF exaggerate this point, and
sometimes needlessly assume that their code needs to operate on
individual characters (code points), when it is in fact not so - and
that code will work just fine as if it was written to handle ASCII. The
example Walter quoted (regex - assuming you don't want Unicode ranges or
case-insensitivity) is one such case.


+1
BTW regex even with Unicode ranges and case-insensitivity is doable just 
not easy (yet).



Another thing I noticed: sometimes when you think you really need to
operate on individual characters (and that your code will not be correct
unless you do that), the assumption will be incorrect due to the
existence of combining characters in Unicode. Two of the often-quoted
use cases of working on individual code points is calculating the string
width (assuming a fixed-width font), and slicing the string - both of
these will break with combining characters if those are not accounted
for.  I believe the proper way to approach such tasks is to implement the
respective Unicode algorithms for it, which I believe are non-trivial
and for which the relative impact for the overhead of working with a
variable-width encoding is acceptable.


Another plus one. Algorithms defined on code point basis are quite 
complex so that benefit of not decoding won't be that large. The benefit 
of transparently special-casing ASCII in UTF-8 is far larger.



Can you post some specific cases where the benefits of a constant-width
encoding are obvious and, in your opinion, make constant-width encodings
more useful than all the benefits of UTF-8?

Also, I don't think this has been posted in this thread. Not sure if it
answers your points, though:

http://www.utf8everywhere.org/

And here's a simple and correct UTF-8 decoder:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/



--
Dmitry Olshansky


Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Dmitry Olshansky

25-May-2013 13:05, Joakim пишет:

On Saturday, 25 May 2013 at 08:42:46 UTC, Walter Bright wrote:

I think you stand alone in your desire to return to code pages.

Nobody is talking about going back to code pages.  I'm talking about
going to single-byte encodings, which do not imply the problems that you
had with code pages way back when.


Problem is what you outline is isomorphic with code-pages. Hence the 
grief of accumulated experience against them.

Code pages simply are no longer practical nor acceptable for a global
community. D is never going to convert to a code page system, and even
if it did, there's no way D will ever convince the world to abandon
Unicode, and so D would be as useless as EBCDIC.

I'm afraid you and others here seem to mentally translate "single-byte
encodings" to "code pages" in your head, then recoil in horror as you
remember all your problems with broken implementations of code pages,
even though those problems are not intrinsic to single-byte encodings.

I'm not asking you to consider this for D.  I just wanted to discuss why
UTF-8 is used at all.  I had hoped for some technical evaluations of its
merits, but I seem to simply be dredging up a bunch of repressed
memories about code pages instead. ;)


Well if somebody get a quest to redefine UTF-8 they *might* come up with 
something that is a bit faster to decode but shares the same properties. 
Hardly a life saver anyway.


The world may not "abandon Unicode," but it will abandon UTF-8, because
it's a dumb idea.  Unfortunately, such dumb ideas- XML anyone?- often
proliferate until someone comes up with something better to show how
dumb they are.


Even children know XML is awful redundant shit as interchange format. 
The hierarchical document is a nice idea anyway.



Perhaps it won't be the D programming language that does
that, but it would be easy to implement my idea in D, so maybe it will
be a D-based library someday. :)


Implement Unicode compression scheme - at least that is standardized.



--
Dmitry Olshansky


Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Dmitry Olshansky

25-May-2013 10:44, Joakim пишет:

On Friday, 24 May 2013 at 21:21:27 UTC, Dmitry Olshansky wrote:

You seem to think that not only UTF-8 is bad encoding but also one
unified encoding (code-space) is bad(?).

Yes, on the encoding, if it's a variable-length encoding like UTF-8, no,
on the code space.  I was originally going to title my post, "Why
Unicode?" but I have no real problem with UCS, which merely standardized
a bunch of pre-existing code pages.  Perhaps there are a lot of problems
with UCS also, I just haven't delved into it enough to know.  My problem
is with these dumb variable-length encodings, so I was precise in the
title.



UCS is dead and gone. Next in line to "640K is enough for everyone".
Simply put Unicode decided to take into account all diversity of 
luggages instead of ~80% of these. Hard to add anything else. No offense 
meant but it feels like you actually live in universe that is 5-7 years 
behind current state. UTF-16 (a successor to UCS) is no random-access 
either. And it's shitty beyond measure, UTF-8 is a shining gem in 
comparison.



Separate code spaces were the case before Unicode (and utf-8). The
problem is not only that without header text is meaningless (no easy
slicing) but the fact that encoding of data after header strongly
depends a variety of factors -  a list of encodings actually. Now
everybody has to keep a (code) page per language to at least know if
it's 2 bytes per char or 1 byte per char or whatever. And you still
work on a basis that there is no combining marks and regional specific
stuff :)

Everybody is still keeping code pages, UTF-8 hasn't changed that.


Legacy. Hard to switch overnight. There are graphs that indicate that 
few years from now you might never encounter a legacy encoding anymore, 
only UTF-8/UTF-16.



 Does
UTF-8 not need "to at least know if it's 2 bytes per char or 1 byte per
char or whatever?"


It's coherent in its scheme to determine that. You don't need extra 
information synced to text unlike header stuff.



It has to do that also. Everyone keeps talking about
"easy slicing" as though UTF-8 provides it, but it doesn't.  Phobos
turns UTF-8 into UTF-32 internally for all that ease of use, at least
doubling your string size in the process.  Correct me if I'm wrong, that
was what I read on the newsgroup sometime back.


Indeed you are - searching for UTF-8 substring in UTF-8 string doesn't 
do any decoding and it does return you a slice of a balance of original.





In fact it was even "better" nobody ever talked about header they just
assumed a codepage with some global setting. Imagine yourself creating
a font rendering system these days - a hell of an exercise in
frustration (okay how do I render 0x88 ? mm if that is in codepage XYZ
then ...).

I understand that people were frustrated with all the code pages out
there before UCS standardized them, but that is a completely different
argument than my problem with UTF-8 and variable-length encodings.  My
proposed simple, header-based, constant-width encoding could be
implemented with UCS and there go all your arguments about random code
pages.


No they don't - have you ever seen native Korean or Chinese codepages? 
Problems with your header based approach are self-evident in a sense 
that there is no single sane way to deal with it on cross-locale basis 
(that you simply ignore as noted below).



This just shows you don't care for multilingual stuff at all. Imagine
any language tutor/translator/dictionary on the Web. For instance most
languages need to intersperse ASCII (also keep in mind e.g. HTML
markup). Books often feature citations in native language (or e.g.
latin) along with translations.

This is a small segment of use and it would be handled fine by an
alternate encoding.


??? Simply makes no sense. There is no intersection between some legacy 
encodings as of now. Or do you want to add N*(N-1) cross-encodings for 
any combination of 2? What about 3 in one string?



Now also take into account math symbols, currency symbols and beyond.
Also these days cultures are mixing in wild combinations so you might
need to see the text even if you can't read it. Unicode is not only
"encode characters from all languages". It needs to address universal
representation of symbolics used in writing systems at large.

I take your point that it isn't just languages, but symbols also.  I see
no reason why UTF-8 is a better encoding for that purpose than the kind
of simple encoding I've suggested.


We want monoculture! That is to understand each without all these
"par-le-vu-france?" and codepages of various complexity(insanity).

I hate monoculture, but then I haven't had to decipher some screwed-up
codepage in the middle of the night. ;)


So you never had trouble of internationalization? What languages do you 
use (read/speak/etc.)?



That said, you could standardize
on UCS for your code space without using a bad encoding like UTF-8, as I
said above.


UCS is a myth as of ~5 years ago.

Re: D's limited template specialization abilities compared to C++

2013-05-25 Thread deadalnix

On Saturday, 25 May 2013 at 10:46:05 UTC, Ahuzhgairl wrote:

Hi,

In D, the : in a template parameter list only binds to 1 
parameter.
There is no way to specialize upon the entire template 
parameter list.
Therefore you can't do much with the pattern matching and it's 
not powerful.
Not a reasonable situation for a language aiming to be only the 
best.


What is needed is the ability to bind to the whole template 
parameter list:


template  struct get_class;
template  struct get_class(C::*)(A...)> { typedef C type; };


Let's shorten the terms:

 @ 

And here's how this kind of specialization would work in D:

template A[B] { struct C {} } template Foo[alias X, Y, Z @ 
X[Y].Z] { alias Z Foo; } void main() { alias Foo[A[bool].C] 
xxx; }


You need a separate delimiter besides : which does not bind to 
individual parameters, but which binds to the set of parameters.


I propose @ as the character which shall be the delimiter for 
the arguments to the pattern match, and the pattern match.


On an unrelated note, I don't like the ! thing so I use []. 
Sorry for the confusion there.


z


Hi, I obviously don't know D that much, but I assume I do.

I have this feature that I can't even show a working example that 
exists in C++. I also can't come up with any use case, but I know 
this is mandatory to have.


As I assume I know D well enough, I assume I know that this is 
impossible in D, so I propose an improvement.


With that improvement, a new syntax is introduced to support some 
new feature that is barely defined and it can be used in unknown 
situation.


I also explain myself using my own made up syntax. I don't care 
if it conflict with other language construct as it is superior 
anyway.


Re: Best XML Library

2013-05-25 Thread Meta
I suggest you check the XMLP library by Michael Rynn. I tried 
XML processing with D, so I don't know how good the different 
libraries are, but XMLP is on the review queue, which means 
it's highly possible it will become Phobos' standard XML 
library, and when that happens you will have an easy migration.


That is a good point. I suppose I'll take a look at that and
Tango's XML package.


Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Daniel Murphy
"Manu"  wrote in message 
news:mailman.137.1369448229.13711.digitalmar...@puremagic.com...
>>
>> One of the first, and best, decisions I made for D was it would be 
>> Unicode
>> front to back.
>>
>
> Indeed, excellent decision!
> So when we define operators for u × v and a · b, or maybe n²? ;)

When these have keys on standard keyboards. 




Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Joakim
On Saturday, 25 May 2013 at 14:18:32 UTC, Vladimir Panteleev 
wrote:

On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote:
Are you sure _you_ understand it properly?  Both encodings 
have to check every single character to test for whitespace, 
but the single-byte encoding simply has to load each byte in 
the string and compare it against the whitespace-signifying 
bytes, while the variable-length code has to first load and 
parse potentially 4 bytes before it can compare, because it 
has to go through the state machine that you linked to above.  
Obviously the constant-width encoding will be faster.  Did I 
really need to explain this?


It looks like you've missed an important property of UTF-8: 
lower ASCII remains encoded the same, and UTF-8 code units 
encoding non-ASCII characters cannot be confused with ASCII 
characters. Code that does not need Unicode code points can 
treat UTF-8 strings as ASCII strings, and does not need to 
decode each character individually - because a 0x20 byte will 
mean "space" regardless of context. That's why a function that 
splits a string by ASCII whitespace does NOT need do perform 
UTF-8 decoding.


I hope this clears up the misunderstanding :)
OK, you got me with this particular special case: it is not 
necessary to decode every UTF-8 character if you are simply 
comparing against ASCII space characters.  My mixup is because I 
was unaware if every language used its own space character in 
UTF-8 or if they reuse the ASCII space character, apparently it's 
the latter.


However, my overall point stands.  You still have to check 2-4 
times as many bytes if you do it the way Peter suggests, as 
opposed to a single-byte encoding.  There is a shortcut: you 
could also check the first byte to see if it's ASCII or not and 
then skip the right number of ensuing bytes in a character's 
encoding if it isn't ASCII, but at that point you have begun 
partially decoding the UTF-8 encoding, which you claimed wasn't 
necessary and which will degrade performance anyway.


On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote:
I suggest you read up on UTF-8. You really don't understand it. 
There is no need to decode, you just treat the UTF-8 string as 
if it is an ASCII string.
Not being aware of this shortcut doesn't mean not understanding 
UTF-8.


This code will count all spaces in a string whether it is 
encoded as ASCII or UTF-8:


int countSpaces(const(char)* c)
{
int n = 0;
while (*c)
if (*c == ' ')
++n;
return n;
}

I repeat: there is no need to decode. Please read up on UTF-8. 
You do not understand it. The reason you don't need to decode 
is because UTF-8 is self-synchronising.
Not quite.  The reason you don't need to decode is because of the 
particular encoding scheme chosen for UTF-8, a side effect of 
ASCII backwards compatibility and reusing the ASCII space 
character; it has nothing to do with whether it's 
self-synchronizing or not.


The code above tests for spaces only, but it works the same 
when searching for any substring or single character. It is no 
slower than fixed-width encoding for these operations.
It doesn't work the same "for any substring or single character," 
it works the same for any single ASCII character.


Of course it's slower than a fixed-width single-byte encoding.  
You have to check every single byte of a non-ASCII character in 
UTF-8, whereas a single-byte encoding only has to check a single 
byte per language character.  There is a shortcut if you 
partially decode the first byte in UTF-8, mentioned above, but 
you seem dead-set against decoding. ;)


Again, I urge you, please read up on UTF-8. It is very well 
designed.
I disagree.  It is very badly designed, but the ASCII 
compatibility does hack in some shortcuts like this, which still 
don't save its performance.


Re: Why UTF-8/16 character encodings?

2013-05-25 Thread H. S. Teoh
On Sat, May 25, 2013 at 03:47:41PM +0200, Joakim wrote:
> On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev wrote:
> >On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
> >>>If you want to split a string by ASCII whitespace (newlines,
> >>>tabs and spaces), it makes no difference whether the string is
> >>>in ASCII or UTF-8 - the code will behave correctly in either
> >>>case, variable-width-encodings regardless.
> >>Except that a variable-width encoding will take longer to decode
> >>while splitting, when compared to a single-byte encoding.
> >
> >No. Are you sure you understand UTF-8 properly?
> Are you sure _you_ understand it properly?  Both encodings have to
> check every single character to test for whitespace, but the
> single-byte encoding simply has to load each byte in the string and
> compare it against the whitespace-signifying bytes, while the
> variable-length code has to first load and parse potentially 4 bytes
> before it can compare, because it has to go through the state
> machine that you linked to above.  Obviously the constant-width
> encoding will be faster.  Did I really need to explain this?
[...]

Have you actually tried to write a whitespace splitter for UTF-8? Do you
realize that you can use an ASCII whitespace splitter for UTF-8 and it
will work correctly?

There is no need to decode UTF-8 for whitespace splitting at all. There
is no need to parse anything. You just iterate over the bytes and split
on 0x20. There is no performance difference over ASCII.

As Dmitry said, UTF-8 is self-synchronizing. While current Phobos code
tries to play it safe by decoding every character, this is not necessary
in many cases.


T

-- 
The best compiler is between your ears. -- Michael Abrash


Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Vladimir Panteleev

On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev 
wrote:

On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
If you want to split a string by ASCII whitespace (newlines, 
tabs and spaces), it makes no difference whether the string 
is in ASCII or UTF-8 - the code will behave correctly in 
either case, variable-width-encodings regardless.
Except that a variable-width encoding will take longer to 
decode while splitting, when compared to a single-byte 
encoding.


No. Are you sure you understand UTF-8 properly?
Are you sure _you_ understand it properly?  Both encodings have 
to check every single character to test for whitespace, but the 
single-byte encoding simply has to load each byte in the string 
and compare it against the whitespace-signifying bytes, while 
the variable-length code has to first load and parse 
potentially 4 bytes before it can compare, because it has to go 
through the state machine that you linked to above.  Obviously 
the constant-width encoding will be faster.  Did I really need 
to explain this?


It looks like you've missed an important property of UTF-8: lower 
ASCII remains encoded the same, and UTF-8 code units encoding 
non-ASCII characters cannot be confused with ASCII characters. 
Code that does not need Unicode code points can treat UTF-8 
strings as ASCII strings, and does not need to decode each 
character individually - because a 0x20 byte will mean "space" 
regardless of context. That's why a function that splits a string 
by ASCII whitespace does NOT need do perform UTF-8 decoding.


I hope this clears up the misunderstanding :)


Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Peter Alexander

On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote:

int countSpaces(const(char)* c)
{
int n = 0;
while (*c)
if (*c == ' ')
++n;
return n;
}


Oops. Missing a ++c in there, but I'm sure the point was made :-)


Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Peter Alexander

On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev 
wrote:

On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
If you want to split a string by ASCII whitespace (newlines, 
tabs and spaces), it makes no difference whether the string 
is in ASCII or UTF-8 - the code will behave correctly in 
either case, variable-width-encodings regardless.
Except that a variable-width encoding will take longer to 
decode while splitting, when compared to a single-byte 
encoding.


No. Are you sure you understand UTF-8 properly?
Are you sure _you_ understand it properly?  Both encodings have 
to check every single character to test for whitespace, but the 
single-byte encoding simply has to load each byte in the string 
and compare it against the whitespace-signifying bytes, while 
the variable-length code has to first load and parse 
potentially 4 bytes before it can compare, because it has to go 
through the state machine that you linked to above.  Obviously 
the constant-width encoding will be faster.  Did I really need 
to explain this?


I suggest you read up on UTF-8. You really don't understand it. 
There is no need to decode, you just treat the UTF-8 string as if 
it is an ASCII string.


This code will count all spaces in a string whether it is encoded 
as ASCII or UTF-8:


int countSpaces(const(char)* c)
{
int n = 0;
while (*c)
if (*c == ' ')
++n;
return n;
}

I repeat: there is no need to decode. Please read up on UTF-8. 
You do not understand it. The reason you don't need to decode is 
because UTF-8 is self-synchronising.


The code above tests for spaces only, but it works the same when 
searching for any substring or single character. It is no slower 
than fixed-width encoding for these operations.


Again, I urge you, please read up on UTF-8. It is very well 
designed.


Re: DMD under 64-bit Windows 7 HOWTO

2013-05-25 Thread Sébastien.Kunz-Jacques

On Saturday, 25 May 2013 at 13:24:56 UTC, Rainer Schuetze wrote:



On 25.05.2013 15:03, "Sébastien Kunz-Jacques" 
" wrote:
On Tuesday, 18 December 2012 at 13:33:03 UTC, Gor Gyolchanyan 
wrote:


I hope I was helpful, because when I started to set up a 
development
environment under 64-bit Windows 7, I went through a lot of 
problems

to get
here and I'd love to have this HOWTO at that time.


I just tried this with the current beta (may 25, 2.063). It 
lacks the

-m64 option. Was it present in some older beta ?



-m64 isn't displayed in the usage screen (no idea why it is 
excluded there), but it is supported aswell as -m32 (the 
default).


Thanks for the tip. I had incorrectly put quotes around -m64 
-L/NOLOGO and the resulting error message


unrecognized switch '-m64 -L/NOLOGO'

plus the lack of mention of -m64 in the dmd command-line help 
confused me.


Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Joakim
On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev 
wrote:

On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
If you want to split a string by ASCII whitespace (newlines, 
tabs and spaces), it makes no difference whether the string 
is in ASCII or UTF-8 - the code will behave correctly in 
either case, variable-width-encodings regardless.
Except that a variable-width encoding will take longer to 
decode while splitting, when compared to a single-byte 
encoding.


No. Are you sure you understand UTF-8 properly?
Are you sure _you_ understand it properly?  Both encodings have 
to check every single character to test for whitespace, but the 
single-byte encoding simply has to load each byte in the string 
and compare it against the whitespace-signifying bytes, while the 
variable-length code has to first load and parse potentially 4 
bytes before it can compare, because it has to go through the 
state machine that you linked to above.  Obviously the 
constant-width encoding will be faster.  Did I really need to 
explain this?


On Saturday, 25 May 2013 at 12:43:21 UTC, Andrei Alexandrescu 
wrote:

On 5/25/13 3:33 AM, Joakim wrote:

On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
This is more a problem with the algorithms taking the easy 
way than a
problem with UTF-8. You can do all the string algorithms, 
including
regex, by working with the UTF-8 directly rather than 
converting to

UTF-32. Then the algorithms work at full speed.
I call BS on this. There's no way working on a variable-width 
encoding
can be as "full speed" as a constant-width encoding. Perhaps 
you mean

that the slowdown is minimal, but I doubt that also.


You mentioned this a couple of times, and I wonder what makes 
you so sure. On contemporary architectures small is fast and 
large is slow; betting on replacing larger data with more 
computation is quite often a win.
When has small ever been slow and large fast? ;) I'm talking 
about replacing larger data _and_ more computation, ie UTF-8, 
with smaller data and less computation, ie single-byte encodings, 
so it is an unmitigated win in that regard. :)


Re: D's limited template specialization abilities compared to C++

2013-05-25 Thread Kenji Hara
2013/5/25 Ahuzhgairl 

> No,
>
> struct Foo(T) {
> static void f() { writeln("general"); }
> }
>
> struct Foo(T : A(B).alias C, A, B, C) {
> static void f() { writeln("special"); }
> }
>
> struct Bar(T) {
> struct Baz {}
> }
>
> struct Baz(T : A(B), A, B) {
> }
>
> void main() {
> Foo!(Bar!(int).Baz);
> Baz!(Bar!(int));
> }
>

As I already shown, Baz!(Bar!(int)); could work in D.

But, currently Foo!(Bar!(int).Baz); is not yet supported.

I'm opening a compiler enhancement for related case,
http://d.puremagic.com/issues/show_bug.cgi?id=9022

and right now I updated compiler patch to allow parameterize enclosed type
by name/type/alias.
https://github.com/D-Programming-Language/dmd/pull/1296
https://github.com/9rnsr/dmd/commit/b29726d30b0094b9e7c2e15f5802501cb686ee68

After it is merged, you can write it as follows.

import std.stdio;
struct Foo(T)
{
static void f() { writeln("general"); }
}
struct Foo(T : A!(B).C, alias A, B, alias C)
{
static void f() { writeln("special"); }
}

struct Bar(T) { struct Baz {} }

void main()
{
Foo!(Bar!(int).Baz) x;
x.f();  // prints "special"
}

Kenji Hara


Re: D's limited template specialization abilities compared to C++

2013-05-25 Thread Peter Alexander

On Saturday, 25 May 2013 at 12:43:42 UTC, Ahuzhgairl wrote:

C++ example, works:

template  struct A;
template  class X, class Y> struct A> {};

template  struct B;

int main() {
A> a;
}


As we've shown, you can do this in D. Instead of template 
templates, you use alias.




But the following does not work:

struct Foo {};
template  struct B { Foo x; }

template  struct A;
template  struct A {}

int main() {
A<&B::x> a;
}


It's getting very hard to see what you're trying to do. I think 
it would help if you used real C++ and D syntax instead of 
inventing new syntax because I can't tell what you're trying to 
achieve and what semantics you expect of it.


Please post a small example of real, working, compilable C++ that 
shows what you want to do, and we'll show you how to do it in D 
(assuming it is possible).


Re: DMD under 64-bit Windows 7 HOWTO

2013-05-25 Thread Rainer Schuetze



On 25.05.2013 15:03, "Sébastien Kunz-Jacques" " wrote:

On Tuesday, 18 December 2012 at 13:33:03 UTC, Gor Gyolchanyan wrote:


I hope I was helpful, because when I started to set up a development
environment under 64-bit Windows 7, I went through a lot of problems
to get
here and I'd love to have this HOWTO at that time.


I just tried this with the current beta (may 25, 2.063). It lacks the
-m64 option. Was it present in some older beta ?



-m64 isn't displayed in the usage screen (no idea why it is excluded 
there), but it is supported aswell as -m32 (the default).


Re: Are people using textmate for D programming?

2013-05-25 Thread Michael

Where do I put it?

Thanks,

Andrei


http://docs.sublimetext.info/en/latest/extensibility/syntaxdefs.html



Re: Are people using textmate for D programming?

2013-05-25 Thread Michael
OK, you convinced me to try. But my SublimeText OSX 
installation does not contain the D.tmPackage file described at 
https://github.com/alexrp/st2-d. Where do I put it?


Thanks,

Andrei


ST2 and ST3 have built-in D Syntax highlighting.
ST3 now in the beta stage, but have improved mac os x support. 
ST3 beta for registered users only, but it's worth the money.


Re: DMD under 64-bit Windows 7 HOWTO

2013-05-25 Thread Sébastien.Kunz-Jacques
On Tuesday, 18 December 2012 at 13:33:03 UTC, Gor Gyolchanyan 
wrote:

Good day, fellow D developers.
After spending much time figuring out how to make DMD work 
fluently under
64-bit Windows 7 I've realized that this is not a trivial task 
and lots of
people might have trouble with this, so I've decided to post my 
solution,

that might save people a lot of time.
As we know, there are compatibility problems with 32-bit DMD 
binaries,
because they are compiled using DMC back-end, which can only 
produce OMF
binaries, so in order to avoid problems with linking against 
externally
compiled libraries, it's much easier to stick to 64-bit 
binaries, so that
DMD will use the Visual Studio linker to produce compatible 
COFF binaries.
Another problem is that 32-bit DMD binaries are linked against 
obsolete
32-bit WinAPI libraries, which lack some very important 
functions, while
the 64-bit binaries are required to link with the 64-bit 
libraries,

supplied by the the Windows SDK.

And here's how this could be arranged:

1. Prepare your development folder.
1.1. Create a folder with no spaces in its full path.
1.2. Store its full path in the '%DEV_DIR_ROOT%' environment 
variable.

2. Get the Windows SDK.
2.1. Download the Windows SDK.
2.1.1. Navigate to 
'http://msdn.microsoft.com/en-US/windows//bb980924.aspx'

in a web browser.
2.1.2. Under section 2 (number '2' in a green circle) click on 
the bold

blue 'Install Now' link.
2.1.3. In the opened window click in the blue 'Download' button 
at the

bottom of the page.
2.1.4. Make sure, that the Windows SDK installer 
('winsdk_web.exe') is

downloaded.
2.2. Install the downloaded Windows SDK.
2.2.1. Navigate to the folder, where the Windows SDK installer 
was

downloaded in a file browser.
2.2.2. Double-click on the installer and agree to security 
warnings to

launch it.
2.2.3. Click next, read and agree to the license until you 
reach the

'Install Locations' screen.
2.2.4. Store the path under 'Destination Folder for Tools' in 
the

'%DEV_DIR_MSWINSDK%' (e.g. 'C:\Program Files (x86)\Microsoft
SDKs\Windows\v7.0A') and click 'Next >'.
2.3.3. On the 'Installation Options' uncheck everything except 
'x64

Libraries' and 'Visual C++ Compilers' and click 'Next >'.
2.3.4. Confirm that everything is correct and click 'Next >' to 
start

installing.
2.3.5. Make sure, tata the installation is completed 
succesfully.
2.3.6. Store the path to the installed Visual Studio C++ 
compiler into the

'%DEV_DIR_MSVC%' environment variable (e.g. 'C:\Program Files
(x86)\Microsoft Visual Studio 10.0\VC').
3. Get the DMD.
3.1. Navigate to 'http://ftp.digitalmars.com/dmd2beta.zip' in a 
web browser.
3.2. Make sure, that the DMD compiler archive ('dmd2beta.zip') 
is

downloaded.
3.3. Unzip the archive into '%DEV_DIR_ROOT%\Tools', so that the 
'dmd2'
folder in the archive will end up in 
'%DEV_DIR_ROOT%\Tools\dmd2'.
3.4. Adapt the compiler configuration to the development 
environment.
3.4.1. Open the file 
'%DEV_DIR_ROOT%\Tools\dmd2\windows\bin\sc.ini' in a

text editor.
3.4.2. Replace the line with 'LIB=' with the line
'LIB="%DEV_DIR_WINSDK%\Lib\x64";"%DEV_DIR_MSVC%\lib\amd64";"%@P%\..\lib"'.
3.4.3. Add '-m64 -L/NOLOGO' to  the 'DFLAGS' variable.
3.4.4. Remove the lines with 'VCINSTALLDIR=' and 
'WindowsSdkDir='.

3.4.5. Replace the like with 'LINKCMD64=' with the line
'LINKCMD64="%DEV_DIR_MSVC%\bin\amd64\link.exe"'
Now "%DEV_DIR_ROOT%\Tools\dmd2\windows\bin\dmd.exe" will always 
use the
Windows SDK libraries and Visual C++ compiler to produce 64-bit 
COFF

binaries.

I hope I was helpful, because when I started to set up a 
development
environment under 64-bit Windows 7, I went through a lot of 
problems to get

here and I'd love to have this HOWTO at that time.


I just tried this with the current beta (may 25, 2.063). It lacks 
the -m64 option. Was it present in some older beta ?




Re: Are people using textmate for D programming?

2013-05-25 Thread Andrei Alexandrescu

On 5/25/13 5:08 AM, TommiT wrote:

I just tried out Sublime Text 2 and found it to be quite similar but
somewhat better than TextMate 2. And there's an improved D syntax
highlighter for it at: https://github.com/alexrp/st2-d

All the keywords seem to be there, indentation works etc.

Sublime Text does from time to time annoy you about buying the license,
but luckily there's google.


OK, you convinced me to try. But my SublimeText OSX installation does 
not contain the D.tmPackage file described at 
https://github.com/alexrp/st2-d. Where do I put it?


Thanks,

Andrei


Re: D's limited template specialization abilities compared to C++

2013-05-25 Thread Ahuzhgairl

C++ example, works:

template  struct A;
template  class X, class Y> struct A> {};

template  struct B;

int main() {
A> a;
}



But the following does not work:

struct Foo {};
template  struct B { Foo x; }

template  struct A;
template  struct A {}

int main() {
A<&B::x> a;
}


D should be able to do both.


Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Andrei Alexandrescu

On 5/25/13 3:33 AM, Joakim wrote:

On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:

This is more a problem with the algorithms taking the easy way than a
problem with UTF-8. You can do all the string algorithms, including
regex, by working with the UTF-8 directly rather than converting to
UTF-32. Then the algorithms work at full speed.

I call BS on this. There's no way working on a variable-width encoding
can be as "full speed" as a constant-width encoding. Perhaps you mean
that the slowdown is minimal, but I doubt that also.


You mentioned this a couple of times, and I wonder what makes you so 
sure. On contemporary architectures small is fast and large is slow; 
betting on replacing larger data with more computation is quite often a win.



Andrei


Re: D's limited template specialization abilities compared to C++

2013-05-25 Thread Kenji Hara
2013/5/25 Ahuzhgairl 

> Uneditable newsgroups. Simplest case.
>
> struct Bar(T) {}
>
> struct Foo(T : A(B), A, B) {
> static void f() {}
> }
>
> void main() {
> Foo!(Bar!(int)).f();
> }
>

It would work.

struct Bar(T) {}
struct Foo(T : A!(B), alias A, B) {   // 1, 2
static void f() {}
}
void main() {
Foo!(Bar!(int)).f();
}

1. should use A!(B), instead of A(B)
2. A would match to template, so should receive by TemplateAliasParameter.

Kenji Hara


Re: D's limited template specialization abilities compared to C++

2013-05-25 Thread Peter Alexander

On Saturday, 25 May 2013 at 12:13:42 UTC, Ahuzhgairl wrote:

Uneditable newsgroups. Simplest case.

struct Bar(T) {}

struct Foo(T : A(B), A, B) {
static void f() {}
}

void main() {
Foo!(Bar!(int)).f();
}


Two problems with that:

1. A(B) should be A!(B)
2. A won't bind to Bar because Bar is not a type, it is a 
template. A should be an alias.


This works:

struct Bar(T) {}

struct Foo(T : A!(B), alias A, B) {
static void f() {}
}

void main() {
Foo!(Bar!(int)).f();
}


Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Vladimir Panteleev

On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
If you want to split a string by ASCII whitespace (newlines, 
tabs and spaces), it makes no difference whether the string is 
in ASCII or UTF-8 - the code will behave correctly in either 
case, variable-width-encodings regardless.
Except that a variable-width encoding will take longer to 
decode while splitting, when compared to a single-byte encoding.


No. Are you sure you understand UTF-8 properly?


Re: D's limited template specialization abilities compared to C++

2013-05-25 Thread Ahuzhgairl

Uneditable newsgroups. Simplest case.

struct Bar(T) {}

struct Foo(T : A(B), A, B) {
static void f() {}
}

void main() {
Foo!(Bar!(int)).f();
}


Re: D's limited template specialization abilities compared to C++

2013-05-25 Thread Ahuzhgairl

No,




struct Foo(T) {
static void f() { writeln("general"); }
}

struct Foo(T : A(B).alias C, A, B, C) {
static void f() { writeln("special"); }
}

struct Bar(T) {
struct Baz {}
}

struct Baz(T : A(B), A, B) {
}

void main() {
Foo!(Bar!(int).Baz);
Baz!(Bar!(int));
}


Re: D's limited template specialization abilities compared to C++

2013-05-25 Thread Peter Alexander

Is this what you're looking for?

struct Foo(T)
{
static void bar() { writeln("general"); }
}

struct Foo(T : A[B], A, B)
{
static void bar() { writeln("special"); }
}

void main()
{
Foo!(int).bar(); // general
Foo!(int[int]).bar(); // special
}


Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Timon Gehr

On 05/25/2013 05:56 AM, H. S. Teoh wrote:

On Fri, May 24, 2013 at 08:45:56PM -0700, Walter Bright wrote:

On 5/24/2013 7:16 PM, Manu wrote:

So when we define operators for u × v and a · b, or maybe n²? ;)


Oh, how I want to do that. But I still think the world hasn't
completely caught up with Unicode yet.


That would be most awesome!

Though it does raise the issue of how parsing would work, 'cos you
either have to assign a fixed precedence to each of these operators (and
there are a LOT of them in Unicode!),


I think this is what eg. fortress is doing.


or allow user-defined operators
with custom precedence and associativity,


This is what eg. Haskell, Coq are doing.
(Though Coq has the advantage of not allowing forward references, and 
hence inline parser customization is straighforward in Coq.)



which means nightmare for the
parser (it has to adapt itself to new operators as the code is
parsed/analysed,


It would be easier on the parsing side, since the parser would not fully 
parse expressions. Semantic analysis would resolve precedences. This is 
quite simple, and the current way the parser resolves operator 
precedences is less efficient anyways.



which then leads to issues with what happens if two
different modules define the same operator with conflicting precedence /
associativity).



This would probably be an error without explicit disambiguation, or 
follow the usual disambiguation rules. (trying all possibilities appears 
to be exponential in the number of conflicting operators in an 
expression in the worst case though.)


Re: D's limited template specialization abilities compared to C++

2013-05-25 Thread Ahuzhgairl

By extension,

template Foo[X, Y, Z @ X[Y], Y[Z]] { alias Y Foo; }


Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Joakim
On Saturday, 25 May 2013 at 10:33:12 UTC, Vladimir Panteleev 
wrote:
You don't need to do that to slice a string. I think you mean 
to say that you need to decode each character if you want to 
slice the string at the N-th code point? But this is exactly 
what I'm trying to point out: how would you find this N? How 
would you know if it makes sense, taking into account combining 
characters, and all the other complexities of Unicode?
Slicing a string implies finding the N-th code point, what other 
way would you slice and have it make any sense?  Finding the N-th 
point is much simpler with a constant-width encoding.


I'm leaving aside combining characters and those intrinsic 
language complexities baked into unicode in my previous analysis, 
but if you want to bring those in, that's actually an argument in 
favor of my encoding.  With my encoding, you know up front if 
you're using languages that have such complexity- just check the 
header- whereas with a chunk of random UTF-8 text, you cannot 
ever know that unless you decode the entire string once and 
extract knowledge of all the languages that are embedded.


For another similar example, let's say you want to run toUpper on 
a multi-language string, which contains English in the first half 
and some Asian script that doesn't define uppercase in the second 
half.  With my format, toUpper can check the header, then process 
the English half and skip the Asian half (I'm assuming that the 
substring indices for each language would be stored in this more 
complex header).  With UTF-8, you have to process the entire 
string, because you never know what random languages might be 
packed in there.


UTF-8 is riddled with such performance bottlenecks, all to make 
if self-synchronizing.  But is anybody really using its less 
compact encoding to do some "self-synchronized" integrity 
checking?  I suspect almost nobody is.


If you want to split a string by ASCII whitespace (newlines, 
tabs and spaces), it makes no difference whether the string is 
in ASCII or UTF-8 - the code will behave correctly in either 
case, variable-width-encodings regardless.
Except that a variable-width encoding will take longer to decode 
while splitting, when compared to a single-byte encoding.


You cannot honestly look at those multiple state diagrams and 
tell me it's "simple."


I meant that it's simple to implement (and adapt/port to other 
languages). I would say that UTF-8 is quite cleverly designed, 
so I wouldn't say it's simple by itself.
Perhaps, maybe decoding is not so bad for the type of people who 
write the fundamental UTF-8 libraries.  But implementation does 
not merely refer to the UTF-8 libraries, but also all the code 
that tries to build on it for internationalized apps.  And with 
all the unnecessary additional complexity added by UTF-8, 
wrapping the average programmer's head around this mess likely 
leads to as many problems as broken code pages implementations 
did back in the day. ;)


Re: Any plans to fix Issue 9044? aka Language stability question again

2013-05-25 Thread eles
On Saturday, 25 May 2013 at 10:07:29 UTC, Denis Shelomovskij 
wrote:


obviously contradicts my personal very loyal definition (e.g. I 
have noting against breaking changes if they are in good 
direction).


I very much like this definition.



Re: D on next-gen consoles and for game development

2013-05-25 Thread Benjamin Thaut

Am 25.05.2013 03:29, schrieb Manu:



Win64 works for me out of the box... ?


For me dmd produces type names like modulename.typename.subtypename 
which will causes internal errors within the visual studio debugger in 
some cases. Also debugging of static / global variabels is not possible 
(even when gshared) because they are also formatted like 
modulename.variablename;


Kind Regards
Benjamin Thaut


D's limited template specialization abilities compared to C++

2013-05-25 Thread Ahuzhgairl

Hi,

In D, the : in a template parameter list only binds to 1 
parameter.
There is no way to specialize upon the entire template parameter 
list.
Therefore you can't do much with the pattern matching and it's 
not powerful.
Not a reasonable situation for a language aiming to be only the 
best.


What is needed is the ability to bind to the whole template 
parameter list:


template  struct get_class;
template  struct get_class(C::*)(A...)> { typedef C type; };


Let's shorten the terms:

 @ 

And here's how this kind of specialization would work in D:

template A[B] { struct C {} } template Foo[alias X, Y, Z @ 
X[Y].Z] { alias Z Foo; } void main() { alias Foo[A[bool].C] xxx; }


You need a separate delimiter besides : which does not bind to 
individual parameters, but which binds to the set of parameters.


I propose @ as the character which shall be the delimiter for the 
arguments to the pattern match, and the pattern match.


On an unrelated note, I don't like the ! thing so I use []. Sorry 
for the confusion there.


z


Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Vladimir Panteleev

On Saturday, 25 May 2013 at 09:40:36 UTC, Joakim wrote:
Can you post some specific cases where the benefits of a 
constant-width encoding are obvious and, in your opinion, make 
constant-width encodings more useful than all the benefits of 
UTF-8?
Let's take one you listed above, slicing a string.  You have to 
either translate your entire string into UTF-32 so it's 
constant-width, which is apparently what Phobos does, or decode 
every single UTF-8 character along the way, every single time.  
A constant-width, single-byte encoding would be much easier to 
slice, while still using at most half the space.


You don't need to do that to slice a string. I think you mean to 
say that you need to decode each character if you want to slice 
the string at the N-th code point? But this is exactly what I'm 
trying to point out: how would you find this N? How would you 
know if it makes sense, taking into account combining characters, 
and all the other complexities of Unicode?


If you want to split a string by ASCII whitespace (newlines, tabs 
and spaces), it makes no difference whether the string is in 
ASCII or UTF-8 - the code will behave correctly in either case, 
variable-width-encodings regardless.


You cannot honestly look at those multiple state diagrams and 
tell me it's "simple."


I meant that it's simple to implement (and adapt/port to other 
languages). I would say that UTF-8 is quite cleverly designed, so 
I wouldn't say it's simple by itself.


Re: Why UTF-8/16 character encodings?

2013-05-25 Thread w0rp

This is dumb. You are dumb. Go away.


Any plans to fix Issue 9044? aka Language stability question again

2013-05-25 Thread Denis Shelomovskij
As those of you who do write some non-toy projects in D know, from time 
to time you projects become unbuildable because of Issue 9044 [1] an you 
have to juggle with files and randomly copy/move functions from one 
library to another to "detrigger" the issue creating mess marked "Issue 
9044 workaround". It become really annoying when your one-file project 
using an external library fails as it forcing you to juggle with that 
library files (e.g. VisualD's `cpp2d` project which triggers the issue 
randomly).


I'd never complain about such things but the language is tend to be 
self-called stable by main maintainers and I'd like to finally see an 
official definition of this "stability" as it obviously contradicts my 
personal very loyal definition (e.g. I have noting against breaking 
changes if they are in good direction).



[1] http://d.puremagic.com/issues/show_bug.cgi?id=9044

--
Денис В. Шеломовский
Denis V. Shelomovskij


Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Joakim
On Saturday, 25 May 2013 at 08:58:57 UTC, Vladimir Panteleev 
wrote:
Another thing I noticed: sometimes when you think you really 
need to operate on individual characters (and that your code 
will not be correct unless you do that), the assumption will be 
incorrect due to the existence of combining characters in 
Unicode. Two of the often-quoted use cases of working on 
individual code points is calculating the string width 
(assuming a fixed-width font), and slicing the string - both of 
these will break with combining characters if those are not 
accounted for. I believe the proper way to approach such tasks 
is to implement the respective Unicode algorithms for it, which 
I believe are non-trivial and for which the relative impact for 
the overhead of working with a variable-width encoding is 
acceptable.
Combining characters are examples of complexity baked into the 
various languages, so there's no way around that.  I'm arguing 
against layering more complexity on top, through UTF-8.


Can you post some specific cases where the benefits of a 
constant-width encoding are obvious and, in your opinion, make 
constant-width encodings more useful than all the benefits of 
UTF-8?
Let's take one you listed above, slicing a string.  You have to 
either translate your entire string into UTF-32 so it's 
constant-width, which is apparently what Phobos does, or decode 
every single UTF-8 character along the way, every single time.  A 
constant-width, single-byte encoding would be much easier to 
slice, while still using at most half the space.


Also, I don't think this has been posted in this thread. Not 
sure if it answers your points, though:


http://www.utf8everywhere.org/
That seems to be a call to using UTF-8 on Windows, with a lot of 
info on how best to do so, with little justification for why 
you'd want to do so in the first place.  For example,


"Q: But what about performance of text processing algorithms, 
byte alignment, etc?


A: Is it really better with UTF-16? Maybe so."

Not exactly a considered analysis of the two. ;)


And here's a simple and correct UTF-8 decoder:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
You cannot honestly look at those multiple state diagrams and 
tell me it's "simple."  That said, the difficulty of _using_ 
UTF-8 is a much bigger than problem than implementing a decoder 
in a library.


Re: Are people using textmate for D programming?

2013-05-25 Thread TommiT
I just tried out Sublime Text 2 and found it to be quite similar 
but somewhat better than TextMate 2. And there's an improved D 
syntax highlighter for it at: https://github.com/alexrp/st2-d


All the keywords seem to be there, indentation works etc.

Sublime Text does from time to time annoy you about buying the 
license, but luckily there's google.


Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Joakim

On Saturday, 25 May 2013 at 08:42:46 UTC, Walter Bright wrote:

I think you stand alone in your desire to return to code pages.
Nobody is talking about going back to code pages.  I'm talking 
about going to single-byte encodings, which do not imply the 
problems that you had with code pages way back when.


I have years of experience with code pages and the unfixable 
misery they produce. This has disappeared with Unicode. I find 
your arguments unpersuasive when stacked against my experience. 
And yes, I have made a living writing high performance code 
that deals with characters, and you are quite off base with 
claims that UTF-8 has inevitable bad performance - though there 
is inefficient code in Phobos for it, to be sure.
How can a variable-width encoding possibly compete with a 
constant-width encoding?  You have not articulated a reason for 
this.  Do you believe there is a performance loss with 
variable-width, but that it is not significant and therefore 
worth it?  Or do you believe it can be implemented with no loss?  
That is what I asked above, but you did not answer.


My grandfather wrote a book that consists of mixed German, 
French, and Latin words, using special characters unique to 
those languages. Another failing of code pages is it fails 
miserably at any such mixed language text. Unicode handles it 
with aplomb.
I see no reason why single-byte encodings wouldn't do a better 
job at such mixed-language text.  You'd just have to have a 
larger, more complex header or keep all your strings in a single 
language, with a different format to compose them together for 
your book.  This would be so much easier than UTF-8 that I cannot 
see how anyone could argue for a variable-length encoding instead.


I can't even write an email to Rainer Schütze in English under 
your scheme.
Why not?  You seem to think that my scheme doesn't implement 
multi-language text at all, whereas I pointed out, from the 
beginning, that it could be trivially done also.


Code pages simply are no longer practical nor acceptable for a 
global community. D is never going to convert to a code page 
system, and even if it did, there's no way D will ever convince 
the world to abandon Unicode, and so D would be as useless as 
EBCDIC.
I'm afraid you and others here seem to mentally translate 
"single-byte encodings" to "code pages" in your head, then recoil 
in horror as you remember all your problems with broken 
implementations of code pages, even though those problems are not 
intrinsic to single-byte encodings.


I'm not asking you to consider this for D.  I just wanted to 
discuss why UTF-8 is used at all.  I had hoped for some technical 
evaluations of its merits, but I seem to simply be dredging up a 
bunch of repressed memories about code pages instead. ;)


The world may not "abandon Unicode," but it will abandon UTF-8, 
because it's a dumb idea.  Unfortunately, such dumb ideas- XML 
anyone?- often proliferate until someone comes up with something 
better to show how dumb they are.  Perhaps it won't be the D 
programming language that does that, but it would be easy to 
implement my idea in D, so maybe it will be a D-based library 
someday. :)



I'm afraid your quest is quixotic.
I'd argue the opposite, considering most programmers still can't 
wrap their head around UTF-8.  If someone can just get a 
single-byte encoding implemented and in front of them, I suspect 
it will be UTF-8 that will be considered quixotic. :D


Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Vladimir Panteleev

On Saturday, 25 May 2013 at 07:33:15 UTC, Joakim wrote:
This is more a problem with the algorithms taking the easy way 
than a problem with UTF-8. You can do all the string 
algorithms, including regex, by working with the UTF-8 
directly rather than converting to UTF-32. Then the algorithms 
work at full speed.
I call BS on this.  There's no way working on a variable-width 
encoding can be as "full speed" as a constant-width encoding.  
Perhaps you mean that the slowdown is minimal, but I doubt that 
also.


For the record, I noticed that programmers (myself included) that 
had an incomplete understanding of Unicode / UTF exaggerate this 
point, and sometimes needlessly assume that their code needs to 
operate on individual characters (code points), when it is in 
fact not so - and that code will work just fine as if it was 
written to handle ASCII. The example Walter quoted (regex - 
assuming you don't want Unicode ranges or case-insensitivity) is 
one such case.


Another thing I noticed: sometimes when you think you really need 
to operate on individual characters (and that your code will not 
be correct unless you do that), the assumption will be incorrect 
due to the existence of combining characters in Unicode. Two of 
the often-quoted use cases of working on individual code points 
is calculating the string width (assuming a fixed-width font), 
and slicing the string - both of these will break with combining 
characters if those are not accounted for. I believe the proper 
way to approach such tasks is to implement the respective Unicode 
algorithms for it, which I believe are non-trivial and for which 
the relative impact for the overhead of working with a 
variable-width encoding is acceptable.


Can you post some specific cases where the benefits of a 
constant-width encoding are obvious and, in your opinion, make 
constant-width encodings more useful than all the benefits of 
UTF-8?


Also, I don't think this has been posted in this thread. Not sure 
if it answers your points, though:


http://www.utf8everywhere.org/

And here's a simple and correct UTF-8 decoder:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/


Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Walter Bright

On 5/25/2013 12:33 AM, Joakim wrote:

At what cost?  Most programmers completely punt on unicode, because they just
don't want to deal with the complexity. Perhaps you can deal with it and don't
mind the performance loss, but I suspect you're in the minority.


I think you stand alone in your desire to return to code pages. I have years of 
experience with code pages and the unfixable misery they produce. This has 
disappeared with Unicode. I find your arguments unpersuasive when stacked 
against my experience. And yes, I have made a living writing high performance 
code that deals with characters, and you are quite off base with claims that 
UTF-8 has inevitable bad performance - though there is inefficient code in 
Phobos for it, to be sure.


My grandfather wrote a book that consists of mixed German, French, and Latin 
words, using special characters unique to those languages. Another failing of 
code pages is it fails miserably at any such mixed language text. Unicode 
handles it with aplomb.


I can't even write an email to Rainer Schütze in English under your scheme.

Code pages simply are no longer practical nor acceptable for a global community. 
D is never going to convert to a code page system, and even if it did, there's 
no way D will ever convince the world to abandon Unicode, and so D would be as 
useless as EBCDIC.


I'm afraid your quest is quixotic.


Shared libraries in dmd 2.063

2013-05-25 Thread Johannes Pfau
What's the official status of shared libraries in dmd 2.063? Is it
already deemed stable or can there still be breaking changes for dmd
2.064?

I'm asking because I think we should change the default visibility of D
functions in shared libraries. We want to encourage platform
independent code so good code should use the 'export' attribute
anyway.

Making all symbols public by default and templates is a bad combination
for performance as it stresses the runtime linker. Look at that gcc
page, they managed to create a templated library that takes 6 minutes
to load because of this!

http://gcc.gnu.org/wiki/Visibility
http://software.intel.com/en-us/articles/software-convention-models-using-elf-visibility-attributes
http://www.technovelty.org/code/why-symbol-visibility-is-good.html


Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Joakim

On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:
I think you are a little confused about what unicode actually 
is... Unicode has nothing to do with code pages and nobody uses 
code pages any more except for compatibility with legacy 
applications (with good reason!).

Incorrect.

"Unicode is an effort to include all characters from previous 
code pages into a single character enumeration that can be used 
with a number of encoding schemes... In practice the various 
Unicode character set encodings have simply been assigned their 
own code page numbers, and all the other code pages have been 
technically redefined as encodings for various subsets of 
Unicode."

http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode


Unicode is:
1) A standardised numbering of a large number of characters
2) A set of standardised algorithms for operating on these 
characters
3) A set of standardised encodings for efficiently encoding 
sequences of these characters
What makes you think I'm unaware of this?  I have repeatedly 
differentiated between UCS (1) and UTF-8 (3).


You said that phobos converts UTF-8 strings to UTF-32 before 
operating on them but that's not true. As it iterates over 
UTF-8 strings it iterates over dchars rather than chars, but 
that's not in any way inefficient so I don't really see the 
problem.

And what's a dchar?  Let's check:

dchar : unsigned 32 bit UTF-32
http://dlang.org/type.html

Of course that's inefficient, you are translating your whole 
encoding over to a 32-bit encoding every time you need to process 
it.  Walter as much as said so up above.


Also your complaint that UTF-8 reserves the short characters 
for the english alphabet is not really relevant - the 
characters with longer encodings tend to be rarer (such as 
special symbols) or carry more information (such as chinese 
characters where the same sentence takes only about 1/3 the 
number of characters).
The vast majority of non-english alphabets in UCS can be encoded 
in a single byte.  It is your exceptions that are not relevant.


Re: Low-Lock Singletons In D

2013-05-25 Thread Mehrdad

On Tuesday, 7 May 2013 at 20:17:43 UTC, QAston wrote:
No. A tutorial on memory consistency models would be too long 
to insert here. I don't know of a good online resource, does 
anyone?


Andrei


This was very helpful for me - focuses much more on the memory 
model itself than the c++11 part.


http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2


This was awesome/amazing.


Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Diggory
I think you are a little confused about what unicode actually 
is... Unicode has nothing to do with code pages and nobody uses 
code pages any more except for compatibility with legacy 
applications (with good reason!).


Unicode is:
1) A standardised numbering of a large number of characters
2) A set of standardised algorithms for operating on these 
characters
3) A set of standardised encodings for efficiently encoding 
sequences of these characters


You said that phobos converts UTF-8 strings to UTF-32 before 
operating on them but that's not true. As it iterates over UTF-8 
strings it iterates over dchars rather than chars, but that's not 
in any way inefficient so I don't really see the problem.


Also your complaint that UTF-8 reserves the short characters for 
the english alphabet is not really relevant - the characters with 
longer encodings tend to be rarer (such as special symbols) or 
carry more information (such as chinese characters where the same 
sentence takes only about 1/3 the number of characters).


Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Joakim

On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
One of the first, and best, decisions I made for D was it would 
be Unicode front to back.
That is why I asked this question here.  I think D is still one 
of the few programming languages with such unicode support.


This is more a problem with the algorithms taking the easy way 
than a problem with UTF-8. You can do all the string 
algorithms, including regex, by working with the UTF-8 directly 
rather than converting to UTF-32. Then the algorithms work at 
full speed.
I call BS on this.  There's no way working on a variable-width 
encoding can be as "full speed" as a constant-width encoding.  
Perhaps you mean that the slowdown is minimal, but I doubt that 
also.


That was the go-to solution in the 1980's, they were called 
"code pages". A disaster.
My understanding is that code pages were a "disaster" because 
they weren't standardized and often badly implemented.  If you 
used UCS with a single-byte encoding, you wouldn't have that 
problem.



> with the few exceptional languages with more than 256
characters encoded in two bytes.

Like those rare languages Japanese, Korean, Chinese, etc. This 
too was done in the 80's with "Shift-JIS" for Japanese, and 
some other wacky scheme for Korean, and a third nutburger one 
for Chinese.
Of course, you have to have more than one byte for those 
languages, because they have more than 256 characters.  So there 
will be no compression gain over UTF-8/16 there, but a big gain 
in parsing complexity with a simpler encoding, particularly when 
dealing with multi-language strings.


I've had the misfortune of supporting all that in the old 
Zortech C++ compiler. It's AWFUL. If you think it's simpler, 
all I can say is you've never tried to write internationalized 
code with it.
Heh, I'm not saying "let's go back to badly defined code pages" 
because I'm saying "let's go back to single-byte encodings."  The 
two are separate arguments.


UTF-8 is heavenly in comparison. Your code is automatically 
internationalized. It's awesome.
At what cost?  Most programmers completely punt on unicode, 
because they just don't want to deal with the complexity.  
Perhaps you can deal with it and don't mind the performance loss, 
but I suspect you're in the minority.


Re: D on next-gen consoles and for game development

2013-05-25 Thread Paulo Pinto

Am 25.05.2013 03:29, schrieb Manu:

On 25 May 2013 04:20, Benjamin Thaut mailto:c...@benjamin-thaut.de>> wrote:
[...]
See, I have spend a decade on core tech/engine code meticulously
worrying about memory allocation. I don't think a GC is an outright no-go.
But we certainly don't have a GC that fits the bill.


Given that Android, Windows Phone 7/8 and PS Vita have system languages 
with GC, it does not seem to bother those developers.


Yes I know that most AAA studios are actually bypassing them and using C 
and C++ directly, but already having indie developers using D would be a 
great win.


One needs to start somehere.



- Better windows support. All of the developement we do happens on
windows and most of D's community does not care about windows
support. I'm curious how long it will take until D will get propper
DLL support.




Yeah, this is partially why I lost the train for game development. I was 
too much focused in FOSS issues, instead of focusing in doing a game.



--
Paulo



Re: D on next-gen consoles and for game development

2013-05-25 Thread deadalnix

On Saturday, 25 May 2013 at 05:52:23 UTC, Manu wrote:
But it would be deterministic, and if the allocations are few, 
the cost

should be negligible.



You'll pay a tax on pointer write, not on allocations ! It won't 
be negligible !


They're still non-deterministic though. And unless (even if?) 
they're

precise, they might leak.



Not if they are precise. But this is another topic.

What does ObjC do? It seems to work okay on embedded hardware 
(although not

particularly memory-constrained hardware).
Didn't ObjC recently reject GC in favour of refcounting?


ObjC is an horrible three headed monster in that regard, and I 
don't think this is the way to go.


Re: D on next-gen consoles and for game development

2013-05-25 Thread Paulo Pinto

Am 25.05.2013 07:52, schrieb Manu:

On 25 May 2013 15:29, deadalnix mailto:deadal...@gmail.com>> wrote:

On Saturday, 25 May 2013 at 05:18:12 UTC, Manu wrote:

On 25 May 2013 15:00, deadalnix mailto:deadal...@gmail.com>> wrote:

On Saturday, 25 May 2013 at 01:56:42 UTC, Manu wrote:

Understand, I have no virtual-memory manager, it won't
page, it's not a
performance problem, it will just crash if I
mis-calculate this value.


So the GC is kind of out.


Yeah, I'm wondering if that's just a basic truth for embedded.
Can D implement a ref-counting GC? That would probably still be
okay, since
collection is immediate.


This is technically possible, but you said you make few allocations.
So with the tax on pointer write or the reference counting, you'll
pay a lot to collect very few garbages. I'm not sure the tradeoff is
worthwhile.


But it would be deterministic, and if the allocations are few, the cost
should be negligible.


Paradoxically, when you create few garbage, GC are really goos as
they don't need to trigger often. But if you need to add a tax on
each reference write/copy, you'll probably pay more tax than you get
out of it.


They're still non-deterministic though. And unless (even if?) they're
precise, they might leak.

What does ObjC do? It seems to work okay on embedded hardware (although
not particularly memory-constrained hardware).
Didn't ObjC recently reject GC in favour of refcounting?


Yes, but is was mainly for not being able to have a stable working GC 
able to cope with the Objective-C code available in the wild. It had 
quite a few issues.


Objective-C reference counting requires compiler and runtime support.

Basically it is based in how Cocoa does reference counting, but instead 
of requiring the developers to manually write the [retain], [release] 
and [autorelease] messages, the compiler is able to infer them based on

Cocoa memory access patterns.

Additionally it makes use of dataflow analysis to remove superfluous use 
of those calls.


There is a WWDC talk on iTunes where they explain that. I can look for 
it if there is interest.


Microsoft did the same thing with their C++/CX language extensions and 
COM for WinRT.


--
Paulo



Re: Why UTF-8/16 character encodings?

2013-05-25 Thread Joakim

On Friday, 24 May 2013 at 22:44:24 UTC, H. S. Teoh wrote:
I remember those bad ole days of gratuitously-incompatible 
encodings. I
wish those days will never ever return again. You'd get a text 
file in
some unknown encoding, and the only way to make any sense of it 
was to
guess what encoding it might be and hope you get lucky. Not 
only so, the
same language often has multiple encodings, so adding support 
for a
single new language required supporting several new encodings 
and being
able to tell them apart (often with no info on which they are, 
if you're
lucky, or if you're unlucky, with *wrong* encoding type specs 
-- for
example, I *still* get email from outdated systems that claim 
to be

iso-8859 when it's actually KOI8R).

This is an argument for UCS, not UTF-8.

Prepending the encoding to the data doesn't help, because it's 
pretty
much guaranteed somebody will cut-n-paste some segment of that 
data and
save it without the encoding type header (or worse, some 
program will
try to "fix" broken low-level code by prepending a default 
encoding type
to everything, regardless of whether it's actually in that 
encoding or
not), thus ensuring nobody will be able to reliably recognize 
what

encoding it is down the road.
This problem already exists for UTF-8, breaking ASCII 
compatibility in the process:


http://en.wikipedia.org/wiki/Byte_order_mark

Well, at the very least adding garbage ASCII data in the front, 
just as my header would do. ;)


For all of its warts, Unicode fixed a WHOLE bunch of these 
problems, and
made cross-linguistic data sane to handle without pulling out 
your hair,
many times over.  And now we're trying to go back to that 
nightmarish

old world again? No way, José!
No, I'm suggesting going back to one element of that "old world," 
single-byte encodings, but using UCS or some other standardized 
character set to avoid all those incompatible code pages you had 
to deal with.


If you're really concerned about encoding size, just use a 
compression
library -- they're readily available these days. Internally, 
the program
can just use UTF-16 for the most part -- UTF-32 is really only 
necessary

if you're routinely delving outside BMP, which is very rare.
True, but you're still doubling your string size with UTF-16 and 
non-ASCII text.  My concerns are the following, in order of 
importance:


1. Lost programmer productivity due to these dumb variable-length 
encodings.  That is the biggest loss from UTF-8's complexity.


2. Lost speed and memory due to using either an unnecessarily 
complex variable-length encoding or because you translated 
everything to 32-bit UTF-32 to get back to constant-width.


3. Lost bandwidth from using a fatter encoding.

As far as Phobos is concerned, Dmitry's new std.uni module has 
powerful
code-generation templates that let you write code that operate 
directly
on UTF-8 without needing to convert to UTF-32 first. Well, OK, 
maybe
we're not quite there yet, but the foundations are in place, 
and I'm
looking forward to the day when string functions will no longer 
have
implicit conversion to UTF-32, but will directly manipulate 
UTF-8 using

optimized state tables generated by std.uni.
There is no way this can ever be as performant as a 
constant-width single-byte encoding.


+1.  Using your own encoding is perfectly fine. Just don't do 
that for

data interchange. Unicode was created because we *want* a single
standard to communicate with each other without stupid broken 
encoding
issues that used to be rampant on the web before Unicode came 
along.


In the bad ole days, HTML could be served in any random number 
of
encodings, often out-of-sync with what the server claims the 
encoding
is, and browsers would assume arbitrary default encodings that 
for the
most part *appeared* to work but are actually fundamentally 
b0rken.

Sometimes webpages would show up mostly-intact, but with a few
characters mangled, because of deviations / variations on 
codepage
interpretation, or non-standard characters being used in a 
particular
encoding. It was a total, utter mess, that wasted who knows how 
many
man-hours of programming time to work around. For data 
interchange on
the internet, we NEED a universal standard that everyone can 
agree on.
I disagree.  This is not an indictment of multiple encodings, it 
is one of multiple unspecified or _broken_ encodings.  Given how 
difficult UTF-8 is to get right, all you've likely done is 
replace multiple broken encodings with a single encoding with 
multiple broken implementations.


UTF-8, for all its flaws, is remarkably resilient to mangling 
-- you can
cut-n-paste any byte sequence and the receiving end can still 
make some
sense of it.  Not like the bad old days of codepages where you 
just get
one gigantic block of gibberish. A properly-synchronizing UTF-8 
function
can still recover legible data, maybe with only a few 
characters at the
ends truncated in the worst case. I don't see how any 
codepage-based


<    1   2