Re: Why UTF-8/16 character encodings?
░ⓌⓉⒻ░ ╔╗░╔╗░╔╗╔╗╔╗░░ ║║░║║░║║╚═╗╔═╝║╔═══╝░░ ║║░║║░║║░░║║░░║╚═╗ ║╚═╝╚═╝║╔╗║║╔╗║╔═╝╔╗░░ ╚══╝╚╝╚╝╚╝╚╝░░╚╝░░ █░█░█░░▐░░▐░ █░█░█▐▀█▐▀█▐░█▐▀█▐▀█▐▀█░ █░█░█▐▄█▐▄█▐▄▀▐▄█▐░█▐░█░ █▄█▄█▐▄▄▐▄▄▐░█▐▄▄▐░█▐▄█░ --jm
Re: D on next-gen consoles and for game development
On Saturday, 25 May 2013 at 05:29:31 UTC, deadalnix wrote: This is technically possible, but you said you make few allocations. So with the tax on pointer write or the reference counting, you'll pay a lot to collect very few garbages. I'm not sure the tradeoff is worthwhile. Incidentally, I ran across this paper that talks about a reference counted garbage collector that claims to address this issue. MIght be of interest to this group. http://researcher.watson.ibm.com/researcher/files/us-bacon/Bacon03Pure.pdf From the paper: There are two primary problems with reference counting, namely: (1) run-time overhead of incrementing and decrementing the reference count each time a pointer is copied, particularly on the stack; and (2) inability to detect cycles and consequent necessity of including a second garbage collection technique to deal with cyclic garbage. In this paper we present new algorithms that address these problems and describe a new multiprocessor garbage collector based on these techniques that achieves maximum measured pause times of 2.6 milliseconds over a set of eleven benchmark programs that perform significant amounts of memory allocation.
Re: Why UTF-8/16 character encodings?
On Saturday, 25 May 2013 at 17:03:43 UTC, Dmitry Olshansky wrote: 25-May-2013 10:44, Joakim пишет: Yes, on the encoding, if it's a variable-length encoding like UTF-8, no, on the code space. I was originally going to title my post, "Why Unicode?" but I have no real problem with UCS, which merely standardized a bunch of pre-existing code pages. Perhaps there are a lot of problems with UCS also, I just haven't delved into it enough to know. UCS is dead and gone. Next in line to "640K is enough for everyone". I think you are confused. UCS refers to the Universal Character Set, which is the backbone of Unicode: http://en.wikipedia.org/wiki/Universal_Character_Set You might be thinking of the unpopular UCS-2 and UCS-4 encodings, which I have never referred to. Separate code spaces were the case before Unicode (and utf-8). The problem is not only that without header text is meaningless (no easy slicing) but the fact that encoding of data after header strongly depends a variety of factors - a list of encodings actually. Now everybody has to keep a (code) page per language to at least know if it's 2 bytes per char or 1 byte per char or whatever. And you still work on a basis that there is no combining marks and regional specific stuff :) Everybody is still keeping code pages, UTF-8 hasn't changed that. Legacy. Hard to switch overnight. There are graphs that indicate that few years from now you might never encounter a legacy encoding anymore, only UTF-8/UTF-16. I didn't mean that people are literally keeping code pages. I meant that there's not much of a difference between code pages with 2 bytes per char and the language character sets in UCS. Does UTF-8 not need "to at least know if it's 2 bytes per char or 1 byte per char or whatever?" It's coherent in its scheme to determine that. You don't need extra information synced to text unlike header stuff. ?! It's okay because you deem it "coherent in its scheme?" I deem headers much more coherent. :) It has to do that also. Everyone keeps talking about "easy slicing" as though UTF-8 provides it, but it doesn't. Phobos turns UTF-8 into UTF-32 internally for all that ease of use, at least doubling your string size in the process. Correct me if I'm wrong, that was what I read on the newsgroup sometime back. Indeed you are - searching for UTF-8 substring in UTF-8 string doesn't do any decoding and it does return you a slice of a balance of original. Perhaps substring search doesn't strictly require decoding but you have changed the subject: slicing does require decoding and that's the use case you brought up to begin with. I haven't looked into it, but I suspect substring search not requiring decoding is the exception for UTF-8 algorithms, not the rule. ??? Simply makes no sense. There is no intersection between some legacy encodings as of now. Or do you want to add N*(N-1) cross-encodings for any combination of 2? What about 3 in one string? I sketched two possible encodings above, none of which would require "cross-encodings." We want monoculture! That is to understand each without all these "par-le-vu-france?" and codepages of various complexity(insanity). I hate monoculture, but then I haven't had to decipher some screwed-up codepage in the middle of the night. ;) So you never had trouble of internationalization? What languages do you use (read/speak/etc.)? This was meant as a point in your favor, conceding that I haven't had to code with the terrible code pages system from the past. I can read and speak multiple languages, but I don't use anything other than English text. That said, you could standardize on UCS for your code space without using a bad encoding like UTF-8, as I said above. UCS is a myth as of ~5 years ago. Early adopters of Unicode fell into that trap (Java, Windows NT). You shouldn't. UCS, the character set, as noted above. If that's a myth, Unicode is a myth. :) This is it but it's far more flexible in a sense that it allows multi-linguagal strings just fine and lone full-with unicode codepoints as well. That's only because it uses a more complex header than a single byte for the language, which I noted could be done with my scheme, by adding a more complex header, long before you mentioned this unicode compression scheme. But I get the impression that it's only for sending over the wire, ie transmision, so all the processing issues that UTF-8 introduces would still be there. Use mime-type etc. Standards are always a bit stringy and suboptimal, their acceptance rate is one of chief advantages they have. Unicode has horrifically large momentum now and not a single organization aside from them tries to do this dirty work (=i18n). You misunderstand. I was saying that this unicode compression scheme doesn't help you with string processing, it is only for transmission and is probably fine for that, precisely because it seems to implement so
Re: DLang Spec rewrite (?)
I hasten to add that I don't mean to criticise the original writers of the DLang Spec for writing it in DDoc macros. So far, I've found the documentation fairly easy to follow (as plain text) and so I don't want to lose any of that should the spec be rewritten. It's also possible (although, in my opinion, less preferable) to keep the spec written in DDoc macros but reformatted to allow for easier conversion to other formats...
DLang Spec rewrite (?)
Good afternoon, all, I would still like to compile the D Lang Spec into EPUB (and possibly other formats) but, as we discussed in these threads: http://forum.dlang.org/thread/bsbdpjyjubfxvmecw...@forum.dlang.org http://forum.dlang.org/thread/uzdngvjzexukbgkxd...@forum.dlang.org having the D Lang Specification written in DDoc macros is making it extremely difficult to work with. I ask, therefore, what opposition would there be to me rewriting the DLang Spec files into another format that will be easier to parse and compile for the website, PDF, Latex, eBook and other formats? If the answer is 'minimal', 'go ahead' or 'it's your funeral', then my follow-up question is 'what format would be the easiest to write, debug and maintain?' For greater clarity, I am NOT proposing to rewrite the DDoc-generated library documentation or any other pages outside of the spec. In the makefile, they are defined as the files covered in $(SPEC_ROOT). With regards,
Re: Why UTF-8/16 character encodings?
On Saturday, 25 May 2013 at 08:07:42 UTC, Joakim wrote: On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote: I think you are a little confused about what unicode actually is... Unicode has nothing to do with code pages and nobody uses code pages any more except for compatibility with legacy applications (with good reason!). Incorrect. "Unicode is an effort to include all characters from previous code pages into a single character enumeration that can be used with a number of encoding schemes... In practice the various Unicode character set encodings have simply been assigned their own code page numbers, and all the other code pages have been technically redefined as encodings for various subsets of Unicode." http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode That confirms exactly what I just said... You said that phobos converts UTF-8 strings to UTF-32 before operating on them but that's not true. As it iterates over UTF-8 strings it iterates over dchars rather than chars, but that's not in any way inefficient so I don't really see the problem. And what's a dchar? Let's check: dchar : unsigned 32 bit UTF-32 http://dlang.org/type.html Of course that's inefficient, you are translating your whole encoding over to a 32-bit encoding every time you need to process it. Walter as much as said so up above. Given that all the machine registers are at least 32-bits already it doesn't make the slightest difference. The only additional operations on top of ascii are when it's a multi-byte character, and even then it's some simple bit manipulation which is as fast as any variable width encoding is going to get. The only alternatives to a variable width encoding I can see are: - Single code page per string This is completely useless because now you can't concatenate strings of different code pages. - Multiple code pages per string This just makes everything overly complicated and is far slower to decode what the actual character is than UTF-8. - String with escape sequences to change code page Can no longer access characters in the middle or end of the string, you have to parse the entire string every time which completely negates the benefit of a fixed width encoding. - An encoding wide enough to store every character This is just UTF-32. Also your complaint that UTF-8 reserves the short characters for the english alphabet is not really relevant - the characters with longer encodings tend to be rarer (such as special symbols) or carry more information (such as chinese characters where the same sentence takes only about 1/3 the number of characters). The vast majority of non-english alphabets in UCS can be encoded in a single byte. It is your exceptions that are not relevant. Well obviously... That's like saying "if you know what the exact contents of a file are going to be anyway you can compress it to a single byte!" ie. It's possible to devise an encoding which will encode any given string to an arbitrarily small size. It's still completely useless because you'd have to know the string in advance... - A useful encoding has to be able to handle every unicode character - As I've shown the only space-efficient way to do this is using a variable length encoding like UTF-8 - Given the frequency distribution of unicode characters, UTF-8 does a pretty good job at encoding higher frequency characters in fewer bytes. - Yes you COULD encode non-english alphabets in a single byte but doing so would be inefficient because it would mean the more frequently used characters take more bytes to encode.
Re: Why UTF-8/16 character encodings?
25-May-2013 12:58, Vladimir Panteleev пишет: On Saturday, 25 May 2013 at 07:33:15 UTC, Joakim wrote: This is more a problem with the algorithms taking the easy way than a problem with UTF-8. You can do all the string algorithms, including regex, by working with the UTF-8 directly rather than converting to UTF-32. Then the algorithms work at full speed. I call BS on this. There's no way working on a variable-width encoding can be as "full speed" as a constant-width encoding. Perhaps you mean that the slowdown is minimal, but I doubt that also. For the record, I noticed that programmers (myself included) that had an incomplete understanding of Unicode / UTF exaggerate this point, and sometimes needlessly assume that their code needs to operate on individual characters (code points), when it is in fact not so - and that code will work just fine as if it was written to handle ASCII. The example Walter quoted (regex - assuming you don't want Unicode ranges or case-insensitivity) is one such case. +1 BTW regex even with Unicode ranges and case-insensitivity is doable just not easy (yet). Another thing I noticed: sometimes when you think you really need to operate on individual characters (and that your code will not be correct unless you do that), the assumption will be incorrect due to the existence of combining characters in Unicode. Two of the often-quoted use cases of working on individual code points is calculating the string width (assuming a fixed-width font), and slicing the string - both of these will break with combining characters if those are not accounted for. I believe the proper way to approach such tasks is to implement the respective Unicode algorithms for it, which I believe are non-trivial and for which the relative impact for the overhead of working with a variable-width encoding is acceptable. Another plus one. Algorithms defined on code point basis are quite complex so that benefit of not decoding won't be that large. The benefit of transparently special-casing ASCII in UTF-8 is far larger. Can you post some specific cases where the benefits of a constant-width encoding are obvious and, in your opinion, make constant-width encodings more useful than all the benefits of UTF-8? Also, I don't think this has been posted in this thread. Not sure if it answers your points, though: http://www.utf8everywhere.org/ And here's a simple and correct UTF-8 decoder: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ -- Dmitry Olshansky
Re: Why UTF-8/16 character encodings?
25-May-2013 13:05, Joakim пишет: On Saturday, 25 May 2013 at 08:42:46 UTC, Walter Bright wrote: I think you stand alone in your desire to return to code pages. Nobody is talking about going back to code pages. I'm talking about going to single-byte encodings, which do not imply the problems that you had with code pages way back when. Problem is what you outline is isomorphic with code-pages. Hence the grief of accumulated experience against them. Code pages simply are no longer practical nor acceptable for a global community. D is never going to convert to a code page system, and even if it did, there's no way D will ever convince the world to abandon Unicode, and so D would be as useless as EBCDIC. I'm afraid you and others here seem to mentally translate "single-byte encodings" to "code pages" in your head, then recoil in horror as you remember all your problems with broken implementations of code pages, even though those problems are not intrinsic to single-byte encodings. I'm not asking you to consider this for D. I just wanted to discuss why UTF-8 is used at all. I had hoped for some technical evaluations of its merits, but I seem to simply be dredging up a bunch of repressed memories about code pages instead. ;) Well if somebody get a quest to redefine UTF-8 they *might* come up with something that is a bit faster to decode but shares the same properties. Hardly a life saver anyway. The world may not "abandon Unicode," but it will abandon UTF-8, because it's a dumb idea. Unfortunately, such dumb ideas- XML anyone?- often proliferate until someone comes up with something better to show how dumb they are. Even children know XML is awful redundant shit as interchange format. The hierarchical document is a nice idea anyway. Perhaps it won't be the D programming language that does that, but it would be easy to implement my idea in D, so maybe it will be a D-based library someday. :) Implement Unicode compression scheme - at least that is standardized. -- Dmitry Olshansky
Re: Why UTF-8/16 character encodings?
25-May-2013 10:44, Joakim пишет: On Friday, 24 May 2013 at 21:21:27 UTC, Dmitry Olshansky wrote: You seem to think that not only UTF-8 is bad encoding but also one unified encoding (code-space) is bad(?). Yes, on the encoding, if it's a variable-length encoding like UTF-8, no, on the code space. I was originally going to title my post, "Why Unicode?" but I have no real problem with UCS, which merely standardized a bunch of pre-existing code pages. Perhaps there are a lot of problems with UCS also, I just haven't delved into it enough to know. My problem is with these dumb variable-length encodings, so I was precise in the title. UCS is dead and gone. Next in line to "640K is enough for everyone". Simply put Unicode decided to take into account all diversity of luggages instead of ~80% of these. Hard to add anything else. No offense meant but it feels like you actually live in universe that is 5-7 years behind current state. UTF-16 (a successor to UCS) is no random-access either. And it's shitty beyond measure, UTF-8 is a shining gem in comparison. Separate code spaces were the case before Unicode (and utf-8). The problem is not only that without header text is meaningless (no easy slicing) but the fact that encoding of data after header strongly depends a variety of factors - a list of encodings actually. Now everybody has to keep a (code) page per language to at least know if it's 2 bytes per char or 1 byte per char or whatever. And you still work on a basis that there is no combining marks and regional specific stuff :) Everybody is still keeping code pages, UTF-8 hasn't changed that. Legacy. Hard to switch overnight. There are graphs that indicate that few years from now you might never encounter a legacy encoding anymore, only UTF-8/UTF-16. Does UTF-8 not need "to at least know if it's 2 bytes per char or 1 byte per char or whatever?" It's coherent in its scheme to determine that. You don't need extra information synced to text unlike header stuff. It has to do that also. Everyone keeps talking about "easy slicing" as though UTF-8 provides it, but it doesn't. Phobos turns UTF-8 into UTF-32 internally for all that ease of use, at least doubling your string size in the process. Correct me if I'm wrong, that was what I read on the newsgroup sometime back. Indeed you are - searching for UTF-8 substring in UTF-8 string doesn't do any decoding and it does return you a slice of a balance of original. In fact it was even "better" nobody ever talked about header they just assumed a codepage with some global setting. Imagine yourself creating a font rendering system these days - a hell of an exercise in frustration (okay how do I render 0x88 ? mm if that is in codepage XYZ then ...). I understand that people were frustrated with all the code pages out there before UCS standardized them, but that is a completely different argument than my problem with UTF-8 and variable-length encodings. My proposed simple, header-based, constant-width encoding could be implemented with UCS and there go all your arguments about random code pages. No they don't - have you ever seen native Korean or Chinese codepages? Problems with your header based approach are self-evident in a sense that there is no single sane way to deal with it on cross-locale basis (that you simply ignore as noted below). This just shows you don't care for multilingual stuff at all. Imagine any language tutor/translator/dictionary on the Web. For instance most languages need to intersperse ASCII (also keep in mind e.g. HTML markup). Books often feature citations in native language (or e.g. latin) along with translations. This is a small segment of use and it would be handled fine by an alternate encoding. ??? Simply makes no sense. There is no intersection between some legacy encodings as of now. Or do you want to add N*(N-1) cross-encodings for any combination of 2? What about 3 in one string? Now also take into account math symbols, currency symbols and beyond. Also these days cultures are mixing in wild combinations so you might need to see the text even if you can't read it. Unicode is not only "encode characters from all languages". It needs to address universal representation of symbolics used in writing systems at large. I take your point that it isn't just languages, but symbols also. I see no reason why UTF-8 is a better encoding for that purpose than the kind of simple encoding I've suggested. We want monoculture! That is to understand each without all these "par-le-vu-france?" and codepages of various complexity(insanity). I hate monoculture, but then I haven't had to decipher some screwed-up codepage in the middle of the night. ;) So you never had trouble of internationalization? What languages do you use (read/speak/etc.)? That said, you could standardize on UCS for your code space without using a bad encoding like UTF-8, as I said above. UCS is a myth as of ~5 years ago.
Re: D's limited template specialization abilities compared to C++
On Saturday, 25 May 2013 at 10:46:05 UTC, Ahuzhgairl wrote: Hi, In D, the : in a template parameter list only binds to 1 parameter. There is no way to specialize upon the entire template parameter list. Therefore you can't do much with the pattern matching and it's not powerful. Not a reasonable situation for a language aiming to be only the best. What is needed is the ability to bind to the whole template parameter list: template struct get_class; template struct get_class(C::*)(A...)> { typedef C type; }; Let's shorten the terms: @ And here's how this kind of specialization would work in D: template A[B] { struct C {} } template Foo[alias X, Y, Z @ X[Y].Z] { alias Z Foo; } void main() { alias Foo[A[bool].C] xxx; } You need a separate delimiter besides : which does not bind to individual parameters, but which binds to the set of parameters. I propose @ as the character which shall be the delimiter for the arguments to the pattern match, and the pattern match. On an unrelated note, I don't like the ! thing so I use []. Sorry for the confusion there. z Hi, I obviously don't know D that much, but I assume I do. I have this feature that I can't even show a working example that exists in C++. I also can't come up with any use case, but I know this is mandatory to have. As I assume I know D well enough, I assume I know that this is impossible in D, so I propose an improvement. With that improvement, a new syntax is introduced to support some new feature that is barely defined and it can be used in unknown situation. I also explain myself using my own made up syntax. I don't care if it conflict with other language construct as it is superior anyway.
Re: Best XML Library
I suggest you check the XMLP library by Michael Rynn. I tried XML processing with D, so I don't know how good the different libraries are, but XMLP is on the review queue, which means it's highly possible it will become Phobos' standard XML library, and when that happens you will have an easy migration. That is a good point. I suppose I'll take a look at that and Tango's XML package.
Re: Why UTF-8/16 character encodings?
"Manu" wrote in message news:mailman.137.1369448229.13711.digitalmar...@puremagic.com... >> >> One of the first, and best, decisions I made for D was it would be >> Unicode >> front to back. >> > > Indeed, excellent decision! > So when we define operators for u × v and a · b, or maybe n²? ;) When these have keys on standard keyboards.
Re: Why UTF-8/16 character encodings?
On Saturday, 25 May 2013 at 14:18:32 UTC, Vladimir Panteleev wrote: On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote: Are you sure _you_ understand it properly? Both encodings have to check every single character to test for whitespace, but the single-byte encoding simply has to load each byte in the string and compare it against the whitespace-signifying bytes, while the variable-length code has to first load and parse potentially 4 bytes before it can compare, because it has to go through the state machine that you linked to above. Obviously the constant-width encoding will be faster. Did I really need to explain this? It looks like you've missed an important property of UTF-8: lower ASCII remains encoded the same, and UTF-8 code units encoding non-ASCII characters cannot be confused with ASCII characters. Code that does not need Unicode code points can treat UTF-8 strings as ASCII strings, and does not need to decode each character individually - because a 0x20 byte will mean "space" regardless of context. That's why a function that splits a string by ASCII whitespace does NOT need do perform UTF-8 decoding. I hope this clears up the misunderstanding :) OK, you got me with this particular special case: it is not necessary to decode every UTF-8 character if you are simply comparing against ASCII space characters. My mixup is because I was unaware if every language used its own space character in UTF-8 or if they reuse the ASCII space character, apparently it's the latter. However, my overall point stands. You still have to check 2-4 times as many bytes if you do it the way Peter suggests, as opposed to a single-byte encoding. There is a shortcut: you could also check the first byte to see if it's ASCII or not and then skip the right number of ensuing bytes in a character's encoding if it isn't ASCII, but at that point you have begun partially decoding the UTF-8 encoding, which you claimed wasn't necessary and which will degrade performance anyway. On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote: I suggest you read up on UTF-8. You really don't understand it. There is no need to decode, you just treat the UTF-8 string as if it is an ASCII string. Not being aware of this shortcut doesn't mean not understanding UTF-8. This code will count all spaces in a string whether it is encoded as ASCII or UTF-8: int countSpaces(const(char)* c) { int n = 0; while (*c) if (*c == ' ') ++n; return n; } I repeat: there is no need to decode. Please read up on UTF-8. You do not understand it. The reason you don't need to decode is because UTF-8 is self-synchronising. Not quite. The reason you don't need to decode is because of the particular encoding scheme chosen for UTF-8, a side effect of ASCII backwards compatibility and reusing the ASCII space character; it has nothing to do with whether it's self-synchronizing or not. The code above tests for spaces only, but it works the same when searching for any substring or single character. It is no slower than fixed-width encoding for these operations. It doesn't work the same "for any substring or single character," it works the same for any single ASCII character. Of course it's slower than a fixed-width single-byte encoding. You have to check every single byte of a non-ASCII character in UTF-8, whereas a single-byte encoding only has to check a single byte per language character. There is a shortcut if you partially decode the first byte in UTF-8, mentioned above, but you seem dead-set against decoding. ;) Again, I urge you, please read up on UTF-8. It is very well designed. I disagree. It is very badly designed, but the ASCII compatibility does hack in some shortcuts like this, which still don't save its performance.
Re: Why UTF-8/16 character encodings?
On Sat, May 25, 2013 at 03:47:41PM +0200, Joakim wrote: > On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev wrote: > >On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote: > >>>If you want to split a string by ASCII whitespace (newlines, > >>>tabs and spaces), it makes no difference whether the string is > >>>in ASCII or UTF-8 - the code will behave correctly in either > >>>case, variable-width-encodings regardless. > >>Except that a variable-width encoding will take longer to decode > >>while splitting, when compared to a single-byte encoding. > > > >No. Are you sure you understand UTF-8 properly? > Are you sure _you_ understand it properly? Both encodings have to > check every single character to test for whitespace, but the > single-byte encoding simply has to load each byte in the string and > compare it against the whitespace-signifying bytes, while the > variable-length code has to first load and parse potentially 4 bytes > before it can compare, because it has to go through the state > machine that you linked to above. Obviously the constant-width > encoding will be faster. Did I really need to explain this? [...] Have you actually tried to write a whitespace splitter for UTF-8? Do you realize that you can use an ASCII whitespace splitter for UTF-8 and it will work correctly? There is no need to decode UTF-8 for whitespace splitting at all. There is no need to parse anything. You just iterate over the bytes and split on 0x20. There is no performance difference over ASCII. As Dmitry said, UTF-8 is self-synchronizing. While current Phobos code tries to play it safe by decoding every character, this is not necessary in many cases. T -- The best compiler is between your ears. -- Michael Abrash
Re: Why UTF-8/16 character encodings?
On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote: On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev wrote: On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote: If you want to split a string by ASCII whitespace (newlines, tabs and spaces), it makes no difference whether the string is in ASCII or UTF-8 - the code will behave correctly in either case, variable-width-encodings regardless. Except that a variable-width encoding will take longer to decode while splitting, when compared to a single-byte encoding. No. Are you sure you understand UTF-8 properly? Are you sure _you_ understand it properly? Both encodings have to check every single character to test for whitespace, but the single-byte encoding simply has to load each byte in the string and compare it against the whitespace-signifying bytes, while the variable-length code has to first load and parse potentially 4 bytes before it can compare, because it has to go through the state machine that you linked to above. Obviously the constant-width encoding will be faster. Did I really need to explain this? It looks like you've missed an important property of UTF-8: lower ASCII remains encoded the same, and UTF-8 code units encoding non-ASCII characters cannot be confused with ASCII characters. Code that does not need Unicode code points can treat UTF-8 strings as ASCII strings, and does not need to decode each character individually - because a 0x20 byte will mean "space" regardless of context. That's why a function that splits a string by ASCII whitespace does NOT need do perform UTF-8 decoding. I hope this clears up the misunderstanding :)
Re: Why UTF-8/16 character encodings?
On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote: int countSpaces(const(char)* c) { int n = 0; while (*c) if (*c == ' ') ++n; return n; } Oops. Missing a ++c in there, but I'm sure the point was made :-)
Re: Why UTF-8/16 character encodings?
On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote: On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev wrote: On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote: If you want to split a string by ASCII whitespace (newlines, tabs and spaces), it makes no difference whether the string is in ASCII or UTF-8 - the code will behave correctly in either case, variable-width-encodings regardless. Except that a variable-width encoding will take longer to decode while splitting, when compared to a single-byte encoding. No. Are you sure you understand UTF-8 properly? Are you sure _you_ understand it properly? Both encodings have to check every single character to test for whitespace, but the single-byte encoding simply has to load each byte in the string and compare it against the whitespace-signifying bytes, while the variable-length code has to first load and parse potentially 4 bytes before it can compare, because it has to go through the state machine that you linked to above. Obviously the constant-width encoding will be faster. Did I really need to explain this? I suggest you read up on UTF-8. You really don't understand it. There is no need to decode, you just treat the UTF-8 string as if it is an ASCII string. This code will count all spaces in a string whether it is encoded as ASCII or UTF-8: int countSpaces(const(char)* c) { int n = 0; while (*c) if (*c == ' ') ++n; return n; } I repeat: there is no need to decode. Please read up on UTF-8. You do not understand it. The reason you don't need to decode is because UTF-8 is self-synchronising. The code above tests for spaces only, but it works the same when searching for any substring or single character. It is no slower than fixed-width encoding for these operations. Again, I urge you, please read up on UTF-8. It is very well designed.
Re: DMD under 64-bit Windows 7 HOWTO
On Saturday, 25 May 2013 at 13:24:56 UTC, Rainer Schuetze wrote: On 25.05.2013 15:03, "Sébastien Kunz-Jacques" " wrote: On Tuesday, 18 December 2012 at 13:33:03 UTC, Gor Gyolchanyan wrote: I hope I was helpful, because when I started to set up a development environment under 64-bit Windows 7, I went through a lot of problems to get here and I'd love to have this HOWTO at that time. I just tried this with the current beta (may 25, 2.063). It lacks the -m64 option. Was it present in some older beta ? -m64 isn't displayed in the usage screen (no idea why it is excluded there), but it is supported aswell as -m32 (the default). Thanks for the tip. I had incorrectly put quotes around -m64 -L/NOLOGO and the resulting error message unrecognized switch '-m64 -L/NOLOGO' plus the lack of mention of -m64 in the dmd command-line help confused me.
Re: Why UTF-8/16 character encodings?
On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev wrote: On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote: If you want to split a string by ASCII whitespace (newlines, tabs and spaces), it makes no difference whether the string is in ASCII or UTF-8 - the code will behave correctly in either case, variable-width-encodings regardless. Except that a variable-width encoding will take longer to decode while splitting, when compared to a single-byte encoding. No. Are you sure you understand UTF-8 properly? Are you sure _you_ understand it properly? Both encodings have to check every single character to test for whitespace, but the single-byte encoding simply has to load each byte in the string and compare it against the whitespace-signifying bytes, while the variable-length code has to first load and parse potentially 4 bytes before it can compare, because it has to go through the state machine that you linked to above. Obviously the constant-width encoding will be faster. Did I really need to explain this? On Saturday, 25 May 2013 at 12:43:21 UTC, Andrei Alexandrescu wrote: On 5/25/13 3:33 AM, Joakim wrote: On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote: This is more a problem with the algorithms taking the easy way than a problem with UTF-8. You can do all the string algorithms, including regex, by working with the UTF-8 directly rather than converting to UTF-32. Then the algorithms work at full speed. I call BS on this. There's no way working on a variable-width encoding can be as "full speed" as a constant-width encoding. Perhaps you mean that the slowdown is minimal, but I doubt that also. You mentioned this a couple of times, and I wonder what makes you so sure. On contemporary architectures small is fast and large is slow; betting on replacing larger data with more computation is quite often a win. When has small ever been slow and large fast? ;) I'm talking about replacing larger data _and_ more computation, ie UTF-8, with smaller data and less computation, ie single-byte encodings, so it is an unmitigated win in that regard. :)
Re: D's limited template specialization abilities compared to C++
2013/5/25 Ahuzhgairl > No, > > struct Foo(T) { > static void f() { writeln("general"); } > } > > struct Foo(T : A(B).alias C, A, B, C) { > static void f() { writeln("special"); } > } > > struct Bar(T) { > struct Baz {} > } > > struct Baz(T : A(B), A, B) { > } > > void main() { > Foo!(Bar!(int).Baz); > Baz!(Bar!(int)); > } > As I already shown, Baz!(Bar!(int)); could work in D. But, currently Foo!(Bar!(int).Baz); is not yet supported. I'm opening a compiler enhancement for related case, http://d.puremagic.com/issues/show_bug.cgi?id=9022 and right now I updated compiler patch to allow parameterize enclosed type by name/type/alias. https://github.com/D-Programming-Language/dmd/pull/1296 https://github.com/9rnsr/dmd/commit/b29726d30b0094b9e7c2e15f5802501cb686ee68 After it is merged, you can write it as follows. import std.stdio; struct Foo(T) { static void f() { writeln("general"); } } struct Foo(T : A!(B).C, alias A, B, alias C) { static void f() { writeln("special"); } } struct Bar(T) { struct Baz {} } void main() { Foo!(Bar!(int).Baz) x; x.f(); // prints "special" } Kenji Hara
Re: D's limited template specialization abilities compared to C++
On Saturday, 25 May 2013 at 12:43:42 UTC, Ahuzhgairl wrote: C++ example, works: template struct A; template class X, class Y> struct A> {}; template struct B; int main() { A> a; } As we've shown, you can do this in D. Instead of template templates, you use alias. But the following does not work: struct Foo {}; template struct B { Foo x; } template struct A; template struct A {} int main() { A<&B::x> a; } It's getting very hard to see what you're trying to do. I think it would help if you used real C++ and D syntax instead of inventing new syntax because I can't tell what you're trying to achieve and what semantics you expect of it. Please post a small example of real, working, compilable C++ that shows what you want to do, and we'll show you how to do it in D (assuming it is possible).
Re: DMD under 64-bit Windows 7 HOWTO
On 25.05.2013 15:03, "Sébastien Kunz-Jacques" " wrote: On Tuesday, 18 December 2012 at 13:33:03 UTC, Gor Gyolchanyan wrote: I hope I was helpful, because when I started to set up a development environment under 64-bit Windows 7, I went through a lot of problems to get here and I'd love to have this HOWTO at that time. I just tried this with the current beta (may 25, 2.063). It lacks the -m64 option. Was it present in some older beta ? -m64 isn't displayed in the usage screen (no idea why it is excluded there), but it is supported aswell as -m32 (the default).
Re: Are people using textmate for D programming?
Where do I put it? Thanks, Andrei http://docs.sublimetext.info/en/latest/extensibility/syntaxdefs.html
Re: Are people using textmate for D programming?
OK, you convinced me to try. But my SublimeText OSX installation does not contain the D.tmPackage file described at https://github.com/alexrp/st2-d. Where do I put it? Thanks, Andrei ST2 and ST3 have built-in D Syntax highlighting. ST3 now in the beta stage, but have improved mac os x support. ST3 beta for registered users only, but it's worth the money.
Re: DMD under 64-bit Windows 7 HOWTO
On Tuesday, 18 December 2012 at 13:33:03 UTC, Gor Gyolchanyan wrote: Good day, fellow D developers. After spending much time figuring out how to make DMD work fluently under 64-bit Windows 7 I've realized that this is not a trivial task and lots of people might have trouble with this, so I've decided to post my solution, that might save people a lot of time. As we know, there are compatibility problems with 32-bit DMD binaries, because they are compiled using DMC back-end, which can only produce OMF binaries, so in order to avoid problems with linking against externally compiled libraries, it's much easier to stick to 64-bit binaries, so that DMD will use the Visual Studio linker to produce compatible COFF binaries. Another problem is that 32-bit DMD binaries are linked against obsolete 32-bit WinAPI libraries, which lack some very important functions, while the 64-bit binaries are required to link with the 64-bit libraries, supplied by the the Windows SDK. And here's how this could be arranged: 1. Prepare your development folder. 1.1. Create a folder with no spaces in its full path. 1.2. Store its full path in the '%DEV_DIR_ROOT%' environment variable. 2. Get the Windows SDK. 2.1. Download the Windows SDK. 2.1.1. Navigate to 'http://msdn.microsoft.com/en-US/windows//bb980924.aspx' in a web browser. 2.1.2. Under section 2 (number '2' in a green circle) click on the bold blue 'Install Now' link. 2.1.3. In the opened window click in the blue 'Download' button at the bottom of the page. 2.1.4. Make sure, that the Windows SDK installer ('winsdk_web.exe') is downloaded. 2.2. Install the downloaded Windows SDK. 2.2.1. Navigate to the folder, where the Windows SDK installer was downloaded in a file browser. 2.2.2. Double-click on the installer and agree to security warnings to launch it. 2.2.3. Click next, read and agree to the license until you reach the 'Install Locations' screen. 2.2.4. Store the path under 'Destination Folder for Tools' in the '%DEV_DIR_MSWINSDK%' (e.g. 'C:\Program Files (x86)\Microsoft SDKs\Windows\v7.0A') and click 'Next >'. 2.3.3. On the 'Installation Options' uncheck everything except 'x64 Libraries' and 'Visual C++ Compilers' and click 'Next >'. 2.3.4. Confirm that everything is correct and click 'Next >' to start installing. 2.3.5. Make sure, tata the installation is completed succesfully. 2.3.6. Store the path to the installed Visual Studio C++ compiler into the '%DEV_DIR_MSVC%' environment variable (e.g. 'C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC'). 3. Get the DMD. 3.1. Navigate to 'http://ftp.digitalmars.com/dmd2beta.zip' in a web browser. 3.2. Make sure, that the DMD compiler archive ('dmd2beta.zip') is downloaded. 3.3. Unzip the archive into '%DEV_DIR_ROOT%\Tools', so that the 'dmd2' folder in the archive will end up in '%DEV_DIR_ROOT%\Tools\dmd2'. 3.4. Adapt the compiler configuration to the development environment. 3.4.1. Open the file '%DEV_DIR_ROOT%\Tools\dmd2\windows\bin\sc.ini' in a text editor. 3.4.2. Replace the line with 'LIB=' with the line 'LIB="%DEV_DIR_WINSDK%\Lib\x64";"%DEV_DIR_MSVC%\lib\amd64";"%@P%\..\lib"'. 3.4.3. Add '-m64 -L/NOLOGO' to the 'DFLAGS' variable. 3.4.4. Remove the lines with 'VCINSTALLDIR=' and 'WindowsSdkDir='. 3.4.5. Replace the like with 'LINKCMD64=' with the line 'LINKCMD64="%DEV_DIR_MSVC%\bin\amd64\link.exe"' Now "%DEV_DIR_ROOT%\Tools\dmd2\windows\bin\dmd.exe" will always use the Windows SDK libraries and Visual C++ compiler to produce 64-bit COFF binaries. I hope I was helpful, because when I started to set up a development environment under 64-bit Windows 7, I went through a lot of problems to get here and I'd love to have this HOWTO at that time. I just tried this with the current beta (may 25, 2.063). It lacks the -m64 option. Was it present in some older beta ?
Re: Are people using textmate for D programming?
On 5/25/13 5:08 AM, TommiT wrote: I just tried out Sublime Text 2 and found it to be quite similar but somewhat better than TextMate 2. And there's an improved D syntax highlighter for it at: https://github.com/alexrp/st2-d All the keywords seem to be there, indentation works etc. Sublime Text does from time to time annoy you about buying the license, but luckily there's google. OK, you convinced me to try. But my SublimeText OSX installation does not contain the D.tmPackage file described at https://github.com/alexrp/st2-d. Where do I put it? Thanks, Andrei
Re: D's limited template specialization abilities compared to C++
C++ example, works: template struct A; template class X, class Y> struct A> {}; template struct B; int main() { A> a; } But the following does not work: struct Foo {}; template struct B { Foo x; } template struct A; template struct A {} int main() { A<&B::x> a; } D should be able to do both.
Re: Why UTF-8/16 character encodings?
On 5/25/13 3:33 AM, Joakim wrote: On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote: This is more a problem with the algorithms taking the easy way than a problem with UTF-8. You can do all the string algorithms, including regex, by working with the UTF-8 directly rather than converting to UTF-32. Then the algorithms work at full speed. I call BS on this. There's no way working on a variable-width encoding can be as "full speed" as a constant-width encoding. Perhaps you mean that the slowdown is minimal, but I doubt that also. You mentioned this a couple of times, and I wonder what makes you so sure. On contemporary architectures small is fast and large is slow; betting on replacing larger data with more computation is quite often a win. Andrei
Re: D's limited template specialization abilities compared to C++
2013/5/25 Ahuzhgairl > Uneditable newsgroups. Simplest case. > > struct Bar(T) {} > > struct Foo(T : A(B), A, B) { > static void f() {} > } > > void main() { > Foo!(Bar!(int)).f(); > } > It would work. struct Bar(T) {} struct Foo(T : A!(B), alias A, B) { // 1, 2 static void f() {} } void main() { Foo!(Bar!(int)).f(); } 1. should use A!(B), instead of A(B) 2. A would match to template, so should receive by TemplateAliasParameter. Kenji Hara
Re: D's limited template specialization abilities compared to C++
On Saturday, 25 May 2013 at 12:13:42 UTC, Ahuzhgairl wrote: Uneditable newsgroups. Simplest case. struct Bar(T) {} struct Foo(T : A(B), A, B) { static void f() {} } void main() { Foo!(Bar!(int)).f(); } Two problems with that: 1. A(B) should be A!(B) 2. A won't bind to Bar because Bar is not a type, it is a template. A should be an alias. This works: struct Bar(T) {} struct Foo(T : A!(B), alias A, B) { static void f() {} } void main() { Foo!(Bar!(int)).f(); }
Re: Why UTF-8/16 character encodings?
On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote: If you want to split a string by ASCII whitespace (newlines, tabs and spaces), it makes no difference whether the string is in ASCII or UTF-8 - the code will behave correctly in either case, variable-width-encodings regardless. Except that a variable-width encoding will take longer to decode while splitting, when compared to a single-byte encoding. No. Are you sure you understand UTF-8 properly?
Re: D's limited template specialization abilities compared to C++
Uneditable newsgroups. Simplest case. struct Bar(T) {} struct Foo(T : A(B), A, B) { static void f() {} } void main() { Foo!(Bar!(int)).f(); }
Re: D's limited template specialization abilities compared to C++
No, struct Foo(T) { static void f() { writeln("general"); } } struct Foo(T : A(B).alias C, A, B, C) { static void f() { writeln("special"); } } struct Bar(T) { struct Baz {} } struct Baz(T : A(B), A, B) { } void main() { Foo!(Bar!(int).Baz); Baz!(Bar!(int)); }
Re: D's limited template specialization abilities compared to C++
Is this what you're looking for? struct Foo(T) { static void bar() { writeln("general"); } } struct Foo(T : A[B], A, B) { static void bar() { writeln("special"); } } void main() { Foo!(int).bar(); // general Foo!(int[int]).bar(); // special }
Re: Why UTF-8/16 character encodings?
On 05/25/2013 05:56 AM, H. S. Teoh wrote: On Fri, May 24, 2013 at 08:45:56PM -0700, Walter Bright wrote: On 5/24/2013 7:16 PM, Manu wrote: So when we define operators for u × v and a · b, or maybe n²? ;) Oh, how I want to do that. But I still think the world hasn't completely caught up with Unicode yet. That would be most awesome! Though it does raise the issue of how parsing would work, 'cos you either have to assign a fixed precedence to each of these operators (and there are a LOT of them in Unicode!), I think this is what eg. fortress is doing. or allow user-defined operators with custom precedence and associativity, This is what eg. Haskell, Coq are doing. (Though Coq has the advantage of not allowing forward references, and hence inline parser customization is straighforward in Coq.) which means nightmare for the parser (it has to adapt itself to new operators as the code is parsed/analysed, It would be easier on the parsing side, since the parser would not fully parse expressions. Semantic analysis would resolve precedences. This is quite simple, and the current way the parser resolves operator precedences is less efficient anyways. which then leads to issues with what happens if two different modules define the same operator with conflicting precedence / associativity). This would probably be an error without explicit disambiguation, or follow the usual disambiguation rules. (trying all possibilities appears to be exponential in the number of conflicting operators in an expression in the worst case though.)
Re: D's limited template specialization abilities compared to C++
By extension, template Foo[X, Y, Z @ X[Y], Y[Z]] { alias Y Foo; }
Re: Why UTF-8/16 character encodings?
On Saturday, 25 May 2013 at 10:33:12 UTC, Vladimir Panteleev wrote: You don't need to do that to slice a string. I think you mean to say that you need to decode each character if you want to slice the string at the N-th code point? But this is exactly what I'm trying to point out: how would you find this N? How would you know if it makes sense, taking into account combining characters, and all the other complexities of Unicode? Slicing a string implies finding the N-th code point, what other way would you slice and have it make any sense? Finding the N-th point is much simpler with a constant-width encoding. I'm leaving aside combining characters and those intrinsic language complexities baked into unicode in my previous analysis, but if you want to bring those in, that's actually an argument in favor of my encoding. With my encoding, you know up front if you're using languages that have such complexity- just check the header- whereas with a chunk of random UTF-8 text, you cannot ever know that unless you decode the entire string once and extract knowledge of all the languages that are embedded. For another similar example, let's say you want to run toUpper on a multi-language string, which contains English in the first half and some Asian script that doesn't define uppercase in the second half. With my format, toUpper can check the header, then process the English half and skip the Asian half (I'm assuming that the substring indices for each language would be stored in this more complex header). With UTF-8, you have to process the entire string, because you never know what random languages might be packed in there. UTF-8 is riddled with such performance bottlenecks, all to make if self-synchronizing. But is anybody really using its less compact encoding to do some "self-synchronized" integrity checking? I suspect almost nobody is. If you want to split a string by ASCII whitespace (newlines, tabs and spaces), it makes no difference whether the string is in ASCII or UTF-8 - the code will behave correctly in either case, variable-width-encodings regardless. Except that a variable-width encoding will take longer to decode while splitting, when compared to a single-byte encoding. You cannot honestly look at those multiple state diagrams and tell me it's "simple." I meant that it's simple to implement (and adapt/port to other languages). I would say that UTF-8 is quite cleverly designed, so I wouldn't say it's simple by itself. Perhaps, maybe decoding is not so bad for the type of people who write the fundamental UTF-8 libraries. But implementation does not merely refer to the UTF-8 libraries, but also all the code that tries to build on it for internationalized apps. And with all the unnecessary additional complexity added by UTF-8, wrapping the average programmer's head around this mess likely leads to as many problems as broken code pages implementations did back in the day. ;)
Re: Any plans to fix Issue 9044? aka Language stability question again
On Saturday, 25 May 2013 at 10:07:29 UTC, Denis Shelomovskij wrote: obviously contradicts my personal very loyal definition (e.g. I have noting against breaking changes if they are in good direction). I very much like this definition.
Re: D on next-gen consoles and for game development
Am 25.05.2013 03:29, schrieb Manu: Win64 works for me out of the box... ? For me dmd produces type names like modulename.typename.subtypename which will causes internal errors within the visual studio debugger in some cases. Also debugging of static / global variabels is not possible (even when gshared) because they are also formatted like modulename.variablename; Kind Regards Benjamin Thaut
D's limited template specialization abilities compared to C++
Hi, In D, the : in a template parameter list only binds to 1 parameter. There is no way to specialize upon the entire template parameter list. Therefore you can't do much with the pattern matching and it's not powerful. Not a reasonable situation for a language aiming to be only the best. What is needed is the ability to bind to the whole template parameter list: template struct get_class; template struct get_class(C::*)(A...)> { typedef C type; }; Let's shorten the terms: @ And here's how this kind of specialization would work in D: template A[B] { struct C {} } template Foo[alias X, Y, Z @ X[Y].Z] { alias Z Foo; } void main() { alias Foo[A[bool].C] xxx; } You need a separate delimiter besides : which does not bind to individual parameters, but which binds to the set of parameters. I propose @ as the character which shall be the delimiter for the arguments to the pattern match, and the pattern match. On an unrelated note, I don't like the ! thing so I use []. Sorry for the confusion there. z
Re: Why UTF-8/16 character encodings?
On Saturday, 25 May 2013 at 09:40:36 UTC, Joakim wrote: Can you post some specific cases where the benefits of a constant-width encoding are obvious and, in your opinion, make constant-width encodings more useful than all the benefits of UTF-8? Let's take one you listed above, slicing a string. You have to either translate your entire string into UTF-32 so it's constant-width, which is apparently what Phobos does, or decode every single UTF-8 character along the way, every single time. A constant-width, single-byte encoding would be much easier to slice, while still using at most half the space. You don't need to do that to slice a string. I think you mean to say that you need to decode each character if you want to slice the string at the N-th code point? But this is exactly what I'm trying to point out: how would you find this N? How would you know if it makes sense, taking into account combining characters, and all the other complexities of Unicode? If you want to split a string by ASCII whitespace (newlines, tabs and spaces), it makes no difference whether the string is in ASCII or UTF-8 - the code will behave correctly in either case, variable-width-encodings regardless. You cannot honestly look at those multiple state diagrams and tell me it's "simple." I meant that it's simple to implement (and adapt/port to other languages). I would say that UTF-8 is quite cleverly designed, so I wouldn't say it's simple by itself.
Re: Why UTF-8/16 character encodings?
This is dumb. You are dumb. Go away.
Any plans to fix Issue 9044? aka Language stability question again
As those of you who do write some non-toy projects in D know, from time to time you projects become unbuildable because of Issue 9044 [1] an you have to juggle with files and randomly copy/move functions from one library to another to "detrigger" the issue creating mess marked "Issue 9044 workaround". It become really annoying when your one-file project using an external library fails as it forcing you to juggle with that library files (e.g. VisualD's `cpp2d` project which triggers the issue randomly). I'd never complain about such things but the language is tend to be self-called stable by main maintainers and I'd like to finally see an official definition of this "stability" as it obviously contradicts my personal very loyal definition (e.g. I have noting against breaking changes if they are in good direction). [1] http://d.puremagic.com/issues/show_bug.cgi?id=9044 -- Денис В. Шеломовский Denis V. Shelomovskij
Re: Why UTF-8/16 character encodings?
On Saturday, 25 May 2013 at 08:58:57 UTC, Vladimir Panteleev wrote: Another thing I noticed: sometimes when you think you really need to operate on individual characters (and that your code will not be correct unless you do that), the assumption will be incorrect due to the existence of combining characters in Unicode. Two of the often-quoted use cases of working on individual code points is calculating the string width (assuming a fixed-width font), and slicing the string - both of these will break with combining characters if those are not accounted for. I believe the proper way to approach such tasks is to implement the respective Unicode algorithms for it, which I believe are non-trivial and for which the relative impact for the overhead of working with a variable-width encoding is acceptable. Combining characters are examples of complexity baked into the various languages, so there's no way around that. I'm arguing against layering more complexity on top, through UTF-8. Can you post some specific cases where the benefits of a constant-width encoding are obvious and, in your opinion, make constant-width encodings more useful than all the benefits of UTF-8? Let's take one you listed above, slicing a string. You have to either translate your entire string into UTF-32 so it's constant-width, which is apparently what Phobos does, or decode every single UTF-8 character along the way, every single time. A constant-width, single-byte encoding would be much easier to slice, while still using at most half the space. Also, I don't think this has been posted in this thread. Not sure if it answers your points, though: http://www.utf8everywhere.org/ That seems to be a call to using UTF-8 on Windows, with a lot of info on how best to do so, with little justification for why you'd want to do so in the first place. For example, "Q: But what about performance of text processing algorithms, byte alignment, etc? A: Is it really better with UTF-16? Maybe so." Not exactly a considered analysis of the two. ;) And here's a simple and correct UTF-8 decoder: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ You cannot honestly look at those multiple state diagrams and tell me it's "simple." That said, the difficulty of _using_ UTF-8 is a much bigger than problem than implementing a decoder in a library.
Re: Are people using textmate for D programming?
I just tried out Sublime Text 2 and found it to be quite similar but somewhat better than TextMate 2. And there's an improved D syntax highlighter for it at: https://github.com/alexrp/st2-d All the keywords seem to be there, indentation works etc. Sublime Text does from time to time annoy you about buying the license, but luckily there's google.
Re: Why UTF-8/16 character encodings?
On Saturday, 25 May 2013 at 08:42:46 UTC, Walter Bright wrote: I think you stand alone in your desire to return to code pages. Nobody is talking about going back to code pages. I'm talking about going to single-byte encodings, which do not imply the problems that you had with code pages way back when. I have years of experience with code pages and the unfixable misery they produce. This has disappeared with Unicode. I find your arguments unpersuasive when stacked against my experience. And yes, I have made a living writing high performance code that deals with characters, and you are quite off base with claims that UTF-8 has inevitable bad performance - though there is inefficient code in Phobos for it, to be sure. How can a variable-width encoding possibly compete with a constant-width encoding? You have not articulated a reason for this. Do you believe there is a performance loss with variable-width, but that it is not significant and therefore worth it? Or do you believe it can be implemented with no loss? That is what I asked above, but you did not answer. My grandfather wrote a book that consists of mixed German, French, and Latin words, using special characters unique to those languages. Another failing of code pages is it fails miserably at any such mixed language text. Unicode handles it with aplomb. I see no reason why single-byte encodings wouldn't do a better job at such mixed-language text. You'd just have to have a larger, more complex header or keep all your strings in a single language, with a different format to compose them together for your book. This would be so much easier than UTF-8 that I cannot see how anyone could argue for a variable-length encoding instead. I can't even write an email to Rainer Schütze in English under your scheme. Why not? You seem to think that my scheme doesn't implement multi-language text at all, whereas I pointed out, from the beginning, that it could be trivially done also. Code pages simply are no longer practical nor acceptable for a global community. D is never going to convert to a code page system, and even if it did, there's no way D will ever convince the world to abandon Unicode, and so D would be as useless as EBCDIC. I'm afraid you and others here seem to mentally translate "single-byte encodings" to "code pages" in your head, then recoil in horror as you remember all your problems with broken implementations of code pages, even though those problems are not intrinsic to single-byte encodings. I'm not asking you to consider this for D. I just wanted to discuss why UTF-8 is used at all. I had hoped for some technical evaluations of its merits, but I seem to simply be dredging up a bunch of repressed memories about code pages instead. ;) The world may not "abandon Unicode," but it will abandon UTF-8, because it's a dumb idea. Unfortunately, such dumb ideas- XML anyone?- often proliferate until someone comes up with something better to show how dumb they are. Perhaps it won't be the D programming language that does that, but it would be easy to implement my idea in D, so maybe it will be a D-based library someday. :) I'm afraid your quest is quixotic. I'd argue the opposite, considering most programmers still can't wrap their head around UTF-8. If someone can just get a single-byte encoding implemented and in front of them, I suspect it will be UTF-8 that will be considered quixotic. :D
Re: Why UTF-8/16 character encodings?
On Saturday, 25 May 2013 at 07:33:15 UTC, Joakim wrote: This is more a problem with the algorithms taking the easy way than a problem with UTF-8. You can do all the string algorithms, including regex, by working with the UTF-8 directly rather than converting to UTF-32. Then the algorithms work at full speed. I call BS on this. There's no way working on a variable-width encoding can be as "full speed" as a constant-width encoding. Perhaps you mean that the slowdown is minimal, but I doubt that also. For the record, I noticed that programmers (myself included) that had an incomplete understanding of Unicode / UTF exaggerate this point, and sometimes needlessly assume that their code needs to operate on individual characters (code points), when it is in fact not so - and that code will work just fine as if it was written to handle ASCII. The example Walter quoted (regex - assuming you don't want Unicode ranges or case-insensitivity) is one such case. Another thing I noticed: sometimes when you think you really need to operate on individual characters (and that your code will not be correct unless you do that), the assumption will be incorrect due to the existence of combining characters in Unicode. Two of the often-quoted use cases of working on individual code points is calculating the string width (assuming a fixed-width font), and slicing the string - both of these will break with combining characters if those are not accounted for. I believe the proper way to approach such tasks is to implement the respective Unicode algorithms for it, which I believe are non-trivial and for which the relative impact for the overhead of working with a variable-width encoding is acceptable. Can you post some specific cases where the benefits of a constant-width encoding are obvious and, in your opinion, make constant-width encodings more useful than all the benefits of UTF-8? Also, I don't think this has been posted in this thread. Not sure if it answers your points, though: http://www.utf8everywhere.org/ And here's a simple and correct UTF-8 decoder: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
Re: Why UTF-8/16 character encodings?
On 5/25/2013 12:33 AM, Joakim wrote: At what cost? Most programmers completely punt on unicode, because they just don't want to deal with the complexity. Perhaps you can deal with it and don't mind the performance loss, but I suspect you're in the minority. I think you stand alone in your desire to return to code pages. I have years of experience with code pages and the unfixable misery they produce. This has disappeared with Unicode. I find your arguments unpersuasive when stacked against my experience. And yes, I have made a living writing high performance code that deals with characters, and you are quite off base with claims that UTF-8 has inevitable bad performance - though there is inefficient code in Phobos for it, to be sure. My grandfather wrote a book that consists of mixed German, French, and Latin words, using special characters unique to those languages. Another failing of code pages is it fails miserably at any such mixed language text. Unicode handles it with aplomb. I can't even write an email to Rainer Schütze in English under your scheme. Code pages simply are no longer practical nor acceptable for a global community. D is never going to convert to a code page system, and even if it did, there's no way D will ever convince the world to abandon Unicode, and so D would be as useless as EBCDIC. I'm afraid your quest is quixotic.
Shared libraries in dmd 2.063
What's the official status of shared libraries in dmd 2.063? Is it already deemed stable or can there still be breaking changes for dmd 2.064? I'm asking because I think we should change the default visibility of D functions in shared libraries. We want to encourage platform independent code so good code should use the 'export' attribute anyway. Making all symbols public by default and templates is a bad combination for performance as it stresses the runtime linker. Look at that gcc page, they managed to create a templated library that takes 6 minutes to load because of this! http://gcc.gnu.org/wiki/Visibility http://software.intel.com/en-us/articles/software-convention-models-using-elf-visibility-attributes http://www.technovelty.org/code/why-symbol-visibility-is-good.html
Re: Why UTF-8/16 character encodings?
On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote: I think you are a little confused about what unicode actually is... Unicode has nothing to do with code pages and nobody uses code pages any more except for compatibility with legacy applications (with good reason!). Incorrect. "Unicode is an effort to include all characters from previous code pages into a single character enumeration that can be used with a number of encoding schemes... In practice the various Unicode character set encodings have simply been assigned their own code page numbers, and all the other code pages have been technically redefined as encodings for various subsets of Unicode." http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode Unicode is: 1) A standardised numbering of a large number of characters 2) A set of standardised algorithms for operating on these characters 3) A set of standardised encodings for efficiently encoding sequences of these characters What makes you think I'm unaware of this? I have repeatedly differentiated between UCS (1) and UTF-8 (3). You said that phobos converts UTF-8 strings to UTF-32 before operating on them but that's not true. As it iterates over UTF-8 strings it iterates over dchars rather than chars, but that's not in any way inefficient so I don't really see the problem. And what's a dchar? Let's check: dchar : unsigned 32 bit UTF-32 http://dlang.org/type.html Of course that's inefficient, you are translating your whole encoding over to a 32-bit encoding every time you need to process it. Walter as much as said so up above. Also your complaint that UTF-8 reserves the short characters for the english alphabet is not really relevant - the characters with longer encodings tend to be rarer (such as special symbols) or carry more information (such as chinese characters where the same sentence takes only about 1/3 the number of characters). The vast majority of non-english alphabets in UCS can be encoded in a single byte. It is your exceptions that are not relevant.
Re: Low-Lock Singletons In D
On Tuesday, 7 May 2013 at 20:17:43 UTC, QAston wrote: No. A tutorial on memory consistency models would be too long to insert here. I don't know of a good online resource, does anyone? Andrei This was very helpful for me - focuses much more on the memory model itself than the c++11 part. http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2 This was awesome/amazing.
Re: Why UTF-8/16 character encodings?
I think you are a little confused about what unicode actually is... Unicode has nothing to do with code pages and nobody uses code pages any more except for compatibility with legacy applications (with good reason!). Unicode is: 1) A standardised numbering of a large number of characters 2) A set of standardised algorithms for operating on these characters 3) A set of standardised encodings for efficiently encoding sequences of these characters You said that phobos converts UTF-8 strings to UTF-32 before operating on them but that's not true. As it iterates over UTF-8 strings it iterates over dchars rather than chars, but that's not in any way inefficient so I don't really see the problem. Also your complaint that UTF-8 reserves the short characters for the english alphabet is not really relevant - the characters with longer encodings tend to be rarer (such as special symbols) or carry more information (such as chinese characters where the same sentence takes only about 1/3 the number of characters).
Re: Why UTF-8/16 character encodings?
On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote: One of the first, and best, decisions I made for D was it would be Unicode front to back. That is why I asked this question here. I think D is still one of the few programming languages with such unicode support. This is more a problem with the algorithms taking the easy way than a problem with UTF-8. You can do all the string algorithms, including regex, by working with the UTF-8 directly rather than converting to UTF-32. Then the algorithms work at full speed. I call BS on this. There's no way working on a variable-width encoding can be as "full speed" as a constant-width encoding. Perhaps you mean that the slowdown is minimal, but I doubt that also. That was the go-to solution in the 1980's, they were called "code pages". A disaster. My understanding is that code pages were a "disaster" because they weren't standardized and often badly implemented. If you used UCS with a single-byte encoding, you wouldn't have that problem. > with the few exceptional languages with more than 256 characters encoded in two bytes. Like those rare languages Japanese, Korean, Chinese, etc. This too was done in the 80's with "Shift-JIS" for Japanese, and some other wacky scheme for Korean, and a third nutburger one for Chinese. Of course, you have to have more than one byte for those languages, because they have more than 256 characters. So there will be no compression gain over UTF-8/16 there, but a big gain in parsing complexity with a simpler encoding, particularly when dealing with multi-language strings. I've had the misfortune of supporting all that in the old Zortech C++ compiler. It's AWFUL. If you think it's simpler, all I can say is you've never tried to write internationalized code with it. Heh, I'm not saying "let's go back to badly defined code pages" because I'm saying "let's go back to single-byte encodings." The two are separate arguments. UTF-8 is heavenly in comparison. Your code is automatically internationalized. It's awesome. At what cost? Most programmers completely punt on unicode, because they just don't want to deal with the complexity. Perhaps you can deal with it and don't mind the performance loss, but I suspect you're in the minority.
Re: D on next-gen consoles and for game development
Am 25.05.2013 03:29, schrieb Manu: On 25 May 2013 04:20, Benjamin Thaut mailto:c...@benjamin-thaut.de>> wrote: [...] See, I have spend a decade on core tech/engine code meticulously worrying about memory allocation. I don't think a GC is an outright no-go. But we certainly don't have a GC that fits the bill. Given that Android, Windows Phone 7/8 and PS Vita have system languages with GC, it does not seem to bother those developers. Yes I know that most AAA studios are actually bypassing them and using C and C++ directly, but already having indie developers using D would be a great win. One needs to start somehere. - Better windows support. All of the developement we do happens on windows and most of D's community does not care about windows support. I'm curious how long it will take until D will get propper DLL support. Yeah, this is partially why I lost the train for game development. I was too much focused in FOSS issues, instead of focusing in doing a game. -- Paulo
Re: D on next-gen consoles and for game development
On Saturday, 25 May 2013 at 05:52:23 UTC, Manu wrote: But it would be deterministic, and if the allocations are few, the cost should be negligible. You'll pay a tax on pointer write, not on allocations ! It won't be negligible ! They're still non-deterministic though. And unless (even if?) they're precise, they might leak. Not if they are precise. But this is another topic. What does ObjC do? It seems to work okay on embedded hardware (although not particularly memory-constrained hardware). Didn't ObjC recently reject GC in favour of refcounting? ObjC is an horrible three headed monster in that regard, and I don't think this is the way to go.
Re: D on next-gen consoles and for game development
Am 25.05.2013 07:52, schrieb Manu: On 25 May 2013 15:29, deadalnix mailto:deadal...@gmail.com>> wrote: On Saturday, 25 May 2013 at 05:18:12 UTC, Manu wrote: On 25 May 2013 15:00, deadalnix mailto:deadal...@gmail.com>> wrote: On Saturday, 25 May 2013 at 01:56:42 UTC, Manu wrote: Understand, I have no virtual-memory manager, it won't page, it's not a performance problem, it will just crash if I mis-calculate this value. So the GC is kind of out. Yeah, I'm wondering if that's just a basic truth for embedded. Can D implement a ref-counting GC? That would probably still be okay, since collection is immediate. This is technically possible, but you said you make few allocations. So with the tax on pointer write or the reference counting, you'll pay a lot to collect very few garbages. I'm not sure the tradeoff is worthwhile. But it would be deterministic, and if the allocations are few, the cost should be negligible. Paradoxically, when you create few garbage, GC are really goos as they don't need to trigger often. But if you need to add a tax on each reference write/copy, you'll probably pay more tax than you get out of it. They're still non-deterministic though. And unless (even if?) they're precise, they might leak. What does ObjC do? It seems to work okay on embedded hardware (although not particularly memory-constrained hardware). Didn't ObjC recently reject GC in favour of refcounting? Yes, but is was mainly for not being able to have a stable working GC able to cope with the Objective-C code available in the wild. It had quite a few issues. Objective-C reference counting requires compiler and runtime support. Basically it is based in how Cocoa does reference counting, but instead of requiring the developers to manually write the [retain], [release] and [autorelease] messages, the compiler is able to infer them based on Cocoa memory access patterns. Additionally it makes use of dataflow analysis to remove superfluous use of those calls. There is a WWDC talk on iTunes where they explain that. I can look for it if there is interest. Microsoft did the same thing with their C++/CX language extensions and COM for WinRT. -- Paulo
Re: Why UTF-8/16 character encodings?
On Friday, 24 May 2013 at 22:44:24 UTC, H. S. Teoh wrote: I remember those bad ole days of gratuitously-incompatible encodings. I wish those days will never ever return again. You'd get a text file in some unknown encoding, and the only way to make any sense of it was to guess what encoding it might be and hope you get lucky. Not only so, the same language often has multiple encodings, so adding support for a single new language required supporting several new encodings and being able to tell them apart (often with no info on which they are, if you're lucky, or if you're unlucky, with *wrong* encoding type specs -- for example, I *still* get email from outdated systems that claim to be iso-8859 when it's actually KOI8R). This is an argument for UCS, not UTF-8. Prepending the encoding to the data doesn't help, because it's pretty much guaranteed somebody will cut-n-paste some segment of that data and save it without the encoding type header (or worse, some program will try to "fix" broken low-level code by prepending a default encoding type to everything, regardless of whether it's actually in that encoding or not), thus ensuring nobody will be able to reliably recognize what encoding it is down the road. This problem already exists for UTF-8, breaking ASCII compatibility in the process: http://en.wikipedia.org/wiki/Byte_order_mark Well, at the very least adding garbage ASCII data in the front, just as my header would do. ;) For all of its warts, Unicode fixed a WHOLE bunch of these problems, and made cross-linguistic data sane to handle without pulling out your hair, many times over. And now we're trying to go back to that nightmarish old world again? No way, José! No, I'm suggesting going back to one element of that "old world," single-byte encodings, but using UCS or some other standardized character set to avoid all those incompatible code pages you had to deal with. If you're really concerned about encoding size, just use a compression library -- they're readily available these days. Internally, the program can just use UTF-16 for the most part -- UTF-32 is really only necessary if you're routinely delving outside BMP, which is very rare. True, but you're still doubling your string size with UTF-16 and non-ASCII text. My concerns are the following, in order of importance: 1. Lost programmer productivity due to these dumb variable-length encodings. That is the biggest loss from UTF-8's complexity. 2. Lost speed and memory due to using either an unnecessarily complex variable-length encoding or because you translated everything to 32-bit UTF-32 to get back to constant-width. 3. Lost bandwidth from using a fatter encoding. As far as Phobos is concerned, Dmitry's new std.uni module has powerful code-generation templates that let you write code that operate directly on UTF-8 without needing to convert to UTF-32 first. Well, OK, maybe we're not quite there yet, but the foundations are in place, and I'm looking forward to the day when string functions will no longer have implicit conversion to UTF-32, but will directly manipulate UTF-8 using optimized state tables generated by std.uni. There is no way this can ever be as performant as a constant-width single-byte encoding. +1. Using your own encoding is perfectly fine. Just don't do that for data interchange. Unicode was created because we *want* a single standard to communicate with each other without stupid broken encoding issues that used to be rampant on the web before Unicode came along. In the bad ole days, HTML could be served in any random number of encodings, often out-of-sync with what the server claims the encoding is, and browsers would assume arbitrary default encodings that for the most part *appeared* to work but are actually fundamentally b0rken. Sometimes webpages would show up mostly-intact, but with a few characters mangled, because of deviations / variations on codepage interpretation, or non-standard characters being used in a particular encoding. It was a total, utter mess, that wasted who knows how many man-hours of programming time to work around. For data interchange on the internet, we NEED a universal standard that everyone can agree on. I disagree. This is not an indictment of multiple encodings, it is one of multiple unspecified or _broken_ encodings. Given how difficult UTF-8 is to get right, all you've likely done is replace multiple broken encodings with a single encoding with multiple broken implementations. UTF-8, for all its flaws, is remarkably resilient to mangling -- you can cut-n-paste any byte sequence and the receiving end can still make some sense of it. Not like the bad old days of codepages where you just get one gigantic block of gibberish. A properly-synchronizing UTF-8 function can still recover legible data, maybe with only a few characters at the ends truncated in the worst case. I don't see how any codepage-based