Re: Major performance problem with std.array.front()
Am Mon, 10 Mar 2014 17:44:22 -0400 schrieb Nick Sabalausky seewebsitetocontac...@semitwist.com: On 3/7/2014 8:40 AM, Michel Fortin wrote: On 2014-03-07 03:59:55 +, bearophile bearophileh...@lycos.com said: Walter Bright: I understand this all too well. (Note that we currently have a different silent problem: unnoticed large performance problems.) On the other hand your change could introduce Unicode-related bugs in future code (that the current Phobos avoids) (and here I am not talking about code breakage). The way Phobos works isn't any more correct than dealing with code units. Many graphemes span on multiple code points -- because of combined diacritics or character variant modifiers -- and decoding at the code-point level is thus often insufficient for correctness. Well, it is *more* correct, as many western languages are more likely in current Phobos to just work in most cases. It's just that things still aren't completely correct overall. From my experience, I'd suggest these basic operations for a string range instead of the regular range interface: .empty .frontCodeUnit .frontCodePoint .frontGrapheme .popFrontCodeUnit .popFrontCodePoint .popFrontGrapheme .codeUnitLength (aka length) .codePointLength (for dchar[] only) .codePointLengthLinear .graphemeLengthLinear Someone should be able to mix all the three 'front' and 'pop' function variants above in any code dealing with a string type. In my XML parser for instance I regularly use frontCodeUnit to avoid the decoding penalty when matching the next character with an ASCII one such as '' or ''. An API like the one above forces you to be aware of the level you're working on, making bugs and inefficiencies stand out (as long as you're familiar with each representation). If someone wants to use a generic array/range algorithm with a string, my opinion is that he should have to wrap it in a range type that maps front and popFront to one of the above variant. Having to do that should make it obvious that there's an inefficiency there, as you're using an algorithm that wasn't tailored to work with strings and that more decoding than strictly necessary is being done. I actually like this suggestion quite a bit. +1 Reminds me of my proposal for Rust (https://github.com/mozilla/rust/issues/7043#issuecomment-19187984) -- Marco
Re: Major performance problem with std.array.front()
On Thursday, March 06, 2014 18:37:13 Walter Bright wrote: Is there any hope of fixing this? I agree with Andrei. I don't think that there's really anything to fix. The problem is that there's roughly 3 levels at which string operations can be done 1. By code unit 2. By code point 3. By grapheme and which is correct depends on what you're trying to do. Phobos attempts to go for correctness by default without seriously impacting performance, so it treats all strings as ranges of dchar (so, level #2). If we went with #1, then pretty much any algorithm which operated on individual characters would be broken, as unless your strings are ASCII-only, code units are very much the wrong level to be operating on if you're trying to deal with characters. If we went with #3, then we'd have full correctness, but we'd tank performance. With #2, we're far more correct than is typically the case with C++ while still being reasonably performant. And those who want full performance can use immutable(ubyte)[] to get #1, and those who want #3 can use the grapheme support in std.uni. We've gone to great lengths in Phobos to specialize on narrow strings in order to make it more efficient while still maintaining correctness, and anyone who really wants performance can do the same. But by operating on the code point level, we at least get a reasonable level of unicode-correctness by default. With your suggestion, I'd fully expect most D programs to be wrong with regards to Unicode, because most programmers don't know or care about how Unicode works. And changing what we're doing now would be code breakage of astronomical proportions. It will essentially break all uses of range-based string code. Certainly, it would be largest code breakage that D has seen is years if not ever. So, it's almost certainly a bad idea, but if it isn't, we need to be darn sure that what we change to is significantly better and worth the huge amount of code breakage that it will cause. I really don't think that there's any way to get this right. Regardless of which level you operate at by default - be it code unit, code point, or grapheme - it will be wrong a good chunk of the time. So, it becomes a question which of the three has the best tradeoffs, and I think that our current solution of operating on code points by default does that. If there are things that we can do to better support operating on code units or graphemes for those who want it, then great. And it's great if we can find ways to make operating at the code point level more efficient or less prone to bugs due to not operating at the grapheme level. But I think that operating on the code point level like we currently do is by far the best approach. If anything, it's the fact that the language doesn't do that that's a bigger concern IMHO - the main place where that's an issue being the fact that foreach iterates by code unit by default. But I don't know of a good way to solve that other than treating all arrays of char, wchar, and dchar specially, and disable their array operations like ranges do so that you have to convert them to code units via the representation function in order to operate on them as code units - which Andrei has suggested a number of times before, but you've shot him down each time. If that were fixed, then at least we'd be consistent, which is usually the biggest complaint with regards to how D treats strings. But I really don't think that there's a magical fix for range- based string operations, and I think that our current approach is a good one. - Jonathan M Davis
Re: Major performance problem with std.array.front()
On Sunday, 9 March 2014 at 21:38:06 UTC, Nick Sabalausky wrote: On 3/9/2014 7:47 AM, w0rp wrote: My knowledge of Unicode pretty much just comes from having to deal with foreign language customers and discovering the problems with the code unit abstraction most languages seem to use. (Java and Python suffer from similar issues, but they don't really have algorithms in the way that we do.) Python 2 or 3 (out of curiosity)? If you're including Python3, then that somewhat surprises me as I thought greatly improved Unicode was one of the biggest reasons for the jump from 2 to 3. (Although it isn't *completely* surprising since, as we all know far too well here, fully correct Unicode is *not* easy.) Late reply here. Python 3 is a lot better in terms of Unicode support than 2. The situation in Python 2 was this. 1. The default string type is 'str', an immutable array of bytes. 2. 'str' could be one of many encodings, including UTF-16, etc. 3. There is an extra 'unicode' type for when you want a Unicode string. 4. Python implicltly converts between the two, often in wrong ways, often causing exceptions to appear where you didn't expect them to. In 3, this changed to... 1. The default string type is still named 'str', only now it's like the 'unicode' of olde. 2. 'bytes' is a new immutable array of bytes type like the Python 2 'str'. 3. Conversion between 'str' and 'bytes' is always explicit. However, Python 3 works on a code point level, probably some code unit level in fact, and you don't see very many algorithms which take, say, combining characters into account. So Python suffers from similar issues.
Re: Major performance problem with std.array.front()
On Friday, 7 March 2014 at 03:52:42 UTC, Walter Bright wrote: Ok, I have a plan. Each step will be separated by at least one version: 1. implement decode() as an algorithm for string types, so one can write: string s; s.decode.algorithm... suggest that people start doing that instead of: s.algorithm... 2. Emit warning when people use std.array.front(s) with strings. 3. Deprecate std.array.front for strings. 4. Error for std.array.front for strings. 5. Implement new std.array.front for strings that doesn't decode. What about this: [as above] 1. implement decode() as an algorithm for string types, so one can write: string s; s.decode.algorithm... suggest that people start doing that instead of: s.algorithm... [as above] 2. Emit warning when people use std.array.front(s) with strings. 3. Implement new std.array.front for strings that doesn't decode, but keep the old one either forever(ish) or until way into D3 (3.03). 4. Deprecate std.array.front for strings (see 3.) 5. Error for std.array.front for strings. (see 3) I know that one of the rules of D is warnings should eventually become errors, but there is nothing wrong with waiting longer than a few months before something is an error or removed from the library, especially if it would cause loads of code to break (my own too, I suppose). As long as users are aware of it, they can start to make the transition in their own code little by little. In this case they will make the transition rather sooner than later, because nobody wants to suffer constant performance penalties. So for this particular change I'd suggest to wait patiently until it can finally be deprecated. Is this feasible?
Re: Major performance problem with std.array.front()
On Tuesday, 11 March 2014 at 02:07:19 UTC, Steven Schveighoffer wrote: On Mon, 10 Mar 2014 19:59:07 -0400, Walter Bright newshou...@digitalmars.com wrote: On 3/10/2014 6:47 AM, Dicebot wrote: (array literals that allocate, I will never forgive that). It was done that way simply to get it up and running quickly. Having them not allocate is an optimization, it doesn't change the nature. I think you forget about this: foo(int v, int w) { auto x = [v, w]; } Which cannot pre-allocate. The array is small and does not escape. It could be allocated on the stack as an optimization.
Re: Major performance problem with std.array.front()
On 3/10/2014 12:23 AM, Walter Bright wrote: On 3/9/2014 9:19 PM, Nick Sabalausky wrote: On 3/9/2014 6:31 PM, Walter Bright wrote: On 3/9/2014 6:08 AM, Marc Schütz schue...@gmx.net wrote: Also, `byCodeUnit` and `byCodePoint` would probably be better names than `raw` and `decode`, to much the already existing `byGrapheme` in std.uni. I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string, wstring, dstring, and InputRange!char, etc. 'byCodePoint' and 'byDchar' are the same. However, 'byCodeUnit' is completely different from anything else: string str; wstring wstr; dstring dstr; (str|wchar|dchar).byChar // Always range of char (str|wchar|dchar).byWchar // Always range of wchar (str|wchar|dchar).byDchar // Always range of dchar str.representation // Range of ubyte wstr.representation // Range of ushort dstr.representation // Range of uint str.byCodeUnit // Range of char wstr.byCodeUnit // Range of wchar dstr.byCodeUnit // Range of dchar I don't see much point to the latter 3. Do you mean: 1. You don't see the point to iterating by code unit? 2. You don't see the point to 'byCodeUnit' if we have 'representation'? 3. You don't see the point to 'byCodeUnit' if we have 'byChar/byWchar/byDchar'? 4. You don't see the point to having 'byCodeUnit' work on UTF-32 dstrings? Responses: 1. Iterating by code unit: Useful for tweaking performance anytime decoding is unnecessary. For example, parsing a grammar where the bulk of the keywords and operators are ASCII. (Occasional uses of Unicode, like unicode whitespace, can of course be handled easily enough by the lexer FSM). 2. 'byCodeUnit' if we have 'representation': This one I have trouble answering since I'm still unclear on the purpose of 'representation' (I wasn't even aware of it until a few days ago.) I've been assuming there's some specific use-case I've overlooked where it's useful to iterate by code unit *while* treating the code units as if they weren't UTF-8/16/32 at all. But since 'representation' is called *on* a string/wstring/dstring, they should already be UTF-8/16/32 anyway, not some other encoding that would necessitate using integer types. Or maybe it's just for working around problems with the auto-verification being too eager (I've ran into those)? I admit I don't quite get 'representation'. 3. 'byCodeUnit' if we have 'byChar/byWchar/byDchar': To avoid a static if chain every time you want to use code units inside generic code. Also, so in non-generic code you can change your data type without updating instances of 'by*char'. 4. Having 'byCodeUnit' work on UTF-32 dstrings: So generic code working on code units doesn't have to special-case UTF-32.
Re: Major performance problem with std.array.front()
On 3/10/2014 12:09 AM, Nick Sabalausky wrote: On 3/10/2014 12:23 AM, Walter Bright wrote: On 3/9/2014 9:19 PM, Nick Sabalausky wrote: On 3/9/2014 6:31 PM, Walter Bright wrote: On 3/9/2014 6:08 AM, Marc Schütz schue...@gmx.net wrote: Also, `byCodeUnit` and `byCodePoint` would probably be better names than `raw` and `decode`, to much the already existing `byGrapheme` in std.uni. I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string, wstring, dstring, and InputRange!char, etc. 'byCodePoint' and 'byDchar' are the same. However, 'byCodeUnit' is completely different from anything else: string str; wstring wstr; dstring dstr; (str|wchar|dchar).byChar // Always range of char (str|wchar|dchar).byWchar // Always range of wchar (str|wchar|dchar).byDchar // Always range of dchar str.representation // Range of ubyte wstr.representation // Range of ushort dstr.representation // Range of uint str.byCodeUnit // Range of char wstr.byCodeUnit // Range of wchar dstr.byCodeUnit // Range of dchar I don't see much point to the latter 3. Do you mean: 1. You don't see the point to iterating by code unit? 2. You don't see the point to 'byCodeUnit' if we have 'representation'? 3. You don't see the point to 'byCodeUnit' if we have 'byChar/byWchar/byDchar'? 4. You don't see the point to having 'byCodeUnit' work on UTF-32 dstrings? (3) 3. 'byCodeUnit' if we have 'byChar/byWchar/byDchar': To avoid a static if chain every time you want to use code units inside generic code. Also, so in non-generic code you can change your data type without updating instances of 'by*char'. Just not sure I see a use for that.
Re: Major performance problem with std.array.front()
On Sunday, 9 March 2014 at 21:14:30 UTC, Nick Sabalausky wrote: With all due respect, D string type is exclusively for UTF-8 strings. If it is not valid UTF-8, it should never had been a D string in the first place. In the other cases, ubyte[] is there. This is an arbitrary self-imposed limitation caused by the choice in how strings are handled in Phobos. Yea, I've had problems before - completely unnecessary problems that were *not* helpful or indicative of latent bugs - which were a direct result of Phobos being overly pedantic and eager about UTF validation. And yet the implicit UTF validation has never actually *helped* me in any way. self-imposed limitation For greater good. I finds this article very telling about why string should be converted to UTF-8 as often as possible. http://www.utf8everywhere.org/ I agree 100% with its content, it's impossibly hard to have a sane handling of encodings on WIndows (even more in a team), if not following the drastic rules the article exposes. This happens to be what Phobos gently mandates, UTF validation is certainly the lesser evil as compared the mess that everything become without. How is mandating valid UTF-8 being overly pedantic? This is the sanest behaviour. Just use sanitizeUTF8 (http://vibed.org/api/vibe.utils.string/sanitizeUTF8) or equivalent.
Re: Major performance problem with std.array.front()
I'm not sure I understood the point of this (long) thread. The main problem is that decode() is called also if not needed? Well, in this case that's not a problem only for string. I found this problem also when I was writing other ranges. For example when I read binary data from db stream. Front represent a single row, and I decode it every time also if not needed. On Friday, 7 March 2014 at 02:37:11 UTC, Walter Bright wrote: In Lots of low hanging fruit in Phobos the issue came up about the automatic encoding and decoding of char ranges. Throughout D's history, there are regular and repeated proposals to redesign D's view of char[] to pretend it is not UTF-8, but UTF-32. I.e. so D will automatically generate code to decode and encode on every attempt to index char[]. I have strongly objected to these proposals on the grounds that: 1. It is a MAJOR performance problem to do this. 2. Very, very few manipulations of strings ever actually need decoded values. 3. D is a systems/native programming language, and systems/native programming languages must not hide the underlying representation (I make similar arguments about proposals to make ints issue errors on overflow, etc.). 4. Users should choose when decode/encode happens, not the language. and I have been successful at heading these off. But one slipped by me. See this in std.array: @property dchar front(T)(T[] a) @safe pure if (isNarrowString!(T[])) { assert(a.length, Attempting to fetch the front of an empty array of ~ T.stringof); size_t i = 0; return decode(a, i); } What that means is that if I implement an algorithm that accepts, as input, an InputRange of char's, it will ALWAYS try to decode it. This means that even: from.copy(to) will decode 'from', and then re-encode it for 'to'. And it will do it SILENTLY. The user won't notice, and he'll just assume that D performance sux. Even if he does notice, his options to make his code run faster are poor. If the user wants decoding, it should be explicit, as in: from.decode.copy(encode!to) The USER should decide where and when the decoding goes. 'decode' should be just another algorithm. (Yes, I know that std.algorithm.copy() has some specializations to take care of this. But these specializations would have to be written for EVERY algorithm, which is thoroughly unreasonable. Furthermore, copy()'s specializations only apply if BOTH source and destination are arrays. If just one is, the decode/encode penalty applies.) Is there any hope of fixing this?
Re: Major performance problem with std.array.front()
On 3/10/2014 6:21 AM, ponce wrote: On Sunday, 9 March 2014 at 21:14:30 UTC, Nick Sabalausky wrote: Yea, I've had problems before - completely unnecessary problems that were *not* helpful or indicative of latent bugs - which were a direct result of Phobos being overly pedantic and eager about UTF validation. And yet the implicit UTF validation has never actually *helped* me in any way. self-imposed limitation For greater good. I finds this article very telling about why string should be converted to UTF-8 as often as possible. http://www.utf8everywhere.org/ I agree 100% with its content, it's impossibly hard to have a sane handling of encodings on WIndows (even more in a team), if not following the drastic rules the article exposes. I may have missed it, but I don't see where it says anything about validation or immediate sanitation of invalid sequences. It's mostly UTF-16 sucks and so does Windows (not that I'm necessarily disagreeing with it). (ot: Kinda wish they hadn't used such a hard to read font...)
Re: Major performance problem with std.array.front()
On Monday, 10 March 2014 at 11:04:43 UTC, Nick Sabalausky wrote: I may have missed it, but I don't see where it says anything about validation or immediate sanitation of invalid sequences. It's mostly UTF-16 sucks and so does Windows (not that I'm necessarily disagreeing with it). (ot: Kinda wish they hadn't used such a hard to read font...) I should have highlighted it, their recommendations for proper encoding handling on Windows are in section 5 (How to do text on Windows). One of them is std::strings and char*, anywhere in the program, are considered UTF-8 (if not said otherwise). I finds it interesting that D tends to enforce this lesson learned with mixed-encodings codebases.
Re: Major performance problem with std.array.front()
On 3/9/2014 11:27 AM, Vladimir Panteleev wrote: On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote: On topic, I think D's implicit default decode to dchar is *infinity* times better than C++'s char-based strings. While imperfect in terms of grapheme, it was still a design decision made of win. Care to argument? It's simple: Breaking things on all non-English languages is worse than breaking things on non-western[1] languages. Is still breakage, and that *is* bad, but there's no question which breakage is significantly larger. [1] (And yes, I realize western is a gross over-simplification here. Point is one working language vs several working languages.)
Re: Major performance problem with std.array.front()
On Sunday, 9 March 2014 at 17:27:20 UTC, Andrei Alexandrescu wrote: On 3/9/14, 6:47 AM, Marc Schütz schue...@gmx.net wrote: On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote: 2) It is regression back to C++ days of no-one-cares-about-Unicode pain. Thinking about strings as character arrays is so natural and convenient that if language/Phobos won't punish you for that, it will be extremely widespread. Not with Nick Sabalausky's suggestion to remove the implementation of front from char arrays. This way, everyone will be forced to decide whether they want code units or code points or something else. Such as giving up on that crappy language that keeps on breaking their code. Andrei That was more about if you are that crazy to even consider such breakage, this is closer my personal perfection than actual proposal ;)
Re: Major performance problem with std.array.front()
On Friday, 7 March 2014 at 19:43:57 UTC, Walter Bright wrote: On 3/7/2014 7:03 AM, Dicebot wrote: 1) It is a huge breakage and you have been refusing to do one even for more important problems. What is about this sudden change of mind? 1. Performance Performance Performance Not important enough. D has always been safe by default, fast when asked to language, not other way around. There is no fundamental performance problem here, only lack of knowledge about Phobos. 2. The current behavior is surprising (it sure surprised me, I didn't notice it until I looked at the assembler to figure out why the performance sucked) That may imply that better documentation is needed. You were only surprised because of wrong initial assumption about what `char[]` type means. 3. Weirdnesses like ElementEncodingType ElementEncodingType is extremely annoying but I think it is just a side effect of more bigger problem how string algorithms are handled currently. It does not need to be that way. 4. Strange behavior differences between char[], char*, and InputRange!char types Again, there is nothing strange about it. `char[]` is a special type with special semantics that is defined in documentation and consistently following that definition in all but raw array indexing/slicing (which is what I find unfortunate but also beyond fixing feasibility). 5. Funky anomalous issues with writing OutputRange!char (the put(T) must take a dchar) Bad but not worth even a small breaking change. 2) lack of convenient .raw property which will effectively do cast(ubyte[]) I've done the cast as a workaround, but when working with generic code it turns out the ubyte type becomes viral - you have to use it everywhere. So all over the place you're having casts between ubyte = char in unexpected places. You also wind up with ugly ubyte = dchar casts, with the commensurate risk that you goofed and have a truncation bug. Of course it is viral. Because you never ever wan't to have char[] at all if you don't work with Unicode (or work with it on raw byte level). And in that case it is your responsibility to do manual decoding when appropriate. Trying to dish out that performance often means going at low level with all associated risks, there is nothing special about char[] here. It is not a common use case. Essentially, the auto-decode makes trivial code look better, but if you're writing a more comprehensive string processing program, and care about performance, it makes a regular ugly mess of things. And this is how it should be. Again, I am all for creating language that favors performance-critical power programming needs over common/casual needs but it is not what D is and you have been making such choices consistently over quite a long time now (array literals that allocate, I will never forgive that). Suddenly changing your mind only because you have encountered this specific issue personally as opposed to just reports does not fit a language author role. It does not really matter if any new approach itself is good or bad - being unpredictable is a reputation damage D simply can't afford.
Re: Major performance problem with std.array.front()
On Monday, 10 March 2014 at 10:52:02 UTC, Andrea Fontana wrote: I'm not sure I understood the point of this (long) thread. The main problem is that decode() is called also if not needed? I'd like to offer up one D 'user' perspective, it's just a single data point but perhaps useful. I write applications that process Arabic, and I'm thinking about converting one of those apps from python to D, for performance reasons. My app deals with unicode arabic text that is 'out there', and the UnicodeTM support for Arabic is not that well thought out, so the data is often (always) inconsistent in terms of sequencing diacritics etc. Even the code page can vary. Therefore my code has to cater to various ways that other developers have sequenced the code points. So, my needs as a 'user' are: * I want to encode all incoming data immediately into unicode, usually UTF8, if isn't already. * I want to iterate over code points. I don't care about the raw data. * When I get the length of my string it should be the number of code points. * When I index my string it should return the nth code point. * When I manipulate my strings I want to work with code points ... you get the drift. If I want to access the raw data, which I don't, then I'm very happy to cast to ubyte etc. If encode/decode is a performance issue then perhaps there could be a cache for recently used strings where the code point representation is held. BTW to answer a question in the thread, yes the data is left-to-right and visualised right-to-left.
Re: Major performance problem with std.array.front()
In italian we need unicode too. We have several accented letters and often programming languages don't handle utf-8 and other encoding so well... In D I never had any problem with this, and I work a lot on text processing. So my question: is there any problem I'm missing in D with unicode support or is just a performance problem on algorithms? If the problem is performance on algorithms that use .front() but don't care to understand its data, why don't we add a .rawFront() property to implement only when make sense and then a fallback like: auto rawFront(R)(R range) if ( ... isrange ... !__traits(compiles, range.rawFront)) { return range.front; } In this way on copy() or other algorithms we can use rawFront() and it's backward compatible with other ranges too. But I guess I'm missing the point :) On Monday, 10 March 2014 at 13:48:44 UTC, Abdulhaq wrote: On Monday, 10 March 2014 at 10:52:02 UTC, Andrea Fontana wrote: I'm not sure I understood the point of this (long) thread. The main problem is that decode() is called also if not needed? I'd like to offer up one D 'user' perspective, it's just a single data point but perhaps useful. I write applications that process Arabic, and I'm thinking about converting one of those apps from python to D, for performance reasons. My app deals with unicode arabic text that is 'out there', and the UnicodeTM support for Arabic is not that well thought out, so the data is often (always) inconsistent in terms of sequencing diacritics etc. Even the code page can vary. Therefore my code has to cater to various ways that other developers have sequenced the code points. So, my needs as a 'user' are: * I want to encode all incoming data immediately into unicode, usually UTF8, if isn't already. * I want to iterate over code points. I don't care about the raw data. * When I get the length of my string it should be the number of code points. * When I index my string it should return the nth code point. * When I manipulate my strings I want to work with code points ... you get the drift. If I want to access the raw data, which I don't, then I'm very happy to cast to ubyte etc. If encode/decode is a performance issue then perhaps there could be a cache for recently used strings where the code point representation is held. BTW to answer a question in the thread, yes the data is left-to-right and visualised right-to-left.
Re: Major performance problem with std.array.front()
Am 07.03.2014 03:37, schrieb Walter Bright: In Lots of low hanging fruit in Phobos the issue came up about the automatic encoding and decoding of char ranges. after reading many of the attached posts the question is - what could be Ds future design of introducing breaking changes, its not a solution to say its not possible because of too many breaking changes - that will become more and more a problem of Ds evolution - much like C++
Re: Major performance problem with std.array.front()
On Monday, 10 March 2014 at 14:05:39 UTC, dennis luehring wrote: Am 07.03.2014 03:37, schrieb Walter Bright: In Lots of low hanging fruit in Phobos the issue came up about the automatic encoding and decoding of char ranges. after reading many of the attached posts the question is - what could be Ds future design of introducing breaking changes, its not a solution to say its not possible because of too many breaking changes - that will become more and more a problem of Ds evolution - much like C++ Historically 2 approaches has been practiced: 1) argue a lot and then do nothing 2) suddenly change something and tell users is was necessary I also think that this is much more important issue than this whole thread but it does not seem to attract any real attention when mentioned.
Re: Major performance problem with std.array.front()
On Monday, 10 March 2014 at 14:05:39 UTC, dennis luehring wrote: Am 07.03.2014 03:37, schrieb Walter Bright: In Lots of low hanging fruit in Phobos the issue came up about the automatic encoding and decoding of char ranges. after reading many of the attached posts the question is - what could be Ds future design of introducing breaking changes, its not a solution to say its not possible because of too many breaking changes - that will become more and more a problem of Ds evolution - much like C++ I'm a newbie here but I've been waiting for D to mature for a long time. D IMHO has to stabilise now because: * D needs a bigger community so that the the big fish who have learnt the ins and outs don't get bored and move on due to lack of kudos etc. * To get the bigger community D needs more _working_ libraries for major toolkits (GUI etc. etc.) * Libraries will cease to work if there is significant change in D, and then can stay broken because there isn't the inertial mass of other developers to maintain it after the intial developer has moved on. You can see that this has happened a LOT * Anyway the D that I read about in TDPL is already very exciting for programmers like myself, we just want that thanks. Breaking changes can go into D3, if and whenever that is. Keep breaking D2 now and it risks just being forevermore a playpen for computer scientist types. Anyway who cares what I think but I think it reflects a lot of people's opinions too.
Re: Major performance problem with std.array.front()
On Monday, 10 March 2014 at 14:11:13 UTC, Dicebot wrote: Historically 2 approaches has been practiced: 1) argue a lot and then do nothing 2) suddenly change something and tell users is was necessary These are one and the same, just from the two opposing points of view. I also think that this is much more important issue than this whole thread but it does not seem to attract any real attention when mentioned. You mean the whole policy on breaking changes?
Re: Major performance problem with std.array.front()
Historically 2 approaches has been practiced: 1) argue a lot and then do nothing This happens (I think) because Andrei and Walter really value your's and other expert's opinions, but nevertheless have to preserve the general way things work to preserve the long term future of D. They have to be open to persuasion but it would have to be very compelling to get them to change basics now - it seems to me. D is at that difficult 90% stage that we all know about where the boring difficult stuff is left to do. People like to discuss interesting new stuff which at the time seems oh-so-important.
Re: Major performance problem with std.array.front()
On Monday, 10 March 2014 at 14:27:02 UTC, Vladimir Panteleev wrote: On Monday, 10 March 2014 at 14:11:13 UTC, Dicebot wrote: Historically 2 approaches has been practiced: 1) argue a lot and then do nothing 2) suddenly change something and tell users is was necessary These are one and the same, just from the two opposing points of view. /sarcasm :) I also think that this is much more important issue than this whole thread but it does not seem to attract any real attention when mentioned. You mean the whole policy on breaking changes? Yes. I have given up about this idea at some point as there seemed to be consensus that no breaking changes will be even considered for D2 and those that come from fixing bugs are not worth the fuss. This is exactly why I was so shocked that Walter has even started this thread. If breaking changes are actually considered (rare or not), then it is absolutely critical to define the process for it and put link to its description to dlang.org front page.
Re: Major performance problem with std.array.front()
Am Mon, 10 Mar 2014 14:05:03 + schrieb Andrea Fontana nos...@example.com: In italian we need unicode too. We have several accented letters and often programming languages don't handle utf-8 and other encoding so well... In D I never had any problem with this, and I work a lot on text processing. So my question: is there any problem I'm missing in D with unicode support or is just a performance problem on algorithms? The only real problem apart from potential performance issues I've seen mentioned in this thread is that indexing/slicing is done with code units. I think this: auto index = countUntil(...); auto slice = str[0 .. index]; is really the only problem with the current implementation. If we could start from scratch I'd say we keep operating on code points by default but don't make strings arrays of char/wchar/dchar. Instead they should be special types which do all operations (especially indexing, slicing) on code points. This would be as safe as the current implementation, always consistent but probably even slower in some cases. Then offer some nice way to get the raw data for algorithms which can deal with it. However, I think it's too late to make these changes.
Re: Major performance problem with std.array.front()
On Monday, 10 March 2014 at 13:18:50 UTC, Dicebot wrote: On Sunday, 9 March 2014 at 17:27:20 UTC, Andrei Alexandrescu wrote: On 3/9/14, 6:47 AM, Marc Schütz schue...@gmx.net wrote: On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote: 2) It is regression back to C++ days of no-one-cares-about-Unicode pain. Thinking about strings as character arrays is so natural and convenient that if language/Phobos won't punish you for that, it will be extremely widespread. Not with Nick Sabalausky's suggestion to remove the implementation of front from char arrays. This way, everyone will be forced to decide whether they want code units or code points or something else. Such as giving up on that crappy language that keeps on breaking their code. Andrei That was more about if you are that crazy to even consider such breakage, this is closer my personal perfection than actual proposal ;) BTW, I don't believe it would be that bad, because there's a straight-forward path of deprecation: First, std.range.front for narrow strings (and dchar, for consistency) can be marked as deprecated. The deprecation message can say: Please specify .byCodePoint()/.byCodeUnit(), guiding the users towards a better style (assuming one agrees that explicit is indeed better than implicit in this case). After some time, the functionality can be moved into a compatibility module, with the deprecated functions still there, but now additionally telling the user about the quick fix of importing that module. The deprecation period can be very long, and even if the functions should never be removed, at least everyone writing new code would do so in the new style.
Re: Major performance problem with std.array.front()
On Monday, 10 March 2014 at 13:48:44 UTC, Abdulhaq wrote: My app deals with unicode arabic text that is 'out there', and the UnicodeTM support for Arabic is not that well thought out, so the data is often (always) inconsistent in terms of sequencing diacritics etc. Even the code page can vary. Therefore my code has to cater to various ways that other developers have sequenced the code points. So, my needs as a 'user' are: * I want to encode all incoming data immediately into unicode, usually UTF8, if isn't already. * I want to iterate over code points. I don't care about the raw data. * When I get the length of my string it should be the number of code points. * When I index my string it should return the nth code point. * When I manipulate my strings I want to work with code points ... you get the drift. Are you sure that code points is what you want? AFAIK there are lots of diacritics in Arabic, and I believe they are not precomposed with their carrying letters...
Re: Major performance problem with std.array.front()
On Monday, 10 March 2014 at 18:54:26 UTC, Marc Schütz wrote: On Monday, 10 March 2014 at 13:48:44 UTC, Abdulhaq wrote: My app deals with unicode arabic text that is 'out there', and the UnicodeTM support for Arabic is not that well thought out, so the data is often (always) inconsistent in terms of sequencing diacritics etc. Even the code page can vary. Therefore my code has to cater to various ways that other developers have sequenced the code points. So, my needs as a 'user' are: * I want to encode all incoming data immediately into unicode, usually UTF8, if isn't already. * I want to iterate over code points. I don't care about the raw data. * When I get the length of my string it should be the number of code points. * When I index my string it should return the nth code point. * When I manipulate my strings I want to work with code points ... you get the drift. Are you sure that code points is what you want? AFAIK there are lots of diacritics in Arabic, and I believe they are not precomposed with their carrying letters... I checked the terminology before posting so I'm pretty sure. Arabic has a code page for the logical characters, one code point for each letter of the alphabet and others for various diacritics. Because of the 'shaping' each logical character has various glyphs, found on other code pages. Text editing programs tend to store typed Arabic as the user entered it, and because there can be more than one diacritic per alphabetic letter the sequence varies as to how the user sequenced them.
Re: Major performance problem with std.array.front()
On Monday, 10 March 2014 at 18:54:26 UTC, Marc Schütz wrote: On Monday, 10 March 2014 at 13:48:44 UTC, Abdulhaq wrote: My app deals with unicode arabic text that is 'out there', and the UnicodeTM support for Arabic is not that well thought out, so the data is often (always) inconsistent in terms of sequencing diacritics etc. Even the code page can vary. Therefore my code has to cater to various ways that other developers have sequenced the code points. So, my needs as a 'user' are: * I want to encode all incoming data immediately into unicode, usually UTF8, if isn't already. * I want to iterate over code points. I don't care about the raw data. * When I get the length of my string it should be the number of code points. * When I index my string it should return the nth code point. * When I manipulate my strings I want to work with code points ... you get the drift. Are you sure that code points is what you want? AFAIK there are lots of diacritics in Arabic, and I believe they are not precomposed with their carrying letters... Adding to my other comment I don't expect a string type to understand arabic and merge the diacritics for me. In fact there are other symbols (code points) that can also be present, for instance instructions on how Quranic text is to be read. These issues have not been standardised and I would say are not well understood generally.
Re: Major performance problem with std.array.front()
On 3/7/2014 8:40 AM, Michel Fortin wrote: On 2014-03-07 03:59:55 +, bearophile bearophileh...@lycos.com said: Walter Bright: I understand this all too well. (Note that we currently have a different silent problem: unnoticed large performance problems.) On the other hand your change could introduce Unicode-related bugs in future code (that the current Phobos avoids) (and here I am not talking about code breakage). The way Phobos works isn't any more correct than dealing with code units. Many graphemes span on multiple code points -- because of combined diacritics or character variant modifiers -- and decoding at the code-point level is thus often insufficient for correctness. Well, it is *more* correct, as many western languages are more likely in current Phobos to just work in most cases. It's just that things still aren't completely correct overall. From my experience, I'd suggest these basic operations for a string range instead of the regular range interface: .empty .frontCodeUnit .frontCodePoint .frontGrapheme .popFrontCodeUnit .popFrontCodePoint .popFrontGrapheme .codeUnitLength (aka length) .codePointLength (for dchar[] only) .codePointLengthLinear .graphemeLengthLinear Someone should be able to mix all the three 'front' and 'pop' function variants above in any code dealing with a string type. In my XML parser for instance I regularly use frontCodeUnit to avoid the decoding penalty when matching the next character with an ASCII one such as '' or ''. An API like the one above forces you to be aware of the level you're working on, making bugs and inefficiencies stand out (as long as you're familiar with each representation). If someone wants to use a generic array/range algorithm with a string, my opinion is that he should have to wrap it in a range type that maps front and popFront to one of the above variant. Having to do that should make it obvious that there's an inefficiency there, as you're using an algorithm that wasn't tailored to work with strings and that more decoding than strictly necessary is being done. I actually like this suggestion quite a bit.
Re: Major performance problem with std.array.front()
On Monday, 10 March 2014 at 14:42:18 UTC, Dicebot wrote: Yes. I have given up about this idea at some point as there seemed to be consensus that no breaking changes will be even considered for D2 and those that come from fixing bugs are not worth the fuss. So at what point are we going to discuss these things in the context of D-next? These topics have us group up and focus on compromises instead of ideals. As was said, D2 is at the 90% point. It only has room left for bug fixes. I think we would make much more productive use of our time and minds coming up with ideas that actually have a chance of coming to fruition, even if D3 ends up being half a decade away.
Re: Major performance problem with std.array.front()
On 3/10/2014 6:47 AM, Dicebot wrote: (array literals that allocate, I will never forgive that). It was done that way simply to get it up and running quickly. Having them not allocate is an optimization, it doesn't change the nature.
Re: Major performance problem with std.array.front()
On 3/10/2014 7:35 PM, Yota wrote: On Monday, 10 March 2014 at 14:42:18 UTC, Dicebot wrote: Yes. I have given up about this idea at some point as there seemed to be consensus that no breaking changes will be even considered for D2 and those that come from fixing bugs are not worth the fuss. So at what point are we going to discuss these things in the context of D-next? Not until (at least) the D2/Phobos implementations mature, the current issues get worked out, and the library/tool ecosystem grows and matures.
Re: Major performance problem with std.array.front()
On Mon, 10 Mar 2014 19:59:07 -0400, Walter Bright newshou...@digitalmars.com wrote: On 3/10/2014 6:47 AM, Dicebot wrote: (array literals that allocate, I will never forgive that). It was done that way simply to get it up and running quickly. Having them not allocate is an optimization, it doesn't change the nature. I think you forget about this: foo(int v, int w) { auto x = [v, w]; } Which cannot pre-allocate. That said, I would not mind if this code broke and you had to use array(v, w) instead, for the sake of avoiding unnecessary allocations. -Steve
Re: Major performance problem with std.array.front()
On 3/10/14, 7:07 PM, Steven Schveighoffer wrote: On Mon, 10 Mar 2014 19:59:07 -0400, Walter Bright newshou...@digitalmars.com wrote: On 3/10/2014 6:47 AM, Dicebot wrote: (array literals that allocate, I will never forgive that). It was done that way simply to get it up and running quickly. Having them not allocate is an optimization, it doesn't change the nature. I think you forget about this: foo(int v, int w) { auto x = [v, w]; } Which cannot pre-allocate. It actually can, seeing as x is a dead assignment :o). That said, I would not mind if this code broke and you had to use array(v, w) instead, for the sake of avoiding unnecessary allocations. Fixing that: int[] foo(int v, int w) { return [v, w]; } This one would allocate. But analyses of varying complexity may eliminate a variety of allocation patterns. Andrei
Re: Major performance problem with std.array.front()
On Mon, 10 Mar 2014 22:56:22 -0400, Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: On 3/10/14, 7:07 PM, Steven Schveighoffer wrote: On Mon, 10 Mar 2014 19:59:07 -0400, Walter Bright newshou...@digitalmars.com wrote: On 3/10/2014 6:47 AM, Dicebot wrote: (array literals that allocate, I will never forgive that). It was done that way simply to get it up and running quickly. Having them not allocate is an optimization, it doesn't change the nature. I think you forget about this: foo(int v, int w) { auto x = [v, w]; } Which cannot pre-allocate. It actually can, seeing as x is a dead assignment :o). Actually, it can't do anything, seeing as it's invalid code ;) That said, I would not mind if this code broke and you had to use array(v, w) instead, for the sake of avoiding unnecessary allocations. Fixing that: int[] foo(int v, int w) { return [v, w]; } This one would allocate. But analyses of varying complexity may eliminate a variety of allocation patterns. I think you are missing what I'm saying, I don't want the allocation eliminated, but if we eliminate some allocations with [] and not others, it will be confusing. The path I'd always hoped we would go in was to make all array literals immutable, and make allocation of mutable arrays on the heap explicit. Adding eliding of some allocations for optimization is good, but I (and I think possibly Dicebot) think all array literals should not allocate. -Steve
Re: Major performance problem with std.array.front()
On 3/10/14, 8:05 PM, Steven Schveighoffer wrote: I think you are missing what I'm saying, I don't want the allocation eliminated, but if we eliminate some allocations with [] and not others, it will be confusing. The path I'd always hoped we would go in was to make all array literals immutable, and make allocation of mutable arrays on the heap explicit. Adding eliding of some allocations for optimization is good, but I (and I think possibly Dicebot) think all array literals should not allocate. I think so too. But that's irrelevant because arrays do allocate (at least behave as if they did) and that's how the cookie crumbles. D is a wonderful language, and is getting better literally by day. There is a lot more in using it in new and interesting ways, than in brooding about its inevitable imperfections. Andrei
Re: Major performance problem with std.array.front()
On 3/7/2014 6:33 PM, H. S. Teoh wrote: On Fri, Mar 07, 2014 at 11:13:50PM +, Sarath Kodali wrote: On Friday, 7 March 2014 at 22:35:47 UTC, Sarath Kodali wrote: +1 In Indian languages, a character consists of one or more UNICODE code points. For example, in Sanskrit ddhrya http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg consists of 7 UNICODE code points. So to search for this char I have to use string search. - Sarath Oops, incomplete reply ... Since a single alphabet in Indian languages can contain multiple code-points, iterating over single code-points is like iterating over char[] for non English European languages. So decode is of no use other than decreasing the performance. A raw char[] comparison is much faster. Yes. The more I think about it, the more auto-decoding sounds like a wrong decision. The question, though, is whether it's worth the massive code breakage needed to undo it. :-( I'm leaning the same way too. But I also think Andrei is right that, at this point in time, it'd be a terrible move to change things so that by code unit is default. For better or worse, that ship has sailed. Perhaps we *can* deal with the auto-decoding problem not by killing auto-decoding, but by marginalizing it in an additive way: Convincing arguments have been made that any string-processing code which *isn't* done entirely with the official Unicode algorithms is likely wrong *regardless* of whether std.algorithm defaults to per-code-unit or per-code-point. So...How's this?: We add any of these Unicode algorithms we may be missing, encourage their use for strings, discourage use of std.algorithm for string processing, and in the meantime, just do our best to reduce unnecessary decoding wherever possible. Then we call it a day and all be happy :)
Re: Major performance problem with std.array.front()
On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote: On topic, I think D's implicit default decode to dchar is *infinity* times better than C++'s char-based strings. While imperfect in terms of grapheme, it was still a design decision made of win. I'd be tempted to not ask how do we back out, but rather, how can we take this further? I'd love to ditch the whole char/dchar thing altogether, and work with graphemes. But that would be massive involvement. Why do you think it is better? Let's be clear here: if you are searching/iterating/comparing by code point then your program is either not correct, or no better than doing so by code unit. Graphemes don't really fix this either. I think this is the main confusion: the belief that iterating by code point has utility. If you care about normalization then neither by code unit, by code point, nor by grapheme are correct (except in certain language subsets). If you don't care about normalization then by code unit is just as good as by code point, but you don't need to specialise everywhere in Phobos. AFAIK, there is only one exception, stuff like s.all!(c = c == 'é'), but as Vladimir correctly points out: (a) by code point, this is still broken in the face of normalization, and (b) are there any real applications that search a string for a specific non-ASCII character? To those that think the status quo is better, can you give an example of a real-life use case that demonstrates this? I do think it's probably too late to change this, but I think there is value in at least getting everyone on the same page.
Re: Major performance problem with std.array.front()
On Sunday, 9 March 2014 at 09:24:02 UTC, Nick Sabalausky wrote: I'm leaning the same way too. But I also think Andrei is right that, at this point in time, it'd be a terrible move to change things so that by code unit is default. For better or worse, that ship has sailed. Perhaps we *can* deal with the auto-decoding problem not by killing auto-decoding, but by marginalizing it in an additive way: Convincing arguments have been made that any string-processing code which *isn't* done entirely with the official Unicode algorithms is likely wrong *regardless* of whether std.algorithm defaults to per-code-unit or per-code-point. So...How's this?: We add any of these Unicode algorithms we may be missing, encourage their use for strings, discourage use of std.algorithm for string processing, and in the meantime, just do our best to reduce unnecessary decoding wherever possible. Then we call it a day and all be happy :) I've been watching this discussion for the last few days, and I'm kind of a nobody jumping in pretty late, but I think after thinking about the problem for a while I would aggree on a solution along the lines of what you have suggested. I think Vladimir is definitely right when he's saying that when you have algorithms that deal with natural languages, simply working on the basis of a code unit isn't enough. I think it is also true that you need to select a particular algorithm for dealing with strings of characters, as there are many different algorithms you can use for different languages which behave differently, perhaps several in a single langauge. I also think Andrei is right when he is saying we need to minimise code breakage, and that the string decoding and encoding by default isn't the biggest of performance problems. I think our best option is to offer a function which creates a range in std.array for getting a range over raw character data, without decoding to code points. myArray.someAlgorithm; // std.array .front used today with decode calls myArray.rawData.someAlgorithm; // New range which doesn't decode. Then we could look at creating algorithms for string processing which don't use the existing dchar abstraction. myArray.rawData.byNaturalSymbol!SomeIndianEncodingHere; // Range of strings, maybe range of range of characters, not dchars Or even specialise the new algorithm so it looks for arrays and turns them into the ranges for you via the transformation myArray - myArray.rawData. myArray.byNaturalSymbol!SomeIndianEncodingHere; Honestly, I'd leave the details of such an algorithm to Vladimir and not myself, because he's spent far more time looking into Unicode processing than myself. My knowledge of Unicode pretty much just comes from having to deal with foreign language customers and discovering the problems with the code unit abstraction most languages seem to use. (Java and Python suffer from similar issues, but they don't really have algorithms in the way that we do.) This new set of algorithms taking settings for different encodings could be first implemented in a third party library, tested there, and eventually submitted to Phobos, probably in std.string. There's my input, I'll duck before I'm beheaded.
Re: Major performance problem with std.array.front()
- In lots of places, I've discovered that Phobos did UTF decoding (thus murdering performance) when it didn't need to. Such cases included format (now fixed), appender (now fixed), startsWith (now fixed - recently), skipOver (still unfixed). These have caused latent bugs in my programs that happened to be fed non-UTF data. There's no reason for why D should fail on non-UTF data if it has no reason to decode it in the first place! These failures have only served to identify places in Phobos where redundant decoding was occurring. With all due respect, D string type is exclusively for UTF-8 strings. If it is not valid UTF-8, it should never had been a D string in the first place. In the other cases, ubyte[] is there.
Re: Major performance problem with std.array.front()
On 09/03/14 04:26, Andrei Alexandrescu wrote: 2. Add byChar that returns a random-access range iterating a string by character. Add byWchar that does on-the-fly transcoding to UTF16. Add byDchar that accepts any range of char and does decoding. And such stuff. Then whenever one wants to go through a string by code point can just use str.byChar. This is confusing. Did you mean to say that byChar iterates a string by code unit (not character / code point)? Unit. s.byChar.front is a (possibly ref, possibly qualified) char. So IIUC iterating over s.byChar would not encounter the decoding-related speed hits that Walter is concerned about? In which case it seems to me a better solution -- safe strings by default, unsafe speed-focused solution available if you want it. (Safe here in the more general sense of Doesn't generate unexpected errors rather than memory safety.)
Re: Major performance problem with std.array.front()
On Sunday, 9 March 2014 at 11:34:31 UTC, Peter Alexander wrote: On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote: On topic, I think D's implicit default decode to dchar is *infinity* times better than C++'s char-based strings. While imperfect in terms of grapheme, it was still a design decision made of win. I'd be tempted to not ask how do we back out, but rather, how can we take this further? I'd love to ditch the whole char/dchar thing altogether, and work with graphemes. But that would be massive involvement. Why do you think it is better? Let's be clear here: if you are searching/iterating/comparing by code point then your program is either not correct, or no better than doing so by code unit. Graphemes don't really fix this either. I think this is the main confusion: the belief that iterating by code point has utility. If you care about normalization then neither by code unit, by code point, nor by grapheme are correct (except in certain language subsets). If you don't care about normalization then by code unit is just as good as by code point, but you don't need to specialise everywhere in Phobos. IMO, the normalization argument is overrated. I've yet to encounter a real-world case of normalization: only hand written counter-examples. Not saying it doesn't exist, just that: 1. It occurs only in special cases that the program should be aware of before hand. 2. Arguably, be taken care of eagerly, or in a special pass. As for the belief that iterating by code point has utility. I have to strongly disagree. Unicode is composed of codepoints, and that is what we handle. The fact that it can be be encoded and stored as UTF is implementation detail. As for the grapheme thing, I'm not actually so sure about it myself, so don't take it too seriously. AFAIK, there is only one exception, stuff like s.all!(c = c == 'é'), but as Vladimir correctly points out: (a) by code point, this is still broken in the face of normalization, and (b) are there any real applications that search a string for a specific non-ASCII character? But *what* other kinds of algorithms are there? AFAIK, the *only* type of algorithm that doesn't need decoding is searching, and you know what? std.algorithm.find does it perfectly well. This trickles into most other algorithms too: split, splitter or findAmong don't decode if they don't have too. AFAIK, the most common algorithm case insensitive search *must* decode. There may still be cases where it is still not working as intended in the face of normalization, but it is still leaps and bounds better than what we get iterating with codeunits. To turn it the other way around, *what* are you guys doing, that doesn't require decoding, and where performance is such a killer? To those that think the status quo is better, can you give an example of a real-life use case that demonstrates this? I do not know of a single bug report in regards to buggy phobos code that used front/popFront. Not_a_single_one (AFAIK). On the other hand, there are plenty of cases of bugs for attempting to not decode strings, or incorrectly decoding strings. They are being corrected on a continuous basis. Seriously, Bearophile suggested ABCD.sort(), and it took about 6 pages (!) for someone to point out this would be wrong. Even Walter pointed out that such code should work. *Maybe* it is still wrong in regards to graphemes and normalization, but at *least*, the result is not a corrupted UTF-8 stream. Walter keeps grinding on about myCharArray.put('é') not working, but I'm not sure he realizes how dangerous it would actually be to allow such a thing to work. In particular, in all these cases, a simple call to representation will deactivate the feature, giving you the tools you want. I do think it's probably too late to change this, but I think there is value in at least getting everyone on the same page. Me too. I do see the value in being able to do decode-less iteration. I just think the *default* behavior has the advantage of being correct *most* of the time, and definitely much more correct than without decoding. I think opt-out of decoding is just a much much much saner approach to string handling.
Re: Major performance problem with std.array.front()
On Friday, 7 March 2014 at 04:11:15 UTC, Nick Sabalausky wrote: What about this?: Anywhere we currently have a front() that decodes, such as your example: @property dchar front(T)(T[] a) @safe pure if (isNarrowString!(T[])) { assert(a.length, Attempting to fetch the front of an empty array of ~ T.stringof); size_t i = 0; return decode(a, i); } We rip out that front() entirely. The result is *not* technically a range...yet! We could call it a protorange. Then we provide two functions: auto decode(someStringProtoRange) {...} auto raw(someStringProtoRange) {...} These convert the protoranges into actual ranges by adding the missing front() function. The 'decode' adds a front() which decodes into dchar, while the 'raw' adds a front() which simply returns the raw underlying type. I imagine the decode/raw would probably also handle any length property (if it exists in the protorange) accordingly. This way, the user is forced to specify myStringRange.decode or myStringRange.raw as appropriate, otherwise myStringRange can't be used since it isn't technically a range, only a protorange. (Naturally, ranges of dchar would always have front, since no decoding is ever needed for them anyway. For these ranges, the decode/raw funcs above would simply be no-ops.) Strings can be iterated over by code unit, code point, grapheme, grapheme cluster (?), words, sentences, lines, paragraphs, and potentially other things. Therefore, it makes sense two require the same for ranges of dchar, too. Also, `byCodeUnit` and `byCodePoint` would probably be better names than `raw` and `decode`, to much the already existing `byGrapheme` in std.uni.
Re: Major performance problem with std.array.front()
On Sunday, 9 March 2014 at 13:08:05 UTC, Marc Schütz wrote: Also, `byCodeUnit` and `byCodePoint` would probably be better names than `raw` and `decode`, to much the already existing `byGrapheme` in std.uni. There already is a std.uni.byCodePoint. It is a higher order range that accepts ranges of graphemes and ranges of code points (such as strings). `byCodeUnit` is essentially std.string.representation.
Re: Major performance problem with std.array.front()
On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu wrote: The current approach is a cut above treating strings as arrays of bytes for some languages, and still utterly broken for others. If I'm operating on a right to left language like Hebrew, what would I expect the result to be from something like countUntil? The entire string processing paraphernalia is left to right. I figure RTL languages are under-supported, but s.retro.countUntil comes to mind. Andrei I'm pretty sure that all string operations are actually front to back. If I recall correctly, evenlanguages that read right to left, are stored in a front to back manner: EG: string[0] would be the right-most character. Is is only a question of display, and changes nothing to the code. As for countUntil, it would still work perfectly fine, as a RTL reader would expect the counting to start at the begining eg: the Right side. I'm pretty confident RTL is 100% supported. The only issue is the front/left abiguity, and the only one I know of is the oddly named stripLeft function, which actually does a stripFront anyways. So I wouldn't worry about RTL. But as mentioned, it is languages like indian, that have complex graphemes, or languages with accentuated characters, eg, most europeans ones, that can have problems, such as canFind(cassé, 'e'). On topic, I think D's implicit default decode to dchar is *infinity* times better than C++'s char-based strings. While imperfect in terms of grapheme, it was still a design decision made of win. I'd be tempted to not ask how do we back out, but rather, how can we take this further? I'd love to ditch the whole char/dchar thing altogether, and work with graphemes. But that would be massive involvement.
Re: Major performance problem with std.array.front()
On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote: 2) It is regression back to C++ days of no-one-cares-about-Unicode pain. Thinking about strings as character arrays is so natural and convenient that if language/Phobos won't punish you for that, it will be extremely widespread. Not with Nick Sabalausky's suggestion to remove the implementation of front from char arrays. This way, everyone will be forced to decide whether they want code units or code points or something else.
Re: Major performance problem with std.array.front()
On Friday, 7 March 2014 at 16:43:30 UTC, Dicebot wrote: On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev Can we look at some example situations that this will break? Any code that relies on countUntil to count dchar's? Or, to generalize, almost any code that uses std.algorithm functions with string? This would no longer compile, as dchar[] stops being a range. countUntil(range.byCodePoint) would have to be used instead.
Re: Major performance problem with std.array.front()
On 2014-03-09 13:00:45 +, monarch_dodra monarchdo...@gmail.com said: AFAIK, the most common algorithm case insensitive search *must* decode. Not necessarily. While the unicode collation algorithms (which should be used to compare text) are defined in term of code points, you could build a collation element table using code units as keys and bypass the decoding step for searching the table. I'm not sure if there would be a significant performance gain though. That remains an optimization though. The natural way to implement a Unicode algorithm is to base it on code points. -- Michel Fortin michel.for...@michelf.ca http://michelf.ca
Re: Major performance problem with std.array.front()
On Friday, 7 March 2014 at 23:13:50 UTC, H. S. Teoh wrote: On Fri, Mar 07, 2014 at 10:35:46PM +, Sarath Kodali wrote: On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote: On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu wrote: [...] Clearly one might argue that their app has no business dealing with diacriticals or Asian characters. But that's the typical provincial view that marred many languages' approach to UTF and internationalization. So is yours, if you think that making everything magically a dchar is going to solve all problems. The TDPL example only showcases the problem. Yes, it works with Swedish. Now try it again with Sanskrit. +1 In Indian languages, a character consists of one or more UNICODE code points. For example, in Sanskrit ddhrya http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg consists of 7 UNICODE code points. So to search for this char I have to use string search. [...] That's what I've been arguing for. The most general form of character searching in Unicode requires substring searching, and similarly many character-based operations on Unicode strings are effectively substring-based operations, because said character may be a multibyte code point, or, in your case, multiple code points. Since that's the case, we might as well just forget about the distinction between character and string, and treat all such operations as substring operations (even if the operand is supposedly just 1 character long). This would allow us to get rid of the hackish auto-decoding of narrow strings, and thus eliminate the needless overhead of always decoding. That won't work, because your needle might be in a different normalization form than your haystack, thus a byte-by-byte comparison will not be able to find it.
Re: Major performance problem with std.array.front()
On 2014-03-09 14:12:28 +, Marc Schütz schue...@gmx.net said: That won't work, because your needle might be in a different normalization form than your haystack, thus a byte-by-byte comparison will not be able to find it. The core of the problem is that sometime this byte-by-byte comparison is exactly what you want; when searching for some terminal character(s) in some kind of parser for instance. Other times you want to do a proper Unicode search using Unicode comparison algorithms; when the user is searching for a particular string in a text document for instance. The former is very easy to do with the current API. But what's the API for the later? And how to make the correct API the obvious choice depending on the use case? These two questions are what this thread should be about. Although not unimportant, performance of std.array.front() and whether it should decode is a secondary issue in comparison. -- Michel Fortin michel.for...@michelf.ca http://michelf.ca
Re: Major performance problem with std.array.front()
On Sunday, 9 March 2014 at 13:00:46 UTC, monarch_dodra wrote: IMO, the normalization argument is overrated. I've yet to encounter a real-world case of normalization: only hand written counter-examples. Not saying it doesn't exist, just that: 1. It occurs only in special cases that the program should be aware of before hand. 2. Arguably, be taken care of eagerly, or in a special pass. As for the belief that iterating by code point has utility. I have to strongly disagree. Unicode is composed of codepoints, and that is what we handle. The fact that it can be be encoded and stored as UTF is implementation detail. We don't handle code points (when have you ever wanted to handle a combining character separate to the character it combines with?) You are just thinking of a subset of languages and locales. Normalization is an issue any time you have a user enter text into your program and you then want to search for that text. I hope we can agree this isn't a rare occurrence. AFAIK, there is only one exception, stuff like s.all!(c = c == 'é'), but as Vladimir correctly points out: (a) by code point, this is still broken in the face of normalization, and (b) are there any real applications that search a string for a specific non-ASCII character? But *what* other kinds of algorithms are there? AFAIK, the *only* type of algorithm that doesn't need decoding is searching, and you know what? std.algorithm.find does it perfectly well. This trickles into most other algorithms too: split, splitter or findAmong don't decode if they don't have too. Searching, equality testing, copying, sorting, hashing, splitting, joining... I can't think of a single use-case for searching for a non-ASCII code point. You can search for strings, but searching by code unit is just as good (and fast by default). AFAIK, the most common algorithm case insensitive search *must* decode. But it must also normalize and take locales into account, so by code point is insufficient (unless you are willing to ignore languages like Turkish). See Turkish I. http://en.wikipedia.org/wiki/Turkish_I Sure, if you just want to ignore normalization and several languages then by code point is just fine... but that's the point: by code point is incorrect in general. There may still be cases where it is still not working as intended in the face of normalization, but it is still leaps and bounds better than what we get iterating with codeunits. To turn it the other way around, *what* are you guys doing, that doesn't require decoding, and where performance is such a killer? Searching, equality testing, copying, sorting, hashing, splitting, joining... The performance thing can be fixed in the library, but my concern is (a) it takes a significant amount of code to do so (b) complicates implementations. There are many, many algorithms in Phobos that are special cased for strings, and I don't think it needs to be that way. To those that think the status quo is better, can you give an example of a real-life use case that demonstrates this? I do not know of a single bug report in regards to buggy phobos code that used front/popFront. Not_a_single_one (AFAIK). On the other hand, there are plenty of cases of bugs for attempting to not decode strings, or incorrectly decoding strings. They are being corrected on a continuous basis. Can you provide a link to a bug? Also, you haven't answered the question :-) Can you give a real-life example of a case where code point decoding was necessary where code units wouldn't have sufficed? You have mentioned case-insensitive searching, but I think I've adequately demonstrated that this doesn't work in general by code point: you need to normalize and take locales into account.
Re: Major performance problem with std.array.front()
On Sunday, 9 March 2014 at 05:10:26 UTC, Andrei Alexandrescu wrote: On 3/8/14, 8:24 PM, Vladimir Panteleev wrote: On Sunday, 9 March 2014 at 04:18:15 UTC, Andrei Alexandrescu wrote: What exactly is the consensus? From your wiki page I see One of the proposals in the thread is to switch the iteration type of string ranges from dchar to the string's character type. I can tell you straight out: That will not happen for as long as I'm working on D. Why? From the cycle going in circles: because I think the breakage is way too large compared to the alleged improvement. All right. I was wondering if there was something more fundamental behind such an ultimatum. In fact I believe that that design is inferior to the current one regardless. I was hoping we could come to an agreement at least on this point. --- BTW, a thought struck me while thinking about the problem yesterday. char and dchar should not be implicitly convertible between one another, or comparable to the other. void main() { string s = Привет; foreach (c; s) assert(c != 'Ñ'); } Instead, std.conv.to should allow converting between character types, iff they represent one whole code point and fit into the destination type, and throw an exception otherwise (similar to how it deals with integer overflow). Char literals should be special-cased by the compiler to implicitly convert to any sufficiently large type. This would break more[1] code, but it would avoid the silent failures of the earlier proposal. [1] I went through my own larger programs. I actually couldn't find any uses of dchar which would be impacted by such a hypothetical change.
Re: Major performance problem with std.array.front()
On Sunday, 9 March 2014 at 13:47:26 UTC, Marc Schütz wrote: On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote: 2) It is regression back to C++ days of no-one-cares-about-Unicode pain. Thinking about strings as character arrays is so natural and convenient that if language/Phobos won't punish you for that, it will be extremely widespread. Not with Nick Sabalausky's suggestion to remove the implementation of front from char arrays. This way, everyone will be forced to decide whether they want code units or code points or something else. Andrei has made it clear that the code breakage this would involve would be unacceptable.
Re: Major performance problem with std.array.front()
On Sunday, 9 March 2014 at 13:51:12 UTC, Marc Schütz wrote: On Friday, 7 March 2014 at 16:43:30 UTC, Dicebot wrote: On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev Can we look at some example situations that this will break? Any code that relies on countUntil to count dchar's? Or, to generalize, almost any code that uses std.algorithm functions with string? This would no longer compile, as dchar[] stops being a range. countUntil(range.byCodePoint) would have to be used instead. Why? There's no reason why dchar[] would stop being a range. It will be treated as now, like any other array.
Re: Major performance problem with std.array.front()
On Sunday, 9 March 2014 at 12:24:11 UTC, ponce wrote: - In lots of places, I've discovered that Phobos did UTF decoding (thus murdering performance) when it didn't need to. Such cases included format (now fixed), appender (now fixed), startsWith (now fixed - recently), skipOver (still unfixed). These have caused latent bugs in my programs that happened to be fed non-UTF data. There's no reason for why D should fail on non-UTF data if it has no reason to decode it in the first place! These failures have only served to identify places in Phobos where redundant decoding was occurring. With all due respect, D string type is exclusively for UTF-8 strings. If it is not valid UTF-8, it should never had been a D string in the first place. In the other cases, ubyte[] is there. This is an arbitrary self-imposed limitation caused by the choice in how strings are handled in Phobos.
Re: Major performance problem with std.array.front()
On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote: On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu wrote: The current approach is a cut above treating strings as arrays of bytes for some languages, and still utterly broken for others. If I'm operating on a right to left language like Hebrew, what would I expect the result to be from something like countUntil? The entire string processing paraphernalia is left to right. I figure RTL languages are under-supported, but s.retro.countUntil comes to mind. Andrei I'm pretty sure that all string operations are actually front to back. If I recall correctly, evenlanguages that read right to left, are stored in a front to back manner: EG: string[0] would be the right-most character. Is is only a question of display, and changes nothing to the code. As for countUntil, it would still work perfectly fine, as a RTL reader would expect the counting to start at the begining eg: the Right side. I'm pretty confident RTL is 100% supported. The only issue is the front/left abiguity, and the only one I know of is the oddly named stripLeft function, which actually does a stripFront anyways. So I wouldn't worry about RTL. Yeah, I think RTL strings are preceded by a code point that indicates RTL display. It was just something I mentioned because some operations might be confusing to the programmer. But as mentioned, it is languages like indian, that have complex graphemes, or languages with accentuated characters, eg, most europeans ones, that can have problems, such as canFind(cassé, 'e'). True. I still question why anyone would want to do character-based operations on Unicode strings. I guess substring searches could even end up with the same problem in some cases if not implemented specifically for Unicode for the same reason, but those should be far less common.
Re: Major performance problem with std.array.front()
On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote: On topic, I think D's implicit default decode to dchar is *infinity* times better than C++'s char-based strings. While imperfect in terms of grapheme, it was still a design decision made of win. Care to argument? I'd be tempted to not ask how do we back out, but rather, how can we take this further? I'd love to ditch the whole char/dchar thing altogether, and work with graphemes. But that would be massive involvement. As has been discussed, this does not make sense. Graphemes are also a concept which apply only to certain writing systems, all it would do is exchange one set of tradeoffs with another, without solving anything. Text isn't that simple.
Re: Major performance problem with std.array.front()
On Sunday, 9 March 2014 at 13:00:46 UTC, monarch_dodra wrote: As for the belief that iterating by code point has utility. I have to strongly disagree. Unicode is composed of codepoints, and that is what we handle. The fact that it can be be encoded and stored as UTF is implementation detail. But you don't deal with Unicode. You deal with *text*. Unless you are implementing Unicode algorithms, code points solve nothing in the general case. Seriously, Bearophile suggested ABCD.sort(), and it took about 6 pages (!) for someone to point out this would be wrong. Sorting a string has quite limited use in the general case, so I think this is another artificial example. Even Walter pointed out that such code should work. *Maybe* it is still wrong in regards to graphemes and normalization, but at *least*, the result is not a corrupted UTF-8 stream. I think this is no worse than putting all combining marks all clustered at the end of the string, thus attached to the last non-combining letter.
Re: Major performance problem with std.array.front()
Vladimir Panteleev: Seriously, Bearophile suggested ABCD.sort(), and it took about 6 pages (!) for someone to point out this would be wrong. Sorting a string has quite limited use in the general case, It seems I am sorting arrays of mutable ASCII chars often enough :-) Time ago I have even asked for a helper function: https://d.puremagic.com/issues/show_bug.cgi?id=10162 Bye, bearophile
Re: Major performance problem with std.array.front()
On Sunday, 9 March 2014 at 16:02:55 UTC, bearophile wrote: Vladimir Panteleev: Seriously, Bearophile suggested ABCD.sort(), and it took about 6 pages (!) for someone to point out this would be wrong. Sorting a string has quite limited use in the general case, It seems I am sorting arrays of mutable ASCII chars often enough :-) What do you use this for? I can think of sort being useful e.g. to see which characters appear in a string (and with which frequency), but as the concept does not apply to all languages, one would need to draw a line somewhere for which languages they want to support. I think this should be done explicitly in user code.
Re: Major performance problem with std.array.front()
Vladimir Panteleev: What do you use this for? For lots of different reasons (counting, testing, histograms, to unique-ify, to allow binary searches, etc), you can find alternative solutions for every one of those use cases. I can think of sort being useful e.g. to see which characters appear in a string (and with which frequency), but as the concept does not apply to all languages, one would need to draw a line somewhere for which languages they want to support. I think this should be done explicitly in user code. So far I have needed to sort 7-bit ASCII chars. Bye, bearophile
Re: Major performance problem with std.array.front()
On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote: On 09/03/14 04:26, Andrei Alexandrescu wrote: 2. Add byChar that returns a random-access range iterating a string by character. Add byWchar that does on-the-fly transcoding to UTF16. Add byDchar that accepts any range of char and does decoding. And such stuff. Then whenever one wants to go through a string by code point can just use str.byChar. This is confusing. Did you mean to say that byChar iterates a string by code unit (not character / code point)? Unit. s.byChar.front is a (possibly ref, possibly qualified) char. So IIUC iterating over s.byChar would not encounter the decoding-related speed hits that Walter is concerned about? That is correct. Andrei
Re: Major performance problem with std.array.front()
On 3/9/14, 4:34 AM, Peter Alexander wrote: I think this is the main confusion: the belief that iterating by code point has utility. If you care about normalization then neither by code unit, by code point, nor by grapheme are correct (except in certain language subsets). I suspect that code point iteration is the worst as it works only with ASCII and perchance with ASCII single-byte extensions. Then we have code unit iteration that works with a larger spectrum of languages. One question would be how large that spectrum it is. If it's larger than English, then that would be nice because we would've made progress. I don't know about normalization beyond discussions in this group, but as far as I understand from http://www.unicode.org/faq/normalization.html, normalization would be a one-step process, after which code point iteration would cover still more human languages. No? I'm pretty sure it's more complicated than that, so please illuminate me :o). If you don't care about normalization then by code unit is just as good as by code point, but you don't need to specialise everywhere in Phobos. AFAIK, there is only one exception, stuff like s.all!(c = c == 'é'), but as Vladimir correctly points out: (a) by code point, this is still broken in the face of normalization, and (b) are there any real applications that search a string for a specific non-ASCII character? What happened to counting characters and such? To those that think the status quo is better, can you give an example of a real-life use case that demonstrates this? split(ter) comes to mind. I do think it's probably too late to change this, but I think there is value in at least getting everyone on the same page. Awesome. Andrei
Re: Major performance problem with std.array.front()
On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu wrote: On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote: So IIUC iterating over s.byChar would not encounter the decoding-related speed hits that Walter is concerned about? That is correct. Unless I'm missing something, all algorithms that can work faster on arrays will need to be adapted to also recognize byChar-wrapped arrays, unwrap them, perform the fast array operation, and wrap them back in a byChar.
Re: Major performance problem with std.array.front()
On 3/9/14, 6:47 AM, Marc Schütz schue...@gmx.net wrote: On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote: 2) It is regression back to C++ days of no-one-cares-about-Unicode pain. Thinking about strings as character arrays is so natural and convenient that if language/Phobos won't punish you for that, it will be extremely widespread. Not with Nick Sabalausky's suggestion to remove the implementation of front from char arrays. This way, everyone will be forced to decide whether they want code units or code points or something else. Such as giving up on that crappy language that keeps on breaking their code. Andrei
Re: Major performance problem with std.array.front()
On 3/9/14, 6:34 AM, Jakob Ovrum wrote: On Sunday, 9 March 2014 at 13:08:05 UTC, Marc Schütz wrote: Also, `byCodeUnit` and `byCodePoint` would probably be better names than `raw` and `decode`, to much the already existing `byGrapheme` in std.uni. There already is a std.uni.byCodePoint. It is a higher order range that accepts ranges of graphemes and ranges of code points (such as strings). noice `byCodeUnit` is essentially std.string.representation. Actually not because for reasons that are unclear to me people really want the individual type to be char, not ubyte. Andrei
Re: Major performance problem with std.array.front()
On Sunday, 9 March 2014 at 15:23:57 UTC, Vladimir Panteleev wrote: On Sunday, 9 March 2014 at 13:51:12 UTC, Marc Schütz wrote: On Friday, 7 March 2014 at 16:43:30 UTC, Dicebot wrote: On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev Can we look at some example situations that this will break? Any code that relies on countUntil to count dchar's? Or, to generalize, almost any code that uses std.algorithm functions with string? This would no longer compile, as dchar[] stops being a range. countUntil(range.byCodePoint) would have to be used instead. Why? There's no reason why dchar[] would stop being a range. It will be treated as now, like any other array. This was under the assumption that Nick's proposal (and my amendment to extend it to dchar because of graphemes e.a.) would be implemented. But I made the mistake of replying to posts as I read them, just to notice a few posts later that someone else already posted something to the same effect, or that made my point irrelevant. Sorry for the confusion.
Re: Major performance problem with std.array.front()
On Sunday, 9 March 2014 at 17:15:59 UTC, Andrei Alexandrescu wrote: On 3/9/14, 4:34 AM, Peter Alexander wrote: I think this is the main confusion: the belief that iterating by code point has utility. If you care about normalization then neither by code unit, by code point, nor by grapheme are correct (except in certain language subsets). I suspect that code point iteration is the worst as it works only with ASCII and perchance with ASCII single-byte extensions. Then we have code unit iteration that works with a larger spectrum of languages. One question would be how large that spectrum it is. If it's larger than English, then that would be nice because we would've made progress. I don't know about normalization beyond discussions in this group, but as far as I understand from http://www.unicode.org/faq/normalization.html, normalization would be a one-step process, after which code point iteration would cover still more human languages. No? I'm pretty sure it's more complicated than that, so please illuminate me :o). It depends what you mean by cover :-) If we assume strings are normalized then substring search, equality testing, sorting all work the same with either code units or code points. If you don't care about normalization then by code unit is just as good as by code point, but you don't need to specialise everywhere in Phobos. AFAIK, there is only one exception, stuff like s.all!(c = c == 'é'), but as Vladimir correctly points out: (a) by code point, this is still broken in the face of normalization, and (b) are there any real applications that search a string for a specific non-ASCII character? What happened to counting characters and such? I can't think of any case where you would want to count characters. * If you want an index to slice from, then you need code units. * If you want a buffer size, then you need code units. * If you are doing something like word wrapping then you need to count glyphs, which is not the same as counting code points (and that only works with mono-spaced fonts anyway -- with variable width fonts you need to add up the widths of those glyphs) To those that think the status quo is better, can you give an example of a real-life use case that demonstrates this? split(ter) comes to mind. splitter is just an application of substring search, no? substring search works the same with both code units and code points (e.g. strstr in C works with UTF encoded strings without any need to decode). All you need to do is ensure that mismatched encodings in the delimeter are re-encoded (you want to do this for performance anyway) auto splitter(string str, dchar delim) { char[4] enc; return splitter(str, enc[0..encode(enc, delim)]); }
Re: Major performance problem with std.array.front()
On 3/9/14, 9:02 AM, bearophile wrote: Time ago I have even asked for a helper function: https://d.puremagic.com/issues/show_bug.cgi?id=10162 I commented on that and preapproved it. Andrei
Re: Major performance problem with std.array.front()
On 3/9/14, 10:21 AM, Vladimir Panteleev wrote: On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu wrote: On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote: So IIUC iterating over s.byChar would not encounter the decoding-related speed hits that Walter is concerned about? That is correct. Unless I'm missing something, all algorithms that can work faster on arrays will need to be adapted to also recognize byChar-wrapped arrays, unwrap them, perform the fast array operation, and wrap them back in a byChar. Good point. Off the top of my head I can't remember any algorithm that relies on array representation to do better on arrays than on random-access ranges offering all of arrays' primitives. But I'm sure there are a few. Andrei
Re: Major performance problem with std.array.front()
On 3/9/14, 10:34 AM, Peter Alexander wrote: If we assume strings are normalized then substring search, equality testing, sorting all work the same with either code units or code points. But others such as edit distance or equal(some_string, some_wstring) will not. If you don't care about normalization then by code unit is just as good as by code point, but you don't need to specialise everywhere in Phobos. AFAIK, there is only one exception, stuff like s.all!(c = c == 'é'), but as Vladimir correctly points out: (a) by code point, this is still broken in the face of normalization, and (b) are there any real applications that search a string for a specific non-ASCII character? What happened to counting characters and such? I can't think of any case where you would want to count characters. wc (Generally: I've always been very very very doubtful about arguments that start with I can't think of... because I've historically tried them so many times, and with terrible results.) Andrei
Re: Major performance problem with std.array.front()
On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu wrote: wc What should wc produce on a Sanskrit text? The problem is that such questions quickly become philosophical. (Generally: I've always been very very very doubtful about arguments that start with I can't think of... because I've historically tried them so many times, and with terrible results.) I agree, which is why I think that although such arguments are not unwelcome, it's much better to find out by experiment. Break something in Phobos and see how much of your code is affected :)
Re: Major performance problem with std.array.front()
09-Mar-2014 21:45, Andrei Alexandrescu пишет: On 3/9/14, 10:21 AM, Vladimir Panteleev wrote: On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu wrote: On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote: So IIUC iterating over s.byChar would not encounter the decoding-related speed hits that Walter is concerned about? That is correct. Unless I'm missing something, all algorithms that can work faster on arrays will need to be adapted to also recognize byChar-wrapped arrays, unwrap them, perform the fast array operation, and wrap them back in a byChar. Good point. Off the top of my head I can't remember any algorithm that relies on array representation to do better on arrays than on random-access ranges offering all of arrays' primitives. But I'm sure there are a few. copy to begin with. And it's about 80x faster with plain arrays. -- Dmitry Olshansky
Re: Major performance problem with std.array.front()
09-Mar-2014 21:16, Andrei Alexandrescu пишет: On 3/9/14, 4:34 AM, Peter Alexander wrote: I think this is the main confusion: the belief that iterating by code point has utility. If you care about normalization then neither by code unit, by code point, nor by grapheme are correct (except in certain language subsets). I suspect that code point iteration is the worst as it works only with ASCII and perchance with ASCII single-byte extensions. Then we have code unit iteration that works with a larger spectrum of languages. Was clearly meant to be: code point -- code unit One question would be how large that spectrum it is. If it's larger than English, then that would be nice because we would've made progress. Code points help only in so far that many (~all) high-level algorithms in Unicode are described in terms of code points. Code points have properties, code unit do not have anything. Code points with assigned semantic value are abstract characters. It's up to programmer to implement a particular algorithm to make it as if decoding really happened, working directly on code units or do decoding and work with code points which is simpler. Current std.uni offering mostly work on code points and decodes, crucial building block to work directly on code units is in review: https://github.com/D-Programming-Language/phobos/pull/1685 I don't know about normalization beyond discussions in this group, but as far as I understand from http://www.unicode.org/faq/normalization.html, normalization would be a one-step process, after which code point iteration would cover still more human languages. No? I'm pretty sure it's more complicated than that, so please illuminate me :o). Technically most apps just assume say input comes in UTF-8 that is in normalization form C. Other such as browsers strive to get uniform representation on any input, do normalization of any input (often times normalization turns out to be just a no-op). If you don't care about normalization then by code unit is just as good as by code point, but you don't need to specialise everywhere in Phobos. AFAIK, there is only one exception, stuff like s.all!(c = c == 'é'), but as Vladimir correctly points out: (a) by code point, this is still broken in the face of normalization, and (b) are there any real applications that search a string for a specific non-ASCII character? What happened to counting characters and such? Counting chars is dubious. But, for instance, collation is defined in terms of code points. Regex pattern matching is _defined_ in terms of codepoints (even the mystical level 3 Unicode support of it). So there is certain merit to work at that level. But hacking it to be this way isn't the way to go. The least intrusive change would be to generalize the current choice w.r.t. to RA ranges of char/wchar. -- Dmitry Olshansky
Re: Major performance problem with std.array.front()
On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu wrote: On 3/9/14, 10:34 AM, Peter Alexander wrote: If we assume strings are normalized then substring search, equality testing, sorting all work the same with either code units or code points. But others such as edit distance or equal(some_string, some_wstring) will not. equal(string, wstring) should either not compile, or would be overloaded to do the right thing. In an ideal world, char, wchar, and dchar should not be comparable. Edit distance on code points is of questionable utility. Like Vladimir says, its meaning is pretty philosophical, even in ASCII (is \r\n really two edits? What is an edit?) I can't think of any case where you would want to count characters. wc % echo € | wc -c 4 :-) (Generally: I've always been very very very doubtful about arguments that start with I can't think of... because I've historically tried them so many times, and with terrible results.) Fair point... but it's not as if we would be removing the ability (you could always do s.byCodePoint.count); we are talking about defaults. The argument that we shouldn't iterate by code unit by default because people might want to count code points is without substance. Also, with the proposal, string.count(dchar) would encode the dchar to a string first for performance, so it would still work. Anyway, I think this discussion isn't really going anywhere so I think I'll agree to disagree and retire.
Re: Major performance problem with std.array.front()
On 3/9/14, 8:18 AM, Vladimir Panteleev wrote: On Sunday, 9 March 2014 at 05:10:26 UTC, Andrei Alexandrescu wrote: On 3/8/14, 8:24 PM, Vladimir Panteleev wrote: On Sunday, 9 March 2014 at 04:18:15 UTC, Andrei Alexandrescu wrote: What exactly is the consensus? From your wiki page I see One of the proposals in the thread is to switch the iteration type of string ranges from dchar to the string's character type. I can tell you straight out: That will not happen for as long as I'm working on D. Why? From the cycle going in circles: because I think the breakage is way too large compared to the alleged improvement. All right. I was wondering if there was something more fundamental behind such an ultimatum. It's just factual information with no drama attached (i.e. I'm not threatening to leave the language, just plainly explain I'll never approve that particular change). That said a larger explanation is in order. There have been cases in the past when our community has worked itself in a froth over a non-issue and ultimately caused a language change imposed by the faction that shouted the loudest. The lazy keyword and recently the virtual keyword come to mind as cases in which the language leadership has been essentially annoyed into making a change it didn't believe in. I am all about listening to the community's needs and desires. But at some point there is a need to stick to one's guns in matters of judgment call. See e.g. https://d.puremagic.com/issues/show_bug.cgi?id=11837 for a very recent example in which reasonable people may disagree but at some point you can't choose both options. What we now have works as intended. As I mentioned, there is quite a bit more evidence the design is useful to people, than detrimental. Unicode is all about code points. Code units are incidental to each encoding. The fact that we recognize code points at language and library level is, in my opinion, a Good Thing(tm). I understand that doesn't reach the ninth level of Nirvana and there are still issues to work on, and issues where good-looking code is actually incorrect. But I think we're overall in good shape. A regression from that to code unit level would be very destructive. Even a clear slight improvement that breaks backward compatibility would be destructive. So I wanted to limit the potential damage of this discussion. It is made only a lot more dangerous that Walter himself started it, something that others didn't fail to tune into. The sheer fact that we got to contemplate an unbelievably massive breakage on no other evidence than one misuse case and for the sake of possibly an illusory improvement - that's a sign we need to grow up. We can't go like this about changing the language and aim to play in the big leagues. In fact I believe that that design is inferior to the current one regardless. I was hoping we could come to an agreement at least on this point. Sorry to disappoint. --- BTW, a thought struck me while thinking about the problem yesterday. char and dchar should not be implicitly convertible between one another, or comparable to the other. I think only the char - dchar conversion works, and I can see arguments against it. Also comparison of char with dchar is dicey. But there are also cases in which it's legitimate to do that (e.g. assign ASCII chars etc) and this would be a breaking change. One good way to think about breaking changes is if this change were executed to perfection, how much would that improve the overall quality of D? Because breakages _are_ overall - users don't care whether they come from this or the other part of the type system. Really puts things into perspective. void main() { string s = Привет; foreach (c; s) assert(c != 'Ñ'); } Instead, std.conv.to should allow converting between character types, iff they represent one whole code point and fit into the destination type, and throw an exception otherwise (similar to how it deals with integer overflow). Char literals should be special-cased by the compiler to implicitly convert to any sufficiently large type. This would break more[1] code, but it would avoid the silent failures of the earlier proposal. [1] I went through my own larger programs. I actually couldn't find any uses of dchar which would be impacted by such a hypothetical change. Generally I think we should steer away from slight improvements of the language at the cost of breaking existing code. Instead, we must think of ways to improve the language without the breakage. You may want to pursue (bugzilla + pull request) adding the std.conv routines with the semantics you mentioned. Andrei
Re: Major performance problem with std.array.front()
On 3/9/14, 11:14 AM, Dmitry Olshansky wrote: 09-Mar-2014 21:45, Andrei Alexandrescu пишет: On 3/9/14, 10:21 AM, Vladimir Panteleev wrote: On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu wrote: On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote: So IIUC iterating over s.byChar would not encounter the decoding-related speed hits that Walter is concerned about? That is correct. Unless I'm missing something, all algorithms that can work faster on arrays will need to be adapted to also recognize byChar-wrapped arrays, unwrap them, perform the fast array operation, and wrap them back in a byChar. Good point. Off the top of my head I can't remember any algorithm that relies on array representation to do better on arrays than on random-access ranges offering all of arrays' primitives. But I'm sure there are a few. copy to begin with. And it's about 80x faster with plain arrays. Question is if there are a bunch of them. Andrei
Re: Major performance problem with std.array.front()
On 3/9/14, 11:19 AM, Peter Alexander wrote: On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu wrote: On 3/9/14, 10:34 AM, Peter Alexander wrote: If we assume strings are normalized then substring search, equality testing, sorting all work the same with either code units or code points. But others such as edit distance or equal(some_string, some_wstring) will not. equal(string, wstring) should either not compile, or would be overloaded to do the right thing. These would be possible designs each with its pros and cons. The current design works out of the box across all encodings. It has its own pros and cons. Puts in perspective what should and shouldn't be. In an ideal world, char, wchar, and dchar should not be comparable. Probably. But that has nothing to do with equal() working. Edit distance on code points is of questionable utility. Like Vladimir says, its meaning is pretty philosophical, even in ASCII (is \r\n really two edits? What is an edit?) Nothing philosophical - it's as cut and dried as it gets. An edit is as defined by the Levenshtein algorithm using code points as the unit of comparison. I can't think of any case where you would want to count characters. wc % echo € | wc -c 4 :-) Noice. (Generally: I've always been very very very doubtful about arguments that start with I can't think of... because I've historically tried them so many times, and with terrible results.) Fair point... but it's not as if we would be removing the ability (you could always do s.byCodePoint.count); we are talking about defaults. The argument that we shouldn't iterate by code unit by default because people might want to count code points is without substance. Also, with the proposal, string.count(dchar) would encode the dchar to a string first for performance, so it would still work. That's a good enhancement for the current design as well - care to submit a request for it? Anyway, I think this discussion isn't really going anywhere so I think I'll agree to disagree and retire. The part that advocates a breaking change will not indeed lead anywhere. The parts where we improve Unicode support for D is very fertile. Andrei
Re: Major performance problem with std.array.front()
09-Mar-2014 07:53, Vladimir Panteleev пишет: On Sunday, 9 March 2014 at 03:26:40 UTC, Andrei Alexandrescu wrote: I don't understand this argument. Iterating by code unit is not meaningless if you don't want to extract meaning from each unit iteration. For example, if you're parsing JSON or XML, you only care about the syntax characters, which are all ASCII. And there is no confusion of what exactly are we counting here. This was debated... people should not be looking at individual code points, unless they really know what they're doing. Should they be looking at code units instead? No. They should only be looking at substrings. This. Anyhow searching dchar makes sense for _some_ languages, the problem is that it shouldn't decode the whole string but rather encode the needle properly and search that. Basically the whole thread is about: how do I work efficiently (no-decoding) with UTF-8/UTF-16 in cases where it obviously can be done? The current situation is bad in that it undermines writing decode-less generic code. One easily falls into auto-decode trap on first .front, especially when called from some standard algorithm. The algo sees char[]/wchar[] and gets into decode mode via some special case. If it would do that with _all_ char/wchar random access ranges it'd be at least consistent. That and wrapping your head around 2 sets of constraints. The amount of code around 2 types - wchar[]/char[] is way too much, that much is clear. -- Dmitry Olshansky
Re: Major performance problem with std.array.front()
09-Mar-2014 21:54, Vladimir Panteleev пишет: On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu wrote: wc What should wc produce on a Sanskrit text? The problem is that such questions quickly become philosophical. Technically it could use word-braking algorithm for words. Or count grapheme clusters, or count code points it all may have value, depending on the user and writing system. -- Dmitry Olshansky
Re: Major performance problem with std.array.front()
On 3/9/14, 11:34 AM, Dmitry Olshansky wrote: 09-Mar-2014 07:53, Vladimir Panteleev пишет: On Sunday, 9 March 2014 at 03:26:40 UTC, Andrei Alexandrescu wrote: I don't understand this argument. Iterating by code unit is not meaningless if you don't want to extract meaning from each unit iteration. For example, if you're parsing JSON or XML, you only care about the syntax characters, which are all ASCII. And there is no confusion of what exactly are we counting here. This was debated... people should not be looking at individual code points, unless they really know what they're doing. Should they be looking at code units instead? No. They should only be looking at substrings. This. Anyhow searching dchar makes sense for _some_ languages, the problem is that it shouldn't decode the whole string but rather encode the needle properly and search that. That's just an optimization. Conceptually what happens is we're looking for a code point in a sequence of code points. Basically the whole thread is about: how do I work efficiently (no-decoding) with UTF-8/UTF-16 in cases where it obviously can be done? The current situation is bad in that it undermines writing decode-less generic code. s/undermines writing/makes writing explicit/ One easily falls into auto-decode trap on first .front, especially when called from some standard algorithm. The algo sees char[]/wchar[] and gets into decode mode via some special case. If it would do that with _all_ char/wchar random access ranges it'd be at least consistent. That and wrapping your head around 2 sets of constraints. The amount of code around 2 types - wchar[]/char[] is way too much, that much is clear. We're engineers so we should quantify. Ideally that would be as simple as git grep isNarrowString|wc -l which currently prints 42 of all numbers :o). Overall I suspect there are a few good simplifications we can make by using isNarrowString and .representation. Andrei
Re: Major performance problem with std.array.front()
On Sunday, 9 March 2014 at 14:57:32 UTC, Peter Alexander wrote: You have mentioned case-insensitive searching, but I think I've adequately demonstrated that this doesn't work in general by code point: you need to normalize and take locales into account. I don't understand what your argument. Is it by code point is not 100% correct, so let's just drop it and go for raw code units instead? We *are* arguing about whether or not front/popFront should decode by dchar, right...? You mention the algorithms Searching, equality testing, copying, sorting, hashing, splitting, joining... I said by codepoint is not correct, but I still think it's a hell of a lot more accurate than by codeunit. Unless you want to ignore any and all algorithms that takes a predicate? You say unless you are willing to ignore languages like Turkish, but... If you don't decode front, than aren't you just ignoring *all* languages that basically aren't English? As I said, maybe by codepoint is not correct, but if it isn't, I think we should be moving further *into* the correct behavior by default, not away from it.
Re: Major performance problem with std.array.front()
09-Mar-2014 22:41, Andrei Alexandrescu пишет: On 3/9/14, 11:34 AM, Dmitry Olshansky wrote: This. Anyhow searching dchar makes sense for _some_ languages, the problem is that it shouldn't decode the whole string but rather encode the needle properly and search that. That's just an optimization. Conceptually what happens is we're looking for a code point in a sequence of code points. Yup. It's till not a good idea to introduce this in std.algorithm in a non-generic way. That and wrapping your head around 2 sets of constraints. The amount of code around 2 types - wchar[]/char[] is way too much, that much is clear. We're engineers so we should quantify. Ideally that would be as simple as git grep isNarrowString|wc -l which currently prints 42 of all numbers :o). Add to that some uses of isSomeString and ElementEncodingType. 138 and 80 respectively. And in most cases it means that nice generic code was hacked to care about 2 types in particular. That is what bothers me. Overall I suspect there are a few good simplifications we can make by using isNarrowString and .representation. Okay putting potential breakage aside. Let me sketch up an additive way of improving current situation. 1. Say we recognize any indexable entity of char/wchar/dchar, that however has .front returning a dchar as a narrow string. Nothing fancy - it's just a generalization of isNarrowString. At least a range over Array!char will work as string now. 2. Likewise representation must be made something more explicit say byCodeUnit and work on any isNarrowString per above. The opposite of that is byCodePoint. 3. ElementEncodingType is too verbose and misleading. Something more explicit would be useful. ItemType/UnitType maybe? 4. We lack lots of good stuff from Unicode standard. Some recently landed in std.uni. We need many more, and deprecate crappy ones in std.string. (e.g. wrapping text is one) 5. Most algorithms conceptually decode, but may be enhanced to work directly on UTF-8/UTF-16. That together with 1, should IMHO solve most of our problems. 6. Take into account ASCII and maybe other alphabets? Should be as trivial as .assumeASCII and then on you march with all of std.algo/etc. -- Dmitry Olshansky
Re: Major performance problem with std.array.front()
On 3/9/14, 12:25 PM, Dmitry Olshansky wrote: Okay putting potential breakage aside. Let me sketch up an additive way of improving current situation. Now you're talking. 1. Say we recognize any indexable entity of char/wchar/dchar, that however has .front returning a dchar as a narrow string. Nothing fancy - it's just a generalization of isNarrowString. At least a range over Array!char will work as string now. Wait, why is dchar[] a narrow string? 2. Likewise representation must be made something more explicit say byCodeUnit and work on any isNarrowString per above. The opposite of that is byCodePoint. Fine. 3. ElementEncodingType is too verbose and misleading. Something more explicit would be useful. ItemType/UnitType maybe? We're stuck with that name. 4. We lack lots of good stuff from Unicode standard. Some recently landed in std.uni. We need many more, and deprecate crappy ones in std.string. (e.g. wrapping text is one) Add away. 5. Most algorithms conceptually decode, but may be enhanced to work directly on UTF-8/UTF-16. That together with 1, should IMHO solve most of our problems. Great! 6. Take into account ASCII and maybe other alphabets? Should be as trivial as .assumeASCII and then on you march with all of std.algo/etc. Walter is against that. His main argument is that UTF already covers ASCII with only a marginal cost (that can be avoided) and that we should go farther into the future instead of catering to an obsolete representation. Andrei
Re: Major performance problem with std.array.front()
On Sunday, 9 March 2014 at 19:40:32 UTC, Andrei Alexandrescu wrote: 6. Take into account ASCII and maybe other alphabets? Should be as trivial as .assumeASCII and then on you march with all of std.algo/etc. Walter is against that. His main argument is that UTF already covers ASCII with only a marginal cost (that can be avoided) and that we should go farther into the future instead of catering to an obsolete representation. Andrei When I've wanted to write code especially for ASCII, I think it hasn't been for use in generic algorithms anyway. Mostly it's stuff for manipulating segments of memory in a particular way, like as seen here in my library which does some work to generate D code. https://github.com/w0rp/dsmoke/blob/master/source/smoke/string_util.d#L45 Anything else would be something like running through an algorithm and then copying data into a new array or similar, and that would miss the point. When it comes to generic algorithms and ASCII I think UTF-x is sufficient.
Re: Major performance problem with std.array.front()
09-Mar-2014 23:40, Andrei Alexandrescu пишет: On 3/9/14, 12:25 PM, Dmitry Olshansky wrote: Okay putting potential breakage aside. Let me sketch up an additive way of improving current situation. Now you're talking. 1. Say we recognize any indexable entity of char/wchar/dchar, that however has .front returning a dchar as a narrow string. Nothing fancy - it's just a generalization of isNarrowString. At least a range over Array!char will work as string now. Wait, why is dchar[] a narrow string? Indeed `...entity of char/wchar/dchar` -- `...entity of char/wchar`. 3. ElementEncodingType is too verbose and misleading. Something more explicit would be useful. ItemType/UnitType maybe? We're stuck with that name. Too bad, but we have renamed imports... if only they worked correctly. But let's not derail. [snip] Great, so this may be turned into smallish DIP or bugzilla enhancements. 6. Take into account ASCII and maybe other alphabets? Should be as trivial as .assumeASCII and then on you march with all of std.algo/etc. Walter is against that. His main argument is that UTF already covers ASCII with only a marginal cost He certainly doesn't have things like case-insensitive matching or collation on his list. Some cute tables are what directly to the UTF algorithms require for almost anything beyond simple-minded find me a substring. Walter certainly would have different stance the moment he observe the extra bulk of object code for these. (that can be avoided) How? I'm not talking about `x 0x80` branches, these wouldn't cost a dime. I really don't feel strong about 6th point. I see it as a good idea to allow custom alphabets and reap performance benefits where it makes sense, the need for that is less urgent though. and that we should go farther into the future instead of catering to an obsolete representation. That is something I agree with. -- Dmitry Olshansky
Re: Major performance problem with std.array.front()
On 3/9/2014 1:26 PM, Andrei Alexandrescu wrote: On 3/9/14, 6:34 AM, Jakob Ovrum wrote: `byCodeUnit` is essentially std.string.representation. Actually not because for reasons that are unclear to me people really want the individual type to be char, not ubyte. Probably because char *is* D's type for UTF-8 code units.
Re: Major performance problem with std.array.front()
On 3/9/2014 11:21 AM, Vladimir Panteleev wrote: On Sunday, 9 March 2014 at 12:24:11 UTC, ponce wrote: - In lots of places, I've discovered that Phobos did UTF decoding (thus murdering performance) when it didn't need to. Such cases included format (now fixed), appender (now fixed), startsWith (now fixed - recently), skipOver (still unfixed). These have caused latent bugs in my programs that happened to be fed non-UTF data. There's no reason for why D should fail on non-UTF data if it has no reason to decode it in the first place! These failures have only served to identify places in Phobos where redundant decoding was occurring. With all due respect, D string type is exclusively for UTF-8 strings. If it is not valid UTF-8, it should never had been a D string in the first place. In the other cases, ubyte[] is there. This is an arbitrary self-imposed limitation caused by the choice in how strings are handled in Phobos. Yea, I've had problems before - completely unnecessary problems that were *not* helpful or indicative of latent bugs - which were a direct result of Phobos being overly pedantic and eager about UTF validation. And yet the implicit UTF validation has never actually *helped* me in any way.
Re: Major performance problem with std.array.front()
On 3/8/2014 9:15 PM, Michel Fortin wrote: Text is an interesting topic for never-ending discussions. It's also a good example for when non-programmers are surprised to hear that I *don't* see the world as binary black and white *because* of my programming experience ;) Problems like text-handling make it [painfully] obvious to programmers that reality is shades-of-grey - laymen don't often expect that!
Re: Major performance problem with std.array.front()
On 3/9/2014 7:47 AM, w0rp wrote: My knowledge of Unicode pretty much just comes from having to deal with foreign language customers and discovering the problems with the code unit abstraction most languages seem to use. (Java and Python suffer from similar issues, but they don't really have algorithms in the way that we do.) Python 2 or 3 (out of curiosity)? If you're including Python3, then that somewhat surprises me as I thought greatly improved Unicode was one of the biggest reasons for the jump from 2 to 3. (Although it isn't *completely* surprising since, as we all know far too well here, fully correct Unicode is *not* easy.)
Re: Major performance problem with std.array.front()
On 3/9/2014 6:08 AM, Marc Schütz schue...@gmx.net wrote: Also, `byCodeUnit` and `byCodePoint` would probably be better names than `raw` and `decode`, to much the already existing `byGrapheme` in std.uni. I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string, wstring, dstring, and InputRange!char, etc.
Re: Major performance problem with std.array.front()
On 3/9/2014 6:34 AM, Jakob Ovrum wrote: `byCodeUnit` is essentially std.string.representation. Not at all. std.string.representation takes a string and casts it to the corresponding ubyte, ushort, uint string. It doesn't work at all with InputRange!char
Re: Major performance problem with std.array.front()
On 3/9/2014 6:31 PM, Walter Bright wrote: On 3/9/2014 6:08 AM, Marc Schütz schue...@gmx.net wrote: Also, `byCodeUnit` and `byCodePoint` would probably be better names than `raw` and `decode`, to much the already existing `byGrapheme` in std.uni. I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string, wstring, dstring, and InputRange!char, etc. 'byCodePoint' and 'byDchar' are the same. However, 'byCodeUnit' is completely different from anything else: string str; wstring wstr; dstring dstr; (str|wchar|dchar).byChar // Always range of char (str|wchar|dchar).byWchar // Always range of wchar (str|wchar|dchar).byDchar // Always range of dchar str.representation // Range of ubyte wstr.representation // Range of ushort dstr.representation // Range of uint str.byCodeUnit // Range of char wstr.byCodeUnit // Range of wchar dstr.byCodeUnit // Range of dchar
Re: Major performance problem with std.array.front()
On 3/10/2014 12:19 AM, Nick Sabalausky wrote: (str|wchar|dchar).byChar // Always range of char (str|wchar|dchar).byWchar // Always range of wchar (str|wchar|dchar).byDchar // Always range of dchar Erm, naturally I meant (str|wstr|dstr)
Re: Major performance problem with std.array.front()
On 3/9/2014 9:19 PM, Nick Sabalausky wrote: On 3/9/2014 6:31 PM, Walter Bright wrote: On 3/9/2014 6:08 AM, Marc Schütz schue...@gmx.net wrote: Also, `byCodeUnit` and `byCodePoint` would probably be better names than `raw` and `decode`, to much the already existing `byGrapheme` in std.uni. I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string, wstring, dstring, and InputRange!char, etc. 'byCodePoint' and 'byDchar' are the same. However, 'byCodeUnit' is completely different from anything else: string str; wstring wstr; dstring dstr; (str|wchar|dchar).byChar // Always range of char (str|wchar|dchar).byWchar // Always range of wchar (str|wchar|dchar).byDchar // Always range of dchar str.representation // Range of ubyte wstr.representation // Range of ushort dstr.representation // Range of uint str.byCodeUnit // Range of char wstr.byCodeUnit // Range of wchar dstr.byCodeUnit // Range of dchar I don't see much point to the latter 3.
Re: Major performance problem with std.array.front()
08-Mar-2014 05:23, Andrei Alexandrescu пишет: On 3/7/14, 1:58 PM, Vladimir Panteleev wrote: On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote: On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote: No, it doesn't. import std.algorithm; void main() { auto s = cassé; assert(s.canFind('é')); } Hm, I'm not following? Works perfectly fine on my system? Something's messing with your Unicode. Try downloading and compiling this file: http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d Yup, the grapheme issue. This should work. import std.algorithm, std.uni; void main() { auto s = cassé; assert(s.byGrapheme.canFind('é')); } It doesn't compile, seems like a library bug. Becasue Graphemes do not auto-magically convert to dchar and back? After all they are just small strings. Graphemes are the next level of Nirvana above code points, but that doesn't mean it's graphemes or nothing. Andrei -- Dmitry Olshansky
Re: Major performance problem with std.array.front()
08-Mar-2014 12:09, Dmitry Olshansky пишет: 08-Mar-2014 05:23, Andrei Alexandrescu пишет: On 3/7/14, 1:58 PM, Vladimir Panteleev wrote: On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote: On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote: No, it doesn't. import std.algorithm; void main() { auto s = cassé; assert(s.canFind('é')); } Hm, I'm not following? Works perfectly fine on my system? Something's messing with your Unicode. Try downloading and compiling this file: http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d Yup, the grapheme issue. This should work. import std.algorithm, std.uni; void main() { auto s = cassé; assert(s.byGrapheme.canFind('é')); } It doesn't compile, seems like a library bug. Becasue Graphemes do not auto-magically convert to dchar and back? After all they are just small strings. Graphemes are the next level of Nirvana above code points, but that doesn't mean it's graphemes or nothing. Plus it won't help the matters, you need both é and cassé to have the same normalization. -- Dmitry Olshansky
Re: Major performance problem with std.array.front()
08-Mar-2014 05:18, Andrei Alexandrescu пишет: On 3/7/14, 12:48 PM, Dmitry Olshansky wrote: 07-Mar-2014 23:57, Andrei Alexandrescu пишет: On 3/6/14, 6:37 PM, Walter Bright wrote: In Lots of low hanging fruit in Phobos the issue came up about the automatic encoding and decoding of char ranges. [snip] Allow me to enumerate the functions of std.algorithm and how they work today and how they'd work with the proposed change. Let s be a variable of some string type. Special case was wrong though - special casing arrays of char[] and throwing all other ranges of char out the window. The amount of code to support this schizophrenia is enormous. I think this is a confusion. The code in e.g. std.algorithm is specialized for efficiency of stuff that already works. Well, I've said it elsewhere - specialization was too fine grained. Either a generic or it doesn't work. Making strings bidirectional ranges has been a very good choice within the constraints. There was already a string type, and that was immutable(char)[], and a bunch of code depended on that definition. Trying to make it work by blowing a hole in the generic range concept now seems like it wasn't worth it. I disagree. Also what hole? Let's say we keep it. Yesterday I had to write constraints like this: if((isNarrowString!Range is(Unqual!(ElementEncodingType!Range) == wchar)) || (isRandomAccessRange!Range is(Unqual!(ElementType!Range) == wchar))) Just to accept anything that works alike to array of wchar, buffers and whatnot included. I expect that this should have been enough: isRandomAccessRange!Range is(Unqual!(ElementType!Range) == wchar) Or maybe introduce something to indicate any DualRange of narrow chars. -- Dmitry Olshansky
Re: Major performance problem with std.array.front()
On Saturday, 8 March 2014 at 02:04:12 UTC, bearophile wrote: Vladimir Panteleev: It's not about types, it's about algorithms. Given sufficiently refined types, it can be about types :-) Bye, bearophile I think Bear is onto something, we already solved an analogous problem in an elegant way. see SortedRange with assumeSorted etc. But for this to be convenient to use, I still think we should expand the current 'String Literal Postfix' types to include both normaliztion and graphemes. Postfix TypeAka c immutable(char)[] string w immutable(wchar)[] wstring d immutable(dchar)[] dstring
Re: Major performance problem with std.array.front()
On 3/8/14, 12:14 AM, Dmitry Olshansky wrote: 08-Mar-2014 12:09, Dmitry Olshansky пишет: 08-Mar-2014 05:23, Andrei Alexandrescu пишет: On 3/7/14, 1:58 PM, Vladimir Panteleev wrote: On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote: On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote: No, it doesn't. import std.algorithm; void main() { auto s = cassé; assert(s.canFind('é')); } Hm, I'm not following? Works perfectly fine on my system? Something's messing with your Unicode. Try downloading and compiling this file: http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d Yup, the grapheme issue. This should work. import std.algorithm, std.uni; void main() { auto s = cassé; assert(s.byGrapheme.canFind('é')); } It doesn't compile, seems like a library bug. Becasue Graphemes do not auto-magically convert to dchar and back? After all they are just small strings. Graphemes are the next level of Nirvana above code points, but that doesn't mean it's graphemes or nothing. Plus it won't help the matters, you need both é and cassé to have the same normalization. Why? Couldn't the grapheme 'compare true with the character? I.e. the byGrapheme iteration normalizes on the fly. Andrei
Re: Major performance problem with std.array.front()
On 3/8/14, 12:09 AM, Dmitry Olshansky wrote: 08-Mar-2014 05:23, Andrei Alexandrescu пишет: On 3/7/14, 1:58 PM, Vladimir Panteleev wrote: On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote: On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote: No, it doesn't. import std.algorithm; void main() { auto s = cassé; assert(s.canFind('é')); } Hm, I'm not following? Works perfectly fine on my system? Something's messing with your Unicode. Try downloading and compiling this file: http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d Yup, the grapheme issue. This should work. import std.algorithm, std.uni; void main() { auto s = cassé; assert(s.byGrapheme.canFind('é')); } It doesn't compile, seems like a library bug. Becasue Graphemes do not auto-magically convert to dchar and back? After all they are just small strings. Yah but I think they should support comparison with individual characters. No? Andrei