Re: Unicode handling comparison

2013-11-29 Thread Jakob Ovrum
On Wednesday, 27 November 2013 at 20:13:32 UTC, Dmitry Olshansky wrote: I could have sworn we had byGrapheme somewhere, well apparently not :( Simple attempt: https://github.com/D-Programming-Language/phobos/pull/1736

Re: Unicode handling comparison

2013-11-28 Thread Walter Bright
On 11/27/2013 12:06 PM, Dmitry Olshansky wrote: 27-Nov-2013 18:45, David Nadlinger пишет: As far as I'm aware, this behavior is the result of a deliberate decision, as normalizing strings on the fly isn't really cheap. It's anything but cheap. At the minimum imagine crawling the string and iss

Re: Unicode handling comparison

2013-11-28 Thread Walter Bright
On 11/28/2013 10:19 AM, H. S. Teoh wrote: Always decoding strings *is* slow, esp. when you already know that it only contains ASCII characters. It doesn't have to be merely ASCII. You can do string substring searches without any need for decoding, for example. You don't even need decoding to d

Re: Unicode handling comparison

2013-11-28 Thread Walter Bright
On 11/28/2013 11:32 AM, Dmitry Olshansky wrote: I had a (a bit cloudy) vision of settling encoded ranges problem once and for good. That includes defining notion of an encoded range that is 2 in one: some stronger (as in capabilities) range of code elements and the default decoded view imposed on

Re: Unicode handling comparison

2013-11-28 Thread Dmitry Olshansky
28-Nov-2013 17:24, monarch_dodra пишет: On Thursday, 28 November 2013 at 09:02:12 UTC, Walter Bright wrote: Sadly, I think it's great. It means by default, your strings will always be handled correctly. I think there's quite a few algorithms that were written without ever taking strings into a

Re: Unicode handling comparison

2013-11-28 Thread monarch_dodra
On Thursday, 28 November 2013 at 18:55:44 UTC, Dicebot wrote: http://dlang.org/phobos/std_encoding.html#.AsciiString ? Yeah, that or just ubyte[]. The problem with both of these though, is printing :/ (which prints ugly as sin) Something like: struct AsciiChar { private char c; alia

Re: Unicode handling comparison

2013-11-28 Thread Dicebot
http://dlang.org/phobos/std_encoding.html#.AsciiString ?

Re: Unicode handling comparison

2013-11-28 Thread H. S. Teoh
On Thu, Nov 28, 2013 at 09:52:08AM -0800, Walter Bright wrote: > On 11/28/2013 5:24 AM, monarch_dodra wrote: > >Which operations are you thinking of in std.array that decode > >when they shouldn't? > > front() in std.array looks like: > > @property dchar front(T)(T[] a) @safe pure if (isNarrowStr

Re: Unicode handling comparison

2013-11-28 Thread Walter Bright
On 11/28/2013 5:24 AM, monarch_dodra wrote: Which operations are you thinking of in std.array that decode when they shouldn't? front() in std.array looks like: @property dchar front(T)(T[] a) @safe pure if (isNarrowString!(T[])) { assert(a.length, "Attempting to fetch the front of an empty

Re: Unicode handling comparison

2013-11-28 Thread monarch_dodra
On Thursday, 28 November 2013 at 09:02:12 UTC, Walter Bright wrote: Sadly, I think it's great. It means by default, your strings will always be handled correctly. I think there's quite a few algorithms that were written without ever taking strings into account, but still happen to work with the

Re: Unicode handling comparison

2013-11-28 Thread bearophile
Walter Bright: This means that all algorithms on strings will be crippled as far as performance goes. If you want to sort an array of chars you need to use a dchar[], or code like this: char[] word = "just a test".dup; auto sword = cast(char[])word.representation.sort().release; See: http:

Re: Unicode handling comparison

2013-11-28 Thread Jakob Ovrum
On Thursday, 28 November 2013 at 09:02:12 UTC, Walter Bright wrote: Sadly, std.array is determined to decode (i.e. convert to dchar[]) all your strings when they are used as ranges. This means that all algorithms on strings will be crippled as far as performance goes. http://dlang.org/glossar

Re: Unicode handling comparison

2013-11-28 Thread Walter Bright
On 11/27/2013 9:22 AM, Jakob Ovrum wrote: In D, we can write code that is both Unicode-correct and highly performant, while still being simple and pleasant to read. To write such code, one must have a modicum of understanding of how Unicode works (in order to choose the right tools from the toolb

Re: Unicode handling comparison

2013-11-27 Thread Wyatt
On Wednesday, 27 November 2013 at 17:22:43 UTC, Jakob Ovrum wrote: i18nString sounds like a range of graphemes to me. Maybe. If I had called it...say, "normalisedString"? Would you still think that? That was an off-the-cuff name because my morning brain imagined that this sort of thing wou

Re: Unicode handling comparison

2013-11-27 Thread Dmitry Olshansky
27-Nov-2013 20:18, Wyatt пишет: On Wednesday, 27 November 2013 at 15:43:11 UTC, Jakob Ovrum wrote: It honestly surprised me how many things in std.uni don't seem to work on ranges. Which ones? Or do you mean more like isAlpha(rangeOfCodepoints)? -- Dmitry Olshansky

Re: Unicode handling comparison

2013-11-27 Thread Dmitry Olshansky
27-Nov-2013 20:22, Wyatt пишет: On Wednesday, 27 November 2013 at 16:18:34 UTC, Wyatt wrote: trouble following all that (e.g. Isn't "noe\u0308l" a grapheme Whoops, overzealous pasting. That is, "e\u0308", which composes to "ë". A grapheme cluster seems to represent one printed character: ".

Re: Unicode handling comparison

2013-11-27 Thread Dmitry Olshansky
27-Nov-2013 22:54, Jacob Carlborg пишет: On 2013-11-27 18:56, Dicebot wrote: +1 Working with graphemes is rather expensive thing to do performance-wise. I like how D makes this fact obvious and provides continuous transition through abstraction levels here. It is important to make the costs ob

Re: Unicode handling comparison

2013-11-27 Thread Dmitry Olshansky
27-Nov-2013 22:12, H. S. Teoh пишет: On Wed, Nov 27, 2013 at 10:07:43AM -0800, Andrei Alexandrescu wrote: On 11/27/13 7:43 AM, Jakob Ovrum wrote: On that note, I tried to use std.uni to write a simple example of how to correctly handle this in D, but it became apparent that std.uni should expos

Re: Unicode handling comparison

2013-11-27 Thread Simen Kjærås
On 27.11.2013 19:07, Andrei Alexandrescu wrote: On 11/27/13 7:43 AM, Jakob Ovrum wrote: On that note, I tried to use std.uni to write a simple example of how to correctly handle this in D, but it became apparent that std.uni should expose something like `byGrapheme` which lazily transforms a ran

Re: Unicode handling comparison

2013-11-27 Thread Dmitry Olshansky
27-Nov-2013 18:45, David Nadlinger пишет: On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote: Through Reddit I have seen this small comparison of Unicode handling between different programming languages: http://mortoray.com/2013/11/27/the-string-type-is-broken/ D+Phobos seem to f

Re: Unicode handling comparison

2013-11-27 Thread Charles Hixson
On 11/27/2013 08:53 AM, Jakob Ovrum wrote: On Wednesday, 27 November 2013 at 16:18:34 UTC, Wyatt wrote: I agree with the assertion that people SHOULD know how unicode works if they want to work with it, but the way our docs are now is off-putting enough that most probably won't learn anything.

Re: Unicode handling comparison

2013-11-27 Thread Charles Hixson
On 11/27/2013 06:45 AM, David Nadlinger wrote: On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote: Through Reddit I have seen this small comparison of Unicode handling between different programming languages: http://mortoray.com/2013/11/27/the-string-type-is-broken/ D+Phobos see

Re: Unicode handling comparison

2013-11-27 Thread Jacob Carlborg
On 2013-11-27 18:56, Dicebot wrote: +1 Working with graphemes is rather expensive thing to do performance-wise. I like how D makes this fact obvious and provides continuous transition through abstraction levels here. It is important to make the costs obvious. I think it's missing a final high

Re: Unicode handling comparison

2013-11-27 Thread Walter Bright
On 11/27/2013 8:18 AM, Wyatt wrote: It honestly surprised me how many things in std.uni don't seem to work on ranges. Many things in Phobos either predate ranges, or are written by people who aren't used to ranges and don't think in terms of ranges. It's an ongoing issue, and one we need to i

Re: Unicode handling comparison

2013-11-27 Thread H. S. Teoh
On Wed, Nov 27, 2013 at 10:07:43AM -0800, Andrei Alexandrescu wrote: > On 11/27/13 7:43 AM, Jakob Ovrum wrote: > >On that note, I tried to use std.uni to write a simple example of how > >to correctly handle this in D, but it became apparent that std.uni > >should expose something like `byGrapheme`

Re: Unicode handling comparison

2013-11-27 Thread Andrei Alexandrescu
On 11/27/13 7:43 AM, Jakob Ovrum wrote: On that note, I tried to use std.uni to write a simple example of how to correctly handle this in D, but it became apparent that std.uni should expose something like `byGrapheme` which lazily transforms a range of code points to a range of graphemes (probab

Re: Unicode handling comparison

2013-11-27 Thread Dicebot
On Wednesday, 27 November 2013 at 17:37:48 UTC, Jakob Ovrum wrote: On Wednesday, 27 November 2013 at 17:30:22 UTC, Jacob Carlborg wrote: On 2013-11-27 18:22, Jakob Ovrum wrote: What would it do that std.uni doesn't already? A class/struct that handles all these normalizations and other stuf

Re: Unicode handling comparison

2013-11-27 Thread H. S. Teoh
On Wed, Nov 27, 2013 at 06:22:41PM +0100, Jakob Ovrum wrote: > On Wednesday, 27 November 2013 at 16:15:53 UTC, Wyatt wrote: > >I don't remember if it was brought up before, but this makes me > >wonder if something like an i18nString should exist for cases > >where it IS important. Making i18n stuf

Re: Unicode handling comparison

2013-11-27 Thread Jakob Ovrum
On Wednesday, 27 November 2013 at 17:30:22 UTC, Jacob Carlborg wrote: On 2013-11-27 18:22, Jakob Ovrum wrote: What would it do that std.uni doesn't already? A class/struct that handles all these normalizations and other stuff automatically. Sounds terrible :)

Re: Unicode handling comparison

2013-11-27 Thread Jacob Carlborg
On 2013-11-27 18:22, Jakob Ovrum wrote: What would it do that std.uni doesn't already? A class/struct that handles all these normalizations and other stuff automatically. -- /Jacob Carlborg

Re: Unicode handling comparison

2013-11-27 Thread Jakob Ovrum
On Wednesday, 27 November 2013 at 16:15:53 UTC, Wyatt wrote: I don't remember if it was brought up before, but this makes me wonder if something like an i18nString should exist for cases where it IS important. Making i18n stuff as simple as it looks like it "should" be has merit, IMO. (Maybe

Re: Unicode handling comparison

2013-11-27 Thread Jacob Carlborg
On 2013-11-27 17:15, Wyatt wrote: I don't remember if it was brought up before, but this makes me wonder if something like an i18nString should exist for cases where it IS important. Making i18n stuff as simple as it looks like it "should" be has merit, IMO. (Maybe there's even room for a std.

Re: Unicode handling comparison

2013-11-27 Thread Jakob Ovrum
On Wednesday, 27 November 2013 at 16:18:34 UTC, Wyatt wrote: I agree with the assertion that people SHOULD know how unicode works if they want to work with it, but the way our docs are now is off-putting enough that most probably won't learn anything. If they know, they know; if they don't, th

Re: Unicode handling comparison

2013-11-27 Thread Jakob Ovrum
On Wednesday, 27 November 2013 at 16:22:58 UTC, Wyatt wrote: Whoops, overzealous pasting. That is, "e\u0308", which composes to "ë". A grapheme cluster seems to represent one printed character: "...a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of

Re: Unicode handling comparison

2013-11-27 Thread Gary Willoughby
On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote: Through Reddit I have seen this small comparison of Unicode handling between different programming languages: http://mortoray.com/2013/11/27/the-string-type-is-broken/ D+Phobos seem to fail most things (it produces BAFFLE): http

Re: Unicode handling comparison

2013-11-27 Thread Wyatt
On Wednesday, 27 November 2013 at 16:18:34 UTC, Wyatt wrote: trouble following all that (e.g. Isn't "noe\u0308l" a grapheme Whoops, overzealous pasting. That is, "e\u0308", which composes to "ë". A grapheme cluster seems to represent one printed character: "...a horizontally segmentable uni

Re: Unicode handling comparison

2013-11-27 Thread Dicebot
On Wednesday, 27 November 2013 at 16:15:53 UTC, Wyatt wrote: Seems like a pretty big "gotcha" from a usability standpoint; it's not exactly intuitive. I understand WHY this decision was made, but it feels like a source of code smell and weird string comparison errors. It probably is, but is

Re: Unicode handling comparison

2013-11-27 Thread Wyatt
On Wednesday, 27 November 2013 at 15:43:11 UTC, Jakob Ovrum wrote: The author also doesn't seem to understand the Unicode definitions of character and grapheme, which is a shame, because the difference is more or less the whole point of the post. I agree with the assertion that people SHOUL

Re: Unicode handling comparison

2013-11-27 Thread Wyatt
On Wednesday, 27 November 2013 at 14:45:32 UTC, David Nadlinger wrote: If you need to perform this kind of operations on Unicode strings in D, you can call normalize (std.uni) on the string first to make sure it is in one of the Normalization Forms. For example, just appending .normalize to y

Re: Unicode handling comparison

2013-11-27 Thread Jakob Ovrum
On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote: Through Reddit I have seen this small comparison of Unicode handling between different programming languages: http://mortoray.com/2013/11/27/the-string-type-is-broken/ Most of the points are good, but the author seems to confus

Re: Unicode handling comparison

2013-11-27 Thread Jacob Carlborg
On 2013-11-27 16:07, Adam D. Ruppe wrote: Yeah, I saw it too. The fix is simple: https://github.com/D-Programming-Language/phobos/pull/1728 tbh this makes me think version(unittest) might just be considered harmful. I'm sure that code passed the tests, but only because a vital import was in a

Re: Unicode handling comparison

2013-11-27 Thread bearophile
David Nadlinger: If you need to perform this kind of operations on Unicode strings in D, you can call normalize (std.uni) on the string first to make sure it is in one of the Normalization Forms. For example, just appending .normalize to your strings (which defaults to NFC) would make the cod

Re: Unicode handling comparison

2013-11-27 Thread Adam D. Ruppe
On Wednesday, 27 November 2013 at 15:03:37 UTC, Jacob Carlborg wrote: std/uni.d(6301): Error: undefined identifier tuple Yeah, I saw it too. The fix is simple: https://github.com/D-Programming-Language/phobos/pull/1728 tbh this makes me think version(unittest) might just be considered harmfu

Re: Unicode handling comparison

2013-11-27 Thread Jacob Carlborg
On 2013-11-27 15:45, David Nadlinger wrote: If you need to perform this kind of operations on Unicode strings in D, you can call normalize (std.uni) on the string first to make sure it is in one of the Normalization Forms. For example, just appending .normalize to your strings (which defaults to

Re: Unicode handling comparison

2013-11-27 Thread David Nadlinger
On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote: Through Reddit I have seen this small comparison of Unicode handling between different programming languages: http://mortoray.com/2013/11/27/the-string-type-is-broken/ D+Phobos seem to fail most things (it produces BAFFLE): http

Re: Unicode handling comparison

2013-11-27 Thread monarch_dodra
On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote: D+Phobos seem to fail most things (it produces BAFFLE): I still think we're doing pretty good. At least, we *handle* unicode at all (looking at you C++). And we handle *true* unicode, not BMP style UCS (looking at you Java/C#)

Re: Unicode handling comparison

2013-11-27 Thread Simen Kjærås
On 2013-11-27 13:46, bearophile wrote: Through Reddit I have seen this small comparison of Unicode handling between different programming languages: http://mortoray.com/2013/11/27/the-string-type-is-broken/ D+Phobos seem to fail most things (it produces BAFFLE): http://dpaste.dzfl.pl/a5268c435