On Wednesday, 27 November 2013 at 17:22:43 UTC, Jakob Ovrum wrote:

i18nString sounds like a range of graphemes to me.

Maybe. If I had called it...say, "normalisedString"? Would you still think that? That was an off-the-cuff name because my morning brain imagined that this sort of thing would be useful for user input where you can't make assumptions about its form.

I would like a convenient function in std.uni to get such a range of graphemes from a range of points, but I wouldn't want to elevate it to any particular status; that would be a knee-jerk reaction. D's granularity when it comes to Unicode is because there is an appropriate level of representation for each domain. Shoe-horning everything into a range of graphemes is something we should avoid.

Okay, hold up. It's a bit late to prevent everyone from diving down this rabbit hole, but let me be clear:

This really isn't about graphemes. Not really. They may be involved, but I think focusing on that obscures the point.

If you recall the original article, I don't think he's being unfair in expecting "noël" to have a length of four no matter how it was composed. I don't think it's unfair to expect that "noël".take(3) returns "noë", and I don't think it's unfair that reversing it should be "lëon". All the places where his expectations were defied (and more!) are implementation details.

While I stated before that I don't necessarily have anything against people learning more about unicode, neither do I fundamentally believe that's something a lot of people _need_ to worry about. I'm not saying the default string in D should change or anything crazy like that. All I'm suggesting is maybe, rather than telling people they should read a small book about the most arcane stuff imaginable and then explaining which tool does what when that doesn't take, we could just tell them "Here, use this library type where you need it" with the admonishment that it may be too slow if abused. I think THAT could be useful.

In D, we can write code that is both Unicode-correct and highly performant, while still being simple and pleasant to read. To write such code, one must have a modicum of understanding of how Unicode works (in order to choose the right tools from the toolbox), but I think it's a novel compromise.

See, this sways me only a little bit. The reason for that is, often, convenience greatly trumps elegance or performance. Sure I COULD write something in C to look for obvious bad stuff in my syslog, but would I bother when I have a shell with pipes, grep, cut, and sed? This all isn't to say I don't LIKE performance and elegance; but I live, work, and play on both sides of this spectrum, and I'd like to think they can peacefully coexist without too much fuss.

-Wyatt

Reply via email to