On 3/9/14, 11:34 AM, Dmitry Olshansky wrote:
09-Mar-2014 07:53, Vladimir Panteleev пишет:
On Sunday, 9 March 2014 at 03:26:40 UTC, Andrei Alexandrescu wrote:
I don't understand this argument. Iterating by code unit is not
meaningless if you don't want to extract meaning from each unit
iteration. For example, if you're parsing JSON or XML, you only care
about the syntax characters, which are all ASCII. And there is no
confusion of "what exactly are we counting here".

This was debated... people should not be looking at individual code
points, unless they really know what they're doing.

Should they be looking at code units instead?

No. They should only be looking at substrings.

This. Anyhow searching dchar makes sense for _some_ languages, the
problem is that it shouldn't decode the whole string but rather encode
the needle properly and search that.

That's just an optimization. Conceptually what happens is we're looking for a code point in a sequence of code points.

Basically the whole thread is about:
how do I work efficiently (no-decoding) with UTF-8/UTF-16 in cases where
it obviously can be done?

The current situation is bad in that it undermines writing decode-less
generic code.

s/undermines writing/makes writing explicit/

One easily falls into auto-decode trap on first .front,
especially when called from some standard algorithm. The algo sees
char[]/wchar[] and gets into decode mode via some special case. If it
would do that with _all_ char/wchar random access ranges it'd be at
least consistent.

That and wrapping your head around 2 sets of constraints. The amount of
code around 2 types - wchar[]/char[] is way too much, that much is clear.

We're engineers so we should quantify. Ideally that would be as simple as "git grep isNarrowString|wc -l" which currently prints 42 of all numbers :o).

Overall I suspect there are a few good simplifications we can make by using isNarrowString and .representation.


Andrei

Reply via email to