subject:"Re\: Major performance problem with std.array.front\(\)"

Re: Major performance problem with std.array.front()

2014-03-18 Thread Marco Leise

Am Mon, 10 Mar 2014 17:44:22 -0400
schrieb Nick Sabalausky seewebsitetocontac...@semitwist.com:

 On 3/7/2014 8:40 AM, Michel Fortin wrote:
  On 2014-03-07 03:59:55 +, bearophile bearophileh...@lycos.com said:
 
  Walter Bright:
 
  I understand this all too well. (Note that we currently have a
  different silent problem: unnoticed large performance problems.)
 
  On the other hand your change could introduce Unicode-related bugs in
  future code (that the current Phobos avoids) (and here I am not
  talking about code breakage).
 
  The way Phobos works isn't any more correct than dealing with code
  units. Many graphemes span on multiple code points -- because of
  combined diacritics or character variant modifiers -- and decoding at
  the code-point level is thus often insufficient for correctness.
 
 
 Well, it is *more* correct, as many western languages are more likely in 
 current Phobos to just work in most cases. It's just that things still 
 aren't completely correct overall.
 
   From my experience, I'd suggest these basic operations for a string
  range instead of the regular range interface:
 
  .empty
  .frontCodeUnit
  .frontCodePoint
  .frontGrapheme
  .popFrontCodeUnit
  .popFrontCodePoint
  .popFrontGrapheme
  .codeUnitLength (aka length)
  .codePointLength (for dchar[] only)
  .codePointLengthLinear
  .graphemeLengthLinear
 
  Someone should be able to mix all the three 'front' and 'pop' function
  variants above in any code dealing with a string type. In my XML parser
  for instance I regularly use frontCodeUnit to avoid the decoding penalty
  when matching the next character with an ASCII one such as '' or ''.
  An API like the one above forces you to be aware of the level you're
  working on, making bugs and inefficiencies stand out (as long as you're
  familiar with each representation).
 
  If someone wants to use a generic array/range algorithm with a string,
  my opinion is that he should have to wrap it in a range type that maps
  front and popFront to one of the above variant. Having to do that should
  make it obvious that there's an inefficiency there, as you're using an
  algorithm that wasn't tailored to work with strings and that more
  decoding than strictly necessary is being done.
 
 
 I actually like this suggestion quite a bit.

+1 Reminds me of my proposal for Rust
(https://github.com/mozilla/rust/issues/7043#issuecomment-19187984)

-- 
Marco

Re: Major performance problem with std.array.front()

2014-03-13 Thread Jonathan M Davis

On Thursday, March 06, 2014 18:37:13 Walter Bright wrote:
 Is there any hope of fixing this?

I agree with Andrei. I don't think that there's really anything to fix. The 
problem is that there's roughly 3 levels at which string operations can be 
done

1. By code unit
2. By code point
3. By grapheme

and which is correct depends on what you're trying to do. Phobos attempts to 
go for correctness by default without seriously impacting performance, so it 
treats all strings as ranges of dchar (so, level #2). If we went with #1, then 
pretty much any algorithm which operated on individual characters would be 
broken, as unless your strings are ASCII-only, code units are very much the 
wrong level to be operating on if you're trying to deal with characters. If we 
went with #3, then we'd have full correctness, but we'd tank performance. With 
#2, we're far more correct than is typically the case with C++ while still 
being reasonably performant. And those who want full performance can use 
immutable(ubyte)[] to get #1, and those who want #3 can use the grapheme 
support in std.uni.

We've gone to great lengths in Phobos to specialize on narrow strings in order 
to make it more efficient while still maintaining correctness, and anyone who 
really wants performance can do the same. But by operating on the code point 
level, we at least get a reasonable level of unicode-correctness by default. 
With your suggestion, I'd fully expect most D programs to be wrong with 
regards to Unicode, because most programmers don't know or care about how 
Unicode works. And changing what we're doing now would be code breakage of 
astronomical proportions. It will essentially break all uses of range-based 
string code. Certainly, it would be largest code breakage that D has seen is 
years if not ever. So, it's almost certainly a bad idea, but if it isn't, we 
need to be darn sure that what we change to is significantly better and worth 
the huge amount of code breakage that it will cause.

I really don't think that there's any way to get this right. Regardless of 
which level you operate at by default - be it code unit, code point, or 
grapheme - it will be wrong a good chunk of the time. So, it becomes a 
question which of the three has the best tradeoffs, and I think that our 
current solution of operating on code points by default does that. If there 
are things that we can do to better support operating on code units or 
graphemes for those who want it, then great. And it's great if we can find 
ways to make operating at the code point level more efficient or less prone to 
bugs due to not operating at the grapheme level. But I think that operating on 
the code point level like we currently do is by far the best approach.

If anything, it's the fact that the language doesn't do that that's a bigger 
concern IMHO - the main place where that's an issue being the fact that 
foreach iterates by code unit by default. But I don't know of a good way to 
solve that other than treating all arrays of char, wchar, and dchar specially, 
and disable their array operations like ranges do so that you have to convert 
them to code units via the representation function in order to operate on them 
as code units - which Andrei has suggested a number of times before, but 
you've shot him down each time. If that were fixed, then at least we'd be 
consistent, which is usually the biggest complaint with regards to how D 
treats strings. But I really don't think that there's a magical fix for range-
based string operations, and I think that our current approach is a good one.

- Jonathan M Davis

Re: Major performance problem with std.array.front()

2014-03-11 Thread w0rp


On Sunday, 9 March 2014 at 21:38:06 UTC, Nick Sabalausky wrote:

On 3/9/2014 7:47 AM, w0rp wrote:


My knowledge of Unicode pretty much just comes from having
to deal with foreign language customers and discovering the 
problems
with the code unit abstraction most languages seem to use. 
(Java and
Python suffer from similar issues, but they don't really have 
algorithms

in the way that we do.)



Python 2 or 3 (out of curiosity)? If you're including Python3, 
then that somewhat surprises me as I thought greatly improved 
Unicode was one of the biggest reasons for the jump from 2 to 
3. (Although it isn't *completely* surprising since, as we all 
know far too well here, fully correct Unicode is *not* easy.)


Late reply here. Python 3 is a lot better in terms of Unicode 
support than 2. The situation in Python 2 was this.


1. The default string type is 'str', an immutable array of bytes.
2. 'str' could be one of many encodings, including UTF-16, etc.
3. There is an extra 'unicode' type for when you want a Unicode 
string.
4. Python implicltly converts between the two, often in wrong 
ways, often causing exceptions to appear where you didn't expect 
them to.


In 3, this changed to...

1. The default string type is still named 'str', only now it's 
like the 'unicode' of olde.
2. 'bytes' is a new immutable array of bytes type like the Python 
2 'str'.

3. Conversion between 'str' and 'bytes' is always explicit.

However, Python 3 works on a code point level, probably some code 
unit level in fact, and you don't see very many algorithms which 
take, say, combining characters into account. So Python suffers 
from similar issues.

Re: Major performance problem with std.array.front()

2014-03-11 Thread Chris


On Friday, 7 March 2014 at 03:52:42 UTC, Walter Bright wrote:


Ok, I have a plan. Each step will be separated by at least one 
version:


1. implement decode() as an algorithm for string types, so one 
can write:


string s;
s.decode.algorithm...

suggest that people start doing that instead of:

s.algorithm...

2. Emit warning when people use std.array.front(s) with strings.

3. Deprecate std.array.front for strings.

4. Error for std.array.front for strings.

5. Implement new std.array.front for strings that doesn't 
decode.


What about this:

[as above]
1. implement decode() as an algorithm for string types, so one 
can write:


 string s;
 s.decode.algorithm...

 suggest that people start doing that instead of:

 s.algorithm...

[as above]
2. Emit warning when people use std.array.front(s) with strings.

3. Implement new std.array.front for strings that doesn't decode, 
but keep the old one either forever(ish) or until way into D3 
(3.03).


4. Deprecate std.array.front for strings (see 3.)
5. Error for std.array.front for strings. (see 3)

I know that one of the rules of D is warnings should eventually 
become errors, but there is nothing wrong with waiting longer 
than a few months before something is an error or removed from 
the library, especially if it would cause loads of code to break 
(my own too, I suppose). As long as users are aware of it, they 
can start to make the transition in their own code little by 
little. In this case they will make the transition rather sooner 
than later, because nobody wants to suffer constant performance 
penalties. So for this particular change I'd suggest to wait 
patiently until it can finally be deprecated. Is this feasible?

Re: Major performance problem with std.array.front()

2014-03-11 Thread Sean Kelly

On Tuesday, 11 March 2014 at 02:07:19 UTC, Steven Schveighoffer 
wrote:
On Mon, 10 Mar 2014 19:59:07 -0400, Walter Bright 
newshou...@digitalmars.com wrote:



On 3/10/2014 6:47 AM, Dicebot wrote:

(array literals that allocate, I will never forgive that).


It was done that way simply to get it up and running quickly. 
Having them not allocate is an optimization, it doesn't change 
the nature.


I think you forget about this:

foo(int v, int w)
{
   auto x = [v, w];
}

Which cannot pre-allocate.


The array is small and does not escape.  It could be allocated on 
the stack as an optimization.

Re: Major performance problem with std.array.front()

2014-03-10 Thread Nick Sabalausky


On 3/10/2014 12:23 AM, Walter Bright wrote:

On 3/9/2014 9:19 PM, Nick Sabalausky wrote:

On 3/9/2014 6:31 PM, Walter Bright wrote:

On 3/9/2014 6:08 AM, Marc Schütz schue...@gmx.net wrote:

Also, `byCodeUnit` and `byCodePoint` would probably be better names
than `raw`
and `decode`, to much the already existing `byGrapheme` in std.uni.


I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string,
wstring, dstring, and InputRange!char, etc.


'byCodePoint' and 'byDchar' are the same. However, 'byCodeUnit' is
completely
different from anything else:

string  str;
wstring wstr;
dstring dstr;

(str|wchar|dchar).byChar  // Always range of char
(str|wchar|dchar).byWchar // Always range of wchar
(str|wchar|dchar).byDchar // Always range of dchar

str.representation  // Range of ubyte
wstr.representation // Range of ushort
dstr.representation // Range of uint

str.byCodeUnit  // Range of char
wstr.byCodeUnit // Range of wchar
dstr.byCodeUnit // Range of dchar


I don't see much point to the latter 3.



Do you mean:

1. You don't see the point to iterating by code unit?
2. You don't see the point to 'byCodeUnit' if we have 'representation'?
3. You don't see the point to 'byCodeUnit' if we have 
'byChar/byWchar/byDchar'?

4. You don't see the point to having 'byCodeUnit' work on UTF-32 dstrings?

Responses:

1. Iterating by code unit: Useful for tweaking performance anytime 
decoding is unnecessary. For example, parsing a grammar where the bulk 
of the keywords and operators are ASCII. (Occasional uses of Unicode, 
like unicode whitespace, can of course be handled easily enough by the 
lexer FSM).


2. 'byCodeUnit' if we have 'representation': This one I have trouble 
answering since I'm still unclear on the purpose of 'representation' (I 
wasn't even aware of it until a few days ago.) I've been assuming 
there's some specific use-case I've overlooked where it's useful to 
iterate by code unit *while* treating the code units as if they weren't 
UTF-8/16/32 at all. But since 'representation' is called *on* a 
string/wstring/dstring, they should already be UTF-8/16/32 anyway, not 
some other encoding that would necessitate using integer types. Or maybe 
it's just for working around problems with the auto-verification being 
too eager (I've ran into those)? I admit I don't quite get 'representation'.


3. 'byCodeUnit' if we have 'byChar/byWchar/byDchar': To avoid a static 
if chain every time you want to use code units inside generic code. 
Also, so in non-generic code you can change your data type without 
updating instances of 'by*char'.


4. Having 'byCodeUnit' work on UTF-32 dstrings: So generic code working 
on code units doesn't have to special-case UTF-32.

Re: Major performance problem with std.array.front()

2014-03-10 Thread Walter Bright


On 3/10/2014 12:09 AM, Nick Sabalausky wrote:

On 3/10/2014 12:23 AM, Walter Bright wrote:

On 3/9/2014 9:19 PM, Nick Sabalausky wrote:

On 3/9/2014 6:31 PM, Walter Bright wrote:

On 3/9/2014 6:08 AM, Marc Schütz schue...@gmx.net wrote:

Also, `byCodeUnit` and `byCodePoint` would probably be better names
than `raw`
and `decode`, to much the already existing `byGrapheme` in std.uni.


I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string,
wstring, dstring, and InputRange!char, etc.


'byCodePoint' and 'byDchar' are the same. However, 'byCodeUnit' is
completely
different from anything else:

string  str;
wstring wstr;
dstring dstr;

(str|wchar|dchar).byChar  // Always range of char
(str|wchar|dchar).byWchar // Always range of wchar
(str|wchar|dchar).byDchar // Always range of dchar

str.representation  // Range of ubyte
wstr.representation // Range of ushort
dstr.representation // Range of uint

str.byCodeUnit  // Range of char
wstr.byCodeUnit // Range of wchar
dstr.byCodeUnit // Range of dchar


I don't see much point to the latter 3.



Do you mean:

1. You don't see the point to iterating by code unit?
2. You don't see the point to 'byCodeUnit' if we have 'representation'?
3. You don't see the point to 'byCodeUnit' if we have 'byChar/byWchar/byDchar'?
4. You don't see the point to having 'byCodeUnit' work on UTF-32 dstrings?


(3)


3. 'byCodeUnit' if we have 'byChar/byWchar/byDchar': To avoid a static if
chain every time you want to use code units inside generic code. Also, so in
non-generic code you can change your data type without updating instances of
'by*char'.


Just not sure I see a use for that.

Re: Major performance problem with std.array.front()

2014-03-10 Thread ponce


On Sunday, 9 March 2014 at 21:14:30 UTC, Nick Sabalausky wrote:
With all due respect, D string type is exclusively for UTF-8 
strings.
If it is not valid UTF-8, it should never had been a D string 
in the

first place. In the other cases, ubyte[] is there.


This is an arbitrary self-imposed limitation caused by the 
choice in how

strings are handled in Phobos.


Yea, I've had problems before - completely unnecessary problems 
that were *not* helpful or indicative of latent bugs - which 
were a direct result of Phobos being overly pedantic and eager 
about UTF validation. And yet the implicit UTF validation has 
never actually *helped* me in any way.




self-imposed limitation

For greater good.

I finds this article very telling about why string should be 
converted to UTF-8 as often as possible.

http://www.utf8everywhere.org/

I agree 100% with its content, it's impossibly hard to have a 
sane handling of encodings on WIndows (even more in a team), if 
not following the drastic rules the article exposes.


This happens to be what Phobos gently mandates, UTF validation is 
certainly the lesser evil as compared the mess that everything 
become without. How is mandating valid UTF-8 being overly 
pedantic? This is the sanest behaviour. Just use sanitizeUTF8 
(http://vibed.org/api/vibe.utils.string/sanitizeUTF8) or 
equivalent.

Re: Major performance problem with std.array.front()

2014-03-10 Thread Andrea Fontana


I'm not sure I understood the point of this (long) thread.
The main problem is that decode() is called also if not needed?

Well, in this case that's not a problem only for string. I found
this problem also when I was writing other ranges. For example
when I read binary data from db stream. Front represent a single
row, and I decode it every time also if not needed.

On Friday, 7 March 2014 at 02:37:11 UTC, Walter Bright wrote:
In Lots of low hanging fruit in Phobos the issue came up 
about the automatic encoding and decoding of char ranges.


Throughout D's history, there are regular and repeated 
proposals to redesign D's view of char[] to pretend it is not 
UTF-8, but UTF-32. I.e. so D will automatically generate code 
to decode and encode on every attempt to index char[].


I have strongly objected to these proposals on the grounds that:

1. It is a MAJOR performance problem to do this.

2. Very, very few manipulations of strings ever actually need 
decoded values.


3. D is a systems/native programming language, and 
systems/native programming languages must not hide the 
underlying representation (I make similar arguments about 
proposals to make ints issue errors on overflow, etc.).


4. Users should choose when decode/encode happens, not the 
language.


and I have been successful at heading these off. But one 
slipped by me. See this in std.array:


  @property dchar front(T)(T[] a) @safe pure if 
(isNarrowString!(T[]))

  {
assert(a.length, Attempting to fetch the front of an empty 
array of  ~

   T.stringof);
size_t i = 0;
return decode(a, i);
  }

What that means is that if I implement an algorithm that 
accepts, as input, an InputRange of char's, it will ALWAYS try 
to decode it. This means that even:


   from.copy(to)

will decode 'from', and then re-encode it for 'to'. And it will 
do it SILENTLY. The user won't notice, and he'll just assume 
that D performance sux. Even if he does notice, his options to 
make his code run faster are poor.


If the user wants decoding, it should be explicit, as in:

from.decode.copy(encode!to)

The USER should decide where and when the decoding goes. 
'decode' should be just another algorithm.


(Yes, I know that std.algorithm.copy() has some specializations 
to take care of this. But these specializations would have to 
be written for EVERY algorithm, which is thoroughly 
unreasonable. Furthermore, copy()'s specializations only apply 
if BOTH source and destination are arrays. If just one is, the 
decode/encode penalty applies.)


Is there any hope of fixing this?

Re: Major performance problem with std.array.front()

2014-03-10 Thread Nick Sabalausky


On 3/10/2014 6:21 AM, ponce wrote:

On Sunday, 9 March 2014 at 21:14:30 UTC, Nick Sabalausky wrote:


Yea, I've had problems before - completely unnecessary problems that
were *not* helpful or indicative of latent bugs - which were a direct
result of Phobos being overly pedantic and eager about UTF validation.
And yet the implicit UTF validation has never actually *helped* me in
any way.




self-imposed limitation

For greater good.

I finds this article very telling about why string should be converted
to UTF-8 as often as possible.
http://www.utf8everywhere.org/

I agree 100% with its content, it's impossibly hard to have a sane
handling of encodings on WIndows (even more in a team), if not following
the drastic rules the article exposes.



I may have missed it, but I don't see where it says anything about 
validation or immediate sanitation of invalid sequences. It's mostly 
UTF-16 sucks and so does Windows (not that I'm necessarily disagreeing 
with it). (ot: Kinda wish they hadn't used such a hard to read font...)

Re: Major performance problem with std.array.front()

2014-03-10 Thread ponce


On Monday, 10 March 2014 at 11:04:43 UTC, Nick Sabalausky wrote:


I may have missed it, but I don't see where it says anything 
about validation or immediate sanitation of invalid sequences. 
It's mostly UTF-16 sucks and so does Windows (not that I'm 
necessarily disagreeing with it). (ot: Kinda wish they hadn't 
used such a hard to read font...)


I should have highlighted it, their recommendations for proper 
encoding handling on Windows are in section 5 (How to do text on 
Windows).


One of them is std::strings and char*, anywhere in the program, 
are considered UTF-8 (if not said otherwise).


I finds it interesting that D tends to enforce this lesson 
learned with mixed-encodings codebases.

Re: Major performance problem with std.array.front()

2014-03-10 Thread Nick Sabalausky


On 3/9/2014 11:27 AM, Vladimir Panteleev wrote:

On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:

On topic, I think D's implicit default decode to dchar is *infinity*
times better than C++'s char-based strings. While imperfect in terms
of grapheme, it was still a design decision made of win.


Care to argument?



It's simple: Breaking things on all non-English languages is worse than 
breaking things on non-western[1] languages. Is still breakage, and that 
*is* bad, but there's no question which breakage is significantly larger.


[1] (And yes, I realize western is a gross over-simplification here. 
Point is one working language vs several working languages.)

Re: Major performance problem with std.array.front()

2014-03-10 Thread Dicebot

On Sunday, 9 March 2014 at 17:27:20 UTC, Andrei Alexandrescu 
wrote:

On 3/9/14, 6:47 AM, Marc Schütz schue...@gmx.net wrote:

On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
2) It is regression back to C++ days of 
no-one-cares-about-Unicode
pain. Thinking about strings as character arrays is so 
natural and
convenient that if language/Phobos won't punish you for that, 
it will

be extremely widespread.


Not with Nick Sabalausky's suggestion to remove the 
implementation of
front from char arrays. This way, everyone will be forced to 
decide

whether they want code units or code points or something else.


Such as giving up on that crappy language that keeps on 
breaking their code.


Andrei



That was more about if you are that crazy to even consider such 
breakage, this is closer my personal perfection than actual 
proposal ;)

Re: Major performance problem with std.array.front()

2014-03-10 Thread Dicebot


On Friday, 7 March 2014 at 19:43:57 UTC, Walter Bright wrote:

On 3/7/2014 7:03 AM, Dicebot wrote:
1) It is a huge breakage and you have been refusing to do one 
even for more

important problems. What is about this sudden change of mind?


1. Performance Performance Performance


Not important enough. D has always been safe by default, fast 
when asked to language, not other way around. There is no 
fundamental performance problem here, only lack of knowledge 
about Phobos.


2. The current behavior is surprising (it sure surprised me, I 
didn't notice it until I looked at the assembler to figure out 
why the performance sucked)


That may imply that better documentation is needed. You were only 
surprised because of wrong initial assumption about what `char[]` 
type means.



3. Weirdnesses like ElementEncodingType


ElementEncodingType is extremely annoying but I think it is just 
a side effect of more bigger problem how string algorithms are 
handled currently. It does not need to be that way.


4. Strange behavior differences between char[], char*, and 
InputRange!char types


Again, there is nothing strange about it. `char[]` is a special 
type with special semantics that is defined in documentation and 
consistently following  that definition in all but raw array 
indexing/slicing (which is what I find unfortunate but also 
beyond fixing feasibility).


5. Funky anomalous issues with writing OutputRange!char (the 
put(T) must take a dchar)


Bad but not worth even a small breaking change.

2) lack of convenient .raw property which will effectively do 
cast(ubyte[])


I've done the cast as a workaround, but when working with 
generic code it turns out the ubyte type becomes viral - you 
have to use it everywhere. So all over the place you're having 
casts between ubyte = char in unexpected places. You also 
wind up with ugly ubyte = dchar casts, with the commensurate 
risk that you goofed and have a truncation bug.


Of course it is viral. Because you never ever wan't to have 
char[] at all if you don't work with Unicode (or work with it on 
raw byte level). And in that case it is your responsibility to do 
manual decoding when appropriate. Trying to dish out that 
performance often means going at low level with all associated 
risks, there is nothing special about char[] here. It is not a 
common use case.


Essentially, the auto-decode makes trivial code look better, 
but if you're writing a more comprehensive string processing 
program, and care about performance, it makes a regular ugly 
mess of things.


And this is how it should be. Again, I am all for creating 
language that favors performance-critical power programming needs 
over common/casual needs but it is not what D is and you have 
been making such choices consistently over quite a long time now 
(array literals that allocate, I will never forgive that). 
Suddenly changing your mind only because you have encountered 
this specific issue personally as opposed to just reports does 
not fit a language author role. It does not really matter if any 
new approach itself is good or bad - being unpredictable is a 
reputation damage D simply can't afford.

Re: Major performance problem with std.array.front()

2014-03-10 Thread Abdulhaq


On Monday, 10 March 2014 at 10:52:02 UTC, Andrea Fontana wrote:

I'm not sure I understood the point of this (long) thread.
The main problem is that decode() is called also if not needed?



I'd like to offer up one D 'user' perspective, it's just a single 
data point but perhaps useful. I write applications that process 
Arabic, and I'm thinking about converting one of those apps from 
python to D, for performance reasons.


My app deals with unicode arabic text that is 'out there', and 
the UnicodeTM support for Arabic is not that well thought out, so 
the data is often (always) inconsistent in terms of sequencing 
diacritics etc. Even the code page can vary. Therefore my code 
has to cater to various ways that other developers have sequenced 
the code points.


So, my needs as a 'user' are:
* I want to encode all incoming data immediately into unicode, 
usually UTF8, if isn't already.
* I want to iterate over code points. I don't care about the raw 
data.
* When I get the length of my string it should be the number of 
code points.

* When I index my string it should return the nth code point.
* When I manipulate my strings I want to work with code points
... you get the drift.

If I want to access the raw data, which I don't, then I'm very 
happy to cast to ubyte etc.


If encode/decode is a performance issue then perhaps there could 
be a cache for recently used strings where the code point 
representation is held.


BTW to answer a question in the thread, yes the data is 
left-to-right and visualised right-to-left.

Re: Major performance problem with std.array.front()

2014-03-10 Thread Andrea Fontana

In italian we need unicode too. We have several accented letters 
and often programming languages don't handle utf-8 and other 
encoding so well...


In D I never had any problem with this, and I work a lot on text 
processing.


So my question: is there any problem I'm missing in D with 
unicode support or is just a performance problem on algorithms?


If the problem is performance on algorithms that use .front() but 
don't care to understand its data, why don't we add a .rawFront() 
property to implement only when make sense and then a fallback 
like:


auto rawFront(R)(R range) if ( ... isrange ...  
!__traits(compiles, range.rawFront))  { return range.front; }


In this way on copy() or other algorithms we can use rawFront() 
and it's backward compatible with other ranges too.


But I guess I'm missing the point :)


On Monday, 10 March 2014 at 13:48:44 UTC, Abdulhaq wrote:

On Monday, 10 March 2014 at 10:52:02 UTC, Andrea Fontana wrote:

I'm not sure I understood the point of this (long) thread.
The main problem is that decode() is called also if not needed?



I'd like to offer up one D 'user' perspective, it's just a 
single data point but perhaps useful. I write applications that 
process Arabic, and I'm thinking about converting one of those 
apps from python to D, for performance reasons.


My app deals with unicode arabic text that is 'out there', and 
the UnicodeTM support for Arabic is not that well thought out, 
so the data is often (always) inconsistent in terms of 
sequencing diacritics etc. Even the code page can vary. 
Therefore my code has to cater to various ways that other 
developers have sequenced the code points.


So, my needs as a 'user' are:
* I want to encode all incoming data immediately into unicode, 
usually UTF8, if isn't already.
* I want to iterate over code points. I don't care about the 
raw data.
* When I get the length of my string it should be the number of 
code points.

* When I index my string it should return the nth code point.
* When I manipulate my strings I want to work with code points
... you get the drift.

If I want to access the raw data, which I don't, then I'm very 
happy to cast to ubyte etc.


If encode/decode is a performance issue then perhaps there 
could be a cache for recently used strings where the code point 
representation is held.


BTW to answer a question in the thread, yes the data is 
left-to-right and visualised right-to-left.

Re: Major performance problem with std.array.front()

2014-03-10 Thread dennis luehring


Am 07.03.2014 03:37, schrieb Walter Bright:

In Lots of low hanging fruit in Phobos the issue came up about the automatic
encoding and decoding of char ranges.


after reading many of the attached posts the question is - what
could be Ds future design of introducing breaking changes, its
not a solution to say its not possible because of too many breaking 
changes - that will become more and more a problem of Ds evolution

- much like C++

Re: Major performance problem with std.array.front()

2014-03-10 Thread Dicebot


On Monday, 10 March 2014 at 14:05:39 UTC, dennis luehring wrote:

Am 07.03.2014 03:37, schrieb Walter Bright:
In Lots of low hanging fruit in Phobos the issue came up 
about the automatic

encoding and decoding of char ranges.


after reading many of the attached posts the question is - what
could be Ds future design of introducing breaking changes, its
not a solution to say its not possible because of too many 
breaking changes - that will become more and more a problem of 
Ds evolution

- much like C++


Historically 2 approaches has been practiced:

1) argue a lot and then do nothing
2) suddenly change something and tell users is was necessary

I also think that this is much more important issue than this 
whole thread but it does not seem to attract any real attention 
when mentioned.

Re: Major performance problem with std.array.front()

2014-03-10 Thread Abdulhaq


On Monday, 10 March 2014 at 14:05:39 UTC, dennis luehring wrote:

Am 07.03.2014 03:37, schrieb Walter Bright:
In Lots of low hanging fruit in Phobos the issue came up 
about the automatic

encoding and decoding of char ranges.


after reading many of the attached posts the question is - what
could be Ds future design of introducing breaking changes, its
not a solution to say its not possible because of too many 
breaking changes - that will become more and more a problem of 
Ds evolution

- much like C++


I'm a newbie here but I've been waiting for D to mature for a 
long time. D IMHO has to stabilise now because:
* D needs a bigger community so that the the big fish who have 
learnt the ins and outs don't get bored and move on due to lack 
of kudos etc.
* To get the bigger community D needs more _working_ libraries 
for major toolkits (GUI etc. etc.)
* Libraries will cease to work if there is significant change in 
D, and then can stay broken because there isn't the inertial mass 
of other developers to maintain it after the intial developer has 
moved on. You can see that this has happened a LOT
* Anyway the D that I read about in TDPL is already very exciting 
for programmers like myself, we just want that thanks.


Breaking changes can go into D3, if and whenever that is. Keep 
breaking D2 now and it risks just being forevermore a playpen for 
computer scientist types.


Anyway who cares what I think but I think it reflects a lot of 
people's opinions too.

Re: Major performance problem with std.array.front()

2014-03-10 Thread Vladimir Panteleev


On Monday, 10 March 2014 at 14:11:13 UTC, Dicebot wrote:

Historically 2 approaches has been practiced:

1) argue a lot and then do nothing
2) suddenly change something and tell users is was necessary


These are one and the same, just from the two opposing points of 
view.


I also think that this is much more important issue than this 
whole thread but it does not seem to attract any real attention 
when mentioned.


You mean the whole policy on breaking changes?

Re: Major performance problem with std.array.front()

2014-03-10 Thread Abdulhaq





Historically 2 approaches has been practiced:

1) argue a lot and then do nothing


This happens (I think) because Andrei and Walter really value 
your's and other expert's opinions, but nevertheless have to 
preserve the general way things work to preserve the long term 
future of D. They have to be open to persuasion but it would have 
to be very compelling to get them to change basics now - it seems 
to me.


D is at that difficult 90% stage that we all know about where the 
boring difficult stuff is left to do. People like to discuss 
interesting new stuff which at the time seems oh-so-important.

Re: Major performance problem with std.array.front()

2014-03-10 Thread Dicebot

On Monday, 10 March 2014 at 14:27:02 UTC, Vladimir Panteleev 
wrote:

On Monday, 10 March 2014 at 14:11:13 UTC, Dicebot wrote:

Historically 2 approaches has been practiced:

1) argue a lot and then do nothing
2) suddenly change something and tell users is was necessary


These are one and the same, just from the two opposing points 
of view.


/sarcasm :)



I also think that this is much more important issue than this 
whole thread but it does not seem to attract any real 
attention when mentioned.


You mean the whole policy on breaking changes?


Yes. I have given up about this idea at some point as there 
seemed to be consensus that no breaking changes will be even 
considered for D2 and those that come from fixing bugs are not 
worth the fuss. This is exactly why I was so shocked that Walter 
has even started this thread. If breaking changes are actually 
considered (rare or not), then it is absolutely critical to 
define the process for it and put link to its description to 
dlang.org front page.

Re: Major performance problem with std.array.front()

2014-03-10 Thread Johannes Pfau

Am Mon, 10 Mar 2014 14:05:03 +
schrieb Andrea Fontana nos...@example.com:

 In italian we need unicode too. We have several accented letters 
 and often programming languages don't handle utf-8 and other 
 encoding so well...
 
 In D I never had any problem with this, and I work a lot on text 
 processing.
 
 So my question: is there any problem I'm missing in D with 
 unicode support or is just a performance problem on algorithms?

The only real problem apart from potential performance issues I've seen
mentioned in this thread is that indexing/slicing is done with code
units. I think this:

auto index = countUntil(...);
auto slice = str[0 .. index];

is really the only problem with the current implementation.


If we could start from scratch I'd say we keep operating on code points
by default but don't make strings arrays of char/wchar/dchar. Instead
they should be special types which do all operations (especially
indexing, slicing) on code points. This would be as safe as the current
implementation, always consistent but probably even slower in some
cases. Then offer some nice way to get the raw data for algorithms
which can deal with it.
However, I think it's too late to make these changes.

Re: Major performance problem with std.array.front()

2014-03-10 Thread Marc Schütz


On Monday, 10 March 2014 at 13:18:50 UTC, Dicebot wrote:
On Sunday, 9 March 2014 at 17:27:20 UTC, Andrei Alexandrescu 
wrote:

On 3/9/14, 6:47 AM, Marc Schütz schue...@gmx.net wrote:

On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
2) It is regression back to C++ days of 
no-one-cares-about-Unicode
pain. Thinking about strings as character arrays is so 
natural and
convenient that if language/Phobos won't punish you for 
that, it will

be extremely widespread.


Not with Nick Sabalausky's suggestion to remove the 
implementation of
front from char arrays. This way, everyone will be forced to 
decide

whether they want code units or code points or something else.


Such as giving up on that crappy language that keeps on 
breaking their code.


Andrei



That was more about if you are that crazy to even consider 
such breakage, this is closer my personal perfection than 
actual proposal ;)


BTW, I don't believe it would be that bad, because there's a 
straight-forward path of deprecation:


First, std.range.front for narrow strings (and dchar, for 
consistency) can be marked as deprecated. The deprecation message 
can say: Please specify .byCodePoint()/.byCodeUnit(), guiding 
the users towards a better style (assuming one agrees that 
explicit is indeed better than implicit in this case).


After some time, the functionality can be moved into a 
compatibility module, with the deprecated functions still there, 
but now additionally telling the user about the quick fix of 
importing that module.


The deprecation period can be very long, and even if the 
functions should never be removed, at least everyone writing new 
code would do so in the new style.

Re: Major performance problem with std.array.front()

2014-03-10 Thread Marc Schütz


On Monday, 10 March 2014 at 13:48:44 UTC, Abdulhaq wrote:
My app deals with unicode arabic text that is 'out there', and 
the UnicodeTM support for Arabic is not that well thought out, 
so the data is often (always) inconsistent in terms of 
sequencing diacritics etc. Even the code page can vary. 
Therefore my code has to cater to various ways that other 
developers have sequenced the code points.


So, my needs as a 'user' are:
* I want to encode all incoming data immediately into unicode, 
usually UTF8, if isn't already.
* I want to iterate over code points. I don't care about the 
raw data.
* When I get the length of my string it should be the number of 
code points.

* When I index my string it should return the nth code point.
* When I manipulate my strings I want to work with code points
... you get the drift.


Are you sure that code points is what you want? AFAIK there are 
lots of diacritics in Arabic, and I believe they are not 
precomposed with their carrying letters...

Re: Major performance problem with std.array.front()

2014-03-10 Thread Abdulhaq


On Monday, 10 March 2014 at 18:54:26 UTC, Marc Schütz wrote:

On Monday, 10 March 2014 at 13:48:44 UTC, Abdulhaq wrote:
My app deals with unicode arabic text that is 'out there', and 
the UnicodeTM support for Arabic is not that well thought out, 
so the data is often (always) inconsistent in terms of 
sequencing diacritics etc. Even the code page can vary. 
Therefore my code has to cater to various ways that other 
developers have sequenced the code points.


So, my needs as a 'user' are:
* I want to encode all incoming data immediately into unicode, 
usually UTF8, if isn't already.
* I want to iterate over code points. I don't care about the 
raw data.
* When I get the length of my string it should be the number 
of code points.

* When I index my string it should return the nth code point.
* When I manipulate my strings I want to work with code points
... you get the drift.


Are you sure that code points is what you want? AFAIK there are 
lots of diacritics in Arabic, and I believe they are not 
precomposed with their carrying letters...


I checked the terminology before posting so I'm pretty sure. 
Arabic has a code page for the logical characters, one code point 
for each letter of the alphabet and others for various diacritics.


Because of the 'shaping' each logical character has various 
glyphs, found on other code pages.


Text editing programs tend to store typed Arabic as the user 
entered it, and because there can be more than one diacritic per 
alphabetic letter the sequence varies as to how the user 
sequenced them.

Re: Major performance problem with std.array.front()

2014-03-10 Thread Abdulhaq


On Monday, 10 March 2014 at 18:54:26 UTC, Marc Schütz wrote:

On Monday, 10 March 2014 at 13:48:44 UTC, Abdulhaq wrote:
My app deals with unicode arabic text that is 'out there', and 
the UnicodeTM support for Arabic is not that well thought out, 
so the data is often (always) inconsistent in terms of 
sequencing diacritics etc. Even the code page can vary. 
Therefore my code has to cater to various ways that other 
developers have sequenced the code points.


So, my needs as a 'user' are:
* I want to encode all incoming data immediately into unicode, 
usually UTF8, if isn't already.
* I want to iterate over code points. I don't care about the 
raw data.
* When I get the length of my string it should be the number 
of code points.

* When I index my string it should return the nth code point.
* When I manipulate my strings I want to work with code points
... you get the drift.


Are you sure that code points is what you want? AFAIK there are 
lots of diacritics in Arabic, and I believe they are not 
precomposed with their carrying letters...


Adding to my other comment I don't expect a string type to 
understand arabic and merge the diacritics for me. In fact there 
are other symbols (code points) that can also be present, for 
instance instructions on how Quranic text is to be read. These 
issues have not been standardised and I would say are not well 
understood generally.

Re: Major performance problem with std.array.front()

2014-03-10 Thread Nick Sabalausky


On 3/7/2014 8:40 AM, Michel Fortin wrote:

On 2014-03-07 03:59:55 +, bearophile bearophileh...@lycos.com said:


Walter Bright:


I understand this all too well. (Note that we currently have a
different silent problem: unnoticed large performance problems.)


On the other hand your change could introduce Unicode-related bugs in
future code (that the current Phobos avoids) (and here I am not
talking about code breakage).


The way Phobos works isn't any more correct than dealing with code
units. Many graphemes span on multiple code points -- because of
combined diacritics or character variant modifiers -- and decoding at
the code-point level is thus often insufficient for correctness.



Well, it is *more* correct, as many western languages are more likely in 
current Phobos to just work in most cases. It's just that things still 
aren't completely correct overall.



 From my experience, I'd suggest these basic operations for a string
range instead of the regular range interface:

.empty
.frontCodeUnit
.frontCodePoint
.frontGrapheme
.popFrontCodeUnit
.popFrontCodePoint
.popFrontGrapheme
.codeUnitLength (aka length)
.codePointLength (for dchar[] only)
.codePointLengthLinear
.graphemeLengthLinear

Someone should be able to mix all the three 'front' and 'pop' function
variants above in any code dealing with a string type. In my XML parser
for instance I regularly use frontCodeUnit to avoid the decoding penalty
when matching the next character with an ASCII one such as '' or ''.
An API like the one above forces you to be aware of the level you're
working on, making bugs and inefficiencies stand out (as long as you're
familiar with each representation).

If someone wants to use a generic array/range algorithm with a string,
my opinion is that he should have to wrap it in a range type that maps
front and popFront to one of the above variant. Having to do that should
make it obvious that there's an inefficiency there, as you're using an
algorithm that wasn't tailored to work with strings and that more
decoding than strictly necessary is being done.



I actually like this suggestion quite a bit.

Re: Major performance problem with std.array.front()

2014-03-10 Thread Yota


On Monday, 10 March 2014 at 14:42:18 UTC, Dicebot wrote:
Yes. I have given up about this idea at some point as there 
seemed to be consensus that no breaking changes will be even 
considered for D2 and those that come from fixing bugs are not 
worth the fuss.


So at what point are we going to discuss these things
in the context of D-next?  These topics have us group
up and focus on compromises instead of ideals.  As was
said, D2 is at the 90% point.  It only has room left
for bug fixes.  I think we would make much more
productive use of our time and minds coming up with
ideas that actually have a chance of coming to
fruition, even if D3 ends up being half a decade away.

Re: Major performance problem with std.array.front()

2014-03-10 Thread Walter Bright


On 3/10/2014 6:47 AM, Dicebot wrote:

(array literals that allocate, I will never forgive that).


It was done that way simply to get it up and running quickly. Having them not 
allocate is an optimization, it doesn't change the nature.

Re: Major performance problem with std.array.front()

2014-03-10 Thread Nick Sabalausky


On 3/10/2014 7:35 PM, Yota wrote:

On Monday, 10 March 2014 at 14:42:18 UTC, Dicebot wrote:

Yes. I have given up about this idea at some point as there seemed to
be consensus that no breaking changes will be even considered for D2
and those that come from fixing bugs are not worth the fuss.


So at what point are we going to discuss these things
in the context of D-next?


Not until (at least) the D2/Phobos implementations mature, the current 
issues get worked out, and the library/tool ecosystem grows and matures.

Re: Major performance problem with std.array.front()

2014-03-10 Thread Steven Schveighoffer

On Mon, 10 Mar 2014 19:59:07 -0400, Walter Bright  
newshou...@digitalmars.com wrote:



On 3/10/2014 6:47 AM, Dicebot wrote:

(array literals that allocate, I will never forgive that).


It was done that way simply to get it up and running quickly. Having  
them not allocate is an optimization, it doesn't change the nature.


I think you forget about this:

foo(int v, int w)
{
   auto x = [v, w];
}

Which cannot pre-allocate.

That said, I would not mind if this code broke and you had to use array(v,  
w) instead, for the sake of avoiding unnecessary allocations.


-Steve

Re: Major performance problem with std.array.front()

2014-03-10 Thread Andrei Alexandrescu


On 3/10/14, 7:07 PM, Steven Schveighoffer wrote:

On Mon, 10 Mar 2014 19:59:07 -0400, Walter Bright
newshou...@digitalmars.com wrote:


On 3/10/2014 6:47 AM, Dicebot wrote:

(array literals that allocate, I will never forgive that).


It was done that way simply to get it up and running quickly. Having
them not allocate is an optimization, it doesn't change the nature.


I think you forget about this:

foo(int v, int w)
{
auto x = [v, w];
}

Which cannot pre-allocate.


It actually can, seeing as x is a dead assignment :o).


That said, I would not mind if this code broke and you had to use
array(v, w) instead, for the sake of avoiding unnecessary allocations.


Fixing that:

int[] foo(int v, int w) { return [v, w]; }

This one would allocate. But analyses of varying complexity may 
eliminate a variety of allocation patterns.



Andrei

Re: Major performance problem with std.array.front()

2014-03-10 Thread Steven Schveighoffer

On Mon, 10 Mar 2014 22:56:22 -0400, Andrei Alexandrescu  
seewebsiteforem...@erdani.org wrote:



On 3/10/14, 7:07 PM, Steven Schveighoffer wrote:

On Mon, 10 Mar 2014 19:59:07 -0400, Walter Bright
newshou...@digitalmars.com wrote:


On 3/10/2014 6:47 AM, Dicebot wrote:

(array literals that allocate, I will never forgive that).


It was done that way simply to get it up and running quickly. Having
them not allocate is an optimization, it doesn't change the nature.


I think you forget about this:

foo(int v, int w)
{
auto x = [v, w];
}

Which cannot pre-allocate.


It actually can, seeing as x is a dead assignment :o).


Actually, it can't do anything, seeing as it's invalid code ;)


That said, I would not mind if this code broke and you had to use
array(v, w) instead, for the sake of avoiding unnecessary allocations.


Fixing that:

int[] foo(int v, int w) { return [v, w]; }

This one would allocate. But analyses of varying complexity may  
eliminate a variety of allocation patterns.


I think you are missing what I'm saying, I don't want the allocation  
eliminated, but if we eliminate some allocations with [] and not others,  
it will be confusing. The path I'd always hoped we would go in was to make  
all array literals immutable, and make allocation of mutable arrays on the  
heap explicit.


Adding eliding of some allocations for optimization is good, but I (and I  
think possibly Dicebot) think all array literals should not allocate.


-Steve

Re: Major performance problem with std.array.front()

2014-03-10 Thread Andrei Alexandrescu


On 3/10/14, 8:05 PM, Steven Schveighoffer wrote:

I think you are missing what I'm saying, I don't want the allocation
eliminated, but if we eliminate some allocations with [] and not others,
it will be confusing. The path I'd always hoped we would go in was to
make all array literals immutable, and make allocation of mutable arrays
on the heap explicit.

Adding eliding of some allocations for optimization is good, but I (and
I think possibly Dicebot) think all array literals should not allocate.


I think so too. But that's irrelevant because arrays do allocate (at 
least behave as if they did) and that's how the cookie crumbles.


D is a wonderful language, and is getting better literally by day. There 
is a lot more in using it in new and interesting ways, than in brooding 
about its inevitable imperfections.



Andrei

Re: Major performance problem with std.array.front()

2014-03-09 Thread Nick Sabalausky


On 3/7/2014 6:33 PM, H. S. Teoh wrote:

On Fri, Mar 07, 2014 at 11:13:50PM +, Sarath Kodali wrote:

On Friday, 7 March 2014 at 22:35:47 UTC, Sarath Kodali wrote:


+1
In Indian languages, a character consists of one or more UNICODE
code points. For example, in Sanskrit ddhrya
http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg
consists of 7 UNICODE code points. So to search for this char I
have to use string search.

- Sarath


Oops, incomplete reply ...

Since a single alphabet in Indian languages can contain multiple
code-points, iterating over single code-points is like iterating
over char[] for non English European languages. So decode is of no
use other than decreasing the performance. A raw char[] comparison
is much faster.


Yes. The more I think about it, the more auto-decoding sounds like a
wrong decision. The question, though, is whether it's worth the massive
code breakage needed to undo it. :-(



I'm leaning the same way too. But I also think Andrei is right that, at 
this point in time, it'd be a terrible move to change things so that by 
code unit is default. For better or worse, that ship has sailed.


Perhaps we *can* deal with the auto-decoding problem not by killing 
auto-decoding, but by marginalizing it in an additive way:


Convincing arguments have been made that any string-processing code 
which *isn't* done entirely with the official Unicode algorithms is 
likely wrong *regardless* of whether std.algorithm defaults to 
per-code-unit or per-code-point.


So...How's this?: We add any of these Unicode algorithms we may be 
missing, encourage their use for strings, discourage use of 
std.algorithm for string processing, and in the meantime, just do our 
best to reduce unnecessary decoding wherever possible. Then we call it a 
day and all be happy :)

Re: Major performance problem with std.array.front()

2014-03-09 Thread Peter Alexander


On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
On topic, I think D's implicit default decode to dchar is 
*infinity* times better than C++'s char-based strings. While 
imperfect in terms of grapheme, it was still a design decision 
made of win.


I'd be tempted to not ask how do we back out, but rather, 
how can we take this further? I'd love to ditch the whole 
char/dchar thing altogether, and work with graphemes. But 
that would be massive involvement.


Why do you think it is better?

Let's be clear here: if you are searching/iterating/comparing by 
code point then your program is either not correct, or no better 
than doing so by code unit. Graphemes don't really fix this 
either.


I think this is the main confusion: the belief that iterating by 
code point has utility.


If you care about normalization then neither by code unit, by 
code point, nor by grapheme are correct (except in certain 
language subsets).


If you don't care about normalization then by code unit is just 
as good as by code point, but you don't need to specialise 
everywhere in Phobos.


AFAIK, there is only one exception, stuff like s.all!(c = c == 
'é'), but as Vladimir correctly points out: (a) by code point, 
this is still broken in the face of normalization, and (b) are 
there any real applications that search a string for a specific 
non-ASCII character?


To those that think the status quo is better, can you give an 
example of a real-life use case that demonstrates this?


I do think it's probably too late to change this, but I think 
there is value in at least getting everyone on the same page.

Re: Major performance problem with std.array.front()

2014-03-09 Thread w0rp


On Sunday, 9 March 2014 at 09:24:02 UTC, Nick Sabalausky wrote:


I'm leaning the same way too. But I also think Andrei is right 
that, at this point in time, it'd be a terrible move to change 
things so that by code unit is default. For better or worse, 
that ship has sailed.


Perhaps we *can* deal with the auto-decoding problem not by 
killing auto-decoding, but by marginalizing it in an additive 
way:


Convincing arguments have been made that any string-processing 
code which *isn't* done entirely with the official Unicode 
algorithms is likely wrong *regardless* of whether 
std.algorithm defaults to per-code-unit or per-code-point.


So...How's this?: We add any of these Unicode algorithms we may 
be missing, encourage their use for strings, discourage use of 
std.algorithm for string processing, and in the meantime, just 
do our best to reduce unnecessary decoding wherever possible. 
Then we call it a day and all be happy :)


I've been watching this discussion for the last few days, and I'm 
kind of a nobody jumping in pretty late, but I think after 
thinking about the problem for a while I would aggree on a 
solution along the lines of what you have suggested.


I think Vladimir is definitely right when he's saying that when 
you have algorithms that deal with natural languages, simply 
working on the basis of a code unit isn't enough. I think it is 
also true that you need to select a particular algorithm for 
dealing with strings of characters, as there are many different 
algorithms you can use for different languages which behave 
differently, perhaps several in a single langauge. I also think 
Andrei is right when he is saying we need to minimise code 
breakage, and that the string decoding and encoding by default 
isn't the biggest of performance problems.


I think our best option is to offer a function which creates a 
range in std.array for getting a range over raw character data, 
without decoding to code points.


myArray.someAlgorithm; // std.array .front used today with decode 
calls

myArray.rawData.someAlgorithm; // New range which doesn't decode.

Then we could look at creating algorithms for string processing 
which don't use the existing dchar abstraction.


myArray.rawData.byNaturalSymbol!SomeIndianEncodingHere; // Range 
of strings, maybe range of range of characters, not dchars


Or even specialise the new algorithm so it looks for arrays and 
turns them into the ranges for you via the transformation myArray 
- myArray.rawData.


myArray.byNaturalSymbol!SomeIndianEncodingHere;

Honestly, I'd leave the details of such an algorithm to Vladimir 
and not myself, because he's spent far more time looking into 
Unicode processing than myself. My knowledge of Unicode pretty 
much just comes from having to deal with foreign language 
customers and discovering the problems with the code unit 
abstraction most languages seem to use. (Java and Python suffer 
from similar issues, but they don't really have algorithms in the 
way that we do.)


This new set of algorithms taking settings for different 
encodings could be first implemented in a third party library, 
tested there, and eventually submitted to Phobos, probably in 
std.string.


There's my input, I'll duck before I'm beheaded.

Re: Major performance problem with std.array.front()

2014-03-09 Thread ponce

- In lots of places, I've discovered that Phobos did UTF 
decoding (thus murdering performance) when it didn't need to. 
Such cases included format (now fixed), appender (now fixed), 
startsWith (now fixed - recently), skipOver (still unfixed). 
These have caused latent bugs in my programs that happened to 
be fed non-UTF data. There's no reason for why D should fail on 
non-UTF data if it has no reason to decode it in the first 
place! These failures have only served to identify places in 
Phobos where redundant decoding was occurring.


With all due respect, D string type is exclusively for UTF-8 
strings. If it is not valid UTF-8, it should never had been a D 
string in the first place. In the other cases, ubyte[] is there.

Re: Major performance problem with std.array.front()

2014-03-09 Thread Joseph Rushton Wakeling


On 09/03/14 04:26, Andrei Alexandrescu wrote:

2. Add byChar that returns a random-access range iterating a string by
character. Add byWchar that does on-the-fly transcoding to UTF16. Add
byDchar that accepts any range of char and does decoding. And such
stuff. Then whenever one wants to go through a string by code point
can just use str.byChar.


This is confusing. Did you mean to say that byChar iterates a string by
code unit (not character / code point)?


Unit. s.byChar.front is a (possibly ref, possibly qualified) char.


So IIUC iterating over s.byChar would not encounter the decoding-related speed 
hits that Walter is concerned about?


In which case it seems to me a better solution -- safe strings by default, 
unsafe speed-focused solution available if you want it.  (Safe here in the 
more general sense of Doesn't generate unexpected errors rather than memory 
safety.)

Re: Major performance problem with std.array.front()

2014-03-09 Thread monarch_dodra


On Sunday, 9 March 2014 at 11:34:31 UTC, Peter Alexander wrote:

On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
On topic, I think D's implicit default decode to dchar is 
*infinity* times better than C++'s char-based strings. While 
imperfect in terms of grapheme, it was still a design decision 
made of win.


I'd be tempted to not ask how do we back out, but rather, 
how can we take this further? I'd love to ditch the whole 
char/dchar thing altogether, and work with graphemes. But 
that would be massive involvement.


Why do you think it is better?

Let's be clear here: if you are searching/iterating/comparing 
by code point then your program is either not correct, or no 
better than doing so by code unit. Graphemes don't really fix 
this either.


I think this is the main confusion: the belief that iterating 
by code point has utility.


If you care about normalization then neither by code unit, by 
code point, nor by grapheme are correct (except in certain 
language subsets).


If you don't care about normalization then by code unit is just 
as good as by code point, but you don't need to specialise 
everywhere in Phobos.


IMO, the normalization argument is overrated. I've yet to 
encounter a real-world case of normalization: only hand written 
counter-examples. Not saying it doesn't exist, just that:
1. It occurs only in special cases that the program should be 
aware of before hand.

2. Arguably, be taken care of eagerly, or in a special pass.

As for the belief that iterating by code point has utility. I 
have to strongly disagree. Unicode is composed of codepoints, and 
that is what we handle. The fact that it can be be encoded and 
stored as UTF is implementation detail.


As for the grapheme thing, I'm not actually so sure about it 
myself, so don't take it too seriously.


AFAIK, there is only one exception, stuff like s.all!(c = c == 
'é'), but as Vladimir correctly points out: (a) by code point, 
this is still broken in the face of normalization, and (b) are 
there any real applications that search a string for a specific 
non-ASCII character?


But *what* other kinds of algorithms are there? AFAIK, the *only* 
type of algorithm that doesn't need decoding is searching, and 
you know what? std.algorithm.find does it perfectly well. This 
trickles into most other algorithms too: split, splitter or 
findAmong don't decode if they don't have too.


AFAIK, the most common algorithm case insensitive search *must* 
decode.


There may still be cases where it is still not working as 
intended in the face of normalization, but it is still leaps and 
bounds better than what we get iterating with codeunits.


To turn it the other way around, *what* are you guys doing, that 
doesn't require decoding, and where performance is such a killer?


To those that think the status quo is better, can you give an 
example of a real-life use case that demonstrates this?


I do not know of a single bug report in regards to buggy phobos 
code that used front/popFront. Not_a_single_one (AFAIK).


On the other hand, there are plenty of cases of bugs for 
attempting to not decode strings, or incorrectly decoding 
strings. They are being corrected on a continuous basis.


Seriously, Bearophile suggested ABCD.sort(), and it took about 
6 pages (!) for someone to point out this would be wrong. Even 
Walter pointed out that such code should work. *Maybe* it is 
still wrong in regards to graphemes and normalization, but at 
*least*, the result is not a corrupted UTF-8 stream.


Walter keeps grinding on about myCharArray.put('é') not 
working, but I'm not sure he realizes how dangerous it would 
actually be to allow such a thing to work.


In particular, in all these cases, a simple call to 
representation will deactivate the feature, giving you the 
tools you want.


I do think it's probably too late to change this, but I think 
there is value in at least getting everyone on the same page.


Me too. I do see the value in being able to do decode-less 
iteration. I just think the *default* behavior has the advantage 
of being correct *most* of the time, and definitely much more 
correct than without decoding.


I think opt-out of decoding is just a much much much saner 
approach to string handling.

Re: Major performance problem with std.array.front()

2014-03-09 Thread Marc Schütz


On Friday, 7 March 2014 at 04:11:15 UTC, Nick Sabalausky wrote:

What about this?:

Anywhere we currently have a front() that decodes, such as your 
example:


  @property dchar front(T)(T[] a) @safe pure if 
(isNarrowString!(T[]))

  {
assert(a.length, Attempting to fetch the front of an 
empty array

of  ~
   T.stringof);
size_t i = 0;
return decode(a, i);
  }



We rip out that front() entirely. The result is *not* 
technically a range...yet! We could call it a protorange.


Then we provide two functions:

auto decode(someStringProtoRange) {...}
auto raw(someStringProtoRange) {...}

These convert the protoranges into actual ranges by adding the 
missing front() function. The 'decode' adds a front() which 
decodes into dchar, while the 'raw' adds a front() which simply 
returns the raw underlying type.


I imagine the decode/raw would probably also handle any 
length property (if it exists in the protorange) accordingly.


This way, the user is forced to specify myStringRange.decode 
or myStringRange.raw as appropriate, otherwise myStringRange 
can't be used since it isn't technically a range, only a 
protorange.


(Naturally, ranges of dchar would always have front, since no 
decoding is ever needed for them anyway. For these ranges, the 
decode/raw funcs above would simply be no-ops.)


Strings can be iterated over by code unit, code point, grapheme, 
grapheme cluster (?), words, sentences, lines, paragraphs, and 
potentially other things. Therefore, it makes sense two require 
the same for ranges of dchar, too.


Also, `byCodeUnit` and `byCodePoint` would probably be better 
names than `raw` and `decode`, to much the already existing 
`byGrapheme` in std.uni.

Re: Major performance problem with std.array.front()

2014-03-09 Thread Jakob Ovrum


On Sunday, 9 March 2014 at 13:08:05 UTC, Marc Schütz wrote:
Also, `byCodeUnit` and `byCodePoint` would probably be better 
names than `raw` and `decode`, to much the already existing 
`byGrapheme` in std.uni.


There already is a std.uni.byCodePoint. It is a higher order 
range that accepts ranges of graphemes and ranges of code points 
(such as strings).


`byCodeUnit` is essentially std.string.representation.

Re: Major performance problem with std.array.front()

2014-03-09 Thread monarch_dodra

On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu 
wrote:
The current approach is a cut above treating strings as arrays 
of bytes

for some languages, and still utterly broken for others. If I'm
operating on a right to left language like Hebrew, what would 
I expect

the result to be from something like countUntil?


The entire string processing paraphernalia is left to right. I 
figure RTL languages are under-supported, but 
s.retro.countUntil comes to mind.


Andrei


I'm pretty sure that all string operations are actually front to 
back. If I recall correctly, evenlanguages that read right to 
left, are stored in a front to back manner: EG: string[0] would 
be the right-most character. Is is only a question of display, 
and changes nothing to the code. As for countUntil, it would 
still work perfectly fine, as a RTL reader would expect the 
counting to start at the begining eg: the Right side.


I'm pretty confident RTL is 100% supported. The only issue is the 
front/left abiguity, and the only one I know of is the oddly 
named stripLeft function, which actually does a stripFront 
anyways.


So I wouldn't worry about RTL.

But as mentioned, it is languages like indian, that have complex 
graphemes, or languages with accentuated characters, eg, most 
europeans ones, that can have problems, such as canFind(cassé, 
'e').


On topic, I think D's implicit default decode to dchar is 
*infinity* times better than C++'s char-based strings. While 
imperfect in terms of grapheme, it was still a design decision 
made of win.


I'd be tempted to not ask how do we back out, but rather, how 
can we take this further? I'd love to ditch the whole 
char/dchar thing altogether, and work with graphemes. But 
that would be massive involvement.

Re: Major performance problem with std.array.front()

2014-03-09 Thread Marc Schütz


On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
2) It is regression back to C++ days of 
no-one-cares-about-Unicode pain. Thinking about strings as 
character arrays is so natural and convenient that if 
language/Phobos won't punish you for that, it will be extremely 
widespread.


Not with Nick Sabalausky's suggestion to remove the 
implementation of front from char arrays. This way, everyone will 
be forced to decide whether they want code units or code points 
or something else.

Re: Major performance problem with std.array.front()

2014-03-09 Thread Marc Schütz


On Friday, 7 March 2014 at 16:43:30 UTC, Dicebot wrote:

On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev

Can we look at some example situations that this will break?


Any code that relies on countUntil to count dchar's? Or, to 
generalize, almost any code that uses std.algorithm functions 
with string?


This would no longer compile, as dchar[] stops being a range. 
countUntil(range.byCodePoint) would have to be used instead.

Re: Major performance problem with std.array.front()

2014-03-09 Thread Michel Fortin


On 2014-03-09 13:00:45 +, monarch_dodra monarchdo...@gmail.com said:


AFAIK, the most common algorithm case insensitive search *must* decode.


Not necessarily. While the unicode collation algorithms (which should 
be used to compare text) are defined in term of code points, you could 
build a collation element table using code units as keys and bypass the 
decoding step for searching the table. I'm not sure if there would be a 
significant performance gain though.


That remains an optimization though. The natural way to implement a 
Unicode algorithm is to base it on code points.


--
Michel Fortin
michel.for...@michelf.ca
http://michelf.ca

Re: Major performance problem with std.array.front()

2014-03-09 Thread Marc Schütz


On Friday, 7 March 2014 at 23:13:50 UTC, H. S. Teoh wrote:

On Fri, Mar 07, 2014 at 10:35:46PM +, Sarath Kodali wrote:
On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev 
wrote:

On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu
wrote:

[...]
Clearly one might argue that their app has no business 
dealing

with diacriticals or Asian characters. But that's the typical
provincial view that marred many languages' approach to UTF 
and

internationalization.

So is yours, if you think that making everything magically a 
dchar

is going to solve all problems.

The TDPL example only showcases the problem. Yes, it works 
with

Swedish. Now try it again with Sanskrit.

+1
In Indian languages, a character consists of one or more 
UNICODE

code points. For example, in Sanskrit ddhrya
http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg
consists of 7 UNICODE code points. So to search for this char 
I have

to use string search.

[...]

That's what I've been arguing for. The most general form of 
character
searching in Unicode requires substring searching, and 
similarly many

character-based operations on Unicode strings are effectively
substring-based operations, because said character may be a 
multibyte
code point, or, in your case, multiple code points. Since 
that's the

case, we might as well just forget about the distinction between
character and string, and treat all such operations as 
substring
operations (even if the operand is supposedly just 1 character 
long).


This would allow us to get rid of the hackish auto-decoding of 
narrow
strings, and thus eliminate the needless overhead of always 
decoding.


That won't work, because your needle might be in a different 
normalization form than your haystack, thus a byte-by-byte 
comparison will not be able to find it.

Re: Major performance problem with std.array.front()

2014-03-09 Thread Michel Fortin


On 2014-03-09 14:12:28 +, Marc Schütz schue...@gmx.net said:

That won't work, because your needle might be in a different 
normalization form than your haystack, thus a byte-by-byte comparison 
will not be able to find it.


The core of the problem is that sometime this byte-by-byte comparison 
is exactly what you want; when searching for some terminal character(s) 
in some kind of parser for instance.


Other times you want to do a proper Unicode search using Unicode 
comparison algorithms; when the user is searching for a particular 
string in a text document for instance.


The former is very easy to do with the current API. But what's the API 
for the later?


And how to make the correct API the obvious choice depending on the use case?

These two questions are what this thread should be about. Although not 
unimportant, performance of std.array.front() and whether it should 
decode is a secondary issue in comparison.


--
Michel Fortin
michel.for...@michelf.ca
http://michelf.ca

Re: Major performance problem with std.array.front()

2014-03-09 Thread Peter Alexander


On Sunday, 9 March 2014 at 13:00:46 UTC, monarch_dodra wrote:
IMO, the normalization argument is overrated. I've yet to 
encounter a real-world case of normalization: only hand written 
counter-examples. Not saying it doesn't exist, just that:
1. It occurs only in special cases that the program should be 
aware of before hand.

2. Arguably, be taken care of eagerly, or in a special pass.

As for the belief that iterating by code point has utility. I 
have to strongly disagree. Unicode is composed of codepoints, 
and that is what we handle. The fact that it can be be encoded 
and stored as UTF is implementation detail.


We don't handle code points (when have you ever wanted to 
handle a combining character separate to the character it 
combines with?)


You are just thinking of a subset of languages and locales.

Normalization is an issue any time you have a user enter text 
into your program and you then want to search for that text. I 
hope we can agree this isn't a rare occurrence.



AFAIK, there is only one exception, stuff like s.all!(c = c 
== 'é'), but as Vladimir correctly points out: (a) by code 
point, this is still broken in the face of normalization, and 
(b) are there any real applications that search a string for a 
specific non-ASCII character?


But *what* other kinds of algorithms are there? AFAIK, the 
*only* type of algorithm that doesn't need decoding is 
searching, and you know what? std.algorithm.find does it 
perfectly well. This trickles into most other algorithms too: 
split, splitter or findAmong don't decode if they don't have 
too.


Searching, equality testing, copying, sorting, hashing, 
splitting, joining...


I can't think of a single use-case for searching for a non-ASCII 
code point. You can search for strings, but searching by code 
unit is just as good (and fast by default).



AFAIK, the most common algorithm case insensitive search 
*must* decode.


But it must also normalize and take locales into account, so by 
code point is insufficient (unless you are willing to ignore 
languages like Turkish). See Turkish I.


http://en.wikipedia.org/wiki/Turkish_I

Sure, if you just want to ignore normalization and several 
languages then by code point is just fine... but that's the 
point: by code point is incorrect in general.



There may still be cases where it is still not working as 
intended in the face of normalization, but it is still leaps 
and bounds better than what we get iterating with codeunits.


To turn it the other way around, *what* are you guys doing, 
that doesn't require decoding, and where performance is such a 
killer?


Searching, equality testing, copying, sorting, hashing, 
splitting, joining...


The performance thing can be fixed in the library, but my concern 
is (a) it takes a significant amount of code to do so (b) 
complicates implementations. There are many, many algorithms in 
Phobos that are special cased for strings, and I don't think it 
needs to be that way.



To those that think the status quo is better, can you give an 
example of a real-life use case that demonstrates this?


I do not know of a single bug report in regards to buggy phobos 
code that used front/popFront. Not_a_single_one (AFAIK).


On the other hand, there are plenty of cases of bugs for 
attempting to not decode strings, or incorrectly decoding 
strings. They are being corrected on a continuous basis.


Can you provide a link to a bug?

Also, you haven't answered the question :-)  Can you give a 
real-life example of a case where code point decoding was 
necessary where code units wouldn't have sufficed?


You have mentioned case-insensitive searching, but I think I've 
adequately demonstrated that this doesn't work in general by code 
point: you need to normalize and take locales into account.

Re: Major performance problem with std.array.front()

2014-03-09 Thread Vladimir Panteleev

On Sunday, 9 March 2014 at 05:10:26 UTC, Andrei Alexandrescu 
wrote:

On 3/8/14, 8:24 PM, Vladimir Panteleev wrote:
On Sunday, 9 March 2014 at 04:18:15 UTC, Andrei Alexandrescu 
wrote:
What exactly is the consensus? From your wiki page I see One 
of the
proposals in the thread is to switch the iteration type of 
string

ranges from dchar to the string's character type.

I can tell you straight out: That will not happen for as long 
as I'm

working on D.


Why?


From the cycle going in circles: because I think the breakage 
is way too large compared to the alleged improvement.


All right. I was wondering if there was something more 
fundamental behind such an ultimatum.


In fact I believe that that design is inferior to the current 
one regardless.


I was hoping we could come to an agreement at least on this point.

---

BTW, a thought struck me while thinking about the problem 
yesterday.


char and dchar should not be implicitly convertible between one 
another, or comparable to the other.


void main()
{
string s = Привет;
foreach (c; s)
assert(c != 'Ñ');
}

Instead, std.conv.to should allow converting between character 
types, iff they represent one whole code point and fit into the 
destination type, and throw an exception otherwise (similar to 
how it deals with integer overflow). Char literals should be 
special-cased by the compiler to implicitly convert to any 
sufficiently large type.


This would break more[1] code, but it would avoid the silent 
failures of the earlier proposal.


[1] I went through my own larger programs. I actually couldn't 
find any uses of dchar which would be impacted by such a 
hypothetical change.

Re: Major performance problem with std.array.front()

2014-03-09 Thread Vladimir Panteleev


On Sunday, 9 March 2014 at 13:47:26 UTC, Marc Schütz wrote:

On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
2) It is regression back to C++ days of 
no-one-cares-about-Unicode pain. Thinking about strings as 
character arrays is so natural and convenient that if 
language/Phobos won't punish you for that, it will be 
extremely widespread.


Not with Nick Sabalausky's suggestion to remove the 
implementation of front from char arrays. This way, everyone 
will be forced to decide whether they want code units or code 
points or something else.


Andrei has made it clear that the code breakage this would
involve would be unacceptable.

Re: Major performance problem with std.array.front()

2014-03-09 Thread Vladimir Panteleev


On Sunday, 9 March 2014 at 13:51:12 UTC, Marc Schütz wrote:

On Friday, 7 March 2014 at 16:43:30 UTC, Dicebot wrote:

On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev

Can we look at some example situations that this will break?


Any code that relies on countUntil to count dchar's? Or, to 
generalize, almost any code that uses std.algorithm functions 
with string?


This would no longer compile, as dchar[] stops being a range. 
countUntil(range.byCodePoint) would have to be used instead.


Why? There's no reason why dchar[] would stop being a range. It 
will be treated as now, like any other array.

Re: Major performance problem with std.array.front()

2014-03-09 Thread Vladimir Panteleev


On Sunday, 9 March 2014 at 12:24:11 UTC, ponce wrote:
- In lots of places, I've discovered that Phobos did UTF 
decoding (thus murdering performance) when it didn't need to. 
Such cases included format (now fixed), appender (now fixed), 
startsWith (now fixed - recently), skipOver (still unfixed). 
These have caused latent bugs in my programs that happened to 
be fed non-UTF data. There's no reason for why D should fail 
on non-UTF data if it has no reason to decode it in the first 
place! These failures have only served to identify places in 
Phobos where redundant decoding was occurring.


With all due respect, D string type is exclusively for UTF-8 
strings. If it is not valid UTF-8, it should never had been a D 
string in the first place. In the other cases, ubyte[] is there.


This is an arbitrary self-imposed limitation caused by the choice 
in how strings are handled in Phobos.

Re: Major performance problem with std.array.front()

2014-03-09 Thread Sean Kelly


On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu 
wrote:
The current approach is a cut above treating strings as 
arrays of bytes
for some languages, and still utterly broken for others. If 
I'm
operating on a right to left language like Hebrew, what would 
I expect

the result to be from something like countUntil?


The entire string processing paraphernalia is left to right. I 
figure RTL languages are under-supported, but 
s.retro.countUntil comes to mind.


Andrei


I'm pretty sure that all string operations are actually front 
to back. If I recall correctly, evenlanguages that read 
right to left, are stored in a front to back manner: EG: 
string[0] would be the right-most character. Is is only a 
question of display, and changes nothing to the code. As for 
countUntil, it would still work perfectly fine, as a RTL 
reader would expect the counting to start at the begining eg: 
the Right side.


I'm pretty confident RTL is 100% supported. The only issue is 
the front/left abiguity, and the only one I know of is the 
oddly named stripLeft function, which actually does a 
stripFront anyways.


So I wouldn't worry about RTL.


Yeah, I think RTL strings are preceded by a code point that 
indicates RTL display. It was just something I mentioned because 
some operations might be confusing to the programmer.



But as mentioned, it is languages like indian, that have 
complex graphemes, or languages with accentuated characters, 
eg, most europeans ones, that can have problems, such as 
canFind(cassé, 'e').


True. I still question why anyone would want to do 
character-based operations on Unicode strings. I guess substring 
searches could even end up with the same problem in some cases if 
not implemented specifically for Unicode for the same reason, but 
those should be far less common.

Re: Major performance problem with std.array.front()

2014-03-09 Thread Vladimir Panteleev


On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
On topic, I think D's implicit default decode to dchar is 
*infinity* times better than C++'s char-based strings. While 
imperfect in terms of grapheme, it was still a design decision 
made of win.


Care to argument?

I'd be tempted to not ask how do we back out, but rather, 
how can we take this further? I'd love to ditch the whole 
char/dchar thing altogether, and work with graphemes. But 
that would be massive involvement.


As has been discussed, this does not make sense. Graphemes are 
also a concept which apply only to certain writing systems, all 
it would do is exchange one set of tradeoffs with another, 
without solving anything. Text isn't that simple.

Re: Major performance problem with std.array.front()

2014-03-09 Thread Vladimir Panteleev


On Sunday, 9 March 2014 at 13:00:46 UTC, monarch_dodra wrote:
As for the belief that iterating by code point has utility. I 
have to strongly disagree. Unicode is composed of codepoints, 
and that is what we handle. The fact that it can be be encoded 
and stored as UTF is implementation detail.


But you don't deal with Unicode. You deal with *text*. Unless you 
are implementing Unicode algorithms, code points solve nothing in 
the general case.


Seriously, Bearophile suggested ABCD.sort(), and it took 
about 6 pages (!) for someone to point out this would be wrong.


Sorting a string has quite limited use in the general case, so I 
think this is another artificial example.


Even Walter pointed out that such code should work. *Maybe* it 
is still wrong in regards to graphemes and normalization, but 
at *least*, the result is not a corrupted UTF-8 stream.


I think this is no worse than putting all combining marks all 
clustered at the end of the string, thus attached to the last 
non-combining letter.

Re: Major performance problem with std.array.front()

2014-03-09 Thread bearophile


Vladimir Panteleev:

Seriously, Bearophile suggested ABCD.sort(), and it took 
about 6 pages (!) for someone to point out this would be wrong.


Sorting a string has quite limited use in the general case,


It seems I am sorting arrays of mutable ASCII chars often enough 
:-)


Time ago I have even asked for a helper function:
https://d.puremagic.com/issues/show_bug.cgi?id=10162

Bye,
bearophile

Re: Major performance problem with std.array.front()

2014-03-09 Thread Vladimir Panteleev


On Sunday, 9 March 2014 at 16:02:55 UTC, bearophile wrote:

Vladimir Panteleev:

Seriously, Bearophile suggested ABCD.sort(), and it took 
about 6 pages (!) for someone to point out this would be 
wrong.


Sorting a string has quite limited use in the general case,


It seems I am sorting arrays of mutable ASCII chars often 
enough :-)


What do you use this for?

I can think of sort being useful e.g. to see which characters 
appear in a string (and with which frequency), but as the concept 
does not apply to all languages, one would need to draw a line 
somewhere for which languages they want to support. I think this 
should be done explicitly in user code.

Re: Major performance problem with std.array.front()

2014-03-09 Thread bearophile


Vladimir Panteleev:


What do you use this for?


For lots of different reasons (counting, testing, histograms, to 
unique-ify, to allow binary searches, etc), you can find 
alternative solutions for every one of those use cases.



I can think of sort being useful e.g. to see which characters 
appear in a string (and with which frequency), but as the 
concept does not apply to all languages, one would need to draw 
a line somewhere for which languages they want to support. I 
think this should be done explicitly in user code.


So far I have needed to sort 7-bit ASCII chars.

Bye,
bearophile

Re: Major performance problem with std.array.front()

2014-03-09 Thread Andrei Alexandrescu


On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:

On 09/03/14 04:26, Andrei Alexandrescu wrote:

2. Add byChar that returns a random-access range iterating a string by
character. Add byWchar that does on-the-fly transcoding to UTF16. Add
byDchar that accepts any range of char and does decoding. And such
stuff. Then whenever one wants to go through a string by code point
can just use str.byChar.


This is confusing. Did you mean to say that byChar iterates a string by
code unit (not character / code point)?


Unit. s.byChar.front is a (possibly ref, possibly qualified) char.


So IIUC iterating over s.byChar would not encounter the decoding-related
speed hits that Walter is concerned about?


That is correct.

Andrei

Re: Major performance problem with std.array.front()

2014-03-09 Thread Andrei Alexandrescu


On 3/9/14, 4:34 AM, Peter Alexander wrote:

I think this is the main confusion: the belief that iterating by code
point has utility.

If you care about normalization then neither by code unit, by code
point, nor by grapheme are correct (except in certain language subsets).


I suspect that code point iteration is the worst as it works only with 
ASCII and perchance with ASCII single-byte extensions. Then we have code 
unit iteration that works with a larger spectrum of languages. One 
question would be how large that spectrum it is. If it's larger than 
English, then that would be nice because we would've made progress.


I don't know about normalization beyond discussions in this group, but 
as far as I understand from 
http://www.unicode.org/faq/normalization.html, normalization would be a 
one-step process, after which code point iteration would cover still 
more human languages. No? I'm pretty sure it's more complicated than 
that, so please illuminate me :o).



If you don't care about normalization then by code unit is just as good
as by code point, but you don't need to specialise everywhere in Phobos.

AFAIK, there is only one exception, stuff like s.all!(c = c == 'é'),
but as Vladimir correctly points out: (a) by code point, this is still
broken in the face of normalization, and (b) are there any real
applications that search a string for a specific non-ASCII character?


What happened to counting characters and such?


To those that think the status quo is better, can you give an example of
a real-life use case that demonstrates this?


split(ter) comes to mind.


I do think it's probably too late to change this, but I think there is
value in at least getting everyone on the same page.


Awesome.


Andrei

Re: Major performance problem with std.array.front()

2014-03-09 Thread Vladimir Panteleev

On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu 
wrote:

On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:
So IIUC iterating over s.byChar would not encounter the 
decoding-related

speed hits that Walter is concerned about?


That is correct.


Unless I'm missing something, all algorithms that can work faster 
on arrays will need to be adapted to also recognize 
byChar-wrapped arrays, unwrap them, perform the fast array 
operation, and wrap them back in a byChar.

Re: Major performance problem with std.array.front()

2014-03-09 Thread Andrei Alexandrescu


On 3/9/14, 6:47 AM, Marc Schütz schue...@gmx.net wrote:

On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:

2) It is regression back to C++ days of no-one-cares-about-Unicode
pain. Thinking about strings as character arrays is so natural and
convenient that if language/Phobos won't punish you for that, it will
be extremely widespread.


Not with Nick Sabalausky's suggestion to remove the implementation of
front from char arrays. This way, everyone will be forced to decide
whether they want code units or code points or something else.


Such as giving up on that crappy language that keeps on breaking their code.

Andrei

Re: Major performance problem with std.array.front()

2014-03-09 Thread Andrei Alexandrescu


On 3/9/14, 6:34 AM, Jakob Ovrum wrote:

On Sunday, 9 March 2014 at 13:08:05 UTC, Marc Schütz wrote:

Also, `byCodeUnit` and `byCodePoint` would probably be better names
than `raw` and `decode`, to much the already existing `byGrapheme` in
std.uni.


There already is a std.uni.byCodePoint. It is a higher order range that
accepts ranges of graphemes and ranges of code points (such as strings).


noice


`byCodeUnit` is essentially std.string.representation.


Actually not because for reasons that are unclear to me people really 
want the individual type to be char, not ubyte.



Andrei

Re: Major performance problem with std.array.front()

2014-03-09 Thread Marc Schütz


On Sunday, 9 March 2014 at 15:23:57 UTC, Vladimir Panteleev wrote:

On Sunday, 9 March 2014 at 13:51:12 UTC, Marc Schütz wrote:

On Friday, 7 March 2014 at 16:43:30 UTC, Dicebot wrote:

On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev

Can we look at some example situations that this will break?


Any code that relies on countUntil to count dchar's? Or, to 
generalize, almost any code that uses std.algorithm functions 
with string?


This would no longer compile, as dchar[] stops being a range. 
countUntil(range.byCodePoint) would have to be used instead.


Why? There's no reason why dchar[] would stop being a range. It 
will be treated as now, like any other array.


This was under the assumption that Nick's proposal (and my 
amendment to extend it to dchar because of graphemes e.a.) 
would be implemented.


But I made the mistake of replying to posts as I read them, just 
to notice a few posts later that someone else already posted 
something to the same effect, or that made my point irrelevant. 
Sorry for the confusion.

Re: Major performance problem with std.array.front()

2014-03-09 Thread Peter Alexander

On Sunday, 9 March 2014 at 17:15:59 UTC, Andrei Alexandrescu 
wrote:

On 3/9/14, 4:34 AM, Peter Alexander wrote:
I think this is the main confusion: the belief that iterating 
by code

point has utility.

If you care about normalization then neither by code unit, by 
code
point, nor by grapheme are correct (except in certain language 
subsets).


I suspect that code point iteration is the worst as it works 
only with ASCII and perchance with ASCII single-byte 
extensions. Then we have code unit iteration that works with a 
larger spectrum of languages. One question would be how large 
that spectrum it is. If it's larger than English, then that 
would be nice because we would've made progress.


I don't know about normalization beyond discussions in this 
group, but as far as I understand from 
http://www.unicode.org/faq/normalization.html, normalization 
would be a one-step process, after which code point iteration 
would cover still more human languages. No? I'm pretty sure 
it's more complicated than that, so please illuminate me :o).


It depends what you mean by cover :-)

If we assume strings are normalized then substring search, 
equality testing, sorting all work the same with either code 
units or code points.



If you don't care about normalization then by code unit is 
just as good
as by code point, but you don't need to specialise everywhere 
in Phobos.


AFAIK, there is only one exception, stuff like s.all!(c = c 
== 'é'),
but as Vladimir correctly points out: (a) by code point, this 
is still

broken in the face of normalization, and (b) are there any real
applications that search a string for a specific non-ASCII 
character?


What happened to counting characters and such?


I can't think of any case where you would want to count 
characters.


* If you want an index to slice from, then you need code units.
* If you want a buffer size, then you need code units.
* If you are doing something like word wrapping then you need to 
count glyphs, which is not the same as counting code points (and 
that only works with mono-spaced fonts anyway -- with variable 
width fonts you need to add up the widths of those glyphs)



To those that think the status quo is better, can you give an 
example of

a real-life use case that demonstrates this?


split(ter) comes to mind.


splitter is just an application of substring search, no? 
substring search works the same with both code units and code 
points (e.g. strstr in C works with UTF encoded strings without 
any need to decode).


All you need to do is ensure that mismatched encodings in the 
delimeter are re-encoded (you want to do this for performance 
anyway)


auto splitter(string str, dchar delim)
{
char[4] enc;
return splitter(str, enc[0..encode(enc, delim)]);
}

Re: Major performance problem with std.array.front()

2014-03-09 Thread Andrei Alexandrescu


On 3/9/14, 9:02 AM, bearophile wrote:

Time ago I have even asked for a helper function:
https://d.puremagic.com/issues/show_bug.cgi?id=10162


I commented on that and preapproved it.

Andrei

Re: Major performance problem with std.array.front()

2014-03-09 Thread Andrei Alexandrescu


On 3/9/14, 10:21 AM, Vladimir Panteleev wrote:

On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu wrote:

On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:

So IIUC iterating over s.byChar would not encounter the decoding-related
speed hits that Walter is concerned about?


That is correct.


Unless I'm missing something, all algorithms that can work faster on
arrays will need to be adapted to also recognize byChar-wrapped arrays,
unwrap them, perform the fast array operation, and wrap them back in a
byChar.


Good point. Off the top of my head I can't remember any algorithm that 
relies on array representation to do better on arrays than on 
random-access ranges offering all of arrays' primitives. But I'm sure 
there are a few.


Andrei

Re: Major performance problem with std.array.front()

2014-03-09 Thread Andrei Alexandrescu


On 3/9/14, 10:34 AM, Peter Alexander wrote:

If we assume strings are normalized then substring search, equality
testing, sorting all work the same with either code units or code points.


But others such as edit distance or equal(some_string, some_wstring) 
will not.



If you don't care about normalization then by code unit is just as good
as by code point, but you don't need to specialise everywhere in Phobos.

AFAIK, there is only one exception, stuff like s.all!(c = c == 'é'),
but as Vladimir correctly points out: (a) by code point, this is still
broken in the face of normalization, and (b) are there any real
applications that search a string for a specific non-ASCII character?


What happened to counting characters and such?


I can't think of any case where you would want to count characters.


wc

(Generally: I've always been very very very doubtful about arguments 
that start with I can't think of... because I've historically tried 
them so many times, and with terrible results.)



Andrei

Re: Major performance problem with std.array.front()

2014-03-09 Thread Vladimir Panteleev

On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu 
wrote:

wc


What should wc produce on a Sanskrit text?

The problem is that such questions quickly become philosophical.

(Generally: I've always been very very very doubtful about 
arguments that start with I can't think of... because I've 
historically tried them so many times, and with terrible 
results.)


I agree, which is why I think that although such arguments are 
not unwelcome, it's much better to find out by experiment. Break 
something in Phobos and see how much of your code is affected :)

Re: Major performance problem with std.array.front()

2014-03-09 Thread Dmitry Olshansky


09-Mar-2014 21:45, Andrei Alexandrescu пишет:

On 3/9/14, 10:21 AM, Vladimir Panteleev wrote:

On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu wrote:

On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:

So IIUC iterating over s.byChar would not encounter the
decoding-related
speed hits that Walter is concerned about?


That is correct.


Unless I'm missing something, all algorithms that can work faster on
arrays will need to be adapted to also recognize byChar-wrapped arrays,
unwrap them, perform the fast array operation, and wrap them back in a
byChar.


Good point. Off the top of my head I can't remember any algorithm that
relies on array representation to do better on arrays than on
random-access ranges offering all of arrays' primitives. But I'm sure
there are a few.


copy to begin with. And it's about 80x faster with plain arrays.


--
Dmitry Olshansky

Re: Major performance problem with std.array.front()

2014-03-09 Thread Dmitry Olshansky


09-Mar-2014 21:16, Andrei Alexandrescu пишет:

On 3/9/14, 4:34 AM, Peter Alexander wrote:

I think this is the main confusion: the belief that iterating by code
point has utility.

If you care about normalization then neither by code unit, by code
point, nor by grapheme are correct (except in certain language subsets).


I suspect that code point iteration is the worst as it works only with
ASCII and perchance with ASCII single-byte extensions. Then we have code
unit iteration that works with a larger spectrum of languages.


Was clearly meant to be: code point -- code unit


One
question would be how large that spectrum it is. If it's larger than
English, then that would be nice because we would've made progress.



Code points help only in so far that many (~all) high-level algorithms 
in Unicode are described in terms of code points. Code points have 
properties, code unit do not have anything. Code points with assigned 
semantic value are abstract characters.


It's up to programmer to implement a particular algorithm to make it as 
if decoding really happened, working directly on code units or do 
decoding and work with code points which is simpler.


Current std.uni offering mostly work on code points and decodes, crucial 
building block to work directly on code units is in review:


https://github.com/D-Programming-Language/phobos/pull/1685


I don't know about normalization beyond discussions in this group, but
as far as I understand from
http://www.unicode.org/faq/normalization.html, normalization would be a
one-step process, after which code point iteration would cover still
more human languages. No? I'm pretty sure it's more complicated than
that, so please illuminate me :o).


Technically most apps just assume say input comes in UTF-8 that is in 
normalization form C. Other such as browsers strive to get uniform 
representation on any input, do normalization of any input (often times 
normalization turns out to be just a no-op).




If you don't care about normalization then by code unit is just as good
as by code point, but you don't need to specialise everywhere in Phobos.

AFAIK, there is only one exception, stuff like s.all!(c = c == 'é'),
but as Vladimir correctly points out: (a) by code point, this is still
broken in the face of normalization, and (b) are there any real
applications that search a string for a specific non-ASCII character?


What happened to counting characters and such?


Counting chars is dubious. But, for instance, collation is defined in 
terms of code points. Regex pattern matching is _defined_ in terms of 
codepoints (even the mystical level 3 Unicode support of it). So there 
is certain merit to work at that level. But hacking it to be this way 
isn't the way to go.


The least intrusive change would be to generalize the current choice 
w.r.t. to RA ranges of char/wchar.


--
Dmitry Olshansky

Re: Major performance problem with std.array.front()

2014-03-09 Thread Peter Alexander

On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu 
wrote:

On 3/9/14, 10:34 AM, Peter Alexander wrote:
If we assume strings are normalized then substring search, 
equality
testing, sorting all work the same with either code units or 
code points.


But others such as edit distance or equal(some_string, 
some_wstring) will not.


equal(string, wstring) should either not compile, or would be 
overloaded to do the right thing. In an ideal world, char, wchar, 
and dchar should not be comparable.


Edit distance on code points is of questionable utility. Like 
Vladimir says, its meaning is pretty philosophical, even in ASCII 
(is \r\n really two edits? What is an edit?)



I can't think of any case where you would want to count 
characters.


wc


% echo € | wc -c
4

:-)


(Generally: I've always been very very very doubtful about 
arguments that start with I can't think of... because I've 
historically tried them so many times, and with terrible 
results.)


Fair point... but it's not as if we would be removing the ability 
(you could always do s.byCodePoint.count); we are talking about 
defaults. The argument that we shouldn't iterate by code unit by 
default because people might want to count code points is without 
substance. Also, with the proposal, string.count(dchar) would 
encode the dchar to a string first for performance, so it would 
still work.


Anyway, I think this discussion isn't really going anywhere so I 
think I'll agree to disagree and retire.

Re: Major performance problem with std.array.front()

2014-03-09 Thread Andrei Alexandrescu


On 3/9/14, 8:18 AM, Vladimir Panteleev wrote:

On Sunday, 9 March 2014 at 05:10:26 UTC, Andrei Alexandrescu wrote:

On 3/8/14, 8:24 PM, Vladimir Panteleev wrote:

On Sunday, 9 March 2014 at 04:18:15 UTC, Andrei Alexandrescu wrote:

What exactly is the consensus? From your wiki page I see One of the
proposals in the thread is to switch the iteration type of string
ranges from dchar to the string's character type.

I can tell you straight out: That will not happen for as long as I'm
working on D.


Why?


From the cycle going in circles: because I think the breakage is way
too large compared to the alleged improvement.


All right. I was wondering if there was something more fundamental
behind such an ultimatum.


It's just factual information with no drama attached (i.e. I'm not 
threatening to leave the language, just plainly explain I'll never 
approve that particular change).


That said a larger explanation is in order. There have been cases in the 
past when our community has worked itself in a froth over a non-issue 
and ultimately caused a language change imposed by the faction that 
shouted the loudest. The lazy keyword and recently the virtual 
keyword come to mind as cases in which the language leadership has been 
essentially annoyed into making a change it didn't believe in.


I am all about listening to the community's needs and desires. But at 
some point there is a need to stick to one's guns in matters of judgment 
call. See e.g. https://d.puremagic.com/issues/show_bug.cgi?id=11837 for 
a very recent example in which reasonable people may disagree but at 
some point you can't choose both options.


What we now have works as intended. As I mentioned, there is quite a bit 
more evidence the design is useful to people, than detrimental. Unicode 
is all about code points. Code units are incidental to each encoding. 
The fact that we recognize code points at language and library level is, 
in my opinion, a Good Thing(tm).


I understand that doesn't reach the ninth level of Nirvana and there are 
still issues to work on, and issues where good-looking code is actually 
incorrect. But I think we're overall in good shape. A regression from 
that to code unit level would be very destructive. Even a clear slight 
improvement that breaks backward compatibility would be destructive.


So I wanted to limit the potential damage of this discussion. It is made 
only a lot more dangerous that Walter himself started it, something that 
others didn't fail to tune into. The sheer fact that we got to 
contemplate an unbelievably massive breakage on no other evidence than 
one misuse case and for the sake of possibly an illusory improvement - 
that's a sign we need to grow up. We can't go like this about changing 
the language and aim to play in the big leagues.



In fact I believe that that design is inferior to the current one
regardless.


I was hoping we could come to an agreement at least on this point.


Sorry to disappoint.


---

BTW, a thought struck me while thinking about the problem yesterday.

char and dchar should not be implicitly convertible between one another,
or comparable to the other.


I think only the char - dchar conversion works, and I can see arguments 
against it. Also comparison of char with dchar is dicey. But there are 
also cases in which it's legitimate to do that (e.g. assign ASCII chars 
etc) and this would be a breaking change.


One good way to think about breaking changes is if this change were 
executed to perfection, how much would that improve the overall quality 
of D? Because breakages _are_ overall - users don't care whether they 
come from this or the other part of the type system. Really puts things 
into perspective.



void main()
{
 string s = Привет;
 foreach (c; s)
 assert(c != 'Ñ');
}

Instead, std.conv.to should allow converting between character types,
iff they represent one whole code point and fit into the destination
type, and throw an exception otherwise (similar to how it deals with
integer overflow). Char literals should be special-cased by the compiler
to implicitly convert to any sufficiently large type.

This would break more[1] code, but it would avoid the silent failures of
the earlier proposal.

[1] I went through my own larger programs. I actually couldn't find any
uses of dchar which would be impacted by such a hypothetical change.


Generally I think we should steer away from slight improvements of the 
language at the cost of breaking existing code. Instead, we must think 
of ways to improve the language without the breakage. You may want to 
pursue (bugzilla + pull request) adding the std.conv routines with the 
semantics you mentioned.



Andrei

Re: Major performance problem with std.array.front()

2014-03-09 Thread Andrei Alexandrescu


On 3/9/14, 11:14 AM, Dmitry Olshansky wrote:

09-Mar-2014 21:45, Andrei Alexandrescu пишет:

On 3/9/14, 10:21 AM, Vladimir Panteleev wrote:

On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu wrote:

On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:

So IIUC iterating over s.byChar would not encounter the
decoding-related
speed hits that Walter is concerned about?


That is correct.


Unless I'm missing something, all algorithms that can work faster on
arrays will need to be adapted to also recognize byChar-wrapped arrays,
unwrap them, perform the fast array operation, and wrap them back in a
byChar.


Good point. Off the top of my head I can't remember any algorithm that
relies on array representation to do better on arrays than on
random-access ranges offering all of arrays' primitives. But I'm sure
there are a few.


copy to begin with. And it's about 80x faster with plain arrays.


Question is if there are a bunch of them.

Andrei

Re: Major performance problem with std.array.front()

2014-03-09 Thread Andrei Alexandrescu


On 3/9/14, 11:19 AM, Peter Alexander wrote:

On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu wrote:

On 3/9/14, 10:34 AM, Peter Alexander wrote:

If we assume strings are normalized then substring search, equality
testing, sorting all work the same with either code units or code
points.


But others such as edit distance or equal(some_string, some_wstring)
will not.


equal(string, wstring) should either not compile, or would be overloaded
to do the right thing.


These would be possible designs each with its pros and cons. The current 
design works out of the box across all encodings. It has its own pros 
and cons. Puts in perspective what should and shouldn't be.



In an ideal world, char, wchar, and dchar should
not be comparable.


Probably. But that has nothing to do with equal() working.


Edit distance on code points is of questionable utility. Like Vladimir
says, its meaning is pretty philosophical, even in ASCII (is \r\n
really two edits? What is an edit?)


Nothing philosophical - it's as cut and dried as it gets. An edit is as 
defined by the Levenshtein algorithm using code points as the unit of 
comparison.



I can't think of any case where you would want to count characters.


wc


% echo € | wc -c
4

:-)


Noice.


(Generally: I've always been very very very doubtful about arguments
that start with I can't think of... because I've historically tried
them so many times, and with terrible results.)


Fair point... but it's not as if we would be removing the ability (you
could always do s.byCodePoint.count); we are talking about defaults. The
argument that we shouldn't iterate by code unit by default because
people might want to count code points is without substance. Also, with
the proposal, string.count(dchar) would encode the dchar to a string
first for performance, so it would still work.


That's a good enhancement for the current design as well - care to 
submit a request for it?



Anyway, I think this discussion isn't really going anywhere so I think
I'll agree to disagree and retire.


The part that advocates a breaking change will not indeed lead anywhere. 
The parts where we improve Unicode support for D is very fertile.



Andrei

Re: Major performance problem with std.array.front()

2014-03-09 Thread Dmitry Olshansky


09-Mar-2014 07:53, Vladimir Panteleev пишет:

On Sunday, 9 March 2014 at 03:26:40 UTC, Andrei Alexandrescu wrote:
I don't understand this argument. Iterating by code unit is not
meaningless if you don't want to extract meaning from each unit
iteration. For example, if you're parsing JSON or XML, you only care
about the syntax characters, which are all ASCII. And there is no
confusion of what exactly are we counting here.


This was debated... people should not be looking at individual code
points, unless they really know what they're doing.


Should they be looking at code units instead?


No. They should only be looking at substrings.


This. Anyhow searching dchar makes sense for _some_ languages, the 
problem is that it shouldn't decode the whole string but rather encode 
the needle properly and search that.


Basically the whole thread is about:
how do I work efficiently (no-decoding) with UTF-8/UTF-16 in cases where 
it obviously can be done?


The current situation is bad in that it undermines writing decode-less 
generic code. One easily falls into auto-decode trap on first .front, 
especially when called from some standard algorithm. The algo sees 
char[]/wchar[] and gets into decode mode via some special case. If it 
would do that with _all_ char/wchar random access ranges it'd be at 
least consistent.


That and wrapping your head around 2 sets of constraints. The amount of 
code around 2 types - wchar[]/char[] is way too much, that much is clear.


--
Dmitry Olshansky

Re: Major performance problem with std.array.front()

2014-03-09 Thread Dmitry Olshansky


09-Mar-2014 21:54, Vladimir Panteleev пишет:

On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu wrote:

wc


What should wc produce on a Sanskrit text?

The problem is that such questions quickly become philosophical.


Technically it could use word-braking algorithm for words.
Or count grapheme clusters, or count code points it all may have value, 
depending on the user and writing system.



--
Dmitry Olshansky

Re: Major performance problem with std.array.front()

2014-03-09 Thread Andrei Alexandrescu


On 3/9/14, 11:34 AM, Dmitry Olshansky wrote:

09-Mar-2014 07:53, Vladimir Panteleev пишет:

On Sunday, 9 March 2014 at 03:26:40 UTC, Andrei Alexandrescu wrote:
I don't understand this argument. Iterating by code unit is not
meaningless if you don't want to extract meaning from each unit
iteration. For example, if you're parsing JSON or XML, you only care
about the syntax characters, which are all ASCII. And there is no
confusion of what exactly are we counting here.


This was debated... people should not be looking at individual code
points, unless they really know what they're doing.


Should they be looking at code units instead?


No. They should only be looking at substrings.


This. Anyhow searching dchar makes sense for _some_ languages, the
problem is that it shouldn't decode the whole string but rather encode
the needle properly and search that.


That's just an optimization. Conceptually what happens is we're looking 
for a code point in a sequence of code points.



Basically the whole thread is about:
how do I work efficiently (no-decoding) with UTF-8/UTF-16 in cases where
it obviously can be done?

The current situation is bad in that it undermines writing decode-less
generic code.


s/undermines writing/makes writing explicit/


One easily falls into auto-decode trap on first .front,
especially when called from some standard algorithm. The algo sees
char[]/wchar[] and gets into decode mode via some special case. If it
would do that with _all_ char/wchar random access ranges it'd be at
least consistent.

That and wrapping your head around 2 sets of constraints. The amount of
code around 2 types - wchar[]/char[] is way too much, that much is clear.


We're engineers so we should quantify. Ideally that would be as simple 
as git grep isNarrowString|wc -l which currently prints 42 of all 
numbers :o).


Overall I suspect there are a few good simplifications we can make by 
using isNarrowString and .representation.



Andrei

Re: Major performance problem with std.array.front()

2014-03-09 Thread monarch_dodra


On Sunday, 9 March 2014 at 14:57:32 UTC, Peter Alexander wrote:
You have mentioned case-insensitive searching, but I think I've 
adequately demonstrated that this doesn't work in general by 
code point: you need to normalize and take locales into account.


I don't understand what your argument. Is it by code point is 
not 100% correct, so let's just drop it and go for raw code units 
instead?


We *are* arguing about whether or not front/popFront should 
decode by dchar, right...?


You mention the algorithms Searching, equality testing, copying, 
sorting, hashing, splitting, joining... I said by codepoint is 
not correct, but I still think it's a hell of a lot more 
accurate than by codeunit. Unless you want to ignore any and all 
algorithms that takes a predicate?


You say unless you are willing to ignore languages like 
Turkish, but... If you don't decode front, than aren't you just 
ignoring *all* languages that basically aren't English?


As I said, maybe by codepoint is not correct, but if it isn't, I 
think we should be moving further *into* the correct behavior by 
default, not away from it.

Re: Major performance problem with std.array.front()

2014-03-09 Thread Dmitry Olshansky


09-Mar-2014 22:41, Andrei Alexandrescu пишет:

On 3/9/14, 11:34 AM, Dmitry Olshansky wrote:

This. Anyhow searching dchar makes sense for _some_ languages, the
problem is that it shouldn't decode the whole string but rather encode
the needle properly and search that.


That's just an optimization. Conceptually what happens is we're looking
for a code point in a sequence of code points.


Yup. It's till not a good idea to introduce this in std.algorithm in a 
non-generic way.



That and wrapping your head around 2 sets of constraints. The amount of
code around 2 types - wchar[]/char[] is way too much, that much is clear.


We're engineers so we should quantify. Ideally that would be as simple
as git grep isNarrowString|wc -l which currently prints 42 of all
numbers :o).


Add to that some uses of isSomeString and ElementEncodingType.
138 and 80 respectively.

And in most cases it means that nice generic code was hacked to care 
about 2 types in particular. That is what bothers me.



Overall I suspect there are a few good simplifications we can make by
using isNarrowString and .representation.


Okay putting potential breakage aside.
Let me sketch up an additive way of improving current situation.

1. Say we recognize any indexable entity of char/wchar/dchar, that 
however has .front returning a dchar as a narrow string. Nothing fancy 
- it's just a generalization of isNarrowString. At least a range over 
Array!char will work as string now.


2. Likewise representation must be made something more explicit say 
byCodeUnit and work on any isNarrowString per above. The opposite of 
that is byCodePoint.


3. ElementEncodingType is too verbose and misleading. Something more 
explicit would be useful. ItemType/UnitType maybe?


4. We lack lots of good stuff from Unicode standard. Some recently 
landed in std.uni. We need many more, and deprecate crappy ones in 
std.string. (e.g. wrapping text is one)


5. Most algorithms conceptually decode, but may be enhanced to work 
directly on UTF-8/UTF-16. That together with 1, should IMHO solve most 
of our problems.


6. Take into account ASCII and maybe other alphabets? Should be as 
trivial as .assumeASCII and then on you march with all of std.algo/etc.


--
Dmitry Olshansky

Re: Major performance problem with std.array.front()

2014-03-09 Thread Andrei Alexandrescu


On 3/9/14, 12:25 PM, Dmitry Olshansky wrote:

Okay putting potential breakage aside.
Let me sketch up an additive way of improving current situation.


Now you're talking.


1. Say we recognize any indexable entity of char/wchar/dchar, that
however has .front returning a dchar as a narrow string. Nothing fancy
- it's just a generalization of isNarrowString. At least a range over
Array!char will work as string now.


Wait, why is dchar[] a narrow string?


2. Likewise representation must be made something more explicit say
byCodeUnit and work on any isNarrowString per above. The opposite of
that is byCodePoint.


Fine.


3. ElementEncodingType is too verbose and misleading. Something more
explicit would be useful. ItemType/UnitType maybe?


We're stuck with that name.


4. We lack lots of good stuff from Unicode standard. Some recently
landed in std.uni. We need many more, and deprecate crappy ones in
std.string. (e.g. wrapping text is one)


Add away.


5. Most algorithms conceptually decode, but may be enhanced to work
directly on UTF-8/UTF-16. That together with 1, should IMHO solve most
of our problems.


Great!


6. Take into account ASCII and maybe other alphabets? Should be as
trivial as .assumeASCII and then on you march with all of std.algo/etc.


Walter is against that. His main argument is that UTF already covers 
ASCII with only a marginal cost (that can be avoided) and that we should 
go farther into the future instead of catering to an obsolete 
representation.



Andrei

Re: Major performance problem with std.array.front()

2014-03-09 Thread w0rp

On Sunday, 9 March 2014 at 19:40:32 UTC, Andrei Alexandrescu 
wrote:
6. Take into account ASCII and maybe other alphabets? Should 
be as
trivial as .assumeASCII and then on you march with all of 
std.algo/etc.


Walter is against that. His main argument is that UTF already 
covers ASCII with only a marginal cost (that can be avoided) 
and that we should go farther into the future instead of 
catering to an obsolete representation.



Andrei


When I've wanted to write code especially for ASCII, I think it 
hasn't been for use in generic algorithms anyway. Mostly it's 
stuff for manipulating segments of memory in a particular way, 
like as seen here in my library which does some work to generate 
D code.


https://github.com/w0rp/dsmoke/blob/master/source/smoke/string_util.d#L45

Anything else would be something like running through an 
algorithm and then copying data into a new array or similar, and 
that would miss the point. When it comes to generic algorithms 
and ASCII I think UTF-x is sufficient.

Re: Major performance problem with std.array.front()

2014-03-09 Thread Dmitry Olshansky


09-Mar-2014 23:40, Andrei Alexandrescu пишет:

On 3/9/14, 12:25 PM, Dmitry Olshansky wrote:

Okay putting potential breakage aside.
Let me sketch up an additive way of improving current situation.


Now you're talking.


1. Say we recognize any indexable entity of char/wchar/dchar, that
however has .front returning a dchar as a narrow string. Nothing fancy
- it's just a generalization of isNarrowString. At least a range over
Array!char will work as string now.


Wait, why is dchar[] a narrow string?


Indeed `...entity of char/wchar/dchar` -- `...entity of char/wchar`.


3. ElementEncodingType is too verbose and misleading. Something more
explicit would be useful. ItemType/UnitType maybe?


We're stuck with that name.


Too bad, but we have renamed imports... if only they worked correctly. 
But let's not derail.


[snip]

Great, so this may be turned into smallish DIP or bugzilla enhancements.


6. Take into account ASCII and maybe other alphabets? Should be as
trivial as .assumeASCII and then on you march with all of std.algo/etc.


Walter is against that. His main argument is that UTF already covers
ASCII with only a marginal cost


He certainly doesn't have things like case-insensitive matching or 
collation on his list. Some cute tables are what directly to the UTF 
algorithms require for almost anything beyond simple-minded find me a 
substring.


Walter certainly would have different stance the moment he observe the 
extra bulk of object code for these.



(that can be avoided)


How? I'm not talking about  `x  0x80` branches, these wouldn't cost a dime.

I really don't feel strong about 6th point. I see it as a good idea to 
allow custom alphabets and reap performance benefits where it makes 
sense, the need for that is less urgent though.



and that we should
go farther into the future instead of catering to an obsolete
representation.


That is something I agree with.

--
Dmitry Olshansky

Re: Major performance problem with std.array.front()

2014-03-09 Thread Nick Sabalausky


On 3/9/2014 1:26 PM, Andrei Alexandrescu wrote:

On 3/9/14, 6:34 AM, Jakob Ovrum wrote:


`byCodeUnit` is essentially std.string.representation.


Actually not because for reasons that are unclear to me people really
want the individual type to be char, not ubyte.



Probably because char *is* D's type for UTF-8 code units.

Re: Major performance problem with std.array.front()

2014-03-09 Thread Nick Sabalausky


On 3/9/2014 11:21 AM, Vladimir Panteleev wrote:

On Sunday, 9 March 2014 at 12:24:11 UTC, ponce wrote:

- In lots of places, I've discovered that Phobos did UTF decoding
(thus murdering performance) when it didn't need to. Such cases
included format (now fixed), appender (now fixed), startsWith (now
fixed - recently), skipOver (still unfixed). These have caused latent
bugs in my programs that happened to be fed non-UTF data. There's no
reason for why D should fail on non-UTF data if it has no reason to
decode it in the first place! These failures have only served to
identify places in Phobos where redundant decoding was occurring.


With all due respect, D string type is exclusively for UTF-8 strings.
If it is not valid UTF-8, it should never had been a D string in the
first place. In the other cases, ubyte[] is there.


This is an arbitrary self-imposed limitation caused by the choice in how
strings are handled in Phobos.


Yea, I've had problems before - completely unnecessary problems that 
were *not* helpful or indicative of latent bugs - which were a direct 
result of Phobos being overly pedantic and eager about UTF validation. 
And yet the implicit UTF validation has never actually *helped* me in 
any way.

Re: Major performance problem with std.array.front()

2014-03-09 Thread Nick Sabalausky


On 3/8/2014 9:15 PM, Michel Fortin wrote:


Text is an interesting topic for never-ending discussions.



It's also a good example for when non-programmers are surprised to hear 
that I *don't* see the world as binary black and white *because* of my 
programming experience ;)


Problems like text-handling make it [painfully] obvious to programmers 
that reality is shades-of-grey - laymen don't often expect that!

Re: Major performance problem with std.array.front()

2014-03-09 Thread Nick Sabalausky


On 3/9/2014 7:47 AM, w0rp wrote:


My knowledge of Unicode pretty much just comes from having
to deal with foreign language customers and discovering the problems
with the code unit abstraction most languages seem to use. (Java and
Python suffer from similar issues, but they don't really have algorithms
in the way that we do.)



Python 2 or 3 (out of curiosity)? If you're including Python3, then that 
somewhat surprises me as I thought greatly improved Unicode was one of 
the biggest reasons for the jump from 2 to 3. (Although it isn't 
*completely* surprising since, as we all know far too well here, fully 
correct Unicode is *not* easy.)

Re: Major performance problem with std.array.front()

2014-03-09 Thread Walter Bright


On 3/9/2014 6:08 AM, Marc Schütz schue...@gmx.net wrote:

Also, `byCodeUnit` and `byCodePoint` would probably be better names than `raw`
and `decode`, to much the already existing `byGrapheme` in std.uni.


I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string, wstring, 
dstring, and InputRange!char, etc.

Re: Major performance problem with std.array.front()

2014-03-09 Thread Walter Bright


On 3/9/2014 6:34 AM, Jakob Ovrum wrote:

`byCodeUnit` is essentially std.string.representation.


Not at all. std.string.representation takes a string and casts it to the 
corresponding ubyte, ushort, uint string.


It doesn't work at all with InputRange!char

Re: Major performance problem with std.array.front()

2014-03-09 Thread Nick Sabalausky


On 3/9/2014 6:31 PM, Walter Bright wrote:

On 3/9/2014 6:08 AM, Marc Schütz schue...@gmx.net wrote:

Also, `byCodeUnit` and `byCodePoint` would probably be better names
than `raw`
and `decode`, to much the already existing `byGrapheme` in std.uni.


I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string,
wstring, dstring, and InputRange!char, etc.


'byCodePoint' and 'byDchar' are the same. However, 'byCodeUnit' is 
completely different from anything else:


string  str;
wstring wstr;
dstring dstr;

(str|wchar|dchar).byChar  // Always range of char
(str|wchar|dchar).byWchar // Always range of wchar
(str|wchar|dchar).byDchar // Always range of dchar

str.representation  // Range of ubyte
wstr.representation // Range of ushort
dstr.representation // Range of uint

str.byCodeUnit  // Range of char
wstr.byCodeUnit // Range of wchar
dstr.byCodeUnit // Range of dchar

Re: Major performance problem with std.array.front()

2014-03-09 Thread Nick Sabalausky


On 3/10/2014 12:19 AM, Nick Sabalausky wrote:


(str|wchar|dchar).byChar  // Always range of char
(str|wchar|dchar).byWchar // Always range of wchar
(str|wchar|dchar).byDchar // Always range of dchar



Erm, naturally I meant (str|wstr|dstr)

Re: Major performance problem with std.array.front()

2014-03-09 Thread Walter Bright


On 3/9/2014 9:19 PM, Nick Sabalausky wrote:

On 3/9/2014 6:31 PM, Walter Bright wrote:

On 3/9/2014 6:08 AM, Marc Schütz schue...@gmx.net wrote:

Also, `byCodeUnit` and `byCodePoint` would probably be better names
than `raw`
and `decode`, to much the already existing `byGrapheme` in std.uni.


I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string,
wstring, dstring, and InputRange!char, etc.


'byCodePoint' and 'byDchar' are the same. However, 'byCodeUnit' is completely
different from anything else:

string  str;
wstring wstr;
dstring dstr;

(str|wchar|dchar).byChar  // Always range of char
(str|wchar|dchar).byWchar // Always range of wchar
(str|wchar|dchar).byDchar // Always range of dchar

str.representation  // Range of ubyte
wstr.representation // Range of ushort
dstr.representation // Range of uint

str.byCodeUnit  // Range of char
wstr.byCodeUnit // Range of wchar
dstr.byCodeUnit // Range of dchar


I don't see much point to the latter 3.

Re: Major performance problem with std.array.front()

2014-03-08 Thread Dmitry Olshansky


08-Mar-2014 05:23, Andrei Alexandrescu пишет:

On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:

On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:

On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:

No, it doesn't.

import std.algorithm;

void main()
{
   auto s = cassé;
   assert(s.canFind('é'));
}



Hm, I'm not following? Works perfectly fine on my system?


Something's messing with your Unicode. Try downloading and compiling
this file:
http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d


Yup, the grapheme issue. This should work.

import std.algorithm, std.uni;

void main()
{
 auto s = cassé;
 assert(s.byGrapheme.canFind('é'));
}

It doesn't compile, seems like a library bug.


Becasue Graphemes do not auto-magically convert to dchar and back? After 
all they are just small strings.




Graphemes are the next level of Nirvana above code points, but that
doesn't mean it's graphemes or nothing.


Andrei




--
Dmitry Olshansky

Re: Major performance problem with std.array.front()

2014-03-08 Thread Dmitry Olshansky


08-Mar-2014 12:09, Dmitry Olshansky пишет:

08-Mar-2014 05:23, Andrei Alexandrescu пишет:

On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:

On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:

On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:

No, it doesn't.

import std.algorithm;

void main()
{
   auto s = cassé;
   assert(s.canFind('é'));
}



Hm, I'm not following? Works perfectly fine on my system?


Something's messing with your Unicode. Try downloading and compiling
this file:
http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d


Yup, the grapheme issue. This should work.

import std.algorithm, std.uni;

void main()
{
 auto s = cassé;
 assert(s.byGrapheme.canFind('é'));
}

It doesn't compile, seems like a library bug.


Becasue Graphemes do not auto-magically convert to dchar and back? After
all they are just small strings.



Graphemes are the next level of Nirvana above code points, but that
doesn't mean it's graphemes or nothing.



Plus it won't help the matters, you need both é and cassé to have 
the same normalization.



--
Dmitry Olshansky

Re: Major performance problem with std.array.front()

2014-03-08 Thread Dmitry Olshansky


08-Mar-2014 05:18, Andrei Alexandrescu пишет:

On 3/7/14, 12:48 PM, Dmitry Olshansky wrote:

07-Mar-2014 23:57, Andrei Alexandrescu пишет:

On 3/6/14, 6:37 PM, Walter Bright wrote:

In Lots of low hanging fruit in Phobos the issue came up about the
automatic encoding and decoding of char ranges.

[snip]

Allow me to enumerate the functions of std.algorithm and how they work
today and how they'd work with the proposed change. Let s be a variable
of some string type.


Special case was wrong though - special casing arrays of char[] and
throwing all other ranges of char out the window. The amount of code to
support this schizophrenia is enormous.


I think this is a confusion. The code in e.g. std.algorithm is
specialized for efficiency of stuff that already works.


Well, I've said it elsewhere - specialization was too fine grained. 
Either a generic or it doesn't work.





Making strings bidirectional ranges has been a very good choice within
the constraints. There was already a string type, and that was
immutable(char)[], and a bunch of code depended on that definition.


Trying to make it work by blowing a hole in the generic range concept
now seems like it wasn't worth it.


I disagree. Also what hole?


Let's say we keep it.
Yesterday I had to write constraints like this:

if((isNarrowString!Range  is(Unqual!(ElementEncodingType!Range) == 
wchar)) ||

 (isRandomAccessRange!Range  is(Unqual!(ElementType!Range) == wchar)))

Just to accept anything that works alike to array of wchar, buffers and 
whatnot included.


I expect that this should have been enough:
isRandomAccessRange!Range  is(Unqual!(ElementType!Range) == wchar)

Or maybe introduce something to indicate any DualRange of narrow chars.

--
Dmitry Olshansky

Re: Major performance problem with std.array.front()

2014-03-08 Thread Eyrk


On Saturday, 8 March 2014 at 02:04:12 UTC, bearophile wrote:

Vladimir Panteleev:


It's not about types, it's about algorithms.


Given sufficiently refined types, it can be about types :-)

Bye,
bearophile


I think Bear is onto something, we already solved an analogous 
problem in an elegant way.


see SortedRange with assumeSorted etc.

But for this to be convenient to use, I still think we should 
expand the current 'String Literal Postfix' types to include both 
normaliztion and graphemes.


Postfix TypeAka
c   immutable(char)[]   string
w   immutable(wchar)[]  wstring
d   immutable(dchar)[]  dstring

Re: Major performance problem with std.array.front()

2014-03-08 Thread Andrei Alexandrescu


On 3/8/14, 12:14 AM, Dmitry Olshansky wrote:

08-Mar-2014 12:09, Dmitry Olshansky пишет:

08-Mar-2014 05:23, Andrei Alexandrescu пишет:

On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:

On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:

On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:

No, it doesn't.

import std.algorithm;

void main()
{
   auto s = cassé;
   assert(s.canFind('é'));
}



Hm, I'm not following? Works perfectly fine on my system?


Something's messing with your Unicode. Try downloading and compiling
this file:
http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d


Yup, the grapheme issue. This should work.

import std.algorithm, std.uni;

void main()
{
 auto s = cassé;
 assert(s.byGrapheme.canFind('é'));
}

It doesn't compile, seems like a library bug.


Becasue Graphemes do not auto-magically convert to dchar and back? After
all they are just small strings.



Graphemes are the next level of Nirvana above code points, but that
doesn't mean it's graphemes or nothing.



Plus it won't help the matters, you need both é and cassé to have
the same normalization.


Why? Couldn't the grapheme 'compare true with the character? I.e. the 
byGrapheme iteration normalizes on the fly.


Andrei

Re: Major performance problem with std.array.front()

2014-03-08 Thread Andrei Alexandrescu


On 3/8/14, 12:09 AM, Dmitry Olshansky wrote:

08-Mar-2014 05:23, Andrei Alexandrescu пишет:

On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:

On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:

On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:

No, it doesn't.

import std.algorithm;

void main()
{
   auto s = cassé;
   assert(s.canFind('é'));
}



Hm, I'm not following? Works perfectly fine on my system?


Something's messing with your Unicode. Try downloading and compiling
this file:
http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d


Yup, the grapheme issue. This should work.

import std.algorithm, std.uni;

void main()
{
 auto s = cassé;
 assert(s.byGrapheme.canFind('é'));
}

It doesn't compile, seems like a library bug.


Becasue Graphemes do not auto-magically convert to dchar and back? After
all they are just small strings.


Yah but I think they should support comparison with individual 
characters. No?


Andrei

1 2 3 >

1 - 100 of 241 matches

Mail list logo