subject:"std.algorithm.remove and principle of least astonishment"

Re: std.algorithm.remove and principle of least astonishment

2010-11-24 Thread Don


Andrei Alexandrescu wrote:

On 11/22/10 12:01 PM, Steven Schveighoffer wrote:

On Mon, 22 Nov 2010 12:40:16 -0500, Andrei Alexandrescu
seewebsiteforem...@erdani.org wrote:


On 11/22/10 11:22 AM, Steven Schveighoffer wrote:



You're dodging the question. You claim that if I want to use it as an
array, I use it as an array, if I want to use it as a range, use it 
as a

range. I'm simply pointing out why you can't use it as an array --
because phobos treats it as a bidirectional range, and you can't force
it to do what you want.


Of course you can. After you were to admit that it makes next to no
sense to sort an array of code units, I would have said well if
somehow you do imagine such a situation, you achieve that by saying
what you means: cast the char[] to ubyte[] and sort that.


That wasn't what you said -- you said I can use char[] as an array if I
want to use it as an array, not that I can use ubyte[] as an array
(nobody disputes that).


That still stays valid. The thing is, sort doesn't sort arrays, it sorts 
random-access ranges.



The thing is, *only* when one wants to create strings, does one want to
view the data type as a bidirectional string. When one wants to deal
with chars as an element of a container, I don't want to be restricted
to utf requirements.


If you don't want to be restricted to utf requirements, use ubyte and
ushort. You're saying I want to use UTF code points without any
associated UTF meaning.

And
easy to understand means easier to avoid mistakes. The point is, the
domain of valid elements in my application is defined by me, not by the
library. The library is making assumptions that my poker hands may
contain utf8 characters, while I know in my case they cannot.


Then what's wrong with ubyte? Why do you encode as UTF something that 
you know isn't UTF? 


Would you put an integral in a real even though you 
know it's only integral?
I don't think that's a valid comparison, since we have integer types, 
but we don't have ASCII types.


Here's the issue as I see it: there are very common use cases (and lots 
of existing C code) for a type which stores an ASCII character.


I think we're seeing the exact same issue that causes to people to 
mistakenly use 'uint' when they mean 'positive integer'.
It LOOKS as though a char is a subset of dchar (ie, a dchar in the range 
0..0x7F).
It LOOKS as though a uint is a subset of int (ie, an int in the range 
0..int.max).


But in both cases, the possibility that the high bit could be set, 
changes the semantics.

Re: std.algorithm.remove and principle of least astonishment

2010-11-24 Thread Bruno Medeiros


On 22/11/2010 04:56, Andrei Alexandrescu wrote:

On 11/21/10 22:09 CST, Rainer Deyke wrote:

On 11/21/2010 17:31, Andrei Alexandrescu wrote:
char[] and wchar[] fail to provide some of the guarantees of all other
instances of T[].


What exactly are those guarantees?



More exactly, that the following is true for any T: 

foreach(character; (T[]).init) {
static assert(is(typeof(character) == T));
}
static assert(std.range.isRandomAccessRange!(T[]));

It is not true for char and wchar (the second assert fails).
Another guarantee, similar in nature, and roughly described, is that 
functions in std.algorithm should never fail or throw when using an 
array as a argument (assuming the other arguments are valid). So for 
example:


std.algorithm.filter!(true)(anArray)

Should not throw, for any value of anArray. But it may if anArray is of 
type char[] or wchar[] and there is an encoding exception.



I'll leave the arguing of whether we want those guarantees for other 
subthreads, but it should be well agreed by now, that the above is not 
guaranteed.



--
Bruno Medeiros - Software Engineer

Re: std.algorithm.remove and principle of least astonishment

2010-11-24 Thread Bruno Medeiros


On 23/11/2010 18:15, foobar wrote:

It's simple, a mediocre language (Java) with mediocre libraries has orders of 
magnitude more success than C++ with it's libs fine tuned for performance. Why?


Java has mediocre libraries?? Are you serious about that opinion?


--
Bruno Medeiros - Software Engineer

Re: std.algorithm.remove and principle of least astonishment

2010-11-24 Thread Bruno Medeiros


On 24/11/2010 13:07, Bruno Medeiros wrote:

On 22/11/2010 04:56, Andrei Alexandrescu wrote:

On 11/21/10 22:09 CST, Rainer Deyke wrote:

On 11/21/2010 17:31, Andrei Alexandrescu wrote:
char[] and wchar[] fail to provide some of the guarantees of all other
instances of T[].


What exactly are those guarantees?



More exactly, that the following is true for any T:

foreach(character; (T[]).init) {
static assert(is(typeof(character) == T));
}
static assert(std.range.isRandomAccessRange!(T[]));

It is not true for char and wchar (the second assert fails).
Another guarantee, similar in nature, and roughly described, is that
functions in std.algorithm should never fail or throw when using an
array as a argument (assuming the other arguments are valid). So for
example:

std.algorithm.filter!(true)(anArray)

Should not throw, for any value of anArray. But it may if anArray is of
type char[] or wchar[] and there is an encoding exception.


I'll leave the arguing of whether we want those guarantees for other
subthreads, but it should be well agreed by now, that the above is not
guaranteed.




Actually, I'll reply here, on why I would like these guarantees:

I think these guarantees are desirable due to a general design principle 
of mine that goes something like this:
 * Avoid bad abstractions: the abstraction should reflect intent as 
closely and clearly as possible.


Yeah, that may not tell anyone much because it's very hard to 
objectively define whether an abstraction is bad or not, or better or 
worse than another. However, here are a few guidelines:
  - within the same level of functionality, things should be as simple 
and as orthogonal as possible.
  - don't confuse implementation with contract/interface/API. (note 
that I said confuse, not expose)


char[] is not as orthogonal as possible. char[] does not reflect it's 
underlying intent as clearly as it could. If it was defined in a struct, 
you could directly document the expectation that the underlying string 
must be a valid UTF-8 encoding. In fact, you could even make that a 
contract.


If instead of an argument based on a design principle, you ask for 
concrete examples of why this is undesirable, well, I have no examples 
to give...  I haven't used D enough to run into real-world examples, but 
I believe that whenever the above principle is violated, then it is very 
likely that problems and/or annoyances will occur sooner or later.


I should point out however, that, at least for me, the undesirability of 
the current behavior is actually very low. Compared to other language 
issues (whether current ones, or past ones), it does not seem that 
significant. For example, static arrays not being proper values types 
(plus their .init thing) was much worse, man, that annoyed the shit out 
of me.


Then again, someone with more experience using D might encounter a more 
serious real-world case regarding the current behavior. Also, regarding 
this:



On 22/11/2010 17:40, Andrei Alexandrescu wrote:

 Of course you can. After you were to admit that it makes next to no
 sense to sort an array of code units, I would have said well if somehow
 you do imagine such a situation, you achieve that by saying what you
 means: cast the char[] to ubyte[] and sort that.

Casting to ubyte[] does solve the use case, I agree. It does so with a 
minor inconvenience (having to cast), but it's very minor and I don't 
think it's that significant.
Rather, I'm more concerned with the use cases that actually want to use 
a char[] as a UTF-8 encoded string. As I mentioned above, I'm afraid of 
situations where this inconsistency might cause more significant 
inconveniences, maybe even bugs!


--
Bruno Medeiros - Software Engineer

Re: std.algorithm.remove and principle of least astonishment

2010-11-24 Thread Bruno Medeiros


On 21/11/2010 18:23, Andrei Alexandrescu wrote:


I have often reflected whether I'd do things differently if I could go
back in time and join Walter when he invented D's strings. I might have
done one or two things differently, but the gain would be marginal at
best. In fact, it's not impossible the balance of things could have been
hurt. Between speed, simplicity, effectiveness, abstraction, access to
representation, and economy of means, D's strings are the best
compromise out there that I know of, bar none by a wide margin.


Those things you would have done differently, would any of them impact 
this particular issue?


--
Bruno Medeiros - Software Engineer

Re: std.algorithm.remove and principle of least astonishment

2010-11-24 Thread spir

On Wed, 24 Nov 2010 13:39:19 +0100
Don nos...@nospam.com wrote:

 I think we're seeing the exact same issue that causes to people to 
 mistakenly use 'uint' when they mean 'positive integer'.
 It LOOKS as though a char is a subset of dchar (ie, a dchar in the range 
 0..0x7F).

Cannot be, in the sense of uint beeing a subset ulong. That's why char, if 
not perfect, is a good name, providing the programmer with a hint about actual 
semantics. What i don't understand is why people who need unsigned bytes do not 
use ubyte? But instead bug into char. Is this only because of C baggage?

 It LOOKS as though a uint is a subset of int (ie, an int in the range 
 0..int.max).

This indeed is a big issue. I would prefere uint (= Natural) to be implemented 
as a subset of int:
uint   0 -- +7fff 
int -f00 -- +7fff

Denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com

Re: std.algorithm.remove and principle of least astonishment

2010-11-24 Thread Daniel Gibson


spir schrieb:

What i don't understand is why people who need unsigned bytes do not use ubyte? 
But instead bug into char. Is this only because of C baggage?



probably because you can't write ubyte[] str = asdf; and they want to have 
ascii-chars in their ubyte arrays

Re: std.algorithm.remove and principle of least astonishment

2010-11-24 Thread Andrei Alexandrescu


On 11/24/10 9:35 AM, Daniel Gibson wrote:

spir schrieb:

What i don't understand is why people who need unsigned bytes do not
use ubyte? But instead bug into char. Is this only because of C baggage?



probably because you can't write ubyte[] str = asdf; and they want to
have ascii-chars in their ubyte arrays


Probably the assignment should be allowed.

Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-11-24 Thread spir

On Wed, 24 Nov 2010 16:35:59 +0100
Daniel Gibson metalcae...@gmail.com wrote:

 spir schrieb:
  What i don't understand is why people who need unsigned bytes do not use 
  ubyte? But instead bug into char. Is this only because of C baggage?
  
 
 probably because you can't write ubyte[] str = asdf; and they want to have 
 ascii-chars in their ubyte arrays

Oh yes, sorry for the noise. Then, I don't see any other solution else having a 
proper ByteString type built in the compiler (that would indeed work for any 
single-byte encoding, not only ASCII), with a corresponding string literal 
pre/post-fix (one more ;-).

Denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com

Re: std.algorithm.remove and principle of least astonishment

2010-11-24 Thread foobar

Bruno Medeiros Wrote:

 On 23/11/2010 18:15, foobar wrote:
  It's simple, a mediocre language (Java) with mediocre libraries has orders 
  of magnitude more success than C++ with it's libs fine tuned for 
  performance. Why?
 
 Java has mediocre libraries?? Are you serious about that opinion?
 
 
 -- 
 Bruno Medeiros - Software Engineer


It all depends on the scale you use. 
If we equate programming with cooking than using C++ is like trying to make a 
feast out of single atoms. Using Java would then be equivalent to buying at the 
supermarket. It's fine for most people. 
On this scale, the naked sheaf has his own farm with organic livestock and also 
a garden so he can get the best ingredients.

Re: std.algorithm.remove and principle of least astonishment

2010-11-23 Thread spir

On Tue, 23 Nov 2010 00:10:40 -0500
Jesse Phillips jessekphillip...@gmail.com wrote:

 Rainer Deyke Wrote:
 
  On 11/22/2010 11:55, Andrei Alexandrescu wrote:
   http://d.puremagic.com/issues/show_bug.cgi?id=5257
  
  I think this bug is a symptom of a larger issue.  The range abstraction
  is too fragile.  If even you can't use the range abstraction correctly
  (in the library that defines this abstraction no less), how can you
  expect anyone else to do so?
 
 Note that this issue with foreach has been discussed before. The suggested 
 solution was to have infer dchar instead of char (shot down since iterating 
 char is useful and it is simple to add the type dchar). Maybe a range 
 interface (as found in std.string) should take precedence over arrays in 
 foreach? Or maybe foreach should only work with ranges and opApply (that 
 would mean std.array would need imported to use foreach with arrays)?
 
 That wouldn't address your exact issue. I tend to agree with Andrei as you 
 should be coding to the Range interface which will prevent any miss use of 
 char/wchar. On the other hand, why can't I have a range of char (I mean get 
 one from an array, not that I would ever want to)?
 
 Anyway, I agree char[] is a special case, but I also agree it isn't an issue.

This issue may also be interpreted as one more sign that text types in general 
are special enough to require a distinct (set of) type(s). Which would not 
prevent freely using *char[] as a plain array (even if I personly cannot 
imagine what for).

Denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com

Re: std.algorithm.remove and principle of least astonishment

2010-11-23 Thread foobar

Andrei Alexandrescu Wrote:

 On 11/22/10 5:59 PM, foobar wrote:
  Canonical example: DNA.
  I shouldn't need to write a special function to print it since it IS a 
  string.
  I shouldn't need to cast it in order to do operations on it like sort, 
  find, etc.
 
 I think it's best to encode DNA strings as sequences of ubyte. UTF 
 routines will work slower on them than functions for ubyte.
 

how would I go about printing DNA sequences then? printing a ubyte should print 
it's numeric value, and NOT a char. What actually needed here is a ASCIIChar 
type or even a more stricter DNAChar. 

  D's [w|D|]char types make no sense since they are NOT characters and the 
  concept doesn't fit for unicode since as someone else wrote, there are 
  different levels of abstractions in unicode (copde point, code unit, 
  grapheme).
  Naming matters and having a cat called dog (char is actually code unit) is 
  a source of bugs.
 
 I think the names are fine. It doesn't take much learning to understand 
 that char, wchar, and dchar are UTF-8, UTF-16, and UTF-32 code units 
 respectively. I mean it would be odd if they were something else.
 
 

The isn't a quantitative issue but an existential one. I agree that it's easy 
to use dogs once someone tells you that everywhere you want a dog you should 
denote it with cat. Why do you need to learn that mistake _AT_ALL_ ?
it is odd for YOU to think otherwise because you have ALREADY learned and 
accustomed to use a cat every time you need a dog. That does not mean that 
this is indeed correct. 
This is the same issue people having with D's enum. 

You just don't seem to get that learning is location depended. What makes sense 
to YOU based on your location on the learning curve isn't absolute and does NOT 
reflect on people with a different location on the learning curve. This goes 
with many of your excellent implementations that get awful names. Very C++ on 
your part - you need to be a c++ guru just to write a hello world app.

 Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-11-23 Thread Andrei Alexandrescu


On 11/23/10 3:49 AM, foobar wrote:

Andrei Alexandrescu Wrote:


On 11/22/10 5:59 PM, foobar wrote:

Canonical example: DNA.
I shouldn't need to write a special function to print it since it IS a string.
I shouldn't need to cast it in order to do operations on it like sort, find, 
etc.


I think it's best to encode DNA strings as sequences of ubyte. UTF
routines will work slower on them than functions for ubyte.



how would I go about printing DNA sequences then? printing a ubyte should print 
it's numeric value, and NOT a char. What actually needed here is a ASCIIChar 
type or even a more stricter DNAChar.


Yes, and the language offers the abstraction abilities to define such 
types.



D's [w|D|]char types make no sense since they are NOT characters and the 
concept doesn't fit for unicode since as someone else wrote, there are 
different levels of abstractions in unicode (copde point, code unit, grapheme).
Naming matters and having a cat called dog (char is actually code unit) is a 
source of bugs.


I think the names are fine. It doesn't take much learning to understand
that char, wchar, and dchar are UTF-8, UTF-16, and UTF-32 code units
respectively. I mean it would be odd if they were something else.




The isn't a quantitative issue but an existential one. I agree that it's easy to use dogs 
once someone tells you that everywhere you want a dog you should denote it with 
cat. Why do you need to learn that mistake _AT_ALL_ ?
it is odd for YOU to think otherwise because you have ALREADY learned and accustomed to 
use a cat every time you need a dog. That does not mean that this is indeed 
correct.
This is the same issue people having with D's enum.

You just don't seem to get that learning is location depended. What makes sense 
to YOU based on your location on the learning curve isn't absolute and does NOT 
reflect on people with a different location on the learning curve. This goes 
with many of your excellent implementations that get awful names. Very C++ on 
your part - you need to be a c++ guru just to write a hello world app.


I think I don't understand what you're suggesting.


Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-11-23 Thread foobar

Andrei Alexandrescu Wrote:

 On 11/23/10 3:49 AM, foobar wrote:
  Andrei Alexandrescu Wrote:
 
  On 11/22/10 5:59 PM, foobar wrote:
  Canonical example: DNA.
  I shouldn't need to write a special function to print it since it IS a 
  string.
  I shouldn't need to cast it in order to do operations on it like sort, 
  find, etc.
 
  I think it's best to encode DNA strings as sequences of ubyte. UTF
  routines will work slower on them than functions for ubyte.
 
 
  how would I go about printing DNA sequences then? printing a ubyte should 
  print it's numeric value, and NOT a char. What actually needed here is a 
  ASCIIChar type or even a more stricter DNAChar.
 
 Yes, and the language offers the abstraction abilities to define such 
 types.
 
  D's [w|D|]char types make no sense since they are NOT characters and the 
  concept doesn't fit for unicode since as someone else wrote, there are 
  different levels of abstractions in unicode (copde point, code unit, 
  grapheme).
  Naming matters and having a cat called dog (char is actually code unit) 
  is a source of bugs.
 
  I think the names are fine. It doesn't take much learning to understand
  that char, wchar, and dchar are UTF-8, UTF-16, and UTF-32 code units
  respectively. I mean it would be odd if they were something else.
 
 
 
  The isn't a quantitative issue but an existential one. I agree that it's 
  easy to use dogs once someone tells you that everywhere you want a dog you 
  should denote it with cat. Why do you need to learn that mistake _AT_ALL_ 
  ?
  it is odd for YOU to think otherwise because you have ALREADY learned and 
  accustomed to use a cat every time you need a dog. That does not mean 
  that this is indeed correct.
  This is the same issue people having with D's enum.
 
  You just don't seem to get that learning is location depended. What makes 
  sense to YOU based on your location on the learning curve isn't absolute 
  and does NOT reflect on people with a different location on the learning 
  curve. This goes with many of your excellent implementations that get awful 
  names. Very C++ on your part - you need to be a c++ guru just to write a 
  hello world app.
 
 I think I don't understand what you're suggesting.
 
 
 Andrei

It's simple, a mediocre language (Java) with mediocre libraries has orders of 
magnitude more success than C++ with it's libs fine tuned for performance. Why? 
because from a regular programmer's POV which just wants to get things done 
(TM), Java is geared towards easy and quick use. 
the are many libs for all common use cases, there is a common style and good 
naming conventions and 9/10 times you can write code by the feel without 
spending half an hour to read documentation. There are no obscure function 
names in Latin or Greek (even if the Latin/Greek term is more precise in math 
terms)
in short, Java is KISS, C++ is not. 

If you want D to succeed you need to acknowledge this and act according to 
this. Make the common case trivial and the special case possible. 
char is NOT fine and is misleading. I'm not asking to change this right now 
and would accept a response like it's too late to change now or whatever. 
However, I do expect you to at least acknowledge the issue and not dismiss it. 

Your code might be excellent but it caters only to you and a small amount of 
programmers that share your style. D will not succeed in general programmer 
public until you start catering for the common people and stop dismissing their 
complaints. 
D2 is way more complex than D1 becasue of this (and the const system) and I'm 
singling you out because you are the main developer of D's standard lib and 
because you set the design goals/style of it.

Re: std.algorithm.remove and principle of least astonishment

2010-11-23 Thread Andrei Alexandrescu


On 11/23/10 12:15 PM, foobar wrote:

Andrei Alexandrescu Wrote:


On 11/23/10 3:49 AM, foobar wrote:

Andrei Alexandrescu Wrote:


On 11/22/10 5:59 PM, foobar wrote:

Canonical example: DNA.
I shouldn't need to write a special function to print it since it IS a string.
I shouldn't need to cast it in order to do operations on it like sort, find, 
etc.


I think it's best to encode DNA strings as sequences of ubyte. UTF
routines will work slower on them than functions for ubyte.



how would I go about printing DNA sequences then? printing a ubyte should print 
it's numeric value, and NOT a char. What actually needed here is a ASCIIChar 
type or even a more stricter DNAChar.


Yes, and the language offers the abstraction abilities to define such
types.


D's [w|D|]char types make no sense since they are NOT characters and the 
concept doesn't fit for unicode since as someone else wrote, there are 
different levels of abstractions in unicode (copde point, code unit, grapheme).
Naming matters and having a cat called dog (char is actually code unit) is a 
source of bugs.


I think the names are fine. It doesn't take much learning to understand
that char, wchar, and dchar are UTF-8, UTF-16, and UTF-32 code units
respectively. I mean it would be odd if they were something else.




The isn't a quantitative issue but an existential one. I agree that it's easy to use dogs 
once someone tells you that everywhere you want a dog you should denote it with 
cat. Why do you need to learn that mistake _AT_ALL_ ?
it is odd for YOU to think otherwise because you have ALREADY learned and accustomed to 
use a cat every time you need a dog. That does not mean that this is indeed 
correct.
This is the same issue people having with D's enum.

You just don't seem to get that learning is location depended. What makes sense 
to YOU based on your location on the learning curve isn't absolute and does NOT 
reflect on people with a different location on the learning curve. This goes 
with many of your excellent implementations that get awful names. Very C++ on 
your part - you need to be a c++ guru just to write a hello world app.


I think I don't understand what you're suggesting.


Andrei


It's simple, a mediocre language (Java) with mediocre libraries has
orders of magnitude more success than C++ with it's libs fine tuned
for performance. Why? because from a regular programmer's POV which
just wants to get things done (TM), Java is geared towards easy and
quick use. the are many libs for all common use cases, there is a
common style and good naming conventions and 9/10 times you can write
code by the feel without spending half an hour to read documentation.
There are no obscure function names in Latin or Greek (even if the
Latin/Greek term is more precise in math terms) in short, Java is
KISS, C++ is not.


I don't think the dynamics of programming language success can be 
represented with a one-dimensional explanation. There are many other 
factors (marketing, perception, historical setting, etc. etc. etc.) Many 
languages offer easier and quicker ways to get done than Java, which is 
quite verbose. And Java programmers in fact spend large amounts of time 
reading documentation of the massive APIs they are working with. I'm not 
framing that as a bad thing; I'm just clarifying why I think your 
attempt at explaining Java's success is not only incomplete, but wrong.



If you want D to succeed you need to acknowledge this and act
according to this. Make the common case trivial and the special case
possible. char is NOT fine and is misleading. I'm not asking to
change this right now and would accept a response like it's too late
to change now or whatever. However, I do expect you to at least
acknowledge the issue and not dismiss it.


What would be a good replacement name for char?


Your code might be excellent but it caters only to you and a small
amount of programmers that share your style.


I'm curious how you validated this assumption.


D will not succeed in
general programmer public until you start catering for the common
people and stop dismissing their complaints.


Since you are trying to build the impression that this is a common 
pattern, you should have no trouble finding plenty of examples.



D2 is way more complex
than D1 becasue of this (and the const system) and I'm singling you
out because you are the main developer of D's standard lib and
because you set the design goals/style of it.


I have had a Google alert tuned for the exact string D programming 
language for a good while. The general opinion that I seem to have 
gathered is that Phobos 2 is a major pro, not a con, in deciding to 
choose D2 versus D1.



Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-11-23 Thread Jonathan M Davis

On Tuesday, November 23, 2010 09:05:05 Andrei Alexandrescu wrote:
  You just don't seem to get that learning is location depended. What makes
  sense to YOU based on your location on the learning curve isn't absolute
  and does NOT reflect on people with a different location on the learning
  curve. This goes with many of your excellent implementations that get
  awful names. Very C++ on your part - you need to be a c++ guru just to
  write a hello world app.
 
 I think I don't understand what you're suggesting.

I think that what he's saying is that the names char, wchar, and dchar as 
UTF-8, 
UTF-16, and UTF-32 code points respectively make sense to you because you're 
used to them, but for anyone learning D (particularly those who are used to 
char 
in other languages being an ASCII character) don't find it at all intuitive or 
obvious.

Honestly, the only semi-reasonable alternative to char, wchar, and dchar that I 
can think of would be utf8, utf16, and utf32. But then everyone would be 
wondering where char was, and I'm not sure that it would really help any in the 
long run anyway. It would be more explicit though. But given char and wchar_t 
in 
C++, I really don't think that it's much of a stretch to use char, wchar, and 
dchar. The only thing really different about it is that D insists that char is 
always a UTF-8 code unit rather than it really being useable as an ASCII 
character.

- Jonathan M Davis

Re: std.algorithm.remove and principle of least astonishment

2010-11-23 Thread so

I think that what he's saying is that the names char, wchar, and dchar  
as UTF-8,
UTF-16, and UTF-32 code points respectively make sense to you because  
you're
used to them, but for anyone learning D (particularly those who are used  
to char
in other languages being an ASCII character) don't find it at all  
intuitive or

obvious.


They should first realize this is another language.

Honestly, the only semi-reasonable alternative to char, wchar, and dchar  
that I

can think of would be utf8, utf16, and utf32. But then everyone would be
wondering where char was, and I'm not sure that it would really help any  
in the
long run anyway. It would be more explicit though. But given char and  
wchar_t in
C++, I really don't think that it's much of a stretch to use char,  
wchar, and
dchar. The only thing really different about it is that D insists that  
char is

always a UTF-8 code unit rather than it really being useable as an ASCII
character.


That actually is an excellent idea, wiping all 3 of them and replacing  
with these.


--
Using Opera's revolutionary email client: http://www.opera.com/mail/

Re: std.algorithm.remove and principle of least astonishment

2010-11-23 Thread Daniel Gibson


Jonathan M Davis schrieb:

On Tuesday, November 23, 2010 09:05:05 Andrei Alexandrescu wrote:

You just don't seem to get that learning is location depended. What makes
sense to YOU based on your location on the learning curve isn't absolute
and does NOT reflect on people with a different location on the learning
curve. This goes with many of your excellent implementations that get
awful names. Very C++ on your part - you need to be a c++ guru just to
write a hello world app.

I think I don't understand what you're suggesting.


I think that what he's saying is that the names char, wchar, and dchar as UTF-8, 
UTF-16, and UTF-32 code points respectively make sense to you because you're 
used to them, but for anyone learning D (particularly those who are used to char 
in other languages being an ASCII character) don't find it at all intuitive or 
obvious.




And in Java a char is a 16bit unicode char that is generally handled as a code 
unit (since Java 1.5 32bit surrogate pair code units consisting of 2 chars are 
also supported, but I don't know if that really works in the whole standard lib 
and if people actually use it).
So also for Java programmers 1 char == 1 printed character, even though it 
supports more than ASCII.


Honestly, the only semi-reasonable alternative to char, wchar, and dchar that I 
can think of would be utf8, utf16, and utf32. 


Naa, that sounds like it's a whole UTF-* string and not just a code point to me.
utf8codepoint maybe? or utf8cp? ..
That sucks, IMHO it should stay the way it is.
But maybe an ASCII type (or maybe a more general 8bit text type that also 
supports ISO-* charsets etc?) would be helpful.
One that string literals can implicitly be converted to (so ubyte[] or an alias 
of that won't work). Also the compiler would have to make sure that all 
characters of the string can be represented in ASCII (or ISO-*).



Cheers,
- Daniel

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread Rainer Deyke

On 11/22/2010 00:08, Andrei Alexandrescu wrote:
 On 11/21/10 11:59 PM, Rainer Deyke wrote:
 That the range view and the array view provide direct access to the same
 data.
 
 Where do ranges state that assumption?

Are you saying that arrays of T do not function as ranges of T when T is
not a character type?

 One of the useful features of most arrays is that an array of T can be
 treated as a range of T.  However, this feature is missing for arrays of
 char and wchar.
 
 This is not a guarantee by ranges, it's just a mistaken assumption.

I'm not saying that this feature is guaranteed for all arrays, because
it clearly isn't.  I'm saying that this feature is present for T[] where
T is not a character type, and missing for T[] where T is a character
type.  When writing code that is not intended to operate on character
data, it is natural to use this feature.  The code then breaks when the
code is used with character data.

 No, I'm saying that I write generic code that declares T[] and then
 passes it off to a function that operates on ranges, or to a foreach
 loop.
 
 A function that operates on ranges would have an appropriate constraint
 so it would work properly or not at all. foreach works fine with all
 arrays.

It works, but produces different results than when iterating over a
character array than when iterating over a non-character array.  Code
can compile, have well-defined behavior, run, produce correct results in
most cases, but still be wrong.

 Let's say I have an array and I want to iterate over the first ten
 items.  My first instinct would be to write something like this:

foreach (item; array[0 .. 10]) {
  doSomethingWith(item);
}

 Simple, natural, readable code.  Broken for arrays of char or wchar, but
 in a way that is difficult to detect.
 
 Why is it broken? Please try it to convince yourself of the contrary.

I see, foreach still iterates over code units by default.  Of course,
this means that foreach over ranges doesn't work with strings, which in
turn means that algorithms that use foreach over ranges are broken.
Observe:

  import std.stdio;
  import std.algorithm;

  void main() {
writeln(count!(true)(日本語)); // Three characters.
  }

Output (compiled with Digital Marse D Compiler v2.050):
  9

 Fine. Use T[] generically in conjunction with the array primitives. If
 you plan to use them with the range primitives, you do as ranges do.

If arrays can't operate as ranges, what's the point of giving them a
range interface?

 Easy:
- string_t becomes a keyword.
- Syntactically speaking, string_t!T is the name of a type when T is a
 type.
- For every built-in character type T (including const and immutable
 versions), the type currently called T[] is now called string_t!T, but
 otherwise maintains all of its current behavior.
- For every other type T, string_t!T is an error.
- char[] and wchar[] (including const and immutable versions) are
 plain arrays of code units, even when viewed as a range.

 It's not my preferred solution, but it's easy to explain, it fixes the
 main problem with the current system, and it only costs one keyword.

 (I'd rather treat string_t as a library template with compiler support
 like and rename it to String, but then it wouldn't be a built-in string.)
 
 I very much prefer the current state of affairs.

Care to support that with some arguments, or is it just a purely
subjective preference?


-- 
Rainer Deyke - rain...@eldwood.com

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread spir

On Sun, 21 Nov 2010 19:21:27 -0600
Andrei Alexandrescu seewebsiteforem...@erdani.org wrote:

 On 11/21/10 7:00 PM, Jonathan M Davis wrote:
  Actually, the better implementation would probably be to provide wrapper 
  ranges
  for ranges of char and wchar so that you could access them as ranges of 
  dchar.
  Doing otherwise would make it so that you couldn't access them directly as
  ranges of char or wchar, which would be limiting, and since it's likely that
  anyone actually wanting strings would just use strings, there's a good 
  chance
  that in the majority of cases, what you'd want would really be a range of 
  char
  or wchar anyway. Regardless, it's quite possible to access containers of 
  char or
  wchar as ranges of dchar if you need to.
 
 I agree except for the majority of cases part. In fact the original 
 design of range interfaces for char[] and wchar[] was to require 
 byDchar() to get a bidirectional interface over the arrays of code units.
 
 That design, with which I experimented for a while, had two drawbacks:
 
 1. It had the default reversed, i.e. most often you want to regard a 
 char[] or a wchar[] as a range of code points, not as an array of code 
 units.
 
 2. It had the unpleasant effect that most algorithms in std.algorithm 
 and beyond did the wrong thing by default, and the right thing only if 
 you wrapped everything with byDchar().

I find these points most relevant. The issue is that *char[] actually are the 
mutable variants of *string. So that one needs to use them as textual types, 
meaning as strings of code points. Thus, I do not think the most common case is 
to have them iterated as strings of code _units_.

 The second iteration of the design, which is currently in use, was to 
 define in std.range the primitives such that char[] and wchar[] offer by 
 default the bidirectional range interface. I have gone through all 
 algorithms in std.algorithm and std.string and noticed with amazed 
 satisfaction that they most always did the right thing, and that I could 
 tweak the few that didn't to complete a satisfactory implementation. 
 (indexOf has slipped through the cracks.) I think that experience with 
 the current design is speaking in its favor.

This makes the safe and common case default.

 One thing could be done to drive the point home: a function byCodeUnit() 
 could be added that actually does iterate a char[] or a wchar[] one code 
 unit at a time (and consequently restores their behavior as T[]). That 
 function could be simply a cast to ubyte[]/ushort[], or it could 
 introduce a random-access range.

For sure, this would be useful in the cases where really needs code units. And 
make it clear that default iteration is _not_ over code units (thus avoiding 
part of the critics).

Maybe an alternative would be (or have been) to have complete lexical 
distinction between (text) strings and true char arrays, that applies whatever 
constness or mutability is wished.
* char[] is always an array of plain unsigned ints
* mutable strings can be defined using mutable(string) for text processing, 
still beeing indexed and iterated as strings of code _points_.

 Andrei

Denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread Jonathan M Davis

On Monday 22 November 2010 02:01:38 Rainer Deyke wrote:
 On 11/22/2010 00:08, Andrei Alexandrescu wrote:
  On 11/21/10 11:59 PM, Rainer Deyke wrote:
  That the range view and the array view provide direct access to the same
  data.
  
  Where do ranges state that assumption?
 
 Are you saying that arrays of T do not function as ranges of T when T is
 not a character type?

I believe that he means that you either use them as ranges or you use them as 
arrays. Mixing the two sets of operations is asking for trouble.

- Jonathan M Davis

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread Rainer Deyke

On 11/22/2010 03:57, Jonathan M Davis wrote:
 On Monday 22 November 2010 02:01:38 Rainer Deyke wrote:
 Are you saying that arrays of T do not function as ranges of T when T is
 not a character type?
 
 I believe that he means that you either use them as ranges or you use them as 
 arrays. Mixing the two sets of operations is asking for trouble.

It is impossible to have a non-empty array without at some point using
an array operation.  If you can't mix array operations with range
operations, then you can't use arrays as ranges.


-- 
Rainer Deyke - rain...@eldwood.com

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread spir

On Sun, 21 Nov 2010 21:26:53 -0500
Michel Fortin michel.for...@michelf.com wrote:

 On 2010-11-21 20:21:27 -0500, Andrei Alexandrescu 
 seewebsiteforem...@erdani.org said:
 
  That design, with which I experimented for a while, had two drawbacks:
  
  1. It had the default reversed, i.e. most often you want to regard a 
  char[] or a wchar[] as a range of code points, not as an array of code 
  units.
  
  2. It had the unpleasant effect that most algorithms in std.algorithm 
  and beyond did the wrong thing by default, and the right thing only if 
  you wrapped everything with byDchar().

Hello Michel,

 Well, basically these two arguments are the same: iterating by code 
 unit isn't a good default. And I agree. But I'm unconvinced that 
 iterating by dchar is the right default either. For one thing it has 
 more overhead, and for another it still doesn't represent a character.

This is an issue evoked in a previous thread some weeks ago. More on it below.

 Now, add graphemes to the equation and you have a representation that 
 matches the user-perceived character concept, but for that you add 
 another layer of decoding overhead and a variable-size data type to 
 manipulate (a grapheme is a sequence of code points). And you have to 
 use Unicode normalization when comparing graphemes. So is that a good 
 default? Probably not. It might be correct in some sense, but it's 
 totally overkill for most cases.

It is not possible, as writer of a textprocessing lib ot Text type, to define a 
right level of abstraction (code unit, code point, or grapheme) that would both 
be usually efficent and avoid unexpected failures for naive use of the tool.
The only safe level in 99% cases is the highest-level one, namely grapheme. 
Only then can one be sure that, for instance text.count(ä) will actually 
count ä's in source text. But in most cases, this is overkill. It depends on 
what the text actually, *and* on what the programmer knows about it (I mean 
that texts may be plain ASCII, so that even unsigned byte strings would do the 
job, but if the programmer cannot guess it...).
The tool writer cannot guess anything.

 My thinking is that there is no good default. If you write an XML 
 parser, you'll probably want to work at the code point level; if you 
 write a JSON parser, you can easily skip the overhead and work at the 
 UTF-8 code unit level until you start parsing a string; if you write 
 something to count the number of user-perceived characters or want to 
 map characters to a font then you'll want graphemes...

At least 3 factors must be taken into account:

1. The actual content of source texts. For instance, 99.999% of all texts won't 
ever hold code points  . This tells which size should be used for code 
units. The safe general choice indeed beeing 32 bits.

2. The normalisation form of graphemes; whether they are decomposed (the right 
choice), or in unknown form or possibly in mixed forms, or as precomposed as 
possible. In the latter case (by far the most common one for western language 
texts), and one can assert that every grapheme in every source text to be dealt 
with has a fully precomposed form (= 1 single code *point*), then the level of 
code points is safe enough.

3. Whether text is just transferred through an app or is also processed. Many 
apps just use some bits of input texts (files, user input, literals) as is, 
without any processing, and often output some of them, possibly concatenated. 
This is safe whatever the abstraction level of text representation used; one 
can concat plain utf8 representing composite graphemes in decomposed form. 

But as soon as any text-processing routine is used (indexing, slicing, find, 
count, replace...), then questions arise about correctness of the app.

And, as said already, to be able to safely choose any lower-level of 
repreentation, the programmer must know about the content, its properties, its 
UCS coding. For instance, imagine you need to write an app dealing with texts 
containing phonetic symbols (IPA). How do you know which is the lowest safe 
level?
* What is the common coding of IPA graphemes in UCS?
* Can they be coded in various ways (yes!, too bad..)
* What is the highest code point ever possibly needed? (== is utf8 or utf16 
enough for code points?)
* Do all graphemes have a fully precomposed form?
* Can I be sure that all texts will actually be coded in precomposed form (this 
depends on text producing tools), for ever?

 Perhaps there should be simply no default; perhaps you should be forced 
 to choose explicitly at which layer you want to operate each time you 
 apply an algorithm on a string... and to make this less painful we 
 could have functions in std.string acting as a thin layer over similar 
 ones in std.algorithm that would automatically choose the right 
 representation for the algorithm depending on the operation.

My next project should be to write one Text type dealing at the highest-level 
-- if only to

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread spir

On Sun, 21 Nov 2010 19:27:06 -0600
Andrei Alexandrescu seewebsiteforem...@erdani.org wrote:

  There is no easy notion of character in unicode. A code point is *not*
  a character. One character can span multiple code points. I fear
  treating dchars as the default character unit is repeating same kind
  of mistake earlier frameworks made by adopting UCS-2 (now UTF-16) and
  treating each 2-byte code unit as a character. I mean, what's the point
  of working with the intermediary representation (code points) when it
  doesn't represent a character?  
 
 I understand the concern, and that's why I strongly support formal 
 abstractions that are supported by, but largely independent from, 
 representations. If graphemes are to be modeled, D is in better shape 
 than other languages. What we need to do is define a range byGrapheme() 
 that accepts char[], wchar[], or dchar[].

Sure, D helps a lot. I agree with abstraction levels independant of internal 
representation in the general case (I think it's one major aspect and advantage 
of ranges, isn't it?). But it yields a huge efficiency issue in this very case. 
Namely that if one deals with a text at the level graphemes while the 
representation of of a string of code points, then every little routine has to 
reconstruct the graphemes on the fly. For instance, indexing 3 times will do 3 
times the job of constructing a string of graphemes (up to the given indices).
Thus, when one has to do text processing, even of the simplest kind, it is 
necessary to use a dedicated type (or any kind of tool using a high-level 
representation). (Analog to the need of first decoding code units into code 
points, only once, before dealing with code points -- but at a higher level.)
See also answer to Michel's post.

Denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread spir

On Sun, 21 Nov 2010 20:11:23 -0500
Michel Fortin michel.for...@michelf.com wrote:

 On 2010-11-20 18:58:33 -0500, Andrei Alexandrescu 
 seewebsiteforem...@erdani.org said:
 
  D strings exhibit no such problems. They expose their implementation - 
  array of code units. Having that available is often handy. They also 
  obey a formal interface - bidirectional ranges.
 
 It's convenient that char[] and wchar[] expose a dchar bidirectional 
 range interface... but only when a dchar bidirectional range is what 
 you want to use. If you want to iterate over code units (lower-level 
 representation), or graphemes (upper-level representation), then it 
 gets in your way.

True.

 There is no easy notion of character in unicode. A code point is 
 *not* a character. One character can span multiple code points. I fear 
 treating dchars as the default character unit is repeating same kind 
 of mistake earlier frameworks made by adopting UCS-2 (now UTF-16) and 
 treating each 2-byte code unit as a character. I mean, what's the point 
 of working with the intermediary representation (code points) when it 
 doesn't represent a character?

True, but only partially. The error of using utf16 to represent code points is 
far less serious in practice, because code point   have about no chance to 
ever be present in any text one programmer will ever have to deal with. (This 
error was in fact initially caused by the standard people who first thought 
 was enough, so that 16-bit tools and encodings were created and used.)
But I fully agree with what's the point of working with the intermediary 
representation (code points) when it doesn't represent a character?. *This* is 
wrong and may cause much damage. Actually, it means apps simply do not work 
correctly; a logical error; and one that can hardly be automatically detected.
A side-issue is that in present times we mostly deal with source texts for 
which there exists precomposed characters, _and_ text-prodcuing tools usually 
use them. So that programmers who ignore the issue may think they are right. 
But both of those facts may soon be wrong.

 Instead, I think it'd be better that the level one wants to work at be 
 made explicit. If one wants to work with code points, he just rolls a 
 code-point bidirectional range on top of the string. If one wants to 
 work with graphemes (user-perceived characters), he just rolls a 
 grapheme bidirectional range on top of the string. In other words:
 
   string str = hello;
   foreach (cu; str) {}// code unit iteration
   foreach (cp; str.codePoints) {} // code point iteration, bidirectional 
 range of dchar
   foreach (gr; str.graphemes) {}  // grapheme iteration, bidirectional 
 range of graphemes
 
 That'd be much cleaner than having some sort of hybrid 
 code-point/code-unit array/range.

Yop, but the ability to iterate over graphemes, while the internal 
representation is of a string of codes, or code units, is *not* what we need:
text.count(c);
would have to construct graphemes on the fly on the whole string. Every text 
processing routine performed on a given text will have to do it on all or part 
of the text (indexing for instance would do it only up to given index). Meaning 
every routine would have to do the job of constructing a string of graphemes 
(and normalising it) that should be done only once. Hope I'm clear.
Reason why we need a proper Text type as a string of graphemes. The same 
abstration offered by dchar (from code units to code points) is needed at a 
higher-level (from code points to graphemes). Each element would be what I call 
a stack, a mini-array of dchars. Then, we can deal with it like with a palin 
ASCII or Latin-1 text.

c c c c c c c c cdstring = dchar[] -- coded string

c
  c c   c
c c c c ctext = stack[]-- logical string

 Here's a nice reference about unicode graphemes, word segmentation, and 
 related algorithms.

 http://unicode.org/reports/tr29/

I have implemented once the algorithm used to construct graphemes put of code 
points, as a base for a grapheme-level Text type, with all common text 
processing routines (*) (in Lua). I plan to do this in  for D in a short 
while. As said, it should simpler thank to D's true string types who already 
abstract from lower-level issues.


(*) Actually, once one a has a string of graphemes/codes/code-units, routines 
are the same whatever the kind of element. There could be a generic version in 
std.string.

-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread spir

On Sun, 21 Nov 2010 17:56:15 -0800
Jonathan M Davis jmdavisp...@gmx.com wrote:

 We could always define an abstract Character (or whatever you want to call 
 it) 
 which holds a character - regardless of whether it uses a grapheme or not - 
 and 
 make it relatively easy to iterate over Characters rather than dchars.

This is not a solution, it would force constructing graphemes for each routine 
applied to a given text. You need to do it only once.

 It would 
 be nice if they abolished graphemes though...

What is the alternative? For a given set of base characters (say ascii letters, 
cardinal NC) and a given set of combining marks (say latin diacritics, 
cardinal ND), what is the number of combinations? If I'm right, the answer is 
NC * 2^ND (in other words, an astronomical number). We would need thousands of 
bits for each code point ;-)
Also, we cannot predict future. Think that for each new diacritic, you must 
double the number of precomposed characters, simply by adding this diacritic to 
every already existing combination. We cannot know what would be needed in a 
few years.

The error UCS  Unicode have done is the opposite one. To silently pretend that 
code points represent characters (I cannot believe that choosing the term 
abstract character to denote what is coded by a code point was innocent. It 
can only introduce confusion). They should have said that a code point 
represents, say, an abstract marks. And made clear that a character, meaning 
a logical text element, is represented by a mini-array of code units (what I 
call a code stack, see other post for why).
This would have avoided confusion from start on, and encouraged programmer to 
design proper, correct, text representations -- at least for text processing. 
Now, and only because of that, everybody seems to discover consequent issues 20 
years too late. Even in unicode circles: I have tried to evoke this on the 
usincode maling list several times in past years, with about no echo at all. 
People do not *want* to hear of it.
I think this has been a deliberate marketing choice for the UCS/Unicode 
standard. Probably they were afraid of reactions from programming  communities 
if they had made clear dealing with universal text required adding *2* levels 
of abstraction over plain ASCII. Another error was to promote using code units 
for space-efficiency. Else, there would be only 1 new level.

 It is quite possible that while 
 D's handling of unicode is a huge improvement over other languages, by 
 treating 
 dchar as a full character essentially everywhere, we're opening ourselves up 
 for 
 a variety of bugs caused by graphemes which will be subtle and hard to find.
 But I'm not sure what the correct solution to that is.

There is one general solution as long as efficiency is considered irrelevant: a 
text is represented as a string of graphemes. There is no solution with 
efficiency because cases for which this is overkil are most common one (as of 
now, but this will change with the growth of computing is asiatic countries).

Denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread Michel Fortin


On 2010-11-22 06:57:36 -0500, spir denis.s...@gmail.com said:


(*) Actually, once one a has a string of graphemes/codes/code-units, rout
ines are the same whatever the kind of element. There could be a generic ve
rsion in std.string.


Just to add to the compexity: graphemes aren't always equivalent to 
user-perceived characters either. Ligatures can contain more than one 
user-perceived characters. If you're looking for the substring 
flourish in a string, should it fail to match when it encounters 
ﬂourish just because of the ﬂ (fl) ligature? On most Mac 
applications it matches both thanks to sensible defaults in NSString's 
search and comparison algorithms.


So perhaps we need yet another layer over graphemes to represent 
user-perceived characters.


--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread Michel Fortin


On 2010-11-22 06:57:36 -0500, spir denis.s...@gmail.com said:


On Sun, 21 Nov 2010 20:11:23 -0500
Michel Fortin michel.for...@michelf.com wrote:


On 2010-11-20 18:58:33 -0500, Andrei Alexandrescu
seewebsiteforem...@erdani.org said:


D strings exhibit no such problems. They expose their implementation -



array of code units. Having that available is often handy. They also
obey a formal interface - bidirectional ranges.


It's convenient that char[] and wchar[] expose a dchar bidirectional
range interface... but only when a dchar bidirectional range is what
you want to use. If you want to iterate over code units (lower-level
representation), or graphemes (upper-level representation), then it
gets in your way.


True.


There is no easy notion of character in unicode. A code point is
*not* a character. One character can span multiple code points. I fear
treating dchars as the default character unit is repeating same kind
of mistake earlier frameworks made by adopting UCS-2 (now UTF-16) and
treating each 2-byte code unit as a character. I mean, what's the point
of working with the intermediary representation (code points) when it
doesn't represent a character?


True, but only partially. The error of using utf16 to represent code points
 is far less serious in practice, because code point   have about no c
hance to ever be present in any text one programmer will ever have to deal
with. (This error was in fact initially caused by the standard people who f
irst thought  was enough, so that 16-bit tools and encodings were creat
ed and used.)
But I fully agree with what's the point of working with the intermediary r
epresentation (code points) when it doesn't represent a character?. *This*
 is wrong and may cause much damage. Actually, it means apps simply do not
work correctly; a logical error; and one that can hardly be automatically d
etected.
A side-issue is that in present times we mostly deal with source texts for
which there exists precomposed characters, _and_ text-prodcuing tools usual
ly use them. So that programmers who ignore the issue may think they are ri
ght. But both of those facts may soon be wrong.


Instead, I think it'd be better that the level one wants to work at be
made explicit. If one wants to work with code points, he just rolls a
code-point bidirectional range on top of the string. If one wants to
work with graphemes (user-perceived characters), he just rolls a
grapheme bidirectional range on top of the string. In other words:

string str = hello;
foreach (cu; str) {}// code unit iteration
foreach (cp; str.codePoints) {} // code point iteration, bidirectional
range of dchar
foreach (gr; str.graphemes) {}  // grapheme iteration, bidirectional
range of graphemes

That'd be much cleaner than having some sort of hybrid
code-point/code-unit array/range.


Yop, but the ability to iterate over graphemes, while the internal represen
tation is of a string of codes, or code units, is *not* what we need:
text.count(c);
would have to construct graphemes on the fly on the whole string.


I agree there might be a use case for a special data type allowing fast 
random access to graphemes and able to retain the precise count of 
graphemes. But if what you do only requires iterating over all 
graphemes, a wrapper range that converts to graphemes on the fly might 
be less overhead than building a separate data structure.


In fact, this separate data structure to hold graphemes is probably 
going to require more memory, and more memory will fit worse in the 
processor's cache. Compare the cost in performance of a cache miss 
versus one or two comparisons to check if the next code point is part 
of the same grapheme and you might actually find the version that 
iterates by converting code points to graphemes on the fly faster for 
long strings. As long as you don't need random access to the graphemes, 
I don't think you need a separate data structure.



--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread spir

On Mon, 22 Nov 2010 07:34:15 -0500
Michel Fortin michel.for...@michelf.com wrote:

 Just to add to the compexity: graphemes aren't always equivalent to 
 user-perceived characters either. Ligatures can contain more than one 
 user-perceived characters. If you're looking for the substring 
 flourish in a string, should it fail to match when it encounters 
 ﬂourish just because of the ﬂ (fl) ligature? On most Mac 
 applications it matches both thanks to sensible defaults in NSString's 
 search and comparison algorithms.

That's true. I guess you're thinking at the distinction between NFD/NFC 
canonical forms and NFKD/NFKC ones (so-called compatibility).

 So perhaps we need yet another layer over graphemes to represent 
 user-perceived characters.

In my view, this is not the responsability of a general-purpose tool. I guess, 
but may be wrong, we are clearly entering the field of app logics and 
semantics. These are for me _not_ general-purpose points (but builtin types  
libraries often offer clearly non-general routines like one dealing with 
casing, or even less general: the set of ASCII letters). These issues would 
have to be dealt with either by apps or by domain-specific libraries.
I find it wrong that Unicode even simply provides standard canonical forms for 
them (but fortunately common libs do not implement them AFAIK)


denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread spir

On Mon, 22 Nov 2010 08:24:33 -0500
Michel Fortin michel.for...@michelf.com wrote:

 I agree there might be a use case for a special data type allowing fast 
 random access to graphemes and able to retain the precise count of 
 graphemes. But if what you do only requires iterating over all 
 graphemes, a wrapper range that converts to graphemes on the fly might 
 be less overhead than building a separate data structure.

It's true as long as you can assert each string is iterated at most once. But 
the job of constructing an instance of UText (say, grapheme string) should be 
exactly the same as what each iteration has to do on the fly. Or do i miss a 
point?
Also, it's not only about indexing or iterating. Simply 
finding/counting/replacing given characters (I mean in the sense of graphemes) 
or slices requires the string to be not only grouped, but also normalised (else 
how is the routine supposed to recognise the same char in another form?). A 
heavy job as well, you don't want to do twice. Grouping makes normalising 
easier (you only cope with a mini-array of codes at once, already known to 
represent a whole char) (and sorting codes in stacks is easier as well).
Finally, to avoid reprocessing already processed text, I had the idea of 
utf33 ;-) This is utf32 plus the guaranty that character forms are already 
normalised and sorted.

denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread Michel Fortin


On 2010-11-22 09:09:41 -0500, spir denis.s...@gmail.com said:


On Mon, 22 Nov 2010 08:24:33 -0500
Michel Fortin michel.for...@michelf.com wrote:


I agree there might be a use case for a special data type allowing fast
random access to graphemes and able to retain the precise count of
graphemes. But if what you do only requires iterating over all
graphemes, a wrapper range that converts to graphemes on the fly might
be less overhead than building a separate data structure.


It's true as long as you can assert each string is iterated at most once. B
ut the job of constructing an instance of UText (say, grapheme string) sh
ould be exactly the same as what each iteration has to do on the fly. Or do
 i miss a point?


I think you missed my point.

My point was that decoding on the fly while iterating might be as fast 
or maybe faster in most cases (which don't include grapheme clusters) 
than if you had already predecoded the graphemes and stored them in a 
grapheme-oriented data structure. I say that mostly because of the 
variable-length nature of a grapheme makes it hard to store one 
efficiently.


That's my opinion, but debating that is rather pointless in the absence 
of an implementation of each to compare.



--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread Michel Fortin


On 2010-11-22 08:57:39 -0500, spir denis.s...@gmail.com said:


On Mon, 22 Nov 2010 07:34:15 -0500
Michel Fortin michel.for...@michelf.com wrote:


Just to add to the compexity: graphemes aren't always equivalent to
user-perceived characters either. Ligatures can contain more than one
user-perceived characters. If you're looking for the substring
flourish in a string, should it fail to match when it encounters
ﬂourish just because of the ﬂ (fl) ligature? On most

Mac

applications it matches both thanks to sensible defaults in NSString's
search and comparison algorithms.


That's true. I guess you're thinking at the distinction between NFD/NFC ca
nonical forms and NFKD/NFKC ones (so-called compatibility).


So perhaps we need yet another layer over graphemes to represent
user-perceived characters.


In my view, this is not the responsability of a general-purpose tool. I gue
ss, but may be wrong, we are clearly entering the field of app logics and s
emantics. These are for me _not_ general-purpose points (but builtin types
 libraries often offer clearly non-general routines like one dealing with
casing, or even less general: the set of ASCII letters). These issues would
 have to be dealt with either by apps or by domain-specific libraries.


Is searching for a word in a text file less general purpose than 
searching for a specific combination of graphemes forming that word? 
That the implementation to get it right is quite complex doesn't make a 
tool less general purpose. The sole reason searching works this way in 
most Mac OS X (and iOS) applications is that Apple implemented it at 
the core of its string type and made it the default way of searching 
substrings and comparing strings. It's dubious whether even half of Mac 
applications would have implemented the thing correctly otherwise.



--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread Steven Schveighoffer

On Sun, 21 Nov 2010 23:56:17 -0500, Andrei Alexandrescu  
seewebsiteforem...@erdani.org wrote:


If you want to work with arrays, use a[0] to access the front, a[$ - 1]  
to access the back, and a = a[1 .. $] to chop off the first element of  
the array. It is not AT ALL natural to mix those with a.front, a.back  
etc. It is not - why? because std.range defines them with specific  
meanings for arrays in general and for arrays of characters in  
particular. If you submit to use std.range's abstraction, you submit to  
using it the way it is defined.


I want to use char[] as an array.  I want to sort the array, how do I do  
this?  (assume array.sort as a property is deprecated, as it should be)


The problem is that the library *won't let you* treat them as arrays.   
Some functions see char[] as an array, and some see them as a range of  
dchars, you can't declare to those functions No! this is an array! or  
No, this is a dchar range!  That is the main problem I see with how the  
current code works.


BTW, you may not understand that we don't want to go back to the days of  
'byDchar'.  We want strings (including literals) to be special type  
because they are a special type (not an array).


-Steve

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread so


Easy:
  - string_t becomes a keyword.
  - Syntactically speaking, string_t!T is the name of a type when T is a
type.
  - For every built-in character type T (including const and immutable
versions), the type currently called T[] is now called string_t!T, but
otherwise maintains all of its current behavior.
  - For every other type T, string_t!T is an error.
  - char[] and wchar[] (including const and immutable versions) are
plain arrays of code units, even when viewed as a range.

It's not my preferred solution, but it's easy to explain, it fixes the
main problem with the current system, and it only costs one keyword.

(I'd rather treat string_t as a library template with compiler support
like and rename it to String, but then it wouldn't be a built-in string.)


Or better, if you want both ranges and random access do same thing,  
convert it to byte[] short[] and int[].


--
Using Opera's revolutionary email client: http://www.opera.com/mail/

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread Michel Fortin

On 2010-11-22 10:37:48 -0500, Steven Schveighoffer 
schvei...@yahoo.com said:


BTW, you may not understand that we don't want to go back to the days 
of  'byDchar'.  We want strings (including literals) to be special type 
 because they are a special type (not an array).


It's amusing to read this from my perspective.

In my project where I'm implementing the Objective-C object model, I 
implemented literal Objective-C strings a few days ago. It's basically 
a fourth string type understood by the compiler that generates a static 
NSString instance in the object file. String literals with no explicit 
type are implicitly converted whenever needed, so it really is painless 
to use:


	NSString str = hello;  // implicit conversion, but only for 
compile-time constants


Here you have your NSString, all stored as static data, no memory 
allocation at all.


So you now have your special string type that works with literals and 
is not an array. But it's Cocoa-only.



--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread Andrei Alexandrescu


On 11/22/10 9:37 AM, Steven Schveighoffer wrote:

On Sun, 21 Nov 2010 23:56:17 -0500, Andrei Alexandrescu
seewebsiteforem...@erdani.org wrote:


If you want to work with arrays, use a[0] to access the front, a[$ -
1] to access the back, and a = a[1 .. $] to chop off the first element
of the array. It is not AT ALL natural to mix those with a.front,
a.back etc. It is not - why? because std.range defines them with
specific meanings for arrays in general and for arrays of characters
in particular. If you submit to use std.range's abstraction, you
submit to using it the way it is defined.


I want to use char[] as an array. I want to sort the array, how do I do
this? (assume array.sort as a property is deprecated, as it should be)


Why do you want to sort an array of char?

Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread Steven Schveighoffer

On Mon, 22 Nov 2010 12:07:55 -0500, Andrei Alexandrescu  
seewebsiteforem...@erdani.org wrote:



On 11/22/10 9:37 AM, Steven Schveighoffer wrote:

On Sun, 21 Nov 2010 23:56:17 -0500, Andrei Alexandrescu
seewebsiteforem...@erdani.org wrote:


If you want to work with arrays, use a[0] to access the front, a[$ -
1] to access the back, and a = a[1 .. $] to chop off the first element
of the array. It is not AT ALL natural to mix those with a.front,
a.back etc. It is not - why? because std.range defines them with
specific meanings for arrays in general and for arrays of characters
in particular. If you submit to use std.range's abstraction, you
submit to using it the way it is defined.


I want to use char[] as an array. I want to sort the array, how do I do
this? (assume array.sort as a property is deprecated, as it should be)


Why do you want to sort an array of char?


You're dodging the question.  You claim that if I want to use it as an  
array, I use it as an array, if I want to use it as a range, use it as a  
range.  I'm simply pointing out why you can't use it as an array --  
because phobos treats it as a bidirectional range, and you can't force it  
to do what you want.


More points -- what about a redblacktree!(char)?  Is that 'invalid'?  I  
mean, it's automatically sorted, so what should I do, throw an error if  
you try to build one?  Is an Array!char a string?  What about an  
SList!char?


The thing is, *only* when one wants to create strings, does one want to  
view the data type as a bidirectional string.  When one wants to deal with  
chars as an element of a container, I don't want to be restricted to utf  
requirements.


FWIW, I deal in ASCII pretty much exclusively, so sorting an array of char  
is not out of the question.  You might say oh, well that's stupid! but  
then so is using the index operator on a char[] array, no?  I see no  
difference.


I'm going to drop out of this discussion in order to develop a viable  
alternative to using arrays to represent strings.  Then we can discuss the  
merits/drawbacks of such a type.  I think it will be simple to build.


-Steve

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread Andrei Alexandrescu


On 11/22/10 11:22 AM, Steven Schveighoffer wrote:

On Mon, 22 Nov 2010 12:07:55 -0500, Andrei Alexandrescu
seewebsiteforem...@erdani.org wrote:


On 11/22/10 9:37 AM, Steven Schveighoffer wrote:

On Sun, 21 Nov 2010 23:56:17 -0500, Andrei Alexandrescu
seewebsiteforem...@erdani.org wrote:


If you want to work with arrays, use a[0] to access the front, a[$ -
1] to access the back, and a = a[1 .. $] to chop off the first element
of the array. It is not AT ALL natural to mix those with a.front,
a.back etc. It is not - why? because std.range defines them with
specific meanings for arrays in general and for arrays of characters
in particular. If you submit to use std.range's abstraction, you
submit to using it the way it is defined.


I want to use char[] as an array. I want to sort the array, how do I do
this? (assume array.sort as a property is deprecated, as it should be)


Why do you want to sort an array of char?


You're dodging the question. You claim that if I want to use it as an
array, I use it as an array, if I want to use it as a range, use it as a
range. I'm simply pointing out why you can't use it as an array --
because phobos treats it as a bidirectional range, and you can't force
it to do what you want.


Of course you can. After you were to admit that it makes next to no 
sense to sort an array of code units, I would have said well if somehow 
you do imagine such a situation, you achieve that by saying what you 
means: cast the char[] to ubyte[] and sort that.



More points -- what about a redblacktree!(char)? Is that 'invalid'? I
mean, it's automatically sorted, so what should I do, throw an error if
you try to build one?


No, it still has well-defined semantics. It just doesn't have much sense 
to it. Why would you use a redblacktree of char? Probably you want one 
of ubyte, so then why don't you say so?



Is an Array!char a string? What about an SList!char?


Depends on how Array or SList are defined. D chose to convey char[] and 
wchar[] specific meaning revealing that they are sequences of code 
points, i.e. Unicode strings.



The thing is, *only* when one wants to create strings, does one want to
view the data type as a bidirectional string. When one wants to deal
with chars as an element of a container, I don't want to be restricted
to utf requirements.


If you don't want to be restricted to utf requirements, use ubyte and 
ushort. You're saying I want to use UTF code points without any 
associated UTF meaning.



FWIW, I deal in ASCII pretty much exclusively, so sorting an array of
char is not out of the question.


Example?


You might say oh, well that's stupid!
but then so is using the index operator on a char[] array, no? I see no
difference.


There is a difference. Often in a loop you know the index at which a 
code point starts.



I'm going to drop out of this discussion in order to develop a viable
alternative to using arrays to represent strings. Then we can discuss
the merits/drawbacks of such a type. I think it will be simple to build.


I think that's a great idea.


Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread Steven Schveighoffer

On Mon, 22 Nov 2010 12:40:16 -0500, Andrei Alexandrescu  
seewebsiteforem...@erdani.org wrote:



On 11/22/10 11:22 AM, Steven Schveighoffer wrote:



You're dodging the question. You claim that if I want to use it as an
array, I use it as an array, if I want to use it as a range, use it as a
range. I'm simply pointing out why you can't use it as an array --
because phobos treats it as a bidirectional range, and you can't force
it to do what you want.


Of course you can. After you were to admit that it makes next to no  
sense to sort an array of code units, I would have said well if somehow  
you do imagine such a situation, you achieve that by saying what you  
means: cast the char[] to ubyte[] and sort that.


That wasn't what you said -- you said I can use char[] as an array if I  
want to use it as an array, not that I can use ubyte[] as an array (nobody  
disputes that).



The thing is, *only* when one wants to create strings, does one want to
view the data type as a bidirectional string. When one wants to deal
with chars as an element of a container, I don't want to be restricted
to utf requirements.


If you don't want to be restricted to utf requirements, use ubyte and  
ushort. You're saying I want to use UTF code points without any  
associated UTF meaning.


A literal defining an array of ubytes or ushorts is considerably more  
painful than one of chars.



FWIW, I deal in ASCII pretty much exclusively, so sorting an array of
char is not out of the question.


Example?


In some poker-hand detection code I've written in C++ (and actually in D  
too) in the past, I can use characters to represent each card.  A  
straightforward way to do this is to add each 'card' to a string, then  
sort the string.  This allows me to use string functions and regex to  
detect hand types.


You can do the same with ubytes, but it's not as easy to understand.  And  
easy to understand means easier to avoid mistakes.  The point is, the  
domain of valid elements in my application is defined by me, not by the  
library.  The library is making assumptions that my poker hands may  
contain utf8 characters, while I know in my case they cannot.  If I could  
convey this in a way that allows me to keep the nice properties of char  
arrays (i.e. printing as strings), then I would be fine with the library  
assuming unless I told it so.


But there is no way currently, the library steadfastly refuses to look at  
it any other way than a utf-8 code sequence.  It doesn't help matters that  
the compiler steadfastly looks at them as arrays.


What I want is for the compiler *and* the library to look at strings as  
not arrays, and for both to look at char[] as an array.  So I can clearly  
define my intent of how I want them to treat such variables.



I'm going to drop out of this discussion in order to develop a viable
alternative to using arrays to represent strings. Then we can discuss
the merits/drawbacks of such a type. I think it will be simple to build.


Here I am continuing to argue.  I swear I'll stop after this :)  At least  
until I have my string type ready.


-Steve

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread Andrei Alexandrescu


On 11/22/10 12:01 PM, Steven Schveighoffer wrote:

On Mon, 22 Nov 2010 12:40:16 -0500, Andrei Alexandrescu
seewebsiteforem...@erdani.org wrote:


On 11/22/10 11:22 AM, Steven Schveighoffer wrote:



You're dodging the question. You claim that if I want to use it as an
array, I use it as an array, if I want to use it as a range, use it as a
range. I'm simply pointing out why you can't use it as an array --
because phobos treats it as a bidirectional range, and you can't force
it to do what you want.


Of course you can. After you were to admit that it makes next to no
sense to sort an array of code units, I would have said well if
somehow you do imagine such a situation, you achieve that by saying
what you means: cast the char[] to ubyte[] and sort that.


That wasn't what you said -- you said I can use char[] as an array if I
want to use it as an array, not that I can use ubyte[] as an array
(nobody disputes that).


That still stays valid. The thing is, sort doesn't sort arrays, it sorts 
random-access ranges.



The thing is, *only* when one wants to create strings, does one want to
view the data type as a bidirectional string. When one wants to deal
with chars as an element of a container, I don't want to be restricted
to utf requirements.


If you don't want to be restricted to utf requirements, use ubyte and
ushort. You're saying I want to use UTF code points without any
associated UTF meaning.


A literal defining an array of ubytes or ushorts is considerably more
painful than one of chars.


I've been thinking for a while to have to!(const(ubyte)[]) simply insert 
a cast when passed const(char)[]. The cast is sound - you are asking for 
a view of individual code points in a string. That should help with 
literals.



FWIW, I deal in ASCII pretty much exclusively, so sorting an array of
char is not out of the question.


Example?


In some poker-hand detection code I've written in C++ (and actually in D
too) in the past, I can use characters to represent each card.


Why not ubytes?


A
straightforward way to do this is to add each 'card' to a string, then
sort the string. This allows me to use string functions and regex to
detect hand types.

You can do the same with ubytes, but it's not as easy to understand.


Why?


And
easy to understand means easier to avoid mistakes. The point is, the
domain of valid elements in my application is defined by me, not by the
library. The library is making assumptions that my poker hands may
contain utf8 characters, while I know in my case they cannot.


Then what's wrong with ubyte? Why do you encode as UTF something that 
you know isn't UTF? Would you put an integral in a real even though you 
know it's only integral?



If I could
convey this in a way that allows me to keep the nice properties of char
arrays (i.e. printing as strings), then I would be fine with the library
assuming unless I told it so.


How would printing as strings be meaningful? I'd suspect you'd want to 
print a poker hand better than by using one character per card. Even if 
for some odd reason you want to print ubytes as characters in some 
exceptional situation, why don't you write a routine that does that and 
get over with?



But there is no way currently, the library steadfastly refuses to look
at it any other way than a utf-8 code sequence. It doesn't help matters
that the compiler steadfastly looks at them as arrays.

What I want is for the compiler *and* the library to look at strings as
not arrays, and for both to look at char[] as an array. So I can clearly
define my intent of how I want them to treat such variables.


I totally understand where you're coming from.

I believe you also understand where I'm coming from: within the 
constraints of making UTF built-in, integrated, efficient, and easy to 
understand, I think the current decisions taken by the language are 
good. To directly reply to your point: instead of ascribing your desired 
meaning to char[], you should use char[] for UTF-8 strings exclusively. 
For arrays of bytes, there's always ubyte[].



I'm going to drop out of this discussion in order to develop a viable
alternative to using arrays to represent strings. Then we can discuss
the merits/drawbacks of such a type. I think it will be simple to build.


Here I am continuing to argue. I swear I'll stop after this :) At least
until I have my string type ready.


I suspect you'll notice before long that it's a considerably more 
difficult task than it might seem in the beginning, and that the result 
is bound to be less satisfactory than the current strings in at least 
some dimensions. But I welcome the initiative to bring a concrete 
abstraction (heh, oxymoron) on the table.



Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread Andrei Alexandrescu


On 11/22/10 4:01 AM, Rainer Deyke wrote:

I see, foreach still iterates over code units by default.  Of course,
this means that foreach over ranges doesn't work with strings, which in
turn means that algorithms that use foreach over ranges are broken.
Observe:

   import std.stdio;
   import std.algorithm;

   void main() {
 writeln(count!(true)(日本語)); // Three characters.
   }

Output (compiled with Digital Marse D Compiler v2.050):
   9


Thanks.

http://d.puremagic.com/issues/show_bug.cgi?id=5257

Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread KennyTM~


On Nov 23, 10 01:40, Andrei Alexandrescu wrote:

On 11/22/10 11:22 AM, Steven Schveighoffer wrote:

On Mon, 22 Nov 2010 12:07:55 -0500, Andrei Alexandrescu
seewebsiteforem...@erdani.org wrote:


On 11/22/10 9:37 AM, Steven Schveighoffer wrote:

On Sun, 21 Nov 2010 23:56:17 -0500, Andrei Alexandrescu
seewebsiteforem...@erdani.org wrote:


If you want to work with arrays, use a[0] to access the front, a[$ -
1] to access the back, and a = a[1 .. $] to chop off the first element
of the array. It is not AT ALL natural to mix those with a.front,
a.back etc. It is not - why? because std.range defines them with
specific meanings for arrays in general and for arrays of characters
in particular. If you submit to use std.range's abstraction, you
submit to using it the way it is defined.


I want to use char[] as an array. I want to sort the array, how do I do
this? (assume array.sort as a property is deprecated, as it should be)


Why do you want to sort an array of char?


You're dodging the question. You claim that if I want to use it as an
array, I use it as an array, if I want to use it as a range, use it as a
range. I'm simply pointing out why you can't use it as an array --
because phobos treats it as a bidirectional range, and you can't force
it to do what you want.


Of course you can. After you were to admit that it makes next to no
sense to sort an array of code units, I would have said well if somehow
you do imagine such a situation, you achieve that by saying what you
means: cast the char[] to ubyte[] and sort that.



Right, and D3 should simply disable using char and wchar as an 
independent type, like void, since using a single code unit makes next 
to no sense either. As a side-effect, no one can complain containers of 
char and wchar doesn't work as expected because it simply won't compile. 
Then we can rightfully say char[] and wchar[] are special.


char c = 'A';
// error: A single code unit makes no sense. Make it a ubyte or 
dchar instead.

int[char] d;
// error: Indexing by a code unit makes no sense. Make it an 
int[ubyte] or int[dchar] instead.


:p


More points -- what about a redblacktree!(char)? Is that 'invalid'? I
mean, it's automatically sorted, so what should I do, throw an error if
you try to build one?


No, it still has well-defined semantics. It just doesn't have much sense
to it. Why would you use a redblacktree of char? Probably you want one
of ubyte, so then why don't you say so?


Is an Array!char a string? What about an SList!char?


Depends on how Array or SList are defined. D chose to convey char[] and
wchar[] specific meaning revealing that they are sequences of code
points, i.e. Unicode strings.


The thing is, *only* when one wants to create strings, does one want to
view the data type as a bidirectional string. When one wants to deal
with chars as an element of a container, I don't want to be restricted
to utf requirements.


If you don't want to be restricted to utf requirements, use ubyte and
ushort. You're saying I want to use UTF code points without any
associated UTF meaning.


FWIW, I deal in ASCII pretty much exclusively, so sorting an array of
char is not out of the question.


Example?



One possible application could be (assume ASCII for a moment)

pure bool slowIsAnagramOf(in char[] a, in char[] b) {
  auto c = a.dup, d = b.dup;
  sort(c);
  sort(d);
  return c == d;
}


You might say oh, well that's stupid!
but then so is using the index operator on a char[] array, no? I see no
difference.


There is a difference. Often in a loop you know the index at which a
code point starts.


I'm going to drop out of this discussion in order to develop a viable
alternative to using arrays to represent strings. Then we can discuss
the merits/drawbacks of such a type. I think it will be simple to build.


I think that's a great idea.


Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread Rainer Deyke

On 11/22/2010 11:55, Andrei Alexandrescu wrote:
 On 11/22/10 4:01 AM, Rainer Deyke wrote:
 I see, foreach still iterates over code units by default.  Of course,
 this means that foreach over ranges doesn't work with strings, which in
 turn means that algorithms that use foreach over ranges are broken.
 Observe:

import std.stdio;
import std.algorithm;

void main() {
  writeln(count!(true)(日本語)); // Three characters.
}

 Output (compiled with Digital Marse D Compiler v2.050):
9
 
 Thanks.
 
 http://d.puremagic.com/issues/show_bug.cgi?id=5257

I think this bug is a symptom of a larger issue.  The range abstraction
is too fragile.  If even you can't use the range abstraction correctly
(in the library that defines this abstraction no less), how can you
expect anyone else to do so?

At the very least, this is a sign that std.algorithm needs more thorough
testing, and/or a through code review.  This is far from the only use of
foreach on a range in std.algorithm.  It just happens to be the first
example I found to illustrate my point.


-- 
Rainer Deyke - rain...@eldwood.com

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread foobar

Andrei Alexandrescu Wrote:

 On 11/22/10 12:01 PM, Steven Schveighoffer wrote:
  On Mon, 22 Nov 2010 12:40:16 -0500, Andrei Alexandrescu
  seewebsiteforem...@erdani.org wrote:
 
  On 11/22/10 11:22 AM, Steven Schveighoffer wrote:
 
  You're dodging the question. You claim that if I want to use it as an
  array, I use it as an array, if I want to use it as a range, use it as a
  range. I'm simply pointing out why you can't use it as an array --
  because phobos treats it as a bidirectional range, and you can't force
  it to do what you want.
 
  Of course you can. After you were to admit that it makes next to no
  sense to sort an array of code units, I would have said well if
  somehow you do imagine such a situation, you achieve that by saying
  what you means: cast the char[] to ubyte[] and sort that.
 
  That wasn't what you said -- you said I can use char[] as an array if I
  want to use it as an array, not that I can use ubyte[] as an array
  (nobody disputes that).
 
 That still stays valid. The thing is, sort doesn't sort arrays, it sorts 
 random-access ranges.
 
  The thing is, *only* when one wants to create strings, does one want to
  view the data type as a bidirectional string. When one wants to deal
  with chars as an element of a container, I don't want to be restricted
  to utf requirements.
 
  If you don't want to be restricted to utf requirements, use ubyte and
  ushort. You're saying I want to use UTF code points without any
  associated UTF meaning.
 
  A literal defining an array of ubytes or ushorts is considerably more
  painful than one of chars.
 
 I've been thinking for a while to have to!(const(ubyte)[]) simply insert 
 a cast when passed const(char)[]. The cast is sound - you are asking for 
 a view of individual code points in a string. That should help with 
 literals.
 
  FWIW, I deal in ASCII pretty much exclusively, so sorting an array of
  char is not out of the question.
 
  Example?
 
  In some poker-hand detection code I've written in C++ (and actually in D
  too) in the past, I can use characters to represent each card.
 
 Why not ubytes?
 
  A
  straightforward way to do this is to add each 'card' to a string, then
  sort the string. This allows me to use string functions and regex to
  detect hand types.
 
  You can do the same with ubytes, but it's not as easy to understand.
 
 Why?
 
  And
  easy to understand means easier to avoid mistakes. The point is, the
  domain of valid elements in my application is defined by me, not by the
  library. The library is making assumptions that my poker hands may
  contain utf8 characters, while I know in my case they cannot.
 
 Then what's wrong with ubyte? Why do you encode as UTF something that 
 you know isn't UTF? Would you put an integral in a real even though you 
 know it's only integral?
 
  If I could
  convey this in a way that allows me to keep the nice properties of char
  arrays (i.e. printing as strings), then I would be fine with the library
  assuming unless I told it so.
 
 How would printing as strings be meaningful? I'd suspect you'd want to 
 print a poker hand better than by using one character per card. Even if 
 for some odd reason you want to print ubytes as characters in some 
 exceptional situation, why don't you write a routine that does that and 
 get over with?
 
  But there is no way currently, the library steadfastly refuses to look
  at it any other way than a utf-8 code sequence. It doesn't help matters
  that the compiler steadfastly looks at them as arrays.
 
  What I want is for the compiler *and* the library to look at strings as
  not arrays, and for both to look at char[] as an array. So I can clearly
  define my intent of how I want them to treat such variables.
 
 I totally understand where you're coming from.
 
 I believe you also understand where I'm coming from: within the 
 constraints of making UTF built-in, integrated, efficient, and easy to 
 understand, I think the current decisions taken by the language are 
 good. To directly reply to your point: instead of ascribing your desired 
 meaning to char[], you should use char[] for UTF-8 strings exclusively. 
 For arrays of bytes, there's always ubyte[].
 
  I'm going to drop out of this discussion in order to develop a viable
  alternative to using arrays to represent strings. Then we can discuss
  the merits/drawbacks of such a type. I think it will be simple to build.
 
  Here I am continuing to argue. I swear I'll stop after this :) At least
  until I have my string type ready.
 
 I suspect you'll notice before long that it's a considerably more 
 difficult task than it might seem in the beginning, and that the result 
 is bound to be less satisfactory than the current strings in at least 
 some dimensions. But I welcome the initiative to bring a concrete 
 abstraction (heh, oxymoron) on the table.
 
 
 Andrei

Canonical example: DNA.
I shouldn't need to write a special function to print it since it IS a string. 
I shouldn't

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread Andrei Alexandrescu


On 11/22/10 5:59 PM, foobar wrote:

Canonical example: DNA.
I shouldn't need to write a special function to print it since it IS a string.
I shouldn't need to cast it in order to do operations on it like sort, find, 
etc.


I think it's best to encode DNA strings as sequences of ubyte. UTF 
routines will work slower on them than functions for ubyte.



D's [w|D|]char types make no sense since they are NOT characters and the 
concept doesn't fit for unicode since as someone else wrote, there are 
different levels of abstractions in unicode (copde point, code unit, grapheme).
Naming matters and having a cat called dog (char is actually code unit) is a 
source of bugs.


I think the names are fine. It doesn't take much learning to understand 
that char, wchar, and dchar are UTF-8, UTF-16, and UTF-32 code units 
respectively. I mean it would be odd if they were something else.



Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread Jonathan M Davis

On Monday 22 November 2010 16:45:43 Andrei Alexandrescu wrote:
 On 11/22/10 5:59 PM, foobar wrote:
  Canonical example: DNA.
  I shouldn't need to write a special function to print it since it IS a
  string. I shouldn't need to cast it in order to do operations on it like
  sort, find, etc.
 
 I think it's best to encode DNA strings as sequences of ubyte. UTF
 routines will work slower on them than functions for ubyte.
 
  D's [w|D|]char types make no sense since they are NOT characters and the
  concept doesn't fit for unicode since as someone else wrote, there are
  different levels of abstractions in unicode (copde point, code unit,
  grapheme). Naming matters and having a cat called dog (char is actually
  code unit) is a source of bugs.
 
 I think the names are fine. It doesn't take much learning to understand
 that char, wchar, and dchar are UTF-8, UTF-16, and UTF-32 code units
 respectively. I mean it would be odd if they were something else.

The problem with char is that so many people are used to thinking of char as a 
character rather than a code unit. Once you get passed that, though, it's fine. 
I 
think that it's very well thought out as it is. It just takes some getting used 
to. Unfortunately though, it seems thinking of a char as UTF-8 code unit and 
_never_ dealing with it as a character is hard for a lot of people to adjust to.

- Jonathan M Davis

Re: std.algorithm.remove and principle of least astonishment

2010-11-22 Thread Jesse Phillips

Rainer Deyke Wrote:

 On 11/22/2010 11:55, Andrei Alexandrescu wrote:
  http://d.puremagic.com/issues/show_bug.cgi?id=5257
 
 I think this bug is a symptom of a larger issue.  The range abstraction
 is too fragile.  If even you can't use the range abstraction correctly
 (in the library that defines this abstraction no less), how can you
 expect anyone else to do so?

Note that this issue with foreach has been discussed before. The suggested 
solution was to have infer dchar instead of char (shot down since iterating 
char is useful and it is simple to add the type dchar). Maybe a range interface 
(as found in std.string) should take precedence over arrays in foreach? Or 
maybe foreach should only work with ranges and opApply (that would mean 
std.array would need imported to use foreach with arrays)?

That wouldn't address your exact issue. I tend to agree with Andrei as you 
should be coding to the Range interface which will prevent any miss use of 
char/wchar. On the other hand, why can't I have a range of char (I mean get one 
from an array, not that I would ever want to)?

Anyway, I agree char[] is a special case, but I also agree it isn't an issue.

Re: std.algorithm.remove and principle of least astonishment

2010-11-21 Thread Andrei Alexandrescu


On 11/20/10 9:42 PM, Rainer Deyke wrote:

On 11/20/2010 16:58, Andrei Alexandrescu wrote:

On 11/20/10 12:32 PM, Rainer Deyke wrote:

std::vectorbool   in C++ is a specialization of std::vector that packs
eight booleans into a byte instead of storing each element separately.
It doesn't behave exactly like other std::vectors and technically
doesn't meet the C++ requirements of a container, although it tries to
come as close as possible.  This means that any code that uses
std::vectorbool   needs to be extra careful to take those differences in
account.  This is especially an issue when dealing with generic code
that uses std::vectorT, where T may or may not be bool.

The issue with Vector!char is similar.  Because char[] is not a true
array, generic code that uses T[] can unexpectedly fail when T is char.
   Other containers of char behave like normal containers, iterating over
individual chars.  char[] iterates over dchars.  Vector!char can,
depending on its implementation, iterate over chars, iterate over
dchars, or fail to compile at all when instantiated with T=char.  It's
not even clear which of these is the correct behavior.


The parallel does not stand scrutiny. The problem with vectorbool  in
C++ is that it implements no formal abstraction, although it is a
specialization of one.


The problem with std::vectorbool  is that it pretends to be a
std::vector, but isn't.  If it was called dynamic_bitset instead, nobody
would have complained.  char[] has exactly the same problem.


char[] does not exhibit the same issues that vectorbool has. The 
situation is very different, and again, trying to reduce one to another 
misses a lot of the picture.


vectorbool hides representation and in doing so becomes non-compliant 
with vectorT which does expose representation. Worse, vectorbool is 
not compliant with any concept, express or implied, which makes 
vectorbool virtually unusable with generic code.


In contrast, char[] exposes a meaningful representation (array of code 
units) that is often useful, and obeys a slightly weaker formal 
abstraction (bidirectional range) which is also useful. It's simply a 
very different setup from vectorbool, and again attempting to use one 
in predicting the fare of the other is a poor approach.



Vector!char is just an example. Any generic code that uses T[] can
unexpectedly fail to compile or behave incorrectly used when T=char.
If I were to use D2 in its present state, I would try to avoid both
char/wchar and arrays as much as possible in order to avoid this
trap. This would mean avoiding large parts of Phobos, and providing
safe wrappers around the rest.


It may be wise in fact to start using D2 and make criticism grounded in
reality that could help us improve the state of affairs.


Sorry, but no.  It would take a huge investment of time and effort on my
part to switch from C++ to D.  I'm not going to make that leap without
looking first, and I'm not going to make it when I can see that I'm
about to jump into a spike pit.


You may rest assured that if anything, strings are not a problem. The 
way the abstractions are laid out make D's strings the best approach to 
Unicode strings I know about.



The above is
only fallacious presupposition. Algorithms in Phobos are abstracted on
the formal range interface, and as such you won't be exposed to risks
when using them with strings.


I'm not concerned about algorithms, I'm concerned about code that uses
arrays directly.  Like my Vector!char example, which I see you still
haven't addressed.


When you define your abstractions, you are free to decide how you want 
to go about them. The D programming language makes it unequivocally 
clear that char[] is an array of UTF-8 code units that offers a 
bidirectional range of code points. Same about wchar[] (replace UTF-8 
with UTF-16). dchar[] is an array of UTF-32 code points which are 
equivalent to code units, and as such is a full random-access range.


If you define your own function that uses an array directly, such as 
sort(), then attempting to sort a char[] will get you exactly what you 
expect - you sort the code units in the array. The sort routine in the 
standard library is modeled to work with random access ranges, and will 
refuse to sort a char[].


I have often reflected whether I'd do things differently if I could go 
back in time and join Walter when he invented D's strings. I might have 
done one or two things differently, but the gain would be marginal at 
best. In fact, it's not impossible the balance of things could have been 
hurt. Between speed, simplicity, effectiveness, abstraction, access to 
representation, and economy of means, D's strings are the best 
compromise out there that I know of, bar none by a wide margin.



Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-11-21 Thread Rainer Deyke

On 11/21/2010 11:23, Andrei Alexandrescu wrote:
 On 11/20/10 9:42 PM, Rainer Deyke wrote:
 On 11/20/2010 16:58, Andrei Alexandrescu wrote:
 The parallel does not stand scrutiny. The problem with vectorbool  in
 C++ is that it implements no formal abstraction, although it is a
 specialization of one.

 The problem with std::vectorbool  is that it pretends to be a
 std::vector, but isn't.  If it was called dynamic_bitset instead, nobody
 would have complained.  char[] has exactly the same problem.
 
 char[] does not exhibit the same issues that vectorbool has. The
 situation is very different, and again, trying to reduce one to another
 misses a lot of the picture.

I agree that there are differences.  For one thing, if you iterate over
a std::vectorbool you get actual booleans, albeit through an extra
layer of indirection.  If you iterate over char[] you might get chars or
you might get dchars depending on the method you use for iterating.

char[] isn't the equivalent of std::vectorbool.  It's worse.  char[]
is the equivalent of a vectorbool that keeps the current behavior of
std::vectorbool when iterating through iterators, but gives access to
bytes of packed booleans when using operator[].

 vectorbool hides representation and in doing so becomes non-compliant
 with vectorT which does expose representation. Worse, vectorbool is
 not compliant with any concept, express or implied, which makes
 vectorbool virtually unusable with generic code.

The ways in which std::vectorbool differs from any other vector are
well understood.  It uses proxies instead of true references.  Its
iterators meet the requirements of input/output iterators (or in boost
terms, readable, writable iterators with random access traversal).  Any
generic code written with these limitations in mind can use
std::vectorT freely.  (The C++ standard library doesn't play nicely
with std::vectorbool, but that's another issue entirely.)

std::vectorbool is a useful type, it just isn't a std::vector.  In
that respect, its situation is analogous to that of char[].

 It may be wise in fact to start using D2 and make criticism grounded in
 reality that could help us improve the state of affairs.

 Sorry, but no.  It would take a huge investment of time and effort on my
 part to switch from C++ to D.  I'm not going to make that leap without
 looking first, and I'm not going to make it when I can see that I'm
 about to jump into a spike pit.
 
 You may rest assured that if anything, strings are not a problem.

I'm not concerned about strings, I'm concerned about *arrays*.  Arrays
of T, where T may or not be a character type.  I see that you ignored my
Vector!char example yet again.

Your assurances aren't increasing my confidence in D, they're decreasing
my confidence in your judgment (and by extension my confidence in D).


-- 
Rainer Deyke - rain...@eldwood.com

Re: std.algorithm.remove and principle of least astonishment

2010-11-21 Thread Andrei Alexandrescu


On 11/21/10 6:12 PM, Rainer Deyke wrote:

On 11/21/2010 11:23, Andrei Alexandrescu wrote:

On 11/20/10 9:42 PM, Rainer Deyke wrote:

On 11/20/2010 16:58, Andrei Alexandrescu wrote:

The parallel does not stand scrutiny. The problem with
vectorbool   in C++ is that it implements no formal
abstraction, although it is a specialization of one.


The problem with std::vectorbool   is that it pretends to be a
std::vector, but isn't.  If it was called dynamic_bitset instead,
nobody would have complained.  char[] has exactly the same
problem.


char[] does not exhibit the same issues that vectorbool  has.
The situation is very different, and again, trying to reduce one to
another misses a lot of the picture.


I agree that there are differences.  For one thing, if you iterate
over a std::vectorbool  you get actual booleans, albeit through an
extra layer of indirection.  If you iterate over char[] you might get
chars or you might get dchars depending on the method you use for
iterating.


This is sensible because a string may be seen as a sequence of code
points or a sequence of code units. Either view is useful.


char[] isn't the equivalent of std::vectorbool.  It's worse.
char[] is the equivalent of a vectorbool  that keeps the current
behavior of std::vectorbool  when iterating through iterators, but
gives access to bytes of packed booleans when using operator[].


I explained why char[] is better than vectorbool. Ignoring the
explanation and restating a fallacious conclusion based on an
overstretched parallel does hardly much to push forward the discussion.

Again: code units _are_ well-defined, useful to have access to, and good
for a variety of uses. Please understand this.


vectorbool  hides representation and in doing so becomes
non-compliant with vectorT  which does expose representation.
Worse, vectorbool  is not compliant with any concept, express or
implied, which makes vectorbool  virtually unusable with generic
code.


The ways in which std::vectorbool  differs from any other vector
are well understood.  It uses proxies instead of true references.
Its iterators meet the requirements of input/output iterators (or in
boost terms, readable, writable iterators with random access
traversal).  Any generic code written with these limitations in mind
can use std::vectorT  freely.  (The C++ standard library doesn't
play nicely with std::vectorbool, but that's another issue
entirely.)

std::vectorbool  is a useful type, it just isn't a std::vector.
In that respect, its situation is analogous to that of char[].


It may be wise in fact to start using D2 and make criticism
grounded in reality that could help us improve the state of
affairs.


Sorry, but no.  It would take a huge investment of time and
effort on my part to switch from C++ to D.  I'm not going to make
that leap without looking first, and I'm not going to make it
when I can see that I'm about to jump into a spike pit.


You may rest assured that if anything, strings are not a problem.


I'm not concerned about strings, I'm concerned about *arrays*.
Arrays of T, where T may or not be a character type.  I see that you
ignored my Vector!char example yet again.


I sure have replied to it, but probably my reply hasn't been read.
Please allow me to paste it again:


When you define your abstractions, you are free to decide how you
want to go about them. The D programming language makes it
unequivocally clear that char[] is an array of UTF-8 code units that
offers a bidirectional range of code points. Same about wchar[]
(replace UTF-8 with UTF-16). dchar[] is an array of UTF-32 code
points which are equivalent to code units, and as such is a full
random-access range.


So it's up to you what Vector!char does. In D char[] is an array of code 
units that can be iterated as a bidirectional range of code points. I 
don't see anything cagey about that.



Your assurances aren't increasing my confidence in D, they're
decreasing my confidence in your judgment (and by extension my
confidence in D).


I prefaced my assurances with logical arguments that I can only assume 
went unread. You are of course free to your opinion (though it would be 
great if it were more grounded in real reasons); the rest of us will 
continue enjoying D2 strings.



Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-11-21 Thread Jonathan M Davis

On Sunday 21 November 2010 16:12:14 Rainer Deyke wrote:
  It may be wise in fact to start using D2 and make criticism grounded in
  reality that could help us improve the state of affairs.
  
  Sorry, but no.  It would take a huge investment of time and effort on my
  part to switch from C++ to D.  I'm not going to make that leap without
  looking first, and I'm not going to make it when I can see that I'm
  about to jump into a spike pit.
  
  You may rest assured that if anything, strings are not a problem.
 
 I'm not concerned about strings, I'm concerned about *arrays*.  Arrays
 of T, where T may or not be a character type.  I see that you ignored my
 Vector!char example yet again.
 
 Your assurances aren't increasing my confidence in D, they're decreasing
 my confidence in your judgment (and by extension my confidence in D).

Character arrays are arrays of code units and ranges of code points (of dchar 
specifically). If you want them to be treated as code points, access them as 
ranges. If you want to treat them as code units, access them as arrays. So, as 
far as character arrays go, there shouldn't be any problems. You just have to 
be 
aware of the difference between a char or wchar and a character.

Now, as for Array!char or any other container which could be considered a 
sequence of code units, there, we could be in trouble if we want to treat them 
as code points rather than code units. I believe that ranges over them would be 
over code units rather than code points, and if that's the case, you're going 
to 
have to deal with char and wchar as arrays if you want to treat them as ranges 
of dchar. We should be able to get around the problem by special-casing the 
containers on char and wchar, but that would mean more work for anyone 
implementing a container where it would be reasonable to see its elements as a 
sequence of code units making up a string. It's quite doable though.

- Jonathan M Davis

Re: std.algorithm.remove and principle of least astonishment

2010-11-21 Thread Jonathan M Davis

On Sunday 21 November 2010 16:48:53 Jonathan M Davis wrote:
 On Sunday 21 November 2010 16:12:14 Rainer Deyke wrote:
   It may be wise in fact to start using D2 and make criticism grounded
   in reality that could help us improve the state of affairs.
   
   Sorry, but no.  It would take a huge investment of time and effort on
   my part to switch from C++ to D.  I'm not going to make that leap
   without looking first, and I'm not going to make it when I can see
   that I'm about to jump into a spike pit.
   
   You may rest assured that if anything, strings are not a problem.
  
  I'm not concerned about strings, I'm concerned about *arrays*.  Arrays
  of T, where T may or not be a character type.  I see that you ignored my
  Vector!char example yet again.
  
  Your assurances aren't increasing my confidence in D, they're decreasing
  my confidence in your judgment (and by extension my confidence in D).
 
 Character arrays are arrays of code units and ranges of code points (of
 dchar specifically). If you want them to be treated as code points, access
 them as ranges. If you want to treat them as code units, access them as
 arrays. So, as far as character arrays go, there shouldn't be any
 problems. You just have to be aware of the difference between a char or
 wchar and a character.
 
 Now, as for Array!char or any other container which could be considered a
 sequence of code units, there, we could be in trouble if we want to treat
 them as code points rather than code units. I believe that ranges over
 them would be over code units rather than code points, and if that's the
 case, you're going to have to deal with char and wchar as arrays if you
 want to treat them as ranges of dchar. We should be able to get around the
 problem by special-casing the containers on char and wchar, but that would
 mean more work for anyone implementing a container where it would be
 reasonable to see its elements as a sequence of code units making up a
 string. It's quite doable though.

Actually, the better implementation would probably be to provide wrapper ranges 
for ranges of char and wchar so that you could access them as ranges of dchar. 
Doing otherwise would make it so that you couldn't access them directly as 
ranges of char or wchar, which would be limiting, and since it's likely that 
anyone actually wanting strings would just use strings, there's a good chance 
that in the majority of cases, what you'd want would really be a range of char 
or wchar anyway. Regardless, it's quite possible to access containers of char 
or 
wchar as ranges of dchar if you need to.

- Jonathan M Davis

Re: std.algorithm.remove and principle of least astonishment

2010-11-21 Thread Michel Fortin

On 2010-11-20 18:58:33 -0500, Andrei Alexandrescu 
seewebsiteforem...@erdani.org said:


D strings exhibit no such problems. They expose their implementation - 
array of code units. Having that available is often handy. They also 
obey a formal interface - bidirectional ranges.


It's convenient that char[] and wchar[] expose a dchar bidirectional 
range interface... but only when a dchar bidirectional range is what 
you want to use. If you want to iterate over code units (lower-level 
representation), or graphemes (upper-level representation), then it 
gets in your way.


There is no easy notion of character in unicode. A code point is 
*not* a character. One character can span multiple code points. I fear 
treating dchars as the default character unit is repeating same kind 
of mistake earlier frameworks made by adopting UCS-2 (now UTF-16) and 
treating each 2-byte code unit as a character. I mean, what's the point 
of working with the intermediary representation (code points) when it 
doesn't represent a character?


Instead, I think it'd be better that the level one wants to work at be 
made explicit. If one wants to work with code points, he just rolls a 
code-point bidirectional range on top of the string. If one wants to 
work with graphemes (user-perceived characters), he just rolls a 
grapheme bidirectional range on top of the string. In other words:


string str = hello;
foreach (cu; str) {}// code unit iteration
	foreach (cp; str.codePoints) {} // code point iteration, bidirectional 
range of dchar
	foreach (gr; str.graphemes) {}  // grapheme iteration, bidirectional 
range of graphemes


That'd be much cleaner than having some sort of hybrid 
code-point/code-unit array/range.


Here's a nice reference about unicode graphemes, word segmentation, and 
related algorithms.

http://unicode.org/reports/tr29/

--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Re: std.algorithm.remove and principle of least astonishment

2010-11-21 Thread Andrei Alexandrescu


On 11/21/10 7:00 PM, Jonathan M Davis wrote:

Actually, the better implementation would probably be to provide wrapper ranges
for ranges of char and wchar so that you could access them as ranges of dchar.
Doing otherwise would make it so that you couldn't access them directly as
ranges of char or wchar, which would be limiting, and since it's likely that
anyone actually wanting strings would just use strings, there's a good chance
that in the majority of cases, what you'd want would really be a range of char
or wchar anyway. Regardless, it's quite possible to access containers of char or
wchar as ranges of dchar if you need to.


I agree except for the majority of cases part. In fact the original 
design of range interfaces for char[] and wchar[] was to require 
byDchar() to get a bidirectional interface over the arrays of code units.


That design, with which I experimented for a while, had two drawbacks:

1. It had the default reversed, i.e. most often you want to regard a 
char[] or a wchar[] as a range of code points, not as an array of code 
units.


2. It had the unpleasant effect that most algorithms in std.algorithm 
and beyond did the wrong thing by default, and the right thing only if 
you wrapped everything with byDchar().


The second iteration of the design, which is currently in use, was to 
define in std.range the primitives such that char[] and wchar[] offer by 
default the bidirectional range interface. I have gone through all 
algorithms in std.algorithm and std.string and noticed with amazed 
satisfaction that they most always did the right thing, and that I could 
tweak the few that didn't to complete a satisfactory implementation. 
(indexOf has slipped through the cracks.) I think that experience with 
the current design is speaking in its favor.


One thing could be done to drive the point home: a function byCodeUnit() 
could be added that actually does iterate a char[] or a wchar[] one code 
unit at a time (and consequently restores their behavior as T[]). That 
function could be simply a cast to ubyte[]/ushort[], or it could 
introduce a random-access range.



Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-11-21 Thread Andrei Alexandrescu


On 11/21/10 7:11 PM, Michel Fortin wrote:

On 2010-11-20 18:58:33 -0500, Andrei Alexandrescu
seewebsiteforem...@erdani.org said:


D strings exhibit no such problems. They expose their implementation -
array of code units. Having that available is often handy. They also
obey a formal interface - bidirectional ranges.


It's convenient that char[] and wchar[] expose a dchar bidirectional
range interface... but only when a dchar bidirectional range is what you
want to use. If you want to iterate over code units (lower-level
representation), or graphemes (upper-level representation), then it gets
in your way.


I agree.


There is no easy notion of character in unicode. A code point is *not*
a character. One character can span multiple code points. I fear
treating dchars as the default character unit is repeating same kind
of mistake earlier frameworks made by adopting UCS-2 (now UTF-16) and
treating each 2-byte code unit as a character. I mean, what's the point
of working with the intermediary representation (code points) when it
doesn't represent a character?


I understand the concern, and that's why I strongly support formal 
abstractions that are supported by, but largely independent from, 
representations. If graphemes are to be modeled, D is in better shape 
than other languages. What we need to do is define a range byGrapheme() 
that accepts char[], wchar[], or dchar[].



Instead, I think it'd be better that the level one wants to work at be
made explicit. If one wants to work with code points, he just rolls a
code-point bidirectional range on top of the string. If one wants to
work with graphemes (user-perceived characters), he just rolls a
grapheme bidirectional range on top of the string. In other words:

string str = hello;
foreach (cu; str) {} // code unit iteration
foreach (cp; str.codePoints) {} // code point iteration, bidirectional
range of dchar
foreach (gr; str.graphemes) {} // grapheme iteration, bidirectional
range of graphemes

That'd be much cleaner than having some sort of hybrid
code-point/code-unit array/range.

Here's a nice reference about unicode graphemes, word segmentation, and
related algorithms.
http://unicode.org/reports/tr29/


I agree except for the fact that in my experience you want to iterate 
over code points much more often than over code units. Iterating by code 
unit by default is almost always wrong. That's why D's strings offer the 
bidirectional interface by default. I have reasons to believe it was a 
good decision.



Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-11-21 Thread Jonathan M Davis

On Sunday 21 November 2010 17:21:27 Andrei Alexandrescu wrote:
 On 11/21/10 7:00 PM, Jonathan M Davis wrote:
  Actually, the better implementation would probably be to provide wrapper
  ranges for ranges of char and wchar so that you could access them as
  ranges of dchar. Doing otherwise would make it so that you couldn't
  access them directly as ranges of char or wchar, which would be
  limiting, and since it's likely that anyone actually wanting strings
  would just use strings, there's a good chance that in the majority of
  cases, what you'd want would really be a range of char or wchar anyway.
  Regardless, it's quite possible to access containers of char or wchar as
  ranges of dchar if you need to.
 
 I agree except for the majority of cases part. In fact the original
 design of range interfaces for char[] and wchar[] was to require
 byDchar() to get a bidirectional interface over the arrays of code units.
 
 That design, with which I experimented for a while, had two drawbacks:
 
 1. It had the default reversed, i.e. most often you want to regard a
 char[] or a wchar[] as a range of code points, not as an array of code
 units.
 
 2. It had the unpleasant effect that most algorithms in std.algorithm
 and beyond did the wrong thing by default, and the right thing only if
 you wrapped everything with byDchar().
 
 The second iteration of the design, which is currently in use, was to
 define in std.range the primitives such that char[] and wchar[] offer by
 default the bidirectional range interface. I have gone through all
 algorithms in std.algorithm and std.string and noticed with amazed
 satisfaction that they most always did the right thing, and that I could
 tweak the few that didn't to complete a satisfactory implementation.
 (indexOf has slipped through the cracks.) I think that experience with
 the current design is speaking in its favor.
 
 One thing could be done to drive the point home: a function byCodeUnit()
 could be added that actually does iterate a char[] or a wchar[] one code
 unit at a time (and consequently restores their behavior as T[]). That
 function could be simply a cast to ubyte[]/ushort[], or it could
 introduce a random-access range.

Well, I don't know for certain whether people would normally want to iterate 
over Array!char as a char range or a dchar range. However, when thinking about 
the likely uses, it seems to me that you if you really want a string, you'd 
likely be using a string rather than Array!char, so I figure that the most 
likely 
use case for Array!char would be to iterate over a range of char. But I could 
be 
totally wrong about that.

As for character arrays, I do think that the normal use case is to want to see 
them as ranges of dchar rather than char or wchar. However, that can get a bit 
funny due to the fact that while the _programmer_ almost always views them that 
way, the _algorithms_ vary quite a bit more in whether they really want dchar 
or 
whether char or wchar works just fine. I do agree though that the current 
design 
works quite well overall though.

If I were to change it, I'd probably make strings into structs which have an 
array property (giving access to the char[] or wchar[] array if you need it) 
and 
give the struct a range interface which was over dchar. To really make that 
work, though, you'd need uniform function call syntax (or things like 
str.splitlines() would quick working), and there could be other reasons why it 
would fall apart. However, it would quickly and easily make dchar iteration the 
default while still allowing access to the interior char[] or wchar[]. But 
since 
you'd still have to special case functions which actually wanted the char[] or 
wchar[], I'm not sure if you ultimately gain much - though it does fix the 
foreach error where it defaults to char or wchar.

Overally, what we have works quite well. It _is_ a bit convoluted at times, but 
it's generally convoluted because of the nature of unicode rather than how 
we're 
implementing it. It's not perfect (unicode is too disgusting for perfection to 
be possible anyway), but it works _far_ better than any other language that 
I've 
used, and I actually understand unicode and its issues far better than I did 
before messing around with D strings.

- Jonathan M Davis

Re: std.algorithm.remove and principle of least astonishment

2010-11-21 Thread Jonathan M Davis

On Sunday 21 November 2010 17:27:06 Andrei Alexandrescu wrote:
 On 11/21/10 7:11 PM, Michel Fortin wrote:
  On 2010-11-20 18:58:33 -0500, Andrei Alexandrescu
  
  seewebsiteforem...@erdani.org said:
  D strings exhibit no such problems. They expose their implementation -
  array of code units. Having that available is often handy. They also
  obey a formal interface - bidirectional ranges.
  
  It's convenient that char[] and wchar[] expose a dchar bidirectional
  range interface... but only when a dchar bidirectional range is what you
  want to use. If you want to iterate over code units (lower-level
  representation), or graphemes (upper-level representation), then it gets
  in your way.
 
 I agree.
 
  There is no easy notion of character in unicode. A code point is *not*
  a character. One character can span multiple code points. I fear
  treating dchars as the default character unit is repeating same kind
  of mistake earlier frameworks made by adopting UCS-2 (now UTF-16) and
  treating each 2-byte code unit as a character. I mean, what's the point
  of working with the intermediary representation (code points) when it
  doesn't represent a character?
 
 I understand the concern, and that's why I strongly support formal
 abstractions that are supported by, but largely independent from,
 representations. If graphemes are to be modeled, D is in better shape
 than other languages. What we need to do is define a range byGrapheme()
 that accepts char[], wchar[], or dchar[].
 
  Instead, I think it'd be better that the level one wants to work at be
  made explicit. If one wants to work with code points, he just rolls a
  code-point bidirectional range on top of the string. If one wants to
  work with graphemes (user-perceived characters), he just rolls a
  grapheme bidirectional range on top of the string. In other words:

We could always define an abstract Character (or whatever you want to call it) 
which holds a character - regardless of whether it uses a grapheme or not - and 
make it relatively easy to iterate over Characters rather than dchars. It would 
be nice if they abolished graphemes though... It is quite possible that while 
D's handling of unicode is a huge improvement over other languages, by treating 
dchar as a full character essentially everywhere, we're opening ourselves up 
for 
a variety of bugs caused by graphemes which will be subtle and hard to find. 
But 
I'm not sure what the correct solution to that is.

- Jonathan M Davis

Re: std.algorithm.remove and principle of least astonishment

2010-11-21 Thread Michel Fortin

On 2010-11-21 20:21:27 -0500, Andrei Alexandrescu 
seewebsiteforem...@erdani.org said:



That design, with which I experimented for a while, had two drawbacks:

1. It had the default reversed, i.e. most often you want to regard a 
char[] or a wchar[] as a range of code points, not as an array of code 
units.


2. It had the unpleasant effect that most algorithms in std.algorithm 
and beyond did the wrong thing by default, and the right thing only if 
you wrapped everything with byDchar().


Well, basically these two arguments are the same: iterating by code 
unit isn't a good default. And I agree. But I'm unconvinced that 
iterating by dchar is the right default either. For one thing it has 
more overhead, and for another it still doesn't represent a character.


Now, add graphemes to the equation and you have a representation that 
matches the user-perceived character concept, but for that you add 
another layer of decoding overhead and a variable-size data type to 
manipulate (a grapheme is a sequence of code points). And you have to 
use Unicode normalization when comparing graphemes. So is that a good 
default? Probably not. It might be correct in some sense, but it's 
totally overkill for most cases.


My thinking is that there is no good default. If you write an XML 
parser, you'll probably want to work at the code point level; if you 
write a JSON parser, you can easily skip the overhead and work at the 
UTF-8 code unit level until you start parsing a string; if you write 
something to count the number of user-perceived characters or want to 
map characters to a font then you'll want graphemes...


Perhaps there should be simply no default; perhaps you should be forced 
to choose explicitly at which layer you want to operate each time you 
apply an algorithm on a string... and to make this less painful we 
could have functions in std.string acting as a thin layer over similar 
ones in std.algorithm that would automatically choose the right 
representation for the algorithm depending on the operation.


--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Re: std.algorithm.remove and principle of least astonishment

2010-11-21 Thread Rainer Deyke

On 11/21/2010 17:31, Andrei Alexandrescu wrote:
 On 11/21/10 6:12 PM, Rainer Deyke wrote:
 I agree that there are differences.  For one thing, if you iterate
 over a std::vectorbool  you get actual booleans, albeit through an
 extra layer of indirection.  If you iterate over char[] you might get
 chars or you might get dchars depending on the method you use for
 iterating.
 
 This is sensible because a string may be seen as a sequence of code
 points or a sequence of code units. Either view is useful.

I don't dispute that either view is useful.

 char[] isn't the equivalent of std::vectorbool.  It's worse.
 char[] is the equivalent of a vectorbool  that keeps the current
 behavior of std::vectorbool  when iterating through iterators, but
 gives access to bytes of packed booleans when using operator[].
 
 I explained why char[] is better than vectorbool. Ignoring the
 explanation and restating a fallacious conclusion based on an
 overstretched parallel does hardly much to push forward the discussion.

I'm not interested in discussing if char[] is overall a better data
structure than std::vectorbool.  I'm focusing on one particular
property of both.

std::vectorbool fails to provide some of the guarantees of all other
instances of std::vectorT.  This means that generic code that uses
std::vectorT needs to take special consideration of std::vectorbool
if it wants to work correctly when T = bool.  This is an indisputable fact.

char[] and wchar[] fail to provide some of the guarantees of all other
instances of T[].  This means that generic code that uses T[] needs to
take special consideration of char[] if it wants to work correctly when
T = char.  This is also an indisputable fact.

I don't think it's much a stretch to draw an analogy from
std::vectorbool to char[] based on this.  However, even if
std::vectorbool did not exist, I would still consider this a design
flaw of char[].

 Again: code units _are_ well-defined, useful to have access to, and good
 for a variety of uses. Please understand this.

Again, I understand this and don't dispute it.  It's a complete
non-sequitur to this discussion.  I'm not arguing against the string
type providing access to both code points and code units.  I'm arguing
against the string type having the name of the array when it doesn't
share the behavior of an array.

 I'm not concerned about strings, I'm concerned about *arrays*.
 Arrays of T, where T may or not be a character type.  I see that you
 ignored my Vector!char example yet again.
 
 I sure have replied to it, but probably my reply hasn't been read.
 Please allow me to paste it again:
 
 When you define your abstractions, you are free to decide how you
 want to go about them. The D programming language makes it
 unequivocally clear that char[] is an array of UTF-8 code units that
 offers a bidirectional range of code points. Same about wchar[]
 (replace UTF-8 with UTF-16). dchar[] is an array of UTF-32 code
 points which are equivalent to code units, and as such is a full
 random-access range.
 
 So it's up to you what Vector!char does. In D char[] is an array of code
 units that can be iterated as a bidirectional range of code points. I
 don't see anything cagey about that.

Ah, I did read that, but it doesn't address my concerns about
Vector!char at all.  I'm aware that I can write Vector!char to act like
a container of code units.  I'm also aware that I can write Vector!char
to automatically translate to code points.  My concerns are these:

  - When writing code that uses T[], it is often natural to mix
range-based access and index-based access, with the assumption that both
provide direct access to the same underlying data.  However, with char[]
this assumption is incorrect, as the underlying data is transformed when
viewing the array as a range.  This means that generic code that uses
T[] must take special consideration of char[] or it may unexpectedly
produce incorrect results when T = char.

  - char[] sets a precedent of Container!char providing a dchar range
interface.  Other containers must choose to either follow this precedent
or to avoid it.  Either choice may require extra work when implementing
the container.  Either choice can lead to surprising behavior for the
user of the container.


-- 
Rainer Deyke - rain...@eldwood.com

Re: std.algorithm.remove and principle of least astonishment

2010-11-21 Thread Andrei Alexandrescu


On 11/21/10 22:09 CST, Rainer Deyke wrote:

On 11/21/2010 17:31, Andrei Alexandrescu wrote:
char[] and wchar[] fail to provide some of the guarantees of all other
instances of T[].


What exactly are those guarantees?


   - When writing code that uses T[], it is often natural to mix
range-based access and index-based access, with the assumption that both
provide direct access to the same underlying data.  However, with char[]
this assumption is incorrect, as the underlying data is transformed when
viewing the array as a range.  This means that generic code that uses
T[] must take special consideration of char[] or it may unexpectedly
produce incorrect results when T = char.


This is exactly where your point falls apart. I'm actually glad you 
wrote it down explicitly because this makes it simple to achieve the 
goal of putting you in the position to both understand where your point 
is wrong, and also the goal of putting you in the position for an aha 
moment or at least a all right, grumble grumble moment.


What you're saying is that you write generic code that requires T[], and 
then the code itself uses front, popFront, and other range-specific 
functions in conjunction with it.


But this is exactly the problem. If you want to use range primitives, 
you submit to the requirement of ranges. So you write the generic 
function to ask for ranges (with e.g. isForwardRange etc). Otherwise 
your code is incorrect.


If you want to work with arrays, use a[0] to access the front, a[$ - 1] 
to access the back, and a = a[1 .. $] to chop off the first element of 
the array. It is not AT ALL natural to mix those with a.front, a.back 
etc. It is not - why? because std.range defines them with specific 
meanings for arrays in general and for arrays of characters in 
particular. If you submit to use std.range's abstraction, you submit to 
using it the way it is defined.


So: if you want to use char[] as an array with the built-in array 
interface, no problem. If you want to use char[] as a range with the 
range interface as defined by std.range, again no problem. But asking 
for one and then surreptitiously using the other is simply incorrect 
code. You can't use std.range while at the same time complaining you 
can't be bothered to read its docs.



   - char[] sets a precedent of Container!char providing a dchar range
interface.  Other containers must choose to either follow this precedent
or to avoid it.  Either choice may require extra work when implementing
the container.  Either choice can lead to surprising behavior for the
user of the container.


Encoded strings bring with them the necessity of encoding and decoding. 
That is an expected feature. It is up to your container whether it wants 
to do so or it needs to pass it to the client.


I challenge you to define an alternative built-in string that fares 
better than string  Comp. Before long you'll be overwhelmed by the 
various necessities imposed by your constraints.



Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-11-21 Thread Rainer Deyke

On 11/21/2010 21:56, Andrei Alexandrescu wrote:
 On 11/21/10 22:09 CST, Rainer Deyke wrote:
 On 11/21/2010 17:31, Andrei Alexandrescu wrote:
 char[] and wchar[] fail to provide some of the guarantees of all other
 instances of T[].
 
 What exactly are those guarantees?

That the range view and the array view provide direct access to the same
data.

One of the useful features of most arrays is that an array of T can be
treated as a range of T.  However, this feature is missing for arrays of
char and wchar.

- When writing code that uses T[], it is often natural to mix
 range-based access and index-based access, with the assumption that both
 provide direct access to the same underlying data.  However, with char[]
 this assumption is incorrect, as the underlying data is transformed when
 viewing the array as a range.  This means that generic code that uses
 T[] must take special consideration of char[] or it may unexpectedly
 produce incorrect results when T = char.
 
 What you're saying is that you write generic code that requires T[], and
 then the code itself uses front, popFront, and other range-specific
 functions in conjunction with it.

No, I'm saying that I write generic code that declares T[] and then
passes it off to a function that operates on ranges, or to a foreach loop.

 But this is exactly the problem. If you want to use range primitives,
 you submit to the requirement of ranges. So you write the generic
 function to ask for ranges (with e.g. isForwardRange etc). Otherwise
 your code is incorrect.

Again, my generic function declares the array as a local variable or a
member variable.  It cannot declare a generic range.

 If you want to work with arrays, use a[0] to access the front, a[$ - 1]
 to access the back, and a = a[1 .. $] to chop off the first element of
 the array. It is not AT ALL natural to mix those with a.front, a.back
 etc. It is not - why? because std.range defines them with specific
 meanings for arrays in general and for arrays of characters in
 particular. If you submit to use std.range's abstraction, you submit to
 using it the way it is defined.

It absolutely is natural to mix these in code that is written without
consideration for strings, especially when you consider that foreach
also uses the range interface.

Let's say I have an array and I want to iterate over the first ten
items.  My first instinct would be to write something like this:

  foreach (item; array[0 .. 10]) {
doSomethingWith(item);
  }

Simple, natural, readable code.  Broken for arrays of char or wchar, but
in a way that is difficult to detect.

 So: if you want to use char[] as an array with the built-in array
 interface, no problem. If you want to use char[] as a range with the
 range interface as defined by std.range, again no problem. But asking
 for one and then surreptitiously using the other is simply incorrect
 code. You can't use std.range while at the same time complaining you
 can't be bothered to read its docs.

This would sound reasonable if I were using char[] directly.  I'm not.
I'm using T[] in a generic context.  I may not have considered the case
of T = char when I wrote the code.  The code may even have originally
used Widget[] before I decided to make it generic.

 I challenge you to define an alternative built-in string that fares
 better than string  Comp. Before long you'll be overwhelmed by the
 various necessities imposed by your constraints.

Easy:
  - string_t becomes a keyword.
  - Syntactically speaking, string_t!T is the name of a type when T is a
type.
  - For every built-in character type T (including const and immutable
versions), the type currently called T[] is now called string_t!T, but
otherwise maintains all of its current behavior.
  - For every other type T, string_t!T is an error.
  - char[] and wchar[] (including const and immutable versions) are
plain arrays of code units, even when viewed as a range.

It's not my preferred solution, but it's easy to explain, it fixes the
main problem with the current system, and it only costs one keyword.

(I'd rather treat string_t as a library template with compiler support
like and rename it to String, but then it wouldn't be a built-in string.)


-- 
Rainer Deyke - rain...@eldwood.com

Re: std.algorithm.remove and principle of least astonishment

2010-11-21 Thread Andrei Alexandrescu


On 11/21/10 11:59 PM, Rainer Deyke wrote:

On 11/21/2010 21:56, Andrei Alexandrescu wrote:

On 11/21/10 22:09 CST, Rainer Deyke wrote:

On 11/21/2010 17:31, Andrei Alexandrescu wrote:
char[] and wchar[] fail to provide some of the guarantees of all other
instances of T[].


What exactly are those guarantees?


That the range view and the array view provide direct access to the same
data.


Where do ranges state that assumption?


One of the useful features of most arrays is that an array of T can be
treated as a range of T.  However, this feature is missing for arrays of
char and wchar.


This is not a guarantee by ranges, it's just a mistaken assumption.


- When writing code that uses T[], it is often natural to mix
range-based access and index-based access, with the assumption that both
provide direct access to the same underlying data.  However, with char[]
this assumption is incorrect, as the underlying data is transformed when
viewing the array as a range.  This means that generic code that uses
T[] must take special consideration of char[] or it may unexpectedly
produce incorrect results when T = char.


What you're saying is that you write generic code that requires T[], and
then the code itself uses front, popFront, and other range-specific
functions in conjunction with it.


No, I'm saying that I write generic code that declares T[] and then
passes it off to a function that operates on ranges, or to a foreach loop.


A function that operates on ranges would have an appropriate constraint 
so it would work properly or not at all. foreach works fine with all arrays.



But this is exactly the problem. If you want to use range primitives,
you submit to the requirement of ranges. So you write the generic
function to ask for ranges (with e.g. isForwardRange etc). Otherwise
your code is incorrect.


Again, my generic function declares the array as a local variable or a
member variable.  It cannot declare a generic range.


If you want to work with arrays, use a[0] to access the front, a[$ - 1]
to access the back, and a = a[1 .. $] to chop off the first element of
the array. It is not AT ALL natural to mix those with a.front, a.back
etc. It is not - why? because std.range defines them with specific
meanings for arrays in general and for arrays of characters in
particular. If you submit to use std.range's abstraction, you submit to
using it the way it is defined.


It absolutely is natural to mix these in code that is written without
consideration for strings, especially when you consider that foreach
also uses the range interface.

Let's say I have an array and I want to iterate over the first ten
items.  My first instinct would be to write something like this:

   foreach (item; array[0 .. 10]) {
 doSomethingWith(item);
   }

Simple, natural, readable code.  Broken for arrays of char or wchar, but
in a way that is difficult to detect.


Why is it broken? Please try it to convince yourself of the contrary.


So: if you want to use char[] as an array with the built-in array
interface, no problem. If you want to use char[] as a range with the
range interface as defined by std.range, again no problem. But asking
for one and then surreptitiously using the other is simply incorrect
code. You can't use std.range while at the same time complaining you
can't be bothered to read its docs.


This would sound reasonable if I were using char[] directly.  I'm not.
I'm using T[] in a generic context.  I may not have considered the case
of T = char when I wrote the code.  The code may even have originally
used Widget[] before I decided to make it generic.


Fine. Use T[] generically in conjunction with the array primitives. If 
you plan to use them with the range primitives, you do as ranges do.



I challenge you to define an alternative built-in string that fares
better than string  Comp. Before long you'll be overwhelmed by the
various necessities imposed by your constraints.


Easy:
   - string_t becomes a keyword.
   - Syntactically speaking, string_t!T is the name of a type when T is a
type.
   - For every built-in character type T (including const and immutable
versions), the type currently called T[] is now called string_t!T, but
otherwise maintains all of its current behavior.
   - For every other type T, string_t!T is an error.
   - char[] and wchar[] (including const and immutable versions) are
plain arrays of code units, even when viewed as a range.

It's not my preferred solution, but it's easy to explain, it fixes the
main problem with the current system, and it only costs one keyword.

(I'd rather treat string_t as a library template with compiler support
like and rename it to String, but then it wouldn't be a built-in string.)


I very much prefer the current state of affairs.


Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-11-21 Thread Andrew Wiley

On Mon, Nov 22, 2010 at 1:08 AM, Andrei Alexandrescu 
seewebsiteforem...@erdani.org wrote:

 On 11/21/10 11:59 PM, Rainer Deyke wrote:

 On 11/21/2010 21:56, Andrei Alexandrescu wrote:

 On 11/21/10 22:09 CST, Rainer Deyke wrote:


 - When writing code that uses T[], it is often natural to mix
 range-based access and index-based access, with the assumption that both
 provide direct access to the same underlying data.  However, with char[]
 this assumption is incorrect, as the underlying data is transformed when
 viewing the array as a range.  This means that generic code that uses
 T[] must take special consideration of char[] or it may unexpectedly
 produce incorrect results when T = char.


 What you're saying is that you write generic code that requires T[], and
 then the code itself uses front, popFront, and other range-specific
 functions in conjunction with it.


 No, I'm saying that I write generic code that declares T[] and then
 passes it off to a function that operates on ranges, or to a foreach loop.


 A function that operates on ranges would have an appropriate constraint so
 it would work properly or not at all. foreach works fine with all arrays.


One gotcha that seems to occur here is this code:
foreach(index, character; someString) assert(someString[index] ==
character);

I don't really have much that's meaningful to add to this discussion except
to say that it shouldn't be easy to write code like the above. I spent a few
hours today figuring out why that wouldn't work.

Re: std.algorithm.remove and principle of least astonishment

2010-11-21 Thread Lutger Blijdestijn

Andrei Alexandrescu wrote:

...
 
 I agree except for the fact that in my experience you want to iterate
 over code points much more often than over code units. Iterating by code
 unit by default is almost always wrong. That's why D's strings offer the
 bidirectional interface by default. I have reasons to believe it was a
 good decision.
 
 
 Andrei

Is there a plan to make std.string and std.algorithm more compatible with 
this view? 

Nearly all algorithms in std.string work with slices or substrings rather 
than code unit or points. I found it sometimes hard to mix and match that 
approach with the api that std.algorithm offers. Maybe I'm missing 
something.

Re: std.algorithm.remove and principle of least astonishment

2010-11-20 Thread spir

On Fri, 19 Nov 2010 22:04:51 -0700
Rainer Deyke rain...@eldwood.com wrote:

 On 11/19/2010 16:40, Andrei Alexandrescu wrote:
  On 11/19/10 12:59 PM, Bruno Medeiros wrote:
  Sorry, what I mean is: so we agree that char[] and wchar[] are special.
  Unlike *all other arrays*, there are restrictions to what you can assign
  to each element of the array. So conceptually they are not arrays, but
  in the type system they are very much arrays. (or described
  alternatively: implemented with arrays).
 
  Isn't this a clear sign that what currently is char[] and wchar[] (=
  UTF-8 and UTF-16 encoded strings) should not be arrays, but instead a
  struct which would correctly represents the semantics and contracts of
  char[] and wchar[]? Let me clarify what I'm suggesting:
  * char[] and wchar[] would be just arrays of char's and wchar's,
  completely orthogonal with other arrays types, no restrictions on
  assignment, no further contracts.
  * UTF-8 and UTF-16 encoded strings would have their own struct-based
  type, lets called them string and wstring, which would likely use char[]
  and wchar[] as the contents (but these fields would be internal), and
  have whatever methods be appropriate, including opIndex.
  * string literals would be of type string and wstring, not char[] and
  wchar[].
  * for consistency, probably this would be true for UTF-32 as well: we
  would have a dstring, with dchar[] as the contents.
 
  Problem solved. You're welcome. (as John Hodgeman would say)
 
  No?
  
  I don't think that would mark an improvement.
 
 You don't see the advantage of generic types behaving in a generic
 manner?  Do you know how much pain std::vectorbool caused in C++?
 
 I asked this before, but I received no answer.  Let me ask it again.
 Imagine a container Vector!T that uses T[] internally.  Then consider
 Vector!char.  What would be its correct element type?  What would be its
 correct behavior during iteration?  What would be its correct response
 when asked to return its length?  Assuming you come up with a coherent
 set of semantics for Vector!char, how would you implement it?  Do you
 see how easy it would be to implement it incorrectly?

Hello Rainer,

The original proposal by Bruno would simplify some project I have in mind 
(namely, a higher-level universal text type already evoked). The issues you 
point to intuitively seem relevant to me, but I cannot really understand any. 
Would be kind enough and expand a bit on each question? (Thinking at people who 
about nothing of C++ -- yes, they exist ;-)

Denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com

Re: std.algorithm.remove and principle of least astonishment

2010-11-20 Thread Rainer Deyke

On 11/20/2010 05:12, spir wrote:
 On Fri, 19 Nov 2010 22:04:51 -0700 Rainer Deyke rain...@eldwood.com
 wrote:
 You don't see the advantage of generic types behaving in a generic 
 manner?  Do you know how much pain std::vectorbool caused in
 C++?
 
 I asked this before, but I received no answer.  Let me ask it
 again. Imagine a container Vector!T that uses T[] internally.  Then
 consider Vector!char.  What would be its correct element type?
 What would be its correct behavior during iteration?  What would be
 its correct response when asked to return its length?  Assuming you
 come up with a coherent set of semantics for Vector!char, how would
 you implement it?  Do you see how easy it would be to implement it
 incorrectly?
 
 Hello Rainer,
 
 The original proposal by Bruno would simplify some project I have in
 mind (namely, a higher-level universal text type already evoked). The
 issues you point to intuitively seem relevant to me, but I cannot
 really understand any. Would be kind enough and expand a bit on each
 question? (Thinking at people who about nothing of C++ -- yes, they
 exist ;-)

std::vectorbool in C++ is a specialization of std::vector that packs
eight booleans into a byte instead of storing each element separately.
It doesn't behave exactly like other std::vectors and technically
doesn't meet the C++ requirements of a container, although it tries to
come as close as possible.  This means that any code that uses
std::vectorbool needs to be extra careful to take those differences in
account.  This is especially an issue when dealing with generic code
that uses std::vectorT, where T may or may not be bool.

The issue with Vector!char is similar.  Because char[] is not a true
array, generic code that uses T[] can unexpectedly fail when T is char.
 Other containers of char behave like normal containers, iterating over
individual chars.  char[] iterates over dchars.  Vector!char can,
depending on its implementation, iterate over chars, iterate over
dchars, or fail to compile at all when instantiated with T=char.  It's
not even clear which of these is the correct behavior.

Vector!char is just an example.  Any generic code that uses T[] can
unexpectedly fail to compile or behave incorrectly used when T=char.  If
I were to use D2 in its present state, I would try to avoid both
char/wchar and arrays as much as possible in order to avoid this trap.
This would mean avoiding large parts of Phobos, and providing safe
wrappers around the rest.


-- 
Rainer Deyke - rain...@eldwood.com

Re: std.algorithm.remove and principle of least astonishment

2010-11-20 Thread Andrei Alexandrescu


On 11/20/10 12:32 PM, Rainer Deyke wrote:

On 11/20/2010 05:12, spir wrote:

On Fri, 19 Nov 2010 22:04:51 -0700 Rainer Deykerain...@eldwood.com
wrote:

You don't see the advantage of generic types behaving in a generic
manner?  Do you know how much pain std::vectorbool  caused in
C++?

I asked this before, but I received no answer.  Let me ask it
again. Imagine a container Vector!T that uses T[] internally.  Then
consider Vector!char.  What would be its correct element type?
What would be its correct behavior during iteration?  What would be
its correct response when asked to return its length?  Assuming you
come up with a coherent set of semantics for Vector!char, how would
you implement it?  Do you see how easy it would be to implement it
incorrectly?


Hello Rainer,

The original proposal by Bruno would simplify some project I have in
mind (namely, a higher-level universal text type already evoked). The
issues you point to intuitively seem relevant to me, but I cannot
really understand any. Would be kind enough and expand a bit on each
question? (Thinking at people who about nothing of C++ -- yes, they
exist ;-)


std::vectorbool  in C++ is a specialization of std::vector that packs
eight booleans into a byte instead of storing each element separately.
It doesn't behave exactly like other std::vectors and technically
doesn't meet the C++ requirements of a container, although it tries to
come as close as possible.  This means that any code that uses
std::vectorbool  needs to be extra careful to take those differences in
account.  This is especially an issue when dealing with generic code
that uses std::vectorT, where T may or may not be bool.

The issue with Vector!char is similar.  Because char[] is not a true
array, generic code that uses T[] can unexpectedly fail when T is char.
  Other containers of char behave like normal containers, iterating over
individual chars.  char[] iterates over dchars.  Vector!char can,
depending on its implementation, iterate over chars, iterate over
dchars, or fail to compile at all when instantiated with T=char.  It's
not even clear which of these is the correct behavior.


The parallel does not stand scrutiny. The problem with vectorbool in 
C++ is that it implements no formal abstraction, although it is a 
specialization of one.


D strings exhibit no such problems. They expose their implementation - 
array of code units. Having that available is often handy. They also 
obey a formal interface - bidirectional ranges.



Vector!char is just an example. Any generic code that uses T[] can
unexpectedly fail to compile or behave incorrectly used when T=char.
If I were to use D2 in its present state, I would try to avoid both
char/wchar and arrays as much as possible in order to avoid this
trap. This would mean avoiding large parts of Phobos, and providing
safe wrappers around the rest.


It may be wise in fact to start using D2 and make criticism grounded in 
reality that could help us improve the state of affairs. The above is 
only fallacious presupposition. Algorithms in Phobos are abstracted on 
the formal range interface, and as such you won't be exposed to risks 
when using them with strings.



Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-11-20 Thread Rainer Deyke

On 11/20/2010 16:58, Andrei Alexandrescu wrote:
 On 11/20/10 12:32 PM, Rainer Deyke wrote:
 std::vectorbool  in C++ is a specialization of std::vector that packs
 eight booleans into a byte instead of storing each element separately.
 It doesn't behave exactly like other std::vectors and technically
 doesn't meet the C++ requirements of a container, although it tries to
 come as close as possible.  This means that any code that uses
 std::vectorbool  needs to be extra careful to take those differences in
 account.  This is especially an issue when dealing with generic code
 that uses std::vectorT, where T may or may not be bool.

 The issue with Vector!char is similar.  Because char[] is not a true
 array, generic code that uses T[] can unexpectedly fail when T is char.
   Other containers of char behave like normal containers, iterating over
 individual chars.  char[] iterates over dchars.  Vector!char can,
 depending on its implementation, iterate over chars, iterate over
 dchars, or fail to compile at all when instantiated with T=char.  It's
 not even clear which of these is the correct behavior.
 
 The parallel does not stand scrutiny. The problem with vectorbool in
 C++ is that it implements no formal abstraction, although it is a
 specialization of one.

The problem with std::vectorbool is that it pretends to be a
std::vector, but isn't.  If it was called dynamic_bitset instead, nobody
would have complained.  char[] has exactly the same problem.

 Vector!char is just an example. Any generic code that uses T[] can
 unexpectedly fail to compile or behave incorrectly used when T=char.
 If I were to use D2 in its present state, I would try to avoid both
 char/wchar and arrays as much as possible in order to avoid this
 trap. This would mean avoiding large parts of Phobos, and providing
 safe wrappers around the rest.
 
 It may be wise in fact to start using D2 and make criticism grounded in
 reality that could help us improve the state of affairs.

Sorry, but no.  It would take a huge investment of time and effort on my
part to switch from C++ to D.  I'm not going to make that leap without
looking first, and I'm not going to make it when I can see that I'm
about to jump into a spike pit.

 The above is
 only fallacious presupposition. Algorithms in Phobos are abstracted on
 the formal range interface, and as such you won't be exposed to risks
 when using them with strings.

I'm not concerned about algorithms, I'm concerned about code that uses
arrays directly.  Like my Vector!char example, which I see you still
haven't addressed.


-- 
Rainer Deyke - rain...@eldwood.com

Re: std.algorithm.remove and principle of least astonishment

2010-11-19 Thread Bruno Medeiros


On 16/10/2010 20:51, Andrei Alexandrescu wrote:

On 10/16/2010 01:39 PM, Steven Schveighoffer wrote:

I suggest wrapping a char[] or wchar[] (of all constancies) with a
special range that imposes the restrictions.


I did so. It was called byDchar and it would accept a string type. It
sucked.

char[] and wchar[] are special. They embed their UTF affiliation in
their type. I don't think we should make a wash of all that by handling
them as arrays. They are not arrays.


Andrei


They are not arrays.? So why are they arrays then? :3

Sorry, what I mean is: so we agree that char[] and wchar[] are special. 
Unlike *all other arrays*, there are restrictions to what you can assign 
to each element of the array. So conceptually they are not arrays, but 
in the type system they are very much arrays. (or described 
alternatively: implemented with arrays).


Isn't this a clear sign that what currently is char[] and wchar[] (= 
UTF-8 and UTF-16 encoded strings) should not be arrays, but instead a 
struct which would correctly represents the semantics and contracts of 
char[] and wchar[]? Let me clarify what I'm suggesting:
 * char[] and wchar[] would be just arrays of char's and wchar's, 
completely orthogonal with other arrays types, no restrictions on 
assignment, no further contracts.
 * UTF-8 and UTF-16 encoded strings would have their own struct-based 
type, lets called them string and wstring, which would likely use char[] 
and wchar[] as the contents (but these fields would be internal), and 
have whatever methods be appropriate, including opIndex.
 * string literals would be of type string and wstring, not char[] and 
wchar[].
 * for consistency, probably this would be true for UTF-32 as well: we 
would have a dstring, with dchar[] as the contents.


Problem solved. You're welcome. (as John Hodgeman would say)

No?

--
Bruno Medeiros - Software Engineer

Re: std.algorithm.remove and principle of least astonishment

2010-11-19 Thread Andrei Alexandrescu


On 11/19/10 12:59 PM, Bruno Medeiros wrote:

On 16/10/2010 20:51, Andrei Alexandrescu wrote:

On 10/16/2010 01:39 PM, Steven Schveighoffer wrote:

I suggest wrapping a char[] or wchar[] (of all constancies) with a
special range that imposes the restrictions.


I did so. It was called byDchar and it would accept a string type. It
sucked.

char[] and wchar[] are special. They embed their UTF affiliation in
their type. I don't think we should make a wash of all that by handling
them as arrays. They are not arrays.


Andrei


They are not arrays.? So why are they arrays then? :3

Sorry, what I mean is: so we agree that char[] and wchar[] are special.
Unlike *all other arrays*, there are restrictions to what you can assign
to each element of the array. So conceptually they are not arrays, but
in the type system they are very much arrays. (or described
alternatively: implemented with arrays).

Isn't this a clear sign that what currently is char[] and wchar[] (=
UTF-8 and UTF-16 encoded strings) should not be arrays, but instead a
struct which would correctly represents the semantics and contracts of
char[] and wchar[]? Let me clarify what I'm suggesting:
* char[] and wchar[] would be just arrays of char's and wchar's,
completely orthogonal with other arrays types, no restrictions on
assignment, no further contracts.
* UTF-8 and UTF-16 encoded strings would have their own struct-based
type, lets called them string and wstring, which would likely use char[]
and wchar[] as the contents (but these fields would be internal), and
have whatever methods be appropriate, including opIndex.
* string literals would be of type string and wstring, not char[] and
wchar[].
* for consistency, probably this would be true for UTF-32 as well: we
would have a dstring, with dchar[] as the contents.

Problem solved. You're welcome. (as John Hodgeman would say)

No?


I don't think that would mark an improvement.

Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-11-19 Thread Rainer Deyke

On 11/19/2010 16:40, Andrei Alexandrescu wrote:
 On 11/19/10 12:59 PM, Bruno Medeiros wrote:
 Sorry, what I mean is: so we agree that char[] and wchar[] are special.
 Unlike *all other arrays*, there are restrictions to what you can assign
 to each element of the array. So conceptually they are not arrays, but
 in the type system they are very much arrays. (or described
 alternatively: implemented with arrays).

 Isn't this a clear sign that what currently is char[] and wchar[] (=
 UTF-8 and UTF-16 encoded strings) should not be arrays, but instead a
 struct which would correctly represents the semantics and contracts of
 char[] and wchar[]? Let me clarify what I'm suggesting:
 * char[] and wchar[] would be just arrays of char's and wchar's,
 completely orthogonal with other arrays types, no restrictions on
 assignment, no further contracts.
 * UTF-8 and UTF-16 encoded strings would have their own struct-based
 type, lets called them string and wstring, which would likely use char[]
 and wchar[] as the contents (but these fields would be internal), and
 have whatever methods be appropriate, including opIndex.
 * string literals would be of type string and wstring, not char[] and
 wchar[].
 * for consistency, probably this would be true for UTF-32 as well: we
 would have a dstring, with dchar[] as the contents.

 Problem solved. You're welcome. (as John Hodgeman would say)

 No?
 
 I don't think that would mark an improvement.

You don't see the advantage of generic types behaving in a generic
manner?  Do you know how much pain std::vectorbool caused in C++?

I asked this before, but I received no answer.  Let me ask it again.
Imagine a container Vector!T that uses T[] internally.  Then consider
Vector!char.  What would be its correct element type?  What would be its
correct behavior during iteration?  What would be its correct response
when asked to return its length?  Assuming you come up with a coherent
set of semantics for Vector!char, how would you implement it?  Do you
see how easy it would be to implement it incorrectly?


-- 
Rainer Deyke - rain...@eldwood.com

std.algorithm.remove and principle of least astonishment

2010-10-16 Thread klickverbot


Hello all,

I decided to have a go at solving some easy programming puzzles with 
D2/Phobos to see how Phobos, especially ranges and std.algorithm, work 
out in simple real-world use cases (the puzzle in question is from 
hacker.org, by the way).


The following code is a direct translation of a simple problem 
description to D (it is horrible from performance point of view, but 
that's certainly no issue here).


---
import std.algorithm;
import std.conv;
import std.stdio;

// The original input string is longer, but irrelevant to this post.
enum INPUT = 93752xxx746x27x1754xx90x93x238x44x75xx087509;

void main() {
   uint sum;

   auto tmp = INPUT.dup;
   size_t i;
   while ( i  tmp.length ) {
  char c = tmp[ i ];
  if ( c == 'x' ) {
 tmp = remove( tmp, i );
 i -= 2;
  } else {
 sum += to!uint( [ c ] );
 ++i;
  }
   }

   writeln( sum );
}
---

Quite contrary to what you would expect, the call to »remove« fails to 
compile with the following error messages: »std/algorithm.d(4287): 
Error: front(src) is not an lvalue« and »std/algorithm.d(4287): Error: 
front(tgt) is not an lvalue«.


I am intentionally posting this to this NG and not to d.…D.learn, since 
this is a quite gross violation of the principle of least surprise in my 
eyes.


If this isn't a bug, a better error message via a template constraint or 
a static assert would be something worth looking at in my opinion, since 
one would probably expect this to compile and not to fail within Phobos 
code.


David

Re: std.algorithm.remove and principle of least astonishment

2010-10-16 Thread Steven Schveighoffer


On Sat, 16 Oct 2010 14:29:59 -0400, klickverbot s...@klickverbot.at wrote:


Hello all,

I decided to have a go at solving some easy programming puzzles with  
D2/Phobos to see how Phobos, especially ranges and std.algorithm, work  
out in simple real-world use cases (the puzzle in question is from  
hacker.org, by the way).


The following code is a direct translation of a simple problem  
description to D (it is horrible from performance point of view, but  
that's certainly no issue here).


---
import std.algorithm;
import std.conv;
import std.stdio;

// The original input string is longer, but irrelevant to this post.
enum INPUT = 93752xxx746x27x1754xx90x93x238x44x75xx087509;

void main() {
uint sum;

auto tmp = INPUT.dup;
size_t i;
while ( i  tmp.length ) {
   char c = tmp[ i ];
   if ( c == 'x' ) {
  tmp = remove( tmp, i );
  i -= 2;
   } else {
  sum += to!uint( [ c ] );
  ++i;
   }
}

writeln( sum );
}
---

Quite contrary to what you would expect, the call to »remove« fails to  
compile with the following error messages: »std/algorithm.d(4287):  
Error: front(src) is not an lvalue« and »std/algorithm.d(4287): Error:  
front(tgt) is not an lvalue«.


My guess is that since INPUT is a string, phobos has unwisely decided to  
treat strings not as random access arrays of chars, but as a bidirectional  
range of dchar.  This means that even though you can randomly access the  
characters (phobos can't take that away from you), it artificially imposes  
restrictions (such as making front an rvalue) where it wouldn't do the  
same to an int[] or ubyte[].


Andrei, I am increasingly seeing people struggling with the decision to  
make strings bidirectional ranges of dchar instead of what the compiler  
says they are.  This needs a different solution.  It's too  
confusing/difficult to deal with.


I suggest wrapping a char[] or wchar[] (of all constancies) with a special  
range that imposes the restrictions.  This means people will have to use  
these ranges when they want to treat them as bidir ranges of dchar, but  
the current situation is at least annoying, if not a complete turn-off to  
D.  And it vastly simplifies code that uses ranges, since they now don't  
have to contain special cases for char[] and wchar[].


-Steve

Re: std.algorithm.remove and principle of least astonishment

2010-10-16 Thread klickverbot


In case it was not clear, this is what I want to achive:
»tmp = tmp[ 0 .. i ] ~ tmp[ ( i + 1 ) .. $ ];«

Re: std.algorithm.remove and principle of least astonishment

2010-10-16 Thread Andrei Alexandrescu


On 10/16/2010 01:29 PM, klickverbot wrote:

Hello all,

I decided to have a go at solving some easy programming puzzles with
D2/Phobos to see how Phobos, especially ranges and std.algorithm, work
out in simple real-world use cases (the puzzle in question is from
hacker.org, by the way).

The following code is a direct translation of a simple problem
description to D (it is horrible from performance point of view, but
that's certainly no issue here).

---
import std.algorithm;
import std.conv;
import std.stdio;

// The original input string is longer, but irrelevant to this post.
enum INPUT = 93752xxx746x27x1754xx90x93x238x44x75xx087509;

void main() {
uint sum;

auto tmp = INPUT.dup;
size_t i;
while ( i  tmp.length ) {
char c = tmp[ i ];
if ( c == 'x' ) {
tmp = remove( tmp, i );
i -= 2;
} else {
sum += to!uint( [ c ] );
++i;
}
}

writeln( sum );
}
---

Quite contrary to what you would expect, the call to »remove« fails to
compile with the following error messages: »std/algorithm.d(4287):
Error: front(src) is not an lvalue« and »std/algorithm.d(4287): Error:
front(tgt) is not an lvalue«.

I am intentionally posting this to this NG and not to d.…D.learn, since
this is a quite gross violation of the principle of least surprise in my
eyes.

If this isn't a bug, a better error message via a template constraint or
a static assert would be something worth looking at in my opinion, since
one would probably expect this to compile and not to fail within Phobos
code.

David


Thanks for the input. This is not a bug, it's what I believe to be a 
very intentional feature: strings are not ordinary arrays because 
characters have variable length. As such, assigning to the first 
character in a string is not allowed because the assignment might mess 
up the next character.


It's a good test bed. Simply replacing this:

   auto tmp = INPUT.dup;

with this:

   auto tmp = cast(ubyte[]) INPUT.dup;

makes the program work and print 322 (you also must include std.conv).

How do you all believe we could improve this example?

1. remove() could be specialized for char[] and wchar[] because it can 
be made to work with some effort and is a worthwhile algorithms for 
strings.


2. to!(ubyte[]) should work for char[] by making a copy and casting it 
to ubyte[]. So this should have worked:


   auto tmp = to!(ubyte[])(INPUT);

to! is better than cast because it always does the right thing and never 
undermines type safety.


Whadday'all think?


Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-10-16 Thread Andrei Alexandrescu


On 10/16/2010 01:39 PM, Steven Schveighoffer wrote:

My guess is that since INPUT is a string, phobos has unwisely decided to
treat strings not as random access arrays of chars, but as a
bidirectional range of dchar.


s/un//

:o)

Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-10-16 Thread Andrei Alexandrescu


On 10/16/2010 01:39 PM, Steven Schveighoffer wrote:

Andrei, I am increasingly seeing people struggling with the decision to
make strings bidirectional ranges of dchar instead of what the compiler
says they are. This needs a different solution. It's too
confusing/difficult to deal with.


I'm not seeing that. I'm seeing strings working automagically with most 
of std.algorithm without ever destroying a wide string.


Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-10-16 Thread Andrei Alexandrescu


On 10/16/2010 01:39 PM, Steven Schveighoffer wrote:

I suggest wrapping a char[] or wchar[] (of all constancies) with a
special range that imposes the restrictions.


I did so. It was called byDchar and it would accept a string type. It 
sucked.


char[] and wchar[] are special. They embed their UTF affiliation in 
their type. I don't think we should make a wash of all that by handling 
them as arrays. They are not arrays.



Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-10-16 Thread klickverbot


On 10/16/10 9:47 PM, Andrei Alexandrescu wrote:

Thanks for the input. This is not a bug, it's what I believe to be a
very intentional feature: strings are not ordinary arrays because
characters have variable length. As such, assigning to the first
character in a string is not allowed because the assignment might mess
up the next character.


I see that there is a problem due the difference of code units and code 
points, but why does the following work then?


tmp = tmp[ 0 .. i ] ~ tmp[ ( i + 1 ) .. $ ];

This is equivalent to my (naïve?) mental model of remove(), and thus it 
seems very counter-intuitive to me that one works, but the other doesn't.

Re: std.algorithm.remove and principle of least astonishment

2010-10-16 Thread Tomek Sowiński

Andrei Alexandrescu napisał:

 On 10/16/2010 01:39 PM, Steven Schveighoffer wrote:
 I suggest wrapping a char[] or wchar[] (of all constancies) with a
 special range that imposes the restrictions.
 
 I did so. It was called byDchar and it would accept a string type. It
 sucked.

Why it sucked?

-- 
Tomek

Re: std.algorithm.remove and principle of least astonishment

2010-10-16 Thread Steven Schveighoffer

On Sat, 16 Oct 2010 15:51:23 -0400, Andrei Alexandrescu  
seewebsiteforem...@erdani.org wrote:



On 10/16/2010 01:39 PM, Steven Schveighoffer wrote:

I suggest wrapping a char[] or wchar[] (of all constancies) with a
special range that imposes the restrictions.


I did so. It was called byDchar and it would accept a string type. It  
sucked.


char[] and wchar[] are special. They embed their UTF affiliation in  
their type. I don't think we should make a wash of all that by handling  
them as arrays. They are not arrays.


The compiler thinks they are.  And they look like arrays (T[] looks like  
an array to me no matter what T is).  And I *want* an array of characters  
in most cases.  If you want a special type for strings, make them a  
special type.


D should not have this schizophrenic view of strings.  Plus it strikes me  
as extremely unclean and bloated for every algorithm that might have a  
range of char's passed into it to treat it specially (ignoring what the  
compiler says).


-Steve

Re: std.algorithm.remove and principle of least astonishment

2010-10-16 Thread Rainer Deyke

On 10/16/2010 13:51, Andrei Alexandrescu wrote:
 char[] and wchar[] are special. They embed their UTF affiliation in
 their type. I don't think we should make a wash of all that by handling
 them as arrays. They are not arrays.

Then rename them to something else.  Problem solved.


-- 
Rainer Deyke - rain...@eldwood.com

Re: std.algorithm.remove and principle of least astonishment

2010-10-16 Thread Pelle


On 10/16/2010 09:56 PM, klickverbot wrote:

On 10/16/10 9:47 PM, Andrei Alexandrescu wrote:

Thanks for the input. This is not a bug, it's what I believe to be a
very intentional feature: strings are not ordinary arrays because
characters have variable length. As such, assigning to the first
character in a string is not allowed because the assignment might mess
up the next character.


I see that there is a problem due the difference of code units and code
points, but why does the following work then?

tmp = tmp[ 0 .. i ] ~ tmp[ ( i + 1 ) .. $ ];

This is equivalent to my (naïve?) mental model of remove(), and thus it
seems very counter-intuitive to me that one works, but the other doesn't.


Try it with ä or ░ instead of x.

Re: std.algorithm.remove and principle of least astonishment

2010-10-16 Thread Andrei Alexandrescu


On 10/16/2010 02:58 PM, Tomek Sowiński wrote:

Andrei Alexandrescu napisał:


On 10/16/2010 01:39 PM, Steven Schveighoffer wrote:

I suggest wrapping a char[] or wchar[] (of all constancies) with a
special range that imposes the restrictions.


I did so. It was called byDchar and it would accept a string type. It
sucked.


Why it sucked?


Because 99% of the times you'd want to pass byDchar, but it was easy to 
forget. Then the algorithm would compile and run without byDchar, just 
with useless semantics.


Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-10-16 Thread Andrei Alexandrescu


On 10/16/2010 02:56 PM, klickverbot wrote:

On 10/16/10 9:47 PM, Andrei Alexandrescu wrote:

Thanks for the input. This is not a bug, it's what I believe to be a
very intentional feature: strings are not ordinary arrays because
characters have variable length. As such, assigning to the first
character in a string is not allowed because the assignment might mess
up the next character.


I see that there is a problem due the difference of code units and code
points, but why does the following work then?

tmp = tmp[ 0 .. i ] ~ tmp[ ( i + 1 ) .. $ ];

This is equivalent to my (naïve?) mental model of remove(), and thus it
seems very counter-intuitive to me that one works, but the other doesn't.


Strings are dual types. They have [] and .length but not with the 
semantics required by ranges. So formally they don't support 
isRandomAccessRange and hasLength.


Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-10-16 Thread Andrei Alexandrescu


On 10/16/2010 02:56 PM, klickverbot wrote:

On 10/16/10 9:47 PM, Andrei Alexandrescu wrote:

Thanks for the input. This is not a bug, it's what I believe to be a
very intentional feature: strings are not ordinary arrays because
characters have variable length. As such, assigning to the first
character in a string is not allowed because the assignment might mess
up the next character.


I see that there is a problem due the difference of code units and code
points, but why does the following work then?

tmp = tmp[ 0 .. i ] ~ tmp[ ( i + 1 ) .. $ ];

This is equivalent to my (naïve?) mental model of remove(), and thus it
seems very counter-intuitive to me that one works, but the other doesn't.


To drive my point home: if you wanted to replace not 'x', but instead a 
multibyte character, your algorithm wouldn't work. It essentially 
assumes the string has one byte per character, and the needed cast to 
byte[] reflects that.


If anything, I'd call this a success.


Andrei

Re: std.algorithm.remove and principle of least astonishment

2010-10-16 Thread Steven Schveighoffer

On Sat, 16 Oct 2010 17:28:17 -0400, Andrei Alexandrescu  
seewebsiteforem...@erdani.org wrote:



On 10/16/2010 02:58 PM, Tomek Sowiński wrote:

Andrei Alexandrescu napisał:


On 10/16/2010 01:39 PM, Steven Schveighoffer wrote:

I suggest wrapping a char[] or wchar[] (of all constancies) with a
special range that imposes the restrictions.


I did so. It was called byDchar and it would accept a string type. It
sucked.


Why it sucked?


Because 99% of the times you'd want to pass byDchar, but it was easy to  
forget. Then the algorithm would compile and run without byDchar, just  
with useless semantics.


So call it string, and make the compiler use it as the default type for  
string literals.


-Steve

Re: std.algorithm.remove and principle of least astonishment

2010-10-16 Thread Andrei Alexandrescu


On 10/16/2010 03:14 PM, Steven Schveighoffer wrote:

On Sat, 16 Oct 2010 15:51:23 -0400, Andrei Alexandrescu
seewebsiteforem...@erdani.org wrote:


On 10/16/2010 01:39 PM, Steven Schveighoffer wrote:

I suggest wrapping a char[] or wchar[] (of all constancies) with a
special range that imposes the restrictions.


I did so. It was called byDchar and it would accept a string type. It
sucked.

char[] and wchar[] are special. They embed their UTF affiliation in
their type. I don't think we should make a wash of all that by
handling them as arrays. They are not arrays.


The compiler thinks they are. And they look like arrays (T[] looks like
an array to me no matter what T is). And I *want* an array of characters
in most cases. If you want a special type for strings, make them a
special type.

D should not have this schizophrenic view of strings. Plus it strikes me
as extremely unclean and bloated for every algorithm that might have a
range of char's passed into it to treat it specially (ignoring what the
compiler says).


It would do wrong or useless things otherwise. I'd probably do some 
things differently if I started over, but given the circumstances I 
think std.algorithm does the best it could ever do with strings.


Andrei

88 matches

Mail list logo