Re: Proposal for fixing dchar ranges

2014-03-19 Thread Dmitry Olshansky

19-Mar-2014 18:42, Marco Leise пишет:

Am Tue, 18 Mar 2014 23:18:16 +0400
schrieb Dmitry Olshansky dmitry.o...@gmail.com:


Related:
- What normalization do D strings use. Both Linux and
  MacOS X use UTF-8, but the binary representation of non-ASCII
  file names is different.


There is no single normalization to fix on.
D programs may be written for Linux only, for Mac-only or for both.


Normalizations C and D are the non lossy ones and as far as I
understood equivalent. So I agree.



Right, the KC  KD ones are really all about fuzzy matching and searching.


IMO we should just provide ways to normalize strings.
(std.uni.normalize has 'normalize' for starters).


I wondered if anyone will actually read up on normalization
prior to touching Unicode strings. I didn't, Andrei didn't and
so on...
So I expect strA == strB to be common enough, just like floatA
== floatB until the news spread.


If that of any comfort other languages are even worse here. In C++ your 
are hopeless without ICU.



Since == is supposed to
compare for equivalence, could we hide all those details in
an opaque string type and offer correct comparison functions?


Well, turns out the Unicode standard ties equivalence to normalization 
forms. In other words unless both your strings are normalized the same 
way there is really no point in trying to compare them.


As for opaque type - we could have say String!NFC and String!NFD or 
some-such. It would then make sure the normalization is the right one.



- How do we handle sorting strings?


Unicode collation algorithm and provide ways to tweak the default one.


I wish I didn't look at the UCA. Jz...
But yeah, that's the way to go.


Needless to say I had a nice jaw-dropping moment when I realized what 
elephant I have missed with our std.uni (somewhere in the middle of the 
work).



Big frameworks like Java added a Collate class with predefined
constants for several languages. That's too much work for us.
But the API doesn't need to preclude adding those.


Indeed some kind of Collator is in order. On the use side of things it's 
simply a functor that compares strings. The fact that it's full of 
tables and the like is well hidden. The only thing above that is caching 
preprocessed strings, that maybe useful for databases and string indexes.



The topic matter is complex, but not difficult (as in rocket science).
If we really want to find a solution, we should form an expert group
and stop talking until we read the latest Unicode specs.


Well, I did. You seem motivated, would you like to join the group?


Yes, I'd like to see a Unicode 6.x approved stamp on D.
I didn't know that you already wrote all the simple algorithms
for 2.064. Those would have been my candidates to work on, too.
Is there anything that can be implemented in a day or two? :)



Cool, consider yourself enlisted :)
I reckon word and line breaking algorithms are piece of cake compared to 
UCA. Given the power toys of CodepointSet and toTrie it shouldn't be 
that hard to come up with prototype. Then we just move precomputed 
versions of related tries to std/internal/ and that's it, ready for 
public consumption.



D (or any library for that matter) won't ever have all possible
tinkering that Unicode standard permits. So I expect D to be done with
Unicode one day simply by reaching a point of having all universally
applicable stuff (and stated defaults) plus having a toolbox to craft
your own versions of algorithms. This is the goal of new std.uni.


Sorting strings is a very basic feature, but as I learned now
also highly complex.  I expected some kind of tables for
download that would suffice, but the rules are pretty detailed.
E.g. in German phonebook order, ä/ö/ü has the same order as
ae/oe/ue.


This is tailoring, an awful thing that makes cultural differences what 
they are in Unicode ;)


What we need first and furthermost DUCET based version (default Unicode 
collation element tables).


--
Dmitry Olshansky


Re: Proposal for fixing dchar ranges

2014-03-19 Thread Marco Leise
Am Thu, 20 Mar 2014 01:55:08 +0400
schrieb Dmitry Olshansky dmitry.o...@gmail.com:

 Well, turns out the Unicode standard ties equivalence to normalization 
 forms. In other words unless both your strings are normalized the same 
 way there is really no point in trying to compare them.
 
 As for opaque type - we could have say String!NFC and String!NFD or 
 some-such. It would then make sure the normalization is the right one.

And I thought of going the slow route where normalized and
unnormalized strings can coexist and be compared. No NFD or
NFC, just UTF8 strings.

Pros:
+ Learning about normalization isn't needed to use strings
  correctly. And few people do that.
+ Strings don't need to be normalized. Every modification to
  data is bad, e.g. when said string is fed back to the
  source. Think about a file name on a file system where a
  different normalization is a different file.

Cons:
- Comparisons for already normalized strings are unnecessarily
  slow. Maybe the normalization form (NFC, NFD, mixed) could be
  stored alongside the string.

 Cool, consider yourself enlisted :)
 I reckon word and line breaking algorithms are piece of cake compared to 
 UCA. Given the power toys of CodepointSet and toTrie it shouldn't be 
 that hard to come up with prototype. Then we just move precomputed 
 versions of related tries to std/internal/ and that's it, ready for 
 public consumption.

Would a typical use case be to find the previous/next boundary
given a code unit index? E.g. the cursor sits on a word and
you want to jump to the start or end of it. Just iterating the
words and lines might not be too useful.

  D (or any library for that matter) won't ever have all possible
  tinkering that Unicode standard permits. So I expect D to be done with
  Unicode one day simply by reaching a point of having all universally
  applicable stuff (and stated defaults) plus having a toolbox to craft
  your own versions of algorithms. This is the goal of new std.uni.
 
  Sorting strings is a very basic feature, but as I learned now
  also highly complex.  I expected some kind of tables for
  download that would suffice, but the rules are pretty detailed.
  E.g. in German phonebook order, ä/ö/ü has the same order as
  ae/oe/ue.
 
 This is tailoring, an awful thing that makes cultural differences what 
 they are in Unicode ;)
 
 What we need first and furthermost DUCET based version (default Unicode 
 collation element tables).

Of course.

-- 
Marco



Re: Proposal for fixing dchar ranges

2014-03-18 Thread Marco Leise
The Unicode standard is too complex for general purpose
algorithms to do useful things on D strings. We don't see that
however, since our writing systems are sufficiently well
supported.

As an inspiration I'll leave a string here that contains
combined characters in Korean
(http://decodeunicode.org/hangul_syllables)
and Latin as well as full width characters that span 2
characters in e.g. Latin, Greek or Cyrillic scripts
(http://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms):

Halfwidth / Fullwidth, ᆨᆨᆨᆚᆚᅱᅱᅱᅡᅓᅲᄄᆒᄋᆮ, a͢b 9͚ c̹̊

(I used the unfonts package for the Hangul part)

What I want to say is that for correct Unicode handling we
should either use existing libraries or get a feeling for
what the Unicode standard provides, then form use cases out of it.

For example when we talk about the length of a string we are
actually talking about 4 different things:

  - number of code units
  - number of code points
  - number of user perceived characters
  - display width using a monospace font

The same distinction applies for slicing, depending on use case.

Related:
  - What normalization do D strings use. Both Linux and
MacOS X use UTF-8, but the binary representation of non-ASCII
file names is different.
  - How do we handle sorting strings?

The topic matter is complex, but not difficult (as in rocket science).
If we really want to find a solution, we should form an expert group
and stop talking until we read the latest Unicode specs. They are a
moving target. Don't expect to ever be done with full Unicode
support in D.

-- 
Marco



Re: Proposal for fixing dchar ranges

2014-03-18 Thread Dmitry Olshansky

18-Mar-2014 10:21, Marco Leise пишет:

The Unicode standard is too complex for general purpose
algorithms to do useful things on D strings. We don't see that
however, since our writing systems are sufficiently well
supported.



As an inspiration I'll leave a string here that contains
combined characters in Korean
(http://decodeunicode.org/hangul_syllables)
and Latin as well as full width characters that span 2
characters in e.g. Latin, Greek or Cyrillic scripts
(http://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms):

Halfwidth / Fullwidth, ᆨᆨᆨᆚᆚᅱᅱᅱᅡᅓᅲᄄᆒᄋᆮ, a͢b 9͚ c̹̊

(I used the unfonts package for the Hangul part)

What I want to say is that for correct Unicode handling we
should either use existing libraries or get a feeling for
what the Unicode standard provides, then form use cases out of it.


There is ICU and very few other things, like support in OSX frameworks 
(NSString). Industry in general kinda sucks on this point but 
desperately wants to improve.




For example when we talk about the length of a string we are
actually talking about 4 different things:

   - number of code units
   - number of code points
   - number of user perceived characters
   - display width using a monospace font

The same distinction applies for slicing, depending on use case.

Related:
   - What normalization do D strings use. Both Linux and
 MacOS X use UTF-8, but the binary representation of non-ASCII
 file names is different.


There is no single normalization to fix on.
D programs may be written for Linux only, for Mac-only or for both.

IMO we should just provide ways to normalize strings.
(std.uni.normalize has 'normalize' for starters).


   - How do we handle sorting strings?


Unicode collation algorithm and provide ways to tweak the default one.


The topic matter is complex, but not difficult (as in rocket science).
If we really want to find a solution, we should form an expert group
and stop talking until we read the latest Unicode specs.


Well, I did. You seem motivated, would you like to join the group?


They are a
moving target. Don't expect to ever be done with full Unicode
support in D.


The 6.x standard line seems pretty stable to me. There is a point in 
provding support that worth approaching. After that ROI is drooping 
steadily as the amount of work to specialize for each specific culture 
rises. At some point we can only talk about opening up ways to specialize.


D (or any library for that matter) won't ever have all possible 
tinkering that Unicode standard permits. So I expect D to be done with 
Unicode one day simply by reaching a point of having all universally 
applicable stuff (and stated defaults) plus having a toolbox to craft 
your own versions of algorithms. This is the goal of new std.uni.



--
Dmitry Olshansky


Re: Proposal for fixing dchar ranges

2014-03-12 Thread monarch_dodra

On Tuesday, 11 March 2014 at 18:26:36 UTC, Johannes Pfau wrote:
I think the problem here is that if ranges / algorithms have to 
work on
the same data type as slicing/indexing. If .front returns code 
units,
then indexing/slicing should be done with code units. If it 
returns
code points then slicing has to happen on code points for 
consistency
or it should be disallowed. (Slicing on code units is important 
- no

doubt. But it is error prone and should be explicit in some way:
string.sliceCP(a, b) or string.representation[a...b])


I think it is import to remember that in terms of 
ranges/algorithms, strings are not indexable, nor sliceable 
ranges.


The only way to generically slice a string in generic code, is 
to explicitly test that a range is actually a string, and then 
knowingly call an internal primitive that is NOT a part of the 
range traits.


So slicing/indexing *is* already disallowed, in terms of 
range/algorithms anyways.


Re: Proposal for fixing dchar ranges

2014-03-12 Thread monarch_dodra
On Tuesday, 11 March 2014 at 18:02:26 UTC, Steven Schveighoffer 
wrote:
No, where we are today is that in some cases, the language 
treats a char[] as an array of char, in other cases, it treats 
a char[] as a bi-directional dchar range.


-Steve


I want to mention something I've had trouble with recently, that 
I haven't seen mentioned yet, but is related:


The ambiguity of the lone char.

By that I mean: When a function accepts 'char' as an argument, it 
is (IMO) very hard to know if it is actually accepting a?

1. An ascii char in the 0 .. 128 range?
2. A code unit?
3. (heaven forbid) a codepoint in the 0 .. 256 range packed into 
a char?


Currently (fortuantly? unfortunatly?) the current choice taken in 
our algorithms is 3, which is actually the 'safest' solution.


So if you write:
find(cassé, cast(char)'é');

It *will* correctly find the 'é', but it *won't* search for it in 
individual codeunits.




Another more pernicious case is that of output ranges. put is 
supposed to know how to convert and string/char width, into any 
sting/char width.


Again, things become funky if you tell put to place a string, 
into a sink that accepts a char.


Is the sink actually telling you to feed it code units? or ascii?


Re: Proposal for fixing dchar ranges

2014-03-12 Thread Ary Borenszweig

On 3/10/14, 3:30 PM, Walter Bright wrote:

On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:

An idea to fix the whole problems I see with char[] being treated
specially by
phobos: introduce an actual string type, with char[] as backing, that
is a dchar
range, that actually dictates the rules we want. Then, make the
compiler use
this type for literals.


Proposals to make a string class for D have come up many times. I have a
kneejerk dislike for it. It's a really strong feature for D to have
strings be an array type, and I'll go to great lengths to keep it that way.


You can also look at Erlang, where strings are just lists of numbers. 
Eventually they realized it was a huge mistake and introduced another 
type, a binary string, which is much more efficient and works as expected.


I think making strings behave like arrays is a design mistake.


Re: Proposal for fixing dchar ranges

2014-03-12 Thread Andrei Alexandrescu

On 3/12/14, 6:24 AM, Ary Borenszweig wrote:

On 3/10/14, 3:30 PM, Walter Bright wrote:

On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:

An idea to fix the whole problems I see with char[] being treated
specially by
phobos: introduce an actual string type, with char[] as backing, that
is a dchar
range, that actually dictates the rules we want. Then, make the
compiler use
this type for literals.


Proposals to make a string class for D have come up many times. I have a
kneejerk dislike for it. It's a really strong feature for D to have
strings be an array type, and I'll go to great lengths to keep it that
way.


You can also look at Erlang, where strings are just lists of numbers.
Eventually they realized it was a huge mistake and introduced another
type, a binary string, which is much more efficient and works as expected.

I think making strings behave like arrays is a design mistake.


Erlang's mistake was different from what you believe was D's mistake. 
There is no comparison to be drawn.


Andrei



Re: Proposal for fixing dchar ranges

2014-03-12 Thread Ary Borenszweig

On 3/12/14, 1:53 PM, Andrei Alexandrescu wrote:

On 3/12/14, 6:24 AM, Ary Borenszweig wrote:

On 3/10/14, 3:30 PM, Walter Bright wrote:

On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:

An idea to fix the whole problems I see with char[] being treated
specially by
phobos: introduce an actual string type, with char[] as backing, that
is a dchar
range, that actually dictates the rules we want. Then, make the
compiler use
this type for literals.


Proposals to make a string class for D have come up many times. I have a
kneejerk dislike for it. It's a really strong feature for D to have
strings be an array type, and I'll go to great lengths to keep it that
way.


You can also look at Erlang, where strings are just lists of numbers.
Eventually they realized it was a huge mistake and introduced another
type, a binary string, which is much more efficient and works as
expected.

I think making strings behave like arrays is a design mistake.


Erlang's mistake was different from what you believe was D's mistake.
There is no comparison to be drawn.

Andrei


What's D's mistake then?



Re: Proposal for fixing dchar ranges

2014-03-12 Thread Andrei Alexandrescu

On 3/12/14, 10:29 AM, Ary Borenszweig wrote:

What's D's mistake then?


I don't think we made a mistake with D's strings. They could have been 
done better if we made all iteration requests explicit.


Andrei



Re: Proposal for fixing dchar ranges

2014-03-11 Thread John Colvin
On Monday, 10 March 2014 at 22:15:34 UTC, Steven Schveighoffer 
wrote:
On Mon, 10 Mar 2014 17:46:23 -0400, John Colvin 
john.loughran.col...@gmail.com wrote:


On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer 
wrote:
I proposed this inside the long major performance problem 
with std.array.front, I've also proposed it before, a long 
time ago.


But seems to be getting no attention buried in that thread, 
not even negative attention :)


An idea to fix the whole problems I see with char[] being 
treated specially by phobos: introduce an actual string type, 
with char[] as backing, that is a dchar range, that actually 
dictates the rules we want. Then, make the compiler use this 
type for literals.


e.g.:

struct string {
  immutable(char)[] representation;
  this(char[] data) { representation = data;}
  ... // dchar range primitives
}

Then, a char[] array is simply an array of char[].

points:

1. No more issues with foreach(c; cassé), it iterates via 
dchar
2. No more issues with cassé[4], it is a static compiler 
error.

3. No more awkward ASCII manipulation using ubyte[].
4. No more phobos schizophrenia saying char[] is not an array.
5. No more special casing char[] array templates to fool the 
compiler.
6. Any other special rules we come up with can be dictated by 
the library, and not ignored by the compiler.


Note, std.algorithm.copy(string1, mutablestring) will still 
decode/encode, but it's more explicit. It's EXPLICITLY a 
dchar range. Use std.algorithm.copy(string1.representation, 
mutablestring.representation) will avoid the issues.


I imagine only code that is currently UTF ignorant will 
break, and that code is easily 'fixed' by adding the 
'representation' qualifier.


-Steve


just to check I understand this fully:

in this new scheme, what would this do?

auto s = cassé.representation;
foreach(i, c; s) write(i, ':', c, ' ');
writeln(s);

Currently - without the .representation - I get

0:c 1:a 2:s 3:s 4:e 5:̠6:`
cassé

or, to spell it out a bit more:
0:c 1:a 2:s 3:s 4:e 5:xCC 6:x81
cassé


The plan is for foreach on s to iterate by char, and foreach on 
cassé to iterate by dchar.


What this means is the accent will be iterated separately from 
the e, and likely gets put onto the colon after 5. However, the 
half code-units that has no meaning anywhere (xCC and X81) 
would not be iterated.


In your above code, using .representation would be equivalent 
to what it is now without .representation (i.e. over char), and 
without .representation would be equivalent to this on today's 
compiler (except faster):


foreach(i, dchar c; s)

-Steve


Awesome, let's do this :)


Re: Proposal for fixing dchar ranges

2014-03-11 Thread John Colvin

On Monday, 10 March 2014 at 21:52:04 UTC, Walter Bright wrote:

On 3/10/2014 2:09 PM, Steven Schveighoffer wrote:
What in my proposal makes you think you don't have unfettered 
access? The
underlying immutable(char)[] representation is accessible. In 
fact, you would
have more access, since phobos functions would then work with 
a char[] like it's

a proper array.


You divide the D world into two camps - those that use 'struct 
string', and those that use immutable(char)[] strings.


I would go so far as to say this is a good thing, as long as the 
'struct string' is transparently the default.


If you want good unicode support that works in a sane and 
relatively transparent manner, just write string, use literals as 
normal etc.
If you want a normal array of characters, that behaves sanely and 
consistently as an array, use char[] with relevant qualifiers.


Re: Proposal for fixing dchar ranges

2014-03-11 Thread Kagamin
Automatic decoding by default itself is a WTF factor. The problem 
with it is it encourages unicode ignorance and pretends to work 
correctly, so it's harder for the developer to discover the 
incorrectness.


Re: Proposal for fixing dchar ranges

2014-03-11 Thread Dicebot
On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer 
wrote:
On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson 
e...@gnuk.net wrote:


It seems like this would be an even bigger breaking change 
than Walter's proposal though (right or wrong, slicing strings 
is very common).


You're the second person to mention that, I was not planning on 
disabling string slicing. Just random access to individual 
chars, and probably .length.


-Steve


It is unacceptable to have slicing which is not O(1) for basic 
types.


Re: Proposal for fixing dchar ranges

2014-03-11 Thread Steven Schveighoffer

On Tue, 11 Mar 2014 09:11:22 -0400, Dicebot pub...@dicebot.lv wrote:


On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer wrote:

On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson e...@gnuk.net wrote:

It seems like this would be an even bigger breaking change than  
Walter's proposal though (right or wrong, slicing strings is very  
common).


You're the second person to mention that, I was not planning on  
disabling string slicing. Just random access to individual chars, and  
probably .length.




It is unacceptable to have slicing which is not O(1) for basic types.


It would be O(1), work just like it does today.

-Steve


Re: Proposal for fixing dchar ranges

2014-03-11 Thread Dicebot
On Tuesday, 11 March 2014 at 14:04:38 UTC, Steven Schveighoffer 
wrote:

It would be O(1), work just like it does today.

-Steve


Today it works by allowing arbitrary index and not checking if 
resulting slice is valid UTF-8. Anything that implies decoding is 
O(n). What exactly do you have in mind for this?


Re: Proposal for fixing dchar ranges

2014-03-11 Thread Steven Schveighoffer

On Tue, 11 Mar 2014 10:06:47 -0400, Dicebot pub...@dicebot.lv wrote:


On Tuesday, 11 March 2014 at 14:04:38 UTC, Steven Schveighoffer wrote:

It would be O(1), work just like it does today.

-Steve


Today it works by allowing arbitrary index and not checking if resulting  
slice is valid UTF-8. Anything that implies decoding is O(n). What  
exactly do you have in mind for this?


Well, a valid improvement would be to throw an exception when the slice  
didn't start/end on a valid code point. This is easily checkable in O(1)  
time, but I wouldn't recommend it to begin with, it may have huge  
performance issues. Typically, one does not arbitrarily slice up via some  
specific value, they use a function to get an index, and they don't care  
what the index value actually is.


Alternatively, it could be done via assert, to disable it during release  
mode. This might be acceptable.


But I would never expect any kind of indexing or slicing to use number of  
code points, which clearly requires O(n) decoding to determine it's  
position. That would be disastrous.


-Steve


Re: Proposal for fixing dchar ranges

2014-03-11 Thread H. S. Teoh
On Tue, Mar 11, 2014 at 12:49:40AM +, Meta wrote:
 On Tuesday, 11 March 2014 at 00:02:13 UTC, bearophile wrote:
 Walter Bright:
 
 In the last couple days, we also wound up annoying a valuable
 client with some minor breakage with std.json, reiterating how
 important it is to not break code if we can at all avoid it..
 
 There are still some breaking changed that I'd like to perform in
 D, like deprecating certain usages of the comma operator, etc.
[...]
 That damnable comma operator is one of the worst things that was
 inherited from C. IMO, it has no use outside the header of a for
 loop, and even there it's suspect.

I've always been of the opinion that the comma operator in a for loop
should be treated as special syntax, rather than a language-wide
operator. The comma operator must die. :P


T

-- 
Public parking: euphemism for paid parking. -- Flora


Re: Proposal for fixing dchar ranges

2014-03-11 Thread bearophile

Meta:

That damnable comma operator is one of the worst things that 
was inherited from C. IMO, it has no use outside the header of 
a for loop, and even there it's suspect.


The place for the discussion about the comma operator:
https://d.puremagic.com/issues/show_bug.cgi?id=2659

Bye,
bearophile


Re: Proposal for fixing dchar ranges

2014-03-11 Thread Chris Williams
On Tuesday, 11 March 2014 at 14:16:31 UTC, Steven Schveighoffer 
wrote:
But I would never expect any kind of indexing or slicing to use 
number of code points, which clearly requires O(n) decoding 
to determine it's position. That would be disastrous.


If the indexes put into the slice aren't by code-point, but 
people need to use proper helper functions to convert a 
code-point into an index, then we're basically back to where we 
are today.


Re: Proposal for fixing dchar ranges

2014-03-11 Thread Steven Schveighoffer
On Tue, 11 Mar 2014 13:18:46 -0400, Chris Williams  
yoreanon-chr...@yahoo.co.jp wrote:



On Tuesday, 11 March 2014 at 14:16:31 UTC, Steven Schveighoffer wrote:
But I would never expect any kind of indexing or slicing to use number  
of code points, which clearly requires O(n) decoding to determine it's  
position. That would be disastrous.


If the indexes put into the slice aren't by code-point, but people need  
to use proper helper functions to convert a code-point into an index,  
then we're basically back to where we are today.


No, where we are today is that in some cases, the language treats a char[]  
as an array of char, in other cases, it treats a char[] as a  
bi-directional dchar range.


What I'm proposing is we have a type that defines This is what a string  
looks like, and it is consistent across all uses of the string, instead  
of the schizophrenic view we have now. I would also point out that quite a  
bit of deception and nonsense is needed to maintain that view, including  
things like assert(!hasLength!(char[])  __traits(compiles, { char[] x;  
int y = x.length;})). The documentation for hasLength says Tests if a  
given range has the length attribute, which is clearly a lie.


However, I want to define right here, that index is not a number of code  
points. One does not frequently get code point counts, one gets indexes.  
It has always been that way, and I'm not planning to change that. That you  
can't use the index to determine the number of code points that came  
before it, is not a frequent issue that arises.


e.g., I want to find the first instance of xyz in a string, do I care  
how many code points it has to go through, or what point I have to slice  
the string to get that?


A previous poster brings up this incorrect code:

auto index = countUntil(str, xyz);
auto newstr = str[index..$];

But it can easily be done this way also:

auto index = indexOf(str, xyz);
auto codepts = walkLength(str[0..index]);
auto newstr = str[index..$];

Given how D works, I think it would be very costly and near impossible to  
somehow make the incorrect slice operation statically rejected. One simply  
has to be trained what a code point is, and what a code unit is. HOWEVER,  
for the most part, nobody needs to care. Strings work fine without having  
to randomly access specific code points or slice based on them. Using  
indexes works just fine.


-Steve


Re: Proposal for fixing dchar ranges

2014-03-11 Thread Johannes Pfau
Am Tue, 11 Mar 2014 14:02:26 -0400
schrieb Steven Schveighoffer schvei...@yahoo.com:

 A previous poster brings up this incorrect code:
 
 auto index = countUntil(str, xyz);
 auto newstr = str[index..$];
 
 But it can easily be done this way also:
 
 auto index = indexOf(str, xyz);
 auto codepts = walkLength(str[0..index]);
 auto newstr = str[index..$];
 
 Given how D works, I think it would be very costly and near
 impossible to somehow make the incorrect slice operation statically
 rejected. One simply has to be trained what a code point is, and what
 a code unit is. HOWEVER, for the most part, nobody needs to care.
 Strings work fine without having to randomly access specific code
 points or slice based on them. Using indexes works just fine.
 
 -Steve

Yes, you can workaround the count problem, but then it is not
consistent across all uses of the string. What if the above code was
a generic template written for arrays? Then it silently fails for
strings and you have to special case it.

I think the problem here is that if ranges / algorithms have to work on
the same data type as slicing/indexing. If .front returns code units,
then indexing/slicing should be done with code units. If it returns
code points then slicing has to happen on code points for consistency
or it should be disallowed. (Slicing on code units is important - no
doubt. But it is error prone and should be explicit in some way:
string.sliceCP(a, b) or string.representation[a...b])


Re: Proposal for fixing dchar ranges

2014-03-11 Thread Steven Schveighoffer
On Tue, 11 Mar 2014 14:25:10 -0400, Johannes Pfau nos...@example.com  
wrote:



Yes, you can workaround the count problem, but then it is not
consistent across all uses of the string. What if the above code was
a generic template written for arrays? Then it silently fails for
strings and you have to special case it.

I think the problem here is that if ranges / algorithms have to work on
the same data type as slicing/indexing. If .front returns code units,
then indexing/slicing should be done with code units. If it returns
code points then slicing has to happen on code points for consistency
or it should be disallowed. (Slicing on code units is important - no
doubt. But it is error prone and should be explicit in some way:
string.sliceCP(a, b) or string.representation[a...b])


I look at it a different way -- indexes are increasing, just not  
consecutive. If there is a way to say indexes are not linear, then that  
would be a good trait to expose.


For instance, think of a tree-map, which has keys that may not be  
consecutive. Should you be able to slice such a container? I'd say yes.  
But tree[0..5] may not get you the first 5 elements.


-Steve


Proposal for fixing dchar ranges

2014-03-10 Thread Steven Schveighoffer
I proposed this inside the long major performance problem with  
std.array.front, I've also proposed it before, a long time ago.


But seems to be getting no attention buried in that thread, not even  
negative attention :)


An idea to fix the whole problems I see with char[] being treated  
specially by phobos: introduce an actual string type, with char[] as  
backing, that is a dchar range, that actually dictates the rules we want.  
Then, make the compiler use this type for literals.


e.g.:

struct string {
   immutable(char)[] representation;
   this(char[] data) { representation = data;}
   ... // dchar range primitives
}

Then, a char[] array is simply an array of char[].

points:

1. No more issues with foreach(c; cassé), it iterates via dchar
2. No more issues with cassé[4], it is a static compiler error.
3. No more awkward ASCII manipulation using ubyte[].
4. No more phobos schizophrenia saying char[] is not an array.
5. No more special casing char[] array templates to fool the compiler.
6. Any other special rules we come up with can be dictated by the library,  
and not ignored by the compiler.


Note, std.algorithm.copy(string1, mutablestring) will still decode/encode,  
but it's more explicit. It's EXPLICITLY a dchar range. Use  
std.algorithm.copy(string1.representation, mutablestring.representation)  
will avoid the issues.


I imagine only code that is currently UTF ignorant will break, and that  
code is easily 'fixed' by adding the 'representation' qualifier.


-Steve


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Steven Schveighoffer
On Mon, 10 Mar 2014 09:35:44 -0400, Steven Schveighoffer  
schvei...@yahoo.com wrote:



Then, a char[] array is simply an array of char[].


An array of char even.

-Steve


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Dicebot
On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer 
wrote:
I proposed this inside the long major performance problem with 
std.array.front, I've also proposed it before, a long time ago.


But seems to be getting no attention buried in that thread, not 
even negative attention :)


An idea to fix the whole problems I see with char[] being 
treated specially by phobos: introduce an actual string type, 
with char[] as backing, that is a dchar range, that actually 
dictates the rules we want. Then, make the compiler use this 
type for literals.


e.g.:

struct string {
   immutable(char)[] representation;
   this(char[] data) { representation = data;}
   ... // dchar range primitives
}

Then, a char[] array is simply an array of char[].

points:

1. No more issues with foreach(c; cassé), it iterates via 
dchar
2. No more issues with cassé[4], it is a static compiler 
error.

3. No more awkward ASCII manipulation using ubyte[].
4. No more phobos schizophrenia saying char[] is not an array.
5. No more special casing char[] array templates to fool the 
compiler.
6. Any other special rules we come up with can be dictated by 
the library, and not ignored by the compiler.


Note, std.algorithm.copy(string1, mutablestring) will still 
decode/encode, but it's more explicit. It's EXPLICITLY a dchar 
range. Use std.algorithm.copy(string1.representation, 
mutablestring.representation) will avoid the issues.


I imagine only code that is currently UTF ignorant will break, 
and that code is easily 'fixed' by adding the 'representation' 
qualifier.


-Steve


It will break any code that slices stored char[] strings directly 
which may or may not be breaking UTF depending on how indices are 
calculated. Also adding one more runtime dependency into language 
but there are so many that it probably does not matter.


Re: Proposal for fixing dchar ranges

2014-03-10 Thread H. S. Teoh
On Mon, Mar 10, 2014 at 09:35:44AM -0400, Steven Schveighoffer wrote:
[...]
 An idea to fix the whole problems I see with char[] being treated
 specially by phobos: introduce an actual string type, with char[] as
 backing, that is a dchar range, that actually dictates the rules we
 want. Then, make the compiler use this type for literals.
 
 e.g.:
 
 struct string {
immutable(char)[] representation;
this(char[] data) { representation = data;}
... // dchar range primitives
 }
 
 Then, a char[] array is simply an array of char[].
 
 points:
 
 1. No more issues with foreach(c; cassé), it iterates via dchar
 2. No more issues with cassé[4], it is a static compiler error.
 3. No more awkward ASCII manipulation using ubyte[].
 4. No more phobos schizophrenia saying char[] is not an array.
 5. No more special casing char[] array templates to fool the compiler.
 6. Any other special rules we come up with can be dictated by the
 library, and not ignored by the compiler.

I like this idea. Special-casing char[] in templates was a bad idea. It
makes Phobos code needlessly complex, and the inconsistent treatment of
char[] sometimes as an array of char and sometimes not causes silly
issues like foreach defaulting to char but range iteration defaulting to
dchar. Enclosing it in a struct means we can enforce string rules
separately from the fact that it's a char array.


 Note, std.algorithm.copy(string1, mutablestring) will still
 decode/encode, but it's more explicit. It's EXPLICITLY a dchar
 range. Use std.algorithm.copy(string1.representation,
 mutablestring.representation) will avoid the issues.
 
 I imagine only code that is currently UTF ignorant will break, and
 that code is easily 'fixed' by adding the 'representation'
 qualifier.
[...]

The only concern I have is the current use of char[] and const(char)[]
as mutable strings, and the current implicit conversion from string to
const(char)[]. We would need similar wrappers for char[] and
const(char)[], and string and mutablestring must be implicitly
convertible to conststring, otherwise a LOT of existing code will break
in a major way. Plus, these wrappers should also expose the same dchar
range API with .representation giving a way to get at the raw code
units.


T

-- 
It is the quality rather than the quantity that matters. -- Lucius Annaeus 
Seneca


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Steven Schveighoffer

On Mon, 10 Mar 2014 10:48:26 -0400, Dicebot pub...@dicebot.lv wrote:


On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer wrote:
I proposed this inside the long major performance problem with  
std.array.front, I've also proposed it before, a long time ago.


But seems to be getting no attention buried in that thread, not even  
negative attention :)


An idea to fix the whole problems I see with char[] being treated  
specially by phobos: introduce an actual string type, with char[] as  
backing, that is a dchar range, that actually dictates the rules we  
want. Then, make the compiler use this type for literals.


e.g.:

struct string {
   immutable(char)[] representation;
   this(char[] data) { representation = data;}
   ... // dchar range primitives
}

Then, a char[] array is simply an array of char[].

points:

1. No more issues with foreach(c; cassé), it iterates via dchar
2. No more issues with cassé[4], it is a static compiler error.
3. No more awkward ASCII manipulation using ubyte[].
4. No more phobos schizophrenia saying char[] is not an array.
5. No more special casing char[] array templates to fool the compiler.
6. Any other special rules we come up with can be dictated by the  
library, and not ignored by the compiler.


Note, std.algorithm.copy(string1, mutablestring) will still  
decode/encode, but it's more explicit. It's EXPLICITLY a dchar range.  
Use std.algorithm.copy(string1.representation,  
mutablestring.representation) will avoid the issues.


I imagine only code that is currently UTF ignorant will break, and that  
code is easily 'fixed' by adding the 'representation' qualifier.




It will break any code that slices stored char[] strings directly which  
may or may not be breaking UTF depending on how indices are calculated.


That is already broken. What I'm looking to do is remove the cruft and  
WTF factor of the current state of affairs (an array that's not an  
array).


Originally (in that long ago proposal) I had proposed to check for and  
disallow invalid slicing during runtime. In fact, it could be added if  
desired with the type defined by the library.


Also adding one more runtime dependency into language but there are so  
many that it probably does not matter.


alias string = immutable(char)[];

There isn't much extra dependency one must add to revert to the original  
behavior. In fact, one nice thing about this proposal is the compiler  
changes can be done and tested before any real meddling with the string  
type is done.


-Steve


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Steven Schveighoffer
On Mon, 10 Mar 2014 10:54:50 -0400, H. S. Teoh hst...@quickfur.ath.cx  
wrote:




The only concern I have is the current use of char[] and const(char)[]
as mutable strings, and the current implicit conversion from string to
const(char)[]. We would need similar wrappers for char[] and
const(char)[], and string and mutablestring must be implicitly
convertible to conststring, otherwise a LOT of existing code will break
in a major way.


I agree that is a limitation of the proposal. It's more of a language-wide  
problem that one cannot make a struct that can be tail-const-ified.


One idea to begin with is to weakly bind to immutable(char)[] using alias  
this. That way, existing code devolves to current behavior. Then you pick  
off the primitives you want by defining them in the struct itself.



Plus, these wrappers should also expose the same dchar
range API with .representation giving a way to get at the raw code
units.


It already does that, representation is a public member.

-Steve


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Steven Schveighoffer

On Mon, 10 Mar 2014 11:11:50 -0400, Boyd gaboonvi...@gmx.net wrote:

I personally love this idea, though I think it probably introduces too  
much silent breaking changes for it to be universally acceptable by D  
users.


What silent breaking changes?

-Steve


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Dicebot
On Monday, 10 March 2014 at 15:01:54 UTC, Steven Schveighoffer 
wrote:
That is already broken. What I'm looking to do is remove the 
cruft and WTF factor of the current state of affairs (an 
array that's not an array).


Originally (in that long ago proposal) I had proposed to check 
for and disallow invalid slicing during runtime. In fact, it 
could be added if desired with the type defined by the library.


Broken as if in you are not supposed to do it user code? Yes. 
Broken as in does the wrong thing - no. If your index is 
properly calculated, it is no different from casting to ubyte[] 
and then slicing. I am pretty sure even Phobos does it here and 
there.


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Boyd
I personally love this idea, though I think it probably 
introduces too much silent breaking changes for it to be 
universally acceptable by D users.


Perhaps naming it 'String', and deprecating 'string' would make 
it more acceptable?



On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer 
wrote:
I proposed this inside the long major performance problem with 
std.array.front, I've also proposed it before, a long time ago.


But seems to be getting no attention buried in that thread, not 
even negative attention :)


An idea to fix the whole problems I see with char[] being 
treated specially by phobos: introduce an actual string type, 
with char[] as backing, that is a dchar range, that actually 
dictates the rules we want. Then, make the compiler use this 
type for literals.


e.g.:

struct string {
   immutable(char)[] representation;
   this(char[] data) { representation = data;}
   ... // dchar range primitives
}

Then, a char[] array is simply an array of char[].

points:

1. No more issues with foreach(c; cassé), it iterates via 
dchar
2. No more issues with cassé[4], it is a static compiler 
error.

3. No more awkward ASCII manipulation using ubyte[].
4. No more phobos schizophrenia saying char[] is not an array.
5. No more special casing char[] array templates to fool the 
compiler.
6. Any other special rules we come up with can be dictated by 
the library, and not ignored by the compiler.


Note, std.algorithm.copy(string1, mutablestring) will still 
decode/encode, but it's more explicit. It's EXPLICITLY a dchar 
range. Use std.algorithm.copy(string1.representation, 
mutablestring.representation) will avoid the issues.


I imagine only code that is currently UTF ignorant will break, 
and that code is easily 'fixed' by adding the 'representation' 
qualifier.


-Steve


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Steven Schveighoffer

On Mon, 10 Mar 2014 11:20:49 -0400, Boyd gaboonvi...@gmx.net wrote:


Utf8 aware slicing for strings would be an issue.


I'm not proposing to add this.

-Steve


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Boyd

Utf8 aware slicing for strings would be an issue.

--
On Monday, 10 March 2014 at 15:13:26 UTC, Steven Schveighoffer 
wrote:
On Mon, 10 Mar 2014 11:11:50 -0400, Boyd gaboonvi...@gmx.net 
wrote:


I personally love this idea, though I think it probably 
introduces too much silent breaking changes for it to be 
universally acceptable by D users.


What silent breaking changes?

-Steve


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Boyd
Ok, then you just destroyed my sole hypothetical objection to 
this.

---
On Monday, 10 March 2014 at 15:22:41 UTC, Steven Schveighoffer 
wrote:
On Mon, 10 Mar 2014 11:20:49 -0400, Boyd gaboonvi...@gmx.net 
wrote:



Utf8 aware slicing for strings would be an issue.


I'm not proposing to add this.

-Steve


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Steven Schveighoffer

On Mon, 10 Mar 2014 11:11:23 -0400, Dicebot pub...@dicebot.lv wrote:


On Monday, 10 March 2014 at 15:01:54 UTC, Steven Schveighoffer wrote:
That is already broken. What I'm looking to do is remove the cruft and  
WTF factor of the current state of affairs (an array that's not an  
array).


Originally (in that long ago proposal) I had proposed to check for and  
disallow invalid slicing during runtime. In fact, it could be added if  
desired with the type defined by the library.


Broken as if in you are not supposed to do it user code? Yes. Broken  
as in does the wrong thing - no. If your index is properly calculated,  
it is no different from casting to ubyte[] and then slicing. I am pretty  
sure even Phobos does it here and there.


If the idea to ensure the user cannot slice a code point was added, you  
would still be able to slice via str.representation[a..b], or even  
str.ptr[a..b] if you were so sure of the length you didn't want it to be  
checked ;)


The idea behind the proposal is to make it fully backwards compatible with  
existing code, except for randomly accessing a char, and probably .length.  
Slicing would still work as it does now, but could be adjusted later.


It will break existing code. To fix those breaks, you would need to use  
the char[] array directly via the representation member, or rethink your  
code to be UTF-correct. Basically, instead of pretending an array isn't an  
array, create a new mostly-compatible type that behaves as we want it to  
behave in all circumstances, not just when you use phobos algorithms.


The breaks may be trivial to work around, and might seem annoying.  
However, they may be actual UTF bugs that make your code more correct when  
you fix them.


The biggest problem right now is the lack of the ability to implicitly  
cast to tail-const with a custom struct. We can keep an alias-this link  
for those cases until we can fix that in the compiler.


-Steve


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Brad Anderson
On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer 
wrote:
I proposed this inside the long major performance problem with 
std.array.front, I've also proposed it before, a long time ago.


But seems to be getting no attention buried in that thread, not 
even negative attention :)


An idea to fix the whole problems I see with char[] being 
treated specially by phobos: introduce an actual string type, 
with char[] as backing, that is a dchar range, that actually 
dictates the rules we want. Then, make the compiler use this 
type for literals.


e.g.:

struct string {
   immutable(char)[] representation;
   this(char[] data) { representation = data;}
   ... // dchar range primitives
}

Then, a char[] array is simply an array of char[].

points:

1. No more issues with foreach(c; cassé), it iterates via 
dchar
2. No more issues with cassé[4], it is a static compiler 
error.

3. No more awkward ASCII manipulation using ubyte[].
4. No more phobos schizophrenia saying char[] is not an array.
5. No more special casing char[] array templates to fool the 
compiler.
6. Any other special rules we come up with can be dictated by 
the library, and not ignored by the compiler.


Note, std.algorithm.copy(string1, mutablestring) will still 
decode/encode, but it's more explicit. It's EXPLICITLY a dchar 
range. Use std.algorithm.copy(string1.representation, 
mutablestring.representation) will avoid the issues.


I imagine only code that is currently UTF ignorant will break, 
and that code is easily 'fixed' by adding the 'representation' 
qualifier.


-Steve


Generally I think it's a good idea. Going a bit further you could 
also enable Short String Optimization but you'd have to 
encapsulate the backing array.


It seems like this would be an even bigger breaking change than 
Walter's proposal though (right or wrong, slicing strings is very 
common).


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Steven Schveighoffer

On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson e...@gnuk.net wrote:

It seems like this would be an even bigger breaking change than Walter's  
proposal though (right or wrong, slicing strings is very common).


You're the second person to mention that, I was not planning on disabling  
string slicing. Just random access to individual chars, and probably  
.length.


-Steve


Re: Proposal for fixing dchar ranges

2014-03-10 Thread John Colvin
On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer 
wrote:
On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson 
e...@gnuk.net wrote:


It seems like this would be an even bigger breaking change 
than Walter's proposal though (right or wrong, slicing strings 
is very common).


You're the second person to mention that, I was not planning on 
disabling string slicing. Just random access to individual 
chars, and probably .length.


-Steve


How is slicing any better than indexing?


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Steven Schveighoffer
On Mon, 10 Mar 2014 14:01:45 -0400, John Colvin  
john.loughran.col...@gmail.com wrote:



On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer wrote:

On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson e...@gnuk.net wrote:

It seems like this would be an even bigger breaking change than  
Walter's proposal though (right or wrong, slicing strings is very  
common).


You're the second person to mention that, I was not planning on  
disabling string slicing. Just random access to individual chars, and  
probably .length.


-Steve


How is slicing any better than indexing?


Because one can slice out a multi-code-unit code point, one cannot access  
it via index. Strings would be horribly crippled without slicing. Without  
indexing, they are fine.


A possibility is to allow index, but actually decode the code point at  
that index (error on invalid index). That might actually be the correct  
mechanism.


-Steve


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Steven Schveighoffer
On Mon, 10 Mar 2014 13:59:53 -0400, John Colvin  
john.loughran.col...@gmail.com wrote:



On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer wrote:
I proposed this inside the long major performance problem with  
std.array.front, I've also proposed it before, a long time ago.


But seems to be getting no attention buried in that thread, not even  
negative attention :)


An idea to fix the whole problems I see with char[] being treated  
specially by phobos: introduce an actual string type, with char[] as  
backing, that is a dchar range, that actually dictates the rules we  
want. Then, make the compiler use this type for literals.


e.g.:

struct string {
   immutable(char)[] representation;
   this(char[] data) { representation = data;}
   ... // dchar range primitives
}

Then, a char[] array is simply an array of char[].

points:

1. No more issues with foreach(c; cassé), it iterates via dchar
2. No more issues with cassé[4], it is a static compiler error.
3. No more awkward ASCII manipulation using ubyte[].
4. No more phobos schizophrenia saying char[] is not an array.
5. No more special casing char[] array templates to fool the compiler.
6. Any other special rules we come up with can be dictated by the  
library, and not ignored by the compiler.


Note, std.algorithm.copy(string1, mutablestring) will still  
decode/encode, but it's more explicit. It's EXPLICITLY a dchar range.  
Use std.algorithm.copy(string1.representation,  
mutablestring.representation) will avoid the issues.


I imagine only code that is currently UTF ignorant will break, and that  
code is easily 'fixed' by adding the 'representation' qualifier.


-Steve


I know warnings are disliked, but couldn't we make the slicing and  
indexing work as currently but issue a warning*? It's not ideal but it  
does mean we get backwards compatibility.


As I mentioned elsewhere (but repeating here for viewers), I was not  
planning on disabling slicing.


Indexing is rarely a feature one needs or should use, especially with  
encoded strings.


-Steve


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Steven Schveighoffer
On Mon, 10 Mar 2014 14:30:07 -0400, Walter Bright  
newshou...@digitalmars.com wrote:



On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
An idea to fix the whole problems I see with char[] being treated  
specially by
phobos: introduce an actual string type, with char[] as backing, that  
is a dchar
range, that actually dictates the rules we want. Then, make the  
compiler use

this type for literals.


Proposals to make a string class for D have come up many times. I have a  
kneejerk dislike for it. It's a really strong feature for D to have  
strings be an array type, and I'll go to great lengths to keep it that  
way.


I wholly agree, they should be an array type. But what they are now is  
worse.


-Steve


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Walter Bright

On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:

An idea to fix the whole problems I see with char[] being treated specially by
phobos: introduce an actual string type, with char[] as backing, that is a dchar
range, that actually dictates the rules we want. Then, make the compiler use
this type for literals.


Proposals to make a string class for D have come up many times. I have a 
kneejerk dislike for it. It's a really strong feature for D to have strings be 
an array type, and I'll go to great lengths to keep it that way.


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Steven Schveighoffer
On Mon, 10 Mar 2014 14:30:07 -0400, Walter Bright  
newshou...@digitalmars.com wrote:



On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
An idea to fix the whole problems I see with char[] being treated  
specially by
phobos: introduce an actual string type, with char[] as backing, that  
is a dchar
range, that actually dictates the rules we want. Then, make the  
compiler use

this type for literals.


Proposals to make a string class for D have come up many times. I have a  
kneejerk dislike for it. It's a really strong feature for D to have  
strings be an array type, and I'll go to great lengths to keep it that  
way.


BTW, this escaped my view the first time reading your post, but I am NOT  
proposing a string *class*. In fact, I'm not proposing we change anything  
technical about strings, the code generated should be basically identical.  
What I'm proposing is to encapsulate what you can and can't do with a  
string in the type itself, instead of making the standard library flip  
over backwards to treat it as something else when the compiler treats it  
as a simple array of char.


-Steve


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Johannes Pfau
Am Mon, 10 Mar 2014 11:30:07 -0700
schrieb Walter Bright newshou...@digitalmars.com:

 On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
  An idea to fix the whole problems I see with char[] being treated
  specially by phobos: introduce an actual string type, with char[]
  as backing, that is a dchar range, that actually dictates the rules
  we want. Then, make the compiler use this type for literals.
 
 Proposals to make a string class for D have come up many times. I
 have a kneejerk dislike for it. It's a really strong feature for D to
 have strings be an array type, and I'll go to great lengths to keep
 it that way.

Question: which type T doesn't have slicing, has an ElementType of
dchar, has typeof(T[0]).sizeof == 4, ElementEncodingType!T == char and
still satisfies isArray?

It's a string. Would you call that 'an array type'?

writeln(isArray!string);   //true
writeln(hasSlicing!string); //false
writeln(ElementType!string.stringof); //dchar
writeln(ElementEncodingType!string.stringof); //char

I wouldn't call that an array. Part of the problem is that you want
string to be arrays (fixed size elements, direct indexing) and Andrei
doesn't want them to be arrays (operating on code points = not fixed
size = not arrays).


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Johannes Pfau
Am Mon, 10 Mar 2014 13:55:00 -0400
schrieb Steven Schveighoffer schvei...@yahoo.com:

 On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson e...@gnuk.net
 wrote:
 
  It seems like this would be an even bigger breaking change than
  Walter's proposal though (right or wrong, slicing strings is very
  common).
 
 You're the second person to mention that, I was not planning on
 disabling string slicing. Just random access to individual chars, and
 probably .length.
 
 -Steve

Unfortunately slicing by code units is probably the most important
safety issue with the current implementation: As was mentioned in the
other thread:

size_t index = str.countUntil('a');
auto slice = str[0..index];

This can be a safety and security issue. (I realize that this would
break lots of code so I'm not sure if we should/can fix it. But I think
this was the most important problem mentioned in the other thread.)


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Walter Bright

On 3/10/2014 11:54 AM, Steven Schveighoffer wrote:

BTW, this escaped my view the first time reading your post, but I am NOT
proposing a string *class*.


Right, but here I used the term class to be more generic as in being a user 
defined type, i.e. struct or class. I should have been more clear.


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Artem Tarasov

On Monday, 10 March 2014 at 18:50:28 UTC, Johannes Pfau wrote:


Question: which type T doesn't have slicing, has an ElementType 
of
dchar, has typeof(T[0]).sizeof == 4, ElementEncodingType!T == 
char and

still satisfies isArray?


In addition, hasLength!T == false, which totally freaked me out 
when I first discovered that.




Re: Proposal for fixing dchar ranges

2014-03-10 Thread John Colvin
On Monday, 10 March 2014 at 18:09:51 UTC, Steven Schveighoffer 
wrote:
On Mon, 10 Mar 2014 14:01:45 -0400, John Colvin 
john.loughran.col...@gmail.com wrote:


On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer 
wrote:
On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson 
e...@gnuk.net wrote:


It seems like this would be an even bigger breaking change 
than Walter's proposal though (right or wrong, slicing 
strings is very common).


You're the second person to mention that, I was not planning 
on disabling string slicing. Just random access to individual 
chars, and probably .length.


-Steve


How is slicing any better than indexing?


Because one can slice out a multi-code-unit code point, one 
cannot access it via index. Strings would be horribly crippled 
without slicing. Without indexing, they are fine.


A possibility is to allow index, but actually decode the code 
point at that index (error on invalid index). That might 
actually be the correct mechanism.


-Steve


In order to be correct, both require exactly the same knowledge: 
The beginning of a code point, followed by the end of a code 
point. In the indexing case they just happen to be the same 
code-point and happen to be one code unit from each other. I 
don't see how one is any more or less errror-prone or 
fundamentally wrong than the other.


I do understand that slicing is more important however.


Re: Proposal for fixing dchar ranges

2014-03-10 Thread H. S. Teoh
On Mon, Mar 10, 2014 at 07:49:04PM +0100, Johannes Pfau wrote:
 Am Mon, 10 Mar 2014 11:30:07 -0700
 schrieb Walter Bright newshou...@digitalmars.com:
 
  On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
   An idea to fix the whole problems I see with char[] being treated
   specially by phobos: introduce an actual string type, with char[]
   as backing, that is a dchar range, that actually dictates the
   rules we want. Then, make the compiler use this type for literals.
  
  Proposals to make a string class for D have come up many times. I
  have a kneejerk dislike for it. It's a really strong feature for D
  to have strings be an array type, and I'll go to great lengths to
  keep it that way.

I'm on the fence about this one. The nice thing about strings being an
array type, is that it is a familiar concept to C coders, and it allows
array slicing for extracting substrings, etc., which fits nicely with
the C view of strings as character arrays. As a C coder myself, I like
it this way too. But the bad thing about strings being an array type, is
that it's a holdover from C, and it allows slicing for extracting
substrings -- malformed substrings by permitting slicing a multibyte
(multiword) character.

Basically, the nice aspects of strings being arrays only apply when
you're dealing with ASCII (or mostly-ASCII) strings. These very same
nice aspects turn into problems when dealing with anything non-ASCII.
The only way the user can get it right using only array operations, is
if they understand the whole of Unicode in their head and are willing to
reinvent Unicode algorithms every time they slice a string or do some
operation on it. Since D purportedly supports Unicode by default, it
shouldn't be this way. D should *actually* support Unicode all the way
-- use proper Unicode algorithms for substring extraction, collation,
line-breaking, normalization, etc.. Being a systems language, of course,
means that D should allow you to get under the hood and do things
directly with the raw string representation -- but this shouldn't be the
*default* modus operandi.  The default should be a properly-encapsulated
string type with Unicode algorithms to operate on it (with the option of
reaching into the raw representation where necessary).


 Question: which type T doesn't have slicing, has an ElementType of
 dchar, has typeof(T[0]).sizeof == 4, ElementEncodingType!T == char and
 still satisfies isArray?
 
 It's a string. Would you call that 'an array type'?
 
   writeln(isArray!string);   //true
   writeln(hasSlicing!string); //false
   writeln(ElementType!string.stringof); //dchar
   writeln(ElementEncodingType!string.stringof); //char
 
 I wouldn't call that an array. Part of the problem is that you want
 string to be arrays (fixed size elements, direct indexing) and Andrei
 doesn't want them to be arrays (operating on code points = not fixed
 size = not arrays).

Exactly. What we have right now is a frankensteinian hybrid that's
neither fully an array, nor fully a Unicode string type. If we call the
current messy AA implementation split between compiler, aaA.d, and
object.di a design problem, then I'd call the current state of D strings
a design problem too. This underlying inconsistency is ultimately what
leads to the poor performance of strings in std.algorithm.

It's precisely because of this that I've given up on using std.algorithm
for strings altogether -- std.regex is far better: more flexible, more
expressive, and more performant, and specifically designed to operate on
strings. Nowadays I only use std.algorithm for non-string ranges
(because then the behaviour is actually consistent!!).


T

-- 
MS Windows: 64-bit overhaul of 32-bit extensions and a graphical shell for a 
16-bit patch to an 8-bit operating system originally coded for a 4-bit 
microprocessor, written by a 2-bit company that can't stand 1-bit of 
competition.


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Steven Schveighoffer
On Mon, 10 Mar 2014 15:30:00 -0400, John Colvin  
john.loughran.col...@gmail.com wrote:



On Monday, 10 March 2014 at 18:09:51 UTC, Steven Schveighoffer wrote:


Because one can slice out a multi-code-unit code point, one cannot  
access it via index. Strings would be horribly crippled without  
slicing. Without indexing, they are fine.


A possibility is to allow index, but actually decode the code point at  
that index (error on invalid index). That might actually be the correct  
mechanism.




In order to be correct, both require exactly the same knowledge: The  
beginning of a code point, followed by the end of a code point. In the  
indexing case they just happen to be the same code-point and happen to  
be one code unit from each other. I don't see how one is any more or  
less errror-prone or fundamentally wrong than the other.


Using indexing, you simply cannot get the single code unit that represents  
a multi-code-unit code point. It doesn't fit in a char. It's guaranteed to  
fail, whereas slicing will give you access to the all the data in the  
string.


Now, with indexing actually decoding a code point, one can alias a[i] to  
a[i..$].front(), which means decode the first code point you come to at  
index i. This means indexing is slow(er), and returns a dchar. I think as  
a first step, that might be too much to add silently. I'd rather break it  
first, then add it back later.


-Steve


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Steven Schveighoffer
On Mon, 10 Mar 2014 14:54:22 -0400, Johannes Pfau nos...@example.com  
wrote:



Am Mon, 10 Mar 2014 13:55:00 -0400
schrieb Steven Schveighoffer schvei...@yahoo.com:


On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson e...@gnuk.net
wrote:

 It seems like this would be an even bigger breaking change than
 Walter's proposal though (right or wrong, slicing strings is very
 common).

You're the second person to mention that, I was not planning on
disabling string slicing. Just random access to individual chars, and
probably .length.

-Steve


Unfortunately slicing by code units is probably the most important
safety issue with the current implementation: As was mentioned in the
other thread:

size_t index = str.countUntil('a');
auto slice = str[0..index];

This can be a safety and security issue. (I realize that this would
break lots of code so I'm not sure if we should/can fix it. But I think
this was the most important problem mentioned in the other thread.)


Slicing can never be a code point based operation. It would be too slow  
(read linear complexity). What needs to be broken is the expectation that  
an index is the number of code points or characters in a string. Think of  
an index as a position that has no real meaning except they are ordered in  
the stream. Like a set of ordered numbers, not necessarily consecutive.  
The index 4 may not exist, while 5 does.


At this point, my proposal does not fix that particular problem, but I  
don't think there's any way to fix that problem except to train the user  
who wrote it not to do that. However, it does not leave us in a worse  
position.


-Steve


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Steven Schveighoffer
On Mon, 10 Mar 2014 16:06:25 -0400, Steven Schveighoffer  
schvei...@yahoo.com wrote:



Think of an index as a position that has no real meaning except they are  
ordered in the stream. Like a set of ordered numbers, not necessarily  
consecutive. The index 4 may not exist, while 5 does.


I said that wrong, of course it has meaning. What I mean is that if you  
have two positions, the ordering will indicate where the  
characters/graphemes/code points occur in the stream, but their value will  
not be indicative of how far they are apart in terms of  
characters/graphemes/code points.


In other words, if I have two characters, at position p1 and p2, then

p1  p2 = p1 comes later in the string than p2
p1 == p2 = p1 and p2 refer to the same character
p1 - p2 = not defined to any particular value.

-Steve


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Brad Anderson
On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer 
wrote:
On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson 
e...@gnuk.net wrote:


It seems like this would be an even bigger breaking change 
than Walter's proposal though (right or wrong, slicing strings 
is very common).


You're the second person to mention that, I was not planning on 
disabling string slicing. Just random access to individual 
chars, and probably .length.


-Steve


Sorry, I misunderstood. That sounds reasonable.


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Walter Bright

On 3/10/2014 1:36 PM, Steven Schveighoffer wrote:

What strings are already is a user-defined type,


No, they are not.


but with horrible enforcement.


With no enforcement, and that is by design.

Keep in mind that D is a systems programming language, and that means unfettered 
access to strings.


Re: Proposal for fixing dchar ranges

2014-03-10 Thread John Colvin
On Monday, 10 March 2014 at 20:00:07 UTC, Steven Schveighoffer 
wrote:
On Mon, 10 Mar 2014 15:30:00 -0400, John Colvin 
john.loughran.col...@gmail.com wrote:


On Monday, 10 March 2014 at 18:09:51 UTC, Steven Schveighoffer 
wrote:


Because one can slice out a multi-code-unit code point, one 
cannot access it via index. Strings would be horribly 
crippled without slicing. Without indexing, they are fine.


A possibility is to allow index, but actually decode the code 
point at that index (error on invalid index). That might 
actually be the correct mechanism.




In order to be correct, both require exactly the same 
knowledge: The beginning of a code point, followed by the end 
of a code point. In the indexing case they just happen to be 
the same code-point and happen to be one code unit from each 
other. I don't see how one is any more or less errror-prone or 
fundamentally wrong than the other.


Using indexing, you simply cannot get the single code unit that 
represents a multi-code-unit code point. It doesn't fit in a 
char. It's guaranteed to fail, whereas slicing will give you 
access to the all the data in the string.




I think I understand your motivation now. Indexing never provides 
anything that slicing doesn't do more generally.


Now, with indexing actually decoding a code point, one can 
alias a[i] to a[i..$].front(), which means decode the first 
code point you come to at index i. This means indexing is 
slow(er), and returns a dchar. I think as a first step, that 
might be too much to add silently. I'd rather break it first, 
then add it back later.


-Steve


Of course that i has to be at the beginning of a code-point. 
Doesn't seem like that useful a feature and potentially very 
confusing for people who naively expect normal indexing.


Re: Proposal for fixing dchar ranges

2014-03-10 Thread John Colvin

On Monday, 10 March 2014 at 19:48:34 UTC, H. S. Teoh wrote:

On Mon, Mar 10, 2014 at 07:49:04PM +0100, Johannes Pfau wrote:

Am Mon, 10 Mar 2014 11:30:07 -0700
schrieb Walter Bright newshou...@digitalmars.com:

 On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
  An idea to fix the whole problems I see with char[] being 
  treated
  specially by phobos: introduce an actual string type, with 
  char[]
  as backing, that is a dchar range, that actually dictates 
  the
  rules we want. Then, make the compiler use this type for 
  literals.
 
 Proposals to make a string class for D have come up many 
 times. I
 have a kneejerk dislike for it. It's a really strong feature 
 for D
 to have strings be an array type, and I'll go to great 
 lengths to

 keep it that way.


I'm on the fence about this one. The nice thing about strings 
being an
array type, is that it is a familiar concept to C coders, and 
it allows
array slicing for extracting substrings, etc., which fits 
nicely with
the C view of strings as character arrays. As a C coder myself, 
I like
it this way too. But the bad thing about strings being an array 
type, is
that it's a holdover from C, and it allows slicing for 
extracting
substrings -- malformed substrings by permitting slicing a 
multibyte

(multiword) character.

Basically, the nice aspects of strings being arrays only apply 
when
you're dealing with ASCII (or mostly-ASCII) strings. These very 
same
nice aspects turn into problems when dealing with anything 
non-ASCII.
The only way the user can get it right using only array 
operations, is
if they understand the whole of Unicode in their head and are 
willing to
reinvent Unicode algorithms every time they slice a string or 
do some
operation on it. Since D purportedly supports Unicode by 
default, it
shouldn't be this way. D should *actually* support Unicode all 
the way
-- use proper Unicode algorithms for substring extraction, 
collation,
line-breaking, normalization, etc.. Being a systems language, 
of course,
means that D should allow you to get under the hood and do 
things
directly with the raw string representation -- but this 
shouldn't be the
*default* modus operandi.  The default should be a 
properly-encapsulated
string type with Unicode algorithms to operate on it (with the 
option of

reaching into the raw representation where necessary).




You started off on the fence, but you seem pretty convinced by 
the end!


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Steven Schveighoffer
On Mon, 10 Mar 2014 16:52:27 -0400, Walter Bright  
newshou...@digitalmars.com wrote:



On 3/10/2014 1:36 PM, Steven Schveighoffer wrote:

What strings are already is a user-defined type,


No, they are not.


The functionality added via phobos can hardly be considered extraneous.  
One would not use strings without the library.



but with horrible enforcement.


With no enforcement, and that is by design.


The enforcement is opt-in. That is, you have to use phobos' templates in  
order to use them properly:


auto getIt(R)(R r, size_t idx)
{
   if(idx  r.length)
  return r[idx];
}

The above compiles fine for strings. However, it does not compile fine if  
you do:


auto getIt(R)(R r, size_t idx) if(hasLength!R  isRandomAccessRange!R)

Any other range will fail to compile for the more strict version and the  
simple implementation without template constraints. In other words, the  
compiler doesn't believe the same thing phobos does. shooting one's foot  
is quite easy.


Keep in mind that D is a systems programming language, and that means  
unfettered access to strings.


Access is fine, with clear intentions. And we do not have unfettered  
access. I cannot sort a mutable string of ASCII characters without first  
converting it to ubyte[].


What in my proposal makes you think you don't have unfettered access? The  
underlying immutable(char)[] representation is accessible. In fact, you  
would have more access, since phobos functions would then work with a  
char[] like it's a proper array.


-Steve


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Steven Schveighoffer
On Mon, 10 Mar 2014 16:54:34 -0400, John Colvin   
john.loughran.col...@gmail.com wrote:
Of course that i has to be at the beginning of a code-point. Doesn't  
seem like that useful a feature and potentially very confusing for  
people who naively expect normal indexing.


What it would do is remove the confusion of is(typeof(r.front) !=   
typeof(r[0]))


Naivety is to be expected when you have made your C-derived language's  
default string type an encoded UTF8 array called char[]. It doesn't  
magically make D programs UTF aware.


I would suggest that a lofty goal is for the string type to be completely  
safe, and efficient, and only allow raw access via the .representation  
member. But I don't think, given the current code base,
that we can achieve that in one proposal. It has to be gradual. This is a  
first step.


-Steve


Re: Proposal for fixing dchar ranges

2014-03-10 Thread John Colvin
On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer 
wrote:
I proposed this inside the long major performance problem with 
std.array.front, I've also proposed it before, a long time ago.


But seems to be getting no attention buried in that thread, not 
even negative attention :)


An idea to fix the whole problems I see with char[] being 
treated specially by phobos: introduce an actual string type, 
with char[] as backing, that is a dchar range, that actually 
dictates the rules we want. Then, make the compiler use this 
type for literals.


e.g.:

struct string {
   immutable(char)[] representation;
   this(char[] data) { representation = data;}
   ... // dchar range primitives
}

Then, a char[] array is simply an array of char[].

points:

1. No more issues with foreach(c; cassé), it iterates via 
dchar
2. No more issues with cassé[4], it is a static compiler 
error.

3. No more awkward ASCII manipulation using ubyte[].
4. No more phobos schizophrenia saying char[] is not an array.
5. No more special casing char[] array templates to fool the 
compiler.
6. Any other special rules we come up with can be dictated by 
the library, and not ignored by the compiler.


Note, std.algorithm.copy(string1, mutablestring) will still 
decode/encode, but it's more explicit. It's EXPLICITLY a dchar 
range. Use std.algorithm.copy(string1.representation, 
mutablestring.representation) will avoid the issues.


I imagine only code that is currently UTF ignorant will break, 
and that code is easily 'fixed' by adding the 'representation' 
qualifier.


-Steve


just to check I understand this fully:

in this new scheme, what would this do?

auto s = cassé.representation;
foreach(i, c; s) write(i, ':', c, ' ');
writeln(s);

Currently - without the .representation - I get

0:c 1:a 2:s 3:s 4:e 5:̠6:`
cassé

or, to spell it out a bit more:
0:c 1:a 2:s 3:s 4:e 5:xCC 6:x81
cassé


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Chris Williams
On Monday, 10 March 2014 at 18:13:14 UTC, Steven Schveighoffer 
wrote:
Indexing is rarely a feature one needs or should use, 
especially with encoded strings.


If I was writing something like a chat or terminal window, I 
would want to be able to jump to chunks of text based on some 
sort of buffer length, then search for actual character 
boundaries. Similarly, if I was indexing text, I don't care what 
the underlying data is just whether any particular set of n-bytes 
have been seen together among some document. For the latter case, 
I don't need to be able to interpret the data as text while 
indexing, but once I perform an actual search and want to jump 
the user to that line in the file, being able to take a byte 
offset that I had stored in the index and convert that to a 
textual position would be good.


I do think that D should have something like

alias String8 = UTF!char;
alias String16 = UTF!wchar;
alias String32 = UTF!dchar;

And that those sit on top of an underlying immutable(xchar)[] 
buffer, providing variants of things like foreach and length 
based on code-point or grapheme boundaries. But I don't think 
there's any value in reinterpretting string. Not being a struct 
or an object, it doesn't have the extensibility to be useful for 
all the variations of access that working with Unicode and the 
underlying bytes warrants.


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Walter Bright

On 3/10/2014 2:09 PM, Steven Schveighoffer wrote:

What in my proposal makes you think you don't have unfettered access? The
underlying immutable(char)[] representation is accessible. In fact, you would
have more access, since phobos functions would then work with a char[] like it's
a proper array.


You divide the D world into two camps - those that use 'struct string', and 
those that use immutable(char)[] strings.


 I imagine only code that is currently UTF ignorant will break,

This also makes it a non-starter.


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Steven Schveighoffer
On Mon, 10 Mar 2014 17:46:23 -0400, John Colvin  
john.loughran.col...@gmail.com wrote:



On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer wrote:
I proposed this inside the long major performance problem with  
std.array.front, I've also proposed it before, a long time ago.


But seems to be getting no attention buried in that thread, not even  
negative attention :)


An idea to fix the whole problems I see with char[] being treated  
specially by phobos: introduce an actual string type, with char[] as  
backing, that is a dchar range, that actually dictates the rules we  
want. Then, make the compiler use this type for literals.


e.g.:

struct string {
   immutable(char)[] representation;
   this(char[] data) { representation = data;}
   ... // dchar range primitives
}

Then, a char[] array is simply an array of char[].

points:

1. No more issues with foreach(c; cassé), it iterates via dchar
2. No more issues with cassé[4], it is a static compiler error.
3. No more awkward ASCII manipulation using ubyte[].
4. No more phobos schizophrenia saying char[] is not an array.
5. No more special casing char[] array templates to fool the compiler.
6. Any other special rules we come up with can be dictated by the  
library, and not ignored by the compiler.


Note, std.algorithm.copy(string1, mutablestring) will still  
decode/encode, but it's more explicit. It's EXPLICITLY a dchar range.  
Use std.algorithm.copy(string1.representation,  
mutablestring.representation) will avoid the issues.


I imagine only code that is currently UTF ignorant will break, and that  
code is easily 'fixed' by adding the 'representation' qualifier.


-Steve


just to check I understand this fully:

in this new scheme, what would this do?

auto s = cassé.representation;
foreach(i, c; s) write(i, ':', c, ' ');
writeln(s);

Currently - without the .representation - I get

0:c 1:a 2:s 3:s 4:e 5:̠6:`
cassé

or, to spell it out a bit more:
0:c 1:a 2:s 3:s 4:e 5:xCC 6:x81
cassé


The plan is for foreach on s to iterate by char, and foreach on cassé  
to iterate by dchar.


What this means is the accent will be iterated separately from the e, and  
likely gets put onto the colon after 5. However, the half code-units that  
has no meaning anywhere (xCC and X81) would not be iterated.


In your above code, using .representation would be equivalent to what it  
is now without .representation (i.e. over char), and without  
.representation would be equivalent to this on today's compiler (except  
faster):


foreach(i, dchar c; s)

-Steve


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Steven Schveighoffer
On Mon, 10 Mar 2014 17:52:05 -0400, Walter Bright  
newshou...@digitalmars.com wrote:



On 3/10/2014 2:09 PM, Steven Schveighoffer wrote:
What in my proposal makes you think you don't have unfettered access?  
The
underlying immutable(char)[] representation is accessible. In fact, you  
would
have more access, since phobos functions would then work with a char[]  
like it's

a proper array.


You divide the D world into two camps - those that use 'struct string',  
and those that use immutable(char)[] strings.


Really? It's not that divisive. However, the situation is certainly better  
than today's world of those who use 'string' and those who use  
'string.representation'. Those who use string.representation would  
actually get much more use out of it. Those who use string would see no  
changes.



  I imagine only code that is currently UTF ignorant will break,

This also makes it a non-starter.


You're the guardian of changes to the language, clearly holding a veto on  
any proposals. But this doesn't come across as very open-minded,  
especially from someone who wanted to do something that would change the  
fundamental treatment of strings last week.


IMO, breaking incorrect code is a good idea, and worth at least exploring.

-Steve


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Walter Bright

On 3/10/2014 3:26 PM, Steven Schveighoffer wrote:

On Mon, 10 Mar 2014 17:52:05 -0400, Walter Bright newshou...@digitalmars.com
wrote:

This also makes it a non-starter.


You're the guardian of changes to the language, clearly holding a veto on any
proposals. But this doesn't come across as very open-minded, especially from
someone who wanted to do something that would change the fundamental treatment
of strings last week.


I deserve that criticism. On the other hand, I've pretty much given up on fixing 
std.array.front() because of that. In the last couple days, we also wound up 
annoying a valuable client with some minor breakage with std.json, reiterating 
how important it is to not break code if we can at all avoid it.




IMO, breaking incorrect code is a good idea, and worth at least exploring.


Breaking broken code, yes.



Re: Proposal for fixing dchar ranges

2014-03-10 Thread bearophile

Walter Bright:

In the last couple days, we also wound up annoying a valuable 
client with some minor breakage with std.json, reiterating how 
important it is to not break code if we can at all avoid it..


There are still some breaking changed that I'd like to perform in 
D, like deprecating certain usages of the comma operator, etc.


Bye,
bearophile


Re: Proposal for fixing dchar ranges

2014-03-10 Thread Meta

On Tuesday, 11 March 2014 at 00:02:13 UTC, bearophile wrote:

Walter Bright:

In the last couple days, we also wound up annoying a valuable 
client with some minor breakage with std.json, reiterating how 
important it is to not break code if we can at all avoid it..


There are still some breaking changed that I'd like to perform 
in D, like deprecating certain usages of the comma operator, 
etc.


Bye,
bearophile


That damnable comma operator is one of the worst things that was 
inherited from C. IMO, it has no use outside the header of a for 
loop, and even there it's suspect.