Re: Range of chars (narrow string ranges)

2015-04-30 Thread Kagamin via Digitalmars-d

On Friday, 24 April 2015 at 20:44:34 UTC, Walter Bright wrote:
Time has shown, however, that UTF8 has pretty much won. wchar 
only exists for Windows API and Java


Also NSString. It used to support UTF-16 and C encoding. AFAIK, 
the latter later evolved into UTF-8.


Re: Range of chars (narrow string ranges)

2015-04-29 Thread Chris via Digitalmars-d
On Wednesday, 29 April 2015 at 15:13:15 UTC, Jonathan M Davis 
wrote:

On Wednesday, 29 April 2015 at 10:02:09 UTC, Chris wrote:
This sounds like a good starting point for a transition plan. 
One important thing, though, would be to do some benchmarking 
with and without autodecoding, to see if it really boosts 
performance in a way that would justify the transition.


Well, personally, I think that it's worth it even if the 
performance is identical (and it's a guarantee that it's going 
to be better without autodecoding - it's just a question of how 
much better - since it's going to have less work to do without 
autodecoding). Simply operating at the code point level like we 
do now is the worst of all worlds in terms of flexibility and 
correctness. As long as the Unicode is normalized, operating at 
the code unit level is the most efficient, and decoding is 
often unnecessary for correctness, and if you need to decode, 
then you really need to go up to the grapheme level in order to 
be operating on the full character, meaning that operating on 
code points really has the same problems as operating on code 
units as far as correctness goes. So, it's less performant 
without actually being correct. It just gives the illusion of 
correctness.


By treating strings as ranges of code units, you don't take a 
performance hit when you don't need to, and it forces you to 
actually consider something like byDchar or byGrapheme if you 
want to operate on full, Unicode characters. It's similar to 
how operating on UTF-16 code units as if they were characters 
(as Java and C# generally do) frequently gives the incorrect 
impression that you're handling Unicode correctly, because you 
have to work harder at coming up with characters that can't fit 
in a single code unit, whereas with UTF-8, anything but ASCII 
is screwed if you treat code units as code points. Treating 
code points as if they were full characters like we're doing 
now in Phobos with ranges just makes it that much harder to 
notice that you're not handling Unicode correctly.


Also, treating strings as ranges of code units makes it so that 
they're not so special and actually are treated like every 
other type of array, which eliminates a lot of the special 
casing that we're forced to do right now, and it eliminates all 
of the confusion that folks keep running into when string 
doesn't work with many functions, because it's not a 
random-access range or doesn't have length, or because the 
resulting range isn't the same type (copy would be a prime 
example of a function that doesn't work with char[] when it 
should). By leaving in autodecoding, we're basically leaving in 
technical debt in D permanently. We'll forever have to be 
explaining it to folks and forever have to be working around it 
in order to achieve either performance or correctness.


What we have now isn't performant, correct, or flexible, and 
we'll be forever paying for that if we don't get rid of 
autodecoding.


I don't criticize Andrei in the least for coming up with it, 
since if you don't take graphemes into account (and he didn't 
know about them at the time), it seems like a great idea and 
allows us to be correct by default and performant if we put 
some effort into, but after having seen how it's worked out, 
how much code has to be special-cased, how much confusion there 
is over it, and how it's not actually correct anyway, I think 
that it's quite clear that autodecoding was a mistake. And at 
this point, it's mainly a question of how we can get rid of it 
without being too disruptive and whether we can convince Andrei 
that it makes sense to make the change, since he seems to still 
think that autodecoding is fine in spite of the fact that it's 
neither performant nor correct.


It may be that the decision will be that it's too disruptive to 
remove autodecoding, but I think that that's really a question 
of whether we can find a way to do it that doesn't break tons 
of code rather than whether it's worth the performance or 
correctness gain.


- Jonathan M Davis


Ok, I see. Well, if we don't want to repeat C++'s mistakes, we 
should fix it before it's too late. Since I'm dealing a lot with 
strings (non ASCII) and depend on Unicode (and correctness!), I 
would be more than happy to test any changes to Phobos with my 
programs to see if it screws up anything.


Re: Range of chars (narrow string ranges)

2015-04-29 Thread Jonathan M Davis via Digitalmars-d

On Wednesday, 29 April 2015 at 10:02:09 UTC, Chris wrote:
This sounds like a good starting point for a transition plan. 
One important thing, though, would be to do some benchmarking 
with and without autodecoding, to see if it really boosts 
performance in a way that would justify the transition.


Well, personally, I think that it's worth it even if the 
performance is identical (and it's a guarantee that it's going to 
be better without autodecoding - it's just a question of how much 
better - since it's going to have less work to do without 
autodecoding). Simply operating at the code point level like we 
do now is the worst of all worlds in terms of flexibility and 
correctness. As long as the Unicode is normalized, operating at 
the code unit level is the most efficient, and decoding is often 
unnecessary for correctness, and if you need to decode, then you 
really need to go up to the grapheme level in order to be 
operating on the full character, meaning that operating on code 
points really has the same problems as operating on code units as 
far as correctness goes. So, it's less performant without 
actually being correct. It just gives the illusion of correctness.


By treating strings as ranges of code units, you don't take a 
performance hit when you don't need to, and it forces you to 
actually consider something like byDchar or byGrapheme if you 
want to operate on full, Unicode characters. It's similar to how 
operating on UTF-16 code units as if they were characters (as 
Java and C# generally do) frequently gives the incorrect 
impression that you're handling Unicode correctly, because you 
have to work harder at coming up with characters that can't fit 
in a single code unit, whereas with UTF-8, anything but ASCII is 
screwed if you treat code units as code points. Treating code 
points as if they were full characters like we're doing now in 
Phobos with ranges just makes it that much harder to notice that 
you're not handling Unicode correctly.


Also, treating strings as ranges of code units makes it so that 
they're not so special and actually are treated like every other 
type of array, which eliminates a lot of the special casing that 
we're forced to do right now, and it eliminates all of the 
confusion that folks keep running into when string doesn't work 
with many functions, because it's not a random-access range or 
doesn't have length, or because the resulting range isn't the 
same type (copy would be a prime example of a function that 
doesn't work with char[] when it should). By leaving in 
autodecoding, we're basically leaving in technical debt in D 
permanently. We'll forever have to be explaining it to folks and 
forever have to be working around it in order to achieve either 
performance or correctness.


What we have now isn't performant, correct, or flexible, and 
we'll be forever paying for that if we don't get rid of 
autodecoding.


I don't criticize Andrei in the least for coming up with it, 
since if you don't take graphemes into account (and he didn't 
know about them at the time), it seems like a great idea and 
allows us to be correct by default and performant if we put some 
effort into, but after having seen how it's worked out, how much 
code has to be special-cased, how much confusion there is over 
it, and how it's not actually correct anyway, I think that it's 
quite clear that autodecoding was a mistake. And at this point, 
it's mainly a question of how we can get rid of it without being 
too disruptive and whether we can convince Andrei that it makes 
sense to make the change, since he seems to still think that 
autodecoding is fine in spite of the fact that it's neither 
performant nor correct.


It may be that the decision will be that it's too disruptive to 
remove autodecoding, but I think that that's really a question of 
whether we can find a way to do it that doesn't break tons of 
code rather than whether it's worth the performance or 
correctness gain.


- Jonathan M Davis


Re: Range of chars (narrow string ranges)

2015-04-29 Thread Chris via Digitalmars-d

On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis wrote:

On Tuesday, 28 April 2015 at 09:11:10 UTC, Chris wrote:
Would it be much work to show have example code or even an 
experimental module that gets rid of auto-decoding, so we 
could see what would be affected in general and how actual 
code we have would be affected by it?


The topic keeps coming up again and again, and while I'm in 
favor of anything that enhances performance, I'm afraid of 
having to refactor large chunks of my code. However, this fear 
may be unfounded, but I would need some examples to visualize 
the problem.


Honestly, most code won't care. If we just switched out all of 
the auto-decoding right now, pretty much anything using only 
ASCII would just work, and most anything that's trying to 
manipulate ASCII characters in a Unicode string will just work, 
whereas code that's specifically manipulating Unicode 
characters might have problems (e.g. comparing front with a 
dchar will no longer have the same result, since front would 
just be the first code unit rather than necessarily the first 
code point). Since most Phobos range-based functions which 
operate on strings are special-cased on strings already, many 
of them would continue to just work (e.g. find returns the same 
range type as what's passed to it even if it's given a string, 
so it might just work with the change, or it might need to be 
tweaked slightly), and those that would then generally either 
need to call encode on an argument to make it match the string 
type in the cases string types mix (e.g. "foo".find("fo"d) 
would need to call encode on "fo"d to make it a string for 
comparison), or the caller would need to use std.utf.byDchar or 
std.uni.byGrapheme to operate on code points or graphemes 
rather than code units.


The two biggest places in Phobos that would potentially have 
problems are functions that special-cased strings but still 
used front and those which have to return a new range type. 
e.g. filter would be a good example, because it's forced to 
return a new range type. Right now, it would filter on dchars, 
but with the change, it would filter on the code unit type 
(most typically char). If you're filtering on ASCII characters, 
it wouldn't matter aside from the fact that the resulting range 
would have an element type of char rather than dchar, but if 
you're filtering on Unicode characters, it wouldn't work 
anymore. For situations like that, you'd be forced do use 
std.utf.byDchar or std.uni.byGrapheme. However, since most 
string code tends to operate on substrings rather than 
characters, I don't know how common it even is to use a 
function like filter on a string (as opposed to a range of 
strings). Such code might actually be fairly rare.


So, there _are_ a few functions which stop working the same way 
in a potentially silent manner if we just made it so that front 
didn't autodecode anymore. However, in general, because Phobos 
almost always special-cases strings, calls to Phobos functions 
probably wouldn't need to change in most cases, and when they 
do, a call to byDchar would restore the old behavior. But of 
course, we'd want to do the transition in a way that didn't 
result in silent behavioral changes that would break code, even 
though in most cases, it wouldn't matter, because most code 
will be operating on ASCII strings even if the strings 
themselves contain Unicode - e.g. 
unicodeString.find(asciiString) is far more common than 
unicodeString.find(otherUnicodeString).


I suspect that the code that's at the greatest risk is code 
that checks for is(Unqual!(ElementType!Range) == dchar) to 
operate on strings and wrapper ranges around strings, since it 
would then only match the cases where byDchar had been used. In 
general though, the code that's going to run into the most 
trouble is user code that contains range-based functions 
similar to what you might find in Phobos rather than code 
that's simply using the Phobos functions like startsWith and 
find - i.e. if you're writing range-base code that worries 
about doing stuff like special-casing strings or which 
specifically needs to operate on code points, then you're going 
to have to make changes, whereas to a great extent, if all 
you're doing is passing strings to Phobos functions, your code 
will tend to just work.


To actually see what the impact would be, we'd have to just 
change Phobos, I think, and then see what the impact was on 
user code. It could be surprising how much or how little it 
affects things, though in most cases, I expect that it'll mean 
that code will just work. And if we really wanted to do that, 
we could create a version flag that turned of autodecoding and 
version the changes in Phobos appropriately to see what we got. 
In many cases, if we simply made sure that Phobos functions 
which special-cased strings didn't use front directly but 
instead didn't care whether they were operating on ranges of 
char, wchar, or dchar, t

Re: Range of chars (narrow string ranges)

2015-04-28 Thread Jonathan M Davis via Digitalmars-d
On Tuesday, 28 April 2015 at 21:57:31 UTC, Vladimir Panteleev 
wrote:
On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis 
wrote:
But of course, we'd want to do the transition in a way that 
didn't result in silent behavioral changes that would break 
code,


One proposal is to make char and dchar comparisons illegal 
(after all, they are comparing different things - an UTF-8 code 
unit with a code point, and even though in some cases this 
comparison makes sense, in many it doesn't). That would solve 
most silent breakages at the expense of more not-so-silent 
breakages.


It would, but it doesn't necessarily play nicely with the 
promotion rules, and since the character types tend to be treated 
as integral types, I suspect that it would be problematic in a 
number of cases. I also suspect that it's not something that 
Walter would go for given his typical attitude about conversions 
(though I don't know). It's definitely an interesting thought, 
but I doubt that it would fly.


And if we really wanted to do that, we could create a version 
flag that turned of autodecoding and version the changes in 
Phobos appropriately to see what we got.


Shameless self-promotion alert: An alternative is a GitHub 
fork. You can easily install and try out D forks with Digger, 
it's two commands:


digger build master+jmdavis/phobos/noautodecode
digger install


Well, that may very well be what needs to happen as an 
experiment, but if we want to actually transition to not having 
autodecoding, we need a transition plan in master itself rather 
than a fork, and a temporary version would be one way to do that.


After thinking about the situation some over the past few days 
though, I think that what we need to do to begin with is to make 
it so that as many functions in Phobos as possible don't care 
whether they're dealing with ranges of char or dchar so that 
they'll work regardless of what front does on strings (either by 
simply not using front on strings - or by making it so that the 
code will work whether front return char or dchar). And that will 
reduce the number of changes that will have to be done in Phobos 
via versioning or deprecation or whatever we'd have to do to 
actually remove autodecoding. I suspect that it would mean that 
very little would have to be versioned or deprecated if/when we 
make the switch.


The bigger problem though is probably 3rd party range-based 
functions using front with strings or checking rather than Phobos 
itself or code using Phobos, since much of that would just work 
even if we outright switched front from autodecoding to 
non-autodecoding, and most of what wouldn't can be made to work 
by making it so that those functions don't care whether they're 
dealing with autodecoded strings or not.


- Jonathan M Davis


Re: Range of chars (narrow string ranges)

2015-04-28 Thread Jonathan M Davis via Digitalmars-d

On Tuesday, 28 April 2015 at 23:26:14 UTC, Damian wrote:
I second that! If we all make the switch, perhaps Walter will 
too? :D


Walter isn't necessarily the one we have to convince in this 
case. He'll be very concerned about avoiding breaking existing 
code, so we'd need a solid transition plan, but he very much 
wants to get rid of autodecoding, so he'll welcome it if we can 
do it cleanly. The bigger problem is convincing Andrei, since he 
seems to think that even discussing the issue is a waste of time 
and takes away from more important topics. And I don't dispute 
that there are other important topics, and coming back to this 
one over and over again is arguably a problem, but if we can just 
figure out how to make the transition and get it over with, then 
it wouldn't need to keep getting discussed like this.


- Jonathan M Davis


Re: Range of chars (narrow string ranges)

2015-04-28 Thread Damian via Digitalmars-d

On Tuesday, 28 April 2015 at 23:15:40 UTC, H. S. Teoh wrote:
On Tue, Apr 28, 2015 at 09:57:29PM +, Vladimir Panteleev 
via Digitalmars-d wrote:
On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis 
wrote:
>But of course, we'd want to do the transition in a way that 
>didn't

>result in silent behavioral changes that would break code,

One proposal is to make char and dchar comparisons illegal 
(after all,
they are comparing different things - an UTF-8 code unit with 
a code
point, and even though in some cases this comparison makes 
sense, in
many it doesn't).  That would solve most silent breakages at 
the

expense of more not-so-silent breakages.

>And if we really wanted to do that, we could create a version 
>flag

>that turned of autodecoding and version the changes in Phobos
>appropriately to see what we got.

Shameless self-promotion alert: An alternative is a GitHub 
fork. You
can easily install and try out D forks with Digger, it's two 
commands:


digger build master+jmdavis/phobos/noautodecode
digger install


Oooh, Jonathan has the code ready? Haha, maybe I'll start using 
that

instead of git master! ;-)


T


I second that! If we all make the switch, perhaps Walter will 
too? :D


Re: Range of chars (narrow string ranges)

2015-04-28 Thread H. S. Teoh via Digitalmars-d
On Tue, Apr 28, 2015 at 09:57:29PM +, Vladimir Panteleev via Digitalmars-d 
wrote:
> On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis wrote:
> >But of course, we'd want to do the transition in a way that didn't
> >result in silent behavioral changes that would break code,
> 
> One proposal is to make char and dchar comparisons illegal (after all,
> they are comparing different things - an UTF-8 code unit with a code
> point, and even though in some cases this comparison makes sense, in
> many it doesn't).  That would solve most silent breakages at the
> expense of more not-so-silent breakages.
> 
> >And if we really wanted to do that, we could create a version flag
> >that turned of autodecoding and version the changes in Phobos
> >appropriately to see what we got.
> 
> Shameless self-promotion alert: An alternative is a GitHub fork. You
> can easily install and try out D forks with Digger, it's two commands:
> 
> digger build master+jmdavis/phobos/noautodecode
> digger install

Oooh, Jonathan has the code ready? Haha, maybe I'll start using that
instead of git master! ;-)


T

-- 
Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash, / The day 
and hour soon are coming / When all the IT folks say "Gosh!" / It isn't from a 
clever lawsuit / That Windowsland will finally fall, / But thousands writing 
open source code / Like mice who nibble through a wall. -- The Linux-nationale 
by Greg Baker


Re: Range of chars (narrow string ranges)

2015-04-28 Thread Vladimir Panteleev via Digitalmars-d

On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis wrote:
But of course, we'd want to do the transition in a way that 
didn't result in silent behavioral changes that would break 
code,


One proposal is to make char and dchar comparisons illegal (after 
all, they are comparing different things - an UTF-8 code unit 
with a code point, and even though in some cases this comparison 
makes sense, in many it doesn't). That would solve most silent 
breakages at the expense of more not-so-silent breakages.


And if we really wanted to do that, we could create a version 
flag that turned of autodecoding and version the changes in 
Phobos appropriately to see what we got.


Shameless self-promotion alert: An alternative is a GitHub fork. 
You can easily install and try out D forks with Digger, it's two 
commands:


digger build master+jmdavis/phobos/noautodecode
digger install


Re: Range of chars (narrow string ranges)

2015-04-28 Thread Jonathan M Davis via Digitalmars-d

On Tuesday, 28 April 2015 at 09:11:10 UTC, Chris wrote:
Would it be much work to show have example code or even an 
experimental module that gets rid of auto-decoding, so we could 
see what would be affected in general and how actual code we 
have would be affected by it?


The topic keeps coming up again and again, and while I'm in 
favor of anything that enhances performance, I'm afraid of 
having to refactor large chunks of my code. However, this fear 
may be unfounded, but I would need some examples to visualize 
the problem.


Honestly, most code won't care. If we just switched out all of 
the auto-decoding right now, pretty much anything using only 
ASCII would just work, and most anything that's trying to 
manipulate ASCII characters in a Unicode string will just work, 
whereas code that's specifically manipulating Unicode characters 
might have problems (e.g. comparing front with a dchar will no 
longer have the same result, since front would just be the first 
code unit rather than necessarily the first code point). Since 
most Phobos range-based functions which operate on strings are 
special-cased on strings already, many of them would continue to 
just work (e.g. find returns the same range type as what's passed 
to it even if it's given a string, so it might just work with the 
change, or it might need to be tweaked slightly), and those that 
would then generally either need to call encode on an argument to 
make it match the string type in the cases string types mix (e.g. 
"foo".find("fo"d) would need to call encode on "fo"d to make it a 
string for comparison), or the caller would need to use 
std.utf.byDchar or std.uni.byGrapheme to operate on code points 
or graphemes rather than code units.


The two biggest places in Phobos that would potentially have 
problems are functions that special-cased strings but still used 
front and those which have to return a new range type. e.g. 
filter would be a good example, because it's forced to return a 
new range type. Right now, it would filter on dchars, but with 
the change, it would filter on the code unit type (most typically 
char). If you're filtering on ASCII characters, it wouldn't 
matter aside from the fact that the resulting range would have an 
element type of char rather than dchar, but if you're filtering 
on Unicode characters, it wouldn't work anymore. For situations 
like that, you'd be forced do use std.utf.byDchar or 
std.uni.byGrapheme. However, since most string code tends to 
operate on substrings rather than characters, I don't know how 
common it even is to use a function like filter on a string (as 
opposed to a range of strings). Such code might actually be 
fairly rare.


So, there _are_ a few functions which stop working the same way 
in a potentially silent manner if we just made it so that front 
didn't autodecode anymore. However, in general, because Phobos 
almost always special-cases strings, calls to Phobos functions 
probably wouldn't need to change in most cases, and when they do, 
a call to byDchar would restore the old behavior. But of course, 
we'd want to do the transition in a way that didn't result in 
silent behavioral changes that would break code, even though in 
most cases, it wouldn't matter, because most code will be 
operating on ASCII strings even if the strings themselves contain 
Unicode - e.g. unicodeString.find(asciiString) is far more common 
than unicodeString.find(otherUnicodeString).


I suspect that the code that's at the greatest risk is code that 
checks for is(Unqual!(ElementType!Range) == dchar) to operate on 
strings and wrapper ranges around strings, since it would then 
only match the cases where byDchar had been used. In general 
though, the code that's going to run into the most trouble is 
user code that contains range-based functions similar to what you 
might find in Phobos rather than code that's simply using the 
Phobos functions like startsWith and find - i.e. if you're 
writing range-base code that worries about doing stuff like 
special-casing strings or which specifically needs to operate on 
code points, then you're going to have to make changes, whereas 
to a great extent, if all you're doing is passing strings to 
Phobos functions, your code will tend to just work.


To actually see what the impact would be, we'd have to just 
change Phobos, I think, and then see what the impact was on user 
code. It could be surprising how much or how little it affects 
things, though in most cases, I expect that it'll mean that code 
will just work. And if we really wanted to do that, we could 
create a version flag that turned of autodecoding and version the 
changes in Phobos appropriately to see what we got. In many 
cases, if we simply made sure that Phobos functions which 
special-cased strings didn't use front directly but instead 
didn't care whether they were operating on ranges of char, wchar, 
or dchar, then we wouldn't even need to version anything (e.g. 
find could easily 

Re: Range of chars (narrow string ranges)

2015-04-28 Thread Chris via Digitalmars-d

On Monday, 27 April 2015 at 17:49:04 UTC, Jonathan M Davis wrote:

On Monday, 27 April 2015 at 17:01:03 UTC, H. S. Teoh wrote:
On Sat, Apr 25, 2015 at 02:27:45AM +, Jonathan M Davis via 
Digitalmars-d wrote:

[...]
I suppose that a related alternative would be to change it so 
that
strings aren't considered ranges anymore (at least 
temporarily), and
force folks to use stuff like byChar or byDChar (or whatever 
those
functions are) whenever they use strings as ranges. And 
actually, that
_would_ allow us to get rid of the autodecoding without 
rearranging
modules. Later, we could change them to being ranges of their 
actual
element types, or we could just force folks to be explicit 
forever in
an effort to make the Unicode issues clear, if we thought 
that that
were better (though it would probably better to just change 
front and
friends later to work with strings again but not autodecode). 
And if
an algorithm would work with either autodecoding or without 
it, then
maybe it could be special cased to accept strings as ranges, 
only
forcing it in the cases where it the behavior of the 
algorithm would

change based on whether autodecoding were used or not.

Hmmm. I'm not sure what all of the repercussions of such an 
approach
would be, but the more I think about it, the more tempting it 
seems to

me.

[...]

I would vote for this approach, if we ever decide to get rid of
autodecoding. I'm OK with either option -- get rid of 
autodecoding, or
keep it and use it consistently. What I am *not* OK with is 
the present,
and growing, schizophrenic mixture of autodecoding and 
non-autodecoding
string functions in Phobos. This inconsistency is going to 
come back to

bite us later.


I expect that the two biggest problems causing the current 
situation are


1. Andrei and Walter don't seem to agree on the issue (Andrei 
seems to think that it's not a big deal to leave in the 
autodecoding).


2. While most of the core devs want to get rid of the 
autodecoding, it's a big enough change that we're afraid to do 
it and/or aren't sure of how we could do it without being too 
disruptive.


So, Walter has been pushing the schizophrenic approach in an 
effort to work around the problem. If the core devs could agree 
on an approach to removing autodecoding that wasn't too 
disruptive and somehow get Andrei to go along with it, then we 
could do that and fix the problem, but otherwise, Walter is 
just going to push for the schizophrenic approach, because it 
at least partially fixes the autodecoding problem, and enough 
of the core devs want to ditch the autodecoding that at least 
some of those changes are likely to make it in.


Honestly, I think that we need to figure out what the best 
options are for killing autodecoding and then figure out how to 
convince Andrei of it, but I haven't a clue how to convince 
Andrei unless maybe a solution which isn't very disruptive can 
be found, but it seems like every time the issue comes up, he 
gets annoyed that we're spending time on something unimportant. 
I do think that this limbo needs to stop though, and I think 
that it's clear that while autodecoding seemed like a good idea 
at first (especially if code points really were full characters 
instead of having to worry about graphemes), ultimately, 
autodecoding is a mistake.


- Jonathan M Davis


Would it be much work to show have example code or even an 
experimental module that gets rid of auto-decoding, so we could 
see what would be affected in general and how actual code we have 
would be affected by it?


The topic keeps coming up again and again, and while I'm in favor 
of anything that enhances performance, I'm afraid of having to 
refactor large chunks of my code. However, this fear may be 
unfounded, but I would need some examples to visualize the 
problem.


Re: Range of chars (narrow string ranges)

2015-04-27 Thread Jonathan M Davis via Digitalmars-d

On Monday, 27 April 2015 at 17:01:03 UTC, H. S. Teoh wrote:
On Sat, Apr 25, 2015 at 02:27:45AM +, Jonathan M Davis via 
Digitalmars-d wrote:

[...]
I suppose that a related alternative would be to change it so 
that
strings aren't considered ranges anymore (at least 
temporarily), and
force folks to use stuff like byChar or byDChar (or whatever 
those
functions are) whenever they use strings as ranges. And 
actually, that
_would_ allow us to get rid of the autodecoding without 
rearranging
modules. Later, we could change them to being ranges of their 
actual
element types, or we could just force folks to be explicit 
forever in
an effort to make the Unicode issues clear, if we thought that 
that
were better (though it would probably better to just change 
front and
friends later to work with strings again but not autodecode). 
And if
an algorithm would work with either autodecoding or without 
it, then
maybe it could be special cased to accept strings as ranges, 
only
forcing it in the cases where it the behavior of the algorithm 
would

change based on whether autodecoding were used or not.

Hmmm. I'm not sure what all of the repercussions of such an 
approach
would be, but the more I think about it, the more tempting it 
seems to

me.

[...]

I would vote for this approach, if we ever decide to get rid of
autodecoding. I'm OK with either option -- get rid of 
autodecoding, or
keep it and use it consistently. What I am *not* OK with is the 
present,
and growing, schizophrenic mixture of autodecoding and 
non-autodecoding
string functions in Phobos. This inconsistency is going to come 
back to

bite us later.


I expect that the two biggest problems causing the current 
situation are


1. Andrei and Walter don't seem to agree on the issue (Andrei 
seems to think that it's not a big deal to leave in the 
autodecoding).


2. While most of the core devs want to get rid of the 
autodecoding, it's a big enough change that we're afraid to do it 
and/or aren't sure of how we could do it without being too 
disruptive.


So, Walter has been pushing the schizophrenic approach in an 
effort to work around the problem. If the core devs could agree 
on an approach to removing autodecoding that wasn't too 
disruptive and somehow get Andrei to go along with it, then we 
could do that and fix the problem, but otherwise, Walter is just 
going to push for the schizophrenic approach, because it at least 
partially fixes the autodecoding problem, and enough of the core 
devs want to ditch the autodecoding that at least some of those 
changes are likely to make it in.


Honestly, I think that we need to figure out what the best 
options are for killing autodecoding and then figure out how to 
convince Andrei of it, but I haven't a clue how to convince 
Andrei unless maybe a solution which isn't very disruptive can be 
found, but it seems like every time the issue comes up, he gets 
annoyed that we're spending time on something unimportant. I do 
think that this limbo needs to stop though, and I think that it's 
clear that while autodecoding seemed like a good idea at first 
(especially if code points really were full characters instead of 
having to worry about graphemes), ultimately, autodecoding is a 
mistake.


- Jonathan M Davis


Re: Range of chars (narrow string ranges)

2015-04-27 Thread H. S. Teoh via Digitalmars-d
On Sat, Apr 25, 2015 at 02:27:45AM +, Jonathan M Davis via Digitalmars-d 
wrote:
[...]
> I suppose that a related alternative would be to change it so that
> strings aren't considered ranges anymore (at least temporarily), and
> force folks to use stuff like byChar or byDChar (or whatever those
> functions are) whenever they use strings as ranges. And actually, that
> _would_ allow us to get rid of the autodecoding without rearranging
> modules. Later, we could change them to being ranges of their actual
> element types, or we could just force folks to be explicit forever in
> an effort to make the Unicode issues clear, if we thought that that
> were better (though it would probably better to just change front and
> friends later to work with strings again but not autodecode). And if
> an algorithm would work with either autodecoding or without it, then
> maybe it could be special cased to accept strings as ranges, only
> forcing it in the cases where it the behavior of the algorithm would
> change based on whether autodecoding were used or not.
> 
> Hmmm. I'm not sure what all of the repercussions of such an approach
> would be, but the more I think about it, the more tempting it seems to
> me.
[...]

I would vote for this approach, if we ever decide to get rid of
autodecoding. I'm OK with either option -- get rid of autodecoding, or
keep it and use it consistently. What I am *not* OK with is the present,
and growing, schizophrenic mixture of autodecoding and non-autodecoding
string functions in Phobos. This inconsistency is going to come back to
bite us later.


T

-- 
One reason that few people are aware there are programs running the internet is 
that they never crash in any significant way: the free software underlying the 
internet is reliable to the point of invisibility. -- Glyn Moody, from the 
article "Giving it all away"


Re: Range of chars (narrow string ranges)

2015-04-25 Thread ketmar via Digitalmars-d
On Fri, 24 Apr 2015 13:44:43 -0700, Walter Bright wrote:

> I'm afraid we are stuck with autodecoding, as taking it out may be far
> too disruptive.

the more time passing the harder autodecode to kill. kill it while it's 
not too late. make the next DMD release 2.100 and KILL AUTODECODE for 
good.

signature.asc
Description: PGP signature


Re: Range of chars (narrow string ranges)

2015-04-24 Thread Jonathan M Davis via Digitalmars-d
On Saturday, 25 April 2015 at 02:04:02 UTC, Steven Schveighoffer 
wrote:

On 4/24/15 9:02 PM, Walter Bright wrote:

On 4/24/2015 4:56 PM, Steven Schveighoffer wrote:
This is pretty easy. We just have to create a string type 
that is

backed by, but
isn't simply an alias to, an array of char.


Just shoot me now!



Yeah, that's the reaction I figured I'd get ;) But it doesn't 
hurt to keep trying since we keep coming back to this over, and 
over, and over, and over...


Honestly, even if that were the ideal way to go (and I don't 
think that it is), I'd expect that to be even more disruptive 
than trying to rearrange the modules so that front and friends 
don't autodecode for strings.


I suppose that a related alternative would be to change it so 
that strings aren't considered ranges anymore (at least 
temporarily), and force folks to use stuff like byChar or byDChar 
(or whatever those functions are) whenever they use strings as 
ranges. And actually, that _would_ allow us to get rid of the 
autodecoding without rearranging modules. Later, we could change 
them to being ranges of their actual element types, or we could 
just force folks to be explicit forever in an effort to make the 
Unicode issues clear, if we thought that that were better (though 
it would probably better to just change front and friends later 
to work with strings again but not autodecode). And if an 
algorithm would work with either autodecoding or without it, then 
maybe it could be special cased to accept strings as ranges, only 
forcing it in the cases where it the behavior of the algorithm 
would change based on whether autodecoding were used or not.


Hmmm. I'm not sure what all of the repercussions of such an 
approach would be, but the more I think about it, the more 
tempting it seems to me.


- Jonathan M Davis


Re: Range of chars (narrow string ranges)

2015-04-24 Thread Steven Schveighoffer via Digitalmars-d

On 4/24/15 9:02 PM, Walter Bright wrote:

On 4/24/2015 4:56 PM, Steven Schveighoffer wrote:

This is pretty easy. We just have to create a string type that is
backed by, but
isn't simply an alias to, an array of char.


Just shoot me now!



Yeah, that's the reaction I figured I'd get ;) But it doesn't hurt to 
keep trying since we keep coming back to this over, and over, and over, 
and over...


-Steve


Re: Range of chars (narrow string ranges)

2015-04-24 Thread Walter Bright via Digitalmars-d

On 4/24/2015 4:56 PM, Steven Schveighoffer wrote:

This is pretty easy. We just have to create a string type that is backed by, but
isn't simply an alias to, an array of char.


Just shoot me now!



Re: Range of chars (narrow string ranges)

2015-04-24 Thread Steven Schveighoffer via Digitalmars-d

On 4/24/15 4:44 PM, Walter Bright wrote:


I'm afraid we are stuck with autodecoding, as taking it out may be far
too disruptive.


This is pretty easy. We just have to create a string type that is backed 
by, but isn't simply an alias to, an array of char.


-Steve


Re: Range of chars (narrow string ranges)

2015-04-24 Thread Jonathan M Davis via Digitalmars-d

On Friday, 24 April 2015 at 20:44:34 UTC, Walter Bright wrote:

On 4/24/2015 11:52 AM, H. S. Teoh via Digitalmars-d wrote:
I really wish we would just *make the darn decision* already, 
whether to
kill off autodecoding or not, and MAKE IT CONSISTENT ACROSS 
PHOBOS,
instead of introducing this schizophrenic dichotomy where some 
functions
give you a range of dchar while others give you a range of 
char/wchar,
and the two don't work well together. This is totally going to 
make a

laughing stock of D one day.


Some facts:

1. When I started D, there was a lot of speculation about 
whether the world would settle on UTF8, UTF16, or UTF32. So D 
supports natively all three. Time has shown, however, that UTF8 
has pretty much won. wchar only exists for Windows API and 
Java, dchar strings pretty much don't exist in the wild.


2. dchar is very useful as a character type, but not as a 
string type.


3. Pretty much none of the algorithms in Phobos work when 
presented with a range of chars or wchars. This is not even 
documented.


4. Autodecoding is inefficient, especially considering that few 
algorithms actually need decoding. Re-encoding the result back 
to UTF8 is another inefficiency.


I'm afraid we are stuck with autodecoding, as taking it out may 
be far too disruptive.


But all is not lost. The Phobos algorithms can all be fixed to 
not care about autodecoding. The changes I've made to 
std.string all reflect that.


https://github.com/D-Programming-Language/phobos/pulls/WalterBright


I really think that leaving things with autodecoding in some 
cases and not in others is just asking for trouble. Even if we 
manage to figure out how to fix it so that Phobos doesn't 
autodecode in any of its algorithms without breaking any user 
code in the process, that then leaves user code with the problem, 
and since Phobos _wouldn't_ have the problem, it then would be 
all the more confusing.


It _is_ possible to get rid of it entirely without breaking code 
if we move the array range primitives to a new module and later 
deprecate the old ones, though that would probably mean breaking 
up std.array into submodules and deprecating _all_ of it in favor 
of its submodules, since anyone importing std.array would then 
have the old array range primitives rather than the new ones - or 
both, causing conflicts. And it's made worse by the fact that 
std.range publicly imports std.array. So, yes, it _is_ ugly. But 
it _can_ be done.


If we leave autodecoding in and just work around it everywhere in 
Phobos, it's just going to forever screw with user code and 
confuse users. They get confused enough by it as it is, and at 
least now, they're running into it in Phobos where we can explain 
it, whereas if they don't see it with Phobos and only with their 
own code, then they're going to think that they're doing 
something wrong and potentially get very frustrated.


I definitely share the concern that removing autodecoding 
outright will be too disruptive, but at the same time, I don't 
know if we can afford to go halfway with it.


Re: Range of chars (narrow string ranges)

2015-04-24 Thread Walter Bright via Digitalmars-d

On 4/24/2015 3:29 PM, Brad Anderson wrote:

I haven't really followed the autodecoding conversations. The problem is that
front on char ranges decode, right?


Nope. Only front on narrow string arrays. Ranges aren't autodecoded.



Is there quick way to tell which functions
are auto decoding so we can have a list of candidates for replacement? It'd be
good for hackweek.


If they accept ranges, and don't special case narrow strings, then they 
autodecode.



I'm reminded of this conversation
http://forum.dlang.org/post/xgnurdjcqiyatpvnw...@forum.dlang.org
which contains a partial list of candidates.


PR's exist for most of these now.


Following your lead with
implementing these lazy versions (without autodecoding) would be good hackweek
projects.


Yup.



Finally, there is this http://goo.gl/Wmotu4 list from
http://forum.dlang.org/post/lvmydbvjivsvmwtim...@forum.dlang.org that has some
good candidates for hackweek I think.


Yes, we should have an answer for each of the Boost string algorithms.



Are we collecting hackweek ideas anywhere?


Andrei?


Re: Range of chars (narrow string ranges)

2015-04-24 Thread Brad Anderson via Digitalmars-d

On Friday, 24 April 2015 at 20:44:34 UTC, Walter Bright wrote:

[snip]
I'm afraid we are stuck with autodecoding, as taking it out may 
be far too disruptive.


No!

But all is not lost. The Phobos algorithms can all be fixed to 
not care about autodecoding. The changes I've made to 
std.string all reflect that.


Yay!

I haven't really followed the autodecoding conversations. The 
problem is that front on char ranges decode, right? Is there 
quick way to tell which functions are auto decoding so we can 
have a list of candidates for replacement? It'd be good for 
hackweek.


I'm reminded of this conversation 
http://forum.dlang.org/post/xgnurdjcqiyatpvnw...@forum.dlang.org
which contains a partial list of candidates. Following your lead 
with implementing these lazy versions (without autodecoding) 
would be good hackweek projects.


Finally, there is this http://goo.gl/Wmotu4 list from 
http://forum.dlang.org/post/lvmydbvjivsvmwtim...@forum.dlang.org 
that has some good candidates for hackweek I think.


Are we collecting hackweek ideas anywhere?


Re: Range of chars (narrow string ranges)

2015-04-24 Thread Martin Nowak via Digitalmars-d
On 04/24/2015 10:44 PM, Walter Bright wrote:
> 4. Autodecoding is inefficient, especially considering that few
> algorithms actually need decoding. Re-encoding the result back to UTF8
> is another inefficiency.
> 
> I'm afraid we are stuck with autodecoding, as taking it out may be far
> too disruptive.
> 
> But all is not lost. The Phobos algorithms can all be fixed to not care
> about autodecoding. The changes I've made to std.string all reflect that.

It probably won't be too disruptive to optimize algorithms such as
filter to return a range of chars, but only if we support such ranges as
narrow strings everywhere.



Re: Range of chars (narrow string ranges)

2015-04-24 Thread Walter Bright via Digitalmars-d

On 4/24/2015 11:52 AM, H. S. Teoh via Digitalmars-d wrote:

I really wish we would just *make the darn decision* already, whether to
kill off autodecoding or not, and MAKE IT CONSISTENT ACROSS PHOBOS,
instead of introducing this schizophrenic dichotomy where some functions
give you a range of dchar while others give you a range of char/wchar,
and the two don't work well together. This is totally going to make a
laughing stock of D one day.


Some facts:

1. When I started D, there was a lot of speculation about whether the world 
would settle on UTF8, UTF16, or UTF32. So D supports natively all three. Time 
has shown, however, that UTF8 has pretty much won. wchar only exists for Windows 
API and Java, dchar strings pretty much don't exist in the wild.


2. dchar is very useful as a character type, but not as a string type.

3. Pretty much none of the algorithms in Phobos work when presented with a range 
of chars or wchars. This is not even documented.


4. Autodecoding is inefficient, especially considering that few algorithms 
actually need decoding. Re-encoding the result back to UTF8 is another inefficiency.


I'm afraid we are stuck with autodecoding, as taking it out may be far too 
disruptive.


But all is not lost. The Phobos algorithms can all be fixed to not care about 
autodecoding. The changes I've made to std.string all reflect that.


https://github.com/D-Programming-Language/phobos/pulls/WalterBright


Re: Range of chars (narrow string ranges)

2015-04-24 Thread H. S. Teoh via Digitalmars-d
On Fri, Apr 24, 2015 at 08:39:36PM +0200, Martin Nowak via Digitalmars-d wrote:
> Just want to make this a bit more visible.
> https://github.com/D-Programming-Language/phobos/pull/3206#issuecomment-95681812
> 
> We just added entabber to std.phobos, and AFAIK, it's the first range
> algorithm that transforms narrow strings to a range of chars, instead
> of decoding the original string and returning a range of dchars.
> 
> Most of phobos can't handle such ranges like strings and you'd have to
> decode them using byDchar to work with them.

I really wish we would just *make the darn decision* already, whether to
kill off autodecoding or not, and MAKE IT CONSISTENT ACROSS PHOBOS,
instead of introducing this schizophrenic dichotomy where some functions
give you a range of dchar while others give you a range of char/wchar,
and the two don't work well together. This is totally going to make a
laughing stock of D one day.


T

-- 
Guns don't kill people. Bullets do.