Am Mon, 10 Mar 2014 17:44:22 -0400
schrieb Nick Sabalausky seewebsitetocontac...@semitwist.com:
On 3/7/2014 8:40 AM, Michel Fortin wrote:
On 2014-03-07 03:59:55 +, bearophile bearophileh...@lycos.com said:
Walter Bright:
I understand this all too well. (Note that we currently have
On Thursday, March 06, 2014 18:37:13 Walter Bright wrote:
Is there any hope of fixing this?
I agree with Andrei. I don't think that there's really anything to fix. The
problem is that there's roughly 3 levels at which string operations can be
done
1. By code unit
2. By code point
3. By
On Sunday, 9 March 2014 at 21:38:06 UTC, Nick Sabalausky wrote:
On 3/9/2014 7:47 AM, w0rp wrote:
My knowledge of Unicode pretty much just comes from having
to deal with foreign language customers and discovering the
problems
with the code unit abstraction most languages seem to use.
(Java
On Friday, 7 March 2014 at 03:52:42 UTC, Walter Bright wrote:
Ok, I have a plan. Each step will be separated by at least one
version:
1. implement decode() as an algorithm for string types, so one
can write:
string s;
s.decode.algorithm...
suggest that people start doing that
On Tuesday, 11 March 2014 at 02:07:19 UTC, Steven Schveighoffer
wrote:
On Mon, 10 Mar 2014 19:59:07 -0400, Walter Bright
newshou...@digitalmars.com wrote:
On 3/10/2014 6:47 AM, Dicebot wrote:
(array literals that allocate, I will never forgive that).
It was done that way simply to get it
On 3/10/2014 12:23 AM, Walter Bright wrote:
On 3/9/2014 9:19 PM, Nick Sabalausky wrote:
On 3/9/2014 6:31 PM, Walter Bright wrote:
On 3/9/2014 6:08 AM, Marc Schütz schue...@gmx.net wrote:
Also, `byCodeUnit` and `byCodePoint` would probably be better names
than `raw`
and `decode`, to much the
On 3/10/2014 12:09 AM, Nick Sabalausky wrote:
On 3/10/2014 12:23 AM, Walter Bright wrote:
On 3/9/2014 9:19 PM, Nick Sabalausky wrote:
On 3/9/2014 6:31 PM, Walter Bright wrote:
On 3/9/2014 6:08 AM, Marc Schütz schue...@gmx.net wrote:
Also, `byCodeUnit` and `byCodePoint` would probably be
On Sunday, 9 March 2014 at 21:14:30 UTC, Nick Sabalausky wrote:
With all due respect, D string type is exclusively for UTF-8
strings.
If it is not valid UTF-8, it should never had been a D string
in the
first place. In the other cases, ubyte[] is there.
This is an arbitrary self-imposed
I'm not sure I understood the point of this (long) thread.
The main problem is that decode() is called also if not needed?
Well, in this case that's not a problem only for string. I found
this problem also when I was writing other ranges. For example
when I read binary data from db stream. Front
On 3/10/2014 6:21 AM, ponce wrote:
On Sunday, 9 March 2014 at 21:14:30 UTC, Nick Sabalausky wrote:
Yea, I've had problems before - completely unnecessary problems that
were *not* helpful or indicative of latent bugs - which were a direct
result of Phobos being overly pedantic and eager about
On Monday, 10 March 2014 at 11:04:43 UTC, Nick Sabalausky wrote:
I may have missed it, but I don't see where it says anything
about validation or immediate sanitation of invalid sequences.
It's mostly UTF-16 sucks and so does Windows (not that I'm
necessarily disagreeing with it). (ot: Kinda
On 3/9/2014 11:27 AM, Vladimir Panteleev wrote:
On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
On topic, I think D's implicit default decode to dchar is *infinity*
times better than C++'s char-based strings. While imperfect in terms
of grapheme, it was still a design decision
On Sunday, 9 March 2014 at 17:27:20 UTC, Andrei Alexandrescu
wrote:
On 3/9/14, 6:47 AM, Marc Schütz schue...@gmx.net wrote:
On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
2) It is regression back to C++ days of
no-one-cares-about-Unicode
pain. Thinking about strings as character
On Friday, 7 March 2014 at 19:43:57 UTC, Walter Bright wrote:
On 3/7/2014 7:03 AM, Dicebot wrote:
1) It is a huge breakage and you have been refusing to do one
even for more
important problems. What is about this sudden change of mind?
1. Performance Performance Performance
Not important
On Monday, 10 March 2014 at 10:52:02 UTC, Andrea Fontana wrote:
I'm not sure I understood the point of this (long) thread.
The main problem is that decode() is called also if not needed?
I'd like to offer up one D 'user' perspective, it's just a single
data point but perhaps useful. I write
In italian we need unicode too. We have several accented letters
and often programming languages don't handle utf-8 and other
encoding so well...
In D I never had any problem with this, and I work a lot on text
processing.
So my question: is there any problem I'm missing in D with
unicode
Am 07.03.2014 03:37, schrieb Walter Bright:
In Lots of low hanging fruit in Phobos the issue came up about the automatic
encoding and decoding of char ranges.
after reading many of the attached posts the question is - what
could be Ds future design of introducing breaking changes, its
not a
On Monday, 10 March 2014 at 14:05:39 UTC, dennis luehring wrote:
Am 07.03.2014 03:37, schrieb Walter Bright:
In Lots of low hanging fruit in Phobos the issue came up
about the automatic
encoding and decoding of char ranges.
after reading many of the attached posts the question is - what
On Monday, 10 March 2014 at 14:05:39 UTC, dennis luehring wrote:
Am 07.03.2014 03:37, schrieb Walter Bright:
In Lots of low hanging fruit in Phobos the issue came up
about the automatic
encoding and decoding of char ranges.
after reading many of the attached posts the question is - what
On Monday, 10 March 2014 at 14:11:13 UTC, Dicebot wrote:
Historically 2 approaches has been practiced:
1) argue a lot and then do nothing
2) suddenly change something and tell users is was necessary
These are one and the same, just from the two opposing points of
view.
I also think that
Historically 2 approaches has been practiced:
1) argue a lot and then do nothing
This happens (I think) because Andrei and Walter really value
your's and other expert's opinions, but nevertheless have to
preserve the general way things work to preserve the long term
future of D. They
On Monday, 10 March 2014 at 14:27:02 UTC, Vladimir Panteleev
wrote:
On Monday, 10 March 2014 at 14:11:13 UTC, Dicebot wrote:
Historically 2 approaches has been practiced:
1) argue a lot and then do nothing
2) suddenly change something and tell users is was necessary
These are one and the
Am Mon, 10 Mar 2014 14:05:03 +
schrieb Andrea Fontana nos...@example.com:
In italian we need unicode too. We have several accented letters
and often programming languages don't handle utf-8 and other
encoding so well...
In D I never had any problem with this, and I work a lot on text
On Monday, 10 March 2014 at 13:18:50 UTC, Dicebot wrote:
On Sunday, 9 March 2014 at 17:27:20 UTC, Andrei Alexandrescu
wrote:
On 3/9/14, 6:47 AM, Marc Schütz schue...@gmx.net wrote:
On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
2) It is regression back to C++ days of
On Monday, 10 March 2014 at 13:48:44 UTC, Abdulhaq wrote:
My app deals with unicode arabic text that is 'out there', and
the UnicodeTM support for Arabic is not that well thought out,
so the data is often (always) inconsistent in terms of
sequencing diacritics etc. Even the code page can vary.
On Monday, 10 March 2014 at 18:54:26 UTC, Marc Schütz wrote:
On Monday, 10 March 2014 at 13:48:44 UTC, Abdulhaq wrote:
My app deals with unicode arabic text that is 'out there', and
the UnicodeTM support for Arabic is not that well thought out,
so the data is often (always) inconsistent in
On Monday, 10 March 2014 at 18:54:26 UTC, Marc Schütz wrote:
On Monday, 10 March 2014 at 13:48:44 UTC, Abdulhaq wrote:
My app deals with unicode arabic text that is 'out there', and
the UnicodeTM support for Arabic is not that well thought out,
so the data is often (always) inconsistent in
On 3/7/2014 8:40 AM, Michel Fortin wrote:
On 2014-03-07 03:59:55 +, bearophile bearophileh...@lycos.com said:
Walter Bright:
I understand this all too well. (Note that we currently have a
different silent problem: unnoticed large performance problems.)
On the other hand your change
On Monday, 10 March 2014 at 14:42:18 UTC, Dicebot wrote:
Yes. I have given up about this idea at some point as there
seemed to be consensus that no breaking changes will be even
considered for D2 and those that come from fixing bugs are not
worth the fuss.
So at what point are we going to
On 3/10/2014 6:47 AM, Dicebot wrote:
(array literals that allocate, I will never forgive that).
It was done that way simply to get it up and running quickly. Having them not
allocate is an optimization, it doesn't change the nature.
On 3/10/2014 7:35 PM, Yota wrote:
On Monday, 10 March 2014 at 14:42:18 UTC, Dicebot wrote:
Yes. I have given up about this idea at some point as there seemed to
be consensus that no breaking changes will be even considered for D2
and those that come from fixing bugs are not worth the fuss.
So
On Mon, 10 Mar 2014 19:59:07 -0400, Walter Bright
newshou...@digitalmars.com wrote:
On 3/10/2014 6:47 AM, Dicebot wrote:
(array literals that allocate, I will never forgive that).
It was done that way simply to get it up and running quickly. Having
them not allocate is an optimization,
On 3/10/14, 7:07 PM, Steven Schveighoffer wrote:
On Mon, 10 Mar 2014 19:59:07 -0400, Walter Bright
newshou...@digitalmars.com wrote:
On 3/10/2014 6:47 AM, Dicebot wrote:
(array literals that allocate, I will never forgive that).
It was done that way simply to get it up and running quickly.
On Mon, 10 Mar 2014 22:56:22 -0400, Andrei Alexandrescu
seewebsiteforem...@erdani.org wrote:
On 3/10/14, 7:07 PM, Steven Schveighoffer wrote:
On Mon, 10 Mar 2014 19:59:07 -0400, Walter Bright
newshou...@digitalmars.com wrote:
On 3/10/2014 6:47 AM, Dicebot wrote:
(array literals that
On 3/10/14, 8:05 PM, Steven Schveighoffer wrote:
I think you are missing what I'm saying, I don't want the allocation
eliminated, but if we eliminate some allocations with [] and not others,
it will be confusing. The path I'd always hoped we would go in was to
make all array literals immutable,
On 3/7/2014 6:33 PM, H. S. Teoh wrote:
On Fri, Mar 07, 2014 at 11:13:50PM +, Sarath Kodali wrote:
On Friday, 7 March 2014 at 22:35:47 UTC, Sarath Kodali wrote:
+1
In Indian languages, a character consists of one or more UNICODE
code points. For example, in Sanskrit ddhrya
On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
On topic, I think D's implicit default decode to dchar is
*infinity* times better than C++'s char-based strings. While
imperfect in terms of grapheme, it was still a design decision
made of win.
I'd be tempted to not ask how do we
On Sunday, 9 March 2014 at 09:24:02 UTC, Nick Sabalausky wrote:
I'm leaning the same way too. But I also think Andrei is right
that, at this point in time, it'd be a terrible move to change
things so that by code unit is default. For better or worse,
that ship has sailed.
Perhaps we *can*
- In lots of places, I've discovered that Phobos did UTF
decoding (thus murdering performance) when it didn't need to.
Such cases included format (now fixed), appender (now fixed),
startsWith (now fixed - recently), skipOver (still unfixed).
These have caused latent bugs in my programs that
On 09/03/14 04:26, Andrei Alexandrescu wrote:
2. Add byChar that returns a random-access range iterating a string by
character. Add byWchar that does on-the-fly transcoding to UTF16. Add
byDchar that accepts any range of char and does decoding. And such
stuff. Then whenever one wants to go
On Sunday, 9 March 2014 at 11:34:31 UTC, Peter Alexander wrote:
On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
On topic, I think D's implicit default decode to dchar is
*infinity* times better than C++'s char-based strings. While
imperfect in terms of grapheme, it was still a
On Friday, 7 March 2014 at 04:11:15 UTC, Nick Sabalausky wrote:
What about this?:
Anywhere we currently have a front() that decodes, such as your
example:
@property dchar front(T)(T[] a) @safe pure if
(isNarrowString!(T[]))
{
assert(a.length, Attempting to fetch the front of an
On Sunday, 9 March 2014 at 13:08:05 UTC, Marc Schütz wrote:
Also, `byCodeUnit` and `byCodePoint` would probably be better
names than `raw` and `decode`, to much the already existing
`byGrapheme` in std.uni.
There already is a std.uni.byCodePoint. It is a higher order
range that accepts
On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu
wrote:
The current approach is a cut above treating strings as arrays
of bytes
for some languages, and still utterly broken for others. If I'm
operating on a right to left language like Hebrew, what would
I expect
the result to be
On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
2) It is regression back to C++ days of
no-one-cares-about-Unicode pain. Thinking about strings as
character arrays is so natural and convenient that if
language/Phobos won't punish you for that, it will be extremely
widespread.
Not
On Friday, 7 March 2014 at 16:43:30 UTC, Dicebot wrote:
On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev
Can we look at some example situations that this will break?
Any code that relies on countUntil to count dchar's? Or, to
generalize, almost any code that uses std.algorithm
On 2014-03-09 13:00:45 +, monarch_dodra monarchdo...@gmail.com said:
AFAIK, the most common algorithm case insensitive search *must* decode.
Not necessarily. While the unicode collation algorithms (which should
be used to compare text) are defined in term of code points, you could
build
On Friday, 7 March 2014 at 23:13:50 UTC, H. S. Teoh wrote:
On Fri, Mar 07, 2014 at 10:35:46PM +, Sarath Kodali wrote:
On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev
wrote:
On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu
wrote:
[...]
Clearly one might argue that
On 2014-03-09 14:12:28 +, Marc Schütz schue...@gmx.net said:
That won't work, because your needle might be in a different
normalization form than your haystack, thus a byte-by-byte comparison
will not be able to find it.
The core of the problem is that sometime this byte-by-byte
On Sunday, 9 March 2014 at 13:00:46 UTC, monarch_dodra wrote:
IMO, the normalization argument is overrated. I've yet to
encounter a real-world case of normalization: only hand written
counter-examples. Not saying it doesn't exist, just that:
1. It occurs only in special cases that the program
On Sunday, 9 March 2014 at 05:10:26 UTC, Andrei Alexandrescu
wrote:
On 3/8/14, 8:24 PM, Vladimir Panteleev wrote:
On Sunday, 9 March 2014 at 04:18:15 UTC, Andrei Alexandrescu
wrote:
What exactly is the consensus? From your wiki page I see One
of the
proposals in the thread is to switch the
On Sunday, 9 March 2014 at 13:47:26 UTC, Marc Schütz wrote:
On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
2) It is regression back to C++ days of
no-one-cares-about-Unicode pain. Thinking about strings as
character arrays is so natural and convenient that if
language/Phobos won't
On Sunday, 9 March 2014 at 13:51:12 UTC, Marc Schütz wrote:
On Friday, 7 March 2014 at 16:43:30 UTC, Dicebot wrote:
On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev
Can we look at some example situations that this will break?
Any code that relies on countUntil to count dchar's?
On Sunday, 9 March 2014 at 12:24:11 UTC, ponce wrote:
- In lots of places, I've discovered that Phobos did UTF
decoding (thus murdering performance) when it didn't need to.
Such cases included format (now fixed), appender (now fixed),
startsWith (now fixed - recently), skipOver (still
On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu
wrote:
The current approach is a cut above treating strings as
arrays of bytes
for some languages, and still utterly broken for others. If
I'm
operating on a right to
On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
On topic, I think D's implicit default decode to dchar is
*infinity* times better than C++'s char-based strings. While
imperfect in terms of grapheme, it was still a design decision
made of win.
Care to argument?
I'd be tempted
On Sunday, 9 March 2014 at 13:00:46 UTC, monarch_dodra wrote:
As for the belief that iterating by code point has utility. I
have to strongly disagree. Unicode is composed of codepoints,
and that is what we handle. The fact that it can be be encoded
and stored as UTF is implementation detail.
Vladimir Panteleev:
Seriously, Bearophile suggested ABCD.sort(), and it took
about 6 pages (!) for someone to point out this would be wrong.
Sorting a string has quite limited use in the general case,
It seems I am sorting arrays of mutable ASCII chars often enough
:-)
Time ago I have
On Sunday, 9 March 2014 at 16:02:55 UTC, bearophile wrote:
Vladimir Panteleev:
Seriously, Bearophile suggested ABCD.sort(), and it took
about 6 pages (!) for someone to point out this would be
wrong.
Sorting a string has quite limited use in the general case,
It seems I am sorting arrays
Vladimir Panteleev:
What do you use this for?
For lots of different reasons (counting, testing, histograms, to
unique-ify, to allow binary searches, etc), you can find
alternative solutions for every one of those use cases.
I can think of sort being useful e.g. to see which characters
On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:
On 09/03/14 04:26, Andrei Alexandrescu wrote:
2. Add byChar that returns a random-access range iterating a string by
character. Add byWchar that does on-the-fly transcoding to UTF16. Add
byDchar that accepts any range of char and does decoding.
On 3/9/14, 4:34 AM, Peter Alexander wrote:
I think this is the main confusion: the belief that iterating by code
point has utility.
If you care about normalization then neither by code unit, by code
point, nor by grapheme are correct (except in certain language subsets).
I suspect that code
On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu
wrote:
On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:
So IIUC iterating over s.byChar would not encounter the
decoding-related
speed hits that Walter is concerned about?
That is correct.
Unless I'm missing something, all
On 3/9/14, 6:47 AM, Marc Schütz schue...@gmx.net wrote:
On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
2) It is regression back to C++ days of no-one-cares-about-Unicode
pain. Thinking about strings as character arrays is so natural and
convenient that if language/Phobos won't punish
On 3/9/14, 6:34 AM, Jakob Ovrum wrote:
On Sunday, 9 March 2014 at 13:08:05 UTC, Marc Schütz wrote:
Also, `byCodeUnit` and `byCodePoint` would probably be better names
than `raw` and `decode`, to much the already existing `byGrapheme` in
std.uni.
There already is a std.uni.byCodePoint. It is a
On Sunday, 9 March 2014 at 15:23:57 UTC, Vladimir Panteleev wrote:
On Sunday, 9 March 2014 at 13:51:12 UTC, Marc Schütz wrote:
On Friday, 7 March 2014 at 16:43:30 UTC, Dicebot wrote:
On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev
Can we look at some example situations that this
On Sunday, 9 March 2014 at 17:15:59 UTC, Andrei Alexandrescu
wrote:
On 3/9/14, 4:34 AM, Peter Alexander wrote:
I think this is the main confusion: the belief that iterating
by code
point has utility.
If you care about normalization then neither by code unit, by
code
point, nor by grapheme
On 3/9/14, 9:02 AM, bearophile wrote:
Time ago I have even asked for a helper function:
https://d.puremagic.com/issues/show_bug.cgi?id=10162
I commented on that and preapproved it.
Andrei
On 3/9/14, 10:21 AM, Vladimir Panteleev wrote:
On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu wrote:
On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:
So IIUC iterating over s.byChar would not encounter the decoding-related
speed hits that Walter is concerned about?
That is
On 3/9/14, 10:34 AM, Peter Alexander wrote:
If we assume strings are normalized then substring search, equality
testing, sorting all work the same with either code units or code points.
But others such as edit distance or equal(some_string, some_wstring)
will not.
If you don't care about
On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu
wrote:
wc
What should wc produce on a Sanskrit text?
The problem is that such questions quickly become philosophical.
(Generally: I've always been very very very doubtful about
arguments that start with I can't think of... because
09-Mar-2014 21:45, Andrei Alexandrescu пишет:
On 3/9/14, 10:21 AM, Vladimir Panteleev wrote:
On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu wrote:
On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:
So IIUC iterating over s.byChar would not encounter the
decoding-related
speed
09-Mar-2014 21:16, Andrei Alexandrescu пишет:
On 3/9/14, 4:34 AM, Peter Alexander wrote:
I think this is the main confusion: the belief that iterating by code
point has utility.
If you care about normalization then neither by code unit, by code
point, nor by grapheme are correct (except in
On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu
wrote:
On 3/9/14, 10:34 AM, Peter Alexander wrote:
If we assume strings are normalized then substring search,
equality
testing, sorting all work the same with either code units or
code points.
But others such as edit distance or
On 3/9/14, 8:18 AM, Vladimir Panteleev wrote:
On Sunday, 9 March 2014 at 05:10:26 UTC, Andrei Alexandrescu wrote:
On 3/8/14, 8:24 PM, Vladimir Panteleev wrote:
On Sunday, 9 March 2014 at 04:18:15 UTC, Andrei Alexandrescu wrote:
What exactly is the consensus? From your wiki page I see One of
On 3/9/14, 11:14 AM, Dmitry Olshansky wrote:
09-Mar-2014 21:45, Andrei Alexandrescu пишет:
On 3/9/14, 10:21 AM, Vladimir Panteleev wrote:
On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu wrote:
On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:
So IIUC iterating over s.byChar
On 3/9/14, 11:19 AM, Peter Alexander wrote:
On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu wrote:
On 3/9/14, 10:34 AM, Peter Alexander wrote:
If we assume strings are normalized then substring search, equality
testing, sorting all work the same with either code units or code
09-Mar-2014 07:53, Vladimir Panteleev пишет:
On Sunday, 9 March 2014 at 03:26:40 UTC, Andrei Alexandrescu wrote:
I don't understand this argument. Iterating by code unit is not
meaningless if you don't want to extract meaning from each unit
iteration. For example, if you're parsing JSON or XML,
09-Mar-2014 21:54, Vladimir Panteleev пишет:
On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu wrote:
wc
What should wc produce on a Sanskrit text?
The problem is that such questions quickly become philosophical.
Technically it could use word-braking algorithm for words.
Or
On 3/9/14, 11:34 AM, Dmitry Olshansky wrote:
09-Mar-2014 07:53, Vladimir Panteleev пишет:
On Sunday, 9 March 2014 at 03:26:40 UTC, Andrei Alexandrescu wrote:
I don't understand this argument. Iterating by code unit is not
meaningless if you don't want to extract meaning from each unit
On Sunday, 9 March 2014 at 14:57:32 UTC, Peter Alexander wrote:
You have mentioned case-insensitive searching, but I think I've
adequately demonstrated that this doesn't work in general by
code point: you need to normalize and take locales into account.
I don't understand what your argument.
09-Mar-2014 22:41, Andrei Alexandrescu пишет:
On 3/9/14, 11:34 AM, Dmitry Olshansky wrote:
This. Anyhow searching dchar makes sense for _some_ languages, the
problem is that it shouldn't decode the whole string but rather encode
the needle properly and search that.
That's just an
On 3/9/14, 12:25 PM, Dmitry Olshansky wrote:
Okay putting potential breakage aside.
Let me sketch up an additive way of improving current situation.
Now you're talking.
1. Say we recognize any indexable entity of char/wchar/dchar, that
however has .front returning a dchar as a narrow string.
On Sunday, 9 March 2014 at 19:40:32 UTC, Andrei Alexandrescu
wrote:
6. Take into account ASCII and maybe other alphabets? Should
be as
trivial as .assumeASCII and then on you march with all of
std.algo/etc.
Walter is against that. His main argument is that UTF already
covers ASCII with only
09-Mar-2014 23:40, Andrei Alexandrescu пишет:
On 3/9/14, 12:25 PM, Dmitry Olshansky wrote:
Okay putting potential breakage aside.
Let me sketch up an additive way of improving current situation.
Now you're talking.
1. Say we recognize any indexable entity of char/wchar/dchar, that
however
On 3/9/2014 1:26 PM, Andrei Alexandrescu wrote:
On 3/9/14, 6:34 AM, Jakob Ovrum wrote:
`byCodeUnit` is essentially std.string.representation.
Actually not because for reasons that are unclear to me people really
want the individual type to be char, not ubyte.
Probably because char *is*
On 3/9/2014 11:21 AM, Vladimir Panteleev wrote:
On Sunday, 9 March 2014 at 12:24:11 UTC, ponce wrote:
- In lots of places, I've discovered that Phobos did UTF decoding
(thus murdering performance) when it didn't need to. Such cases
included format (now fixed), appender (now fixed), startsWith
On 3/8/2014 9:15 PM, Michel Fortin wrote:
Text is an interesting topic for never-ending discussions.
It's also a good example for when non-programmers are surprised to hear
that I *don't* see the world as binary black and white *because* of my
programming experience ;)
Problems like
On 3/9/2014 7:47 AM, w0rp wrote:
My knowledge of Unicode pretty much just comes from having
to deal with foreign language customers and discovering the problems
with the code unit abstraction most languages seem to use. (Java and
Python suffer from similar issues, but they don't really have
On 3/9/2014 6:08 AM, Marc Schütz schue...@gmx.net wrote:
Also, `byCodeUnit` and `byCodePoint` would probably be better names than `raw`
and `decode`, to much the already existing `byGrapheme` in std.uni.
I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string, wstring,
dstring,
On 3/9/2014 6:34 AM, Jakob Ovrum wrote:
`byCodeUnit` is essentially std.string.representation.
Not at all. std.string.representation takes a string and casts it to the
corresponding ubyte, ushort, uint string.
It doesn't work at all with InputRange!char
On 3/9/2014 6:31 PM, Walter Bright wrote:
On 3/9/2014 6:08 AM, Marc Schütz schue...@gmx.net wrote:
Also, `byCodeUnit` and `byCodePoint` would probably be better names
than `raw`
and `decode`, to much the already existing `byGrapheme` in std.uni.
I'd vastly prefer 'byChar', 'byWchar',
On 3/10/2014 12:19 AM, Nick Sabalausky wrote:
(str|wchar|dchar).byChar // Always range of char
(str|wchar|dchar).byWchar // Always range of wchar
(str|wchar|dchar).byDchar // Always range of dchar
Erm, naturally I meant (str|wstr|dstr)
On 3/9/2014 9:19 PM, Nick Sabalausky wrote:
On 3/9/2014 6:31 PM, Walter Bright wrote:
On 3/9/2014 6:08 AM, Marc Schütz schue...@gmx.net wrote:
Also, `byCodeUnit` and `byCodePoint` would probably be better names
than `raw`
and `decode`, to much the already existing `byGrapheme` in std.uni.
08-Mar-2014 05:23, Andrei Alexandrescu пишет:
On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
No, it doesn't.
import std.algorithm;
void main()
{
auto s = cassé;
08-Mar-2014 12:09, Dmitry Olshansky пишет:
08-Mar-2014 05:23, Andrei Alexandrescu пишет:
On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
No, it doesn't.
import
08-Mar-2014 05:18, Andrei Alexandrescu пишет:
On 3/7/14, 12:48 PM, Dmitry Olshansky wrote:
07-Mar-2014 23:57, Andrei Alexandrescu пишет:
On 3/6/14, 6:37 PM, Walter Bright wrote:
In Lots of low hanging fruit in Phobos the issue came up about the
automatic encoding and decoding of char ranges.
On Saturday, 8 March 2014 at 02:04:12 UTC, bearophile wrote:
Vladimir Panteleev:
It's not about types, it's about algorithms.
Given sufficiently refined types, it can be about types :-)
Bye,
bearophile
I think Bear is onto something, we already solved an analogous
problem in an elegant
On 3/8/14, 12:14 AM, Dmitry Olshansky wrote:
08-Mar-2014 12:09, Dmitry Olshansky пишет:
08-Mar-2014 05:23, Andrei Alexandrescu пишет:
On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev
On 3/8/14, 12:09 AM, Dmitry Olshansky wrote:
08-Mar-2014 05:23, Andrei Alexandrescu пишет:
On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
No, it doesn't.
import
1 - 100 of 241 matches
Mail list logo