subject:"Re\: On the meaning of string.length"

Re: On the meaning of string.length

2014-11-21 Thread Dmitry Olshansky via Digitalmars-d-announce


20-Nov-2014 16:50, Adam D. Ruppe пишет:

On Wednesday, 19 November 2014 at 21:00:50 UTC, Ary Borenszweig wrote:

In Ruby `length` returns the number of unicode characters


What is a unicode character? Even in utf-32, one printed character might
be made up of two unicode code points. Or sometimes, two printed
characters might come from a single code point.


Perl goes for grapheme cluster as character. I'd say that's probably the 
closest thing to it.


Sadly being systems language we can't go so far as to create a per 
process table of cached graphemes, and then use index in that table as 
"character" ;)


--
Dmitry Olshansky

Re: On the meaning of string.length

2014-11-20 Thread Adam D. Ruppe via Digitalmars-d-announce

On Wednesday, 19 November 2014 at 21:00:50 UTC, Ary Borenszweig 
wrote:

In Ruby `length` returns the number of unicode characters


What is a unicode character? Even in utf-32, one printed 
character might be made up of two unicode code points. Or 
sometimes, two printed characters might come from a single code 
point.

Re: On the meaning of string.length

2014-11-19 Thread Walter Bright via Digitalmars-d-announce


On 11/19/2014 7:06 AM, Upvoter wrote:

On Wednesday, 19 November 2014 at 14:33:05 UTC, Adam D. Ruppe wrote:

I think the auto decoding in phobos was and is a mistake.

I agree when you say auto decoding is a good choice.


Uh-oh!

Re: On the meaning of string.length

2014-11-19 Thread Ary Borenszweig via Digitalmars-d-announce


On 11/19/14, 11:33 AM, Adam D. Ruppe wrote:

I answered a random C# stackoverflow question about why string.length
returns the value it does with some rationale defending code units
instead of "characters" - basically, I typed up a defense of D's
string-as-array behavior.


In Ruby `length` returns the number of unicode characters and `bytesize` 
returns the number of bytes. I prefer this use of the names.

Re: On the meaning of string.length

2014-11-19 Thread Upvoter via Digitalmars-d-announce

On Wednesday, 19 November 2014 at 14:33:05 UTC, Adam D. Ruppe
wrote:
I answered a random C# stackoverflow question about why
string.length returns the value it does with some rationale
defending code units instead of "characters" - basically, I
typed up a defense of D's string-as-array behavior.

To my surprise, my answer got an enormous number of votes* so I
decided to post it to reddit too.

http://www.reddit.com/r/programming/comments/2mqghp/why_does_stringlength_count_code_units_instead_of/

This is really encouraging to me that there's been such a
positive response. The question every so often comes up here
too, people saying string.length should give number of
characters, and of course, we have the automatic UTF decoding
done in Phobos that comes up from time to time.

It looks like D, the language, made the right decisions here.

This reddit comment applies to the phobos thing though:

"Most people like to pick on surrogate pairs here, and decry
languages which don't handle them "properly", but I think it's
important to point out that handling surrogate pairs as a
single character doesn't in any way fix the underlying issue --
many multiple-codepoint sequences are a single logical glyph
even if you use 32 bit wide chars."

I know this has been said a lot of times... but I think the
auto decoding in phobos was and is a mistake. The bigger
question is what I posited on stackoverflow: "Moreover, what's
the point? Why does these metrics matter?" Similarly with
std.algorithm on strings, why would you ever want to call sort
on a string? Well, I can think of a few reasons, like checking
on the frequency of letter, but I think we should see what
happens if Phobos changes from autodecoding to compile error
when that would occur. Then we can fix it by casting to
.representation or whatever to work with code units or manually
adding a .utfDecode to work with dchars and make the decision
explicitly.

That'd offer a way forward and I suspect would break less code
than we might think.

* stack overflow votes are a silly thing, a somewhat easy
answer like this gets a bazillion whereas difficult questions
with difficult answers get me one, maybe two votes. oh well.

One more upvote.
I agree when you say auto decoding is a good choice.

Additonally it allows a good compatibility with the Linux API, in
opposite to the Windows API since Windows unicode version use
WideChars as string parameters (always two bytes.)

And finally for someone like me who makes software for his own
usage UTF-8 doesn't change anything since I'm french and every
char fits in one byte...

Re: On the meaning of string.length

Re: On the meaning of string.length

Re: On the meaning of string.length

Re: On the meaning of string.length

Re: On the meaning of string.length

5 matches

Site Navigation

Mail list logo

Footer information