Re: Notice/Warning on narrowStrings .length

2012-04-27 Thread H. S. Teoh
On Fri, Apr 27, 2012 at 04:12:25AM +0200, Matt Peterson wrote:
 On Friday, 27 April 2012 at 01:35:26 UTC, H. S. Teoh wrote:
 When I get the time? Hah... I really need to get my lazy bum back to
 working on the new AA implementation first. I think that would
 contribute greater value than optimizing Unicode algorithms. :-) I
 was hoping *somebody* would be inspired by my idea and run with it...
 
 I actually recently wrote a lexer generator for D that wouldn't be
 that hard to adapt to something like this.

That's awesome! Would you like to give it a shot? ;-)

Also, I'm in love with lexer generators... I'd love to make good use of
your lexer generator if the code is available somewhere.


T

-- 
Nothing in the world is more distasteful to a man than to take the path
that leads to himself. -- Herman Hesse


Re: Notice/Warning on narrowStrings .length

2012-04-27 Thread H. S. Teoh
On Thu, Apr 26, 2012 at 09:55:54PM -0400, Nick Sabalausky wrote:
[...]
 Crazy stuff! Some of them look rather similar to Arabic or Korean's
 Hangul (sp?), at least to my untrained eye. And then others are just
 *really* interesting-looking, like:
 
 http://www.omniglot.com/writing/12480.htm
 http://www.omniglot.com/writing/ayeri.htm
 http://www.omniglot.com/writing/oxidilogi.htm
 
 You're right though, if I were in charge of Unicode and tasked with
 handling some of those, I think I'd just say Screw it. Unicode is now
 depricated.  Use ASCII instead. Doesn't have the characters for your
 langauge? Tough! Fix your language! :)

You think that's crazy, huh? Check this out:

http://www.omniglot.com/writing/sumerian.htm

Now take a deep breath...

... this writing was *actually used* in ancient times. Yeah.

Which means it probably has a Unicode block assigned to it, right now.
:-)


  When I get the time? Hah... I really need to get my lazy bum back to
  working on the new AA implementation first. I think that would
  contribute greater value than optimizing Unicode algorithms. :-) I
  was hoping *somebody* would be inspired by my idea and run with
  it...
 
 
 Heh, yea. It is a tempting project, but my plate's overflowing too.
 (Now if only I could make the same happen to bank account...!)
[...]

On the other hand though, sometimes it's refreshing to take a break from
serious low-level core language D code, and just write plain ole
normal boring application code in D. It's good to be reminded just how
easy and pleasant it is to write application code in D.

For example, just today I was playing around with a regex-based version
of formattedRead: you pass in a regex and a bunch of pointers, and the
function uses compile-time introspection to convert regex matches into
the correct value types. So you could call it like this:

int year;
string month;
int day;
regexRead(input, `(\d{4})\s+(\w+)\s+(\d{2})`, year, month, day);

Basically, each pair of parentheses corresponds with a pointer argument;
non-capturing parentheses (?:) can be used for grouping without
assigning to an item.

Its current implementation is still kinda crude, but it does support
assigning to user-defined types if you define a fromString() method that
does the requisite conversion from the matching substring.

The next step is to standardize on enums in user-defined types that
specify a regex substring to be used for matching items of that type, so
that the caller doesn't have to know what kind of string pattern is
expected by fromString(). I envision something like this:

struct MyDate {
enum stdFmt = `(\d{4}-\d{2}-\d{2})`;
enum americanFmt = `(\d{2}-\d{2}-\d{4})`;
static MyDate fromString(Char)(Char[] value) { ... }
}
...
string label1, label2;
MyDate dt1, dt2;
regexRead(input, `\s+(\w+)\s*=\s*`~MyDate.stdFmt~`\s*$`,
label1, dt1);
regexRead(input, `\s+(\w+)\s*=\s*`~MyDate.americanFmt~`\s*$`,
label2, dt2);

So the user can specify, in the regex, which date format to use in
parsing the dates.

I think this is a vast improvement over the current straitjacketed
formattedRead. ;-) And it's so much fun to code (and use).


T

-- 
Let X be the set not defined by this sentence...


Re: Notice/Warning on narrowStrings .length

2012-04-27 Thread Brad Anderson

On Friday, 27 April 2012 at 00:25:44 UTC, H. S. Teoh wrote:

On Thu, Apr 26, 2012 at 06:13:00PM -0400, Nick Sabalausky wrote:
H. S. Teoh hst...@quickfur.ath.cx wrote in message 
news:mailman.2173.1335475413.4860.digitalmar...@puremagic.com...

[...]
 And don't forget that some code points (notably from the CJK 
 block)
 are specified as double-width, so if you're trying to do 
 text

 layout, you'll want yet a different length (layoutLength?).



Correction: the official term for this is full-width (as 
opposed to

the half-width of the typical European scripts).


Interesting. Kinda makes sence that such thing exists, though: 
The CJK
characters (even the relatively simple Japanese *kanas) are 
detailed
enough that they need to be larger to achieve the same 
readability.
And that's the *non*-double-length ones. So I don't doubt 
there's ones

that need to be tagged as Draw Extra Big!! :)


Have you seen U+9598? It's an insanely convoluted glyph 
composed of

*three copies* of an already extremely complex glyph.

http://upload.wikimedia.org/wikipedia/commons/3/3c/U%2B9F98.png

(And yes, that huge thing is supposed to fit inside a SINGLE
character... what *were* those ancient Chinese scribes 
thinking?!)




For example, I have my font size in Windows Notepad set to a
comfortable value. But when I want to use hiragana or 
katakana, I have
to go into the settings and increase the font size so I can 
actually
read it (Well, to what *little* extent I can even read it in 
the first

place ;) ). And those kana's tend to be among the simplest CJK
characters.

(Don't worry - I only use Notepad as a quick-n-dirty scrap 
space,

never for real coding/writing).


LOL... love the fact that you felt obligated to justify your 
use of

notepad. :-P



 So we really need all four lengths. Ain't unicode fun?! :-)


No kidding. The *one* thing I really, really hate about 
Unicode is the
fact that most (if not all) of its complexity actually *is* 
necessary.


We're lucky the more imaginative scribes of the world have 
either been
dead for centuries or have restricted themselves to writing 
fictional
languages. :-) The inventions of the dead ones have been 
codified and
simplified by the unfortunate people who inherited their overly 
complex
systems (*cough*CJK glyphs*cough), and the inventions of the 
living ones
are largely ignored by the world due to the fact that, well, 
their

scripts are only useful for writing fictional languages. :-)

So despite the fact that there are still some crazy convoluted 
stuff out
there, such as Arabic or Indic scripts with pair-wise 
substitution rules

in Unicode, overall things are relatively tame. At least the
subcomponents of CJK glyphs are no longer productive (actively 
being
used to compose new characters by script users) -- can you 
imagine the
insanity if Unicode had to support composition by those 
radicals and

subparts? Or if Unicode had to support a script like this one:

http://www.arthaey.com/conlang/ashaille/writing/sarapin.html

whose components are graphically composed in, shall we say, 
entirely
non-trivial ways (see the composed samples at the bottom of the 
page)?



Unicode *itself* is undisputably necessary, but I do sure miss 
ASCII.


In an ideal world, where memory is not an issue and bus width is
indefinitely wide, a Unicode string would simply be a sequence 
of
integers (of arbitrary size). Things like combining diacritics, 
etc.,
would have dedicated bits/digits for representing them, so 
there's no
need of the complexity of UTF-8, UTF-16, etc.. Everything fits 
into a
single character. Every possible combination of diacritics on 
every
possible character has a unique representation as a single 
integer.

String length would be equal to glyph count.

In such an ideal world, screens would also be of indefinitely 
detailed
resolution, so anything can fit inside a single grid cell, so 
there's no
need of half-width/double-width distinctions.  You could port 
ancient
ASCII-centric C code just by increasing sizeof(char), and 
things would

Just Work.

Yeah I know. Totally impossible. But one can dream, right? :-)


[...]
 I've been thinking about unicode processing recently. 
 Traditionally,
 we have to decode narrow strings into UTF-32 (aka dchar) 
 then do
 table lookups and such. But unicode encoding and properties, 
 etc.,
 are static information (at least within a single unicode 
 release).

 So why bother with hardcoding tables and stuff at all?

 What we *really* should be doing, esp. for commonly-used 
 functions
 like computing various lengths, is to automatically process 
 said
 tables and encode the computation in finite-state machines 
 that can

 then be optimized at the FSM level (there are known algos for
 generating optimal FSMs), codegen'd, and then optimized 
 again at the
 assembly level by the compiler. These FSMs will operate at 
 the
 native narrow string char type level, so that there will be 
 no need

 for explicit 

Re: Notice/Warning on narrowStrings .length

2012-04-27 Thread Dmitry Olshansky

On 27.04.2012 5:36, H. S. Teoh wrote:

On Thu, Apr 26, 2012 at 09:03:59PM -0400, Nick Sabalausky wrote:
[...]

Heh, any usage of Notepad *needs* to be justified. For example, it has an
undo buffer of exactly ONE change.




Come on, notepad is a real nice in one job only: getting rid of style 
and fonts of a copied text fragment. I use it as clean-up scratch pool 
daily. Would be a shame if they ever add fonts and layout to it ;)



--
Dmitry Olshansky


Re: Notice/Warning on narrowStrings .length

2012-04-27 Thread Dmitry Olshansky

On 27.04.2012 1:23, H. S. Teoh wrote:

On Thu, Apr 26, 2012 at 01:51:17PM -0400, Nick Sabalausky wrote:

James Millerja...@aatch.net  wrote in message
news:qdgacdzxkhmhojqce...@forum.dlang.org...

I'm writing an introduction/tutorial to using strings in D, paying
particular attention to the complexities of UTF-8 and 16. I realised
that when you want the number of characters, you normally actually
want to use walkLength, not length. Is is reasonable for the
compiler to pick this up during semantic analysis and point out this
situation?

It's just a thought because a lot of the time, using length will get
the right answer, but for the wrong reasons, resulting in lurking
bugs. You can always cast to immutable(ubyte)[] or
immutable(short)[] if you want to work with the actual bytes anyway.


I find that most of the time I actually *do* want to use length. Don't
know if that's common, though, or if it's just a reflection of my
particular use-cases.

Also, keep in mind that (unless I'm mistaken) walkLength does *not*
return the number of characters (ie, graphemes), but merely the
number of code points - which is not the same thing (due to existence
of the [confusingly-named] combining characters).

[...]

And don't forget that some code points (notably from the CJK block) are
specified as double-width, so if you're trying to do text layout,
you'll want yet a different length (layoutLength?).

So we really need all four lengths. Ain't unicode fun?! :-)

Array length is simple.  Walklength is already implemented. Grapheme
length requires recognition of 'combining characters' (or rather,
ignoring said characters), and layout length requires recognizing
widthless, single- and double-width characters.

I've been thinking about unicode processing recently. Traditionally, we
have to decode narrow strings into UTF-32 (aka dchar) then do table
lookups and such. But unicode encoding and properties, etc., are static
information (at least within a single unicode release). So why bother
with hardcoding tables and stuff at all?


Of course they are generated.



What we *really* should be doing, esp. for commonly-used functions like
computing various lengths, is to automatically process said tables and
encode the computation in finite-state machines that can then be
optimized at the FSM level (there are known algos for generating optimal
FSMs),


FSA are based on tables so it's all runs in the circle. Only the layout 
changes. Yet the speed gains of non-decoding are huge.


 codegen'd, and then optimized again at the assembly level by the

compiler. These FSMs will operate at the native narrow string char type
level, so that there will be no need for explicit decoding.

The generation algo can then be run just once per unicode release, and
everything will Just Work.


This year Unicode in D will receive a nice upgrade.
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/dolsh/20002#

Anyway keep me posted if you have these FSA ever come to soil your sleep ;)

--
Dmitry Olshansky


Re: Notice/Warning on narrowStrings .length

2012-04-27 Thread Nick Sabalausky
Dmitry Olshansky dmitry.o...@gmail.com wrote in message 
news:jndkji$23ni$2...@digitalmars.com...
 On 27.04.2012 5:36, H. S. Teoh wrote:
 On Thu, Apr 26, 2012 at 09:03:59PM -0400, Nick Sabalausky wrote:
 [...]
 Heh, any usage of Notepad *needs* to be justified. For example, it has 
 an
 undo buffer of exactly ONE change.


 Come on, notepad is a real nice in one job only: getting rid of style and 
 fonts of a copied text fragment. I use it as clean-up scratch pool daily. 
 Would be a shame if they ever add fonts and layout to it ;)


That's the #1 biggest thing I use it for!! :) And yes, daily.

I frequently wish I had a global setting for Don't include style in the 
clipboard, and maybe a *separate* Copy with style command. Or at least a 
standard copy without style, or remove style from clipboard command. 
*Something*. 99% of the times I copy/paste text I *don't* want to include 
style. Drives me crazy.




Re: Notice/Warning on narrowStrings .length

2012-04-27 Thread Nick Sabalausky
H. S. Teoh hst...@quickfur.ath.cx wrote in message 
news:mailman.1.1335507187.22023.digitalmar...@puremagic.com...
 On Thu, Apr 26, 2012 at 09:55:54PM -0400, Nick Sabalausky wrote:
 [...]
 Crazy stuff! Some of them look rather similar to Arabic or Korean's
 Hangul (sp?), at least to my untrained eye. And then others are just
 *really* interesting-looking, like:

 http://www.omniglot.com/writing/12480.htm
 http://www.omniglot.com/writing/ayeri.htm
 http://www.omniglot.com/writing/oxidilogi.htm

 You're right though, if I were in charge of Unicode and tasked with
 handling some of those, I think I'd just say Screw it. Unicode is now
 depricated.  Use ASCII instead. Doesn't have the characters for your
 langauge? Tough! Fix your language! :)

 You think that's crazy, huh? Check this out:

 http://www.omniglot.com/writing/sumerian.htm

 Now take a deep breath...

 ... this writing was *actually used* in ancient times. Yeah.


Jesus, I could *easily* mistake that for hardware schematics. That's wild.




Re: Notice/Warning on narrowStrings .length

2012-04-27 Thread Dmitry Olshansky

On 27.04.2012 12:31, Nick Sabalausky wrote:

Dmitry Olshanskydmitry.o...@gmail.com  wrote in message
news:jndkji$23ni$2...@digitalmars.com...

On 27.04.2012 5:36, H. S. Teoh wrote:

On Thu, Apr 26, 2012 at 09:03:59PM -0400, Nick Sabalausky wrote:
[...]

Heh, any usage of Notepad *needs* to be justified. For example, it has
an
undo buffer of exactly ONE change.




Come on, notepad is a real nice in one job only: getting rid of style and
fonts of a copied text fragment. I use it as clean-up scratch pool daily.
Would be a shame if they ever add fonts and layout to it ;)



That's the #1 biggest thing I use it for!! :) And yes, daily.

I frequently wish I had a global setting for Don't include style in the
clipboard, and maybe a *separate* Copy with style command. Or at least a
standard copy without style, or remove style from clipboard command.
*Something*. 99% of the times I copy/paste text I *don't* want to include
style. Drives me crazy.



Yup I certainly wouldn't mind a separate copy with my font settings ;)

--
Dmitry Olshansky


Re: Notice/Warning on narrowStrings .length

2012-04-27 Thread H. S. Teoh
On Fri, Apr 27, 2012 at 12:20:13PM +0400, Dmitry Olshansky wrote:
 On 27.04.2012 1:23, H. S. Teoh wrote:
[...]
 What we *really* should be doing, esp. for commonly-used functions
 like computing various lengths, is to automatically process said
 tables and encode the computation in finite-state machines that can
 then be optimized at the FSM level (there are known algos for
 generating optimal FSMs),
 
 FSA are based on tables so it's all runs in the circle. Only the
 layout changes. Yet the speed gains of non-decoding are huge.

Yes, but hand-coded tables tend to go out of date, be prone to bugs, or
are missing optimizations done by an FSA generator (e.g. a lexer
generator). Collapsed FSA states, for example, can greatly reduce table
size and speed things up.


  codegen'd, and then optimized again at the assembly level by the
 compiler. These FSMs will operate at the native narrow string char
 type level, so that there will be no need for explicit decoding.
 
 The generation algo can then be run just once per unicode release,
 and everything will Just Work.
 
 This year Unicode in D will receive a nice upgrade.
 http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/dolsh/20002#
 
 Anyway keep me posted if you have these FSA ever come to soil your
 sleep ;)
[...]

One area where autogenerated Unicode algos will be very useful is in
normalization. Unicode normalization is non-trivial, to say the least;
it involves looking up various character properties and performing
mappings between them in a specified order.

If we can encode this process as FSA, then we can let an automated FSA
optimizer produce code that maps directly between the (non-decoded!)
source string and the target (non-decoded!) normalized string. Similar
things can be done for string concatenation (which requires
arbitrarily-distant scanning in either direction from the joining point,
though in normal use cases the distance should be very short).


T

-- 
Error: Keyboard not attached. Press F1 to continue. -- Yoon Ha Lee, CONLANG


Re: Notice/Warning on narrowStrings .length

2012-04-27 Thread Nathan M. Swan

On Friday, 27 April 2012 at 06:12:01 UTC, H. S. Teoh wrote:

On Thu, Apr 26, 2012 at 09:55:54PM -0400, Nick Sabalausky wrote:
[...]
Crazy stuff! Some of them look rather similar to Arabic or 
Korean's
Hangul (sp?), at least to my untrained eye. And then others 
are just

*really* interesting-looking, like:

http://www.omniglot.com/writing/12480.htm
http://www.omniglot.com/writing/ayeri.htm
http://www.omniglot.com/writing/oxidilogi.htm

You're right though, if I were in charge of Unicode and tasked 
with
handling some of those, I think I'd just say Screw it. 
Unicode is now
depricated.  Use ASCII instead. Doesn't have the characters 
for your

langauge? Tough! Fix your language! :)


You think that's crazy, huh? Check this out:

http://www.omniglot.com/writing/sumerian.htm

Now take a deep breath...

... this writing was *actually used* in ancient times. Yeah.

Which means it probably has a Unicode block assigned to it, 
right now.

:-)


It was actually the first human writing ever. Which Phoenician 
scribe knew that his innovation of the alphabet would make 
programming easier thousands of years later?





Re: Notice/Warning on narrowStrings .length

2012-04-27 Thread Nick Sabalausky
H. S. Teoh hst...@quickfur.ath.cx wrote in message 
news:mailman.1.1335507187.22023.digitalmar...@puremagic.com...

 For example, just today I was playing around with a regex-based version
 of formattedRead: you pass in a regex and a bunch of pointers, and the
 function uses compile-time introspection to convert regex matches into
 the correct value types. So you could call it like this:

 int year;
 string month;
 int day;
 regexRead(input, `(\d{4})\s+(\w+)\s+(\d{2})`, year, month, day);
 [...]

That's pretty cool.




Re: Notice/Warning on narrowStrings .length

2012-04-26 Thread Nick Sabalausky
James Miller ja...@aatch.net wrote in message 
news:qdgacdzxkhmhojqce...@forum.dlang.org...
 I'm writing an introduction/tutorial to using strings in D, paying 
 particular attention to the complexities of UTF-8 and 16. I realised that 
 when you want the number of characters, you normally actually want to use 
 walkLength, not length. Is is reasonable for the compiler to pick this up 
 during semantic analysis and point out this situation?

 It's just a thought because a lot of the time, using length will get the 
 right answer, but for the wrong reasons, resulting in lurking bugs. You 
 can always cast to immutable(ubyte)[] or immutable(short)[] if you want to 
 work with the actual bytes anyway.

I find that most of the time I actually *do* want to use length. Don't know 
if that's common, though, or if it's just a reflection of my particular 
use-cases.

Also, keep in mind that (unless I'm mistaken) walkLength does *not* return 
the number of characters (ie, graphemes), but merely the number of code 
points - which is not the same thing (due to existence of the 
[confusingly-named] combining characters).




Re: Notice/Warning on narrowStrings .length

2012-04-26 Thread Jonathan M Davis
On Thursday, April 26, 2012 13:51:17 Nick Sabalausky wrote:
 Also, keep in mind that (unless I'm mistaken) walkLength does *not* return
 the number of characters (ie, graphemes), but merely the number of code
 points - which is not the same thing (due to existence of the
 [confusingly-named] combining characters).

You're not mistaken. Nothing in Phobos (save perhaps some of std.regex's 
internals) deals with graphemes. It all operates on code points, and strings 
are considered to be ranges of code points, not graphemes. So, as far as 
ranges go, walkLength returns the actual length of the range. That's _usually_ 
the number of characters/graphemes as well, but it's certainly not 100% 
correct. We'll need further unicode facilities in Phobos to deal with that 
though, and I doubt that strings will ever change to be treated as ranges of 
graphemes, since that would be incredibly expensive computationally. We have 
enough performance problems with strings as it is. What we'll probably get is 
extra functions to deal with normalization (and probably something to count 
the number of graphemes) and probably a wrapper type that does deal in 
graphemes.

Regardless, you're right about walkLength returning the number of code points 
rather than graphemes, because strings are considered to be ranges of dchar.

- Jonathan M Davis


Re: Notice/Warning on narrowStrings .length

2012-04-26 Thread H. S. Teoh
On Thu, Apr 26, 2012 at 01:51:17PM -0400, Nick Sabalausky wrote:
 James Miller ja...@aatch.net wrote in message 
 news:qdgacdzxkhmhojqce...@forum.dlang.org...
  I'm writing an introduction/tutorial to using strings in D, paying
  particular attention to the complexities of UTF-8 and 16. I realised
  that when you want the number of characters, you normally actually
  want to use walkLength, not length. Is is reasonable for the
  compiler to pick this up during semantic analysis and point out this
  situation?
 
  It's just a thought because a lot of the time, using length will get
  the right answer, but for the wrong reasons, resulting in lurking
  bugs. You can always cast to immutable(ubyte)[] or
  immutable(short)[] if you want to work with the actual bytes anyway.
 
 I find that most of the time I actually *do* want to use length. Don't
 know if that's common, though, or if it's just a reflection of my
 particular use-cases.
 
 Also, keep in mind that (unless I'm mistaken) walkLength does *not*
 return the number of characters (ie, graphemes), but merely the
 number of code points - which is not the same thing (due to existence
 of the [confusingly-named] combining characters).
[...]

And don't forget that some code points (notably from the CJK block) are
specified as double-width, so if you're trying to do text layout,
you'll want yet a different length (layoutLength?).

So we really need all four lengths. Ain't unicode fun?! :-)

Array length is simple.  Walklength is already implemented. Grapheme
length requires recognition of 'combining characters' (or rather,
ignoring said characters), and layout length requires recognizing
widthless, single- and double-width characters.

I've been thinking about unicode processing recently. Traditionally, we
have to decode narrow strings into UTF-32 (aka dchar) then do table
lookups and such. But unicode encoding and properties, etc., are static
information (at least within a single unicode release). So why bother
with hardcoding tables and stuff at all?

What we *really* should be doing, esp. for commonly-used functions like
computing various lengths, is to automatically process said tables and
encode the computation in finite-state machines that can then be
optimized at the FSM level (there are known algos for generating optimal
FSMs), codegen'd, and then optimized again at the assembly level by the
compiler. These FSMs will operate at the native narrow string char type
level, so that there will be no need for explicit decoding.

The generation algo can then be run just once per unicode release, and
everything will Just Work.


T

-- 
Give me some fresh salted fish, please.


Re: Notice/Warning on narrowStrings .length

2012-04-26 Thread Nick Sabalausky
Jonathan M Davis jmdavisp...@gmx.com wrote in message 
news:mailman.2166.1335463456.4860.digitalmar...@puremagic.com...
 On Thursday, April 26, 2012 13:51:17 Nick Sabalausky wrote:
 Also, keep in mind that (unless I'm mistaken) walkLength does *not* 
 return
 the number of characters (ie, graphemes), but merely the number of code
 points - which is not the same thing (due to existence of the
 [confusingly-named] combining characters).

 You're not mistaken. Nothing in Phobos (save perhaps some of std.regex's
 internals) deals with graphemes. It all operates on code points, and 
 strings
 are considered to be ranges of code points, not graphemes. So, as far as
 ranges go, walkLength returns the actual length of the range. That's 
 _usually_
 the number of characters/graphemes as well, but it's certainly not 100%
 correct. We'll need further unicode facilities in Phobos to deal with that
 though, and I doubt that strings will ever change to be treated as ranges 
 of
 graphemes, since that would be incredibly expensive computationally. We 
 have
 enough performance problems with strings as it is. What we'll probably get 
 is
 extra functions to deal with normalization (and probably something to 
 count
 the number of graphemes) and probably a wrapper type that does deal in
 graphemes.


Yea, I'm not saying that walkLength should deal with graphemes. Just that if 
someone wants the number of characters, then neither length *nor* 
walkLength are guaranteed to be correct.




Re: Notice/Warning on narrowStrings .length

2012-04-26 Thread Nick Sabalausky
H. S. Teoh hst...@quickfur.ath.cx wrote in message 
news:mailman.2173.1335475413.4860.digitalmar...@puremagic.com...
 On Thu, Apr 26, 2012 at 01:51:17PM -0400, Nick Sabalausky wrote:
 James Miller ja...@aatch.net wrote in message
 news:qdgacdzxkhmhojqce...@forum.dlang.org...
  I'm writing an introduction/tutorial to using strings in D, paying
  particular attention to the complexities of UTF-8 and 16. I realised
  that when you want the number of characters, you normally actually
  want to use walkLength, not length. Is is reasonable for the
  compiler to pick this up during semantic analysis and point out this
  situation?
 
  It's just a thought because a lot of the time, using length will get
  the right answer, but for the wrong reasons, resulting in lurking
  bugs. You can always cast to immutable(ubyte)[] or
  immutable(short)[] if you want to work with the actual bytes anyway.

 I find that most of the time I actually *do* want to use length. Don't
 know if that's common, though, or if it's just a reflection of my
 particular use-cases.

 Also, keep in mind that (unless I'm mistaken) walkLength does *not*
 return the number of characters (ie, graphemes), but merely the
 number of code points - which is not the same thing (due to existence
 of the [confusingly-named] combining characters).
 [...]

 And don't forget that some code points (notably from the CJK block) are
 specified as double-width, so if you're trying to do text layout,
 you'll want yet a different length (layoutLength?).


Interesting. Kinda makes sence that such thing exists, though: The CJK 
characters (even the relatively simple Japanese *kanas) are detailed enough 
that they need to be larger to achieve the same readability. And that's the 
*non*-double-length ones. So I don't doubt there's ones that need to be 
tagged as Draw Extra Big!! :)

For example, I have my font size in Windows Notepad set to a comfortable 
value. But when I want to use hiragana or katakana, I have to go into the 
settings and increase the font size so I can actually read it (Well, to what 
*little* extent I can even read it in the first place ;) ). And those kana's 
tend to be among the simplest CJK characters.

(Don't worry - I only use Notepad as a quick-n-dirty scrap space, never for 
real coding/writing).

 So we really need all four lengths. Ain't unicode fun?! :-)


No kidding. The *one* thing I really, really hate about Unicode is the fact 
that most (if not all) of its complexity actually *is* necessary.

Unicode *itself* is undisputably necessary, but I do sure miss ASCII.

 Array length is simple.  Walklength is already implemented. Grapheme
 length requires recognition of 'combining characters' (or rather,
 ignoring said characters), and layout length requires recognizing
 widthless, single- and double-width characters.


Yup.

 I've been thinking about unicode processing recently. Traditionally, we
 have to decode narrow strings into UTF-32 (aka dchar) then do table
 lookups and such. But unicode encoding and properties, etc., are static
 information (at least within a single unicode release). So why bother
 with hardcoding tables and stuff at all?

 What we *really* should be doing, esp. for commonly-used functions like
 computing various lengths, is to automatically process said tables and
 encode the computation in finite-state machines that can then be
 optimized at the FSM level (there are known algos for generating optimal
 FSMs), codegen'd, and then optimized again at the assembly level by the
 compiler. These FSMs will operate at the native narrow string char type
 level, so that there will be no need for explicit decoding.

 The generation algo can then be run just once per unicode release, and
 everything will Just Work.


While I find that very intersting...I'm afraid I don't actually understand 
your suggestion :/ (I do understand FSM's and how they work, though) Could 
you give a little example of what you mean?




Re: Notice/Warning on narrowStrings .length

2012-04-26 Thread H. S. Teoh
On Thu, Apr 26, 2012 at 06:13:00PM -0400, Nick Sabalausky wrote:
 H. S. Teoh hst...@quickfur.ath.cx wrote in message 
 news:mailman.2173.1335475413.4860.digitalmar...@puremagic.com...
[...]
  And don't forget that some code points (notably from the CJK block)
  are specified as double-width, so if you're trying to do text
  layout, you'll want yet a different length (layoutLength?).
 

Correction: the official term for this is full-width (as opposed to
the half-width of the typical European scripts).


 Interesting. Kinda makes sence that such thing exists, though: The CJK
 characters (even the relatively simple Japanese *kanas) are detailed
 enough that they need to be larger to achieve the same readability.
 And that's the *non*-double-length ones. So I don't doubt there's ones
 that need to be tagged as Draw Extra Big!! :)

Have you seen U+9598? It's an insanely convoluted glyph composed of
*three copies* of an already extremely complex glyph.

http://upload.wikimedia.org/wikipedia/commons/3/3c/U%2B9F98.png

(And yes, that huge thing is supposed to fit inside a SINGLE
character... what *were* those ancient Chinese scribes thinking?!)


 For example, I have my font size in Windows Notepad set to a
 comfortable value. But when I want to use hiragana or katakana, I have
 to go into the settings and increase the font size so I can actually
 read it (Well, to what *little* extent I can even read it in the first
 place ;) ). And those kana's tend to be among the simplest CJK
 characters.
 
 (Don't worry - I only use Notepad as a quick-n-dirty scrap space,
 never for real coding/writing).

LOL... love the fact that you felt obligated to justify your use of
notepad. :-P


  So we really need all four lengths. Ain't unicode fun?! :-)
 
 
 No kidding. The *one* thing I really, really hate about Unicode is the
 fact that most (if not all) of its complexity actually *is* necessary.

We're lucky the more imaginative scribes of the world have either been
dead for centuries or have restricted themselves to writing fictional
languages. :-) The inventions of the dead ones have been codified and
simplified by the unfortunate people who inherited their overly complex
systems (*cough*CJK glyphs*cough), and the inventions of the living ones
are largely ignored by the world due to the fact that, well, their
scripts are only useful for writing fictional languages. :-)

So despite the fact that there are still some crazy convoluted stuff out
there, such as Arabic or Indic scripts with pair-wise substitution rules
in Unicode, overall things are relatively tame. At least the
subcomponents of CJK glyphs are no longer productive (actively being
used to compose new characters by script users) -- can you imagine the
insanity if Unicode had to support composition by those radicals and
subparts? Or if Unicode had to support a script like this one:

http://www.arthaey.com/conlang/ashaille/writing/sarapin.html

whose components are graphically composed in, shall we say, entirely
non-trivial ways (see the composed samples at the bottom of the page)?


 Unicode *itself* is undisputably necessary, but I do sure miss ASCII.

In an ideal world, where memory is not an issue and bus width is
indefinitely wide, a Unicode string would simply be a sequence of
integers (of arbitrary size). Things like combining diacritics, etc.,
would have dedicated bits/digits for representing them, so there's no
need of the complexity of UTF-8, UTF-16, etc.. Everything fits into a
single character. Every possible combination of diacritics on every
possible character has a unique representation as a single integer.
String length would be equal to glyph count.

In such an ideal world, screens would also be of indefinitely detailed
resolution, so anything can fit inside a single grid cell, so there's no
need of half-width/double-width distinctions.  You could port ancient
ASCII-centric C code just by increasing sizeof(char), and things would
Just Work.

Yeah I know. Totally impossible. But one can dream, right? :-)


[...]
  I've been thinking about unicode processing recently. Traditionally,
  we have to decode narrow strings into UTF-32 (aka dchar) then do
  table lookups and such. But unicode encoding and properties, etc.,
  are static information (at least within a single unicode release).
  So why bother with hardcoding tables and stuff at all?
 
  What we *really* should be doing, esp. for commonly-used functions
  like computing various lengths, is to automatically process said
  tables and encode the computation in finite-state machines that can
  then be optimized at the FSM level (there are known algos for
  generating optimal FSMs), codegen'd, and then optimized again at the
  assembly level by the compiler. These FSMs will operate at the
  native narrow string char type level, so that there will be no need
  for explicit decoding.
 
  The generation algo can then be run just once per unicode release,
  and everything will Just Work.
 
 

Re: Notice/Warning on narrowStrings .length

2012-04-26 Thread Nick Sabalausky
H. S. Teoh hst...@quickfur.ath.cx wrote in message 
news:mailman.2179.1335486409.4860.digitalmar...@puremagic.com...

 Have you seen U+9598? It's an insanely convoluted glyph composed of
 *three copies* of an already extremely complex glyph.

 http://upload.wikimedia.org/wikipedia/commons/3/3c/U%2B9F98.png

 (And yes, that huge thing is supposed to fit inside a SINGLE
 character... what *were* those ancient Chinese scribes thinking?!)


Yikes!


 For example, I have my font size in Windows Notepad set to a
 comfortable value. But when I want to use hiragana or katakana, I have
 to go into the settings and increase the font size so I can actually
 read it (Well, to what *little* extent I can even read it in the first
 place ;) ). And those kana's tend to be among the simplest CJK
 characters.

 (Don't worry - I only use Notepad as a quick-n-dirty scrap space,
 never for real coding/writing).

 LOL... love the fact that you felt obligated to justify your use of
 notepad. :-P


Heh, any usage of Notepad *needs* to be justified. For example, it has an 
undo buffer of exactly ONE change. And the stupid thing doesn't even handle 
Unix-style newlines. *Everything* handes Unix-style newlines these days, 
even on Windows. Windows *BATCH* files even accept Unix-style newlines, for 
goddsakes! But not Notepad.

It is nice in it's leanness and no-nonsence-ness. But it desperately needs 
some updates.

At least it actually supports Unicode though. (Which actually I find 
somewhat surprising.)

'Course, this is all XP. For all I know maybe they have finally updated it 
in MS OSX, erm, I mean Vista and Win7...


  So we really need all four lengths. Ain't unicode fun?! :-)
 

 No kidding. The *one* thing I really, really hate about Unicode is the
 fact that most (if not all) of its complexity actually *is* necessary.

 We're lucky the more imaginative scribes of the world have either been
 dead for centuries or have restricted themselves to writing fictional
 languages. :-) The inventions of the dead ones have been codified and
 simplified by the unfortunate people who inherited their overly complex
 systems (*cough*CJK glyphs*cough), and the inventions of the living ones
 are largely ignored by the world due to the fact that, well, their
 scripts are only useful for writing fictional languages. :-)

 So despite the fact that there are still some crazy convoluted stuff out
 there, such as Arabic or Indic scripts with pair-wise substitution rules
 in Unicode, overall things are relatively tame. At least the
 subcomponents of CJK glyphs are no longer productive (actively being
 used to compose new characters by script users) -- can you imagine the
 insanity if Unicode had to support composition by those radicals and
 subparts? Or if Unicode had to support a script like this one:

 http://www.arthaey.com/conlang/ashaille/writing/sarapin.html

 whose components are graphically composed in, shall we say, entirely
 non-trivial ways (see the composed samples at the bottom of the page)?


That's insane!

And yet, very very interesting...


 While I find that very intersting...I'm afraid I don't actually
 understand your suggestion :/ (I do understand FSM's and how they
 work, though) Could you give a little example of what you mean?
 [...]

 Currently, std.uni code (argh the pun!!)

Hah! :)

 is hand-written with tables of
 which character belongs to which class, etc.. These hand-coded tables
 are error-prone and unnecessary. For example, think of computing the
 layout width of a UTF-8 stream. Why waste time decoding into dchar, and
 then doing all sorts of table lookups to compute the width? Instead,
 treat the stream as a byte stream, with certain sequences of bytes
 evaluating to length 2, others to length 1, and yet others to length 0.

 A lexer engine is perfectly suited for recognizing these kinds of
 sequences with optimal speed. The only difference from a real lexer is
 that instead of spitting out tokens, it keeps a running total (layout)
 length, which is output at the end.

 So what we should do is to write a tool that processes Unicode.txt (the
 official table of character properties from the Unicode standard) and
 generates lexer engines that compute various Unicode properties
 (grapheme count, layout length, etc.) for each of the UTF encodings.

 This way, we get optimal speed for these algorithms, plus we don't need
 to manually maintain tables and stuff, we just run the tool on
 Unicode.txt each time there's a new Unicode release, and the correct
 code will be generated automatically.


I see. I think that's a very good observation, and a great suggestion. In 
fact, it'd imagine it'd be considerably simpler than a typial lexer 
generator. Much less of the fancy regexy-ness would be needed. Maybe put 
together a pull request if you get the time...?




Re: Notice/Warning on narrowStrings .length

2012-04-26 Thread Jonathan M Davis
On Thursday, April 26, 2012 17:26:40 H. S. Teoh wrote:
 Currently, std.uni code (argh the pun!!) is hand-written with tables of
 which character belongs to which class, etc.. These hand-coded tables
 are error-prone and unnecessary. For example, think of computing the
 layout width of a UTF-8 stream. Why waste time decoding into dchar, and
 then doing all sorts of table lookups to compute the width? Instead,
 treat the stream as a byte stream, with certain sequences of bytes
 evaluating to length 2, others to length 1, and yet others to length 0.
 
 A lexer engine is perfectly suited for recognizing these kinds of
 sequences with optimal speed. The only difference from a real lexer is
 that instead of spitting out tokens, it keeps a running total (layout)
 length, which is output at the end.
 
 So what we should do is to write a tool that processes Unicode.txt (the
 official table of character properties from the Unicode standard) and
 generates lexer engines that compute various Unicode properties
 (grapheme count, layout length, etc.) for each of the UTF encodings.
 
 This way, we get optimal speed for these algorithms, plus we don't need
 to manually maintain tables and stuff, we just run the tool on
 Unicode.txt each time there's a new Unicode release, and the correct
 code will be generated automatically.

That's a fantastic idea! Of course, that leaves the job of implementing it... 
:)

- Jonathan M Davis


Re: Notice/Warning on narrowStrings .length

2012-04-26 Thread H. S. Teoh
On Thu, Apr 26, 2012 at 09:03:59PM -0400, Nick Sabalausky wrote:
[...]
 Heh, any usage of Notepad *needs* to be justified. For example, it has an 
 undo buffer of exactly ONE change.

Don't laugh too hard. The original version of vi also had an undo buffer
of depth 1. In fact, one of the *current* vi's still only has an undo
buffer of depth 1. (Fortunately vim is much much saner.)


 And the stupid thing doesn't even handle Unix-style newlines.
 *Everything* handes Unix-style newlines these days, even on Windows.
 Windows *BATCH* files even accept Unix-style newlines, for 
 goddsakes! But not Notepad.
 
 It is nice in it's leanness and no-nonsence-ness. But it desperately needs 
 some updates.

Back in the day, my favorite editor ever was Norton Editor. It's tiny
(only about 50k or less, IIRC) yet had innovative (for its day)
features... like split pane editing, ^V which flips capitalization to
EOL (so a single function serves for both uppercasing and lowercasing,
and you just apply it twice to do a single word).  Unfortunately it's a
DOS-only program.  I think it works in the command prompt, but I've
never tested it (the modern windows command prompt is subtly different
from the old DOS command prompt, so things may not quite work as they
used to).

It's ironic how useless Notepad is compared to an ancient DOS program
from the dinosaur age.


 At least it actually supports Unicode though. (Which actually I find 
 somewhat surprising.)

Now in that, at least, it surpasses Norton Editor. :-) But had Norton
not been bought over by Symantec, we'd have a modern, much more powerful
version of NE today. But, oh well. Things have moved on. Vim beats the
crap out of NE, Notepad, and just about any GUI editor out there. It
also beats the snot out of emacs, but I don't want to start *that*
flamewar. :-P


[...]
  http://www.arthaey.com/conlang/ashaille/writing/sarapin.html
 
  whose components are graphically composed in, shall we say, entirely
  non-trivial ways (see the composed samples at the bottom of the
  page)?
 
 
 That's insane!
 
 And yet, very very interesting...

Here's more:

http://www.omniglot.com/writing/conscripts2.htm

Imagine if some of the more complicated scripts there were actually used
in a real language, and Unicode had to support it...  Like this one:

http://www.omniglot.com/writing/talisman.htm

Or, if you *really* wanna go all-out:

http://www.omniglot.com/writing/ssioweluwur.php

(Check out the sample text near the bottom of the page and gape in
awe at what creative minds let loose can produce... and horror at the
prospect of Unicode being required to support it.)


[...]
  Currently, std.uni code (argh the pun!!)
 
 Hah! :)
 
  is hand-written with tables of which character belongs to which
  class, etc.. These hand-coded tables are error-prone and
  unnecessary. For example, think of computing the layout width of a
  UTF-8 stream. Why waste time decoding into dchar, and then doing all
  sorts of table lookups to compute the width? Instead, treat the
  stream as a byte stream, with certain sequences of bytes evaluating
  to length 2, others to length 1, and yet others to length 0.
 
  A lexer engine is perfectly suited for recognizing these kinds of
  sequences with optimal speed. The only difference from a real lexer
  is that instead of spitting out tokens, it keeps a running total
  (layout) length, which is output at the end.
 
  So what we should do is to write a tool that processes Unicode.txt
  (the official table of character properties from the Unicode
  standard) and generates lexer engines that compute various Unicode
  properties (grapheme count, layout length, etc.) for each of the UTF
  encodings.
 
  This way, we get optimal speed for these algorithms, plus we don't
  need to manually maintain tables and stuff, we just run the tool on
  Unicode.txt each time there's a new Unicode release, and the correct
  code will be generated automatically.
 
 
 I see. I think that's a very good observation, and a great suggestion.
 In fact, it'd imagine it'd be considerably simpler than a typial lexer
 generator. Much less of the fancy regexy-ness would be needed. Maybe
 put together a pull request if you get the time...?
[...]

When I get the time? Hah... I really need to get my lazy bum back to
working on the new AA implementation first. I think that would
contribute greater value than optimizing Unicode algorithms. :-) I was
hoping *somebody* would be inspired by my idea and run with it...


T

-- 
What do you mean the Internet isn't filled with subliminal messages? What about 
all those buttons marked submit??


Re: Notice/Warning on narrowStrings .length

2012-04-26 Thread Andrej Mitrovic
On 4/27/12, H. S. Teoh hst...@quickfur.ath.cx wrote:
 It's ironic how useless Notepad is compared to an ancient DOS program
 from the dinosaur age.

If you run edit in command prompt or the run dialog (well, assuming
you had a win32 box somewhere), you'd actually get a pretty decent
dos-based editor that is still better than Notepad. It has split
windows, a tab stop setting, and even a whole bunch of color settings.
:P


Re: Notice/Warning on narrowStrings .length

2012-04-26 Thread Nick Sabalausky
H. S. Teoh hst...@quickfur.ath.cx wrote in message 
news:mailman.2182.1335490591.4860.digitalmar...@puremagic.com...

 Now in that, at least, it surpasses Norton Editor. :-) But had Norton
 not been bought over by Symantec, we'd have a modern, much more powerful
 version of NE today. But, oh well. Things have moved on. Vim beats the
 crap out of NE, Notepad, and just about any GUI editor out there. It
 also beats the snot out of emacs, but I don't want to start *that*
 flamewar. :-P


We didn't start that flamewar,
It was always burning,
Since the world's been turning...


 Here's more:

 http://www.omniglot.com/writing/conscripts2.htm

 Imagine if some of the more complicated scripts there were actually used
 in a real language, and Unicode had to support it...  Like this one:

 http://www.omniglot.com/writing/talisman.htm

 Or, if you *really* wanna go all-out:

 http://www.omniglot.com/writing/ssioweluwur.php

 (Check out the sample text near the bottom of the page and gape in
 awe at what creative minds let loose can produce... and horror at the
 prospect of Unicode being required to support it.)


Crazy stuff! Some of them look rather similar to Arabic or Korean's Hangul 
(sp?), at least to my untrained eye. And then others are just *really* 
interesting-looking, like:

http://www.omniglot.com/writing/12480.htm
http://www.omniglot.com/writing/ayeri.htm
http://www.omniglot.com/writing/oxidilogi.htm

You're right though, if I were in charge of Unicode and tasked with handling 
some of those, I think I'd just say Screw it. Unicode is now depricated. 
Use ASCII instead. Doesn't have the characters for your langauge? Tough! Fix 
your language! :)


 When I get the time? Hah... I really need to get my lazy bum back to
 working on the new AA implementation first. I think that would
 contribute greater value than optimizing Unicode algorithms. :-) I was
 hoping *somebody* would be inspired by my idea and run with it...


Heh, yea. It is a tempting project, but my plate's overflowing too. (Now if 
only I could make the same happen to bank account...!)




Re: Notice/Warning on narrowStrings .length

2012-04-26 Thread Nick Sabalausky
Andrej Mitrovic andrej.mitrov...@gmail.com wrote in message 
news:mailman.2183.1335491333.4860.digitalmar...@puremagic.com...
 On 4/27/12, H. S. Teoh hst...@quickfur.ath.cx wrote:
 It's ironic how useless Notepad is compared to an ancient DOS program
 from the dinosaur age.

 If you run edit in command prompt or the run dialog (well, assuming
 you had a win32 box somewhere), you'd actually get a pretty decent
 dos-based editor that is still better than Notepad. It has split
 windows, a tab stop setting, and even a whole bunch of color settings.
 :P

Heh, I remember that :)

Holy crap, even in XP, they updated it to use the Windows standard key 
combos for cut/copy/paste. I had no idea, all this time. Back in DOS, it 
used that old Shift-Ins stuff.




Re: Notice/Warning on narrowStrings .length

2012-04-26 Thread Matt Peterson

On Friday, 27 April 2012 at 01:35:26 UTC, H. S. Teoh wrote:
When I get the time? Hah... I really need to get my lazy bum 
back to

working on the new AA implementation first. I think that would
contribute greater value than optimizing Unicode algorithms. 
:-) I was
hoping *somebody* would be inspired by my idea and run with 
it...


I actually recently wrote a lexer generator for D that wouldn't 
be that hard to adapt to something like this.


Notice/Warning on narrowStrings .length

2012-04-23 Thread James Miller
I'm writing an introduction/tutorial to using strings in D, 
paying particular attention to the complexities of UTF-8 and 16. 
I realised that when you want the number of characters, you 
normally actually want to use walkLength, not length. Is is 
reasonable for the compiler to pick this up during semantic 
analysis and point out this situation?


It's just a thought because a lot of the time, using length will 
get the right answer, but for the wrong reasons, resulting in 
lurking bugs. You can always cast to immutable(ubyte)[] or 
immutable(short)[] if you want to work with the actual bytes 
anyway.


Re: Notice/Warning on narrowStrings .length

2012-04-23 Thread Adam D. Ruppe

On Monday, 23 April 2012 at 23:01:59 UTC, James Miller wrote:
Is is reasonable for the compiler to pick this up during 
semantic analysis and point out this situation?



Maybe... but it is important that this works:

string s;

if(s.length)
   do_something(s);

since that's always right and quite common.


Re: Notice/Warning on narrowStrings .length

2012-04-23 Thread bearophile

James Miller:

I realised that when you want the number of characters, you 
normally actually want to use walkLength, not length.


As with strlen() in C, unfortunately the result of 
walkLength(somestring) is computed every time you call it... 
because it's doesn't get cached.
A partial improvement for this situation is to assure 
walkLength(somestring) to be strongly pure, and to assure the D 
compiler is able to move this invariant pure computation out of 
loops.



Is is reasonable for the compiler to pick this up during 
semantic analysis and point out this situation?


This is not easy to do, because sometimes you want to know the 
number of code points, and sometimes of code units.
I remember even a proposal to rename the length field to 
another name for narrow strings, to avoid such bugs.


---

Adam D. Ruppe:


Maybe... but it is important that this works:

string s;

if(s.length)
   do_something(s);

since that's always right and quite common.


Better:

if (!s.empty)
do_something(s);

(or even better, built-in non-ulls, usable for strings too).

Bye,
bearophile


Re: Notice/Warning on narrowStrings .length

2012-04-23 Thread James Miller

On Monday, 23 April 2012 at 23:52:41 UTC, bearophile wrote:

James Miller:

I realised that when you want the number of characters, you 
normally actually want to use walkLength, not length.


As with strlen() in C, unfortunately the result of 
walkLength(somestring) is computed every time you call it... 
because it's doesn't get cached.
A partial improvement for this situation is to assure 
walkLength(somestring) to be strongly pure, and to assure the D 
compiler is able to move this invariant pure computation out of 
loops.



Is is reasonable for the compiler to pick this up during 
semantic analysis and point out this situation?


This is not easy to do, because sometimes you want to know the 
number of code points, and sometimes of code units.
I remember even a proposal to rename the length field to 
another name for narrow strings, to avoid such bugs.


I was thinking about that. This is quite a vague suggestion, more 
just throwing the idea out there and seeing what people think. I 
am aware of the issue of walkLength being computed every time, 
rather than being a constant lookup. One option would be to make 
it only a warning in @safe code, so worst case scenario is that 
you mark the function as @trusted. I feel this fits in with the 
idea of @safe quite well, since you have to explicitly tell the 
compiler that you know what you're doing.


Another option would be to have some sort of general lint tool 
that picks up on these kinds of potential errors, though that is 
a lot bigger scope...


--
James Miller


Re: Notice/Warning on narrowStrings .length

2012-04-23 Thread bearophile

James Miller:

Another option would be to have some sort of general lint tool 
that picks up on these kinds of potential errors, though that 
is a lot bigger scope...


Lot of people in D.learn don't even use -wi -property so go 
figure how many will use a lint :-)


In first approximation you can rely only on what people see 
compiling with dmd foo.d, that is the most basic compilation 
use only. More serious programmers thankfully activate warnings.


Bye,
bearophile


Re: Notice/Warning on narrowStrings .length

2012-04-23 Thread Jonathan M Davis
On Tuesday, April 24, 2012 01:01:57 James Miller wrote:
 I'm writing an introduction/tutorial to using strings in D,
 paying particular attention to the complexities of UTF-8 and 16.
 I realised that when you want the number of characters, you
 normally actually want to use walkLength, not length. Is is
 reasonable for the compiler to pick this up during semantic
 analysis and point out this situation?
 
 It's just a thought because a lot of the time, using length will
 get the right answer, but for the wrong reasons, resulting in
 lurking bugs. You can always cast to immutable(ubyte)[] or
 immutable(short)[] if you want to work with the actual bytes
 anyway.

At this point, I don't think that it makes any sense to give a warning for 
this. The compiler can't possibly know whether using length is a good idea or 
correct in any particular set of code. If we really want to do something to 
tackle the problem, then we should create a new string type which better 
solves the issues. There's a _lot_ more to be worried about due to the fact 
that strings are variable length encoded than just their length.

There has been talk of creating a new string type, and there has been talk of 
creating the concept of a variable length encoded range which better handles 
all of this stuff, but no proposal thus far has gotten anywhere.

As for walkLength being O(n) in many cases (as discussed elsewhere in this 
thread), I don't think that it's that big a deal. If you know what it's doing, 
you know that it's O(n), and it's simple enough to simply save the result if 
you need to call it multiple times.

- Jonathan M Davis