Re: String Literal Docs

2010-06-23 Thread Justin Spahr-Summers
On Mon, 21 Jun 2010 15:20:16 -0500, Ellery Newcomer ellery-
newco...@utulsa.edu wrote:
 Are your diagrams solely concerned with the lexer? Because I have a 
 (messy) parser grammar which I'm a bit more confident about if you're 
 interested. 

I can't speak for Alix, but I would absolutely be interested. I'm 
working on an Objective-D preprocessor and my parsing still has lots 
of holes, even besides the stuff I have marked to-do. A strict reading 
of the website has already turned up a few inaccuracies.


Re: String Literal Docs

2010-06-22 Thread div0

On 20/06/2010 20:14, Nick Sabalausky wrote:

div0d...@users.sourceforge.net  wrote in message
news:hvlok6$1rf...@digitalmars.com...

On 20/06/2010 18:55, Nick Sabalausky wrote:

div0d...@users.sourceforge.net   wrote in message
news:hvkrsc$2r5...@digitalmars.com...


It says multiple of 2, not even number of digits.


multiple of 2 == even number

Even as in even vs odd


I also said 'To me that implies'. Please don't take what I said out of
context and be a smart arse about it. There's more than enough of that
goes on round here.



That wan't my intent, sorry if it came across that way. It sounded to me
like you were implying there was a difference between multiple of 2 and
even number. If that wasn't the case, then I guess I'm just not sure what
you were really getting at.




What I was getting at is that if you use the w suffix, then surely you 
would expect the number of hex digits to be a multiple of 4 not 2.


If there are only 6 digits what then? Are the missing one inferred to be 
0, is it a compile error, or something else?


Because of the use of the 2, I inferred from the spec that the suffixes 
were not supposed to be allowed. If it had said even number of digits, 
I'd have been more inclined to think that the suffixes are legal.


Either which way it just high lights that the spec isn't sufficiently clear.

--
My enormous talent is exceeded only by my outrageous laziness.
http://www.ssTk.co.uk


Re: String Literal Docs

2010-06-21 Thread Alix Pexton

On 20/06/2010 22:46, Alix Pexton wrote:

On 20/06/2010 21:37, Ellery Newcomer wrote:

On 06/20/2010 03:01 PM, Alix Pexton wrote:

On 19/06/2010 21:12, Alix Pexton wrote:

I've been sketching some grammar diagrams for D2.0, a little like those
on JSON.org, and of course I didn't get far before I ran into something
odd.



I think I will take the plunge and base my diagrams on the source of
DMD. After looking at the code in lexer.c, it does not seem as far
beyond my rusty old c++ parsing skills as I had expected! Massive credit
to Walter for having a codebase that is as mature as DMD without it
turning into a labyrinth of preprocessor macros and cryptic comefroms.

This will mean however that my little project may take a little longer,
sigh...

A...


Do share. I've always been too lazy to read lexer.c, and from this
discussion, it sounds like there are a few spots where my own lexer
grammar is incorrect (or at least differs from dmd).



of course ^^

A...


Well, I think I have got my head around lexer.c now, and its various 
peculiarities, like 000377. being a valid float (although not 
according to my shiny new, limited edition copy of tDPL (fig2.2 p35)^^).


The weirdness occurs because some of some corner cases are handled not 
by the neat little state state machine that validates reals, but in the 
scanner at the point where it recognises a number beginning with a zero. 
The productions in lex.html represent the range of inputs that are 
accepted by the state machine without taking into account that the 
scanner rejects the sequence ._ (which makes sense as that is the 
identifier _ in the outer scope).


Andrei's analysis in tDPL also points out that 0xp0 is a valid hexfloat, 
but a strict reading of lex.html would not allow it.


Overall the diagram for hexfloat is much simpler than the one for 
decimalfloat, which I think will have to be split into 3 


A...

PS, octal must die!


Re: String Literal Docs

2010-06-21 Thread Ellery Newcomer

On 06/21/2010 02:21 PM, Alix Pexton wrote:

On 20/06/2010 22:46, Alix Pexton wrote:

On 20/06/2010 21:37, Ellery Newcomer wrote:

On 06/20/2010 03:01 PM, Alix Pexton wrote:

On 19/06/2010 21:12, Alix Pexton wrote:

I've been sketching some grammar diagrams for D2.0, a little like
those
on JSON.org, and of course I didn't get far before I ran into
something
odd.



I think I will take the plunge and base my diagrams on the source of
DMD. After looking at the code in lexer.c, it does not seem as far
beyond my rusty old c++ parsing skills as I had expected! Massive
credit
to Walter for having a codebase that is as mature as DMD without it
turning into a labyrinth of preprocessor macros and cryptic
comefroms.

This will mean however that my little project may take a little longer,
sigh...

A...


Do share. I've always been too lazy to read lexer.c, and from this
discussion, it sounds like there are a few spots where my own lexer
grammar is incorrect (or at least differs from dmd).



of course ^^

A...


Well, I think I have got my head around lexer.c now, and its various
peculiarities, like 000377. being a valid float (although not
according to my shiny new, limited edition copy of tDPL (fig2.2 p35)^^).


Oh wow. That's a sweet little diagram. Those dots are hard to see though.



The weirdness occurs because some of some corner cases are handled not
by the neat little state state machine that validates reals, but in the
scanner at the point where it recognises a number beginning with a zero.
The productions in lex.html represent the range of inputs that are
accepted by the state machine without taking into account that the
scanner rejects the sequence ._ (which makes sense as that is the
identifier _ in the outer scope).


to hell with lexer.c. I'm not changing anything.



Andrei's analysis in tDPL also points out that 0xp0 is a valid hexfloat,
but a strict reading of lex.html would not allow it.

Overall the diagram for hexfloat is much simpler than the one for
decimalfloat, which I think will have to be split into 3 

A...

PS, octal must die!


I'll settle for modified syntax 0c123. But yeah.

Are your diagrams solely concerned with the lexer? Because I have a 
(messy) parser grammar which I'm a bit more confident about if you're 
interested.


Re: String Literal Docs

2010-06-21 Thread Alix Pexton

On 21/06/2010 21:20, Ellery Newcomer wrote:


Are your diagrams solely concerned with the lexer? Because I have a
(messy) parser grammar which I'm a bit more confident about if you're
interested.


So far I have only covered the lexer, but most of it needs redoing in 
light of the errors in the DMD docs, but I am hoping to cover the whole 
spec, eventually...


The more I do the quicker I'm able to make them as my workflow evolves, 
so its hard to say how long it will take...


A...


Re: String Literal Docs

2010-06-20 Thread Alix Pexton

On 20/06/2010 01:09, div0 wrote:

On 19/06/2010 23:17, Ellery Newcomer wrote:


All I can say is

auto w = xdead beefw;

results in

Error: invalid UTF-8 sequence

on dmd 2.047


Then you've found a bug, you know what to do:

http://d.puremagic.com/issues/



Hmn, that would seem to indicate to me that the postfix is being allowed 
when the hex represents a valid UTF sequence, but not otherwise.


I didn't do too much testing myself as I know next to zilch about string 
internals 


The text that describes hex strings says that they have to have an even 
number of digits, but this would seem to imply that they have to have a 
multiple of 4 or 8 for wstrings and dstrings respectively, which makes 
sense, but I'm not sure that can be verified in the lexing of a string 
literal without insane lookahead rules 


But, then I guess that is why the spec says that hex strings are exempt 
from the valid UTF rule, and in that case hexstrings should really make 
byte arrays rather than strings, but failing that, always chars and not 
anything wider.


A...


Re: String Literal Docs

2010-06-20 Thread div0

On 20/06/2010 11:03, Alix Pexton wrote:

On 20/06/2010 01:09, div0 wrote:

On 19/06/2010 23:17, Ellery Newcomer wrote:


All I can say is

auto w = xdead beefw;

results in

Error: invalid UTF-8 sequence

on dmd 2.047


Then you've found a bug, you know what to do:

http://d.puremagic.com/issues/



Hmn, that would seem to indicate to me that the postfix is being allowed
when the hex represents a valid UTF sequence, but not otherwise.

I didn't do too much testing myself as I know next to zilch about string
internals 

The text that describes hex strings says that they have to have an even
number of digits, but this would seem to imply that they have to have a
multiple of 4 or 8 for wstrings and dstrings respectively, which makes
sense, but I'm not sure that can be verified in the lexing of a string
literal without insane lookahead rules 


It says multiple of 2, not even number of digits. To me that implies 
it's always 2 and the suffix acceptance is just a bug. It could be made 
more clear though.




But, then I guess that is why the spec says that hex strings are exempt
from the valid UTF rule, and in that case hexstrings should really make
byte arrays rather than strings, but failing that, always chars and not
anything wider.

A...


Yeah, hex strings should probably have the type ubyte[]

If you using them to put arbitrary binary in your program you're almost 
certainly going to cast the array to something else anyway, so char[], 
wchar[], dchar[] all seem a bit pointless and as they allow invalid utf, 
making them ?char[] seems wrong.


--
My enormous talent is exceeded only by my outrageous laziness.
http://www.ssTk.co.uk


Re: String Literal Docs

2010-06-20 Thread Nick Sabalausky
div0 d...@users.sourceforge.net wrote in message 
news:hvkrsc$2r5...@digitalmars.com...

 It says multiple of 2, not even number of digits.

multiple of 2 == even number

Even as in even vs odd


 Yeah, hex strings should probably have the type ubyte[]

 If you using them to put arbitrary binary in your program you're almost 
 certainly going to cast the array to something else anyway, so char[], 
 wchar[], dchar[] all seem a bit pointless and as they allow invalid utf, 
 making them ?char[] seems wrong.


You have me completely convinced.




Re: String Literal Docs

2010-06-20 Thread div0

On 20/06/2010 18:55, Nick Sabalausky wrote:

div0d...@users.sourceforge.net  wrote in message
news:hvkrsc$2r5...@digitalmars.com...


It says multiple of 2, not even number of digits.


multiple of 2 == even number

Even as in even vs odd


I also said 'To me that implies'. Please don't take what I said out of 
context and be a smart arse about it. There's more than enough of that 
goes on round here.


I read the spec. as specifying that the hex characters should be in 
groups of 2, I also take it as implying that the suffixes are not 
applicable. You're more than welcome to your own take on it.


--
My enormous talent is exceeded only by my outrageous laziness.
http://www.ssTk.co.uk


Re: String Literal Docs

2010-06-20 Thread Alix Pexton

On 19/06/2010 21:12, Alix Pexton wrote:

I've been sketching some grammar diagrams for D2.0, a little like those
on JSON.org, and of course I didn't get far before I ran into something
odd.



I think I will take the plunge and base my diagrams on the source of 
DMD. After looking at the code in lexer.c, it does not seem as far 
beyond my rusty old c++ parsing skills as I had expected! Massive credit 
to Walter for having a codebase that is as mature as DMD without it 
turning into a labyrinth of preprocessor macros and cryptic comefroms.


This will mean however that my little project may take a little longer, 
sigh...


A...


Re: String Literal Docs

2010-06-20 Thread Alix Pexton

On 20/06/2010 20:14, Nick Sabalausky wrote:

div0d...@users.sourceforge.net  wrote in message
news:hvlok6$1rf...@digitalmars.com...

On 20/06/2010 18:55, Nick Sabalausky wrote:

div0d...@users.sourceforge.net   wrote in message
news:hvkrsc$2r5...@digitalmars.com...


It says multiple of 2, not even number of digits.


multiple of 2 == even number

Even as in even vs odd


I also said 'To me that implies'. Please don't take what I said out of
context and be a smart arse about it. There's more than enough of that
goes on round here.



That wan't my intent, sorry if it came across that way. It sounded to me
like you were implying there was a difference between multiple of 2 and
even number. If that wasn't the case, then I guess I'm just not sure what
you were really getting at.




From looking at the source, I now know that all string literals can 
have a postfix, and that as far as lexing goes, all strings are in UTF8. 
I've not tracked down yet where the the value of the postfix is applied, 
but I'm fairly certain that it would be easy enough to turn off the UTF 
verification for the hexstrings at that end.


As far as making my diagrams, I don't think it matters, for now...

A...


Re: String Literal Docs

2010-06-20 Thread Ellery Newcomer

On 06/20/2010 03:01 PM, Alix Pexton wrote:

On 19/06/2010 21:12, Alix Pexton wrote:

I've been sketching some grammar diagrams for D2.0, a little like those
on JSON.org, and of course I didn't get far before I ran into something
odd.



I think I will take the plunge and base my diagrams on the source of
DMD. After looking at the code in lexer.c, it does not seem as far
beyond my rusty old c++ parsing skills as I had expected! Massive credit
to Walter for having a codebase that is as mature as DMD without it
turning into a labyrinth of preprocessor macros and cryptic comefroms.

This will mean however that my little project may take a little longer,
sigh...

A...


Do share. I've always been too lazy to read lexer.c, and from this 
discussion, it sounds like there are a few spots where my own lexer 
grammar is incorrect (or at least differs from dmd).




Re: String Literal Docs

2010-06-20 Thread Alix Pexton

On 20/06/2010 21:37, Ellery Newcomer wrote:

On 06/20/2010 03:01 PM, Alix Pexton wrote:

On 19/06/2010 21:12, Alix Pexton wrote:

I've been sketching some grammar diagrams for D2.0, a little like those
on JSON.org, and of course I didn't get far before I ran into something
odd.



I think I will take the plunge and base my diagrams on the source of
DMD. After looking at the code in lexer.c, it does not seem as far
beyond my rusty old c++ parsing skills as I had expected! Massive credit
to Walter for having a codebase that is as mature as DMD without it
turning into a labyrinth of preprocessor macros and cryptic comefroms.

This will mean however that my little project may take a little longer,
sigh...

A...


Do share. I've always been too lazy to read lexer.c, and from this
discussion, it sounds like there are a few spots where my own lexer
grammar is incorrect (or at least differs from dmd).



of course ^^

A...


String Literal Docs

2010-06-19 Thread Alix Pexton
I've been sketching some grammar diagrams for D2.0, a little like those 
on JSON.org, and of course I didn't get far before I ran into something odd.


In the section of www.digitalmars.com/d/2.0/lex.html on string literals, 
the productions imply that the [c|w|d] postfix is allowed on Wysiwyg, 
DoubleQuote and Hex strings and not on either Delimited or Token 
strings, which didn't make a lot of sense to me, so I tested it with DMD 
(v2.046, win)...


---

import std.stdio;

void main(){
auto t1 = double quoted; // OK
auto t2 = `back tick`d;// OK
auto t3 = xdead beef;// postfix not allowed on hexstrings!
auto t4 = qdelimited/d;// OK
auto t5 = q{if}d;  // OK

writefln(all literals A-OK!);
}

---

This makes sense to me, HexStrings with wide chars would have made my 
brain scream 


So, to correct the documentation, the postfix needs to be removed from 
HexString and added to DelimitedString and TokenString.


I tried to see if this was already reporded in the bug tracker but 
couldn't see anything close.


On a slightly quieter note, there is also a spare underscore in the 
definition of HexidecimalDigit as it extends DecimalDigit which 
already has an underscore.


I also noticed a bug in the tracker related to initial underscores in 
float literals, if the diagrams start getting to puzzling I might look 
into that ^^


A...

PS, my copy of tDPL is in the post, yay!


Re: String Literal Docs

2010-06-19 Thread Ellery Newcomer

On 06/19/2010 03:12 PM, Alix Pexton wrote:

I've been sketching some grammar diagrams for D2.0, a little like those
on JSON.org, and of course I didn't get far before I ran into something
odd.

In the section of www.digitalmars.com/d/2.0/lex.html on string literals,
the productions imply that the [c|w|d] postfix is allowed on Wysiwyg,
DoubleQuote and Hex strings and not on either Delimited or Token
strings, which didn't make a lot of sense to me, so I tested it with DMD
(v2.046, win)...

---

import std.stdio;

void main(){
auto t1 = double quoted; // OK
auto t2 = `back tick`d; // OK
auto t3 = xdead beef; // postfix not allowed on hexstrings!
auto t4 = qdelimited/d;// OK
auto t5 = q{if}d; // OK

writefln(all literals A-OK!);
}

---

This makes sense to me, HexStrings with wide chars would have made my
brain scream 



http://d.puremagic.com/issues/show_bug.cgi?id=4351


but I'm not so sure about the hex string one. I think you just gave it 
invalid unicode. E.g., this compiles fine:


auto w = x1e1d 1e1fw;

on dmd 2.047

but what it results in is pretty screwy.



So, to correct the documentation, the postfix needs to be removed from
HexString and added to DelimitedString and TokenString.

I tried to see if this was already reporded in the bug tracker but
couldn't see anything close.

On a slightly quieter note, there is also a spare underscore in the
definition of HexidecimalDigit as it extends DecimalDigit which
already has an underscore.

I also noticed a bug in the tracker related to initial underscores in
float literals, if the diagrams start getting to puzzling I might look
into that ^^


What what?



A...

PS, my copy of tDPL is in the post, yay!


Re: String Literal Docs

2010-06-19 Thread div0

On 19/06/2010 22:16, Ellery Newcomer wrote:

On 06/19/2010 03:12 PM, Alix Pexton wrote:

I've been sketching some grammar diagrams for D2.0, a little like those
on JSON.org, and of course I didn't get far before I ran into something
odd.

In the section of www.digitalmars.com/d/2.0/lex.html on string literals,
the productions imply that the [c|w|d] postfix is allowed on Wysiwyg,
DoubleQuote and Hex strings and not on either Delimited or Token
strings, which didn't make a lot of sense to me, so I tested it with DMD
(v2.046, win)...

---

import std.stdio;

void main(){
auto t1 = double quoted; // OK
auto t2 = `back tick`d; // OK
auto t3 = xdead beef; // postfix not allowed on hexstrings!
auto t4 = qdelimited/d;// OK
auto t5 = q{if}d; // OK

writefln(all literals A-OK!);
}

---

This makes sense to me, HexStrings with wide chars would have made my
brain scream 



http://d.puremagic.com/issues/show_bug.cgi?id=4351


but I'm not so sure about the hex string one. I think you just gave it
invalid unicode. E.g., this compiles fine:


Hex strings are specifically exempted from the requirement for valid utf.


--
My enormous talent is exceeded only by my outrageous laziness.
http://www.ssTk.co.uk


Re: String Literal Docs

2010-06-19 Thread Ellery Newcomer

On 06/19/2010 04:26 PM, div0 wrote:

On 19/06/2010 22:16, Ellery Newcomer wrote:

On 06/19/2010 03:12 PM, Alix Pexton wrote:

I've been sketching some grammar diagrams for D2.0, a little like those
on JSON.org, and of course I didn't get far before I ran into something
odd.

In the section of www.digitalmars.com/d/2.0/lex.html on string literals,
the productions imply that the [c|w|d] postfix is allowed on Wysiwyg,
DoubleQuote and Hex strings and not on either Delimited or Token
strings, which didn't make a lot of sense to me, so I tested it with DMD
(v2.046, win)...

---

import std.stdio;

void main(){
auto t1 = double quoted; // OK
auto t2 = `back tick`d; // OK
auto t3 = xdead beef; // postfix not allowed on hexstrings!
auto t4 = qdelimited/d;// OK
auto t5 = q{if}d; // OK

writefln(all literals A-OK!);
}

---

This makes sense to me, HexStrings with wide chars would have made my
brain scream 



http://d.puremagic.com/issues/show_bug.cgi?id=4351


but I'm not so sure about the hex string one. I think you just gave it
invalid unicode. E.g., this compiles fine:


Hex strings are specifically exempted from the requirement for valid utf.




All I can say is

auto w = xdead beefw;

results in

Error: invalid UTF-8 sequence

on dmd 2.047


Re: String Literal Docs

2010-06-19 Thread div0

On 19/06/2010 23:17, Ellery Newcomer wrote:


All I can say is

auto w = xdead beefw;

results in

Error: invalid UTF-8 sequence

on dmd 2.047


Then you've found a bug, you know what to do:

http://d.puremagic.com/issues/

--
My enormous talent is exceeded only by my outrageous laziness.
http://www.ssTk.co.uk