Re: std.d.lexer: pre-voting review / discussion

2013-09-27 Thread Dominikus Dittes Scherkl
On Thursday, 26 September 2013 at 16:47:09 UTC, Jos van Uden 
wrote:

Is U+001A really meant to end the source file?
According to the Unicode specification this is a replacement 
character, like the newer U+FFFC. Or is it simply a spelling 
error and U+0019 was intended to

end the source (this would fit, as it means end of media).

More important to me is, that all the Space-Characters beyond 
ASCII are not considered whitespace


I imagine the lexer follows the language specification:

http://dlang.org/lex.html#EndOfFile


I know. What I wanted to say is: The language specification has a 
bug here
(at least it is strange to interpret replacement character as 
end of file
and end of media not) and the handling of unicode space 
characters is

not nice.
If this is not the right place to discus that matter, please 
point me to a better place.


Re: std.d.lexer: pre-voting review / discussion

2013-09-27 Thread Mehrdad

On Thursday, 12 September 2013 at 05:00:11 UTC, deadalnix wrote:
The problem is that it can cause a exponential (and I literally 
mean exponential here) amount of complexity.


The alternative is to go for some ambiguous function/template 
parameters parsing and resolve at the end, but as template 
argument are themselves ambiguous type/expression/symbols, the 
amount of complexity in the parser is doomed to explode.



Pretty sure a GLR parser handles that well within O(n^2) space. 
Nothing exponential necessary...


Re: std.d.lexer: pre-voting review / discussion

2013-09-27 Thread Mehrdad

On Saturday, 28 September 2013 at 01:23:48 UTC, Mehrdad wrote:

On Thursday, 12 September 2013 at 05:00:11 UTC, deadalnix wrote:
The problem is that it can cause a exponential (and I 
literally mean exponential here) amount of complexity.


The alternative is to go for some ambiguous function/template 
parameters parsing and resolve at the end, but as template 
argument are themselves ambiguous type/expression/symbols, the 
amount of complexity in the parser is doomed to explode.



Pretty sure a GLR parser handles that well within O(n^2) space. 
Nothing exponential necessary...


I meant time... Though it's true for space too.


Re: std.d.lexer: pre-voting review / discussion

2013-09-26 Thread Dominikus Dittes Scherkl

Hello.

I'm not sure if this belongs here, but I think there is bug at 
the very start of the Lexer chapter:


Is U+001A really meant to end the source file?
According to the Unicode specification this is a replacement 
character, like the newer U+FFFC. Or is it simply a spelling 
error and U+0019 was intended to

end the source (this would fit, as it means end of media).

I don't know if anybody ever has ended his source in that way or 
if it was tested.


More important to me is, that all the Space-Characters beyond 
ASCII are not
considered whitespace (starting with U+00A0 NBSP, the different 
wide spaces
U+2000 to U+200B up to the exotic stuff U+202F, U+205F, U+2060, 
U+3000 and

the famous U+FEFF). Why?
Ok, the set is much larger, but for the end-of-line also the 
unicode versions (U+2028 and U+2029) are added. This seems 
inconsequent to me.


Re: std.d.lexer: pre-voting review / discussion

2013-09-26 Thread Jos van Uden

On 26-9-2013 17:41, Dominikus Dittes Scherkl wrote:

Hello.

I'm not sure if this belongs here, but I think there is bug at the very start 
of the Lexer chapter:

Is U+001A really meant to end the source file?
According to the Unicode specification this is a replacement character, like 
the newer U+FFFC. Or is it simply a spelling error and U+0019 was intended to
end the source (this would fit, as it means end of media).

I don't know if anybody ever has ended his source in that way or if it was 
tested.

More important to me is, that all the Space-Characters beyond ASCII are not
considered whitespace (starting with U+00A0 NBSP, the different wide spaces
U+2000 to U+200B up to the exotic stuff U+202F, U+205F, U+2060, U+3000 and
the famous U+FEFF). Why?
Ok, the set is much larger, but for the end-of-line also the unicode versions 
(U+2028 and U+2029) are added. This seems inconsequent to me.


I imagine the lexer follows the language specification:

http://dlang.org/lex.html#EndOfFile


Re: std.d.lexer: pre-voting review / discussion

2013-09-25 Thread Jacob Carlborg

On 2013-09-25 04:48, Brian Schott wrote:


Changes since last time:

https://github.com/Hackerpilot/phobos/compare/D-Programming-Language:df38839...master


I had some comments and a couple of minor things like spell errors:

* I see that errorMessage throws an exception. Do we really want that? I 
would except it just returns an invalid token.


* Could we get some unit tests for string literals, comments and 
identifies out side of the ASCII table


* Personally I would like to see a short description for each unit test, 
what it's testing


* Could you remove debug code and other code that is commented out:

- 344
- 1172
- 1226, is that needed?
- 3165-3166
- 3197-3198
- 3392
- 3410
- 3434

Spell errors:

* forwarad - 292
* commemnt - 2031
* sentenels - 299
* messsage - 301
* underliying - 2454
* alloctors - 3230
* strightforward - 2276

I guess these line number might be off now. My original comments was 
made September 12.


For reference see:

http://forum.dlang.org/thread/jsnhlcbulwyjuqcqo...@forum.dlang.org?page=7#post-l0rsje:24jf9:241:40digitalmars.com

--
/Jacob Carlborg


Re: std.d.lexer: pre-voting review / discussion

2013-09-25 Thread Brian Schott

On Wednesday, 25 September 2013 at 09:36:43 UTC, Jacob Carlborg
wrote:
* I see that errorMessage throws an exception. Do we really 
want that? I would except it just returns an invalid token.


This is the default behavior that happens when you don't
configure an error callback.

* Could we get some unit tests for string literals, comments 
and identifies out side of the ASCII table


I've added one.

* Could you remove debug code and other code that is commented 
out:


Most of this is now gone.


Spell errors:


These were fixed weeks ago.


Re: std.d.lexer: pre-voting review / discussion

2013-09-25 Thread Jacob Carlborg
On Wednesday, 25 September 2013 at 16:52:43 UTC, Brian Schott 
wrote:



This is the default behavior that happens when you don't
configure an error callback.


I see.


I've added one.


Thanks.


Most of this is now gone.


That's good.


These were fixed weeks ago.


Great, I just never got a respond that.

--
/Jacob Carlborg



Re: std.d.lexer: pre-voting review / discussion

2013-09-24 Thread Brian Schott

On Tuesday, 17 September 2013 at 16:34:01 UTC, deadalnix wrote:
I had some comments that nobody addressed. Mostly about firing 
several instances of the lexer with the same identifier pool.


Doing that would require making the identifier pool part of the 
public API, which is not something that I want to do at the 
moment. Let's wait until the allocators are figureod out first.


Re: std.d.lexer: pre-voting review / discussion

2013-09-24 Thread Brian Schott

On Tuesday, 17 September 2013 at 20:14:36 UTC, Brian Schott wrote:
I've been busy with things that aren't D-related recently, but 
I should have time the rest of this week to address the lexer.


Changes since last time:

https://github.com/Hackerpilot/phobos/compare/D-Programming-Language:df38839...master

Test coverage is up to 85% now and a few bugs have been fixed.


Re: std.d.lexer: pre-voting review / discussion

2013-09-24 Thread deadalnix
On Wednesday, 25 September 2013 at 02:23:36 UTC, Brian Schott 
wrote:

On Tuesday, 17 September 2013 at 16:34:01 UTC, deadalnix wrote:
I had some comments that nobody addressed. Mostly about firing 
several instances of the lexer with the same identifier pool.


Doing that would require making the identifier pool part of the 
public API, which is not something that I want to do at the 
moment. Let's wait until the allocators are figureod out first.


Yes, ideally as alias parameter. This is a show stopper for many 
usages.


Re: std.d.lexer: pre-voting review / discussion

2013-09-17 Thread Dicebot

Discussion has slowed down and thread is fading away.

I personally have not noticed any crucial objections (ones that 
can't be figured out during normal pull request review process). 
Brian, what is your state on this? Is there something you want to 
add/change before the voting gets initiated?


Re: std.d.lexer: pre-voting review / discussion

2013-09-17 Thread Dicebot

On Tuesday, 17 September 2013 at 16:34:01 UTC, deadalnix wrote:

On Tuesday, 17 September 2013 at 15:31:00 UTC, Dicebot wrote:

Discussion has slowed down and thread is fading away.

I personally have not noticed any crucial objections (ones 
that can't be figured out during normal pull request review 
process). Brian, what is your state on this? Is there 
something you want to add/change before the voting gets 
initiated?


I had some comments that nobody addressed. Mostly about firing 
several instances of the lexer with the same identifier pool.


Well, I suppose only Brian can address it :) I am only checking 
if there are no commonly repeated objections that are likely to 
ruin the voting. There are plenty of comments of course and I 
would recommend Brian to at least make some statement about those 
before voting but that is his call.


There is a possibility to mention one issue as a blocker when 
voting though, that will be taken into consideration when 
counting actual votes.


Re: std.d.lexer: pre-voting review / discussion

2013-09-17 Thread deadalnix

On Tuesday, 17 September 2013 at 15:31:00 UTC, Dicebot wrote:

Discussion has slowed down and thread is fading away.

I personally have not noticed any crucial objections (ones that 
can't be figured out during normal pull request review 
process). Brian, what is your state on this? Is there something 
you want to add/change before the voting gets initiated?


I had some comments that nobody addressed. Mostly about firing 
several instances of the lexer with the same identifier pool.


Re: std.d.lexer: pre-voting review / discussion

2013-09-17 Thread Brian Schott

On Tuesday, 17 September 2013 at 16:48:44 UTC, Dicebot wrote:

On Tuesday, 17 September 2013 at 16:34:01 UTC, deadalnix wrote:

On Tuesday, 17 September 2013 at 15:31:00 UTC, Dicebot wrote:

Discussion has slowed down and thread is fading away.

I personally have not noticed any crucial objections (ones 
that can't be figured out during normal pull request review 
process). Brian, what is your state on this? Is there 
something you want to add/change before the voting gets 
initiated?


I had some comments that nobody addressed. Mostly about firing 
several instances of the lexer with the same identifier pool.


Well, I suppose only Brian can address it :) I am only checking 
if there are no commonly repeated objections that are likely to 
ruin the voting. There are plenty of comments of course and I 
would recommend Brian to at least make some statement about 
those before voting but that is his call.


There is a possibility to mention one issue as a blocker when 
voting though, that will be taken into consideration when 
counting actual votes.


I've been busy with things that aren't D-related recently, but I 
should have time the rest of this week to address the lexer.


Re: std.d.lexer: pre-voting review / discussion

2013-09-17 Thread Dicebot

On Tuesday, 17 September 2013 at 20:14:36 UTC, Brian Schott wrote:
I've been busy with things that aren't D-related recently, but 
I should have time the rest of this week to address the lexer.


Ok, please notify me then once you feel ready for voting / extra 
review ( public at dicebot.lv )


Thanks for you work.


Re: std.d.lexer: pre-voting review / discussion

2013-09-17 Thread ilya-stromberg

On Wednesday, 11 September 2013 at 15:02:00 UTC, Dicebot wrote:
std.d.lexer is standard module for lexing D code, written by 
Brian Schott


Brian, have you got any plans to write common lexer/parser 
generator based on context-free grammar (CFG) or parsing 
expression grammar (PEG), not only for D grammar?


Something like this:
https://github.com/PhilippeSigaud/Pegged


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Walter Bright

On 9/11/2013 10:10 PM, deadalnix wrote:

See my comment, it is possible, with increased parser complexity, to handle many
cases where you don't know what you are parsing yet. Doing so, lookahead is only
required to find matching closing token. I suspect that a fast path in the lexer
for that precise use case may be faster than buffering tokens, as it allow to
save one branch per token.


I don't believe that, because you can see about anything for tokens in lookahead 
and so have to duplicate nearly the whole lexer anyway for the 'fast path', but 
you're free to try it out and prove me wrong.


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread sclytrack

On Thursday, 12 September 2013 at 03:31:42 UTC, H. S. Teoh wrote:
On Wed, Sep 11, 2013 at 10:06:11PM -0400, Jonathan M Davis 
wrote:

On Thursday, September 12, 2013 03:37:06 deadalnix wrote:
 Int, Function, Scope, Import are all valid identifiers.

All of which violate Phobos' naming conventions for enum 
values (they
must start with a lowercase letter), which is why we went with 
adding
an _ on the end. And it's pretty much as close as you can get 
to the
keyword without actually using the keyword, which is a plus 
IMHO

(though from the sounds of it, H.S. Teoh would consider that a
negative due to possible confusion with the keyword).

[...]

Actually, the main issue I have is that some of the enum values 
end with
_ while others don't. This is inconsistent. I'd rather have 
consistency
than superficial resemblance to the keywords as typed. Either 
*all* of
the enum values should end with _, or *none* of them should.  
Having a
mixture of both is an eyesore, and leads to people wondering, 
should I

add a _ at the end or not?

If people insist that the 'default' keyword absolutely must be
represented as TokenType.default_ (I really don't see why), 
then *all*
TokenType values should end with _. But honestly, I find that 
really
ugly. Writing something like kwDefault, or tokenTypeDefault, 
would be

far better.

Sigh, Andrei was right. Once the bikeshed is up for painting, 
even the

rainbow won't suffice. :-P


T


Delphi would use

  TokenType   ttDefault
  MyType  mtDefault


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Jacob Carlborg

On 2013-09-12 00:02, H. S. Teoh wrote:


Constructors with many arguments are evil, because you can't just modify
one setting and use defaults for the rest; you have to specify all of
them up to the last default argument you have to change.


Currently this is possible:

LexerConfig config = { iterStyle: IterationStyle.everything, tokenStyle: 
TokenStyle.source };


I would really like the following to be possible as well:

void foo (LexerConfig config);

foo({ iterStyle: IterationStyle.everything, tokenStyle: 
TokenStyle.source });


--
/Jacob Carlborg


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Jacob Carlborg

On 2013-09-11 22:37, Brian Schott wrote:


It's possible, but I fear that it would make the code a mess.


Really? It seems you only need to change two places in the code:

https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L1724

And

https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L400

Although I just did a quick search for line.

--
/Jacob Carlborg


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Jacob Carlborg

On 2013-09-11 21:57, Piotr Szturmaj wrote:


Delphi designers realized this problem years ago and they came up with a
solution:
http://docwiki.embarcadero.com/RADStudio/XE4/en/Fundamental_Syntactic_Elements#Extended_Identifiers


Basically, Delphi allows escaping reserved identifiers with a ''. I
wonder how D solves that problem when interfacing to COM classes if they
have for example a function named scope.


Scala does it as well: `keyword` if I recall correctly. Seems like you 
can put basically anything between the backticks in Scala.


--
/Jacob Carlborg


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Jacob Carlborg

On 2013-09-12 00:36, Martin Nowak wrote:


Also a convenience function that reads a file and processes UTF BOM
marks would be nice (see toUtf8
https://github.com/dawgfoto/lexer/blob/master/dlexer/dlexer.d#L1429),
but that could as well fit somewhere else into phobos.


Sounds like that would fit in std.file or similar.

--
/Jacob Carlborg


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Jacob Carlborg

On 2013-09-11 17:01, Dicebot wrote:

std.d.lexer is standard module for lexing D code, written by Brian Schott


Finally :)

* How does it handler errors, just returns TokenType.invalid?

* Personally I think the module is too big. I would go with:

- std.d.lexer.token
- std.d.lexer.tokentype
- std.d.lexer.lexer - contains the rest
- std.d.lexer.config - IterationStyle, TokenStyle, LexerConfig
- CircularRange, StringCache, possibly put somewhere else. I assume this 
can be used for other things than lexing?

- Trie related code, same as above

* I see that errorMessage throws an exception. Do we really want that? I 
would except it just returns an invalid token.


If we do decide it should throw, it should absolutely _not_ throw a 
plain Exception. Create a new type, LexException or similar. I hate when 
code throws plain Exceptions, it makes it useless to catch them.


I would also expect this LexException to contain a Token. It shouldn't 
be needed to parse the exception message to get line and column information.


* I like that you overall use clear and descriptive variable and 
function names. Except sbox: 
https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L3265


* Could we get some unit tests for string literals, comments and 
identifies out side of the ASCII table


* I would like to see a short description for each unit test, what it's 
testing. Personally I have started with this style:


@describe(byToken)
{
@context(valid string literal)
{
@it(should return a token with the type 
TokenType.stringLiteral) unittest

{
// test
}

@it(should return a token with the correct lexeme) unittest
{
// test
}
}
}

Better formatted: http://pastebin.com/Dx78Vw6r

People here might think that would be a bit too verbose. The following 
would be ok as well:


@describe(short description of the unit test) unittest { }

* Could you remove debug code and other code that is commented out:

- 344
- 1172
- 1226, is that needed?
- 3165-3166
- 3197-3198
- 3392
- 3410
- 3434

Spell errors:

* forwarad - 292
* commemnt - 2031
* sentenels - 299
* messsage - 301
* underliying - 2454
* alloctors - 3230
* strightforward - 2276

--
/Jacob Carlborg


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread deadalnix
On Thursday, 12 September 2013 at 06:17:04 UTC, Walter Bright 
wrote:

On 9/11/2013 10:10 PM, deadalnix wrote:
See my comment, it is possible, with increased parser 
complexity, to handle many
cases where you don't know what you are parsing yet. Doing so, 
lookahead is only
required to find matching closing token. I suspect that a fast 
path in the lexer
for that precise use case may be faster than buffering tokens, 
as it allow to

save one branch per token.


I don't believe that, because you can see about anything for 
tokens in lookahead and so have to duplicate nearly the whole 
lexer anyway for the 'fast path', but you're free to try it out 
and prove me wrong.


I plan to, but you know what it is, the best optimization is the 
one that go from non working to working state.


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Brian Schott
I got some time to work on the lexer this evening. Changeset 
here: 
https://github.com/Hackerpilot/phobos/commit/9bdb7f97bb8021f3b0d0291896b8fe21a6fead23#std/d/lexer.d


The DDoc page has moved here: 
http://hackerpilot.github.io/experimental/std_lexer/phobos/std_d_lexer.html


* There are a few more unit tests now
* bitAnd renamed to amp
* slice rename to dotdot
* Much more cross-referencing in the doc comments
* Start line and column can be specified in the lexer config


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread qznc
On Thursday, 12 September 2013 at 01:39:52 UTC, Walter Bright 
wrote:

On 9/11/2013 6:30 PM, deadalnix wrote:

Indeed. What solution do you have in mind ?


The solution dmd uses is to put in an intermediary layer that 
saves the lookahead tokens in a linked list.


I think this is the right approach. It can probably be another 
function, we can put into std.range and reuse for other 
lexer/parsers. The lexer or the parser should not be made more 
complex for this.


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread dennis luehring

just an idea regarding your string[TokenType.max + 1] in

immutable(string[TokenType.max + 1]) tokenValues = [..]

it seems that you try to reduce the memory usage

wouldn't it be a nice idea to generate a combined imutable string at 
compiletime like this one


...pragmaexportpackageprivate...

and generated string slice accesses?

imuteable string big_one = generated_from(toke_list);
imutable string export_token = big_one[10..6];




Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread dennis luehring

Am 12.09.2013 10:15, schrieb Brian Schott:

I got some time to work on the lexer this evening. Changeset
here:
https://github.com/Hackerpilot/phobos/commit/9bdb7f97bb8021f3b0d0291896b8fe21a6fead23#std/d/lexer.d

The DDoc page has moved here:
http://hackerpilot.github.io/experimental/std_lexer/phobos/std_d_lexer.html

* There are a few more unit tests now
* bitAnd renamed to amp
* slice rename to dotdot
* Much more cross-referencing in the doc comments
* Start line and column can be specified in the lexer config



problem: many occurences of the same string

you should use constants for the tokens (and others)

string asm_token = asm;
...

immutable(string[TokenType.max + 1]) tokenValues = [
  ...
  asm_token
  ...
]

and reuse these constants in your optimization

maybe you can replace these lines with something getting feed with 
asm_token and give the same result but without 'a' and sm as splitted 
and different magic values - maybe a nice template or subrange...


case 'a': if (input[1..$].equal(sm)) return TokenType.asm_; else
...
break;

and in your unit tests

for example

on auto expected =




Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread deadalnix

On Wednesday, 11 September 2013 at 15:02:00 UTC, Dicebot wrote:
Most important goal of this review is to determine any API / 
design problems. Any internal implementation tweaks may happen 
after inclusion to Phobos but it is important to assure that no 
breaking changes will be required any time soon after module 
will get wider usage.




One quick remark : we need some kind of value provider to reuse 
across different lexing, and can be used outside the lexer.


If I process a module and have to kick in a new lexing phase 
because of mixin, I want to generate identifiers out of the same 
pool.


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Robert Schadek
On 09/12/2013 12:09 AM, Manfred Nowak wrote:
 Walter Bright wrote:

 Since the very beginning.

 One example is determining if something is a declaration or an
 expression. 
 I see now, that you wrote about parsing---not about lexing.

 Btw. I wrote an LALR(1)-parser for an early version of D. This means a 
 lookahead of one was sufficient---or I made terrible mistakes. 

 -manfred
I had problems with it, especially with

/IdentifierList:
Identifier . IdentifierList
TemplateInstance. IdentifierList

And Bison also complaint.
/


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread H. S. Teoh
On Thu, Sep 12, 2013 at 07:00:09AM +0200, deadalnix wrote:
 On Thursday, 12 September 2013 at 03:37:55 UTC, H. S. Teoh wrote:
 I still don't understand why backtracking is necessary in the first
 place. I would've thought a modern parser should be well able to
 encode seen tokens in its state so that backtracking is never
 necessary.  Or does D grammar have tricky bits that cannot be handled
 this way, that I'm unaware of?
 
 
 The problem is that it can cause a exponential (and I literally mean
 exponential here) amount of complexity.
 
 In some cases, the complexity is manageable, but in other that don't
 make any sense (it has to be noted that even full lexing don't make
 any sens here). For instance :
 
 int foo()() {}
^
 
 When you are at the caret position, you don't know if you face a
 function declaration or a template declaration. You could go for
 some ambiguous parsing, but each template argument can itself be a
 type, an expression or a symbol.
[...]

This can be handled by using an intermediate grammar rule. Reduce all
(...) into an intermediate type, say ArgList, so the reduction
happens something like this:

int  foo   ()  ()  {}
Type Ident ArgList ArgList ^

Then have the rule:

CompileTimeArgs ::= ArgList
RuntimeArgs ::= ArgList
TemplateDecl ::= Type Ident CompileTimeArgs RuntimeArgs '{' ...
FuncDecl ::= Type Ident RuntimeArgs '{' ...

So first, all (...) gets parsed to ArgList, but it's not yet fixed
whether they are compile-time arguments or runtime arguments. It's only
after you see the next '(' or '{' that you decide whether ArgList should
reduce to CompileTimeArgs or RuntimeArgs.

ArgList itself, of course, will accept all possible parameters (both
runtime and compile-time): types, expressions, symbols. Then when you
reduce it to RuntimeArgs, you reject stuff that can't be interpreted as
parameter declarations.


T

-- 
There are two ways to write error-free programs; only the third one works.


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Manfred Nowak
Robert Schadek wrote:

 especially with IdentifierList:

I see the shift/reduce conflict if you was indeed trying to syntactically 
differantiate between template identifiers and other identifiers.

-manfred



Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Manfred Nowak
deadalnix wrote:
 When you are at the caret position, you don't know

If one ever reaches that position one uses petty lexer/grammar definitions.

-manfred


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Timon Gehr

On 09/11/2013 08:49 PM, Walter Bright wrote:






3. I assumed [an instance of] TokenType is a type.


(is(TokenType) is true).


But it's not, it's an enum.
Even the document says it's a 'type', but it's not a type.


It's a type tag. The tag uniquely determines the type. (As in 'the type 
of a token', as opposed to 'the type of an expression'.)



4. When naming tokens like .. 'slice', it is giving it a
syntactic/semantic name rather than a token name. This would be awkward
if .. took on new meanings in D. Calling it 'dotdot' would be clearer.
Ditto for the rest. For example that is done better, '*' is called
'star', rather than 'dereference'.


FWIW, I use Tok!... I.e. a UDL for specifying kinds of tokens when 
interfacing with the parser. Some other kinds of tokens get a canonical 
representation. Eg. Tok!i is the kind of identifier tokens, Tok!0 is 
the kind of signed integer literal tokens etc.




5. The LexerConfig initialization should be a constructor rather than a 
sequence of assignments.


Using the { a:2, b:3 }-style initialization syntax?


6. No clue how lookahead works with this.


Eg. use a CircularBuffer adapter range. I have an implementation 
currently coupled with my own lexer implementation. If there is 
interest, I could factor it out.


Lookahead is realized as follows in the parser:

(assume 'code' is the circular buffer range.)

auto saveState(){muteerr++; return code.pushAnchor();} // saves the 
state and mutes all error messages until the state is restored


void restoreState(Anchor state){ muteerr--; code.popAnchor(state); }

The 'Anchor' is a trivial wrapper around a size_t. The circular buffer 
grows automatically to keep around tokens still reachable by an anchor. 
(The range only needs small constant space besides the buffer to support 
this functionality, though it is unable to detect usage errors.)



This approach is typically more efficient than using a free list on 
contemporary architectures.




Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread H. S. Teoh
On Thu, Sep 12, 2013 at 08:10:18PM +0400, Dmitry Olshansky wrote:
 12-Sep-2013 19:39, Timon Gehr пишет:
 On 09/11/2013 08:49 PM, Walter Bright wrote:
 4. When naming tokens like .. 'slice', it is giving it a
 syntactic/semantic name rather than a token name. This would be
 awkward if .. took on new meanings in D. Calling it 'dotdot' would
 be clearer.  Ditto for the rest. For example that is done better,
 '*' is called 'star', rather than 'dereference'.
 
 FWIW, I use Tok!... I.e. a UDL for specifying kinds of tokens
 when interfacing with the parser. Some other kinds of tokens get a
 canonical representation. Eg. Tok!i is the kind of identifier
 tokens, Tok!0 is the kind of signed integer literal tokens etc.
 
 I like this.
 Not only this has the benefit of not colliding with keywords. I also
 imagine that it could be incredibly convenient to get back the
 symbolic representation of a token (when token used as parameter to
 AST-node say BinaryExpr!(Tok!+)). And truth be told we all know
 how tokens look in symbolic form so learning a pack of names for
 them feels pointless.

+1.  This is superior to both the ad hoc _ suffix and my ad hoc
prefixing approach.  Tok!default is maximally readable, and requires
no silly convolutions like _ or 'kw' / 'tokenType' prefixes.

I vote for Tok!... to denote token types.

Question: what's the implementation of Tok? Does it fit into an enum?
What's the underlying representation? I imagine some kind of canonical
mapping into an integral type would be desired, to maximize runtime
performance.


T

-- 
There are three kinds of people in the world: those who can count, and those 
who can't.


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Dmitry Olshansky

12-Sep-2013 12:05, Jacob Carlborg пишет:

On 2013-09-11 17:01, Dicebot wrote:

std.d.lexer is standard module for lexing D code, written by Brian Schott


Finally :)

* How does it handler errors, just returns TokenType.invalid?

* Personally I think the module is too big. I would go with:

- std.d.lexer.token
- std.d.lexer.tokentype


These could be one module. There is really no meaningful way to use 
token type separately from token.



- std.d.lexer.lexer - contains the rest
- std.d.lexer.config - IterationStyle, TokenStyle, LexerConfig


Contrary I see this break down pointless - do you really want to use 
config without the lexer?




- CircularRange, StringCache, possibly put somewhere else. I assume this
can be used for other things than lexing?
- Trie related code, same as above


No good public interface defined is the reason. Basically the same as 
with Trie in the new std.uni module - needs its own review.




* I see that errorMessage throws an exception. Do we really want that? I
would except it just returns an invalid token.

If we do decide it should throw, it should absolutely _not_ throw a
plain Exception. Create a new type, LexException or similar. I hate when
code throws plain Exceptions, it makes it useless to catch them.

I would also expect this LexException to contain a Token. It shouldn't
be needed to parse the exception message to get line and column
information.


Better yet to have a std exception hierarchy... so that all parsing 
modules can be tied to ParseException. So this needs to be resolved in a 
forward-compatible way.



--
Dmitry Olshansky


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Dmitry Olshansky

12-Sep-2013 19:39, Timon Gehr пишет:

On 09/11/2013 08:49 PM, Walter Bright wrote:

4. When naming tokens like .. 'slice', it is giving it a
syntactic/semantic name rather than a token name. This would be awkward
if .. took on new meanings in D. Calling it 'dotdot' would be clearer.
Ditto for the rest. For example that is done better, '*' is called
'star', rather than 'dereference'.


FWIW, I use Tok!... I.e. a UDL for specifying kinds of tokens when
interfacing with the parser. Some other kinds of tokens get a canonical
representation. Eg. Tok!i is the kind of identifier tokens, Tok!0 is
the kind of signed integer literal tokens etc.


I like this.
Not only this has the benefit of not colliding with keywords. I also 
imagine that it could be incredibly convenient to get back the symbolic 
representation of a token (when token used as parameter to AST-node say 
BinaryExpr!(Tok!+)). And truth be told we all know how tokens look in 
symbolic form so learning a pack of names for them feels pointless.



6. No clue how lookahead works with this.


Eg. use a CircularBuffer adapter range. I have an implementation
currently coupled with my own lexer implementation. If there is
interest, I could factor it out.

Lookahead is realized as follows in the parser:

(assume 'code' is the circular buffer range.)

auto saveState(){muteerr++; return code.pushAnchor();} // saves the
state and mutes all error messages until the state is restored

void restoreState(Anchor state){ muteerr--; code.popAnchor(state); }

The 'Anchor' is a trivial wrapper around a size_t. The circular buffer
grows automatically to keep around tokens still reachable by an anchor.
(The range only needs small constant space besides the buffer to support
this functionality, though it is unable to detect usage errors.)


This approach is typically more efficient than using a free list on
contemporary architectures.



This ^^ is how. In fact std.d.lexer internally does similar thing with 
non-RA ranges of bytes.



--
Dmitry Olshansky


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Timon Gehr

On 09/11/2013 05:01 PM, Dicebot wrote:

std.d.lexer is standard module for lexing D code, written by Brian Schott

 Input 

Code: https://github.com/Hackerpilot/phobos/tree/master/std/d

Documentation:
http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html

...


(Commenting on what's visible in the documentation only for now.)

auto config = ...
... .byToken(config) ...

Seems to be a natural candidate for manual partial specialization.

enum config = ...
... .byToken!config() ...

uint line; ushort column; // is there overflow checking?

Check to see if the token is of the same type and has the same string 
representation as the given token.


Tokens with the same string representation always are of the same type, 
so this seems redundant.


Furthermore, I'd expect (!a.opCmp(b)) === (a == b).

Why provide the operator overloads at all? They don't implement 
essential or natural functionality.



includeSpecialTokens. It's not clear what this flag does.


If the input range supports slicing, the caching layer aliases itself 
away and the lexing process is much more efficient.


It might be more sensible to require the user to manually wrap his range.

pure nothrow bool isOperator(const TokenType t);
pure nothrow bool isOperator(ref const Token t);
pure nothrow bool isKeyword(const TokenType t);
pure nothrow bool isKeyword(ref const Token t);
...

IMO we should get rid of these.



TokenType naming seems inconsistent. eg:  is amp, = is assign, == is 
equal, but = is bitAndEqual and  is logicAnd


IMO better:  is and, = is assign, = is andAssign and  is andAnd.

Of course, it might be best to use a template instead. Tok!, Tok!= 
and Tok!.




Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread deadalnix

On Thursday, 12 September 2013 at 14:09:43 UTC, H. S. Teoh wrote:
This can be handled by using an intermediate grammar rule. 
Reduce all

(...) into an intermediate type, say ArgList, so the reduction
happens something like this:

int  foo   ()  ()  {}
Type Ident ArgList ArgList ^

Then have the rule:

CompileTimeArgs ::= ArgList
RuntimeArgs ::= ArgList
TemplateDecl ::= Type Ident CompileTimeArgs RuntimeArgs '{' ...
FuncDecl ::= Type Ident RuntimeArgs '{' ...

So first, all (...) gets parsed to ArgList, but it's not yet 
fixed
whether they are compile-time arguments or runtime arguments. 
It's only
after you see the next '(' or '{' that you decide whether 
ArgList should

reduce to CompileTimeArgs or RuntimeArgs.

ArgList itself, of course, will accept all possible parameters 
(both
runtime and compile-time): types, expressions, symbols. Then 
when you
reduce it to RuntimeArgs, you reject stuff that can't be 
interpreted as

parameter declarations.



And then you got to backtrack the parsing instead of the lexing. 
You just moved the problem around. You'll have to create some 
temporary ast nodes that then will fix into what they really are.


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread H. S. Teoh
On Thu, Sep 12, 2013 at 06:09:41PM +0200, deadalnix wrote:
 On Thursday, 12 September 2013 at 14:09:43 UTC, H. S. Teoh wrote:
 This can be handled by using an intermediate grammar rule. Reduce all
 (...) into an intermediate type, say ArgList, so the reduction
 happens something like this:
 
  int  foo   ()  ()  {}
  Type Ident ArgList ArgList ^
 
 Then have the rule:
 
  CompileTimeArgs ::= ArgList
  RuntimeArgs ::= ArgList
  TemplateDecl ::= Type Ident CompileTimeArgs RuntimeArgs '{' ...
  FuncDecl ::= Type Ident RuntimeArgs '{' ...
 
 So first, all (...) gets parsed to ArgList, but it's not yet fixed
 whether they are compile-time arguments or runtime arguments. It's
 only after you see the next '(' or '{' that you decide whether
 ArgList should reduce to CompileTimeArgs or RuntimeArgs.
 
 ArgList itself, of course, will accept all possible parameters (both
 runtime and compile-time): types, expressions, symbols. Then when you
 reduce it to RuntimeArgs, you reject stuff that can't be interpreted
 as parameter declarations.
 
 
 And then you got to backtrack the parsing instead of the lexing. You
 just moved the problem around. You'll have to create some temporary
 ast nodes that then will fix into what they really are.

No. You can just use ArgListItem for both runtime args and compile-time
args. Once you decided which one it is, wrong arguments are rejected at
semantic time (which you have to do anyway).

Let's take a concrete example. Say we're parsing this invalid code:

int foo(alias A)(alias B) {}

You'd go through these steps:

1) Parse initial prefix of declaration:

int foo(alias A)(alias B) {}
   ^
AST:
FuncDecl
 |--RetType: int
 |--Ident: foo
 \--[ being built ]

2) Parse first (...):

int foo(alias A)(alias B) {}
^
AST:
FuncDecl
 |--RetType: int
 |--Ident: foo
 |--ArgList
 |   \-- AliasArg
 |\-- ident: A
 \--[ being built ]

   I'm skipping the intermediate steps here, it's obvious how to
   construct AliasArg from the usual parsing process.

3) Parse second (...):

int foo(alias A)(alias B) {}
  ^
AST:
FuncDecl
 |--RetType: int
 |--Ident: foo
 |--ArgList
 |   \-- AliasArg
 |\-- ident: A
 |--ArgList
 |   \-- AliasArg
 |\-- ident: B
 \--[ being built ]

4) At this point, you now know the first ArgList is CompileTimeArgList,
and the second is RuntimeArgList, so you can just change the type
fields (along with narrowing FuncDecl to TemplateFuncDecl):

AST:
TemplateFuncDecl (was: FuncDecl)
 |--RetType: int
 |--Ident: foo
 |--CompileTimeArgList (was: ArgList)
 |   \-- AliasArg
 |\-- ident: A
 |--RuntimeArgList (was: ArgList)
 |   \-- AliasArg
 |\-- ident: B
 \--[ being built ]

Since you're still constructing FuncDecl, your current parsing context
should still have a direct reference to the partially-constructed
FuncDecl node, which in turn has a direct reference to both ArgList
child nodes. So this is just dereferencing a couple of pointers. No
backtracking.

5) Finish parsing the declaration:

int foo(alias A)(alias B) {}
^
AST:
TemplateFuncDecl
 |--RetType: int
 |--Ident: foo
 |--CompileTimeArgList (was: ArgList)
 |   \-- AliasArg
 |\-- ident: A
 |--RuntimeArgList (was: ArgList)
 |   \-- AliasArg
 |\-- ident: B
 \--FuncBody
 \-- CompoundStatement
  \-- [empty body]

6) Run semantic:
   - Create local symbol table for foo, etc..
   - Run semantic on CompileTimeArgList:
  - Check AliasArg for validity
  - Run semantic on AliasArg: add A to function's local symbol
table, etc.
   - Run semantic on RuntimeArgList:
  - Check AliasArg for validity: ERROR: cannot have alias parameter
at runtime.
  - (Add B to local symbol table)(skipped due to previous error)
   - (Run semantic on FuncBody)(skipped due to previous error)
   - (Run semantic on RetType (verify return type match, etc.))(skipped
 due to previous error)
   - (Add function to parent scope symbol table)(skipped due to previous
 error)

So, no backtracking is necessary.

Of course, it sounds like DMD's parser doesn't work this way, but that's
a limitation of DMD's parser, not an *inherent* need for backtracking.


T

-- 
I see that you JS got Bach.


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Jacob Carlborg

On 2013-09-12 18:21, Dmitry Olshansky wrote:


Contrary I see this break down pointless - do you really want to use
config without the lexer?


std.d.lexer.config might be a bit unnecessary. But yes, in general I 
would like to see smaller modules.


--
/Jacob Carlborg


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Jacob Carlborg

On 2013-09-12 17:39, Timon Gehr wrote:


Using the { a:2, b:3 }-style initialization syntax?


Unfortunately that's currently not possible to pass to functions, see my 
other post:


http://forum.dlang.org/thread/jsnhlcbulwyjuqcqo...@forum.dlang.org?page=6#post-l0ro6h:249mk:241:40digitalmars.com

--
/Jacob Carlborg


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Timon Gehr

On 09/12/2013 06:40 PM, H. S. Teoh wrote:


I vote for Tok!... to denote token types.

Question: what's the implementation of Tok?  Does it fit into an enum?
What's the underlying representation? I imagine some kind of canonical
mapping into an integral type would be desired, to maximize runtime
performance.



This is just a quick hack. One would maybe want to avoid the unwarranted 
copies and linear searches at compile time:


import std.algorithm;

private enum symbols = [i,.,..,,,0,/+...+/];

struct TokenType{
private uint _tag; // or smaller type
private this(int tag){_tag=tag;}
string toString(){ // (optional)
static array(R)(R s){ string[] r; foreach(x;s) r~=x; return r; }
static immutable strs=array(symbols.map!((a)=`Tok!`~a~``));
return strs[_tag];
}
}

template Tok(string name)if(symbols.countUntil(name)!=-1){
enum Tok=TokenType(cast(uint)symbols.countUntil(name));
}

import std.stdio;
void main(){
enum x = Tok!i;
writeln(x);
writeln(Tok!0);
}



Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Martin Nowak

On 09/12/2013 08:17 AM, Walter Bright wrote:

I don't believe that, because you can see about anything for tokens in
lookahead and so have to duplicate nearly the whole lexer anyway for the
'fast path', but you're free to try it out and prove me wrong.


Me neither and if you have to duplicate the code for this it's a double 
loss.
I think it would be more fruitful to batch lexing, i.e. fill a hole 
memory page (4Kb) with tokens the go back to parsing until you need more.


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Martin Nowak

On 09/12/2013 05:39 PM, Timon Gehr wrote:


Lookahead is realized as follows in the parser:

(assume 'code' is the circular buffer range.)

auto saveState(){muteerr++; return code.pushAnchor();} // saves the
state and mutes all error messages until the state is restored

void restoreState(Anchor state){ muteerr--; code.popAnchor(state); }

The 'Anchor' is a trivial wrapper around a size_t. The circular buffer
grows automatically to keep around tokens still reachable by an anchor.
(The range only needs small constant space besides the buffer to support
this functionality, though it is unable to detect usage errors.)


Do you think it's possible to use CircularBufferRange itself as anchor.
Then calling save() would return a CircularBufferRange and you canould 
scratch the two functions above. I had some promising experiments in 
that direction, but the implicit save on copy is annoying because you 
end up with anchors from temporary copies.


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Robert Schadek
On 09/12/2013 04:21 AM, Martin Nowak wrote:
 On 09/12/2013 03:39 AM, Walter Bright wrote:
 On 9/11/2013 6:30 PM, deadalnix wrote:
 Indeed. What solution do you have in mind ?

 The solution dmd uses is to put in an intermediary layer that saves the
 lookahead tokens in a linked list.

 Linked list sounds bad.
 Do you have a rough idea how often lookahead is needed, i.e. is it
 performance relevant? If so it might be worth tuning.
Maybe some fixed size stack vector with 64 elements or so and some
linked list for the unusual case would help ..


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Dmitry Olshansky

13-Sep-2013 00:46, Robert Schadek пишет:

On 09/12/2013 04:21 AM, Martin Nowak wrote:

On 09/12/2013 03:39 AM, Walter Bright wrote:

On 9/11/2013 6:30 PM, deadalnix wrote:

Indeed. What solution do you have in mind ?


The solution dmd uses is to put in an intermediary layer that saves the
lookahead tokens in a linked list.


Linked list sounds bad.
Do you have a rough idea how often lookahead is needed, i.e. is it
performance relevant? If so it might be worth tuning.



Maybe some fixed size stack vector with 64 elements or so and some
linked list for the unusual case would help ..


And an extra branch to detect which one is currently the case? No, thanks ;)

--
Dmitry Olshansky


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Timon Gehr

On 09/12/2013 10:03 PM, Martin Nowak wrote:

On 09/12/2013 05:39 PM, Timon Gehr wrote:


Lookahead is realized as follows in the parser:

(assume 'code' is the circular buffer range.)

auto saveState(){muteerr++; return code.pushAnchor();} // saves the
state and mutes all error messages until the state is restored

void restoreState(Anchor state){ muteerr--; code.popAnchor(state); }

The 'Anchor' is a trivial wrapper around a size_t. The circular buffer
grows automatically to keep around tokens still reachable by an anchor.
(The range only needs small constant space besides the buffer to support
this functionality, though it is unable to detect usage errors.)


Do you think it's possible to use CircularBufferRange itself as anchor.


Well, this implementation is geared towards the usage pattern found in 
the top-down parser. If the first anchor is not restored last, it will 
not generally work. This does not conform to the range interface. 
Implementing .save requires a more advanced and less efficient 
implementation.



Then calling save() would return a CircularBufferRange and you could
scratch the two functions above. I had some promising experiments in
that direction,


What did you do? The way I think I'd approach it would be to maintain a 
binary min-heap using a few pointers in each range instance.



but the implicit save on copy is annoying because you
end up with anchors from temporary copies.


Is this referring to suboptimal efficiency? I guess that is rather hard 
to address in a safe way. Basically, you'd want to forget about anchor 
management in case it can be proven that there is another range on the 
same buffer that stays at most as progressed during the whole lifetime 
of the range under consideration.


Of course, it is always possible to do it in unsafely.


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Robert Schadek
On 09/12/2013 11:03 PM, Dmitry Olshansky wrote:
 Maybe some fixed size stack vector with 64 elements or so and some
 linked list for the unusual case would help ..

 And an extra branch to detect which one is currently the case? No,
 thanks ;)

I would think that branching and using the vector is faster than using
the constructing and linking nodes in the LL


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Jonathan M Davis
On Thursday, September 12, 2013 23:55:23 Robert Schadek wrote:
 On 09/12/2013 11:03 PM, Dmitry Olshansky wrote:
  Maybe some fixed size stack vector with 64 elements or so and some
  linked list for the unusual case would help ..
  
  And an extra branch to detect which one is currently the case? No,
  thanks ;)
 
 I would think that branching and using the vector is faster than using
 the constructing and linking nodes in the LL

Well, it sounds like there are several ideas to try out in this thread. I 
guess that only benchmarking and profiling will really tell us which (if any) 
are better though.

- Jonathan M Davis


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Walter Bright

On 9/12/2013 1:15 AM, Brian Schott wrote:

I got some time to work on the lexer this evening. Changeset here:
https://github.com/Hackerpilot/phobos/commit/9bdb7f97bb8021f3b0d0291896b8fe21a6fead23#std/d/lexer.d


The DDoc page has moved here:
http://hackerpilot.github.io/experimental/std_lexer/phobos/std_d_lexer.html


Great!



* There are a few more unit tests now


I strongly recommend running the unit tests with -cov. A lexer should be able to 
get near 100% coverage with the unit tests.



* bitAnd renamed to amp
* slice rename to dotdot
* Much more cross-referencing in the doc comments
* Start line and column can be specified in the lexer config




Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Brian Schott
On Thursday, 12 September 2013 at 23:40:55 UTC, Walter Bright 
wrote:
I strongly recommend running the unit tests with -cov. A lexer 
should be able to get near 100% coverage with the unit tests.


Some of the code is only present to be used at compile-time to 
generate switch statements inside of the lexer. -cov doesn't show 
code that's executed at compile-time to be covered, and it 
couldn't show meaningful line numbers on code that's generated 
and mixed in.


That being said, it's currently 70% covered. I'll be making 
another pass over the code later to fill in some of the gaps.


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread Walter Bright

On 9/12/2013 4:57 PM, Brian Schott wrote:

On Thursday, 12 September 2013 at 23:40:55 UTC, Walter Bright wrote:

I strongly recommend running the unit tests with -cov. A lexer should be able
to get near 100% coverage with the unit tests.


Some of the code is only present to be used at compile-time to generate switch
statements inside of the lexer. -cov doesn't show code that's executed at
compile-time to be covered, and it couldn't show meaningful line numbers on code
that's generated and mixed in.


That's right.



That being said, it's currently 70% covered. I'll be making another pass over
the code later to fill in some of the gaps.


Thanks.


Re: std.d.lexer: pre-voting review / discussion

2013-09-12 Thread deadalnix
We are talking about parameters here, not arguments. But overall, 
that isn't the point.


If I have int a = 3 as an argument in the first set of (), I 
still have no clue if it is a runtime or compile time parameter. 
But ref int a = 3 ?


Doing so, you ends up with a lot of ambiguous node (like 
FunctionArgumentOrValueTemplateParameter) and have to manage that 
complexity, or have to backtrack on the parsing and fix ambiguous 
node when you have enough information.


std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Dicebot
std.d.lexer is standard module for lexing D code, written by 
Brian Schott


 Input 

Code: https://github.com/Hackerpilot/phobos/tree/master/std/d

Documentation:
http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html

Initial discussion:
http://forum.dlang.org/thread/dpdgcycrgfspcxenz...@forum.dlang.org

Usage example in real project:
https://github.com/Hackerpilot/Dscanner
(as stdx.d.lexer, Brian, please correct me if versions do not 
match)


 Information for reviewers 

(yes, I am mostly copy-pasting this :P)

Goal of this thread is to detect if there are any outstanding
issues that need to fixed before formal yes/no voting
happens. If no critical objections will arise, voting will begin
starting with a next week. Otherwise it depends on time module 
author needs to implement suggestions.


Please take this part seriously: If you identify problems along 
the

way, please note if they are minor, serious, or showstoppers.
(http://wiki.dlang.org/Review/Process). This information later
will be used to determine if library is ready for voting.

If there are any frequent Phobos contributors / core developers
please pay extra attention to submission code style and fitting
into overall Phobos guidelines and structure.

Most important goal of this review is to determine any API / 
design problems. Any internal implementation tweaks may happen 
after inclusion to Phobos but it is important to assure that no 
breaking changes will be required any time soon after module will 
get wider usage.


 Information request from module author 

Performance was a major discussed topic in previous thread. Could 
you please provide benchmarking data for version currently 
ongoing the review?


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Tove

On Wednesday, 11 September 2013 at 15:02:00 UTC, Dicebot wrote:
std.d.lexer is standard module for lexing D code, written by 
Brian Schott


I remember reading there were some interesting hash-advances in 
dmd recently.


http://forum.dlang.org/thread/kq7ov0$2o8n$1...@digitalmars.com?page=1

maybe it's worth benchmarking those hashes for std.d.lexer as 
well.


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Walter Bright

On 9/11/2013 11:43 AM, Johannes Pfau wrote:

But to be honest I'm not sure how important this really is. I think it
should help for more responsive IDEs but maybe parsing is not a
bottleneck at all?



It is important, and I'm glad you brought it up. The LexerConfig can provide a 
spot to put a starting line/column value.


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Manfred Nowak
Walter Bright wrote:
 Parsing D requires arbitrary lookahead.

Why---and since which version?

-manfred


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Walter Bright

On 9/11/2013 11:45 AM, qznc wrote:

On Wednesday, 11 September 2013 at 15:02:00 UTC, Dicebot wrote:

std.d.lexer is standard module for lexing D code, written by Brian Schott

Documentation:
http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html


The documentation for Token twice says measured in ASCII characters or UTF-8
code units, which sounds confusing to me.

Is it UTF-8, which includes ASCII? Then it should not be or.


Pedantically, it is just UTF-8 code units, which are a superset of ASCII.



Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Johannes Pfau
Am Wed, 11 Sep 2013 17:01:58 +0200
schrieb Dicebot pub...@dicebot.lv:

 std.d.lexer is standard module for lexing D code, written by 
 Brian Schott
 

Question / Minor issue:

As we already have a range based interface I'd love to have partial
lexing / parsing, especially for IDEs.

Say I have this source code:

1: module a;
2: 
3: void test(int a)
4: {
5: [...]
6: }
7: 
8: void test2()
9: [...]


Then I first do a full parse pass over the source. Now line 5 is being
edited. I know from the full parse that line 5 is part of a
FunctionDeclaration which starts at line 3 and ends at line 6. Now I'd
like to re-parse only that part:


FunctionDeclaration decl = document.getDeclByLine(5);
decl.reparse(/*start_line=*/ 3, docBuffer);


I think these are the two critical points related to this for the
proposed std.lexer:

* How can I tell the lexer to start lexing at line/character n? Of
  course the input could be sliced, but then line number and position
  information in the Token struct is wrong.
* I guess std.lexer slices the original input? This could make things
  difficult if the file buffer is edited in place. But this can
  probably be dealt with outside of std.lexer. (By reallocating all
  .value members)


(And once this is working, an example in the docs would be great)

But to be honest I'm not sure how important this really is. I think it
should help for more responsive IDEs but maybe parsing is not a
bottleneck at all?


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Walter Bright

On 9/11/2013 8:01 AM, Dicebot wrote:

std.d.lexer is standard module for lexing D code, written by Brian Schott


Thank you, Brian! This is important work.

Not a thorough review, just some notes from reading the doc file:

1. I don't like the _ suffix for keywords. Just call it kwimport or something 
like that.


2. The example uses an if-then sequence of isBuiltType, isKeyword, etc. Should 
be an enum so a switch can be done for speed.


3. I assumed TokenType is a type. But it's not, it's an enum. Even the document 
says it's a 'type', but it's not a type.


4. When naming tokens like .. 'slice', it is giving it a syntactic/semantic name 
rather than a token name. This would be awkward if .. took on new meanings in D. 
Calling it 'dotdot' would be clearer. Ditto for the rest. For example that is 
done better, '*' is called 'star', rather than 'dereference'.


5. The LexerConfig initialization should be a constructor rather than a sequence 
of assignments. LexerConfig documentation is awfully thin. For example, 
'tokenStyle' is explained as being 'Token style', whatever that is.


6. No clue how lookahead works with this. Parsing D requires arbitrary 
lookahead.

7. uint line; Should indicate that lines start with '1', not '0'. Ditto for 
columns.

8. 'default_' Again with the awful practice of appending _.

9. Need to insert intra-page navigation links, such as when 'byToken()' appears 
in the text, it should be link to where byToken is described.




Goal of this thread is to detect if there are any outstanding
issues that need to fixed before formal yes/no voting
happens. If no critical objections will arise, voting will begin
starting with a next week. Otherwise it depends on time module author needs to 
implement suggestions.


I believe the state of the documentation is a showstopper, and needs to be 
extensively fleshed out before it can be considered ready for voting.





Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Brian Schott
The choice of ending token names with underscores was made 
according to the Phobos style guide.


http://dlang.org/dstyle.html


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Walter Bright

On 9/11/2013 12:10 PM, Brian Schott wrote:

The choice of ending token names with underscores was made according to the
Phobos style guide.

http://dlang.org/dstyle.html


I didn't realize that was in the style guide. I guess I can't complain about it, 
then :-)


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Jonathan M Davis
On Wednesday, September 11, 2013 11:49:44 Walter Bright wrote:
 1. I don't like the _ suffix for keywords. Just call it kwimport or
 something like that.

Appending _ is what we do in Phobos when we need to use a keyword, and it's 
even called out in the style guide: http://dlang.org/dstyle.html

Now, if it makes sense to use something other than a keyword, that's probably 
better, but in cases where it really makes the most sense to use a keyword but 
we can't (e.g. std.traits.FunctionAttribute), we append _ to the keyword.

What makes the most sense ni this case, I don't know, because I haven't looked
at either the documentation or the code yet, but if Brian is appending _ to
keywords, he's following the official style guide.

- Jonathan M Davis


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread qznc

On Wednesday, 11 September 2013 at 15:02:00 UTC, Dicebot wrote:
std.d.lexer is standard module for lexing D code, written by 
Brian Schott


Documentation:
http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html


The documentation for Token twice says measured in ASCII 
characters or UTF-8 code units, which sounds confusing to me.


Is it UTF-8, which includes ASCII? Then it should not be or.

This is nitpicking. Overall, I like the proposal. Great work!


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Walter Bright

On 9/11/2013 12:17 PM, Manfred Nowak wrote:

Walter Bright wrote:

Parsing D requires arbitrary lookahead.


Why---and since which version?


Since the very beginning.

One example is determining if something is a declaration or an expression.



Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Jonathan M Davis
On Wednesday, September 11, 2013 19:17:26 Manfred Nowak wrote:
 Walter Bright wrote:
  Parsing D requires arbitrary lookahead.
 
 Why---and since which version?

IIRC, it's at least partially cause by the .. token. You have to look ahead to 
figure out whether it's .. or a floating point literal. There may be other 
cases 
as well, but that particular one was actually complained about by Robert 
Schadek in his talk at dconf 2013.

- Jonathan M Davis


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Dicebot

On Wednesday, 11 September 2013 at 20:08:44 UTC, H. S. Teoh wrote:

On Wed, Sep 11, 2013 at 10:04:20PM +0200, Dicebot wrote:
On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh 
wrote:
I disagree. I think it's more readable to use a consistent 
prefix,
like kw... or kw_... (e.g. kw_int, kw_return, etc.), so that 
it's

clear you're referring to token types, not the actual keyword.

Not unless you want to change the style guide and break 
existing

Phobos code ;)


How would that break Phobos code? Phobos code doesn't even use
std.d.lexer right now.


Phobos code must conform its style guide. You can't change it 
without changing existing Phobos code that relies on it. 
Inconsistent style is worst of all options.


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread H. S. Teoh
On Wed, Sep 11, 2013 at 12:13:07PM -0700, Walter Bright wrote:
 On 9/11/2013 12:10 PM, Brian Schott wrote:
 The choice of ending token names with underscores was made according
 to the Phobos style guide.
 
 http://dlang.org/dstyle.html
 
 I didn't realize that was in the style guide. I guess I can't
 complain about it, then :-)

I disagree. I think it's more readable to use a consistent prefix, like
kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's clear
you're referring to token types, not the actual keyword.


T

-- 
One Word to write them all, One Access to find them, One Excel to count
them all, And thus to Windows bind them. -- Mike Champion


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread H. S. Teoh
On Wed, Sep 11, 2013 at 11:49:44AM -0700, Walter Bright wrote:
 On 9/11/2013 8:01 AM, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by Brian
 Schott
 
 Thank you, Brian! This is important work.
 
 Not a thorough review, just some notes from reading the doc file:
 
 1. I don't like the _ suffix for keywords. Just call it kwimport or
 something like that.

I agree.


 2. The example uses an if-then sequence of isBuiltType, isKeyword,
 etc. Should be an enum so a switch can be done for speed.

I believe this is probably a result of having a separate enum value of
each token, and at the same time needing to group them into categories
for syntax highlighting purposes. I'd suggest a function for converting
the TokenType enum value into a TokenCategory enum. Perhaps something
like:

enum TokenCategory { BuiltInType, Keyword, ... }

// Convert TokenType into TokenCategory
TokenCategory category(TokenType tt) { ... }

Then in user code, you call category() on the token type, and switch
over that. This maximizes performance.

Implementation-wise, I'd suggest either a hash table on TokenType, or
perhaps even encoding the category into the TokenType enum values, for
example:

enum TokenCategory {
BuiltInType, Keyword, ...
}

enum TokenType {
IntToken = (TokenCategory.BuiltInType  16) | 0x0001,
FloatToken = (TokenCategory.BuiltInType  16) | 0x0002,
...
FuncToken = (TokenCategory.Keyword  16) | 0x0001,
}

Then the category function can be simply:

TokenCategory category(TokenType tt) {
return cast(TokenCategory)(tt  16);
}

Though admittedly, this is a bit hackish. But if you're going for speed,
this would be quite fast.


 3. I assumed TokenType is a type. But it's not, it's an enum. Even
 the document says it's a 'type', but it's not a type.

I don't understand this complaint. I thought it's pretty clear that it's
referring to what type of token it is (or would you prefer TokenKind?).


 4. When naming tokens like .. 'slice', it is giving it a
 syntactic/semantic name rather than a token name. This would be
 awkward if .. took on new meanings in D. Calling it 'dotdot' would
 be clearer. Ditto for the rest. For example that is done better, '*'
 is called 'star', rather than 'dereference'.

I agree. Especially since '*' can also mean multiplication, depending on
context. It would be weird (and unnecessarily obscure) for parser code
to refer to 'dereference' when parsing expressions. :)


 5. The LexerConfig initialization should be a constructor rather than
 a sequence of assignments. LexerConfig documentation is awfully thin.
 For example, 'tokenStyle' is explained as being 'Token style',
 whatever that is.

I'm on the fence about this one. Setting up the configuration before
starting the lexing process makes sense to me.

Perhaps one way to improve this is to rename LexerConfig to DLexer, and
make byToken a member function (or call it via UFCS):

DLexer lex;
lex.iterStyle = ...;
// etc.

foreach (token; lex.byToken()) {
...
}

This reads better: you're getting a list of tokens from the lexer 'lex',
as opposed to getting something from byToken(config), which doesn't
really *say* what it's doing: is it tokenizing the config object?


 6. No clue how lookahead works with this. Parsing D requires arbitrary
 lookahead.

What has this got to do with lexing? The parser needs to keep track of
its state independently of the lexer anyway, doesn't it?  I'm not sure
how DMD's parser works, but the way I usually write parsers is that
their state encodes the series of tokens encountered so far, so they
don't need the lexer to save any tokens for them. If they need to refer
to tokens, they create partial AST trees on the parser stack that
reference said tokens. I don't see why it should be the lexer's job to
keep track of such things.


[...]
 8. 'default_' Again with the awful practice of appending _.

Yeah I didn't like that. Prefixing keywords with kw or kw_ is much
better, instead of the confusing way of appending _ to some token types
but not others.


 9. Need to insert intra-page navigation links, such as when
 'byToken()' appears in the text, it should be link to where byToken
 is described.

Agreed.


[...]
 I believe the state of the documentation is a showstopper, and needs
 to be extensively fleshed out before it can be considered ready for
 voting.

I think the API can be improved. The LexerConfig - DLexer rename is an
important readability issue IMO.

Also, it's unclear what types of input is supported -- the code example
only uses string input, but what about file input? Does it support
byLine? Or does it need me to slurp the entire file contents into an
in-memory buffer first?

Now, somebody pointed out that there's currently no way to tell it that
the data you're 

Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread H. S. Teoh
On Wed, Sep 11, 2013 at 10:04:20PM +0200, Dicebot wrote:
 On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh wrote:
 I disagree. I think it's more readable to use a consistent prefix,
 like kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's
 clear you're referring to token types, not the actual keyword.
 
 Not unless you want to change the style guide and break existing
 Phobos code ;)

How would that break Phobos code? Phobos code doesn't even use
std.d.lexer right now.


T

-- 
Being able to learn is a great learning; being able to unlearn is a greater 
learning.


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Brian Schott
On Wednesday, 11 September 2013 at 19:29:48 UTC, Jonathan M Davis 
wrote:

On Wednesday, September 11, 2013 19:17:26 Manfred Nowak wrote:

Walter Bright wrote:
 Parsing D requires arbitrary lookahead.

Why---and since which version?


IIRC, it's at least partially cause by the .. token. You have 
to look ahead to
figure out whether it's .. or a floating point literal. There 
may be other cases
as well, but that particular one was actually complained about 
by Robert

Schadek in his talk at dconf 2013.

- Jonathan M Davis


Yeah. D requires lookahead in both lexing and parsing. Some 
examples:


* String literals such as q{}
* If statements need to determine the difference between if 
(expression) and if (type identifier = expression)
* Determining if something is a lamba expression or a function 
literal expression
* Determining if something is a declaration or a statement or an 
expression.
* Differentiating between (type).identifier and a lambda 
expression
* Differentiating between a type and an expression inside a 
typeid expression
* Differentiating between associative array key type and tuple 
slices in type suffixes


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Dicebot

On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh wrote:
I disagree. I think it's more readable to use a consistent 
prefix, like
kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's 
clear

you're referring to token types, not the actual keyword.


Not unless you want to change the style guide and break existing 
Phobos code ;)


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Piotr Szturmaj

On 11.09.2013 20:49, Walter Bright wrote:

On 9/11/2013 8:01 AM, Dicebot wrote:

std.d.lexer is standard module for lexing D code, written by Brian Schott


Thank you, Brian! This is important work.

Not a thorough review, just some notes from reading the doc file:

1. I don't like the _ suffix for keywords. Just call it kwimport or
something like that.

8. 'default_' Again with the awful practice of appending _.


Delphi designers realized this problem years ago and they came up with a 
solution: 
http://docwiki.embarcadero.com/RADStudio/XE4/en/Fundamental_Syntactic_Elements#Extended_Identifiers


Basically, Delphi allows escaping reserved identifiers with a ''. I 
wonder how D solves that problem when interfacing to COM classes if they 
have for example a function named scope.


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Jonathan M Davis
On Wednesday, September 11, 2013 15:29:33 Jonathan M Davis wrote:
 On Wednesday, September 11, 2013 19:17:26 Manfred Nowak wrote:
  Walter Bright wrote:
   Parsing D requires arbitrary lookahead.
  
  Why---and since which version?
 
 IIRC, it's at least partially cause by the .. token. You have to look ahead
 to figure out whether it's .. or a floating point literal. There may be
 other cases as well, but that particular one was actually complained about
 by Robert Schadek in his talk at dconf 2013.

LOL. You were asking about _parsing_ requiring arbitrary lookahead, and I 
answered as to why lexing requires more than one character of lookahead. I 
should read more carefully...

- Jonathan M Davis


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread H. S. Teoh
On Wed, Sep 11, 2013 at 10:18:12PM +0200, Dicebot wrote:
 On Wednesday, 11 September 2013 at 20:08:44 UTC, H. S. Teoh wrote:
 On Wed, Sep 11, 2013 at 10:04:20PM +0200, Dicebot wrote:
 On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh wrote:
 I disagree. I think it's more readable to use a consistent prefix,
 like kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's
 clear you're referring to token types, not the actual keyword.
 
 Not unless you want to change the style guide and break existing
 Phobos code ;)
 
 How would that break Phobos code? Phobos code doesn't even use
 std.d.lexer right now.
 
 Phobos code must conform its style guide. You can't change it
 without changing existing Phobos code that relies on it.
 Inconsistent style is worst of all options.

This doesn't violate Phobos style guidelines:

enum TokenType {
kwInt,
kwFloat,
kwDouble,
...
kwFunction,
kwScope,
... // etc.
}

The token representing a particular keyword is not the same thing as the
keyword itself. There's nothing that says a D lexer must use token type
names that are derived from the D keywords it represents.

In fact, using a kw prefix for tokens representing keywords is *more*
consistent, because it doesn't require inserting _ for some enum values
but not for others.

I never said anything about changing Phobos style guidelines.


T

-- 
I think Debian's doing something wrong, `apt-get install pesticide',
doesn't seem to remove the bugs on my system! -- Mike Dresser


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Walter Bright

On 9/11/2013 12:55 PM, H. S. Teoh wrote:

3. I assumed TokenType is a type. But it's not, it's an enum. Even
the document says it's a 'type', but it's not a type.


I don't understand this complaint. I thought it's pretty clear that it's
referring to what type of token it is (or would you prefer TokenKind?).


A type is a well-understood concept in D. This is using 'type' to mean something 
completely different. This is very jarring. Your idea of using 'kind' is vastly 
superior.




5. The LexerConfig initialization should be a constructor rather than
a sequence of assignments. LexerConfig documentation is awfully thin.
For example, 'tokenStyle' is explained as being 'Token style',
whatever that is.


I'm on the fence about this one. Setting up the configuration before
starting the lexing process makes sense to me.


Yes, using a constructor :-)


6. No clue how lookahead works with this. Parsing D requires arbitrary
lookahead.


What has this got to do with lexing?


The lexer needs to support backing up and going forward again. That's how the 
dmd lexer works.




Also, it's unclear what types of input is supported -- the code example
only uses string input, but what about file input? Does it support
byLine? Or does it need me to slurp the entire file contents into an
in-memory buffer first?


The point of having the input be an InputRange is you DO NOT CARE if the input 
is coming from a string, file, etc.




Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Walter Bright

On 9/11/2013 8:01 AM, Dicebot wrote:

std.d.lexer is standard module for lexing D code, written by Brian Schott


One litmus test of the validity of the design is to see if one can build a D 
parser out of it.


Otherwise we'll likely face the unpleasant task of needing to deprecate 
std.d.lexer.



Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Brian Schott

On Wednesday, 11 September 2013 at 19:56:45 UTC, H. S. Teoh wrote:
2. The example uses an if-then sequence of isBuiltType, 
isKeyword,

etc. Should be an enum so a switch can be done for speed.


I believe this is probably a result of having a separate enum 
value of
each token, and at the same time needing to group them into 
categories
for syntax highlighting purposes. I'd suggest a function for 
converting
the TokenType enum value into a TokenCategory enum. Perhaps 
something

like:

enum TokenCategory { BuiltInType, Keyword, ... }

// Convert TokenType into TokenCategory
TokenCategory category(TokenType tt) { ... }

Then in user code, you call category() on the token type, and 
switch

over that. This maximizes performance.

Implementation-wise, I'd suggest either a hash table on 
TokenType, or
perhaps even encoding the category into the TokenType enum 
values, for

example:

enum TokenCategory {
BuiltInType, Keyword, ...
}

enum TokenType {
IntToken = (TokenCategory.BuiltInType  16) | 0x0001,
FloatToken = (TokenCategory.BuiltInType  16) | 0x0002,
...
FuncToken = (TokenCategory.Keyword  16) | 0x0001,
}

Then the category function can be simply:

TokenCategory category(TokenType tt) {
return cast(TokenCategory)(tt  16);
}

Though admittedly, this is a bit hackish. But if you're going 
for speed,

this would be quite fast.


There are already plenty of hackish things in that module, so 
this would fit right in.



4. When naming tokens like .. 'slice', it is giving it a
syntactic/semantic name rather than a token name. This would be
awkward if .. took on new meanings in D. Calling it 'dotdot' 
would
be clearer. Ditto for the rest. For example that is done 
better, '*'

is called 'star', rather than 'dereference'.


I agree. Especially since '*' can also mean multiplication, 
depending on
context. It would be weird (and unnecessarily obscure) for 
parser code

to refer to 'dereference' when parsing expressions. :)


If you read the docs/code you'll see that * is called star 
:-). The slice - dotdot rename is pretty simple to do.


5. The LexerConfig initialization should be a constructor 
rather than
a sequence of assignments. LexerConfig documentation is 
awfully thin.

For example, 'tokenStyle' is explained as being 'Token style',
whatever that is.


I'm on the fence about this one. Setting up the configuration 
before

starting the lexing process makes sense to me.

Perhaps one way to improve this is to rename LexerConfig to 
DLexer, and

make byToken a member function (or call it via UFCS):

DLexer lex;
lex.iterStyle = ...;
// etc.

foreach (token; lex.byToken()) {
...
}

This reads better: you're getting a list of tokens from the 
lexer 'lex',
as opposed to getting something from byToken(config), which 
doesn't
really *say* what it's doing: is it tokenizing the config 
object?


byToken is a free function because its first parameter is the 
range to tokenize. This allows you to use UFCS on the range. 
(e.g. sourcCode.byToken() Putting it in a struct/class would 
break this.


6. No clue how lookahead works with this. Parsing D requires 
arbitrary

lookahead.


What has this got to do with lexing? The parser needs to keep 
track of
its state independently of the lexer anyway, doesn't it?  I'm 
not sure
how DMD's parser works, but the way I usually write parsers is 
that
their state encodes the series of tokens encountered so far, so 
they
don't need the lexer to save any tokens for them. If they need 
to refer
to tokens, they create partial AST trees on the parser stack 
that
reference said tokens. I don't see why it should be the lexer's 
job to

keep track of such things.


For parsing, you'll likely want to use array() to grab all the 
tokens. But there are other uses such as syntax highlighting that 
only need one token at a time.



9. Need to insert intra-page navigation links, such as when
'byToken()' appears in the text, it should be link to where 
byToken

is described.


Agreed.


I'll work on this later this evening.



[...]
I believe the state of the documentation is a showstopper, and 
needs
to be extensively fleshed out before it can be considered 
ready for

voting.


I think the API can be improved. The LexerConfig - DLexer 
rename is an

important readability issue IMO.


I'm not convinced (yet?).

Also, it's unclear what types of input is supported -- the code 
example
only uses string input, but what about file input? Does it 
support
byLine? Or does it need me to slurp the entire file contents 
into an

in-memory buffer first?


The input is a range of bytes, which the lexer assumes is UTF-8. 
The lexer works much faster on arrays than it does on arbitrary 
ranges though. Dmitry Olshansky implemented a circular buffer in 
this module that's used to cache the 

Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Brian Schott
On Wednesday, 11 September 2013 at 20:37:11 UTC, Walter Bright 
wrote:

On 9/11/2013 8:01 AM, Dicebot wrote:
std.d.lexer is standard module for lexing D code, written by 
Brian Schott


One litmus test of the validity of the design is to see if one 
can build a D parser out of it.


Otherwise we'll likely face the unpleasant task of needing to 
deprecate std.d.lexer.


I'm not sure if you've followed the announce mailing list 
recently, but this lexer is already used in both the DScanner and 
DCD projects. There is a parser module that goes along with this 
lexer included in DScanner (and used by DCD as well). It's not 
ready for review though. I'm still working on improving its error 
recovery.


DScanner is a tool for D source that's able to generate 
syntax-highlighted HTML, perform syntax validation, generate the 
AST in XML format, generate CTAGS, list imports, and count lines 
of code.


DCD is an autocompletion client/server program that has 
integrations for several text editors.


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Michel Fortin

On 2013-09-11 18:49:44 +, Walter Bright newshou...@digitalmars.com said:

6. No clue how lookahead works with this. Parsing D requires arbitrary 
lookahead.


Since this is a forward range, you can save() it at one location and 
continue parsing, and then restart parsing from the savepoint.

https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L2324

This also means that the underlying range used as input to the lexer 
must be a forward range that you can save() too.


--
Michel Fortin
michel.for...@michelf.ca
http://michelf.ca



Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread H. S. Teoh
On Wed, Sep 11, 2013 at 10:37:34PM +0200, Brian Schott wrote:
 On Wednesday, 11 September 2013 at 19:56:45 UTC, H. S. Teoh wrote:
 2. The example uses an if-then sequence of isBuiltType, isKeyword,
 etc. Should be an enum so a switch can be done for speed.
 
 I believe this is probably a result of having a separate enum value
 of each token, and at the same time needing to group them into
 categories for syntax highlighting purposes. I'd suggest a function
 for converting the TokenType enum value into a TokenCategory enum.
 Perhaps something like:
 
  enum TokenCategory { BuiltInType, Keyword, ... }
 
  // Convert TokenType into TokenCategory
  TokenCategory category(TokenType tt) { ... }
 
 Then in user code, you call category() on the token type, and switch
 over that. This maximizes performance.
 
 Implementation-wise, I'd suggest either a hash table on TokenType, or
 perhaps even encoding the category into the TokenType enum values,
 for example:
 
  enum TokenCategory {
  BuiltInType, Keyword, ...
  }
 
  enum TokenType {
  IntToken = (TokenCategory.BuiltInType  16) | 0x0001,
  FloatToken = (TokenCategory.BuiltInType  16) | 0x0002,
  ...
  FuncToken = (TokenCategory.Keyword  16) | 0x0001,
  }
 
 Then the category function can be simply:
 
  TokenCategory category(TokenType tt) {
  return cast(TokenCategory)(tt  16);
  }
 
 Though admittedly, this is a bit hackish. But if you're going for
 speed, this would be quite fast.
 
 There are already plenty of hackish things in that module, so this
 would fit right in.

Better hope it passes code review, though. :)


 4. When naming tokens like .. 'slice', it is giving it a
 syntactic/semantic name rather than a token name. This would be
 awkward if .. took on new meanings in D. Calling it 'dotdot' would
 be clearer. Ditto for the rest. For example that is done better, '*'
 is called 'star', rather than 'dereference'.
 
 I agree. Especially since '*' can also mean multiplication, depending
 on context. It would be weird (and unnecessarily obscure) for parser
 code to refer to 'dereference' when parsing expressions. :)
 
 If you read the docs/code you'll see that * is called star :-).
 The slice - dotdot rename is pretty simple to do.

I was just going by what Walter said, but OK, fair enough.


 5. The LexerConfig initialization should be a constructor rather
 than a sequence of assignments. LexerConfig documentation is awfully
 thin.  For example, 'tokenStyle' is explained as being 'Token
 style', whatever that is.
 
 I'm on the fence about this one. Setting up the configuration before
 starting the lexing process makes sense to me.
 
 Perhaps one way to improve this is to rename LexerConfig to DLexer,
 and make byToken a member function (or call it via UFCS):
 
  DLexer lex;
  lex.iterStyle = ...;
  // etc.
 
  foreach (token; lex.byToken()) {
  ...
  }
 
 This reads better: you're getting a list of tokens from the lexer
 'lex', as opposed to getting something from byToken(config), which
 doesn't really *say* what it's doing: is it tokenizing the config
 object?
 
 byToken is a free function because its first parameter is the range to
 tokenize. This allows you to use UFCS on the range. (e.g.
 sourcCode.byToken() Putting it in a struct/class would break this.

Very good point. In that case, would it be possible to make byToken
accept configuration parameters inline? That is:

auto tokens = File(mysource)
.byLine() // does this actually work??
.byToken(IterationStyle.everything, ...);

This seems better suited for UFCS-style chains, as opposed to needing to
create a separate config object outside the chain just so you can use it
inside. And writing LexerConfig(...) inline seems a bit cumbersome:

auto tokens = File(mysource)
.byLine() // does this actually work??
.byToken(LexerConfig(IterationStyle.everything, ...));

though, in retrospect, perhaps not as bad as it initially sounds.


 6. No clue how lookahead works with this. Parsing D requires
 arbitrary lookahead.
 
 What has this got to do with lexing? The parser needs to keep track
 of its state independently of the lexer anyway, doesn't it?  I'm not
 sure how DMD's parser works, but the way I usually write parsers is
 that their state encodes the series of tokens encountered so far, so
 they don't need the lexer to save any tokens for them. If they need
 to refer to tokens, they create partial AST trees on the parser stack
 that reference said tokens. I don't see why it should be the lexer's
 job to keep track of such things.
 
 For parsing, you'll likely want to use array() to grab all the
 tokens. But there are other uses such as syntax highlighting that
 only need one token at a time.

I don't think that's necessary. An appropriately-constructed parser
doesn't need arbitrary slicing 

Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread H. S. Teoh
On Wed, Sep 11, 2013 at 01:31:45PM -0700, Walter Bright wrote:
 On 9/11/2013 12:55 PM, H. S. Teoh wrote:
 3. I assumed TokenType is a type. But it's not, it's an enum. Even
 the document says it's a 'type', but it's not a type.
 
 I don't understand this complaint. I thought it's pretty clear that
 it's referring to what type of token it is (or would you prefer
 TokenKind?).
 
 A type is a well-understood concept in D. This is using 'type' to
 mean something completely different. This is very jarring. Your idea
 of using 'kind' is vastly superior.

No, in typical D code, user-defined type names are capitalized, such as
InputRange, Duration, Complex, ByLine, etc.. They do not have 'Type' in
their names (we don't call them InputRangeType, for example). So this
usage doesn't conflict at all.


 5. The LexerConfig initialization should be a constructor rather
 than a sequence of assignments. LexerConfig documentation is awfully
 thin.  For example, 'tokenStyle' is explained as being 'Token
 style', whatever that is.
 
 I'm on the fence about this one. Setting up the configuration before
 starting the lexing process makes sense to me.
 
 Yes, using a constructor :-)

Constructors with many arguments are evil, because you can't just modify
one setting and use defaults for the rest; you have to specify all of
them up to the last default argument you have to change.


 6. No clue how lookahead works with this. Parsing D requires
 arbitrary lookahead.
 
 What has this got to do with lexing?
 
 The lexer needs to support backing up and going forward again.
 That's how the dmd lexer works.

Then the parser should be fixed. Keeping track of what tokens have been
seen so far is the parser's job, not the lexer's. LALR(1) parsers, for
example, deal with multiple similar-looking constructs without requiring
more than a single token lookahead. The trick is that you do not assume
a particular grammar production until it has been fully disambiguated.
The grammar, if not conducive to this, can usually be rearranged to make
it possible.  For example, if the grammar has two production rules:

StatementA ::= TokenA TokenB TokenC TokenD
StatementB ::= TokenA TokenB TokenC TokenE

then you may claim that you need a 4-token lookahead in order to tell,
given TokenA in the current input, which type of statement it's going to
be. However, these rules can be rewritten like this:

StatementABPrefix ::= TokenA TokenB TokenC
StatementA ::= StatementABPrefix TokenD
StatementB ::= StatementABPrefix TokenE

Now the parser only needs a 1-token lookahead to do its job, and no
backtracking is ever needed. Any information encoded by the first 3
tokens can be stored in the AST node of StatementABPrefix, so it's
recoverable without needing to save previous tokens.

Unless, you're saying that D grammar is more complex than LALR(1)... If
so, I'd like to see an example of where this is the case.


 Also, it's unclear what types of input is supported -- the code
 example only uses string input, but what about file input? Does it
 support byLine? Or does it need me to slurp the entire file contents
 into an in-memory buffer first?
 
 The point of having the input be an InputRange is you DO NOT CARE if
 the input is coming from a string, file, etc.

Yes, but the docs currently doesn't say what form this input range must
take. Must it be a range of characters? Am I allowed to pass in a range
of *strings* (representing source lines)? This needs to be clearly
stated.

The fact that tokens slice the input also needs to be clearly stated,
since that precludes the use of transient ranges like byLine (your token
values will be garbled by the time you get to them).


T

-- 
Let's call it an accidental feature. -- Larry Wall


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Manfred Nowak
Walter Bright wrote:

 Since the very beginning.
 
 One example is determining if something is a declaration or an
 expression. 

I see now, that you wrote about parsing---not about lexing.

Btw. I wrote an LALR(1)-parser for an early version of D. This means a 
lookahead of one was sufficient---or I made terrible mistakes. 

-manfred


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Walter Bright

On 9/11/2013 1:44 PM, Brian Schott wrote:

I'm not sure if you've followed the announce mailing list recently, but this
lexer is already used in both the DScanner and DCD projects.


Awesome! This is just what I asked for.


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Martin Nowak

On 09/11/2013 05:01 PM, Dicebot wrote:

std.d.lexer is standard module for lexing D code, written by Brian Schott


Nice!


Please take this part seriously: If you identify problems along the
way, please note if they are minor, serious, or showstoppers.
(http://wiki.dlang.org/Review/Process). This information later
will be used to determine if library is ready for voting.

I only had a short look at the source code and was left with a good 
impression, i.e. it looks like a lexer written in D.


I do like how you made the lexer configurable to serve the different 
needs. Also it was very easy to setup and use.


The appended underscore for keywords is much nicer than messing with 
uppercase/lowercase which was what I did.


One thing that bothered me was the usage of const(ubyte)[] as input.
If you operate on UTF-8 it should take const(char)[] you can cast it to 
const(ubyte)[] internally.


Also a convenience function that reads a file and processes UTF BOM 
marks would be nice (see toUtf8 
https://github.com/dawgfoto/lexer/blob/master/dlexer/dlexer.d#L1429), 
but that could as well fit somewhere else into phobos.



 Information request from module author 

Performance was a major discussed topic in previous thread. Could you
please provide benchmarking data for version currently ongoing the review?


Nice too, it works as fast as my D lexer 
(https://github.com/dawgfoto/lexer) which is about as fast as DMD's 
lexer (when compiled with LDC).


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Kapps
On Wednesday, 11 September 2013 at 19:57:28 UTC, Piotr Szturmaj 
wrote:

On 11.09.2013 20:49, Walter Bright wrote:

On 9/11/2013 8:01 AM, Dicebot wrote:
std.d.lexer is standard module for lexing D code, written by 
Brian Schott


Thank you, Brian! This is important work.

Not a thorough review, just some notes from reading the doc 
file:


1. I don't like the _ suffix for keywords. Just call it 
kwimport or

something like that.

8. 'default_' Again with the awful practice of appending _.


Delphi designers realized this problem years ago and they came 
up with a solution: 
http://docwiki.embarcadero.com/RADStudio/XE4/en/Fundamental_Syntactic_Elements#Extended_Identifiers


Basically, Delphi allows escaping reserved identifiers with a 
''. I wonder how D solves that problem when interfacing to COM 
classes if they have for example a function named scope.


C# allows this as well. For example, you can have:
void MyVoid(string @int) { ... }
The name is actually int for purposes of reflection and the like, 
the @ just tells the compiler that it's not a keyword.


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Piotr Szturmaj

On 12.09.2013 01:22, Kapps wrote:

On Wednesday, 11 September 2013 at 19:57:28 UTC, Piotr Szturmaj wrote:

On 11.09.2013 20:49, Walter Bright wrote:

On 9/11/2013 8:01 AM, Dicebot wrote:

std.d.lexer is standard module for lexing D code, written by Brian
Schott


Thank you, Brian! This is important work.

Not a thorough review, just some notes from reading the doc file:

1. I don't like the _ suffix for keywords. Just call it kwimport or
something like that.

8. 'default_' Again with the awful practice of appending _.


Delphi designers realized this problem years ago and they came up with
a solution:
http://docwiki.embarcadero.com/RADStudio/XE4/en/Fundamental_Syntactic_Elements#Extended_Identifiers


Basically, Delphi allows escaping reserved identifiers with a ''. I
wonder how D solves that problem when interfacing to COM classes if
they have for example a function named scope.


C# allows this as well. For example, you can have:
void MyVoid(string @int) { ... }
The name is actually int for purposes of reflection and the like, the @
just tells the compiler that it's not a keyword.


Yes, and this is important to actually have the real name without a '_' 
or any other prefix or suffix. Is there an enhancement proposal for this 
in the bugzilla?


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Walter Bright

On 9/11/2013 3:09 PM, Michel Fortin wrote:

On 2013-09-11 18:49:44 +, Walter Bright newshou...@digitalmars.com said:


6. No clue how lookahead works with this. Parsing D requires arbitrary 
lookahead.


Since this is a forward range, you can save() it at one location and continue
parsing, and then restart parsing from the savepoint.
https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L2324

This also means that the underlying range used as input to the lexer must be a
forward range that you can save() too.


That's correct, but that implies re-lexing the tokens, which has negative 
performance implications.




Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Walter Bright

On 9/11/2013 3:02 PM, H. S. Teoh wrote:

No, in typical D code, user-defined type names are capitalized, such as
InputRange, Duration, Complex, ByLine, etc.. They do not have 'Type' in
their names (we don't call them InputRangeType, for example). So this
usage doesn't conflict at all.


It's not a type as is understood in D. It's an enumerated value.



Constructors with many arguments are evil, because you can't just modify
one setting and use defaults for the rest; you have to specify all of
them up to the last default argument you have to change.


It's not that big a deal, and certainly not evil.



6. No clue how lookahead works with this. Parsing D requires
arbitrary lookahead.


What has this got to do with lexing?


The lexer needs to support backing up and going forward again.
That's how the dmd lexer works.


Then the parser should be fixed.


The parser works fine. It doesn't need fixing.



The point of having the input be an InputRange is you DO NOT CARE if
the input is coming from a string, file, etc.


Yes, but the docs currently doesn't say what form this input range must
take. Must it be a range of characters?


It does say that.



Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread H. S. Teoh
On Wed, Sep 11, 2013 at 04:42:34PM -0700, Walter Bright wrote:
 On 9/11/2013 3:02 PM, H. S. Teoh wrote:
[...]
 Yes, but the docs currently doesn't say what form this input range
 must take. Must it be a range of characters?
 
 It does say that.

But then the code example proceeds to pass byLine() to it. Is that
correct? If it is, then the docs need to be updated, because last time I
checked, byLine() isn't a range of char, but a range of char *arrays*.


T

-- 
Real men don't take backups. They put their source on a public
FTP-server and let the world mirror it. -- Linus Torvalds


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Brian Schott

On Thursday, 12 September 2013 at 00:13:36 UTC, H. S. Teoh wrote:
But then the code example proceeds to pass byLine() to it. Is 
that
correct? If it is, then the docs need to be updated, because 
last time I
checked, byLine() isn't a range of char, but a range of char 
*arrays*.



T


The example doesn't pass the result of byLine to the byToken 
function directly.


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread deadalnix
On Wednesday, 11 September 2013 at 19:46:25 UTC, Brian Schott 
wrote:
Yeah. D requires lookahead in both lexing and parsing. Some 
examples:


* String literals such as q{}
* If statements need to determine the difference between if 
(expression) and if (type identifier = expression)
* Determining if something is a lamba expression or a function 
literal expression
* Determining if something is a declaration or a statement or 
an expression.


Does not require lookahead.

* Differentiating between (type).identifier and a lambda 
expression


Yup !

* Differentiating between a type and an expression inside a 
typeid expression


Does not require lookahead.

* Differentiating between associative array key type and tuple 
slices in type suffixes


ditto.


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread deadalnix
On Wednesday, 11 September 2013 at 23:44:55 UTC, Walter Bright 
wrote:

On 9/11/2013 3:09 PM, Michel Fortin wrote:
On 2013-09-11 18:49:44 +, Walter Bright 
newshou...@digitalmars.com said:


6. No clue how lookahead works with this. Parsing D requires 
arbitrary lookahead.


Since this is a forward range, you can save() it at one 
location and continue

parsing, and then restart parsing from the savepoint.
https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L2324

This also means that the underlying range used as input to the 
lexer must be a

forward range that you can save() too.


That's correct, but that implies re-lexing the tokens, which 
has negative performance implications.


Indeed. What solution do you have in mind ?


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread deadalnix

On Wednesday, 11 September 2013 at 20:28:06 UTC, H. S. Teoh wrote:

On Wed, Sep 11, 2013 at 10:18:12PM +0200, Dicebot wrote:
On Wednesday, 11 September 2013 at 20:08:44 UTC, H. S. Teoh 
wrote:

On Wed, Sep 11, 2013 at 10:04:20PM +0200, Dicebot wrote:
On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh 
wrote:
I disagree. I think it's more readable to use a consistent 
prefix,
like kw... or kw_... (e.g. kw_int, kw_return, etc.), so 
that it's
clear you're referring to token types, not the actual 
keyword.


Not unless you want to change the style guide and break 
existing

Phobos code ;)

How would that break Phobos code? Phobos code doesn't even use
std.d.lexer right now.

Phobos code must conform its style guide. You can't change it
without changing existing Phobos code that relies on it.
Inconsistent style is worst of all options.


This doesn't violate Phobos style guidelines:

enum TokenType {
kwInt,
kwFloat,
kwDouble,
...
kwFunction,
kwScope,
... // etc.
}



Int, Function, Scope, Import are all valid identifiers.

random minimization like kw is really bad. It is even worse when 
it doesn't make anything sorter.


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Walter Bright

On 9/11/2013 6:30 PM, deadalnix wrote:

Indeed. What solution do you have in mind ?


The solution dmd uses is to put in an intermediary layer that saves the 
lookahead tokens in a linked list.


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread deadalnix
On Thursday, 12 September 2013 at 01:39:52 UTC, Walter Bright 
wrote:

On 9/11/2013 6:30 PM, deadalnix wrote:

Indeed. What solution do you have in mind ?


The solution dmd uses is to put in an intermediary layer that 
saves the lookahead tokens in a linked list.


But then, you have an extra step when looking up every tokens + 
memory management overhead. How big is the performance 
improvement ?


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Jonathan M Davis
On Thursday, September 12, 2013 03:37:06 deadalnix wrote:
 Int, Function, Scope, Import are all valid identifiers.

All of which violate Phobos' naming conventions for enum values (they must 
start with a lowercase letter), which is why we went with adding an _ on the 
end. And it's pretty much as close as you can get to the keyword without 
actually using the keyword, which is a plus IMHO (though from the sounds of 
it, H.S. Teoh would consider that a negative due to possible confusion with 
the keyword).

- Jonathan M Davis


Re: std.d.lexer: pre-voting review / discussion

2013-09-11 Thread Martin Nowak

On 09/12/2013 03:30 AM, deadalnix wrote:


That's correct, but that implies re-lexing the tokens, which has
negative performance implications.


Indeed. What solution do you have in mind ?


Buffering the tokens would work. There are some ways promote input 
ranges to forward ranges. But there are also some pitfalls like the 
implicit save on copy.


I have two prototypes for a generic input range buffer.

https://gist.github.com/dawgfoto/2187220 - uses growing ring buffer
https://gist.github.com/dawgfoto/1257196 - uses ref counted lookahead 
buffers in a singly linked list


The lexer itself has a ringbuffer for input ranges.
https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L2278


  1   2   >