Re: Formal review of std.lexer

2014-05-05 Thread Dmitry Olshansky via Digitalmars-d

17-Mar-2014 02:13, Martin Nowak пишет:

On 02/22/2014 09:31 PM, "Marc Schütz" " wrote:

But that still doesn't explain why a custom hash table implementation is
necessary. Maybe a lightweight wrapper around built-in AAs is sufficient?


I'm also wondering what benefit this hash table provides.


Getting back to this. The custom hash map originaly was a product of 
optimization, the benefits over built-in AAs are:

a) Allocation was amortized by allocating nodes in batches.
b) Allowed custom hash function to be used with built-in type (string).

Not sure how much of that stands today.

--
Dmitry Olshansky


Re: Formal review of std.lexer

2014-04-15 Thread Alix Pexton via Digitalmars-d

On 14/04/2014 10:34 PM, Brian Schott wrote:

ubyte[] is required for the lexer to work quickly. That lexer range is
designed to keep track of the column and line numbers.


I can understand that speed requires the input to be handled as bytes 
instead of chars, but the restriction to an ubyte[] over an 
randomAccessRange seems to me un-Dlang-y.


If the LexerRange is only there to optionally add line/column numbering 
support then I think it needs a much more descriptive name and much 
better documentation.



That function's purpose is to determine if the current code unit
signifies the end of an identifier/keyword. When lexing "fortunate", the
lexer would spot "for", and then call the separating function with the
index of "u". In the case of D, it would say that this does NOT end a
word, and "fortunate" should be lexed as an identifier. If it was called
with an index of a space or parenthesis, it should return true.


Somehow I had skipped the salient part of the documentation for this, so 
my apologies (at first I thought that the DDoc output I was reading must 
have been out of sync with the source (it was, but not that much), but 
further investigations suggest some form of temporary blindness).


This description squares with what I had surmised from reading the code, 
and I can see why it would be more efficient than the common technique 
of comparing every matched identifier to the list of keywords. I do 
wonder however if there might be another way to attain the same 
efficiency without the need for the separating function (I should have 
replied to this last night when my ideas on the matter were clearer, 
sleep seems to have stolen my theoretical alternative ><).


I'm also curious about the decision that the signature of the separating 
function should take the offset to the character than needs to be 
tested. A more Dlang-y thing to do in my mind would be to pass a range 
that begins at the first character after the matched keyword.



If more than one prefix is dispatched to the same handler, the handler
cannot blindly call popFront or popFrontN. If the lexer author knows
that the handler will only be called in certain places, than checking
!empty is a waste of cycles. I don't think it's possible for the lexer
generator to enforce this.


I don't think the lexer needs to do anything extra to ensure that a 
handler can begin its work both without repeating calls to .empty or 
blindly calling popFront. To do so requires that the input be treated a 
a forward range. Before reading a token, store the .save of the input, 
then advance the input as the token is matched counting the consumed 
elements. When the handler is called it will have the option of 
including the token by adding to the count and returning a slice that 
begins at the .save'd position or ignoring the length of the match and 
returning a slice that begins at the position after the matched token.


I strongly believe that a model that requires the user to reason about a 
library providing ranges that are "logicaly !empty" is a misstep.



XML doesn't seem to have very distinct lexing/parsing phases like JSON
markup or Java code does. This lexer generator may not be the best way
to deal with XML.


I don't really know how the grammar in the XML standard compares to 
others in general. It is certainly more convoluted than the published 
grammar for Dlang, but we all know that that doesn't quite match the 
language. Some criticise XML for being too complicated, but in some 
respects it is even more complicated than that, it seems ironic that it 
was supposed to be simple.


But, I might have typed too soon when I wrote that this lexer might not 
be able to fully tokenize XML in a single pass. As I was drifting off to 
sleep last night I had the idea of using the extraFields of 
TokenStructure to add a token pointer (or equivalent construct) to the 
tokens, making a singly linked list and allowing me to return multiple 
tokens from a token handler. I just need to convince myself that it is a 
sane thing to do.



If you have specific problems, please list them.


I think the calculator example should be improved or replaced so that 
the documentation's first example uses all the string[]s that can be 
passed to the lexer. Perhaps a simple symbolic calculator that accepts 
letter sequences as variables and reserves pi as a keyword?


I think that the section that describes the token handlers needs to 
fully document the primitives that the user has access to in order to 
lex a token and what state to expect them to be in when the handler is 
called and what state they should be in when it returns.


If tracking lines and columns is optional and supported only when 
wrapping the input with LexerRange then I think that by default Tokens 
should not contain fields for them. Perhaps an enum qstring that 
declares the required fields can be introduced and an example shown 
where it is concatenated with user declarations to be passed in vi

Re: Formal review of std.lexer

2014-04-14 Thread Brian Schott

On Monday, 14 April 2014 at 20:43:58 UTC, Alix Pexton wrote:
I know the official review period is long past but I'd not had 
a good look at this module until this past weekend.


Last year I had been working on my own xml lexer/parser but as 
of yet I have nothing to show for it so I took a look a this 
proposal with an eye towards using it to make my efforts easier.


Andrei's posts about the possible design of a generic lexer had 
also influenced me, so I was expecting to find similarities 
between this module and my own work, albeit with the added 
benefits of being generic (in the good way). I have, however, 
found it very difficult to understand much of it, which I 
entirely put down to my own deficiencies with templates and 
especially the use of mixins.


In the example Dlang lexer, the constructor takes a ubyte[] as 
input and wraps it in a LexerRange struct which defines the 
normal input range primitives as well as various functions for 
lookahead. It is not documented whether the lexer needs these 
extra features or if they are only provided for use within the 
tokenising functions that are supplied to the template by the 
user. If they are used by the core of the lexer then it would 
seem to preclude the use of any other type of input that cannot 
be coerced into a ubyte[] without the effort on the part of the 
user to implement the same interface.


ubyte[] is required for the lexer to work quickly. That lexer 
range is designed to keep track of the column and line numbers.


I think the description of the functionality required of the 
tokenSeparatingFunction that the user must supply needs to be 
much better. If I understand correctly, it is intended to 
differentiate between keywords and identifiers which begin with 
a keyword. THe more I think about this the less certain I am.


That function's purpose is to determine if the current code unit 
signifies the end of an identifier/keyword. When lexing 
"fortunate", the lexer would spot "for", and then call the 
separating function with the index of "u". In the case of D, it 
would say that this does NOT end a word, and "fortunate" should 
be lexed as an identifier. If it was called with an index of a 
space or parenthesis, it should return true.


When the lexer dispatches to a token handler, the front of the 
range is left pointing to the beginning of the character 
sequence that was matched, allowing it to be included in the 
returned token. However, many of the handlers in the example 
Dlang lexer begin with a number of blind popFront calls to jump 
to the end of that match. I am aware that in well meaning code 
this is a case of the range being logically !empty, but I also 
wonder how often it might get overlooked when 2 matches of 
different lengths are dispatched the same handler. (I had a 
similar situation in my own code, and my solution was to have a 
variable storing the .save of my inputRange and count how many 
chars were consumed since it was updated. This way I could 
either return the whole match or part of it in the token or 
discard it and include only what came after it.) As there has 
been some contention about the correct use of the range 
primitives of late, I will refrain from making any other 
comment on their use in this module, especially as I am no 
longer sure that I have been using them correctly myself.


If more than one prefix is dispatched to the same handler, the 
handler cannot blindly call popFront or popFrontN. If the lexer 
author knows that the handler will only be called in certain 
places, than checking !empty is a waste of cycles. I don't think 
it's possible for the lexer generator to enforce this.


In the short time that I have been looking at the features of 
this lexer I have not been able to figure out a way of writing 
a standards compliant XML parser without having to lex some 
parts of the document at least twice, or subverting the token 
handlers to change behaviour according to context. Several non 
compliant single pass XML lexers would be possible, but they 
would not be able to process documents that used some 
(admittedly obscure and often overlooked) features. The only 
scalable technique that I can think of to allow XML to be lexed 
in a single pass in a fully spec compliant way would be to 
allow handlers to return multiple tokens. I am not sure how 
feasible this would be or what mechanism would be best to 
implement it.


XML doesn't seem to have very distinct lexing/parsing phases like 
JSON markup or Java code does. This lexer generator may not be 
the best way to deal with XML.


On the whole I think the the overall design of the module shows 
promise but requires polish to make it both more idiomatically 
Dlang-y and easier for the user to build upon (both by 
documentation and interface).


If you have specific problems, please list them.

On a side not related to the example lexer for Dlang, I believe 
the predicate function isDocComment will produce false 
positives for the 

Re: Formal review of std.lexer

2014-04-14 Thread Brian Schott

On Sunday, 16 March 2014 at 22:13:16 UTC, Martin Nowak wrote:

On 02/22/2014 09:31 PM, "Marc Schütz" " wrote:
But that still doesn't explain why a custom hash table 
implementation is
necessary. Maybe a lightweight wrapper around built-in AAs is 
sufficient?


I'm also wondering what benefit this hash table provides.


The hash table was added by Dmitry a while back. I assume it's 
because he had numbers to back it up.




Re: Formal review of std.lexer

2014-04-14 Thread Alix Pexton

On 21/02/2014 12:12 PM, Dicebot wrote:

http://wiki.dlang.org/Review/std.lexer

This is follow-up by Brian to his earlier proposal
(http://wiki.dlang.org/Review/std.d.lexer). This time proposed module
focuses instead on generic lexer generation as discussed in matching
voting thread.

Docs: http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html
Code: https://github.com/Hackerpilot/Dscanner/blob/master/stdx/lexer.d

Example topics to evaluate during review:
  - Feasibility of overall design and concept
  - Quality of documentation
  - Consistency with existing Phobos modules
  - Overall quality of implementation

Initial review will end on March the 8th.



I know the official review period is long past but I'd not had a good 
look at this module until this past weekend.


Last year I had been working on my own xml lexer/parser but as of yet I 
have nothing to show for it so I took a look a this proposal with an eye 
towards using it to make my efforts easier.


Andrei's posts about the possible design of a generic lexer had also 
influenced me, so I was expecting to find similarities between this 
module and my own work, albeit with the added benefits of being generic 
(in the good way). I have, however, found it very difficult to 
understand much of it, which I entirely put down to my own deficiencies 
with templates and especially the use of mixins.


In the example Dlang lexer, the constructor takes a ubyte[] as input and 
wraps it in a LexerRange struct which defines the normal input range 
primitives as well as various functions for lookahead. It is not 
documented whether the lexer needs these extra features or if they are 
only provided for use within the tokenising functions that are supplied 
to the template by the user. If they are used by the core of the lexer 
then it would seem to preclude the use of any other type of input that 
cannot be coerced into a ubyte[] without the effort on the part of the 
user to implement the same interface.


I think the description of the functionality required of the 
tokenSeparatingFunction that the user must supply needs to be much 
better. If I understand correctly, it is intended to differentiate 
between keywords and identifiers which begin with a keyword. THe more I 
think about this the less certain I am.


When the lexer dispatches to a token handler, the front of the range is 
left pointing to the beginning of the character sequence that was 
matched, allowing it to be included in the returned token. However, many 
of the handlers in the example Dlang lexer begin with a number of blind 
popFront calls to jump to the end of that match. I am aware that in well 
meaning code this is a case of the range being logically !empty, but I 
also wonder how often it might get overlooked when 2 matches of 
different lengths are dispatched the same handler. (I had a similar 
situation in my own code, and my solution was to have a variable storing 
the .save of my inputRange and count how many chars were consumed since 
it was updated. This way I could either return the whole match or part 
of it in the token or discard it and include only what came after it.) 
As there has been some contention about the correct use of the range 
primitives of late, I will refrain from making any other comment on 
their use in this module, especially as I am no longer sure that I have 
been using them correctly myself.


In the short time that I have been looking at the features of this lexer 
I have not been able to figure out a way of writing a standards 
compliant XML parser without having to lex some parts of the document at 
least twice, or subverting the token handlers to change behaviour 
according to context. Several non compliant single pass XML lexers would 
be possible, but they would not be able to process documents that used 
some (admittedly obscure and often overlooked) features. The only 
scalable technique that I can think of to allow XML to be lexed in a 
single pass in a fully spec compliant way would be to allow handlers to 
return multiple tokens. I am not sure how feasible this would be or what 
mechanism would be best to implement it.


On the whole I think the the overall design of the module shows promise 
but requires polish to make it both more idiomatically Dlang-y and 
easier for the user to build upon (both by documentation and interface).


On a side not related to the example lexer for Dlang, I believe the 
predicate function isDocComment will produce false positives for the 
following comment delimiters which to my knowledge are not valid DDoc 
delimiters...


//* //+ /*+ /*/ /+* /+/

As the Dlang lexer is not part of the review proper I have not inspected 
it carefully, this function just happens to be the first one declared in 
that example.


Again, my apologies for the tardiness of this review.

A...


Re: Formal review of std.lexer

2014-03-16 Thread Martin Nowak

On 02/22/2014 09:31 PM, "Marc Schütz" " wrote:

But that still doesn't explain why a custom hash table implementation is
necessary. Maybe a lightweight wrapper around built-in AAs is sufficient?


I'm also wondering what benefit this hash table provides.


Re: Formal review of std.lexer

2014-03-16 Thread Dicebot
Initial review has finished. Voting will be delayed because Brian 
is currently busy and there is ongoing Walter's scopebuffer 
proposal to be processed (per agreement with both Brian and 
Walter).


Anyone late for review can still leave comments, I am sure Brian 
will take them into consideration when doing last moment changes 
before voting.


Re: Formal review of std.lexer

2014-03-10 Thread Dicebot
On Wednesday, 26 February 2014 at 18:07:37 UTC, Jacob Carlborg 
wrote:

On 2014-02-26 00:25, Dicebot wrote:

Don't know if it makes sense to introduce random package 
categorizatin.
I'd love to see more hierarchy in Phobos too but we'd first 
need to

agree to package separation principles then.


Then that's what we need to do. I don't want any more top level 
modules. There are already too many.


As much as I hate to say it, but such hierarchy is worth a DIP. 
Once it is formalized, I can proceed with it in review queue as 
if it was a new module proposal.


Re: Formal review of std.lexer

2014-03-10 Thread Dicebot

Reminder about benchmarks.

By the way, is generated lexer usable at CTFE? Imaginary use case
: easier DSL implementation.


Re: Formal review of std.lexer

2014-03-03 Thread Dicebot

Bringing this back to the front page.


Re: Formal review of std.lexer

2014-02-26 Thread Jacob Carlborg

On 2014-02-26 00:25, Dicebot wrote:


Don't know if it makes sense to introduce random package categorizatin.
I'd love to see more hierarchy in Phobos too but we'd first need to
agree to package separation principles then.


Then that's what we need to do. I don't want any more top level modules. 
There are already too many.


--
/Jacob Carlborg


Re: Formal review of std.lexer

2014-02-26 Thread Jacob Carlborg

On 2014-02-26 16:18, Andrei Alexandrescu wrote:


I think we wouldn't want to add one more package for each language
supported.


That's exactly what we want, preferably in a common package:

std.language.d.lexer
std.language.d.ast
std.language.d.parser

std.language.xml.lexer
std.language.xml.parser
std.language.xml.dom

What do you suggest, having multiple lexers for different languages in 
the same module?


--
/Jacob Carlborg


Re: Formal review of std.lexer

2014-02-26 Thread Jacob Carlborg

On 2014-02-26 07:58, Brian Schott wrote:


I think that's a bit backwards. I'd rather have

std.d.lexer
std.d.ast
std.d.parser

than

std.lexer.d
std.parser.d
std.ast.d


I agree with Brian. Although I would have a common package for all 
languages:


std.language.d.lexer
std.language.d.ast
std.language.d.parser

--
/Jacob Carlborg


Re: Formal review of std.lexer

2014-02-26 Thread Andrei Alexandrescu

On 2/25/14, 10:58 PM, Brian Schott wrote:

On Tuesday, 25 February 2014 at 23:17:56 UTC, Meta wrote:

std.lexer could be the umbrella for a bunch of different lexers. Then
we could have std.lexer.xml, std.lexer.json, etc.


I think that's a bit backwards. I'd rather have

std.d.lexer
std.d.ast
std.d.parser

than

std.lexer.d
std.parser.d
std.ast.d


I think we wouldn't want to add one more package for each language 
supported.


Andrei



Re: Formal review of std.lexer

2014-02-26 Thread Dicebot
On Wednesday, 26 February 2014 at 07:00:33 UTC, Brian Schott 
wrote:

On Monday, 24 February 2014 at 19:05:35 UTC, Dicebot wrote:
Brian, do you have benchmarks for this proposal similar to 
ones you have provided in old review threads? (vs DMD frontend 
lexer)


Not yet, but I can create them before the review period ends.


Thanks, that will be very interesting information to consider in 
context of DDMD.


Re: Formal review of std.lexer

2014-02-25 Thread Brian Schott

On Tuesday, 25 February 2014 at 23:17:56 UTC, Meta wrote:
std.lexer could be the umbrella for a bunch of different 
lexers. Then we could have std.lexer.xml, std.lexer.json, etc.


I think that's a bit backwards. I'd rather have

std.d.lexer
std.d.ast
std.d.parser

than

std.lexer.d
std.parser.d
std.ast.d


Re: Formal review of std.lexer

2014-02-25 Thread Brian Schott

On Monday, 24 February 2014 at 19:05:35 UTC, Dicebot wrote:
Brian, do you have benchmarks for this proposal similar to ones 
you have provided in old review threads? (vs DMD frontend lexer)


Not yet, but I can create them before the review period ends.


Re: Formal review of std.lexer

2014-02-25 Thread Brian Schott
On Tuesday, 25 February 2014 at 20:21:17 UTC, Andrei Alexandrescu 
wrote:
I think it would be great if we rigged things such that 
std.lexer simplifies flint and makes it faster. From what I saw 
Brian has a very similar approach. I'm hoping for a few 
improvements, which I'll share soon.


Andrei


I'll be waiting for this list. In the meantime, I'm making a 
change to the string interning process to make it more threading 
friendly.


Re: Formal review of std.lexer

2014-02-25 Thread Dicebot
On Tuesday, 25 February 2014 at 20:48:08 UTC, Jacob Carlborg 
wrote:
I don't like it being a new top level module. I would name it 
std.language.lexer or std.lang.lexer.


Don't know if it makes sense to introduce random package 
categorizatin. I'd love to see more hierarchy in Phobos too but 
we'd first need to agree to package separation principles then.


Re: Formal review of std.lexer

2014-02-25 Thread Meta
On Tuesday, 25 February 2014 at 20:48:08 UTC, Jacob Carlborg 
wrote:

On 2014-02-21 13:12, Dicebot wrote:

http://wiki.dlang.org/Review/std.lexer

This is follow-up by Brian to his earlier proposal
(http://wiki.dlang.org/Review/std.d.lexer). This time proposed 
module
focuses instead on generic lexer generation as discussed in 
matching

voting thread.


I don't like it being a new top level module. I would name it 
std.language.lexer or std.lang.lexer.


std.lexer could be the umbrella for a bunch of different lexers. 
Then we could have std.lexer.xml, std.lexer.json, etc.


Re: Formal review of std.lexer

2014-02-25 Thread Jacob Carlborg

On 2014-02-21 13:12, Dicebot wrote:

http://wiki.dlang.org/Review/std.lexer

This is follow-up by Brian to his earlier proposal
(http://wiki.dlang.org/Review/std.d.lexer). This time proposed module
focuses instead on generic lexer generation as discussed in matching
voting thread.


I don't like it being a new top level module. I would name it 
std.language.lexer or std.lang.lexer.


--
/Jacob Carlborg


Re: Formal review of std.lexer

2014-02-25 Thread Andrei Alexandrescu

On 2/24/14, 7:18 PM, Adam Wilson wrote:

Note that I as badly as I want std.lexer to be included I want it to
pass a rigorous review. This review (and, is passing, subsequent
inclusion) has opened up an opportunity to start using D at work that I
did not expect and so I am kind of excited about it.


I think it would be great if we rigged things such that std.lexer 
simplifies flint and makes it faster. From what I saw Brian has a very 
similar approach. I'm hoping for a few improvements, which I'll share soon.


Andrei




Re: Formal review of std.lexer

2014-02-25 Thread Dicebot

On Tuesday, 25 February 2014 at 03:52:01 UTC, H. S. Teoh wrote:
Sadly, I haven't had the time to review the proposed code as 
closely as

I'd like.


Don't hesitate to ask for extending review period if you need so. 
Making good review is more important than meeting some arbitrary 
deadline.


Re: Formal review of std.lexer

2014-02-24 Thread H. S. Teoh
On Mon, Feb 24, 2014 at 07:18:58PM -0800, Adam Wilson wrote:
[...]
> Note that I as badly as I want std.lexer to be included I want it to
> pass a rigorous review. This review (and, is passing, subsequent
> inclusion) has opened up an opportunity to start using D at work
> that I did not expect and so I am kind of excited about it.
[...]

Me too, I'm looking forward to a well-designed lexer generator in
Phobos. I will have lots of uses for it.

Sadly, I haven't had the time to review the proposed code as closely as
I'd like.


T

-- 
What doesn't kill me makes me stranger.


Re: Formal review of std.lexer

2014-02-24 Thread Adam Wilson

On Mon, 24 Feb 2014 17:22:32 -0800, Meta  wrote:


On Tuesday, 25 February 2014 at 00:28:26 UTC, Adam Wilson wrote:

[SNIP]


You're throwing what I said way out of proportion. I was replying to the  
statement:


"No criticism should stop this module being accepted, as we do not
have any other lexer in the runtime anyway. Therefore I suggest
we accept std.lexer until a better solution comes up."

I don't agree with this. Obviously std.lexer is well-written and has  
been through a few rounds of iteration, but that doesn't mean we should  
accept it without due diligence to ensure that we won't be regretting  
some overlooked, poorly-designed or badly-named piece of functionality  
down the road. "Good enough because we don't yet have anything better"  
is a bad idea. It seems to me that what Brian has written is much better  
than "good enough", but I don't think that it should be accepted into  
Phobos *solely* because we don't have anything else. If the community  
decides that it is a worthwhile addition, then great, but that must not  
happen *until* it has passed rigorous review, just like every other  
recent Phobos module.


Fair enough. I guess I am just still touchy after the way std.signals was  
shot down. There weren't great technical arguments for shooting it down  
and so I feel that a good piece of code that would've been immediately  
useful and accepted by the community was rejected over some pretty silly  
fears.


Note that I as badly as I want std.lexer to be included I want it to pass  
a rigorous review. This review (and, is passing, subsequent inclusion) has  
opened up an opportunity to start using D at work that I did not expect  
and so I am kind of excited about it.


--
Adam Wilson
GitHub/IRC: LightBender
Aurora Project Coordinator


Re: Formal review of std.lexer

2014-02-24 Thread Meta

On Tuesday, 25 February 2014 at 00:28:26 UTC, Adam Wilson wrote:

[SNIP]


You're throwing what I said way out of proportion. I was replying 
to the statement:


"No criticism should stop this module being accepted, as we do not
have any other lexer in the runtime anyway. Therefore I suggest
we accept std.lexer until a better solution comes up."

I don't agree with this. Obviously std.lexer is well-written and 
has been through a few rounds of iteration, but that doesn't mean 
we should accept it without due diligence to ensure that we won't 
be regretting some overlooked, poorly-designed or badly-named 
piece of functionality down the road. "Good enough because we 
don't yet have anything better" is a bad idea. It seems to me 
that what Brian has written is much better than "good enough", 
but I don't think that it should be accepted into Phobos *solely* 
because we don't have anything else. If the community decides 
that it is a worthwhile addition, then great, but that must not 
happen *until* it has passed rigorous review, just like every 
other recent Phobos module.


Re: Formal review of std.lexer

2014-02-24 Thread Adam Wilson

On Mon, 24 Feb 2014 15:36:43 -0800, Meta  wrote:


On Monday, 24 February 2014 at 23:07:07 UTC, Adam Wilson wrote:
Well, we keep voting down replacement candidates, which incidentally,  
is exactly what happened with the std.signals replacement, so I view  
this as an orthogonal issue to whether or not it should be included  
after passing a review. I don't think the fact that a module might not  
be perfect after review should stop us from approving a module that  
offers completely new functionality AND passed a review. Handling the  
problems after inclusion is what bugzilla is for.


I guess std.signals was a bad example, as there *was* a proposed  
replacement. However, there were real problems with the replacement that  
made it not suitable for inclusion. If I recall, these were largely API  
issues, which are the hardest to change. If we had've accepted the new  
std.signals despite the issues raised, several years down the road it  
might turn out to be as broken as the old implementation (no offense to  
the author, this is just for the sake of argument), and we are unable to  
replace it for fear of breaking code. There are then 2 options: support  
the old API with its broken behaviour in the same module as the new API,  
or introduce std.signals2 or the like, which people have shown  
resistance to in the past. I think that it's very important to be  
careful as to what goes into Phobos at this point, as it's going to  
become increasingly difficult to change anything.


Ok. Then by that standpoint we have two remaining options. A) Don't ever  
change existing code because you might breaking someone else's depending  
code on accident. Even without an API change, a change in how a function  
does it's processing can result in incorrect behavior in down-stream code.  
B) Never introduce new modules because we are terrified that the API might  
not be right in five years.


It is unrealistic in the extreme to demand that a new module in Phobos  
meet some arbitrary future compatibility bar. We routinely make changes to  
Phobos that break people's code in subtle ways that don't produce compiler  
errors and we never hear about it because no sane programmer expects a  
standard library to be a static thing that will never ever ever change in  
any way what so ever so that they can expect the exact same behavior five  
years from now. No, they go fix it and move on.


"OMG! We can't add this! It might not be the right API in the future!" By  
that reasoning we should halt all work on Phobos ever. After all, we might  
break something or a new API might not be perfect in five years after a  
new language feature allows an obviously better way of implementing that  
API. As reasoning goes, the best word I am come up with to describe it is  
"naive". It is a purely fear based response based on some unspecified and  
unknown future circumstance. Newsflash, you can't predict the future,  
which incidentally, is why API's change!


The reason std.signals replacement got voted down was because it wasn't  
API compatible with the original. This is despite the fact that nobody  
stood up to say that they use the current implementation and like it.  
Anybody who did use it said that they are desperate for a replacement. No  
the current std.signals is universally reviled and almost completely  
unused but we can't change it because of the mythical silent user who  
might be out there coding under a rock with no Internet and isn't voicing  
their opinion. Well, guess what, if you don't speak-up you have no right  
to blame it on the people who did and decided to changed it. You had your  
chance, stop whining and go update your code.


We are programmers. It is our JOB to deal with changes in the API/ABI/ISA.  
We don't have the RIGHT to whine about change because dealing with change  
is what we DO. If we aren't dealing with change we're working on something  
that isn't used at all and therefore pointless. I'm all for pointless  
pursuits, but don't hamstring me and my non-pointless pursuits with them.


I wanted std.signals in the new form badly, but now I can't have it, and  
Aurora is going to suffer mightily for it. Now I get to go off and make my  
own substandard iteration of Signals, because some people whined about how  
the API wasn't backwards compatible, even though it's widely acknowledged  
that almost nobody uses it and those that do would've been happy to move  
to the new signals.


Note, I am not talking about voting down modules that are obviously poorly  
implemented, I am talking about modules that work well and can do the task  
they purport to do with as few bugs as possible. If the module can do what  
it says it can, and the API isn't horrible, then we should default to  
adding it. Griping about the API five years from now is just bikeshedding,  
write up a pull and get that reviewed, chances are high it'll get rejected  
because the changes are worse anyways.


The only difficult

Re: Formal review of std.lexer

2014-02-24 Thread Meta

On Monday, 24 February 2014 at 23:07:07 UTC, Adam Wilson wrote:
Well, we keep voting down replacement candidates, which 
incidentally, is exactly what happened with the std.signals 
replacement, so I view this as an orthogonal issue to whether 
or not it should be included after passing a review. I don't 
think the fact that a module might not be perfect after review 
should stop us from approving a module that offers completely 
new functionality AND passed a review. Handling the problems 
after inclusion is what bugzilla is for.


I guess std.signals was a bad example, as there *was* a proposed 
replacement. However, there were real problems with the 
replacement that made it not suitable for inclusion. If I recall, 
these were largely API issues, which are the hardest to change. 
If we had've accepted the new std.signals despite the issues 
raised, several years down the road it might turn out to be as 
broken as the old implementation (no offense to the author, this 
is just for the sake of argument), and we are unable to replace 
it for fear of breaking code. There are then 2 options: support 
the old API with its broken behaviour in the same module as the 
new API, or introduce std.signals2 or the like, which people have 
shown resistance to in the past. I think that it's very important 
to be careful as to what goes into Phobos at this point, as it's 
going to become increasingly difficult to change anything.


Re: Formal review of std.lexer

2014-02-24 Thread Adam Wilson

On Mon, 24 Feb 2014 14:32:53 -0800, Meta  wrote:


On Monday, 24 February 2014 at 22:14:34 UTC, Dejan Lekic wrote:
No criticism should stop this module being accepted, as we do not have  
any other lexer in the runtime anyway. Therefore I suggest we accept  
std.lexer until a better solution comes up. Naturally anyone should be  
encouraged to provide a better solution by submitting a pull request to  
Phobos developers...


The problem is that this is what has been done before, and now we are  
more or less stuck with outdated, sometimes poorly-written, often buggy  
modules (std.signals being one example).


Well, we keep voting down replacement candidates, which incidentally, is  
exactly what happened with the std.signals replacement, so I view this as  
an orthogonal issue to whether or not it should be included after passing  
a review. I don't think the fact that a module might not be perfect after  
review should stop us from approving a module that offers completely new  
functionality AND passed a review. Handling the problems after inclusion  
is what bugzilla is for.


--
Adam Wilson
GitHub/IRC: LightBender
Aurora Project Coordinator


Re: Formal review of std.lexer

2014-02-24 Thread Meta

On Monday, 24 February 2014 at 22:14:34 UTC, Dejan Lekic wrote:
No criticism should stop this module being accepted, as we do 
not have any other lexer in the runtime anyway. Therefore I 
suggest we accept std.lexer until a better solution comes up. 
Naturally anyone should be encouraged to provide a better 
solution by submitting a pull request to Phobos developers...


The problem is that this is what has been done before, and now we 
are more or less stuck with outdated, sometimes poorly-written, 
often buggy modules (std.signals being one example).


Re: Formal review of std.lexer

2014-02-24 Thread Dejan Lekic

On Friday, 21 February 2014 at 12:12:17 UTC, Dicebot wrote:

http://wiki.dlang.org/Review/std.lexer

This is follow-up by Brian to his earlier proposal 
(http://wiki.dlang.org/Review/std.d.lexer). This time proposed 
module focuses instead on generic lexer generation as discussed 
in matching voting thread.


Docs: 
http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html
Code: 
https://github.com/Hackerpilot/Dscanner/blob/master/stdx/lexer.d


Example topics to evaluate during review:
 - Feasibility of overall design and concept
 - Quality of documentation
 - Consistency with existing Phobos modules
 - Overall quality of implementation

Initial review will end on March the 8th.


No criticism should stop this module being accepted, as we do not 
have any other lexer in the runtime anyway. Therefore I suggest 
we accept std.lexer until a better solution comes up. Naturally 
anyone should be encouraged to provide a better solution by 
submitting a pull request to Phobos developers...


So far I haven't seen a better lexer for D source than Brian's 
std.lexer. If anyone has, please let me know.


Re: Formal review of std.lexer

2014-02-24 Thread Dicebot
Brian, do you have benchmarks for this proposal similar to ones 
you have provided in old review threads? (vs DMD frontend lexer)


Re: Formal review of std.lexer

2014-02-22 Thread Marc Schütz
On Friday, 21 February 2014 at 22:13:30 UTC, Andrej Mitrovic 
wrote:

On 2/21/14, Joseph Cassman  wrote:
1. StringCache is a custom hash table. It looks like it's 
primary

role is to reduce some sort of duplication.


It's used for string interning.


But that still doesn't explain why a custom hash table 
implementation is necessary. Maybe a lightweight wrapper around 
built-in AAs is sufficient?


(I also apologize if this has already been asked and answered in 
the last review thread, which is unfortunately too long to read 
in a short time.)


Re: Formal review of std.lexer

2014-02-21 Thread Brian Schott
On Friday, 21 February 2014 at 22:14:36 UTC, Andrej Mitrovic 
wrote:

Hmm, maybe I should make a pull request upstream?


You should.


Re: Formal review of std.lexer

2014-02-21 Thread Andrei Alexandrescu

On 2/21/14, 9:51 PM, Brian Schott wrote:

Does this mean that you're finally getting approval to release your
lexer generator?


Affirmative. It'll happen Real Soon Now(tm).

Andrei



Re: Formal review of std.lexer

2014-02-21 Thread Andrej Mitrovic
On 2/21/14, Andrej Mitrovic  wrote:
> I've used it to learn how to use DLexer, here's my more-documented
> version of the above if it helps anyone, which also takes the D
> parser's whitespace function and uses that in order to track column
> numbers, it injects a toString like I've mentioned, and adds some
> static asserts for sanity checks:
>
> https://github.com/AndrejMitrovic/lexer-demo

Hmm, maybe I should make a pull request upstream?


Re: Formal review of std.lexer

2014-02-21 Thread Andrej Mitrovic
On 2/21/14, Joseph Cassman  wrote:
> 1. StringCache is a custom hash table. It looks like it's primary
> role is to reduce some sort of duplication.

It's used for string interning.

> 3. I tried to understand the reason for and usage of the
> "extraFields" parameter in "TokenStructure" but couldn't figure
> it out. Could some more explanation of its intent and usage be
> provided?

You could used it to inject a toString method. Here's what I did when I used it:

alias Token = TokenStructure!(TokID,
q{
/// Better string representation
void toString(scope void delegate(const(char)[]) sink) const
{
import std.conv;
import %s;

sink("(");
sink(this.line.to!string);
sink(":");
sink(this.column.to!string);
sink(")");
sink(": ");
sink(this.text ? this.text : tokToString(this.type));
}
}.format(__MODULE__)
);

Side-note: Note how I've had to inject an import statement to my own
module because the string is mixed-in at the declaration site of a
module which I can't modify, where IIRC tokToString didn't exist
because it's located in *my* module. It's interesting how you can use
this feature in D, IOW:

-
module mymodule;
import other_module;
void foo() { }
mixin_elsewhere!("import mymodule; foo();");
-

-
module other_module;
void mixin_elsewhere(string str)()
{
mixin(str);
}
-

P.S. the DLexer links here: https://github.com/Hackerpilot/lexer-demo/

I've used it to learn how to use DLexer, here's my more-documented
version of the above if it helps anyone, which also takes the D
parser's whitespace function and uses that in order to track column
numbers, it injects a toString like I've mentioned, and adds some
static asserts for sanity checks:

https://github.com/AndrejMitrovic/lexer-demo


Re: Formal review of std.lexer

2014-02-21 Thread Brian Schott
Does this mean that you're finally getting approval to release 
your lexer generator?


On Friday, 21 February 2014 at 17:06:23 UTC, Andrei Alexandrescu 
wrote:

Can we please defer this by one week?

Thanks,

Andrei




Re: Formal review of std.lexer

2014-02-21 Thread Dicebot

On Friday, 21 February 2014 at 17:15:41 UTC, Jacob Carlborg wrote:

On 2014-02-21 18:06, Andrei Alexandrescu wrote:


Can we please defer this by one week?


Just make the review period one week longer.


This. There is no real rationale behind existing default review 
period so extending it until March the 15th won't cause any 
issues.


Re: Formal review of std.lexer

2014-02-21 Thread Jacob Carlborg

On 2014-02-21 18:06, Andrei Alexandrescu wrote:


Can we please defer this by one week?


Just make the review period one week longer.

--
/Jacob Carlborg


Re: Formal review of std.lexer

2014-02-21 Thread Andrei Alexandrescu

On 2/21/14, 2:12 PM, Dicebot wrote:

http://wiki.dlang.org/Review/std.lexer

This is follow-up by Brian to his earlier proposal
(http://wiki.dlang.org/Review/std.d.lexer). This time proposed module
focuses instead on generic lexer generation as discussed in matching
voting thread.

Docs: http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html
Code: https://github.com/Hackerpilot/Dscanner/blob/master/stdx/lexer.d

Example topics to evaluate during review:
  - Feasibility of overall design and concept
  - Quality of documentation
  - Consistency with existing Phobos modules
  - Overall quality of implementation

Initial review will end on March the 8th.


Can we please defer this by one week?

Thanks,

Andrei



Re: Formal review of std.lexer

2014-02-21 Thread Joseph Cassman

On Friday, 21 February 2014 at 12:12:17 UTC, Dicebot wrote:

http://wiki.dlang.org/Review/std.lexer

This is follow-up by Brian to his earlier proposal 
(http://wiki.dlang.org/Review/std.d.lexer). This time proposed 
module focuses instead on generic lexer generation as discussed 
in matching voting thread.


Docs: 
http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html
Code: 
https://github.com/Hackerpilot/Dscanner/blob/master/stdx/lexer.d


Thanks for all the work Brian. Read through the previous threads 
about the development of this code (links at the bottom) and I 
can see a lot of effort has gone into it. So the following 
comments may come across as uninformed, but hopefully they will 
be helpful.


1. StringCache is a custom hash table. It looks like it's primary 
role is to reduce some sort of duplication. Hash tables, though, 
are difficult to get right. So perhaps could a benchmark 
comparison be made against the built-in HT to show what savings 
it brings? Since it is in the public interface should its payload 
also be public? Although it is built using GC.malloc how about 
the in-the-works std.allocator module? Perhaps a version 1 could 
use GC.malloc but if a later PR could make it possible to use a 
custom allocator that would be nice.


2. I like the fact that a range interface is provided. I realize 
that the previous discussions stipulated the use of ubyte to 
avoid encoding work during scanning. The reasoning about 
performance makes sense to me. That being the case, could a code 
example be provided showing how to use this module to scan a 
UTF-8 encoded string? Even if this is going to focus only on 
scanning code files, the D language spec allows for arbitrary 
Unicode in a code file. How is this possible? (I have a general 
idea, just looking for some explicit code sample help).


3. I tried to understand the reason for and usage of the 
"extraFields" parameter in "TokenStructure" but couldn't figure 
it out. Could some more explanation of its intent and usage be 
provided?


4. Do you want the commented-out pragma statement left over on 
line 601?


5. Should the template "TokenId" perhaps be something like 
"generateTokenId" instead? I am not sure what an "Id" for a token 
means. Is it an integral hash value? Had difficulty seeing how it 
ties in with the concept of "value" in the header documentation. 
If this is a numerical hash of a string token, why is the string 
still stored and used in "tokenStringRepresentation"? I probably 
am missing something big but couldn't the number be used to 
represent the string everywhere, saving on time and space?


6. I tried but had difficulty understanding the difference 
between the four token types -- "staticTokens", "dynamicTokens", 
"possibleDefaultTokens", "tokenHandlers" -- provided as arguments 
to "Lexer". What is a token that has a value that changes versus 
a token that does not change? I am not sure where to put my token 
definitions.


7. Just thinking about using the module and I would like to use 
it to make a scanner for xml, json, csv, c/c++, etc. I wasn't 
able to figure out how to do so, however. The initial code 
example is nice. But could some additional guidance be provided? 
Also, I wasn't sure how to make use of a lexer once created. The 
documentation focuses well on how to initialize a "Lexer" but 
could some guidance also be provided on how to use one past 
initialization?


8. Andrei's trie search 
(http://forum.dlang.org/thread/eeenynxifropasqcu...@forum.dlang.org?page=4#post-l2nm7m:2416e1:241:40digitalmars.com) 
seemed like a really interesting idea. And I saw in that thread 
you continued with his ideas. Does this module incorporate that 
work? Or was it less performant in the end?


9. I ran "dmd -cov" against the module and got zero percent unit 
test coverage. Perhaps adding some test code will help clarify 
usage patterns?


You have put a lot of work into this code so I apologize if the 
above comes across as picking it apart. Just some questions I had 
in trying to make use of the code. Hopefully some of it is 
helpful.


Joseph

Other related posts
http://forum.dlang.org/thread/jsnhlcbulwyjuqcqo...@forum.dlang.org
http://forum.dlang.org/thread/dpdgcycrgfspcxenz...@forum.dlang.org
http://forum.dlang.org/thread/eeenynxifropasqcu...@forum.dlang.org



Re: Formal review of std.lexer

2014-02-21 Thread Vladimir Panteleev

On Friday, 21 February 2014 at 12:12:17 UTC, Dicebot wrote:

http://wiki.dlang.org/Review/std.lexer


First of all, thank you for the great work. This is a very 
important project.


I'll begin with reviewing the documentation.


Summary


Some simple explanation of the terminology and concepts would be 
nice. At least a link to Wikipedia.



Create the string array costants for your language.


Typo ("costants")


Examples:


An inline complete example of a very simple language would be 
nice.



A lexer for D is available here.


Although good to have, this is too much to take in all at once, 
for documentation purposes.



A lexer for Lua is available here.


Nary a comment in sight. This might serve as the example lexer if 
only it was better commented. The comments can be copy-pasted 
from the module documentation, even that would make the code much 
easier to grok.



Template Parameter Definitions


What does this mean? Parameters to what template?

Can this section be moved to inside the documentation of Lexer, 
and Lexer be moved to the first documented symbol in the file?


A function that serves as the default token lexing function. 
For most languages this will be the identifier lexing function.


Should the function signature and contracts be explained here?

This function must return bool and take a single size_t 
argument representing the number of bytes to skip over before 
looking for a separating character.


I think it's better to describe the signature in D syntax rather 
than English.


A listing of tokens whose value is variable, such as 
whitespace, identifiers, number literals, and string literals.


No mention of how the list is represented (is it an array? what 
type of elements should the array have? how are the array values 
used?), the reader is left to figure that out from the example 
below.


Template for determining the type used for a token type. 
Selects the smallest unsigned integral type that is able to 
hold the value staticTokens.length + dynamicTokens.length + 
possibleDefaultTokens.length. For example if there are 20 
static tokens, 30 dynamic tokens, and 10 possible default 
tokens, this template will alias itself to ubyte, as 20 + 30 + 
10 < ubyte.max.


Should this be documented? I understand that this will be 
instantiated only once, by std.lexer.


Utility declarations should preferably be at the end of the 
module, so that they appear last in the documentation.


Generates the token type identifier for the given symbol. There 
are two special cases:


Are these magic constants necessary? Why not declare them as 
enums?


 In all cases this template will alias itself to a constant of 
type IdType. This template will fail at compile time if symbol 
is not one of the staticTokens, dynamicTokens, or 
possibleDefaultTokens.


Style nit: D symbols should be wrapped in the $D(...) macro.


== overload for the the token type.


Is it really necessary to document opEquals?

But since it's here: how does it interact with extraFields?


The Column number at which this token occurs.


There was a lot of bikeshedding regarding the correct terminology 
to use when adding -vcolumns to DMD ( 
https://github.com/D-Programming-Language/dmd/pull/3077 ). I 
think the documentation should at least mention what exactly it 
is counting.


A function that serves as the default token lexing function. 
For most languages this will be the identifier lexing function.


What should the function's name be? How will it interact with 
Lexer? (It's not clear that this refers to the 
defaultTokenFunction parameter, especially after the previous 
list item, popFront, is a different piece of the puzzle.)


The documentation for Lexer's arguments seem to be thrown all 
around the module. I suggest to document them only once, all in 
Lexer's DDoc, add example signatures, and move Lexer to the top 
of the module.



Examples:
struct CalculatorLexer


I think this should be expanded into a full, well-documented 
example featured in the module DDoc.



_popFront();


Where did that come from? Perhaps you meant Lexer's DDoc to say 
"which should call this mixin's _popFront()", and the DDoc 
escaped the _ character? If so, why not use a named mixin to 
disambiguate instead of _?



struct LexerRange;
struct StringCache;


These are thoroughly documented, but will they be used by 
anything other than std.lexer?