Re: Formal review of std.lexer
17-Mar-2014 02:13, Martin Nowak пишет: On 02/22/2014 09:31 PM, "Marc Schütz" " wrote: But that still doesn't explain why a custom hash table implementation is necessary. Maybe a lightweight wrapper around built-in AAs is sufficient? I'm also wondering what benefit this hash table provides. Getting back to this. The custom hash map originaly was a product of optimization, the benefits over built-in AAs are: a) Allocation was amortized by allocating nodes in batches. b) Allowed custom hash function to be used with built-in type (string). Not sure how much of that stands today. -- Dmitry Olshansky
Re: Formal review of std.lexer
On 14/04/2014 10:34 PM, Brian Schott wrote: ubyte[] is required for the lexer to work quickly. That lexer range is designed to keep track of the column and line numbers. I can understand that speed requires the input to be handled as bytes instead of chars, but the restriction to an ubyte[] over an randomAccessRange seems to me un-Dlang-y. If the LexerRange is only there to optionally add line/column numbering support then I think it needs a much more descriptive name and much better documentation. That function's purpose is to determine if the current code unit signifies the end of an identifier/keyword. When lexing "fortunate", the lexer would spot "for", and then call the separating function with the index of "u". In the case of D, it would say that this does NOT end a word, and "fortunate" should be lexed as an identifier. If it was called with an index of a space or parenthesis, it should return true. Somehow I had skipped the salient part of the documentation for this, so my apologies (at first I thought that the DDoc output I was reading must have been out of sync with the source (it was, but not that much), but further investigations suggest some form of temporary blindness). This description squares with what I had surmised from reading the code, and I can see why it would be more efficient than the common technique of comparing every matched identifier to the list of keywords. I do wonder however if there might be another way to attain the same efficiency without the need for the separating function (I should have replied to this last night when my ideas on the matter were clearer, sleep seems to have stolen my theoretical alternative ><). I'm also curious about the decision that the signature of the separating function should take the offset to the character than needs to be tested. A more Dlang-y thing to do in my mind would be to pass a range that begins at the first character after the matched keyword. If more than one prefix is dispatched to the same handler, the handler cannot blindly call popFront or popFrontN. If the lexer author knows that the handler will only be called in certain places, than checking !empty is a waste of cycles. I don't think it's possible for the lexer generator to enforce this. I don't think the lexer needs to do anything extra to ensure that a handler can begin its work both without repeating calls to .empty or blindly calling popFront. To do so requires that the input be treated a a forward range. Before reading a token, store the .save of the input, then advance the input as the token is matched counting the consumed elements. When the handler is called it will have the option of including the token by adding to the count and returning a slice that begins at the .save'd position or ignoring the length of the match and returning a slice that begins at the position after the matched token. I strongly believe that a model that requires the user to reason about a library providing ranges that are "logicaly !empty" is a misstep. XML doesn't seem to have very distinct lexing/parsing phases like JSON markup or Java code does. This lexer generator may not be the best way to deal with XML. I don't really know how the grammar in the XML standard compares to others in general. It is certainly more convoluted than the published grammar for Dlang, but we all know that that doesn't quite match the language. Some criticise XML for being too complicated, but in some respects it is even more complicated than that, it seems ironic that it was supposed to be simple. But, I might have typed too soon when I wrote that this lexer might not be able to fully tokenize XML in a single pass. As I was drifting off to sleep last night I had the idea of using the extraFields of TokenStructure to add a token pointer (or equivalent construct) to the tokens, making a singly linked list and allowing me to return multiple tokens from a token handler. I just need to convince myself that it is a sane thing to do. If you have specific problems, please list them. I think the calculator example should be improved or replaced so that the documentation's first example uses all the string[]s that can be passed to the lexer. Perhaps a simple symbolic calculator that accepts letter sequences as variables and reserves pi as a keyword? I think that the section that describes the token handlers needs to fully document the primitives that the user has access to in order to lex a token and what state to expect them to be in when the handler is called and what state they should be in when it returns. If tracking lines and columns is optional and supported only when wrapping the input with LexerRange then I think that by default Tokens should not contain fields for them. Perhaps an enum qstring that declares the required fields can be introduced and an example shown where it is concatenated with user declarations to be passed in vi
Re: Formal review of std.lexer
On Monday, 14 April 2014 at 20:43:58 UTC, Alix Pexton wrote: I know the official review period is long past but I'd not had a good look at this module until this past weekend. Last year I had been working on my own xml lexer/parser but as of yet I have nothing to show for it so I took a look a this proposal with an eye towards using it to make my efforts easier. Andrei's posts about the possible design of a generic lexer had also influenced me, so I was expecting to find similarities between this module and my own work, albeit with the added benefits of being generic (in the good way). I have, however, found it very difficult to understand much of it, which I entirely put down to my own deficiencies with templates and especially the use of mixins. In the example Dlang lexer, the constructor takes a ubyte[] as input and wraps it in a LexerRange struct which defines the normal input range primitives as well as various functions for lookahead. It is not documented whether the lexer needs these extra features or if they are only provided for use within the tokenising functions that are supplied to the template by the user. If they are used by the core of the lexer then it would seem to preclude the use of any other type of input that cannot be coerced into a ubyte[] without the effort on the part of the user to implement the same interface. ubyte[] is required for the lexer to work quickly. That lexer range is designed to keep track of the column and line numbers. I think the description of the functionality required of the tokenSeparatingFunction that the user must supply needs to be much better. If I understand correctly, it is intended to differentiate between keywords and identifiers which begin with a keyword. THe more I think about this the less certain I am. That function's purpose is to determine if the current code unit signifies the end of an identifier/keyword. When lexing "fortunate", the lexer would spot "for", and then call the separating function with the index of "u". In the case of D, it would say that this does NOT end a word, and "fortunate" should be lexed as an identifier. If it was called with an index of a space or parenthesis, it should return true. When the lexer dispatches to a token handler, the front of the range is left pointing to the beginning of the character sequence that was matched, allowing it to be included in the returned token. However, many of the handlers in the example Dlang lexer begin with a number of blind popFront calls to jump to the end of that match. I am aware that in well meaning code this is a case of the range being logically !empty, but I also wonder how often it might get overlooked when 2 matches of different lengths are dispatched the same handler. (I had a similar situation in my own code, and my solution was to have a variable storing the .save of my inputRange and count how many chars were consumed since it was updated. This way I could either return the whole match or part of it in the token or discard it and include only what came after it.) As there has been some contention about the correct use of the range primitives of late, I will refrain from making any other comment on their use in this module, especially as I am no longer sure that I have been using them correctly myself. If more than one prefix is dispatched to the same handler, the handler cannot blindly call popFront or popFrontN. If the lexer author knows that the handler will only be called in certain places, than checking !empty is a waste of cycles. I don't think it's possible for the lexer generator to enforce this. In the short time that I have been looking at the features of this lexer I have not been able to figure out a way of writing a standards compliant XML parser without having to lex some parts of the document at least twice, or subverting the token handlers to change behaviour according to context. Several non compliant single pass XML lexers would be possible, but they would not be able to process documents that used some (admittedly obscure and often overlooked) features. The only scalable technique that I can think of to allow XML to be lexed in a single pass in a fully spec compliant way would be to allow handlers to return multiple tokens. I am not sure how feasible this would be or what mechanism would be best to implement it. XML doesn't seem to have very distinct lexing/parsing phases like JSON markup or Java code does. This lexer generator may not be the best way to deal with XML. On the whole I think the the overall design of the module shows promise but requires polish to make it both more idiomatically Dlang-y and easier for the user to build upon (both by documentation and interface). If you have specific problems, please list them. On a side not related to the example lexer for Dlang, I believe the predicate function isDocComment will produce false positives for the
Re: Formal review of std.lexer
On Sunday, 16 March 2014 at 22:13:16 UTC, Martin Nowak wrote: On 02/22/2014 09:31 PM, "Marc Schütz" " wrote: But that still doesn't explain why a custom hash table implementation is necessary. Maybe a lightweight wrapper around built-in AAs is sufficient? I'm also wondering what benefit this hash table provides. The hash table was added by Dmitry a while back. I assume it's because he had numbers to back it up.
Re: Formal review of std.lexer
On 21/02/2014 12:12 PM, Dicebot wrote: http://wiki.dlang.org/Review/std.lexer This is follow-up by Brian to his earlier proposal (http://wiki.dlang.org/Review/std.d.lexer). This time proposed module focuses instead on generic lexer generation as discussed in matching voting thread. Docs: http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html Code: https://github.com/Hackerpilot/Dscanner/blob/master/stdx/lexer.d Example topics to evaluate during review: - Feasibility of overall design and concept - Quality of documentation - Consistency with existing Phobos modules - Overall quality of implementation Initial review will end on March the 8th. I know the official review period is long past but I'd not had a good look at this module until this past weekend. Last year I had been working on my own xml lexer/parser but as of yet I have nothing to show for it so I took a look a this proposal with an eye towards using it to make my efforts easier. Andrei's posts about the possible design of a generic lexer had also influenced me, so I was expecting to find similarities between this module and my own work, albeit with the added benefits of being generic (in the good way). I have, however, found it very difficult to understand much of it, which I entirely put down to my own deficiencies with templates and especially the use of mixins. In the example Dlang lexer, the constructor takes a ubyte[] as input and wraps it in a LexerRange struct which defines the normal input range primitives as well as various functions for lookahead. It is not documented whether the lexer needs these extra features or if they are only provided for use within the tokenising functions that are supplied to the template by the user. If they are used by the core of the lexer then it would seem to preclude the use of any other type of input that cannot be coerced into a ubyte[] without the effort on the part of the user to implement the same interface. I think the description of the functionality required of the tokenSeparatingFunction that the user must supply needs to be much better. If I understand correctly, it is intended to differentiate between keywords and identifiers which begin with a keyword. THe more I think about this the less certain I am. When the lexer dispatches to a token handler, the front of the range is left pointing to the beginning of the character sequence that was matched, allowing it to be included in the returned token. However, many of the handlers in the example Dlang lexer begin with a number of blind popFront calls to jump to the end of that match. I am aware that in well meaning code this is a case of the range being logically !empty, but I also wonder how often it might get overlooked when 2 matches of different lengths are dispatched the same handler. (I had a similar situation in my own code, and my solution was to have a variable storing the .save of my inputRange and count how many chars were consumed since it was updated. This way I could either return the whole match or part of it in the token or discard it and include only what came after it.) As there has been some contention about the correct use of the range primitives of late, I will refrain from making any other comment on their use in this module, especially as I am no longer sure that I have been using them correctly myself. In the short time that I have been looking at the features of this lexer I have not been able to figure out a way of writing a standards compliant XML parser without having to lex some parts of the document at least twice, or subverting the token handlers to change behaviour according to context. Several non compliant single pass XML lexers would be possible, but they would not be able to process documents that used some (admittedly obscure and often overlooked) features. The only scalable technique that I can think of to allow XML to be lexed in a single pass in a fully spec compliant way would be to allow handlers to return multiple tokens. I am not sure how feasible this would be or what mechanism would be best to implement it. On the whole I think the the overall design of the module shows promise but requires polish to make it both more idiomatically Dlang-y and easier for the user to build upon (both by documentation and interface). On a side not related to the example lexer for Dlang, I believe the predicate function isDocComment will produce false positives for the following comment delimiters which to my knowledge are not valid DDoc delimiters... //* //+ /*+ /*/ /+* /+/ As the Dlang lexer is not part of the review proper I have not inspected it carefully, this function just happens to be the first one declared in that example. Again, my apologies for the tardiness of this review. A...
Re: Formal review of std.lexer
On 02/22/2014 09:31 PM, "Marc Schütz" " wrote: But that still doesn't explain why a custom hash table implementation is necessary. Maybe a lightweight wrapper around built-in AAs is sufficient? I'm also wondering what benefit this hash table provides.
Re: Formal review of std.lexer
Initial review has finished. Voting will be delayed because Brian is currently busy and there is ongoing Walter's scopebuffer proposal to be processed (per agreement with both Brian and Walter). Anyone late for review can still leave comments, I am sure Brian will take them into consideration when doing last moment changes before voting.
Re: Formal review of std.lexer
On Wednesday, 26 February 2014 at 18:07:37 UTC, Jacob Carlborg wrote: On 2014-02-26 00:25, Dicebot wrote: Don't know if it makes sense to introduce random package categorizatin. I'd love to see more hierarchy in Phobos too but we'd first need to agree to package separation principles then. Then that's what we need to do. I don't want any more top level modules. There are already too many. As much as I hate to say it, but such hierarchy is worth a DIP. Once it is formalized, I can proceed with it in review queue as if it was a new module proposal.
Re: Formal review of std.lexer
Reminder about benchmarks. By the way, is generated lexer usable at CTFE? Imaginary use case : easier DSL implementation.
Re: Formal review of std.lexer
Bringing this back to the front page.
Re: Formal review of std.lexer
On 2014-02-26 00:25, Dicebot wrote: Don't know if it makes sense to introduce random package categorizatin. I'd love to see more hierarchy in Phobos too but we'd first need to agree to package separation principles then. Then that's what we need to do. I don't want any more top level modules. There are already too many. -- /Jacob Carlborg
Re: Formal review of std.lexer
On 2014-02-26 16:18, Andrei Alexandrescu wrote: I think we wouldn't want to add one more package for each language supported. That's exactly what we want, preferably in a common package: std.language.d.lexer std.language.d.ast std.language.d.parser std.language.xml.lexer std.language.xml.parser std.language.xml.dom What do you suggest, having multiple lexers for different languages in the same module? -- /Jacob Carlborg
Re: Formal review of std.lexer
On 2014-02-26 07:58, Brian Schott wrote: I think that's a bit backwards. I'd rather have std.d.lexer std.d.ast std.d.parser than std.lexer.d std.parser.d std.ast.d I agree with Brian. Although I would have a common package for all languages: std.language.d.lexer std.language.d.ast std.language.d.parser -- /Jacob Carlborg
Re: Formal review of std.lexer
On 2/25/14, 10:58 PM, Brian Schott wrote: On Tuesday, 25 February 2014 at 23:17:56 UTC, Meta wrote: std.lexer could be the umbrella for a bunch of different lexers. Then we could have std.lexer.xml, std.lexer.json, etc. I think that's a bit backwards. I'd rather have std.d.lexer std.d.ast std.d.parser than std.lexer.d std.parser.d std.ast.d I think we wouldn't want to add one more package for each language supported. Andrei
Re: Formal review of std.lexer
On Wednesday, 26 February 2014 at 07:00:33 UTC, Brian Schott wrote: On Monday, 24 February 2014 at 19:05:35 UTC, Dicebot wrote: Brian, do you have benchmarks for this proposal similar to ones you have provided in old review threads? (vs DMD frontend lexer) Not yet, but I can create them before the review period ends. Thanks, that will be very interesting information to consider in context of DDMD.
Re: Formal review of std.lexer
On Tuesday, 25 February 2014 at 23:17:56 UTC, Meta wrote: std.lexer could be the umbrella for a bunch of different lexers. Then we could have std.lexer.xml, std.lexer.json, etc. I think that's a bit backwards. I'd rather have std.d.lexer std.d.ast std.d.parser than std.lexer.d std.parser.d std.ast.d
Re: Formal review of std.lexer
On Monday, 24 February 2014 at 19:05:35 UTC, Dicebot wrote: Brian, do you have benchmarks for this proposal similar to ones you have provided in old review threads? (vs DMD frontend lexer) Not yet, but I can create them before the review period ends.
Re: Formal review of std.lexer
On Tuesday, 25 February 2014 at 20:21:17 UTC, Andrei Alexandrescu wrote: I think it would be great if we rigged things such that std.lexer simplifies flint and makes it faster. From what I saw Brian has a very similar approach. I'm hoping for a few improvements, which I'll share soon. Andrei I'll be waiting for this list. In the meantime, I'm making a change to the string interning process to make it more threading friendly.
Re: Formal review of std.lexer
On Tuesday, 25 February 2014 at 20:48:08 UTC, Jacob Carlborg wrote: I don't like it being a new top level module. I would name it std.language.lexer or std.lang.lexer. Don't know if it makes sense to introduce random package categorizatin. I'd love to see more hierarchy in Phobos too but we'd first need to agree to package separation principles then.
Re: Formal review of std.lexer
On Tuesday, 25 February 2014 at 20:48:08 UTC, Jacob Carlborg wrote: On 2014-02-21 13:12, Dicebot wrote: http://wiki.dlang.org/Review/std.lexer This is follow-up by Brian to his earlier proposal (http://wiki.dlang.org/Review/std.d.lexer). This time proposed module focuses instead on generic lexer generation as discussed in matching voting thread. I don't like it being a new top level module. I would name it std.language.lexer or std.lang.lexer. std.lexer could be the umbrella for a bunch of different lexers. Then we could have std.lexer.xml, std.lexer.json, etc.
Re: Formal review of std.lexer
On 2014-02-21 13:12, Dicebot wrote: http://wiki.dlang.org/Review/std.lexer This is follow-up by Brian to his earlier proposal (http://wiki.dlang.org/Review/std.d.lexer). This time proposed module focuses instead on generic lexer generation as discussed in matching voting thread. I don't like it being a new top level module. I would name it std.language.lexer or std.lang.lexer. -- /Jacob Carlborg
Re: Formal review of std.lexer
On 2/24/14, 7:18 PM, Adam Wilson wrote: Note that I as badly as I want std.lexer to be included I want it to pass a rigorous review. This review (and, is passing, subsequent inclusion) has opened up an opportunity to start using D at work that I did not expect and so I am kind of excited about it. I think it would be great if we rigged things such that std.lexer simplifies flint and makes it faster. From what I saw Brian has a very similar approach. I'm hoping for a few improvements, which I'll share soon. Andrei
Re: Formal review of std.lexer
On Tuesday, 25 February 2014 at 03:52:01 UTC, H. S. Teoh wrote: Sadly, I haven't had the time to review the proposed code as closely as I'd like. Don't hesitate to ask for extending review period if you need so. Making good review is more important than meeting some arbitrary deadline.
Re: Formal review of std.lexer
On Mon, Feb 24, 2014 at 07:18:58PM -0800, Adam Wilson wrote: [...] > Note that I as badly as I want std.lexer to be included I want it to > pass a rigorous review. This review (and, is passing, subsequent > inclusion) has opened up an opportunity to start using D at work > that I did not expect and so I am kind of excited about it. [...] Me too, I'm looking forward to a well-designed lexer generator in Phobos. I will have lots of uses for it. Sadly, I haven't had the time to review the proposed code as closely as I'd like. T -- What doesn't kill me makes me stranger.
Re: Formal review of std.lexer
On Mon, 24 Feb 2014 17:22:32 -0800, Meta wrote: On Tuesday, 25 February 2014 at 00:28:26 UTC, Adam Wilson wrote: [SNIP] You're throwing what I said way out of proportion. I was replying to the statement: "No criticism should stop this module being accepted, as we do not have any other lexer in the runtime anyway. Therefore I suggest we accept std.lexer until a better solution comes up." I don't agree with this. Obviously std.lexer is well-written and has been through a few rounds of iteration, but that doesn't mean we should accept it without due diligence to ensure that we won't be regretting some overlooked, poorly-designed or badly-named piece of functionality down the road. "Good enough because we don't yet have anything better" is a bad idea. It seems to me that what Brian has written is much better than "good enough", but I don't think that it should be accepted into Phobos *solely* because we don't have anything else. If the community decides that it is a worthwhile addition, then great, but that must not happen *until* it has passed rigorous review, just like every other recent Phobos module. Fair enough. I guess I am just still touchy after the way std.signals was shot down. There weren't great technical arguments for shooting it down and so I feel that a good piece of code that would've been immediately useful and accepted by the community was rejected over some pretty silly fears. Note that I as badly as I want std.lexer to be included I want it to pass a rigorous review. This review (and, is passing, subsequent inclusion) has opened up an opportunity to start using D at work that I did not expect and so I am kind of excited about it. -- Adam Wilson GitHub/IRC: LightBender Aurora Project Coordinator
Re: Formal review of std.lexer
On Tuesday, 25 February 2014 at 00:28:26 UTC, Adam Wilson wrote: [SNIP] You're throwing what I said way out of proportion. I was replying to the statement: "No criticism should stop this module being accepted, as we do not have any other lexer in the runtime anyway. Therefore I suggest we accept std.lexer until a better solution comes up." I don't agree with this. Obviously std.lexer is well-written and has been through a few rounds of iteration, but that doesn't mean we should accept it without due diligence to ensure that we won't be regretting some overlooked, poorly-designed or badly-named piece of functionality down the road. "Good enough because we don't yet have anything better" is a bad idea. It seems to me that what Brian has written is much better than "good enough", but I don't think that it should be accepted into Phobos *solely* because we don't have anything else. If the community decides that it is a worthwhile addition, then great, but that must not happen *until* it has passed rigorous review, just like every other recent Phobos module.
Re: Formal review of std.lexer
On Mon, 24 Feb 2014 15:36:43 -0800, Meta wrote: On Monday, 24 February 2014 at 23:07:07 UTC, Adam Wilson wrote: Well, we keep voting down replacement candidates, which incidentally, is exactly what happened with the std.signals replacement, so I view this as an orthogonal issue to whether or not it should be included after passing a review. I don't think the fact that a module might not be perfect after review should stop us from approving a module that offers completely new functionality AND passed a review. Handling the problems after inclusion is what bugzilla is for. I guess std.signals was a bad example, as there *was* a proposed replacement. However, there were real problems with the replacement that made it not suitable for inclusion. If I recall, these were largely API issues, which are the hardest to change. If we had've accepted the new std.signals despite the issues raised, several years down the road it might turn out to be as broken as the old implementation (no offense to the author, this is just for the sake of argument), and we are unable to replace it for fear of breaking code. There are then 2 options: support the old API with its broken behaviour in the same module as the new API, or introduce std.signals2 or the like, which people have shown resistance to in the past. I think that it's very important to be careful as to what goes into Phobos at this point, as it's going to become increasingly difficult to change anything. Ok. Then by that standpoint we have two remaining options. A) Don't ever change existing code because you might breaking someone else's depending code on accident. Even without an API change, a change in how a function does it's processing can result in incorrect behavior in down-stream code. B) Never introduce new modules because we are terrified that the API might not be right in five years. It is unrealistic in the extreme to demand that a new module in Phobos meet some arbitrary future compatibility bar. We routinely make changes to Phobos that break people's code in subtle ways that don't produce compiler errors and we never hear about it because no sane programmer expects a standard library to be a static thing that will never ever ever change in any way what so ever so that they can expect the exact same behavior five years from now. No, they go fix it and move on. "OMG! We can't add this! It might not be the right API in the future!" By that reasoning we should halt all work on Phobos ever. After all, we might break something or a new API might not be perfect in five years after a new language feature allows an obviously better way of implementing that API. As reasoning goes, the best word I am come up with to describe it is "naive". It is a purely fear based response based on some unspecified and unknown future circumstance. Newsflash, you can't predict the future, which incidentally, is why API's change! The reason std.signals replacement got voted down was because it wasn't API compatible with the original. This is despite the fact that nobody stood up to say that they use the current implementation and like it. Anybody who did use it said that they are desperate for a replacement. No the current std.signals is universally reviled and almost completely unused but we can't change it because of the mythical silent user who might be out there coding under a rock with no Internet and isn't voicing their opinion. Well, guess what, if you don't speak-up you have no right to blame it on the people who did and decided to changed it. You had your chance, stop whining and go update your code. We are programmers. It is our JOB to deal with changes in the API/ABI/ISA. We don't have the RIGHT to whine about change because dealing with change is what we DO. If we aren't dealing with change we're working on something that isn't used at all and therefore pointless. I'm all for pointless pursuits, but don't hamstring me and my non-pointless pursuits with them. I wanted std.signals in the new form badly, but now I can't have it, and Aurora is going to suffer mightily for it. Now I get to go off and make my own substandard iteration of Signals, because some people whined about how the API wasn't backwards compatible, even though it's widely acknowledged that almost nobody uses it and those that do would've been happy to move to the new signals. Note, I am not talking about voting down modules that are obviously poorly implemented, I am talking about modules that work well and can do the task they purport to do with as few bugs as possible. If the module can do what it says it can, and the API isn't horrible, then we should default to adding it. Griping about the API five years from now is just bikeshedding, write up a pull and get that reviewed, chances are high it'll get rejected because the changes are worse anyways. The only difficult
Re: Formal review of std.lexer
On Monday, 24 February 2014 at 23:07:07 UTC, Adam Wilson wrote: Well, we keep voting down replacement candidates, which incidentally, is exactly what happened with the std.signals replacement, so I view this as an orthogonal issue to whether or not it should be included after passing a review. I don't think the fact that a module might not be perfect after review should stop us from approving a module that offers completely new functionality AND passed a review. Handling the problems after inclusion is what bugzilla is for. I guess std.signals was a bad example, as there *was* a proposed replacement. However, there were real problems with the replacement that made it not suitable for inclusion. If I recall, these were largely API issues, which are the hardest to change. If we had've accepted the new std.signals despite the issues raised, several years down the road it might turn out to be as broken as the old implementation (no offense to the author, this is just for the sake of argument), and we are unable to replace it for fear of breaking code. There are then 2 options: support the old API with its broken behaviour in the same module as the new API, or introduce std.signals2 or the like, which people have shown resistance to in the past. I think that it's very important to be careful as to what goes into Phobos at this point, as it's going to become increasingly difficult to change anything.
Re: Formal review of std.lexer
On Mon, 24 Feb 2014 14:32:53 -0800, Meta wrote: On Monday, 24 February 2014 at 22:14:34 UTC, Dejan Lekic wrote: No criticism should stop this module being accepted, as we do not have any other lexer in the runtime anyway. Therefore I suggest we accept std.lexer until a better solution comes up. Naturally anyone should be encouraged to provide a better solution by submitting a pull request to Phobos developers... The problem is that this is what has been done before, and now we are more or less stuck with outdated, sometimes poorly-written, often buggy modules (std.signals being one example). Well, we keep voting down replacement candidates, which incidentally, is exactly what happened with the std.signals replacement, so I view this as an orthogonal issue to whether or not it should be included after passing a review. I don't think the fact that a module might not be perfect after review should stop us from approving a module that offers completely new functionality AND passed a review. Handling the problems after inclusion is what bugzilla is for. -- Adam Wilson GitHub/IRC: LightBender Aurora Project Coordinator
Re: Formal review of std.lexer
On Monday, 24 February 2014 at 22:14:34 UTC, Dejan Lekic wrote: No criticism should stop this module being accepted, as we do not have any other lexer in the runtime anyway. Therefore I suggest we accept std.lexer until a better solution comes up. Naturally anyone should be encouraged to provide a better solution by submitting a pull request to Phobos developers... The problem is that this is what has been done before, and now we are more or less stuck with outdated, sometimes poorly-written, often buggy modules (std.signals being one example).
Re: Formal review of std.lexer
On Friday, 21 February 2014 at 12:12:17 UTC, Dicebot wrote: http://wiki.dlang.org/Review/std.lexer This is follow-up by Brian to his earlier proposal (http://wiki.dlang.org/Review/std.d.lexer). This time proposed module focuses instead on generic lexer generation as discussed in matching voting thread. Docs: http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html Code: https://github.com/Hackerpilot/Dscanner/blob/master/stdx/lexer.d Example topics to evaluate during review: - Feasibility of overall design and concept - Quality of documentation - Consistency with existing Phobos modules - Overall quality of implementation Initial review will end on March the 8th. No criticism should stop this module being accepted, as we do not have any other lexer in the runtime anyway. Therefore I suggest we accept std.lexer until a better solution comes up. Naturally anyone should be encouraged to provide a better solution by submitting a pull request to Phobos developers... So far I haven't seen a better lexer for D source than Brian's std.lexer. If anyone has, please let me know.
Re: Formal review of std.lexer
Brian, do you have benchmarks for this proposal similar to ones you have provided in old review threads? (vs DMD frontend lexer)
Re: Formal review of std.lexer
On Friday, 21 February 2014 at 22:13:30 UTC, Andrej Mitrovic wrote: On 2/21/14, Joseph Cassman wrote: 1. StringCache is a custom hash table. It looks like it's primary role is to reduce some sort of duplication. It's used for string interning. But that still doesn't explain why a custom hash table implementation is necessary. Maybe a lightweight wrapper around built-in AAs is sufficient? (I also apologize if this has already been asked and answered in the last review thread, which is unfortunately too long to read in a short time.)
Re: Formal review of std.lexer
On Friday, 21 February 2014 at 22:14:36 UTC, Andrej Mitrovic wrote: Hmm, maybe I should make a pull request upstream? You should.
Re: Formal review of std.lexer
On 2/21/14, 9:51 PM, Brian Schott wrote: Does this mean that you're finally getting approval to release your lexer generator? Affirmative. It'll happen Real Soon Now(tm). Andrei
Re: Formal review of std.lexer
On 2/21/14, Andrej Mitrovic wrote: > I've used it to learn how to use DLexer, here's my more-documented > version of the above if it helps anyone, which also takes the D > parser's whitespace function and uses that in order to track column > numbers, it injects a toString like I've mentioned, and adds some > static asserts for sanity checks: > > https://github.com/AndrejMitrovic/lexer-demo Hmm, maybe I should make a pull request upstream?
Re: Formal review of std.lexer
On 2/21/14, Joseph Cassman wrote: > 1. StringCache is a custom hash table. It looks like it's primary > role is to reduce some sort of duplication. It's used for string interning. > 3. I tried to understand the reason for and usage of the > "extraFields" parameter in "TokenStructure" but couldn't figure > it out. Could some more explanation of its intent and usage be > provided? You could used it to inject a toString method. Here's what I did when I used it: alias Token = TokenStructure!(TokID, q{ /// Better string representation void toString(scope void delegate(const(char)[]) sink) const { import std.conv; import %s; sink("("); sink(this.line.to!string); sink(":"); sink(this.column.to!string); sink(")"); sink(": "); sink(this.text ? this.text : tokToString(this.type)); } }.format(__MODULE__) ); Side-note: Note how I've had to inject an import statement to my own module because the string is mixed-in at the declaration site of a module which I can't modify, where IIRC tokToString didn't exist because it's located in *my* module. It's interesting how you can use this feature in D, IOW: - module mymodule; import other_module; void foo() { } mixin_elsewhere!("import mymodule; foo();"); - - module other_module; void mixin_elsewhere(string str)() { mixin(str); } - P.S. the DLexer links here: https://github.com/Hackerpilot/lexer-demo/ I've used it to learn how to use DLexer, here's my more-documented version of the above if it helps anyone, which also takes the D parser's whitespace function and uses that in order to track column numbers, it injects a toString like I've mentioned, and adds some static asserts for sanity checks: https://github.com/AndrejMitrovic/lexer-demo
Re: Formal review of std.lexer
Does this mean that you're finally getting approval to release your lexer generator? On Friday, 21 February 2014 at 17:06:23 UTC, Andrei Alexandrescu wrote: Can we please defer this by one week? Thanks, Andrei
Re: Formal review of std.lexer
On Friday, 21 February 2014 at 17:15:41 UTC, Jacob Carlborg wrote: On 2014-02-21 18:06, Andrei Alexandrescu wrote: Can we please defer this by one week? Just make the review period one week longer. This. There is no real rationale behind existing default review period so extending it until March the 15th won't cause any issues.
Re: Formal review of std.lexer
On 2014-02-21 18:06, Andrei Alexandrescu wrote: Can we please defer this by one week? Just make the review period one week longer. -- /Jacob Carlborg
Re: Formal review of std.lexer
On 2/21/14, 2:12 PM, Dicebot wrote: http://wiki.dlang.org/Review/std.lexer This is follow-up by Brian to his earlier proposal (http://wiki.dlang.org/Review/std.d.lexer). This time proposed module focuses instead on generic lexer generation as discussed in matching voting thread. Docs: http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html Code: https://github.com/Hackerpilot/Dscanner/blob/master/stdx/lexer.d Example topics to evaluate during review: - Feasibility of overall design and concept - Quality of documentation - Consistency with existing Phobos modules - Overall quality of implementation Initial review will end on March the 8th. Can we please defer this by one week? Thanks, Andrei
Re: Formal review of std.lexer
On Friday, 21 February 2014 at 12:12:17 UTC, Dicebot wrote: http://wiki.dlang.org/Review/std.lexer This is follow-up by Brian to his earlier proposal (http://wiki.dlang.org/Review/std.d.lexer). This time proposed module focuses instead on generic lexer generation as discussed in matching voting thread. Docs: http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html Code: https://github.com/Hackerpilot/Dscanner/blob/master/stdx/lexer.d Thanks for all the work Brian. Read through the previous threads about the development of this code (links at the bottom) and I can see a lot of effort has gone into it. So the following comments may come across as uninformed, but hopefully they will be helpful. 1. StringCache is a custom hash table. It looks like it's primary role is to reduce some sort of duplication. Hash tables, though, are difficult to get right. So perhaps could a benchmark comparison be made against the built-in HT to show what savings it brings? Since it is in the public interface should its payload also be public? Although it is built using GC.malloc how about the in-the-works std.allocator module? Perhaps a version 1 could use GC.malloc but if a later PR could make it possible to use a custom allocator that would be nice. 2. I like the fact that a range interface is provided. I realize that the previous discussions stipulated the use of ubyte to avoid encoding work during scanning. The reasoning about performance makes sense to me. That being the case, could a code example be provided showing how to use this module to scan a UTF-8 encoded string? Even if this is going to focus only on scanning code files, the D language spec allows for arbitrary Unicode in a code file. How is this possible? (I have a general idea, just looking for some explicit code sample help). 3. I tried to understand the reason for and usage of the "extraFields" parameter in "TokenStructure" but couldn't figure it out. Could some more explanation of its intent and usage be provided? 4. Do you want the commented-out pragma statement left over on line 601? 5. Should the template "TokenId" perhaps be something like "generateTokenId" instead? I am not sure what an "Id" for a token means. Is it an integral hash value? Had difficulty seeing how it ties in with the concept of "value" in the header documentation. If this is a numerical hash of a string token, why is the string still stored and used in "tokenStringRepresentation"? I probably am missing something big but couldn't the number be used to represent the string everywhere, saving on time and space? 6. I tried but had difficulty understanding the difference between the four token types -- "staticTokens", "dynamicTokens", "possibleDefaultTokens", "tokenHandlers" -- provided as arguments to "Lexer". What is a token that has a value that changes versus a token that does not change? I am not sure where to put my token definitions. 7. Just thinking about using the module and I would like to use it to make a scanner for xml, json, csv, c/c++, etc. I wasn't able to figure out how to do so, however. The initial code example is nice. But could some additional guidance be provided? Also, I wasn't sure how to make use of a lexer once created. The documentation focuses well on how to initialize a "Lexer" but could some guidance also be provided on how to use one past initialization? 8. Andrei's trie search (http://forum.dlang.org/thread/eeenynxifropasqcu...@forum.dlang.org?page=4#post-l2nm7m:2416e1:241:40digitalmars.com) seemed like a really interesting idea. And I saw in that thread you continued with his ideas. Does this module incorporate that work? Or was it less performant in the end? 9. I ran "dmd -cov" against the module and got zero percent unit test coverage. Perhaps adding some test code will help clarify usage patterns? You have put a lot of work into this code so I apologize if the above comes across as picking it apart. Just some questions I had in trying to make use of the code. Hopefully some of it is helpful. Joseph Other related posts http://forum.dlang.org/thread/jsnhlcbulwyjuqcqo...@forum.dlang.org http://forum.dlang.org/thread/dpdgcycrgfspcxenz...@forum.dlang.org http://forum.dlang.org/thread/eeenynxifropasqcu...@forum.dlang.org
Re: Formal review of std.lexer
On Friday, 21 February 2014 at 12:12:17 UTC, Dicebot wrote: http://wiki.dlang.org/Review/std.lexer First of all, thank you for the great work. This is a very important project. I'll begin with reviewing the documentation. Summary Some simple explanation of the terminology and concepts would be nice. At least a link to Wikipedia. Create the string array costants for your language. Typo ("costants") Examples: An inline complete example of a very simple language would be nice. A lexer for D is available here. Although good to have, this is too much to take in all at once, for documentation purposes. A lexer for Lua is available here. Nary a comment in sight. This might serve as the example lexer if only it was better commented. The comments can be copy-pasted from the module documentation, even that would make the code much easier to grok. Template Parameter Definitions What does this mean? Parameters to what template? Can this section be moved to inside the documentation of Lexer, and Lexer be moved to the first documented symbol in the file? A function that serves as the default token lexing function. For most languages this will be the identifier lexing function. Should the function signature and contracts be explained here? This function must return bool and take a single size_t argument representing the number of bytes to skip over before looking for a separating character. I think it's better to describe the signature in D syntax rather than English. A listing of tokens whose value is variable, such as whitespace, identifiers, number literals, and string literals. No mention of how the list is represented (is it an array? what type of elements should the array have? how are the array values used?), the reader is left to figure that out from the example below. Template for determining the type used for a token type. Selects the smallest unsigned integral type that is able to hold the value staticTokens.length + dynamicTokens.length + possibleDefaultTokens.length. For example if there are 20 static tokens, 30 dynamic tokens, and 10 possible default tokens, this template will alias itself to ubyte, as 20 + 30 + 10 < ubyte.max. Should this be documented? I understand that this will be instantiated only once, by std.lexer. Utility declarations should preferably be at the end of the module, so that they appear last in the documentation. Generates the token type identifier for the given symbol. There are two special cases: Are these magic constants necessary? Why not declare them as enums? In all cases this template will alias itself to a constant of type IdType. This template will fail at compile time if symbol is not one of the staticTokens, dynamicTokens, or possibleDefaultTokens. Style nit: D symbols should be wrapped in the $D(...) macro. == overload for the the token type. Is it really necessary to document opEquals? But since it's here: how does it interact with extraFields? The Column number at which this token occurs. There was a lot of bikeshedding regarding the correct terminology to use when adding -vcolumns to DMD ( https://github.com/D-Programming-Language/dmd/pull/3077 ). I think the documentation should at least mention what exactly it is counting. A function that serves as the default token lexing function. For most languages this will be the identifier lexing function. What should the function's name be? How will it interact with Lexer? (It's not clear that this refers to the defaultTokenFunction parameter, especially after the previous list item, popFront, is a different piece of the puzzle.) The documentation for Lexer's arguments seem to be thrown all around the module. I suggest to document them only once, all in Lexer's DDoc, add example signatures, and move Lexer to the top of the module. Examples: struct CalculatorLexer I think this should be expanded into a full, well-documented example featured in the module DDoc. _popFront(); Where did that come from? Perhaps you meant Lexer's DDoc to say "which should call this mixin's _popFront()", and the DDoc escaped the _ character? If so, why not use a named mixin to disambiguate instead of _? struct LexerRange; struct StringCache; These are thoroughly documented, but will they be used by anything other than std.lexer?