Re: What is the compilation model of D?
On Wednesday, 25 July 2012 at 20:25:19 UTC, David Piepgrass wrote: I hope someone can give more details about this. TDPL chapter 11 "Scaling Up". That's where I was looking. As I said already, TDPL does not explain how compilation works, especially not anything about the low-level semantic analysis which has me most curious. Strange, because it seems to me this chapter answers all your previous questions. What exact details are you interested in?
Re: What is the compilation model of D?
On Wednesday, 25 July 2012 at 19:54:31 UTC, David Piepgrass wrote: It keeps diving deeper and deeper to find anything it can "start" with. One it finds that, it'll just build everything back up in whatever order is necessary. I hope someone can give more details about this. TDPL chapter 11 "Scaling Up".
Re: Study: build times for D programs
On Tuesday, 24 July 2012 at 15:06:58 UTC, Andrei Alexandrescu wrote: Nevertheless, I think there is value in the study. We're looking at a real nontrivial application that wasn't written for a study, but for actual use, and that implements the same design and same functionality in both languages. OK. And it could serve as a basis for further variations: * introduce some feature (e.g., ranges), measure impact * measure impact of multiple features alone and in combination Of course, trivial changes would unlikely yield anything useful, but I believe there is a way to produce valuable data in a controlled research.
Re: Study: build times for D programs
On Tuesday, 24 July 2012 at 14:34:58 UTC, Andrei Alexandrescu wrote: the D source is in D1 and should be adjusted to compile with D2), That would provide performance (compilation and run-time) for D1 only (with D2 compiler). Performance of a typical D2 app would likely be different.
Re: Let's stop parser Hell
On Monday, 23 July 2012 at 13:22:47 UTC, Philippe Sigaud wrote: On Thu, Jul 5, 2012 at 10:00 PM, Tobias Pankrath wrote: If you are interested in parsing, than I wouldn't recommend the dragon book, but this one It really is an awesome book. This little post to say a big 'Thanks' to Tobias. I'm reading Grune's Parsing Techniques and find it absolutely wonderful: easy to read, going into details and limits of the many parsing algos. Very cool! Yes, I'm also reading it and find it amusingly high-quality.
Re: just an idea (!! operator)
On Friday, 13 July 2012 at 13:46:10 UTC, David Nadlinger wrote: I guess that this operator is only really worth it in languages where every type is nullable, though. David It might mean identity (return the argument unchanged) for value types.
Re: D versionning
On Friday, 13 July 2012 at 06:52:25 UTC, Adam Wilson wrote: I hope Walter isn't against this, because I'm not seeing much community disagreement with this... I would not be against having development and stable versions, but the price is not trivial: every pull request must be done in at least two branches, probably diverging significantly. And most benefits are already available: we have the git version and the last stable version (of course, the latter would be without the latest bug-fixes). That would mean slower progress in applying existing pull requests. (There are 100+ of those, aren't there?) Also, nobody is preventing any person that considers this to be very important from creating a fork of stable branch and applying bug-fixes there. If this happens to be a very useful option, then it could be accepted as a policy. So my point of view is that it might be too early to have such policy yet.
Re: All right, all right! Interim decision regarding qualified Object methods
On Thursday, 12 July 2012 at 17:51:32 UTC, Andrei Alexandrescu wrote: On 7/12/12 1:40 PM, David Piepgrass wrote: 1. Most importantly, the C++ template approach is a big pain for large-scale systems, because in such systems you want to create DLLs/SOs and C++ has neither a standard ABI nor a safe way to pass around template instantiations between DLLs (in the presence of changes to internal implementation details). Similar problems exist for D, yes? It's a lot easier to define a standard ABI for classes than to solve the cross-DLL template problem. The thing is, that can be done in an opt-in manner. People who want methods in the root of the hierarchy can define a root that defines them. But there's no way to opt out of inheriting Object. Basically it's nice to not force people to buy into a constrained environment without necessity. +1 4. Template bloat is no big deal on desktops but it becomes a bigger problem as the target system gets smaller. Maybe some compromise should be made to ensure D remains powerful and capable on small targets. I think virtuals are an equally, if not worse, problem in memory-constrained systems. The simple solution people choose for such - they use them judiciously. Here actually templates may be at an advantage because of their opt-in quality. +1. Such seams give flexibility that otherwise would be extremely hard to achieve. There were two proposals yesterday that I liked. Taken together, they address all the problems that were raised with const functions in Object: 1. Provide a 'safe workaround' for const, for caching and lazy evaluation (implement it carefully to avoid breaking the guarantees of immutable) We should explore this option in any case. At this point I'm starting to believe (a) we're doing the right thing by marginalizing the four methods aside from this issue, (b) caching is good for other things than this particular problem. +2 (Especially if it would be possible to redefine 'lazy', so that it would evaluate at most once.)
Re: just an idea (!! operator)
On Thursday, 12 July 2012 at 17:35:05 UTC, Alex Rønne Petersen wrote: But on the other hand, C# has had it from day one and it's still widely used and encouraged today: It has been introduced in C# 2.0 and quickly gained high popularity.
Re: All right, all right! Interim decision regarding qualified Object methods
On Thursday, 12 July 2012 at 14:58:29 UTC, deadalnix wrote: I think this is not a problem as big as it is stated. Most of that code will be executed close to never, and 60Mb isn't a big deal for any modern computer, not even for most cell phones now. L1 cache size is.
Re: All right, all right! Interim decision regarding qualified Object methods
On Thursday, 12 July 2012 at 13:49:29 UTC, Andrei Alexandrescu wrote: Too complicated. I think we can afford one comparison. One comparison for each of these basic usages. Would branch prediction work fine in each use case? Anyway I vote for removing those methods from the root class.
Re: just an idea (!! operator)
On Thursday, 12 July 2012 at 12:51:38 UTC, Jacob Carlborg wrote: On 2012-07-12 13:35, Jonas Drewsen wrote: Or the operator?? could be borrowed from c# auto a = foo ?? new Foo(); is short for: auto a = foo is null ? new Foo() : foo; /Jonas I really like that operator. The existential operator, also known as the Elvis operator :) . It's available in many languages with slightly different semantics. +1
Re: All right, all right! Interim decision regarding qualified Object methods
On Thursday, 12 July 2012 at 13:39:54 UTC, Steven Schveighoffer wrote: On Thu, 12 Jul 2012 09:20:47 -0400, Andrei Alexandrescu wrote: If we define alternative free generic functions in object.d for the four culprit methods (and have the compiler, druntime, and stdlib use them instead of the methods), those functions can check whether a given class object has overridden the old-style functions. In that case, that means we're dealing with legacy classes and proceed the old-style way. Otherwise, proceed the new way. Hm... I don't like this, it slows down a very basic function. I think if we want a solution that allows old code to work, why not what Timon suggested? Have a base class for Object (RawObject was suggested) that does not implement the opFunctions. It would still break code, but would be easy to fix (just specify your class derives from Object, not RawObject). -Steve
Re: All right, all right! Interim decision regarding qualified Object methods
On Thursday, 12 July 2012 at 13:41:52 UTC, Roman D. Boiko wrote: ... ups. I meant +1.
Re: All right, all right! Interim decision regarding qualified Object methods
On Thursday, 12 July 2012 at 12:43:01 UTC, Roman D. Boiko wrote: On Thursday, 12 July 2012 at 12:36:18 UTC, RivenTheMage wrote: On Thursday, 12 July 2012 at 12:06:49 UTC, Roman D. Boiko wrote: Jon Skeet wrote on this long ago: http://msmvps.com/blogs/jon_skeet/archive/2008/12/05/redesigning-system-object-java-lang-object.aspx The fact that every object has a monitor associated with it was a mistake in Java, and was unfortunately copied in .NET. This promotes the bad practice of locking on "this" and on types - both of which are typically publicly accessible references. I believe that unless a reference is exposed explicitly for the purpose of locking (like ICollection.SyncRoot) then you should avoid locking on any reference which other code knows about. This has been discussed multiple times on the D forum, I believe. Do you mean Monitor or all other issues from that post as well? Do you have any links? I would be interested to know conclusions. OK, I found one myself from this post: http://michelf.com/weblog/2012/mutex-synchonization-in-d/
Re: All right, all right! Interim decision regarding qualified Object methods
On Thursday, 12 July 2012 at 12:36:18 UTC, RivenTheMage wrote: On Thursday, 12 July 2012 at 12:06:49 UTC, Roman D. Boiko wrote: Jon Skeet wrote on this long ago: http://msmvps.com/blogs/jon_skeet/archive/2008/12/05/redesigning-system-object-java-lang-object.aspx The fact that every object has a monitor associated with it was a mistake in Java, and was unfortunately copied in .NET. This promotes the bad practice of locking on "this" and on types - both of which are typically publicly accessible references. I believe that unless a reference is exposed explicitly for the purpose of locking (like ICollection.SyncRoot) then you should avoid locking on any reference which other code knows about. This has been discussed multiple times on the D forum, I believe. Do you mean Monitor or all other issues from that post as well? Do you have any links? I would be interested to know conclusions.
Re: All right, all right! Interim decision regarding qualified Object methods
On Thursday, 12 July 2012 at 04:15:48 UTC, Andrei Alexandrescu wrote: Required reading prior to this: http://goo.gl/eXpuX You destroyed, we listened. [...] What say you? Andrei I agree, they are not needed. This situation is similar to C#, but there the problem doesn't seem to be so severe. Jon Skeet wrote on this long ago: http://msmvps.com/blogs/jon_skeet/archive/2008/12/05/redesigning-system-object-java-lang-object.aspx http://stackoverflow.com/questions/3096028/why-gethashcode-is-in-object-class
Re: Congratulations to the D Team!
On Wednesday, 11 July 2012 at 18:21:24 UTC, Steven Schveighoffer wrote: ... It also seems to allow abuses. For example: class A { private int _x; public @property x() const { return _x; } } class B : A { private int _x2; public @property x() { return _x2++; } } Now I've completely changed the logistics of the x property so that it's essentially become mutable. In effect, I overrode the const piece of x completely to make it non-const without a cast, and anyone calling x() on an A instance cannot trust that it won't increment the effective value of x(). I think the solution to this overall problem is simply to make object.opEquals use the most derived types available for comparison. -Steve I thought about this example. You can have a const method returning different values anyway (instead of returning what base class' method would return), and A._x has not been mutated. From my point of view, it is just a different concept, but arguably not a worse one.
Re: Congratulations to the D Team!
On Wednesday, 11 July 2012 at 18:10:04 UTC, H. S. Teoh wrote: On Wed, Jul 11, 2012 at 08:01:44PM +0200, deadalnix wrote: ... Did you saw the proposal of feep/tgehr on #d ? It basically state that you can overload a const method with a non const one if : - You don't mutate any data that belong to the parent. - You are prevented to create any immutable instance of that classe or any subclasse. +1, very good, I like this idea! It is a trade-off, but relatively nice one, IMO.
Re: Inherited const when you need to mutate
On Wednesday, 11 July 2012 at 14:42:43 UTC, Andrei Alexandrescu wrote: On 7/11/12 10:33 AM, Timon Gehr wrote: Any takers for Cached? It would be good to assess its level of usefulness first. Andrei Well, it is inefficient. The idea here is to assess functionality provided in order to decide whether to go down this route or not. Andrei Probably using a separate thread would be better for voting or discussing, since this one already has many branches of ideas.
Re: Inherited const when you need to mutate
On Wednesday, 11 July 2012 at 14:42:43 UTC, Andrei Alexandrescu wrote: On 7/11/12 10:33 AM, Timon Gehr wrote: Any takers for Cached? It would be good to assess its level of usefulness first. Andrei Well, it is inefficient. The idea here is to assess functionality provided in order to decide whether to go down this route or not. Andrei Probably using a separate thread would be better for voting, since this one already has many branches of ideas.
Re: Inherited const when you need to mutate
On Wednesday, 11 July 2012 at 12:55:39 UTC, Andrei Alexandrescu wrote: Any takers for Cached? It would be good to assess its level of usefulness first. As for me, lazy computation (with caching) would likely be the last feature I really miss in D.
Re: Let's stop parser Hell
On Tuesday, 10 July 2012 at 20:25:12 UTC, Jonathan M Davis wrote: On Tuesday, July 10, 2012 21:25:52 Timon Gehr wrote: FWIW, this is what most HTML parsers are doing. Which is horrible. You pretty much have to with HTML because of the horrid decision that it should be parsed so laxly by browsers, but pretty much nothing else should do that. Either it's correct or it's not. Having the compiler "fix" your code would cause far more problems that it would ever fix. Not having control over parser or source code causes problems. Ability to deliver useful functionality (see my post above) is a different use case.
Re: Let's stop parser Hell
On Tuesday, 10 July 2012 at 19:41:29 UTC, Philippe Sigaud wrote: On Tue, Jul 10, 2012 at 9:25 PM, Timon Gehr wrote: Do people really what error-repairing parsers? I want my parsers to tell me something is bad, and, optionally to advance a possible repair, but definitely *not* to automatically repair a inferred error and continue happily. FWIW, this is what most HTML parsers are doing. Ah, right. I can get it for HTML/XML. JSON also, maybe. I was thinking of parsing a programming language (C, D, etc) Consider me half-convinced :) It would still generate errors. But would enable a lot of useful functionality: autocompletion, refactoring, symbol documentation in a tooltip, displaying method overloads with parameters as-you-type, go to definition, etc.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 16:37:56 UTC, Roman D. Boiko wrote: Note that PEG does not impose to use packrat parsing, even though it was developed to use it. I think it's a historical 'accident' that put the two together: Bryan Ford thesis used the two together. Note that many PEG parsers do not rely on packrat (Pegged does not). There are a bunch of articles on Bryan Ford's website by a guy writting a PEG parser for Java, and who found that storing the last rules was enought to get a slight speed improvement, buth that doing anymore sotrage was detrimental to the parser's overall efficiency. That's great! Anyway I want to understand the advantages and limitations of both Pegged and ANTLR, and probably study some more techniques. Such research consumes a lot of time but can be done incrementally along with development. One disadvantage of Packrat parsers I mentioned was problematic error recovery (according to the article from ANTLR website). After some additional research, I found that it is not a critical problem. To find the exact place of error (from parser's perspective, not user's) one only needs to remember the farthest successfully parsed position (among several backtracking attempts) and the reason that it failed. It is also possible to rerun parsing with some additional heuristics after failing, thus enabling advanced error repair scenarios. Since Pegged doesn't use Packrat algorithm, this solution might be either not relevant or not applicable, but I doubt that there will be any fundamental problem with error recovery. Unpleasant debugging experience, however, should be relevant for any parser that uses backtracking heavily.
Re: getNext
By the way, this thread is quite old. getNext looks similar to what I wanted when first dealt with ranges, but now it looks too heavyweight. What happened to this proposal anyway? Was it deferred, discarded, or what?
Re: getNext
On Monday, 9 July 2012 at 20:21:18 UTC, Mehrdad wrote: that's not what I meant, but I think another solution is better anyway: Why isn't transform taking in both an input and an output range in the first place? Of course they might be the same, but they don't have to be. Could you make a detailed proposal? I still completely don't understand neither the problem you're trying to solve, nor the solution you have in mind. I was reading all your posts in this thread very carefully.
Re: getNext
On Monday, 9 July 2012 at 15:34:35 UTC, Mehrdad wrote: On Monday, 9 July 2012 at 15:16:52 UTC, Andrei Alexandrescu wrote: the only primitive of output ranges is "put". Andrei Sorry? I don't know what you mean template isOutputRange(R,E) Returns true if R is an output range for elements of type E. An output range is defined functionally as a range that supports the operation void put(R, E)(ref R r, E e);
Re: Let's stop parser Hell
On Monday, 9 July 2012 at 06:35:38 UTC, Jacob Carlborg wrote: I'm not expert in this field but I've heard that it's harder to get good error reporting with a parser generator. Good point!
Re: Let's stop parser Hell
David, I would suggest you to create a proof-of-concept prototype and pay attention to some non-trivial use cases. This way you'd likely be able to reduce major risks significantly before making any serious time investments.
Re: Let's stop parser Hell
I believe it would not hurt generality or quality of a parser generator if it contained sews for inserting custom (optimized) s/sews/seams
Re: Let's stop parser Hell
On Sunday, 8 July 2012 at 21:03:41 UTC, Jonathan M Davis wrote: It's been too long since I was actively working on parsers to give any details, but it is my understanding that because a hand-written parser is optimized for a specific grammar, it's going to be faster. My aim is to find out any potential bottlenecks and ensure that those are possible to get rid of. So, let's try. I believe it would not hurt generality or quality of a parser generator if it contained sews for inserting custom (optimized) code where necessary, including those needed to take advantage of some particular aspects of D grammar. Thus I claim that optimization for D grammar is possible. Also, look at dmd and dmc vs other compilers. They use hand-written parsers and are generally much faster than their competitors. Due to which particular design or implementation decisions? Would it be possible to name a few which are relevant in this context? One thing to remember about hand-written parsers vs generative ones though is that they usually are completely different in terms of the type of parser that you write (e.g. hand-written parsers are generally recursive-decent parser whereas generative ones usually use bottom-up parsers). So far discussion goes in favor of LL(*) parser like ANTLR, which is top-down recursive-descent. Either Pegged will be optimized with LL(*) algorithms, or a new parser generator created.
Re: Let's stop parser Hell
On Sunday, 8 July 2012 at 20:06:07 UTC, Jonathan M Davis wrote: most of the focus right now from people interested in parsing seems to be on pegged and parser generators (which are very cool and in some ways much more interesting, but I seriously question that that's the performant way to go if you're looking to parse D specifically). Can you provide a *specific* example of performance optimization which a custom D compiler would have, but parser generator would be unlikely to catch up.
Re: Let's stop parser Hell
On Sunday, 8 July 2012 at 09:14:32 UTC, Dmitry Olshansky wrote: On 08-Jul-12 13:05, Tobias Pankrath wrote: Yup, LL(*) is my favorite so far. That's Terence Parr's discovery, right? I've always liked ANTLR, so if PEGs turn out to have issues LL(*) sounds like a promising alternative. We should also consider using GLR if LL(*) doesn't work. GLR ... are you serious? It still does parsing in n^3 if I'm not mistaken. http://www.antlr.org/papers/LL-star-PLDI11.pdf describes some disadvantages of GLR with respect to LL(*).
Re: Let's stop parser Hell
On Sunday, 8 July 2012 at 01:15:31 UTC, David Piepgrass wrote: I'm not seeing any tremendous error-handling difficulty in my idea. Anyway, I missed the part about information being thrown away...? (I will use the word "you" to refer to some abstract person or compiler.) Error handling could mean either error reporting and stopping after the first error, or reporting several errors and continuing parsing + semantic analysis whenever possible, so that the user would get partial functionality like code completion, information about method overloads, or "go to definition" / find all usages, etc. The first method is the less powerful option, but still requires constructing a meaningful error message which could help the user. There are many possible error recovery strategies. For example, you might decide to insert some token according to the parsing information available up to the moment when error is discovered, if that would fix the problem and allow to continue parsing. Another option is to ignore a series of further tokens (treat them a whitespace, for example), until parser is able to continue its work. Also there are many heuristics for typical error situations. All of these can only be performed if parser gathers the syntax information which can be inferred from source code according to the grammar rules. In stage 2 you have only performed some basic analysis, like, e.g., matched braces to define some hierarchy. This means that at the time when you find a missing brace, for example, you cannot tell anything more than that braces don't match. Or, if the user inserts a brace in an incorrect location, it is only possible to say that it is either missing somewhere and somewhere else another brace is missing, or that it is missing in one place, and some other brace is redundant. In many cases you won't even notice that a brace is incorrectly placed, and pass the resulting tree to the 3rd stage. You don't take any hint from grammar about the exact locations where some token should be present. Now, stage 3 heavily depends on the output of stage 2. As I demonstrated in some examples, it could get the output which implies incorrect structure, even if that has not been found in the previous stage. You would need to analyze so much information attempting to find the real roots of a problem, that effectively it would involve duplicating (implicitly) the logic of previous stage, but with combinatorial complexity growth. The problems you would need to deal with are much more complex than I described here. So you wouldn't be able to deliver error recovery at all, or (if you could) it would be either trivial or would require needlessly complex logic. Breaking the system at the wrong boundary (when the number of dependencies that cross the boundary is large) always introduces artificial complexity. Described above is the illustration of what I call information loss. I may have described something not as clearly as needed, but I didn't have the goal to provide a thorough and verifiable analysis. I speculated and simplified a lot. If you decide to ignore this, it is not a problem and I don't state that you will fail any of your goals. Just admit that __for me__ this approach is not a viable option. Everything written above is IMHO and there may be many ways to resolve the problems with various degrees of success.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 22:40:05 UTC, Andrei Alexandrescu wrote: http://www.antlr.org/wiki/display/~admin/ANTLR+v4+lexers Thanks, nice article. Also found another post: http://www.antlr.org/wiki/display/~admin/2008/11/30/Example+tree+rewriting+with+patterns, which is related to some other discussion in this thread :)
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 22:45:01 UTC, Chad J wrote: I see what you mean. And I think that I expressed myself badly. Let me rephrase. When the memory hierarchy is deep, every level would require at least one bit position. Or even every base class would require a separate bit. (I think that the former + several bits to distinguish among hierarchies.) Otherwise it would not be easy to check by a bit-mask. Even if the above is incorrect (and that is likely since I didn't try to encode that compactly for the real grammar), I think that in general that information would only be possible to store in a fairly large integral. Especially if we try to generate parser from grammar, and thus can't do fine-tuning to pack the information tightly. This overhead would be paid per each AST node __instance__. But that is not necessary, since we could store information in a lookup table only once per node __type__.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 22:25:00 UTC, David Piepgrass wrote: This is all true, but forgetting a brace commonly results in a barrage of error messages anyway. Code that guesses what you meant obviously won't work all the time, and phase 3 would have to take care not to emit an error message about a "{" token that doesn't actually exist (that was merely "guessed-in"). But at least it's nice for a parser to be /able/ to guess what you meant; for a typical parser it would be out of the question, upon detecting an error, to back up four source lines, insert a brace and try again. So you simply admit that error recovery is difficult to implement. For me, it is a must-have, and thus throwing away information is bad.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 22:05:05 UTC, Dmitry Olshansky wrote: That would not always work, but yes that's what we should do IMHO. I didn't do a deeper research on this, just shared my current understanding. When that doesn't work in a particular case, we will need to find the problem and solve that. It's not what packrats do however (when I say algorithm I mean specifically packrat). And PEGs are commonly associated with packrats. As Philippe Sigaud admitted, that is an incorrect association. it's obvious how backtracking comes in ... whan say A fails somewhere in the middle of it. Simply stop tracking respective state. Process others in parallel as I described above. Yeah it could be done, but it doesn't buy you a thing in a lot of cases unlike in regular grammar. The problem is that state is not as simple as in regex for instance (even in regex that must capture some submatches DFA won't do BTW). Thus you can't merge seemingly identical states because they depend on (different) interpretation of previous part of input. Well, if you must track the path to a particular state, then DFA simply doesn't match the problem and there's nothing we can do. That's why DFA in ANTLR is only used to do lookahead or lexing, it doesn't maintain path through states during parsing. See above. Is maintaining path really necessary? I thought that DFA with additional states could be created for any particular situation of ambiguity, etc. This needs further research. Of course there is a fair amount of bookkeeping that reduces doing work all over again but it does ... allocate. The amount of used memory would be proportional to the length of input after the branching point times number of rules (actually, of states in the DFA, which is likely larger but still finite for a particular grammar). Finite and proportional to input size? Another way to say why an O(n) memory requirement for packrats? Sorry, I incorrectly included the O(n) multiplier. It should be finite. When you go to the next character, you don't need previous states any more. Yup. But I'm not talking definitions here I'm talking the notion of "integrated lexing" and lexing is certainly about characters or was it? I thought integrated lexing was about doing it along with parsing and specifying its rules directly inside the grammar. From the former we get structural context to narrow-down available options, and from the latter we have a uniform grammar description.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 22:03:20 UTC, Chad J wrote: enum : SyntaxElement { AST_EXPRESSION = 0x0001___, AST_UNARY_EXPR= 0x_0001__ | This would cause wasting space (probably a lot). Definitely it would not be easy to implement in a parser generator, when various language properties are not known beforehand for fine-grained tuning. This approach of course has shameful nesting limitations, but I feel like these determinations could be fairly well optimized even for the general case. For example: another approach that I might be more inclined to take is to give each token/symbol a low-valued index into a small inheritance table. Depending on implementation, that might introduce the multiplier overhead of table access per each comparison (and there would be many in case of searching for nodes of specific type). I would expect the regex engine to call the isA function as one of it's operations. Thus placing an AST_EXPRESSION into your expression would also match an AST_NEGATE_EXPR too. But actually it is not so difficult to implement in a very similar way to what you described. I was thinking about a lookup table, but different from a traditional inheritance table. It would be indexed by AST node type (integral enum value), and store various classification information as bits. Maybe this is what you meant and I misunderstood you... Example is here: https://github.com/roman-d-boiko/dct/blob/May2012drafts/fe/core.d (sorry, it doesn't show how to do classification, and has a different context, but I hope you get the idea). The advantage over storing hierarchical information directly in each token is obviously memory usage.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 21:52:09 UTC, David Piepgrass wrote: it seems easier to tell what the programmer "meant" with three phases, in the face of errors. I mean, phase 2 can tell when braces and parenthesis are not matched up properly and then it can make reasonable guesses about where those missing braces/parenthesis were meant to be, based on things like indentation. That would be especially helpful when the parser is used in an IDE, since if the IDE guesses the intention correctly, it can still understand broken code and provide code completion for it. And since phase 2 is a standard tool, anybody's parser can use it. There could be multiple errors that compensate each other and make your phase 2 succeed and prevent phase 3 from doing proper error handling. Even knowing that there is an error, in many cases you would not be able to create a meaningful error message. And any error would make your phase-2 tree incorrect, so it would be difficult to recover from it by inserting an additional token or ignoring tokens until parser is able to continue its work properly. All this would suffer for the same reason: you loose information.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 21:43:58 UTC, David Piepgrass wrote: Still... it's a fun concept, and even if the initial parsing ends up using the good-old lex-parse approach, semantic analysis and lowering can benefit from a tree parser. Tree parsing, of course, is just a generalization of linear parsing and so a tree parser generator (TPG) could work equally well for flat input. Exactly. Semantic analysis is what would benefit from that, as well as client code (for example, refactoring tools, etc.) Parser would become quite complicated. Probably it would need to perform complex tree navigation (which might decrease performance proportionally to the average tree depth or even more, if parser algorithms were themselves fast). Any non-trivial query would require direct character manipulation (like comparing sub-strings of captures with string literals, etc.). Probably you would get frequent cache misses because of the need to jump throughout the (incomplete) AST. It would definitely be problematic to build an immutable AST model, thus (IMO) significantly and needlessly constraining you and users of your library. And likely you would have to deal with many more problems __without__ significant gains.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 21:35:52 UTC, Dmitry Olshansky wrote: On 08-Jul-12 01:24, Roman D. Boiko wrote: But PEG itself doesn't require backtracking parsing, does it? It does. Specifically the algorithm is to try option A, try option B, try option C... There is no algorithm, only __grammar__. It specifies that option A has higher priority than option B in expression A / B / C. But it doesn't forbid, for example, to analyze all in parallel (track state transitions) for each character consequently, as a proper DFA implementation would do, until some option succeeds and all previous fail. A greedy regex would also have to check all successor options (C in the exapmle above) to determine the longest one. it's obvious how backtracking comes in ... whan say A fails somewhere in the middle of it. Simply stop tracking respective state. Process others in parallel as I described above. Of course there is a fair amount of bookkeeping that reduces doing work all over again but it does ... allocate. The amount of used memory would be proportional to the length of input after the branching point times number of rules (actually, of states in the DFA, which is likely larger but still finite for a particular grammar). Tokens.. there is no such term in use if we talk about 'pure' PEG. Terminal symbols. Characters. Nope. According to the definition, PEG is a set of non-terminal symbols, terminal symbols, production rules, and a starting non-terminal. Terminals are not necessarily characters.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 20:27:14 UTC, Dmitry Olshansky wrote: So my view on the matter: whether or not lexer is integrated is two-fold question: notation vs implementation. E.g. it may make regular lexer-like things a part of notation thus having rules like: Identifier -> Alpha (Alpha|Number|_)* Yet it may or may use the same engine for _all_ _rules_, in fact if implementation optimize things by ripping off regular parts of grammar to small DFA engines it would make some kind of lexer behind the scenes! This is the implementation I have in mind as the only sane way to parse PEG efficiently. Plus specializing DFA to only check for those terminals which are allowed according to available structural information at the moment of their invocation. Anyway highly hybrid parsers are all the rage now, and reasons are obvious - they runs fast and can be bend by some magic to quite convoluted grammars of modern languages ;)
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 21:08:43 UTC, Dmitry Olshansky wrote: You may misunderstood me as well, the point is still the same: there are 2 things - notation and implementation, the fact that lexer is integrated in notation like in PEGs is not related to the fact that PEGs in their classic definition never use term Token and do backtracking parsing essentially on character level. But PEG itself doesn't require backtracking parsing, does it? So that's an implementation detail, or a specific algorithm. Lexer separation seems to be independent of this. As for lexing multiple times, simply use a free list of terminals (aka tokens). I still assume that grammar is properly defined, so that there is only one way to split source into tokens. Tokens.. there is no such term in use if we talk about 'pure' PEG. Terminal symbols.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 20:29:26 UTC, Dmitry Olshansky wrote: And given the backtracking nature of PEGs you'll do your distributed thing many times over or ... spend a lot of RAM to remember not to redo it. I recall lexing takes even more then parsing itself. I think that your conclusions are about statistical evidences of PEG misuses and poor PEG parser implementations. My point was that there is nothing fundamentally worse in having lexer integrated with parser, but there are performance advantages of having to check less possible cases when the structural information is available (so that lexSmth could be called when Smth is expected, thus requiring less switch branches if switch is used). As for lexing multiple times, simply use a free list of terminals (aka tokens). I still assume that grammar is properly defined, so that there is only one way to split source into tokens.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 20:26:07 UTC, David Piepgrass wrote: I'd like to add that if we give tree parsing first-class treatment, I believe the most logical approach to parsing has three or more stages instead of the traditional two (lex+parse): 1. Lexer 2. Tree-ification 3. Parsing to AST (which may itself use multiple stages, e.g. parse the declarations first, then parse function bodies later) The new stage two simply groups things that are in parenthesis and braces. So an input stream such as the following: I bet that after stage 2 you would have performed almost the same amount of work (in other words, spent almost the same time) as you would if you did full parsing. After that you would need to iterate the whole tree (possibly multiple times), modify (or recreate if the AST is immutable) its nodes, etc. Altogether this might be a lot of overhead. My opinion is that tree manipulation is something that should be available to clients of parser-as-a-library or even of parser+semantic analyzer, but not necessarily advantageous for parser itself.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 20:09:56 UTC, Andrei Alexandrescu wrote: On 7/7/12 3:17 PM, Dmitry Olshansky wrote: I'll have to point out that the whole point about integrated lexing is moot. It's more of liability then benefit. At very least it's just implementation curiosity not advantage. Interesting. I'll have to ask for more details on why. Andrei +1. Personally I associate PEG with a parser that includes a __distributed__ lexer inside, which gives the advantage of having to check less alternatives at each step (that's similar to deterministic parsing). If lexer is separate, it seems to be more difficult to scan for only a subset of possible tokens.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 20:04:21 UTC, Andrei Alexandrescu wrote: Doesn't ANTLR use full-fledged character-level LL(*) parsing even in the tokenizer? Since I didn't understand your question I assume that my statement was somehow incorrect (likely because I made some wrong assumptions about ANTLR). I didn't know about its existence until today and still don't understand it completely. What I think I understood is that it uses DFA for deciding which grammar rule to apply instead of doing backtracking. I also think that it uses DFA for low-level scanning (I'm not sure). The idea to introduce DFA for both determining which rule to apply and lexing of terminal symbols appeared to me much earlier, and the suggestion to introduce them into Pegged is one of options which I think could extremely improve performance.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 19:58:37 UTC, Roman D. Boiko wrote: On Saturday, 7 July 2012 at 19:50:37 UTC, Timon Gehr wrote: http://en.wikipedia.org/wiki/Operator-precedence_parser#Precedence_climbing_method OK, at least I didn't misunderstand. So my question was whether the alternative that I described and which exists in PEG is somehow worse than OPP. Wikipedia seems to provide an answer to that: OPP tend to be simpler. (I didn't investigate this topic further.) OK, now I agree, the need to perform several nested calls in order to parse some expression is costly enough to consider OPP superior to a general PEG for expressions. At first I was surprised when found that D doesn't define operator precedence explicitly, but instead provides a hierarchy of expression types. Then I somehow concluded that the approaches are equivalent. (I started learning parsing techniques only in February '12.) Since then I never reconsidered my past conclusions.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 19:50:37 UTC, Timon Gehr wrote: http://en.wikipedia.org/wiki/Operator-precedence_parser#Precedence_climbing_method OK, at least I didn't misunderstand. So my question was whether the alternative that I described and which exists in PEG is somehow worse than OPP. Wikipedia seems to provide an answer to that: OPP tend to be simpler. (I didn't investigate this topic further.)
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 19:29:47 UTC, Dmitry Olshansky wrote: Another idea that I've never realized is to add operator precedence grammar to pegged. Of course it should fit naturally with traditional PEG, for instance taking responsibility for parsing expressions. But that's already available by explicitly defining expression grammar via nested rules. See for example the examples/dgrammar.d in Pegged. This way, for example, multiplication has precedence over addition. (It looks like I misunderstood you. Did I?)
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 18:55:57 UTC, Chad J wrote: I was thinking the same thing. My intent is to create a kind of regular-expression-of-nodes with push/pop operators to recognize ascent and descent on the tree. Such a regular expression would allow one to capture subtrees out of generalized patterns and then place them into new trees that then become the input for the next pattern or set of patterns. I think this is much closer to how I conceptualize semantic analysis than how semantic analysis is done in front ends like DMD: it should to be done with pattern recognition and substitution, not with myriad nested if-statements and while-loops. Funny, we've discussed an idea to introduce some hybrid of regex and xpath for querying, searching and traversing AST with Dmitry a few weeks ago. A custom NDepend-like Code Query Language was the major alternative we considered. Discussion started on this forum and continued via email. In this vision I do not use classes and inheritance for my AST. Instead I use structs that contain some kind of nodeType member that would be one of the tokens/symbols in the grammar, like TOK_WHILE_NODE in the above code. Dynamic dispatch is instead performed by (very fast) DFAs recognizing parts of the AST. Exactly. This idea first came to me in April after I implemented the first top-down recursive descent custom parser for a D subset. I tried Visitor pattern before that and wasn't happy. There are some subtle difficulties which I believe will be possible to overcome, most important being the need to introduce a mechanism for hierarchical classification (like a pow expression being an assign expression at the same time).
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 16:56:06 UTC, Chad J wrote: Yeah, it's good to hear this notion reinforced. I had this suspicion that the packrat parser is not necessarily the best/fastest solution, mostly because of the large allocation that has to happen before you get O(n) performance. Thus I figured that pegged might eventually use different parsing strategies underneath it all, possibly with a lot of special-casing and clever hand-tuned and profiled optimizations. At least that's what makes sense to me. At the very least, we could use DFA instead of backtracking where possible. This is the approach implemented in ANTLR, but I wanted to introduce them before I knew about existence of the latter, simply because this would likely produce the fastest parsers possible.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 16:49:09 UTC, deadalnix wrote: I tried that. This is almost impossible, dmd's parser and AST are very tightly mixed with dmd's internals. +1
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 16:27:00 UTC, Philippe Sigaud wrote: I added dstrings because 1- at the time (a few months ago), the lists here were awash in UTF-32 discussions and I thought that'd be the way to go anyway 2- other D parsing libraries seemed to go to UTF32 also (CTPG) 3- I wanted to be able to parse mathematical notation like nabla, derivatives, etc. which all have UTF32 symbols. I propose to switch code to use S if(isSomeString!S) everywhere. Client code would first determine source encoding scheme, and then instantiate parsers specifying a string type. This is not a trivial change, but I'm willing to help implementing it. Note that PEG does not impose to use packrat parsing, even though it was developed to use it. I think it's a historical 'accident' that put the two together: Bryan Ford thesis used the two together. Note that many PEG parsers do not rely on packrat (Pegged does not). There are a bunch of articles on Bryan Ford's website by a guy writting a PEG parser for Java, and who found that storing the last rules was enought to get a slight speed improvement, buth that doing anymore sotrage was detrimental to the parser's overall efficiency. That's great! Anyway I want to understand the advantages and limitations of both Pegged and ANTLR, and probably study some more techniques. Such research consumes a lot of time but can be done incrementally along with development.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 16:14:13 UTC, Tobias Pankrath wrote: Interesting, I thought that PEG ⊂ CFG holds. See section 3.4 of http://bford.info/pub/lang/peg.pdf It contains a simple proof that a non-context-free language (a^n) (b^n) (c^n) can be described with PEG syntax. There are infinitely many such languages.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 15:42:30 UTC, Chad J wrote: On 07/05/2012 08:28 AM, Roman D. Boiko wrote: Pegged may eventually become standard, if it will be performanceoptimized and a bit more customizable Interesting, I thought you were hand-writing this stuff. I'm a fan of pegged and made some pull requests, one of which was aimed at making it more customizable by allowing the user to define what type gets used as a parse tree node, thus allowing one to potentially use their parse tree as an AST (and maybe do a couple other things too). It's a WIP, but the proof of concept is done: DMD can handle the extra templating at compile time, so it works. What kind of things did you want in terms of customizability? There are many possible customization point which would make it a better fit for my project while also being useful in general. The most critical was the one you implemented: ability to define a custom parse tree. I also needed the ability to use immutable structures (almost) everywhere, while currently they must be mutable. Also it would be great to avoid UTF conversions (instead use functions and types templated by the string type). I also wanted to add ability to reuse parts of previously generated AST which correspond to source code that has not been changed, or to identical source code parsed previously. (This would allow incremental parsing, e.g., while user is typing.) The latter would require to track source code positions separately, or even generating them on demand (this is the feature actively criticized by deadalnix in some previous discussions but central to DCT architecture). AST would only hold information about widths of its nodes. I've also written some notes (10-15 suggestions) while studying Pegged code, which will be shared later. However, as I found a few hours ago, Packrat parsing (typically used to handle PEG) has serious disadvantages: it complicates debugging because of frequent backtracking, it has problems with error recovery, and typically disallows to add actions with side effects (because of possibility of backtracking). These are important enough to reconsider my plans of using Pegged. I will try to analyze whether the issues are so fundamental that I (or somebody else) will have to create an ANTLR-like parser instead, or whether it is possible to introduce changes into Pegged that would fix these problems.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 14:17:54 UTC, Tobias Pankrath wrote: Proper CFG parsers all are liner-time anyway. To be picky here: The languages that can be parsed in linear time are a strict subset of CFGs. Also any PEG defines a language with linear-time parsing. Some non-context-free languages can be described with PEG.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 11:33:18 UTC, Dmitry Olshansky wrote: Yup, LL(*) is my favorite so far. As for debugging and error recovery they are always a problem. It's interesting, that I also wanted to use DFA for non-terminals (similarly to LL(*)), but was considering their usage as pure performance optimization. Concerns raised in the article seem to be very reasonable, I will likely reconsider my previous plans... This forum is an endless source of high-quality feed-back for me.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 10:26:39 UTC, Roman D. Boiko wrote: I think that Pegged can be heavily optimized in performance, and there are no fundamental issues which would make it inherently slower than LALR or a hand-written D-specific parser. Hmm... found an interesting article: http://www.antlr.org/papers/LL-star-PLDI11.pdf It describes some disadvantages of Packrat parsing, like problems with debugging and error recovery. These are important for DCT, so I'll have to perform additional research.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 10:05:30 UTC, Dmitry Olshansky wrote: I can tell you that they are not slower then one another in principle. Quality of implementations trumps every theoretical aspect here. LALR is usually fast even if implemented by book but they are hard to optimize futher and quite restrictive on "semantic extensions". Proper CFG parsers all are liner-time anyway. Exactly, that is also my point. I think that Pegged can be heavily optimized in performance, and there are no fundamental issues which would make it inherently slower than LALR or a hand-written D-specific parser.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 09:06:57 UTC, Roman D. Boiko wrote: http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk So far it looks like LALR parsers may have lower constant factors than Packrat. The difference could be minimized by paying attention to parsing of terminal symbols, which was in my plans already. It is not necessary to strictly follow Packrat parsing algorithm. The benefits of Pegged, in my view, are its support of Parsing Expression Grammar (PEG) and compile-time evaluation. It is easily extensible and modifiable. When I implemented recursive-descent parser by hand in one of early drafts of DCT, I strongly felt the need to generalize code in a way which in retrospect I would call PEG-like. The structure of my hand-written recursive-descent parser was a one-to-one mapping to an implemented subset of D specification, and I considered it problematic, because it was needed to duplicate the same structure in the resulting AST. PEG is basically a language that describes both, the implementation of parser, and the language syntax. It greatly reduces implicit code duplication. I think that generated code can be made almost as fast as a hand-written parser for a particular language (probably, a few percent slower). Especially if that language is similar to D (context-free, with fine-grained hierarchical grammar). Optimizations might require to forget about strictly following any theoretical algorithm, but that should be OK.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 00:45:36 UTC, David Piepgrass wrote: This work on parsers might be a good place for me to dive in. I have an interest in parsers and familiarity with LL, LALR, PEGs, and even Pratt parsers (fun!), but I am still inexperienced. ... One thing that has always concerned me about PEGs is that they always say PEGs are slower than traditional two-phase LALR(1) or LL(k) parsers. However, I have never seen any benchmarks. Does anyone know exactly how much performance you lose in an (optimized) PEG compared to an (optimized) LALR/LL parser + LL/regex lexer? I decided to ask a question about this: http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk Don't hesitate to edit it if you would like to clarify some aspect.
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 01:47:21 UTC, H. S. Teoh wrote: Frankly, I don't know what DCT stands for (anyone?) That's a working (not final) name of my project, a.k.a. The D Compiler Tools. I've used this acronym in previous threads but didn't explain this time. Current state is very far from completion, it is in research phase. Link: https://github.com/roman-d-boiko/dct, but I'd suggest to contact me at r...@d-coding.com if anyone is interested.
Re: Let's stop parser Hell
On Thursday, 5 July 2012 at 22:32:29 UTC, deadalnix wrote: Well you did the mistake to introduce an over complex mechanism in log(n) to get back to source code when an obvious one in O(1) exists. Let me doubt. I'm fine with that, let you doubt.
Re: dfmt - D source code formatter
On Thursday, 5 July 2012 at 22:25:15 UTC, Walter Bright wrote: I agree that whatever is inside the comment should be left alone. I was more talking about lining up comment blocks, etc. Why would that be more difficult to do than code formatting? I'm wondering.
Re: Let's stop parser Hell
On Thursday, 5 July 2012 at 22:11:41 UTC, deadalnix wrote: Why not program instead of doing bureaucracy ? To avoid programming things which are not needed or don't fit. I've thrown away several implementations already... time to reflect a little :) But, actually, for DCT I do know what I need to do. That suggestion was with respect to any candidate for becoming standard. Everybody understands that their way (likely differently than others).
Re: Let's stop parser Hell
On Thursday, 5 July 2012 at 20:28:32 UTC, Philippe Sigaud wrote: Hmm 72 € by Springer, 55 € on Amazon. Something is not right. Paperback vs perfect bound maybe? http://www.komkon.org/~sher/books/parsing_techniques_2008.pdf Not sure that it is legal, but definitely free.
Re: Let's stop parser Hell
isKeyword_Dummy (baseline): 2738 [microsec] total, 50 [ns / lookup]. This one calculates a sum of all identifier code units. Included for comparison. isKeyword_Dictionary: 4247 [microsec] total, 242 [ns / lookup]. Check whether an identifier is a keyword using AA (dictionary) lookup. isKeyword_SwitchByLengthThenByChar: 1593 [microsec] total, 91 [ns / lookup]. Switch by identifier length at the top, then recursively switch by each character. Clearly a winner, but I think it can be improved further. isKeyword_BinaryArrayLookup: 14351 [microsec] total, 820 [ns / lookup]. Binary search in an ordered array of keywords. isKeyword_LinearArrayLookup: 59564 [microsec] total, 3405 [ns / lookup]. Ditto, search is linear. isKeyword_UnicodeTrie: 4167 [microsec] total, 238 [ns / lookup]. Lookup a keyword in a trie, created by Dmitry. This will be improved. isKeyword_UnicodeTrieBoolLookup: 3466 [microsec] total, 198 [ns / lookup]. The same, but only determines whether an identifier is a keyword, not which exactly. Total: 104183 identifiers + 17488 keywords. Analyzed the largest Phobos file (DateTime? I didn't check the name.) Results are consistent for other files, too.
Re: Let's stop parser Hell
On Thursday, 5 July 2012 at 19:54:39 UTC, Philippe Sigaud wrote: On Thu, Jul 5, 2012 at 8:28 PM, Andrei Alexandrescu wrote: I'll be glad to buy for you any book you might feel you need for this. Again, there are few things more important for D right now than exploiting its unmatched-by-competition features to great ends. I don't want the lack of educational material to hold you down. Please continue working on this and let me know of what you need. That's nice of you, if a bit extreme for a public mailing list :) Andrei, money is no problem :) Interest in the field of parsing is no problem. Interest in D future is no problem. Having a demanding job, and three children, is a problem. No, scratch that, you know what I mean. I have four, from 1 to 7 years old... Wouldn't call them a problem, though :))) But hey, Roman is doing interesting things on keyword parsing right now, and changing the parser generated by Pegged is not difficult. We will see where this thread lead. (Roman, you should send your results here, because I'm still taken aback by the built-in AA speed compared to linear array look-up for 100 keywords). Well, I wouldn't call those "results" yet. Just some benchmarks. Here they are: isKeyword_Dummy (baseline): 427 [microsec] total, 28 [nanosec / lookup]. isKeyword_Dictionary: 1190 [microsec] total, 214 [nanosec / lookup]. isKeyword_SwitchByLengthThenByChar: 466 [microsec] total, 83 [nanosec / lookup]. isKeyword_BinaryArrayLookup: 2722 [microsec] total, 490 [nanosec / lookup]. isKeyword_LinearArrayLookup: 13822 [microsec] total, 2490 [nanosec / lookup]. isKeyword_UnicodeTrie: 1317 [microsec] total, 237 [nanosec / lookup]. isKeyword_UnicodeTrieBoolLookup: 1072 [microsec] total, 193 [nanosec / lookup]. Total: 22949 identifiers + 5551 keywords. isKeyword_Dummy (baseline): 2738 [microsec] total, 50 [nanosec / lookup]. isKeyword_Dictionary: 4247 [microsec] total, 242 [nanosec / lookup]. isKeyword_SwitchByLengthThenByChar: 1593 [microsec] total, 91 [nanosec / lookup]. isKeyword_BinaryArrayLookup: 14351 [microsec] total, 820 [nanosec / lookup]. isKeyword_LinearArrayLookup: 59564 [microsec] total, 3405 [nanosec / lookup]. isKeyword_UnicodeTrie: 4167 [microsec] total, 238 [nanosec / lookup]. isKeyword_UnicodeTrieBoolLookup: 3466 [microsec] total, 198 [nanosec / lookup]. Total: 104183 identifiers + 17488 keywords. As Dmitry said, we might hit a CTFE wall: memory consumption, computation speed, ... (*channelling Andrei*: then we will correct whatever makes CTFE a problem. Right) Philippe (Hesitating between 'The Art of the Metaobject Protocol' and 'Compilers, Techniques and Tools', right now)
Re: Let's stop parser Hell
On Thursday, 5 July 2012 at 18:33:50 UTC, Dmitry Olshansky wrote: Count me as interested. CTFE needs more correctness & speed though. So to put it blantly - no it's not possible right NOW. BUT it doesn't prevent us from planing and doing a proof of concept. Pegged seems a good starting point even if we end up re-writing it from scratch. IMO, first priority is to make code generated by Pegged work fast, and second priority - make code generation itself fast.
Re: Let's stop parser Hell
On Thursday, 5 July 2012 at 18:28:21 UTC, Andrei Alexandrescu wrote: On 7/5/12 2:16 PM, Philippe Sigaud wrote: So Pegged or any other generator should *not* get the community focus right now. Pegged should be the focus. +10 (can I vote ten times?) My plan would be as follow: 1- assemble a group of people knowing parsing. I don't think I'm exactly knowledgeable, but I'm ready to be part of such a group. 2- create a github process. 3- translate an existing parser / adapt a D parser for Phobos. I'm ready to be part of this (I'm sure I'll learn in the process) 4- spend 1-2 years fighting over LR parsing and such :) (Just kidding) 5- submit it to Phobos and have it adopted. Sounds good. Replace 1-2 years with 1-2 months. Well, probably with 3-4 months... :)
Re: Let's stop parser Hell
On Thursday, 5 July 2012 at 18:17:06 UTC, Philippe Sigaud wrote: On 2012-07-05 18:32, Roman D. Boiko wrote: My vote would be for Pegged, I guess. As much as I'm flattered by that, my current impression is Pegged is very far from being performant. As a proof-of-concept that, in D, it's possible to parse a string and create a parse tree at compile-time and then generate code from this, it's also successful. Go D! As a parser proper, Pegged is awful :-) Nothing I'm ashamed of, as I learn by coding. Hey, I just received the Dragon Book (International Edition), I'm sure I'll learn many things in there. So, if anyone is willing to change the code generated by Pegged, I'm game. The results you showed me on keyword parsing are very interesting! But, my impression is that the need for a 'D'-only parser and lexer is far greater and much more imediate that the need for a parser generator. All the reasons advanced upthread ask for a D parser, not a generic generator. Parser generators are for those of us interested in having DSLs or macros in D. So Pegged or any other generator should *not* get the community focus right now. I'm sure it can generate **much** faster code. I'm going to focus on its part that generates D parser (i.e., to make it significantly faster and able to efficiently parse-as-you-type). Actually, I'm sure it will be able to beat any other parser with respect to performance. :) 1. So my plan is the following: invite whoever would want to help. 2. Prove my claims above in practice. :-)
Re: Let's stop parser Hell
On Thursday, 5 July 2012 at 16:38:27 UTC, Jacob Carlborg wrote: On 2012-07-05 18:32, Roman D. Boiko wrote: My vote would be for Pegged, I guess. Aren't you voting on your own project :) Well, I'm going to base parsing on Pegged, after tweaking it to my needs.
Re: Let's stop parser Hell
On Thursday, 5 July 2012 at 16:28:57 UTC, Jonathan M Davis wrote: It's all too common for someone to suggest that we should do something or implement something without ever attempting to do it themselves, and in general, stuff around here gets done because someone really wants it done, takes the time to do it, and sees it through until its done and in Phobos. - Jonathan M Davis Resume: everybody is welcome to join effort of translating DMD front end, and improving Pegged. Also I would like to invite those interested in DCT project to help me with it. Right now I'm trying to understand whether it is possible to incorporate Pegged inside without losing anything critical (and I think it is very likely possible), and how exactly to do that. Dmitry proposed to help improve Pegged (or some other compiler's) speed. Anyone else?
Re: Let's stop parser Hell
On Thursday, 5 July 2012 at 16:14:27 UTC, Jacob Carlborg wrote: Original post: http://www.digitalmars.com/d/archives/digitalmars/D/DCT_use_cases_-_draft_168106.html#N168141 OK, fairly complete. Let it be the starting point. Why not put it on some wiki, then add some more, discuss, vote, etc.? Meanwhile create a matrix of features implemented in each of interesting standardization candidates. And pick up one of them as "standard" or "reference" parser. My vote would be for Pegged, I guess.
Re: Let's stop parser Hell
On Thursday, 5 July 2012 at 15:40:53 UTC, Jacob Carlborg wrote: On 2012-07-05 15:08, Roman D. Boiko wrote: Anyway I propose to enumerate major use cases first. Haven't we already done that a couple of times. Well, we did something like that for DCT... but I doubt that it would fit general needs. If we had, why haven't they been analyzed, classified, discussed, etc.? Or have they?
Re: Let's stop parser Hell
On Thursday, 5 July 2012 at 15:42:22 UTC, Caligo wrote: Is the actual grammar available somewhere? The online language reference is all we got I guess? DMD doesn't seem to be using yacc/bison either. In PEG format, yes (not necessarily correct): https://github.com/roman-d-boiko/Pegged/blob/dct/pegged/examples/dgrammar.d I don't know about any others, though.
Re: Let's stop parser Hell
On Thursday, 5 July 2012 at 13:05:41 UTC, Dmitry Olshansky wrote: On 05-Jul-12 17:04, Roman D. Boiko wrote: Well why not. But first I'll need to deliver some stuff on my GSOC project. I bet that after you finish with GSOC optimizing Pegged will not be less relevant than it is now :) And as for DCT, it will take another half a year (at least) until it will become usable for any real needs (except the most trivial).
Re: Let's stop parser Hell
On Thursday, 5 July 2012 at 12:53:02 UTC, Denis Shelomovskij wrote: 05.07.2012 16:30, Roman D. Boiko пишет: Forgot to add DDMD, which also has been forked and redesigned recently by someone (thus two more different compiler frontends). But overall I doubt that any project could become standard very soon. Why? Even were they all almost equal we can select any one. The point is to select something (anything) to guide a coder who want to do something with this task to a good/up-to-date/supported parser. Well, I guess we'd like to select one and not change it later, don't we? I'm not sure we can decide which is the best currently and will stay the best in the future for all major use cases. Anyway I propose to enumerate major use cases first.
Re: Let's stop parser Hell
On Thursday, 5 July 2012 at 12:32:19 UTC, Dmitry Olshansky wrote: Then do it. It's all about having something so obviously good, fast and flexible that other stuff can be considered only as toys. I might do one, but I'd rather just help other folks make it faster ;) Would you help to make Pegged faster?
Re: Let's stop parser Hell
Forgot to add DDMD, which also has been forked and redesigned recently by someone (thus two more different compiler frontends). But overall I doubt that any project could become standard very soon.
Re: Let's stop parser Hell
On Thursday, 5 July 2012 at 12:11:33 UTC, Denis Shelomovskij wrote: There are more and more projects requiring parsing D code (IDE plugins, DCT and dfmt now). We definitely need a _standard_ fast D parser (suitable as tokenizer). We already have a good one at least in Visual D. Maybe dmd's parser is faster. If so, it can be (easily?) rewritten in D. We also already have some other parsers (in C# from Mono-D etc.). What about to get one and call it standard? Visual-D is not Boost-licensed (I think this would be possible to change) Mono-D is written in C#, as you mentioned Pegged may eventually become standard, if it will be performance optimized and a bit more customizable Dscanner(https://github.com/Hackerpilot/Dscanner) from Brian Schott is pretty good, too. SDC is another nice option DIL (http://code.google.com/p/dil/) is very nice but GPL I plan to try using Pegged inside my DCT project. Probably that will require huge modifications though... Some more links from Pegged readme: Hisayuki Mima's CTPG(https://github.com/youkei/ctpg), very similar, also done in D. Have a look! Nick Sabalausky's Goldie (http://www.dsource.org/projects/goldie). Benjamin Shropshire's dparser (http://dsource.org/projects/scrapple/browser/trunk/dparser). Martin Nowak put these gists on the D newsgroup: https://gist.github.com/1255439 - lexer generator https://gist.github.com/1262321 - complete and fast D lexer
Re: Proposal: takeFront and takeBack
On Thursday, 5 July 2012 at 08:34:29 UTC, Roman D. Boiko wrote: I make no conclusions, because have not run any benchmarks to estimate how complex the code should be to take those 30 ms. Such benchmarks would be valuable for discussion. That has been written before I saw all benchmarks, so my post expired before being created.
Re: Proposal: takeFront and takeBack
On Wednesday, 4 July 2012 at 22:02:28 UTC, deadalnix wrote: Le 04/07/2012 21:45, Jonathan M Davis a écrit : On Wednesday, July 04, 2012 12:19:15 Jonathan M Davis wrote: But at present, I'm seeing a performance improvement of approximately 70 - 80% in iterating over strings with consumeFront rather than front and popFront (depending on the compiler flags and strings used). I should clarify that. It's taking 70 - 80% as long to use consumeFront to iterate over a string than it is to iterate using popFront and getting front on every iteration. The way I worded that could imply that it was taking 20 - 30% as much time, which would be a _much_ larger improvement than I'm actually seeing. - Jonathan M Davis The win is significant indeed. I'm not sure. Let's speculate a bit with some almost random numbers. E.g., suppose that some test without user code in a loop takes 70 (or 80) ms (per iteration) instead of 100. Or, assuming that some optimization of front / popFront would make it proportionally twice faster, 35 instead of 50. It may be significant, if user code inside the loop is relatively fast, but that is not often the case. Let's assume that it takes 30 ms (that looks like pretty fast). So overall it would be (70 + 30) vs (100 + 30) before front / popFront optimization, and (35 + 30) vs (50 + 30) after (huge optimization). The latter is 65 vs 80, which is about 81 vs 100 (instead of 70 vs 100 before optimization). If user code was slower, impact of front / popFront optimization would be relatively larger, and vice verse. I make no conclusions, because have not run any benchmarks to estimate how complex the code should be to take those 30 ms. Such benchmarks would be valuable for discussion.
Re: Proposal: takeFront and takeBack
On Wednesday, 4 July 2012 at 12:24:14 UTC, trav...@phare.normalesup.org (Christophe Travert) wrote: I am not sure what would be worse, in the long run, between asking developpers to make front remain valid after popFront until next call to front, or having two different standard ways to iterate an input range (front/popFront and consumeFront). Just to clarify (although I gave up with that idea): my suggestion was not to require that value returned from front was valid until next call to front. It was the following: in those ranges where value returned from front is invalidated after a call to popFront, define consumeFront in a such way that an equivalent of popFront is deferred until the next call to any of [front, popFront, consumeFront], and make empty take into account information whether popFront is pending. Currently I think that idea was very bad, given that it would be difficult to find all ranges which invalidate previous front value on popFront.
Re: Proposal: takeFront and takeBack
On Wednesday, 4 July 2012 at 12:40:41 UTC, Tobias Pankrath wrote: You have to maintain 2 version of you algorithm. This is more work to test, to maintain, and it is likely to introduce more bugs. More code == more bugs. More NPATH = more bugs and harder to test and to maintain. In addition, this will result in practice in less generic code, because one may not need the version of the algorithm without consume primitives in his/her own code and not code it at all. Since the performance problem seems mainly to materialize for string processing code an alternative to this proposal would be to add those functions only for strings. As far as I heared, most of phobos is already special cased for string so the maintenance overhead wouldn't be too big. That seems to be reasonable. Drawback is that Phobos algorithms would not ever take advantage of consumeFront for anything else (even if that was defined).
Re: Proposal: takeFront and takeBack
On Wednesday, 4 July 2012 at 10:54:41 UTC, deadalnix wrote: I do agree with that. However, the current proposal have the same drawback but on the code that use the range. Both solutions are messy IMO. What is error-prone in current client code? If particular algorithm wants to take advantage of a potential speed-up, it may decide to check whether hasConsume!Range, and call consumeFront instead of front + popFront. Nobody is obliged to do so. The only problem I can see is potential code duplication, which is lower priority in performance-critical part of code, IMO.
Re: Proposal: takeFront and takeBack
On Wednesday, 4 July 2012 at 08:40:46 UTC, Jonathan M Davis wrote: It depends on the API. All that the word read indicates is that data was read. It doesn't indicate anything about what was read from or what happened to it. In some cases, read doesn't alter what's read at all. In others it does. There's an iterator type that I deal with at work which uses read to indicate that a value was read but the iterator wasn't moved, and it uses consume to indicate that not only was the value read but that the iterator was moved. - Jonathna M Davis If we have both front and readFront, it would be natural to expect readFront to have side-effect on range. But I'm fine with both options, just personally I would select read.
Re: Proposal: takeFront and takeBack
On Wednesday, 4 July 2012 at 08:31:05 UTC, Jonathan M Davis wrote: But if the default is that there is no consumeFront, then it'll only be there when the programmer determines that they need it and defines it. So, the default is safe, but the option of efficiency is there if the programmer codes for it. OK, you sold it to me :) If some developer decides to use consumeFront for efficiency reasons, it should not be a problem to statically check hasConsume. What I was trying to avoid was code bloat at clients, assuming that they used consumeFront to have less code (one call instead of two). That would be a very different use case, and currently proposed design doesn't fit that use case. But it is really minor comparing to potential risk.
Re: Proposal: takeFront and takeBack
On Wednesday, 4 July 2012 at 08:14:54 UTC, Jonathan M Davis wrote: On Wednesday, July 04, 2012 09:55:57 Roman D. Boiko wrote: Still would be nice to get your (or community) feedback on my previous proposals to localize custom logic inside ranges which invalidate value returned from front. It sounds messy and error-prone, to be honest. And any range that needed to implement it and didn't would be broken. It would be broken only in cases when consumeFront is used, and that would likely be easily detectable. Still I don't have any idea how many such ranges exist. But overall I agree.
Re: Proposal: takeFront and takeBack
On Wednesday, 4 July 2012 at 08:20:59 UTC, Mehrdad wrote: I propose we just allow (but not require) popFront() to return ElementType!(R) instead of void? That way, people who need the performance can check to see the return type, and use it without front() if needed. That would be almost perfect. But has drawbacks: * if element is big and may be not needed in all calls to popFront (when the user wants to ignore elements), there would be some overhead; * it might be impossible to change some ranges which would work with current design; I'm not aware of such cases, though * something else?
Re: Proposal: takeFront and takeBack
On Wednesday, 4 July 2012 at 08:25:19 UTC, Mehrdad wrote: Where have you last ever seen a read() method whose default behavior was NOT to modify the collection/range/iterator/stream/whatever? +1
Re: Proposal: takeFront and takeBack
On Wednesday, 4 July 2012 at 08:13:10 UTC, Mehrdad wrote: What's wrong with moveFront()? It has different semantics. For example, it is only supported for ranges which have movable elements. After moveFront, the element which was placed in the front position is replaced by E.init (E is element type).
Re: Proposal: takeFront and takeBack
On Wednesday, 4 July 2012 at 04:33:14 UTC, Jonathan M Davis wrote: If no one has any major objections to this scheme or can provide a better proposal, I'll create a pull request from what I have. No objections. Still would be nice to get your (or community) feedback on my previous proposals to localize custom logic inside ranges which invalidate value returned from front.
Re: Proposal: takeFront and takeBack
On Wednesday, 4 July 2012 at 07:53:32 UTC, Roman D. Boiko wrote: On Wednesday, 4 July 2012 at 07:40:55 UTC, monarch_dodra wrote: I'll propose "spendFront", which is pretty much the same thing, but is less loaded. I'm not entirely sure if it carries the meaning as well? I think it is a worth discussion. I'm not a native speaker, but I feel that "spend" would be a wrong choice. fetchFront was proposed by Dmitry, which I like more than spendFront, but slightly less than consumeFront. Another option might be "readFront". That's somewhere in the middle amond "take", "consume" and "fetch"...
Re: Proposal: takeFront and takeBack
On Wednesday, 4 July 2012 at 07:40:55 UTC, monarch_dodra wrote: I'll propose "spendFront", which is pretty much the same thing, but is less loaded. I'm not entirely sure if it carries the meaning as well? I think it is a worth discussion. I'm not a native speaker, but I feel that "spend" would be a wrong choice. fetchFront was proposed by Dmitry, which I like more than spendFront, but slightly less than consumeFront. By the way, I don't have any associations from the above. Consume is widely used in everyday life, and it should not be a problem that there are various uses. As for introducing a new concept, it looks like that would be necessary any way, because existing concepts correspond to something different :) For example, that was my original motivation to vote against takeFront.
Re: Proposal: takeFront and takeBack
Still, ranges which invalidate front on popFront could defer that (store a flag that popFront or consumeFront has been called and invalidate value previously returned from front in the next call to front). That would be intrusive with respect to such ranges, and a breaking change. But it would simplify client code significantly. (This doesn't mean that I'm insisting, or strongly prefer some of my two ideas, just wanted to clearly restate the second one.)
Re: Proposal: takeFront and takeBack
On Tuesday, 3 July 2012 at 18:12:54 UTC, trav...@phare.normalesup.org (Christophe Travert) wrote: Range have been designed with the idea that front is valid until next call to popFront. If popFront was to be called right after front, then it would just be a popFront that returns a value, and maybe a justPop or something if you don't want to copy the value. It's delicate to come now and decide that if a previously returned front value is no longer valid after a call to popFront, it should be documented by an enum. Who is going to review all libraries (written and to come) to make sure the enum is placed when it should be? The reverse should be done instead: if you want you range to be optimized by calling takeFront, define something (for example... takeFront). Then use algorithms that are specialized for takeFront. takeFront -> consumeFront hasConsume has been proposed by Jonathan already. But not having to modify all existing ranges which invalidate value returned by front might be a good argument against my hack also. However I'm not sure there are really many of such ranges. Those must return something with reference semantics in order to be able to invalidate it. Most common case would be when front value is always stored in the same memory (a buffer). The benefit of my hacking idea is that clients would need no special casing.