On Tue, 22 Mar 2011 18:27:51 -0400, Ilya Pupatenko
<pupate...@gmail.com> wrote:
Hi,
First of all, I want to be polite so I have to introduce myself (you
can skip
this paragraph if you feel tired of newcomer-students’ posts). My name
is
Ilya, I’m a Master student of IT department of Novosibirsk State
University
(Novosibirsk, Russia). In Soviet period Novosibirsk became on of the
most
important science center in the country and now there are very close
relations between University and Academy of Science. That’s why it’s
difficult and very interesting to study here. But I’m not planning to
study
or work this summer, so I’ll be able to work (nearly) full time on GSoC
project. My primary specialization is seismic tomography inverse
problems,
but I’m also interested in programming language implementation and
compilation theory. I have good knowledge of C++ and C# languages and
“intermediate” knowledge of D language, knowledge of compilation
theory, some
experience in implementing lexers, parsers and translators, basic
knowledge
of lex/yacc/antlr and some knowledge of Boost.Spirit library. I’m not
an
expert in D now, but I willing to learn and to solve difficult tasks,
that’s
why I decided to apply on the GSoC.
I’m still working on my proposal (on task “Lexing and Parsing”), but I
want
to write some general ideas and ask some questions.
1. It is said that “it is possible to write a highly-integrated
lexer/perser
generator in D without resorting to additional tools”. As I
understand, the
library should allow programmer to write grammar directly in D
(ideally, the
syntax should be somehow similar to EBNF) and the resulting parser
will be
generated by D compiler while compiling the program. This method allows
integration of parsing in D code; it can make code simpler and even
sometimes
more efficient.
There is a library for C++ (named Boost.Spirit) that follows the same
idea.
It provide (probably not ideal but very nice) “EBNF-like” syntax to
write a
grammar, it’s quite powerful, fast and flexible. There are three parts
in
this library (actually there are 4 parts but we’re not interested in
Spirit.Classic now):
• Spirit.Qi (parser library that allows to build recursive descent
parsers);
• Spirit.Karma (generator library);
• Spirit.Lex (library usable to create tokenizers).
The Spirit library uses “C++ template black magic” heavily (for
example, via
Boost.Fusion). But D has greater metaprogramming abilities, so it is
possible
to implement the same functionality in easier and “clean” way.
So, the question is: is it a good idea if at least parser library
architecture will be somewhat similar to Spirit one? Of course it is
not
about “blind” copying; but creating architecture for such a big system
completely from scratch is quite difficult indeed. If to be exact, I
like an
idea of parser attributes, I like the way semantic actions are
described, and
the “auto-rules” seems really useful.
I'm not qualified to speak on Spirits internal architecture; I've only
used it
once for something very simple and ran into a one-liner bug which
remains
unfixed 7+ years later. But the basic API of Spirit would be wrong for
D. “it
is possible to write a highly-integrated lexer/perser generator in D
without
resorting to additional tools” does not mean "the library should allow
programmer to write grammar directly in D (ideally, the syntax should be
somehow similar to EBNF)" it means that the library should allow you to
write a
grammar in EBNF and then through a combination of templates, string
mixins and
compile-time function evaluation generate the appropriate (hopefully
optimal)
parser. D's compile-time programming abilities are strong enough to do
the code
generation job usually left to separate tools. Ultimately a user of the
library
should be able to declare a parser something like this:
// Declare a parser for Wikipedia's EBNF sample language
Parser!`
(* a simple program syntax in EBNF − Wikipedia *)
program = 'PROGRAM' , white space , identifier , white space ,
'BEGIN' , white space ,
{ assignment , ";" , white space } ,
'END.' ;
identifier = alphabetic character , { alphabetic character | digit } ;
number = [ "-" ] , digit , { digit } ;
string = '"' , { all characters − '"' } , '"' ;
assignment = identifier , ":=" , ( number | identifier | string ) ;
alphabetic character = "A" | "B" | "C" | "D" | "E" | "F" | "G"
| "H" | "I" | "J" | "K" | "L" | "M" | "N"
| "O" | "P" | "Q" | "R" | "S" | "T" | "U"
| "V" | "W" | "X" | "Y" | "Z" ;
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
white space = ? white space characters ? ;
all characters = ? all visible characters ? ;
` wikiLangParser;