Very interesting! I'm not sure his "hard parts" are really the hardest parts; but I don't know enough about the MW parser to be sure.
I do hope the parser can be replaced by a C parser! Asaf On Thu, Sep 23, 2010 at 2:44 PM, Manuel Schneider < [email protected]> wrote: > Someone created a MediaWiki parser written in C - please see the mail > below. > > Greetings from Linux-Kongress in Nürnberg, > > /Manuel > > Sent via mobile phone. > > -- Urspr. Mitt. -- > Betreff: [Wikitech-l] Parser implementaton for MediaWiki syntax > Von: Andreas Jonsson <[email protected]> > Datum: 23.09.2010 11:28 > > Hi, > > I have written a parser for MediaWiki syntax and have set up a test > site for it here: > > http://libmwparser.kreablo.se/index.php/Libmwparsertest > > and the source code is available here: > > http://svn.wikimedia.org/svnroot/mediawiki/trunk/parsers/libmwparser > > A preprocessor will take care of parser functions, magic words, > comment removal, and transclusion. But as it wasn't possible to > cleanly separate these functions from the existing preprocessor, some > preprocessing is disabled at the test site. It should be > straightforward to write a new preprocessor that provides only the required > functionality, however. > > The parser is not feature complete, but the hard parts are solved. I > consider "the hard parts" to be: > > * parsing apostrophes > * parsing html mixed with wikitext > * parsing headings and links > * parsing image links > > And when I say "solved" I mean producing the same or equivalent output > as the original parser, as long as the behavior of the original parser > is well defined and produces valid html. > > Here is a schematic overview of the design: > > +-----------------------+ > | | Wikitext > | client application +---------------------------------------+ > | | | > +-----------------------+ | > ^ | > | Event stream | > +----------+------------+ +-------------------------+ | > | | | | | > | parser context |<------>| Parser | | > | | | | | > +-----------------------+ +-------------------------+ | > ^ | > | Token stream | > +-----------------------+ +------------+------------+ | > | | | | | > | lexer context |<------>| Lexer |<---+ > | | | | > +-----------------------+ +-------------------------+ > > > The design is described more in detail in a series of posts at the > wikitext-l mailing list. The most important "trick" is to make sure > that the lexer never produce a spurious token. An end token for a > production will not appear unless the corresponding begin token > already has been produced, and the lexer maintains a block context to > only produce tokens that makes sense in the current block. > > I have used Antlr for generating both the parser and the lexer, as > Antlr supports semantic predicates that can be used for context > sensitive parsing. Also I am using a slightly patched version of > Antlr's C runtime environent, because the lexer needs to support > speculative execution in order to do context sensitive lookahead. > > A Swig generated interface is used for providing the php api. The > parser process the buffer of the php string directly, and writes its > output to an array of php strings. Only UTF-8 is supported at the > moment. > > The performance seems to be about the same as for the original parser > on plain text. But with an increasing amount of markup, the original > parser runs slower. This new parser implementation maintains roughly > the same performance regardless of input. > > I think that this demonstrates the feasability of replacing the > MediaWiki parser. There is still a lot of work to do in order to turn > it into a full replacement, however. > > Best regards, > > Andreas > > > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > _______________________________________________ > dev-l mailing list > [email protected] > https://intern.openzim.org/mailman/listinfo/dev-l > -- Asaf Bartov <[email protected]>
_______________________________________________ dev-l mailing list [email protected] https://intern.openzim.org/mailman/listinfo/dev-l
