Re: [openZIM dev-l] WG: [Wikitech-l] Parser implementaton for MediaWiki syntax

Asaf Bartov Thu, 23 Sep 2010 08:33:44 -0700

Very interesting!

I'm not sure his "hard parts" are really the hardest parts; but I don't know
enough about the MW parser to be sure.


I do hope the parser can be replaced by a C parser!

   Asaf

On Thu, Sep 23, 2010 at 2:44 PM, Manuel Schneider <
[email protected]> wrote:

> Someone created a MediaWiki parser written in C - please see the mail
> below.
>
> Greetings from Linux-Kongress in Nürnberg,
>
> /Manuel
>
> Sent via mobile phone.
>
> -- Urspr. Mitt. --
> Betreff: [Wikitech-l] Parser implementaton for MediaWiki syntax
> Von: Andreas Jonsson <[email protected]>
> Datum: 23.09.2010 11:28
>
> Hi,
>
> I have written a parser for MediaWiki syntax and have set up a test
> site for it here:
>
> http://libmwparser.kreablo.se/index.php/Libmwparsertest
>
> and the source code is available here:
>
> http://svn.wikimedia.org/svnroot/mediawiki/trunk/parsers/libmwparser
>
> A preprocessor will take care of parser functions, magic words,
> comment removal, and transclusion.  But as it wasn't possible to
> cleanly separate these functions from the existing preprocessor, some
> preprocessing is disabled at the test site.  It should be
> straightforward to write a new preprocessor that provides only the required
> functionality, however.
>
> The parser is not feature complete, but the hard parts are solved.  I
> consider "the hard parts" to be:
>
> * parsing apostrophes
> * parsing html mixed with wikitext
> * parsing headings and links
> * parsing image links
>
> And when I say "solved" I mean producing the same or equivalent output
> as the original parser, as long as the behavior of the original parser
> is well defined and produces valid html.
>
> Here is a schematic overview of the design:
>
> +-----------------------+
> |                       |              Wikitext
> |  client application   +---------------------------------------+
> |                       |                                       |
> +-----------------------+                                       |
>            ^                                                    |
>            | Event stream                                       |
> +----------+------------+        +-------------------------+    |
> |                       |        |                         |    |
> |    parser context     |<------>|         Parser          |    |
> |                       |        |                         |    |
> +-----------------------+        +-------------------------+    |
>                                               ^                 |
>                                               | Token stream    |
> +-----------------------+        +------------+------------+    |
> |                       |        |                         |    |
> |    lexer context      |<------>|         Lexer           |<---+
> |                       |        |                         |
> +-----------------------+        +-------------------------+
>
>
> The design is described more in detail in a series of posts at the
> wikitext-l mailing list.  The most important "trick" is to make sure
> that the lexer never produce a spurious token.  An end token for a
> production will not appear unless the corresponding begin token
> already has been produced, and the lexer maintains a block context to
> only produce tokens that makes sense in the current block.
>
> I have used Antlr for generating both the parser and the lexer, as
> Antlr supports semantic predicates that can be used for context
> sensitive parsing.  Also I am using a slightly patched version of
> Antlr's C runtime environent, because the lexer needs to support
> speculative execution in order to do context sensitive lookahead.
>
> A Swig generated interface is used for providing the php api.  The
> parser process the buffer of the php string directly, and writes its
> output to an array of php strings.  Only UTF-8 is supported at the
> moment.
>
> The performance seems to be about the same as for the original parser
> on plain text.  But with an increasing amount of markup, the original
> parser runs slower.  This new parser implementation maintains roughly
> the same performance regardless of input.
>
> I think that this demonstrates the feasability of replacing the
> MediaWiki parser.  There is still a lot of work to do in order to turn
> it into a full replacement, however.
>
> Best regards,
>
> Andreas
>
>
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
> _______________________________________________
> dev-l mailing list
> [email protected]
> https://intern.openzim.org/mailman/listinfo/dev-l
>



-- 
Asaf Bartov <[email protected]>

_______________________________________________
dev-l mailing list
[email protected]
https://intern.openzim.org/mailman/listinfo/dev-l

Re: [openZIM dev-l] WG: [Wikitech-l] Parser implementaton for MediaWiki syntax

Reply via email to