Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
I've spent a lot of time profiling and optimising the parser in the past. It's a complex process. You can't just look at one number for a large amount of very complex text and conclude that you've found an optimisation target. unless it is {{cite}} Cheers, Domas ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
On 06/05/11 17:13, Andreas Jonsson wrote: I had not analyzed the parts of the core parser that I consider preproprocessing, and it came as a suprise to me that it was as slow as the Barack Obama benchmark shows. But integrating template expansion with the parser would solve this performance problem, and is therefore in itself a strong argument for working towards replacing it. I will write about this on wikitext-l. That benchmark didn't have any templates in it, I expanded them with Special:ExpandTemplates before I started. So it's unlikely that a significant amount of the time was spent in the preprocessor. It was a really quick benchmark, with no profiling or further testing whatsoever. It took a few minutes to do. You shouldn't base architecture decisions on it, it might be totally invalid. It might not be a parser benchmark at all. I might have made some configuration error, causing it to test an unrelated region of the code. All I know is, I sent in wikitext, the CPU usage went to 100% for a while, then HTML came back. I've spent a lot of time profiling and optimising the parser in the past. It's a complex process. You can't just look at one number for a large amount of very complex text and conclude that you've found an optimisation target. -- Tim Starling ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
2011-05-06 03:27, Andrew Garrett skrev: On Thu, May 5, 2011 at 3:21 AM, Andreas Jonsson andreas.jons...@kreablo.se wrote: 2011-05-04 08:13, Tim Starling skrev: On 04/05/11 15:52, Andreas Jonsson wrote: The time it takes to execute the code that glues together the regexps will be insignificant compared to actually executing the regexps for any article larger than a few hundred bytes. This is at least the case for the articles are the the easiest for the core parser, which are articles that contains no markup. The more markup the slower it will run. It is possible that this slowdown will be lessened if compiled with HipHop. But the top speed of the parser (in bytes/seconds) will be largely unaffected. PHP execution dominates for real test cases, and HipHop provides a massive speedup. See the previous HipHop thread. http://lists.wikimedia.org/pipermail/wikitech-l/2011-April/052679.html Unfortunately, users refuse to write articles consisting only of hundreds of kilobytes of plain text, they keep adding references and links and things. So we don't really care about the parser's top speed. We are talking about different things. I don't consider callbacks made when processing magic words or parser functions being part of the actual parsing. The reference case of no markup input is interesting to me as it marks the maximum throughput of the MediaWiki parser, and is what you would compare alternative implementations to. But, obviously, if the Barack Obama article takes 22 seconds to render, there are more severe problems than parser performance at the moment. It's a little more complicated than that, and obviously you haven't spent a lot of time looking at profiling output from parsing the Barack Obama article if you say that — what, if not the parser, is slowing down the processing of that article? Consider the following: 1. Many things that you would exclude from parsing like reference tags and what-not call the parser themselves. 2. Regardless of whether you include the actual callback in your measurements of parser run time, you need to consider them. Identifying structures that require callbacks, as well as structures that don't (such as links, templates, images, and what not) takes time. While you might reasonably exclude ifexist calls and so on from parser run time, you most certainly cannot reasonably exclude template calls, link processing, nor the extra time taken by the preprocessor to identify such structures. As Domas says, real world data is king. As far as I know, in the case of 'a a a a', even if you repeat it for a few MB, virtually no PHP code is run, because the preprocessor uses strcpsn to identify structures requiring preprocessing. That's implemented in C — in fact, for 'a a a' repeated for a few MB, it's my (probably totally wrong) understanding that the PHP code runs in more or less constant time. It's the structures that appear in real articles that make the parser slow. I'm sorry, I misunderstood the original statement that HipHop would make _parsing_ significantly faster and questioned that on false premises, because I'm thinking of the parser and the preprocessor as distinctly different components. Let me explain: as I see it, the first step in formalizing wikitext syntax is to analyze and write a parser that can be used as a drop in replacement after preprocessing. The stuff that is preprocessed cannot be integrated with the parser without sacrificing compatiblity. Preprocessing is problematic. It breaks the one-to-one relationship with the wikitext and the syntax tree, (i.e., it impossible to serialize a syntax tree back to the same wikitext that generated it.) Therefore, in a second step, it should be analyzed how the preprocessed constructions can be integrated with the parser and how to minimize the damage from this change. I had not analyzed the parts of the core parser that I consider preproprocessing, and it came as a suprise to me that it was as slow as the Barack Obama benchmark shows. But integrating template expansion with the parser would solve this performance problem, and is therefore in itself a strong argument for working towards replacing it. I will write about this on wikitext-l. Best Regards, Andreas Jonsson ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
On Thu, May 5, 2011 at 3:21 AM, Andreas Jonsson andreas.jons...@kreablo.se wrote: 2011-05-04 08:13, Tim Starling skrev: On 04/05/11 15:52, Andreas Jonsson wrote: The time it takes to execute the code that glues together the regexps will be insignificant compared to actually executing the regexps for any article larger than a few hundred bytes. This is at least the case for the articles are the the easiest for the core parser, which are articles that contains no markup. The more markup the slower it will run. It is possible that this slowdown will be lessened if compiled with HipHop. But the top speed of the parser (in bytes/seconds) will be largely unaffected. PHP execution dominates for real test cases, and HipHop provides a massive speedup. See the previous HipHop thread. http://lists.wikimedia.org/pipermail/wikitech-l/2011-April/052679.html Unfortunately, users refuse to write articles consisting only of hundreds of kilobytes of plain text, they keep adding references and links and things. So we don't really care about the parser's top speed. We are talking about different things. I don't consider callbacks made when processing magic words or parser functions being part of the actual parsing. The reference case of no markup input is interesting to me as it marks the maximum throughput of the MediaWiki parser, and is what you would compare alternative implementations to. But, obviously, if the Barack Obama article takes 22 seconds to render, there are more severe problems than parser performance at the moment. It's a little more complicated than that, and obviously you haven't spent a lot of time looking at profiling output from parsing the Barack Obama article if you say that — what, if not the parser, is slowing down the processing of that article? Consider the following: 1. Many things that you would exclude from parsing like reference tags and what-not call the parser themselves. 2. Regardless of whether you include the actual callback in your measurements of parser run time, you need to consider them. Identifying structures that require callbacks, as well as structures that don't (such as links, templates, images, and what not) takes time. While you might reasonably exclude ifexist calls and so on from parser run time, you most certainly cannot reasonably exclude template calls, link processing, nor the extra time taken by the preprocessor to identify such structures. As Domas says, real world data is king. As far as I know, in the case of 'a a a a', even if you repeat it for a few MB, virtually no PHP code is run, because the preprocessor uses strcpsn to identify structures requiring preprocessing. That's implemented in C — in fact, for 'a a a' repeated for a few MB, it's my (probably totally wrong) understanding that the PHP code runs in more or less constant time. It's the structures that appear in real articles that make the parser slow. —Andrew -- Andrew Garrett Wikimedia Foundation agarr...@wikimedia.org ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
On 04/05/11 15:52, Andreas Jonsson wrote: The time it takes to execute the code that glues together the regexps will be insignificant compared to actually executing the regexps for any article larger than a few hundred bytes. This is at least the case for the articles are the the easiest for the core parser, which are articles that contains no markup. The more markup the slower it will run. It is possible that this slowdown will be lessened if compiled with HipHop. But the top speed of the parser (in bytes/seconds) will be largely unaffected. PHP execution dominates for real test cases, and HipHop provides a massive speedup. See the previous HipHop thread. http://lists.wikimedia.org/pipermail/wikitech-l/2011-April/052679.html Unfortunately, users refuse to write articles consisting only of hundreds of kilobytes of plain text, they keep adding references and links and things. So we don't really care about the parser's top speed. -- Tim Starling ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
On 05/03/2011 07:45 PM, Ryan Lane wrote: It's slightly more difficult, but it definitely isn't any easier. The point here is that only having one implementation of the parser, which can change at any time, which also defines the spec (and I use the word spec here really loosely), is something that inhibits the ability to share knowledge. I was thinking whether it would be possible to have two-tier parsing? Define what is valid wikitext, express it in BNF, write a parser in C and use it as a PHP extension. If the parser encounters invalid wikitext, enter the quirks mode AKA the current PHP parser. I assume that 90% of wikis' contents would be valid wikitext, and so the speedup should be significant. And if someone needs to reuse the content outside of Wikipedia, they can use 90% of the content very easily, and the rest not harder than right now. The only disadvantage that I see is that every addition to wikitext would have to be implemented in both parsers. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
Ohi, The time it takes to execute the code that glues together the regexps will be insignificant compared to actually executing the regexps for any article larger than a few hundred bytes. Well, you did an edge case - a long line. Actually, try replacing spaces with newlines, and you will get 25x cost difference ;-) But the top speed of the parser (in bytes/seconds) will be largely unaffected. Damn! Domas ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
2011-05-04 08:41, Domas Mituzas skrev: Ohi, The time it takes to execute the code that glues together the regexps will be insignificant compared to actually executing the regexps for any article larger than a few hundred bytes. Well, you did an edge case - a long line. Actually, try replacing spaces with newlines, and you will get 25x cost difference ;-) A single long line containing no markup is indeed an edge case, but it is a good reference case since it is the input where the parser will run at its fastest. Replacing the spaces with newlines will cause a tenfold increase in the execution time. Sure, in relative numbers less is time spent executing regexps, but in absolute numbers, more time is spent there. /Andreas samples %app name symbol name 283 8.6044 libphp5.so zend_hash_quick_find 188 5.7160 libpcre.so.3.12.1/lib/libpcre.so.3.12.1 177 5.3816 libphp5.so zend_parse_va_args 165 5.0167 libphp5.so zend_do_fcall_common_helper_SPEC 160 4.8647 libphp5.so __i686.get_pc_thunk.bx 131 3.9830 libphp5.so zend_hash_find 127 3.8614 libphp5.so _zval_ptr_dtor 872.6452 libc-2.11.2.so memcpy 822.4932 libphp5.so _zend_mm_alloc_canary_int 792.4019 libphp5.so zend_get_hash_value 722.1891 libphp5.so _zend_mm_free_canary_int 591.7939 libphp5.so zend_std_read_property 551.6722 libphp5.so execute 521.5810 libphp5.so suhosin_get_config 511.5506 libphp5.so zend_fetch_property_address_read_helper_SPEC_UNUSED_CONST 481.4594 libphp5.so zendparse But the top speed of the parser (in bytes/seconds) will be largely unaffected. Damn! Domas ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
Hi! A single long line containing no markup is indeed an edge case, but it is a good reference case since it is the input where the parser will run at its fastest. Bubblesort will also have O(N) complexity sometimes :-) Replacing the spaces with newlines will cause a tenfold increase in the execution time. Sure, in relative numbers less is time spent executing regexps, but in absolute numbers, more time is spent there. Well, this is not fair - you should sum up all zend symbols if you compare that way - there're no debugging symbols for libpcre, so you get aggregated view. Thats same like saying that 10 is smaller number than 7, just because you can factorize it ;-) Comparing apples and oranges doesn't always help, that kind of hand waving may impress others, but some have spent more time looking at that data than just for ranting in single mailing list thread ;-) Cheers, Domas ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
- Original Message - From: Nikola Smolenski smole...@eunet.rs I was thinking whether it would be possible to have two-tier parsing? Define what is valid wikitext, express it in BNF, write a parser in C and use it as a PHP extension. If the parser encounters invalid wikitext, enter the quirks mode AKA the current PHP parser. I assume that 90% of wikis' contents would be valid wikitext, and so the speedup should be significant. And if someone needs to reuse the content outside of Wikipedia, they can use 90% of the content very easily, and the rest not harder than right now. Yeah, I made this suggestion, oh, 2 or 3 years ago... and I was never able to get the acceptable percentage down below 100.0%. Cheers, -- jra ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
* Daniel Friesen li...@nadir-seen-fire.com [Tue, 03 May 2011 21:07:07 -0700]: Naturally of course if it's a C library you can build at least an extension/plugin for a number of languages. You would of course have to install the ext/plug though so it's not a shared-hosting thing. ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name] Latest generation of browser Javascript implementations are generally faster than Zend PHP, not sure about HipHop. Maybe client-side parsing can also reduce server load as well. Javascript is also extremly popular and wide-spread language, it has a good perspective at server side as well. Dmitriy ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
2011-05-03 02:38, Chad skrev: On Mon, May 2, 2011 at 8:28 PM, Tim Starling tstarl...@wikimedia.org wrote: I know that there is a camp of data reusers who like to write their own parsers. I think there are more people who have written a wikitext parser from scratch than have contributed even a small change to the MediaWiki core parser. They have a lot of influence, because they go to conferences and ask for things face-to-face. Now that we have HipHop support, we have the ability to turn MediaWiki's core parser into a fast, reusable library. The performance reasons for limiting the amount of abstraction in the core parser will disappear. How many wikitext parsers does the world really need? People want to write their own parsers because they don't want to use PHP. They want to parse in C, Java, Ruby, Python, Perl, Assembly and every other language other than the one that it wasn't written in. There's this, IMHO, misplaced belief that standardizing the parser or markup would put us in a world of unicorns and rainbows where people can write their own parsers on a whim, just because they can. Other than making it easier to integrate with my project, I don't see a need for them either (and tbh, the endless discussions grow tedious). My motivation for attacking the task of creating a wikitext parser is, aside from it being an interesting problem, a genuin concern for the fact that such a large body of data is encoded in such a vaguely specified format. I don't see any problem with keeping the parser in PHP, and as you point out with HipHop support on the not-too-distant horizon the complaints about performance with Zend will largely evaporate. But most of the parser's work consists of running regexp pattern matching over the article text, doesn't it? Regexp pattern matching are implemented by native functions. Does the Zend engine have a slow regexp implementation? I would have guessed that the main reason that the parser is slow is the algorithm, not its implementation. Best Regards, Andreas Jonsson ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
On 11-05-03 03:40 AM, Andreas Jonsson wrote: 2011-05-03 02:38, Chad skrev: [...] I don't see any problem with keeping the parser in PHP, and as you point out with HipHop support on the not-too-distant horizon the complaints about performance with Zend will largely evaporate. But most of the parser's work consists of running regexp pattern matching over the article text, doesn't it? Regexp pattern matching are implemented by native functions. Does the Zend engine have a slow regexp implementation? I would have guessed that the main reason that the parser is slow is the algorithm, not its implementation. Best Regards, Andreas Jonsson regexps might be fast, but when you have to run hundreds of them all over the place and do stuff in-language then the language becomes the bottleneck. -- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name] ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
- Original Message - From: Andreas Jonsson andreas.jons...@kreablo.se Subject: Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?) My motivation for attacking the task of creating a wikitext parser is, aside from it being an interesting problem, a genuin concern for the fact that such a large body of data is encoded in such a vaguely specified format. Correct: Until you have (at least) two independently written parsers, both of which pass a test suite 100%, you don't have a *spec*. Or more to the point, it's unclear whether the spec or the code rules, which can get nasty. Cheers, -- jra ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
Tim Starling wrote: Another goal beyond editing itself is normalizing the world of 'alternate parsers'. There've been several announced recently, and we've got such a large array now of them available, all a little different. We even use mwlib ourselves in the PDF/ODF export deployment, and while we don't maintain that engine we need to coordinate a little with the people who do so that new extensions and structures get handled. I know that there is a camp of data reusers who like to write their own parsers. I think there are more people who have written a wikitext parser from scratch than have contributed even a small change to the MediaWiki core parser. They have a lot of influence, because they go to conferences and ask for things face-to-face. Now that we have HipHop support, we have the ability to turn MediaWiki's core parser into a fast, reusable library. The performance reasons for limiting the amount of abstraction in the core parser will disappear. How many wikitext parsers does the world really need? I realize you have a dry wit, but I imagine this joke was lost on nearly everyone. You're not really suggesting that everyone who wants to parse MediaWiki wikitext compile and run HipHop PHP in order to do so. It's unambiguously a fundamental goal that content on Wikimedia wikis be able to be easily redistributed, shared, and spread. A wikisyntax that's impossible to adequately parse in other environments (or in Wikimedia's environment, for that matter) is a critical and serious inhibitor to this goal. MZMcBride ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
On Tue, May 3, 2011 at 2:15 PM, MZMcBride z...@mzmcbride.com wrote: I realize you have a dry wit, but I imagine this joke was lost on nearly everyone. You're not really suggesting that everyone who wants to parse MediaWiki wikitext compile and run HipHop PHP in order to do so. And how is using the parser with HipHop going to be any more difficult than using it with Zend? -Chad ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
Chad wrote: On Tue, May 3, 2011 at 2:15 PM, MZMcBride z...@mzmcbride.com wrote: I realize you have a dry wit, but I imagine this joke was lost on nearly everyone. You're not really suggesting that everyone who wants to parse MediaWiki wikitext compile and run HipHop PHP in order to do so. And how is using the parser with HipHop going to be any more difficult than using it with Zend? The point is that the wikitext and its parsing should be completely separate from MediaWiki/PHP/HipHop/Zend. I think some of the bigger picture is getting lost here. Wikimedia produces XML dumps that contain wikitext. For most people, this is the only way to obtain and reuse large amounts of content from Wikimedia wikis (especially as the HTML dumps haven't been re-created since 2008). There needs to be a way for others to be able to very easily deal with this content. Many people have suggested (with good reason) that this means that wikitext parsing needs to be reproducible in other programming languages. While HipHop may be the best thing since sliced bread, I've yet to see anyone put forward a compelling reason that the current state of affairs is acceptable. Saying well, it'll soon be much faster for MediaWiki to parse doesn't overcome the legitimate issues that re-users have (such as programming in a language other than PHP, banish the thought). For me, the idea that all that's needed is a faster parser in PHP is a complete non-starter. MZMcBride ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
On Tue, May 3, 2011 at 10:25 AM, Chad innocentkil...@gmail.com wrote: On Tue, May 3, 2011 at 2:15 PM, MZMcBride z...@mzmcbride.com wrote: I realize you have a dry wit, but I imagine this joke was lost on nearly everyone. You're not really suggesting that everyone who wants to parse MediaWiki wikitext compile and run HipHop PHP in order to do so. And how is using the parser with HipHop going to be any more difficult than using it with Zend? It's slightly more difficult, but it definitely isn't any easier. The point here is that only having one implementation of the parser, which can change at any time, which also defines the spec (and I use the word spec here really loosely), is something that inhibits the ability to share knowledge. Requiring people use our PHP implementation, whether or not it is compiled to C is ludicrous. - Ryan ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
It's slightly more difficult, but it definitely isn't any easier It is much easier to embed it in other languages, once you get shared object with Parser methods exposed ;-) Domas ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
It is much easier to embed it in other languages, once you get shared object with Parser methods exposed ;-) Which would also require the linking application to be GPL licensed, which is less than ideal. We shouldn't limit the licensing of applications that want to write wikitext. An alternative implementation can be licensed in any way the author sees fit. - Ryan ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
On Tue, May 3, 2011 at 10:48 AM, Domas Mituzas midom.li...@gmail.comwrote: It's slightly more difficult, but it definitely isn't any easier It is much easier to embed it in other languages, once you get shared object with Parser methods exposed ;-) Building it with HipHop will be harder -- but that's something that can be packaged. However, I strongly agree that having only a poorly-specified single-implementation markup language for all of Wikipedia Wikimedia's redistributable data is **not where we want to be** long term. And even if the PHP-based parser is callable from elsewhere, it's not going to be a good convenient fit for every potential user. It's still worthwhile to hammer out clearer, more consistent document formats for the future, so that other people doing other things that we aren't even thinking of have the flexibility to do those things however they'll need to. -- brion ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
On 03/05/11 19:44, MZMcBride wrote: Chad wrote: On Tue, May 3, 2011 at 2:15 PM, MZMcBridez...@mzmcbride.com wrote: I realize you have a dry wit, but I imagine this joke was lost on nearly everyone. You're not really suggesting that everyone who wants to parse MediaWiki wikitext compile and run HipHop PHP in order to do so. And how is using the parser with HipHop going to be any more difficult than using it with Zend? The point is that the wikitext and its parsing should be completely separate from MediaWiki/PHP/HipHop/Zend. I think some of the bigger picture is getting lost here. Wikimedia produces XML dumps that contain wikitext. For most people, this is the only way to obtain and reuse large amounts of content from Wikimedia wikis (especially as the HTML dumps haven't been re-created since 2008). There needs to be a way for others to be able to very easily deal with this content. Many people have suggested (with good reason) that this means that wikitext parsing needs to be reproducible in other programming languages. While HipHop may be the best thing since sliced bread, I've yet to see anyone put forward a compelling reason that the current state of affairs is acceptable. Saying well, it'll soon be much faster for MediaWiki to parse doesn't overcome the legitimate issues that re-users have (such as programming in a language other than PHP, banish the thought). For me, the idea that all that's needed is a faster parser in PHP is a complete non-starter. MZMcBride I agree completely. I think it cannot be emphasized enough that what's valuable about Wikipedia and other similar wikis is the hard-won _content_, not the software used to write and display it at any given, which is merely a means to that end. Fashions in programming languages and data formats come and go, but the person-centuries of writing effort already embodied in Mediawiki's wikitext format needs to have a much longer lifespan: having a well-defined syntax for its current wikitext format will allow the content itself to continue to be maintained for the long term, beyond the restrictions of its current software or encoding format. -- Neil ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
On 05/03/2011 08:28 PM, Neil Harris wrote: On 03/05/11 19:44, MZMcBride wrote: ... The point is that the wikitext and its parsing should be completely separate from MediaWiki/PHP/HipHop/Zend. I think some of the bigger picture is getting lost here. Wikimedia produces XML dumps that contain wikitext. For most people, this is the only way to obtain and reuse large amounts of content from Wikimedia wikis (especially as the HTML dumps haven't been re-created since 2008). There needs to be a way for others to be able to very easily deal with this content. Many people have suggested (with good reason) that this means that wikitext parsing needs to be reproducible in other programming languages. While HipHop may be the best thing since sliced bread, I've yet to see anyone put forward a compelling reason that the current state of affairs is acceptable. Saying well, it'll soon be much faster for MediaWiki to parse doesn't overcome the legitimate issues that re-users have (such as programming in a language other than PHP, banish the thought). For me, the idea that all that's needed is a faster parser in PHP is a complete non-starter. MZMcBride I agree completely. I think it cannot be emphasized enough that what's valuable about Wikipedia and other similar wikis is the hard-won _content_, not the software used to write and display it at any given, which is merely a means to that end. Fashions in programming languages and data formats come and go, but the person-centuries of writing effort already embodied in Mediawiki's wikitext format needs to have a much longer lifespan: having a well-defined syntax for its current wikitext format will allow the content itself to continue to be maintained for the long term, beyond the restrictions of its current software or encoding format. -- Neil +1 to both MZMcBride and Neil. So relieved to see things put so eloquently. Dirk -- Website: http://dirkriehle.com - Twitter: @dirkriehle Ph (DE): +49-157-8153-4150 - Ph (US): +1-650-450-8550 ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
On 5/2/11 5:28 PM, Tim Starling wrote: How many wikitext parsers does the world really need? That's a tricky question. What MediaWiki calls parsing, the rest of the world calls 1. Parsing 2. Expansion (i.e. templates, magic) 3. Applying local state, preferences, context (i.e. $n, prefs) 4. Emitting And phases 2 and 3 depend heavily upon the state of the local wiki at the time the parse is requested. If you've ever tried to set up a test wiki that works like Wikipedia or Wikimedia Commons you'll know what I'm talking about. As for whether the rest of the world needs another wikitext parser: well, they keep writing them, so there must be some reason why this keeps happening. It's true that language chauvinism plays a part, but the inflexibility of the current approach is probably a big factor as well. The current system mashes parsing and emitting to HTML together, very intimately, and a lot of people would like those to be separate. - if they're doing research or stats, and want a more pure, more normalized form than HTML or Wikitext. - if they're Google, and they want to get all the city infobox data and reuse it (this is a real request we've gotten) - if they're OpenStreetMaps, and the same thing; - if they're emitting to a different format (PDF, LaTeX, books); - if they're emitting to HTML but with different needs (like mobile); And then there's the stuff which you didn't know you wanted, but which becomes easy once you have a more flexible parser. A couple of months ago I wrote a mini PEG-based wikitext parser in JavaScript, that Special:UploadWizard is using, today, live on Commons. http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/UploadWizard/resources/mediawiki.language.parser.js?view=markup http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/UploadWizard/resources/mediawiki.language.parser.peg?view=markup While it was a bit of a heavy download (7K compressed) this gave me the ability to do pluralizations in the frontend (e.g. 3 out of 5 uploads complete) even for difficult languages like Arabic. Great! But the unexpected benefit was that it also made it a snap to add very complicated interface behaviour to our message strings. Actually, right now, with this library + the ingenious way that wikitext does i18n, we may have one of the best libraries out there for internationalized user interfaces. I'm considering splitting it off; it could be useful for any project that used translatewiki. But I don't actually want to use JavaScript for anything but the final rendering stages (I'd rather move most of this parser to PHP) so stay tuned. Anyway, I think it's obviously possible for us to do some RTE, and some of this stuff, with the current parser. But I'm optimistic that a new parsing strategy will be a huge benefit to our community, and our partners, and partners we didn't even know we could have. Imagine doing RTE with an implementation in a JS frontend, that is generated from some of the same sources that the PHP backend uses. For what it's worth: whenever I meet with Wikia employees the topic is always about what MediaWiki and the WMF can do to make their RTE hacks obsolete. That doesn't mean that their RTE isn't the right way forward, but the people who wrote it don't seem to be very strong advocates for it. But I don't want to put words in their mouth; maybe one of them can add more to this thread? -- Neil Kandalgaonkar ne...@wikimedia.org ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
- Original Message - From: MZMcBride z...@mzmcbride.com Now that we have HipHop support, we have the ability to turn MediaWiki's core parser into a fast, reusable library. The performance reasons for limiting the amount of abstraction in the core parser will disappear. How many wikitext parsers does the world really need? I realize you have a dry wit, but I imagine this joke was lost on nearly everyone. You're not really suggesting that everyone who wants to parse MediaWiki wikitext compile and run HipHop PHP in order to do so. I'm fairly certain that his intention was If the parser is HipHop compliant, then the performance improvements that will realize for those who need them will obviate the need to rewrite the parser in anything, while those who run small enough wikiae not to care, won't need to care. That does *not*, of course, answer the if you don't have more than one compliant parser, then the code is part of your formal spec, and you *will* get bitten eventually problem. Of course, Mediawiki's parser has *three* specs: whatever formal one has been ginned up, finally; the code; *and* 8 or 9 GB of MWtext on the Wikipedias. Cheers, -- jra ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
On 11-05-03 08:46 PM, Jay Ashworth wrote: - Original Message - From: MZMcBride z...@mzmcbride.com Now that we have HipHop support, we have the ability to turn MediaWiki's core parser into a fast, reusable library. The performance reasons for limiting the amount of abstraction in the core parser will disappear. How many wikitext parsers does the world really need? I realize you have a dry wit, but I imagine this joke was lost on nearly everyone. You're not really suggesting that everyone who wants to parse MediaWiki wikitext compile and run HipHop PHP in order to do so. I'm fairly certain that his intention was If the parser is HipHop compliant, then the performance improvements that will realize for those who need them will obviate the need to rewrite the parser in anything, while those who run small enough wikiae not to care, won't need to care. That does *not*, of course, answer the if you don't have more than one compliant parser, then the code is part of your formal spec, and you *will* get bitten eventually problem. Of course, Mediawiki's parser has *three* specs: whatever formal one has been ginned up, finally; the code; *and* 8 or 9 GB of MWtext on the Wikipedias. Cheers, -- jra I'm fairly certain myself that his intention was With HipHop support since the C that HipHop compiles PHP to can be extracted and re-used we can turn that compiled C into a C library that can be used anywhere by abstracting the database calls and what not out of the php version of the parser. And because HipHop has better performance we will no longer have to worry about parser abstractions slowing down the parser and as a result increasing the load on large websites like Wikipedia where they are noticeable. So that won't be in the way of adding those abstractions anymore. Naturally of course if it's a C library you can build at least an extension/plugin for a number of languages. You would of course have to install the ext/plug though so it's not a shared-hosting thing. ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name] ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
- Original Message - From: Daniel Friesen li...@nadir-seen-fire.com I'm fairly certain myself that his intention was With HipHop support since the C that HipHop compiles PHP to can be extracted and re-used we can turn that compiled C into a C library that can be used anywhere by abstracting the database calls and what not out of the php version of the parser. And because HipHop has better performance we will no longer have to worry about parser abstractions slowing down the parser and as a result increasing the load on large websites like Wikipedia where they are noticeable. So that won't be in the way of adding those abstractions anymore. What I get for not paying any attention to Facebook Engineering. *That's* what HipHop does? Naturally of course if it's a C library you can build at least an extension/plugin for a number of languages. You would of course have to install the ext/plug though so it's not a shared-hosting thing. True. But that's still a derivative work. And from experience, I can tell you that you *don't* want to work with the *output* of a code generator/cross-compiler. Cheers, -- jra ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
- Original Message - From: Tim Starling tstarl...@wikimedia.org I wasn't saying that the current MediaWiki parser is suitable for reuse, I was saying that it may be possible to develop the MediaWiki parser into something which is reusable. Aren't there a couple of parsers already which claim 99% compliance or better? Did anything ever come of trying to assemble a validation suite, All Those Years Ago? Or, alternatively, deciding how many pages it's acceptable to break in the definition of a formal spec? Cheers, -- jra ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
On 04/05/11 14:07, Daniel Friesen wrote: I'm fairly certain myself that his intention was With HipHop support since the C that HipHop compiles PHP to can be extracted and re-used we can turn that compiled C into a C library that can be used anywhere by abstracting the database calls and what not out of the php version of the parser. And because HipHop has better performance we will no longer have to worry about parser abstractions slowing down the parser and as a result increasing the load on large websites like Wikipedia where they are noticeable. So that won't be in the way of adding those abstractions anymore. Yes that's right, more or less. HipHop generates C++ rather than C though. Basically you would split the parser into several objects: * A parser in the traditional sense. * An output callback object, which would handle generation of HTML or PDF or syntax trees or whatever. * A wiki environment interface object, which would handle link existence checks, template fetching, etc. Then you would use HipHop to compile: * The new parser class. * A few useful output classes, such as HTML. * A stub environment class which has no dependencies on the rest of MediaWiki. Then to top it off, you would add: * A HipHop extension which provides output and environment classes which pass their calls through to C-style function pointers. * A stable C ABI interface to the C++ library. * Interfaces between various high level languages and the new C library, such as Python, Ruby and Zend PHP. Doing this would leverage the MediaWiki development community and the existing PHP codebase to provide a well-maintained, reusable reference parser for MediaWiki wikitext. -- Tim Starling ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
On 4 May 2011 15:16, Tim Starling tstarl...@wikimedia.org wrote: On 04/05/11 14:07, Daniel Friesen wrote: I'm fairly certain myself that his intention was With HipHop support since the C that HipHop compiles PHP to can be extracted and re-used we can turn that compiled C into a C library that can be used anywhere by abstracting the database calls and what not out of the php version of the parser. And because HipHop has better performance we will no longer have to worry about parser abstractions slowing down the parser and as a result increasing the load on large websites like Wikipedia where they are noticeable. So that won't be in the way of adding those abstractions anymore. Yes that's right, more or less. HipHop generates C++ rather than C though. Basically you would split the parser into several objects: * A parser in the traditional sense. * An output callback object, which would handle generation of HTML or PDF or syntax trees or whatever. * A wiki environment interface object, which would handle link existence checks, template fetching, etc. Then you would use HipHop to compile: * The new parser class. * A few useful output classes, such as HTML. * A stub environment class which has no dependencies on the rest of MediaWiki. Then to top it off, you would add: * A HipHop extension which provides output and environment classes which pass their calls through to C-style function pointers. * A stable C ABI interface to the C++ library. * Interfaces between various high level languages and the new C library, such as Python, Ruby and Zend PHP. Doing this would leverage the MediaWiki development community and the existing PHP codebase to provide a well-maintained, reusable reference parser for MediaWiki wikitext. +1 This is the single most exciting news on the MediaWiki front since I started contributing to Wiktionary nine years ago (-: Andrew Dunbar (hippietrail) -- Tim Starling ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
2011-05-03 13:25, Daniel Friesen skrev: On 11-05-03 03:40 AM, Andreas Jonsson wrote: 2011-05-03 02:38, Chad skrev: [...] I don't see any problem with keeping the parser in PHP, and as you point out with HipHop support on the not-too-distant horizon the complaints about performance with Zend will largely evaporate. But most of the parser's work consists of running regexp pattern matching over the article text, doesn't it? Regexp pattern matching are implemented by native functions. Does the Zend engine have a slow regexp implementation? I would have guessed that the main reason that the parser is slow is the algorithm, not its implementation. Best Regards, Andreas Jonsson regexps might be fast, but when you have to run hundreds of them all over the place and do stuff in-language then the language becomes the bottleneck. The time it takes to execute the code that glues together the regexps will be insignificant compared to actually executing the regexps for any article larger than a few hundred bytes. This is at least the case for the articles are the the easiest for the core parser, which are articles that contains no markup. The more markup the slower it will run. It is possible that this slowdown will be lessened if compiled with HipHop. But the top speed of the parser (in bytes/seconds) will be largely unaffected. /Andreas ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
Beyond that let's flip the question the other way -- what do we *want* out of WYSIWYG editing, and can that tool provide it or what else do we need? We want something simpler and easier to use. That is not what Wikia has. I could hardly stand trying it out for a few minutes. Fred ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
Magnus Manske wrote: So, why not use my WYSIFTW approach? It will only parse the parts of the wikitext that it can turn back, edited or unedited, into wikitext, unaltered (including whitespace) if not manually changed. Some parts may therefore stay as wikitext, but it's very rare (except lists, which I didn't implement yet, but they look intuitive enough). Magnus Crazy idea: What if it was an /extensible/ editor? You could add later a module for enable lists, or enable graphic ref, but also instruct it on how to present to the user some crazy template with a dozen parameters... ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
On 05/02/11 15:30, wikitech-l-requ...@lists.wikimedia.org wrote: Date: Tue, 03 May 2011 00:29:51 +0200 From: Platonidesplatoni...@gmail.com Subject: Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong withWikia's WYSIWYG?) To:wikitech-l@lists.wikimedia.org Message-ID:ipnb0i$omi$1...@dough.gmane.org Content-Type: text/plain; charset=ISO-8859-1 Magnus Manske wrote: So, why not use my WYSIFTW approach? It will only parse the parts of the wikitext that it can turn back, edited or unedited, into wikitext, unaltered (including whitespace) if not manually changed. Some parts may therefore stay as wikitext, but it's very rare (except lists, which I didn't implement yet, but they look intuitive enough). Magnus Crazy idea: What if it was an/extensible/ editor? You could add later a module for enable lists, or enable graphicref, but also instruct it on how to present to the user some crazy template with a dozen parameters... Seems like it will need to be extensible, to allow authors of MW extensions to add support for cases where they've changed the parser's behavior? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
On Mon, May 2, 2011 at 3:29 PM, Platonides platoni...@gmail.com wrote: Magnus Manske wrote: So, why not use my WYSIFTW approach? It will only parse the parts of the wikitext that it can turn back, edited or unedited, into wikitext, unaltered (including whitespace) if not manually changed. Some parts may therefore stay as wikitext, but it's very rare (except lists, which I didn't implement yet, but they look intuitive enough). Magnus Crazy idea: What if it was an /extensible/ editor? You could add later a module for enable lists, or enable graphic ref, but also instruct it on how to present to the user some crazy template with a dozen parameters... Generically a nice idea. Specific to Wikipedia / WMF projects - all the extensions you might consider adding are pretty much required for our internal uptake of the tool, as our pages are the biggest / oldest / crustyest ones likely to have to be managed... -- -george william herbert george.herb...@gmail.com ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
On 03/05/11 04:25, Brion Vibber wrote: The most fundamental problem with Wikia's editor remains its fallback behavior when some structure is unsupported: Source mode required Rich text editing has been disabled because the page contains complex code. I don't think that's a fundamental problem, I think it's a quick hack added to reduce the development time devoted to rare wikitext constructs, while maintaining round-trip safety. Like you said further down in your post, it can be handled more elegantly by replacing the complex code with a placeholder. Why not just do that? CKEditor makes adding such placeholders really easy. The RTE source has a long list of such client-side modules, added to support various Wikia extensions. Here's an example of unsupported code, the presence of which makes a page permanently uneditable by the rich editor until it's removed: table trtda/td/tr /table You can try this out now at http://communitytest.wikia.com/ Works for me. http://communitytest.wikia.com/wiki/Brion%27s_table Beyond that let's flip the question the other way -- what do we *want* out of WYSIWYG editing, and can that tool provide it or what else do we need? I've written up some notes a few weeks ago, which need some more collation updating from the preliminary experiments I'm doing, and I would strongly appreciate more feedback from you Tim and from everyone else who's been poking about in parser editing land: http://www.mediawiki.org/wiki/Wikitext.next Some people in this thread have expressed concerns about the tiny breakages in wikitext backwards compatibility introduced by RTE, despite the fact that RTE has aimed for, and largely achieved, precise backwards compatibility with legacy wikitext. I find it hard to believe that those people would be comfortable with a project which has as its goal a broad reform of wikitext syntax. Perhaps there are good arguments for wikitext syntax reform, but I have trouble believing that WYSIWYG support is one of them, since the problem appears to have been solved already by RTE, without any reform. Another goal beyond editing itself is normalizing the world of 'alternate parsers'. There've been several announced recently, and we've got such a large array now of them available, all a little different. We even use mwlib ourselves in the PDF/ODF export deployment, and while we don't maintain that engine we need to coordinate a little with the people who do so that new extensions and structures get handled. I know that there is a camp of data reusers who like to write their own parsers. I think there are more people who have written a wikitext parser from scratch than have contributed even a small change to the MediaWiki core parser. They have a lot of influence, because they go to conferences and ask for things face-to-face. Now that we have HipHop support, we have the ability to turn MediaWiki's core parser into a fast, reusable library. The performance reasons for limiting the amount of abstraction in the core parser will disappear. How many wikitext parsers does the world really need? -- Tim Starling ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
On Mon, May 2, 2011 at 8:28 PM, Tim Starling tstarl...@wikimedia.org wrote: I know that there is a camp of data reusers who like to write their own parsers. I think there are more people who have written a wikitext parser from scratch than have contributed even a small change to the MediaWiki core parser. They have a lot of influence, because they go to conferences and ask for things face-to-face. Now that we have HipHop support, we have the ability to turn MediaWiki's core parser into a fast, reusable library. The performance reasons for limiting the amount of abstraction in the core parser will disappear. How many wikitext parsers does the world really need? People want to write their own parsers because they don't want to use PHP. They want to parse in C, Java, Ruby, Python, Perl, Assembly and every other language other than the one that it wasn't written in. There's this, IMHO, misplaced belief that standardizing the parser or markup would put us in a world of unicorns and rainbows where people can write their own parsers on a whim, just because they can. Other than making it easier to integrate with my project, I don't see a need for them either (and tbh, the endless discussions grow tedious). I don't see any problem with keeping the parser in PHP, and as you point out with HipHop support on the not-too-distant horizon the complaints about performance with Zend will largely evaporate. -Chad ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
On Mon, May 2, 2011 at 8:38 PM, Chad innocentkil...@gmail.com wrote: People want to write their own parsers because they don't want to use PHP. They want to parse in C, Java, Ruby, Python, Perl, Assembly and every other language other than the one that it wasn't written in. s/wasn't/was/ -Chad ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
On Mon, May 2, 2011 at 5:28 PM, Tim Starling tstarl...@wikimedia.orgwrote: On 03/05/11 04:25, Brion Vibber wrote: The most fundamental problem with Wikia's editor remains its fallback behavior when some structure is unsupported: Source mode required Rich text editing has been disabled because the page contains complex code. I don't think that's a fundamental problem, I think it's a quick hack added to reduce the development time devoted to rare wikitext constructs, while maintaining round-trip safety. Like you said further down in your post, it can be handled more elegantly by replacing the complex code with a placeholder. Why not just do that? Excellent question -- how hard would it be to change that? I'm fairly sure that's easier to do with an abstract parse tree generated from source (don't recognize it? stash it in a dedicated blob); I worry it may be harder trying to stash that into the middle of a multi-level HTML translation engine that wasn't meant to be reversible in the first place (do we even know if there's an opportunity to recognize the problem component within the annotated HTML or not? Is it seeing things it doesn't recognize in the HTML, or is it seeing certain structures in the source and aborting before it even gets there?). Like many such things, this might be better resolved by trying it and seeing what happens -- I don't want us to lock into a strategy too early when a lot of ideas are still unresolved. I'm very interested in making experimentation easy; for my pre-exploratory work I'm stashing things into a gadget which adds render/parse tree/inspector modes to the editing page: http://www.mediawiki.org/wiki/File:Parser_Playground_demo.png (screenshot links) I've got this set up as a gadget on mediawiki.org now and as a user script on en.wikipedia.org (loaded on User:Brion_VIBBER/vector.js) just for tossing random pages in and getting a better sense of how things break down. Currently parser variant choices are: * the actual MediaWiki parser via API (parse tree shows the preprocessor XML; side-by-side mode doesn't have a working inspector mode though) * a really crappy FakeParser class I threw together, able to handle only a few constructs. Generates a JSON parse tree, and the inspector mode can match up nodes in side-by-side view of the tree HTML. * PegParser using the peg.js parser generator to build the source-tree parser, and the same tree-html and tree-source round-trip functions as FakeParser. The peg source can be edited and rerun to regen the new parse tree. It's fun! These are a long way off from the level of experimental support we're going to want, but I think people are going to benefit from trying a few different things and getting a better feel for how source, parse trees, and resulting HTML really will look. (Template expansion isn't yet presented in this system, and that's going to be where the real fun is. ;) Some people in this thread have expressed concerns about the tiny breakages in wikitext backwards compatibility introduced by RTE, despite the fact that RTE has aimed for, and largely achieved, precise backwards compatibility with legacy wikitext. I find it hard to believe that those people would be comfortable with a project which has as its goal a broad reform of wikitext syntax. Perhaps there are good arguments for wikitext syntax reform, but I have trouble believing that WYSIWYG support is one of them, since the problem appears to have been solved already by RTE, without any reform. Well, Wikia's RTE still doesn't work on high-profile Wikipedia article pages, so that remains unproven... That said, an RTE that doesn't require changing core parser behavior yet *WILL BE A HUGE BENEFIT* to getting it into use sooner, and still leaves future reform efforts open. I'm *VERY OPEN* to the notion of doing the RTE using either a supplementary source-level parser (which doesn't have to render all structures 100% the same as the core parser, but *needs* to always create sensible structures that are useful for editors and can round-trip cleanly) or an alternate version of the core parser with annotations and limited transformations (eg like how we don't strip comments out when producing editable source, so we need to keep them in the output in some way if it's going to be fed into an HTML-ish editing view). A supplementary parser that deals with all your editing fun, but doesn't play super nice with open...close templates is probably just fine for a huge number of purposes. Now that we have HipHop support, we have the ability to turn MediaWiki's core parser into a fast, reusable library. The performance reasons for limiting the amount of abstraction in the core parser will disappear. How many wikitext parsers does the world really need? I'm not convinced that a giant blob of MediaWiki is suitable as a reusable library, but would love to see it tried. -- brion ___
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
On Mon, May 2, 2011 at 5:55 PM, Brion Vibber br...@pobox.com wrote: On Mon, May 2, 2011 at 5:28 PM, Tim Starling tstarl...@wikimedia.orgwrote: I don't think that's a fundamental problem, I think it's a quick hack added to reduce the development time devoted to rare wikitext constructs, while maintaining round-trip safety. Like you said further down in your post, it can be handled more elegantly by replacing the complex code with a placeholder. Why not just do that? Excellent question -- how hard would it be to change that? I'm fairly sure that's easier to do with an abstract parse tree generated from source (don't recognize it? stash it in a dedicated blob); I worry it may be harder trying to stash that into the middle of a multi-level HTML translation engine that wasn't meant to be reversible in the first place (do we even know if there's an opportunity to recognize the problem component within the annotated HTML or not? Is it seeing things it doesn't recognize in the HTML, or is it seeing certain structures in the source and aborting before it even gets there?). Like many such things, this might be better resolved by trying it and seeing what happens -- I don't want us to lock into a strategy too early when a lot of ideas are still unresolved. Had a quick chat with Tim in IRC -- we're definitely going to try poking at the current state of the Wikia RTE a bit more. I'll start merging it to our extensions SVN so we've got a stable clone of it that can be run on stock trunk. Little changes should be mergable back to Wikia's SVN, and we'll have something available for stock distributions that's more stable than the old FCK extension, and that we can start experimenting with along with other stuff. Another good thing in this code is the client-side editor plugins; once one gets past the raw shove stuff in/out of the markup format most of the hard work and value of an editor actually comes in the helpers for working with links, images, tables, galleries, etc -- dialogs, wizards, helpers for dragging things around. That's all stuff that we can examine and improve or base from. -- brion ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l