Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-09 Thread Domas Mituzas
 I've spent a lot of time profiling and optimising the parser in the
 past. It's a complex process. You can't just look at one number for a
 large amount of very complex text and conclude that you've found an
 optimisation target.

unless it is {{cite}}

Cheers,
Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-08 Thread Tim Starling
On 06/05/11 17:13, Andreas Jonsson wrote:
 I had not analyzed the parts of the core parser that I consider
 preproprocessing, and it came as a suprise to me that it was as slow
 as the Barack Obama benchmark shows.  But integrating template
 expansion with the parser would solve this performance problem, and is
 therefore in itself a strong argument for working towards replacing
 it.  I will write about this on wikitext-l.

That benchmark didn't have any templates in it, I expanded them with
Special:ExpandTemplates before I started. So it's unlikely that a
significant amount of the time was spent in the preprocessor.

It was a really quick benchmark, with no profiling or further testing
whatsoever. It took a few minutes to do. You shouldn't base
architecture decisions on it, it might be totally invalid. It might
not be a parser benchmark at all. I might have made some configuration
error, causing it to test an unrelated region of the code.

All I know is, I sent in wikitext, the CPU usage went to 100% for a
while, then HTML came back.

I've spent a lot of time profiling and optimising the parser in the
past. It's a complex process. You can't just look at one number for a
large amount of very complex text and conclude that you've found an
optimisation target.

-- Tim Starling


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-06 Thread Andreas Jonsson
2011-05-06 03:27, Andrew Garrett skrev:
 On Thu, May 5, 2011 at 3:21 AM, Andreas Jonsson
 andreas.jons...@kreablo.se wrote:
 2011-05-04 08:13, Tim Starling skrev:
 On 04/05/11 15:52, Andreas Jonsson wrote:
 The time it takes to execute the code that glues together the regexps
 will be insignificant compared to actually executing the regexps for any
 article larger than a few hundred bytes.  This is at least the case for
 the articles are the the easiest for the core parser, which are articles
 that contains no markup.  The more markup the slower it will run.  It is
 possible that this slowdown will be lessened if compiled with HipHop.
 But the top speed of the parser (in bytes/seconds) will be largely
 unaffected.

 PHP execution dominates for real test cases, and HipHop provides a
 massive speedup. See the previous HipHop thread.

 http://lists.wikimedia.org/pipermail/wikitech-l/2011-April/052679.html

 Unfortunately, users refuse to write articles consisting only of
 hundreds of kilobytes of plain text, they keep adding references and
 links and things. So we don't really care about the parser's top speed.

 We are talking about different things.  I don't consider callbacks made
 when processing magic words or parser functions being part of the
 actual parsing.  The reference case of no markup input is interesting to
 me as it marks the maximum throughput of the MediaWiki parser, and is
 what you would compare alternative implementations to.  But, obviously,
 if the Barack Obama article takes 22 seconds to render, there are more
 severe problems than parser performance at the moment.
 
 It's a little more complicated than that, and obviously you haven't
 spent a lot of time looking at profiling output from parsing the
 Barack Obama article if you say that — what, if not the parser, is
 slowing down the processing of that article?
 
 Consider the following:
 
 1. Many things that you would exclude from parsing like reference
 tags and what-not call the parser themselves.
 2. Regardless of whether you include the actual callback in your
 measurements of parser run time, you need to consider them.
 Identifying structures that require callbacks, as well as structures
 that don't (such as links, templates, images, and what not) takes
 time. While you might reasonably exclude ifexist calls and so on from
 parser run time, you most certainly cannot reasonably exclude template
 calls, link processing, nor the extra time taken by the preprocessor
 to identify such structures.
 
 As Domas says, real world data is king. As far as I know, in the case
 of 'a a a a', even if you repeat it for a few MB, virtually no PHP
 code is run, because the preprocessor uses strcpsn to identify
 structures requiring preprocessing. That's implemented in C — in fact,
 for 'a a a' repeated for a few MB, it's my (probably totally wrong)
 understanding that the PHP code runs in more or less constant time.
 It's the structures that appear in real articles that make the parser
 slow.

I'm sorry, I misunderstood the original statement that HipHop would
make _parsing_ significantly faster and questioned that on false
premises, because I'm thinking of the parser and the preprocessor as
distinctly different components.

Let me explain: as I see it, the first step in formalizing wikitext
syntax is to analyze and write a parser that can be used as a drop in
replacement after preprocessing.  The stuff that is preprocessed
cannot be integrated with the parser without sacrificing compatiblity.
Preprocessing is problematic.  It breaks the one-to-one relationship
with the wikitext and the syntax tree, (i.e., it impossible to
serialize a syntax tree back to the same wikitext that generated it.)
Therefore, in a second step, it should be analyzed how the
preprocessed constructions can be integrated with the parser and how
to minimize the damage from this change.

I had not analyzed the parts of the core parser that I consider
preproprocessing, and it came as a suprise to me that it was as slow
as the Barack Obama benchmark shows.  But integrating template
expansion with the parser would solve this performance problem, and is
therefore in itself a strong argument for working towards replacing
it.  I will write about this on wikitext-l.

Best Regards,

Andreas Jonsson

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-05 Thread Andrew Garrett
On Thu, May 5, 2011 at 3:21 AM, Andreas Jonsson
andreas.jons...@kreablo.se wrote:
 2011-05-04 08:13, Tim Starling skrev:
 On 04/05/11 15:52, Andreas Jonsson wrote:
 The time it takes to execute the code that glues together the regexps
 will be insignificant compared to actually executing the regexps for any
 article larger than a few hundred bytes.  This is at least the case for
 the articles are the the easiest for the core parser, which are articles
 that contains no markup.  The more markup the slower it will run.  It is
 possible that this slowdown will be lessened if compiled with HipHop.
 But the top speed of the parser (in bytes/seconds) will be largely
 unaffected.

 PHP execution dominates for real test cases, and HipHop provides a
 massive speedup. See the previous HipHop thread.

 http://lists.wikimedia.org/pipermail/wikitech-l/2011-April/052679.html

 Unfortunately, users refuse to write articles consisting only of
 hundreds of kilobytes of plain text, they keep adding references and
 links and things. So we don't really care about the parser's top speed.

 We are talking about different things.  I don't consider callbacks made
 when processing magic words or parser functions being part of the
 actual parsing.  The reference case of no markup input is interesting to
 me as it marks the maximum throughput of the MediaWiki parser, and is
 what you would compare alternative implementations to.  But, obviously,
 if the Barack Obama article takes 22 seconds to render, there are more
 severe problems than parser performance at the moment.

It's a little more complicated than that, and obviously you haven't
spent a lot of time looking at profiling output from parsing the
Barack Obama article if you say that — what, if not the parser, is
slowing down the processing of that article?

Consider the following:

1. Many things that you would exclude from parsing like reference
tags and what-not call the parser themselves.
2. Regardless of whether you include the actual callback in your
measurements of parser run time, you need to consider them.
Identifying structures that require callbacks, as well as structures
that don't (such as links, templates, images, and what not) takes
time. While you might reasonably exclude ifexist calls and so on from
parser run time, you most certainly cannot reasonably exclude template
calls, link processing, nor the extra time taken by the preprocessor
to identify such structures.

As Domas says, real world data is king. As far as I know, in the case
of 'a a a a', even if you repeat it for a few MB, virtually no PHP
code is run, because the preprocessor uses strcpsn to identify
structures requiring preprocessing. That's implemented in C — in fact,
for 'a a a' repeated for a few MB, it's my (probably totally wrong)
understanding that the PHP code runs in more or less constant time.
It's the structures that appear in real articles that make the parser
slow.

—Andrew

--
Andrew Garrett
Wikimedia Foundation
agarr...@wikimedia.org

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-04 Thread Tim Starling
On 04/05/11 15:52, Andreas Jonsson wrote:
 The time it takes to execute the code that glues together the regexps
 will be insignificant compared to actually executing the regexps for any
 article larger than a few hundred bytes.  This is at least the case for
 the articles are the the easiest for the core parser, which are articles
 that contains no markup.  The more markup the slower it will run.  It is
 possible that this slowdown will be lessened if compiled with HipHop.
 But the top speed of the parser (in bytes/seconds) will be largely
 unaffected.

PHP execution dominates for real test cases, and HipHop provides a
massive speedup. See the previous HipHop thread.

http://lists.wikimedia.org/pipermail/wikitech-l/2011-April/052679.html

Unfortunately, users refuse to write articles consisting only of
hundreds of kilobytes of plain text, they keep adding references and
links and things. So we don't really care about the parser's top speed.

-- Tim Starling


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-04 Thread Nikola Smolenski
On 05/03/2011 07:45 PM, Ryan Lane wrote:
 It's slightly more difficult, but it definitely isn't any easier. The
 point here is that only having one implementation of the parser, which
 can change at any time, which also defines the spec (and I use the
 word spec here really loosely), is something that inhibits the ability
 to share knowledge.

I was thinking whether it would be possible to have two-tier parsing? 
Define what is valid wikitext, express it in BNF, write a parser in C 
and use it as a PHP extension. If the parser encounters invalid 
wikitext, enter the quirks mode AKA the current PHP parser.

I assume that 90% of wikis' contents would be valid wikitext, and so 
the speedup should be significant. And if someone needs to reuse the 
content outside of Wikipedia, they can use 90% of the content very 
easily, and the rest not harder than right now.

The only disadvantage that I see is that every addition to wikitext 
would have to be implemented in both parsers.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-04 Thread Domas Mituzas
Ohi,

 The time it takes to execute the code that glues together the regexps
 will be insignificant compared to actually executing the regexps for any
 article larger than a few hundred bytes.  

Well, you did an edge case - a long line. Actually, try replacing spaces with 
newlines, and you will get 25x cost difference ;-) 

 But the top speed of the parser (in bytes/seconds) will be largely unaffected.

Damn!

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-04 Thread Andreas Jonsson
2011-05-04 08:41, Domas Mituzas skrev:
 Ohi,
 
 The time it takes to execute the code that glues together the regexps
 will be insignificant compared to actually executing the regexps for any
 article larger than a few hundred bytes.  
 
 Well, you did an edge case - a long line. Actually, try replacing spaces with 
 newlines, and you will get 25x cost difference ;-) 

A single long line containing no markup is indeed an edge case, but it
is a good reference case since it is the input where the parser will run
at its fastest.

Replacing the spaces with newlines will cause a tenfold increase in the
execution time.  Sure, in relative numbers less is time spent executing
regexps, but in absolute numbers, more time is spent there.

/Andreas

samples  %app name symbol name
283   8.6044  libphp5.so   zend_hash_quick_find
188   5.7160  libpcre.so.3.12.1/lib/libpcre.so.3.12.1
177   5.3816  libphp5.so   zend_parse_va_args
165   5.0167  libphp5.so   zend_do_fcall_common_helper_SPEC
160   4.8647  libphp5.so   __i686.get_pc_thunk.bx
131   3.9830  libphp5.so   zend_hash_find
127   3.8614  libphp5.so   _zval_ptr_dtor
872.6452  libc-2.11.2.so   memcpy
822.4932  libphp5.so   _zend_mm_alloc_canary_int
792.4019  libphp5.so   zend_get_hash_value
722.1891  libphp5.so   _zend_mm_free_canary_int
591.7939  libphp5.so   zend_std_read_property
551.6722  libphp5.so   execute
521.5810  libphp5.so   suhosin_get_config
511.5506  libphp5.so
zend_fetch_property_address_read_helper_SPEC_UNUSED_CONST
481.4594  libphp5.so   zendparse


 But the top speed of the parser (in bytes/seconds) will be largely 
 unaffected.
 
 Damn!
 
 Domas
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-04 Thread Domas Mituzas
Hi!

 A single long line containing no markup is indeed an edge case, but it
 is a good reference case since it is the input where the parser will run
 at its fastest.

Bubblesort will also have O(N) complexity sometimes :-)

 Replacing the spaces with newlines will cause a tenfold increase in the
 execution time.  Sure, in relative numbers less is time spent executing
 regexps, but in absolute numbers, more time is spent there.


Well, this is not fair - you should sum up all zend symbols if you compare that 
way - there're no debugging symbols for libpcre, so you get aggregated view. 
Thats same like saying that 10 is smaller number than 7, just because you can 
factorize it ;-) 

Comparing apples and oranges doesn't always help, that kind of hand waving may 
impress others, but some have spent more time looking at that data than just 
for ranting in single mailing list thread ;-) 

Cheers,
Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-04 Thread Jay Ashworth
- Original Message -
 From: Nikola Smolenski smole...@eunet.rs

 I was thinking whether it would be possible to have two-tier parsing?
 Define what is valid wikitext, express it in BNF, write a parser in C
 and use it as a PHP extension. If the parser encounters invalid
 wikitext, enter the quirks mode AKA the current PHP parser.
 
 I assume that 90% of wikis' contents would be valid wikitext, and so
 the speedup should be significant. And if someone needs to reuse the
 content outside of Wikipedia, they can use 90% of the content very
 easily, and the rest not harder than right now.

Yeah, I made this suggestion, oh, 2 or 3 years ago... and I was never
able to get the acceptable percentage down below 100.0%.

Cheers,
-- jra

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-04 Thread Dmitriy Sintsov
* Daniel Friesen li...@nadir-seen-fire.com [Tue, 03 May 2011 21:07:07 
-0700]:
 Naturally of course if it's a C library you can build at least an
 extension/plugin for a number of languages. You would of course have 
to
 install the ext/plug though so it's not a shared-hosting thing.

 ~Daniel Friesen (Dantman, Nadir-Seen-Fire) 
[http://daniel.friesen.name]

Latest generation of browser Javascript implementations are generally 
faster than Zend PHP, not sure about HipHop. Maybe client-side parsing 
can also reduce server load as well. Javascript is also extremly popular 
and wide-spread language, it has a good perspective at server side as 
well.
Dmitriy

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-03 Thread Andreas Jonsson
2011-05-03 02:38, Chad skrev:
 On Mon, May 2, 2011 at 8:28 PM, Tim Starling tstarl...@wikimedia.org wrote:
 I know that there is a camp of data reusers who like to write their
 own parsers. I think there are more people who have written a wikitext
 parser from scratch than have contributed even a small change to the
 MediaWiki core parser. They have a lot of influence, because they go
 to conferences and ask for things face-to-face.

 Now that we have HipHop support, we have the ability to turn
 MediaWiki's core parser into a fast, reusable library. The performance
 reasons for limiting the amount of abstraction in the core parser will
 disappear. How many wikitext parsers does the world really need?

 
 People want to write their own parsers because they don't want to use PHP.
 They want to parse in C, Java, Ruby, Python, Perl, Assembly and every
 other language other than the one that it wasn't written in. There's this, 
 IMHO,
 misplaced belief that standardizing the parser or markup would put us in a
 world of unicorns and rainbows where people can write their own parsers on
 a whim, just because they can. Other than making it easier to integrate with
 my project, I don't see a need for them either (and tbh, the endless
 discussions grow tedious).

My motivation for attacking the task of creating a wikitext parser is,
aside from it being an interesting problem, a genuin concern for the
fact that such a large body of data is encoded in such a vaguely
specified format.

 I don't see any problem with keeping the parser in PHP, and as you point out
 with HipHop support on the not-too-distant horizon the complaints about
 performance with Zend will largely evaporate.

But most of the parser's work consists of running regexp pattern
matching over the article text, doesn't it?  Regexp pattern matching are
implemented by native functions.  Does the Zend engine have a slow
regexp implementation?  I would have guessed that the main reason that
the parser is slow is the algorithm, not its implementation.

Best Regards,

Andreas Jonsson

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-03 Thread Daniel Friesen
On 11-05-03 03:40 AM, Andreas Jonsson wrote:
 2011-05-03 02:38, Chad skrev:
 [...]
 I don't see any problem with keeping the parser in PHP, and as you point out
 with HipHop support on the not-too-distant horizon the complaints about
 performance with Zend will largely evaporate.
 But most of the parser's work consists of running regexp pattern
 matching over the article text, doesn't it?  Regexp pattern matching are
 implemented by native functions.  Does the Zend engine have a slow
 regexp implementation?  I would have guessed that the main reason that
 the parser is slow is the algorithm, not its implementation.

 Best Regards,

 Andreas Jonsson
regexps might be fast, but when you have to run hundreds of them all
over the place and do stuff in-language then the language becomes the
bottleneck.

-- 
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-03 Thread Jay Ashworth
- Original Message -
 From: Andreas Jonsson andreas.jons...@kreablo.se

 Subject: Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with 
 Wikia's WYSIWYG?)

 My motivation for attacking the task of creating a wikitext parser is,
 aside from it being an interesting problem, a genuin concern for the
 fact that such a large body of data is encoded in such a vaguely
 specified format.

Correct: Until you have (at least) two independently written parsers, both
of which pass a test suite 100%, you don't have a *spec*. 

Or more to the point, it's unclear whether the spec or the code rules, which
can get nasty.

Cheers,
-- jra

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-03 Thread MZMcBride
Tim Starling wrote:
 Another goal beyond editing itself is normalizing the world of 'alternate
 parsers'. There've been several announced recently, and we've got such a
 large array now of them available, all a little different. We even use mwlib
 ourselves in the PDF/ODF export deployment, and while we don't maintain that
 engine we need to coordinate a little with the people who do so that new
 extensions and structures get handled.
 
 I know that there is a camp of data reusers who like to write their
 own parsers. I think there are more people who have written a wikitext
 parser from scratch than have contributed even a small change to the
 MediaWiki core parser. They have a lot of influence, because they go
 to conferences and ask for things face-to-face.
 
 Now that we have HipHop support, we have the ability to turn
 MediaWiki's core parser into a fast, reusable library. The performance
 reasons for limiting the amount of abstraction in the core parser will
 disappear. How many wikitext parsers does the world really need?

I realize you have a dry wit, but I imagine this joke was lost on nearly
everyone. You're not really suggesting that everyone who wants to parse
MediaWiki wikitext compile and run HipHop PHP in order to do so.

It's unambiguously a fundamental goal that content on Wikimedia wikis be
able to be easily redistributed, shared, and spread. A wikisyntax that's
impossible to adequately parse in other environments (or in Wikimedia's
environment, for that matter) is a critical and serious inhibitor to this
goal.

MZMcBride



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-03 Thread Chad
On Tue, May 3, 2011 at 2:15 PM, MZMcBride z...@mzmcbride.com wrote:
 I realize you have a dry wit, but I imagine this joke was lost on nearly
 everyone. You're not really suggesting that everyone who wants to parse
 MediaWiki wikitext compile and run HipHop PHP in order to do so.


And how is using the parser with HipHop going to be any more
difficult than using it with Zend?

-Chad

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-03 Thread MZMcBride
Chad wrote:
 On Tue, May 3, 2011 at 2:15 PM, MZMcBride z...@mzmcbride.com wrote:
 I realize you have a dry wit, but I imagine this joke was lost on nearly
 everyone. You're not really suggesting that everyone who wants to parse
 MediaWiki wikitext compile and run HipHop PHP in order to do so.
 
 And how is using the parser with HipHop going to be any more
 difficult than using it with Zend?

The point is that the wikitext and its parsing should be completely separate
from MediaWiki/PHP/HipHop/Zend.

I think some of the bigger picture is getting lost here. Wikimedia produces
XML dumps that contain wikitext. For most people, this is the only way to
obtain and reuse large amounts of content from Wikimedia wikis (especially
as the HTML dumps haven't been re-created since 2008). There needs to be a
way for others to be able to very easily deal with this content.

Many people have suggested (with good reason) that this means that wikitext
parsing needs to be reproducible in other programming languages. While
HipHop may be the best thing since sliced bread, I've yet to see anyone put
forward a compelling reason that the current state of affairs is acceptable.
Saying well, it'll soon be much faster for MediaWiki to parse doesn't
overcome the legitimate issues that re-users have (such as programming in a
language other than PHP, banish the thought).

For me, the idea that all that's needed is a faster parser in PHP is a
complete non-starter.

MZMcBride



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-03 Thread Ryan Lane
On Tue, May 3, 2011 at 10:25 AM, Chad innocentkil...@gmail.com wrote:
 On Tue, May 3, 2011 at 2:15 PM, MZMcBride z...@mzmcbride.com wrote:
 I realize you have a dry wit, but I imagine this joke was lost on nearly
 everyone. You're not really suggesting that everyone who wants to parse
 MediaWiki wikitext compile and run HipHop PHP in order to do so.


 And how is using the parser with HipHop going to be any more
 difficult than using it with Zend?


It's slightly more difficult, but it definitely isn't any easier. The
point here is that only having one implementation of the parser, which
can change at any time, which also defines the spec (and I use the
word spec here really loosely), is something that inhibits the ability
to share knowledge.

Requiring people use our PHP implementation, whether or not it is
compiled to C is ludicrous.

- Ryan

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-03 Thread Domas Mituzas
 It's slightly more difficult, but it definitely isn't any easier

It is much easier to embed it in other languages, once you get shared object 
with Parser methods exposed ;-)

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-03 Thread Ryan Lane
 It is much easier to embed it in other languages, once you get shared object 
 with Parser methods exposed ;-)


Which would also require the linking application to be GPL licensed,
which is less than ideal. We shouldn't limit the licensing of
applications that want to write wikitext. An alternative
implementation can be licensed in any way the author sees fit.

- Ryan

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-03 Thread Brion Vibber
On Tue, May 3, 2011 at 10:48 AM, Domas Mituzas midom.li...@gmail.comwrote:

  It's slightly more difficult, but it definitely isn't any easier

 It is much easier to embed it in other languages, once you get shared
 object with Parser methods exposed ;-)


Building it with HipHop will be harder -- but that's something that can be
packaged.


However, I strongly agree that having only a poorly-specified
single-implementation markup language for all of Wikipedia  Wikimedia's
redistributable data is **not where we want to be** long term.

And even if the PHP-based parser is callable from elsewhere, it's not going
to be a good convenient fit for every potential user. It's still worthwhile
to hammer out clearer, more consistent document formats for the future, so
that other people doing other things that we aren't even thinking of have
the flexibility to do those things however they'll need to.

-- brion
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-03 Thread Neil Harris
On 03/05/11 19:44, MZMcBride wrote:
 Chad wrote:
 On Tue, May 3, 2011 at 2:15 PM, MZMcBridez...@mzmcbride.com  wrote:
 I realize you have a dry wit, but I imagine this joke was lost on nearly
 everyone. You're not really suggesting that everyone who wants to parse
 MediaWiki wikitext compile and run HipHop PHP in order to do so.
 And how is using the parser with HipHop going to be any more
 difficult than using it with Zend?
 The point is that the wikitext and its parsing should be completely separate
 from MediaWiki/PHP/HipHop/Zend.

 I think some of the bigger picture is getting lost here. Wikimedia produces
 XML dumps that contain wikitext. For most people, this is the only way to
 obtain and reuse large amounts of content from Wikimedia wikis (especially
 as the HTML dumps haven't been re-created since 2008). There needs to be a
 way for others to be able to very easily deal with this content.

 Many people have suggested (with good reason) that this means that wikitext
 parsing needs to be reproducible in other programming languages. While
 HipHop may be the best thing since sliced bread, I've yet to see anyone put
 forward a compelling reason that the current state of affairs is acceptable.
 Saying well, it'll soon be much faster for MediaWiki to parse doesn't
 overcome the legitimate issues that re-users have (such as programming in a
 language other than PHP, banish the thought).

 For me, the idea that all that's needed is a faster parser in PHP is a
 complete non-starter.

 MZMcBride


I agree completely.

I think it cannot be emphasized enough that what's valuable about 
Wikipedia and other similar wikis is the hard-won _content_, not the 
software used to write and display it at any given, which is merely a 
means to that end.

Fashions in programming languages and data formats come and go, but the 
person-centuries of writing effort already embodied in Mediawiki's 
wikitext format needs to have a much longer lifespan: having a 
well-defined syntax for its current wikitext format will allow the 
content itself to continue to be maintained for the long term, beyond 
the restrictions of its current software or encoding format.

-- Neil


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-03 Thread Dirk Riehle


On 05/03/2011 08:28 PM, Neil Harris wrote:
 On 03/05/11 19:44, MZMcBride wrote:
...
 The point is that the wikitext and its parsing should be completely separate
 from MediaWiki/PHP/HipHop/Zend.

 I think some of the bigger picture is getting lost here. Wikimedia produces
 XML dumps that contain wikitext. For most people, this is the only way to
 obtain and reuse large amounts of content from Wikimedia wikis (especially
 as the HTML dumps haven't been re-created since 2008). There needs to be a
 way for others to be able to very easily deal with this content.

 Many people have suggested (with good reason) that this means that wikitext
 parsing needs to be reproducible in other programming languages. While
 HipHop may be the best thing since sliced bread, I've yet to see anyone put
 forward a compelling reason that the current state of affairs is acceptable.
 Saying well, it'll soon be much faster for MediaWiki to parse doesn't
 overcome the legitimate issues that re-users have (such as programming in a
 language other than PHP, banish the thought).

 For me, the idea that all that's needed is a faster parser in PHP is a
 complete non-starter.

 MZMcBride


 I agree completely.

 I think it cannot be emphasized enough that what's valuable about
 Wikipedia and other similar wikis is the hard-won _content_, not the
 software used to write and display it at any given, which is merely a
 means to that end.

 Fashions in programming languages and data formats come and go, but the
 person-centuries of writing effort already embodied in Mediawiki's
 wikitext format needs to have a much longer lifespan: having a
 well-defined syntax for its current wikitext format will allow the
 content itself to continue to be maintained for the long term, beyond
 the restrictions of its current software or encoding format.

 -- Neil

+1 to both MZMcBride and Neil.

So relieved to see things put so eloquently.

Dirk


-- 
Website: http://dirkriehle.com - Twitter: @dirkriehle
Ph (DE): +49-157-8153-4150 - Ph (US): +1-650-450-8550


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-03 Thread Neil Kandalgaonkar
On 5/2/11 5:28 PM, Tim Starling wrote:
How many wikitext parsers does the world really need?

That's a tricky question. What MediaWiki calls parsing, the rest of the 
world calls

1. Parsing
2. Expansion (i.e. templates, magic)
3. Applying local state, preferences, context (i.e. $n, prefs)
4. Emitting

And phases 2 and 3 depend heavily upon the state of the local wiki at 
the time the parse is requested. If you've ever tried to set up a test 
wiki that works like Wikipedia or Wikimedia Commons you'll know what I'm 
talking about.

As for whether the rest of the world needs another wikitext parser: 
well, they keep writing them, so there must be some reason why this 
keeps happening. It's true that language chauvinism plays a part, but 
the inflexibility of the current approach is probably a big factor as 
well. The current system mashes parsing and emitting to HTML together, 
very intimately, and a lot of people would like those to be separate.

   - if they're doing research or stats, and want a more pure, more 
normalized form than HTML or Wikitext.

   - if they're Google, and they want to get all the city infobox data 
and reuse it (this is a real request we've gotten)

   - if they're OpenStreetMaps, and the same thing;

   - if they're emitting to a different format (PDF, LaTeX, books);

   - if they're emitting to HTML but with different needs (like mobile);

And then there's the stuff which you didn't know you wanted, but which 
becomes easy once you have a more flexible parser.

A couple of months ago I wrote a mini PEG-based wikitext parser in 
JavaScript, that Special:UploadWizard is using, today, live on Commons.

 
http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/UploadWizard/resources/mediawiki.language.parser.js?view=markup

 
http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/UploadWizard/resources/mediawiki.language.parser.peg?view=markup

While it was a bit of a heavy download (7K compressed) this gave me the 
ability to do pluralizations in the frontend (e.g. 3 out of 5 uploads 
complete) even for difficult languages like Arabic. Great!

But the unexpected benefit was that it also made it a snap to add very 
complicated interface behaviour to our message strings. Actually, right 
now, with this library + the ingenious way that wikitext does i18n, we 
may have one of the best libraries out there for internationalized user 
interfaces. I'm considering splitting it off; it could be useful for any 
project that used translatewiki.

But I don't actually want to use JavaScript for anything but the final 
rendering stages (I'd rather move most of this parser to PHP) so stay tuned.

Anyway, I think it's obviously possible for us to do some RTE, and some 
of this stuff, with the current parser. But I'm optimistic that a new 
parsing strategy will be a huge benefit to our community, and our 
partners, and partners we didn't even know we could have. Imagine doing 
RTE with an implementation in a JS frontend, that is generated from some 
of the same sources that the PHP backend uses.

For what it's worth: whenever I meet with Wikia employees the topic is 
always about what MediaWiki and the WMF can do to make their RTE hacks 
obsolete. That doesn't mean that their RTE isn't the right way forward, 
but the people who wrote it don't seem to be very strong advocates for 
it. But I don't want to put words in their mouth; maybe one of them can 
add more to this thread?

-- 
Neil Kandalgaonkar ne...@wikimedia.org

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-03 Thread Jay Ashworth
- Original Message -
 From: MZMcBride z...@mzmcbride.com

  Now that we have HipHop support, we have the ability to turn
  MediaWiki's core parser into a fast, reusable library. The performance
  reasons for limiting the amount of abstraction in the core parser will
  disappear. How many wikitext parsers does the world really need?
 
 I realize you have a dry wit, but I imagine this joke was lost on
 nearly everyone. You're not really suggesting that everyone who wants to
 parse MediaWiki wikitext compile and run HipHop PHP in order to do so.

I'm fairly certain that his intention was If the parser is HipHop compliant,
then the performance improvements that will realize for those who need them
will obviate the need to rewrite the parser in anything, while those who
run small enough wikiae not to care, won't need to care.

That does *not*, of course, answer the if you don't have more than one
compliant parser, then the code is part of your formal spec, and you
*will* get bitten eventually problem.

Of course, Mediawiki's parser has *three* specs: whatever formal one 
has been ginned up, finally; the code; *and* 8 or 9 GB of MWtext on the
Wikipedias.

Cheers,
-- jra

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-03 Thread Daniel Friesen
On 11-05-03 08:46 PM, Jay Ashworth wrote:
 - Original Message -
 From: MZMcBride z...@mzmcbride.com
 Now that we have HipHop support, we have the ability to turn
 MediaWiki's core parser into a fast, reusable library. The performance
 reasons for limiting the amount of abstraction in the core parser will
 disappear. How many wikitext parsers does the world really need?
 I realize you have a dry wit, but I imagine this joke was lost on
 nearly everyone. You're not really suggesting that everyone who wants to
 parse MediaWiki wikitext compile and run HipHop PHP in order to do so.
 I'm fairly certain that his intention was If the parser is HipHop compliant,
 then the performance improvements that will realize for those who need them
 will obviate the need to rewrite the parser in anything, while those who
 run small enough wikiae not to care, won't need to care.

 That does *not*, of course, answer the if you don't have more than one
 compliant parser, then the code is part of your formal spec, and you
 *will* get bitten eventually problem.

 Of course, Mediawiki's parser has *three* specs: whatever formal one 
 has been ginned up, finally; the code; *and* 8 or 9 GB of MWtext on the
 Wikipedias.

 Cheers,
 -- jra
I'm fairly certain myself that his intention was With HipHop support
since the C that HipHop compiles PHP to can be extracted and re-used we
can turn that compiled C into a C library that can be used anywhere by
abstracting the database calls and what not out of the php version of
the parser. And because HipHop has better performance we will no longer
have to worry about parser abstractions slowing down the parser and as a
result increasing the load on large websites like Wikipedia where they
are noticeable. So that won't be in the way of adding those abstractions
anymore.

Naturally of course if it's a C library you can build at least an
extension/plugin for a number of languages. You would of course have to
install the ext/plug though so it's not a shared-hosting thing.

~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-03 Thread Jay Ashworth
- Original Message -
 From: Daniel Friesen li...@nadir-seen-fire.com


 I'm fairly certain myself that his intention was With HipHop support
 since the C that HipHop compiles PHP to can be extracted and re-used
 we can turn that compiled C into a C library that can be used anywhere by
 abstracting the database calls and what not out of the php version of
 the parser. And because HipHop has better performance we will no
 longer have to worry about parser abstractions slowing down the parser and as
 a result increasing the load on large websites like Wikipedia where they
 are noticeable. So that won't be in the way of adding those
 abstractions anymore.

What I get for not paying any attention to Facebook Engineering.

*That's* what HipHop does?  

 Naturally of course if it's a C library you can build at least an
 extension/plugin for a number of languages. You would of course have
 to install the ext/plug though so it's not a shared-hosting thing.

True.

But that's still a derivative work.

And from experience, I can tell you that you *don't* want to work
with the *output* of a code generator/cross-compiler.

Cheers,
-- jra

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-03 Thread Jay Ashworth
- Original Message -
 From: Tim Starling tstarl...@wikimedia.org

 I wasn't saying that the current MediaWiki parser is suitable for
 reuse, I was saying that it may be possible to develop the MediaWiki
 parser into something which is reusable.

Aren't there a couple of parsers already which claim 99% compliance or better?

Did anything ever come of trying to assemble a validation suite, All Those
Years Ago?   Or, alternatively, deciding how many pages it's acceptable to
break in the definition of a formal spec?

Cheers,
-- jra

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-03 Thread Tim Starling
On 04/05/11 14:07, Daniel Friesen wrote:
 I'm fairly certain myself that his intention was With HipHop support
 since the C that HipHop compiles PHP to can be extracted and re-used we
 can turn that compiled C into a C library that can be used anywhere by
 abstracting the database calls and what not out of the php version of
 the parser. And because HipHop has better performance we will no longer
 have to worry about parser abstractions slowing down the parser and as a
 result increasing the load on large websites like Wikipedia where they
 are noticeable. So that won't be in the way of adding those abstractions
 anymore.

Yes that's right, more or less. HipHop generates C++ rather than C
though.

Basically you would split the parser into several objects:

* A parser in the traditional sense.
* An output callback object, which would handle generation of HTML or
PDF or syntax trees or whatever.
* A wiki environment interface object, which would handle link
existence checks, template fetching, etc.

Then you would use HipHop to compile:

* The new parser class.
* A few useful output classes, such as HTML.
* A stub environment class which has no dependencies on the rest of
MediaWiki.

Then to top it off, you would add:

* A HipHop extension which provides output and environment classes
which pass their calls through to C-style function pointers.
* A stable C ABI interface to the C++ library.
* Interfaces between various high level languages and the new C
library, such as Python, Ruby and Zend PHP.

Doing this would leverage the MediaWiki development community and the
existing PHP codebase to provide a well-maintained, reusable reference
parser for MediaWiki wikitext.

-- Tim Starling


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-03 Thread Andrew Dunbar
On 4 May 2011 15:16, Tim Starling tstarl...@wikimedia.org wrote:
 On 04/05/11 14:07, Daniel Friesen wrote:
 I'm fairly certain myself that his intention was With HipHop support
 since the C that HipHop compiles PHP to can be extracted and re-used we
 can turn that compiled C into a C library that can be used anywhere by
 abstracting the database calls and what not out of the php version of
 the parser. And because HipHop has better performance we will no longer
 have to worry about parser abstractions slowing down the parser and as a
 result increasing the load on large websites like Wikipedia where they
 are noticeable. So that won't be in the way of adding those abstractions
 anymore.

 Yes that's right, more or less. HipHop generates C++ rather than C
 though.

 Basically you would split the parser into several objects:

 * A parser in the traditional sense.
 * An output callback object, which would handle generation of HTML or
 PDF or syntax trees or whatever.
 * A wiki environment interface object, which would handle link
 existence checks, template fetching, etc.

 Then you would use HipHop to compile:

 * The new parser class.
 * A few useful output classes, such as HTML.
 * A stub environment class which has no dependencies on the rest of
 MediaWiki.

 Then to top it off, you would add:

 * A HipHop extension which provides output and environment classes
 which pass their calls through to C-style function pointers.
 * A stable C ABI interface to the C++ library.
 * Interfaces between various high level languages and the new C
 library, such as Python, Ruby and Zend PHP.

 Doing this would leverage the MediaWiki development community and the
 existing PHP codebase to provide a well-maintained, reusable reference
 parser for MediaWiki wikitext.

+1

This is the single most exciting news on the MediaWiki front since I started
contributing to Wiktionary nine years ago (-:

Andrew Dunbar (hippietrail)

 -- Tim Starling


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-03 Thread Andreas Jonsson
2011-05-03 13:25, Daniel Friesen skrev:
 On 11-05-03 03:40 AM, Andreas Jonsson wrote:
 2011-05-03 02:38, Chad skrev:
 [...]
 I don't see any problem with keeping the parser in PHP, and as you point out
 with HipHop support on the not-too-distant horizon the complaints about
 performance with Zend will largely evaporate.
 But most of the parser's work consists of running regexp pattern
 matching over the article text, doesn't it?  Regexp pattern matching are
 implemented by native functions.  Does the Zend engine have a slow
 regexp implementation?  I would have guessed that the main reason that
 the parser is slow is the algorithm, not its implementation.

 Best Regards,

 Andreas Jonsson
 regexps might be fast, but when you have to run hundreds of them all
 over the place and do stuff in-language then the language becomes the
 bottleneck.
 

The time it takes to execute the code that glues together the regexps
will be insignificant compared to actually executing the regexps for any
article larger than a few hundred bytes.  This is at least the case for
the articles are the the easiest for the core parser, which are articles
that contains no markup.  The more markup the slower it will run.  It is
possible that this slowdown will be lessened if compiled with HipHop.
But the top speed of the parser (in bytes/seconds) will be largely
unaffected.

/Andreas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-02 Thread Fred Bauder

 Beyond that let's flip the question the other way -- what do we *want*
 out
 of WYSIWYG editing, and can that tool provide it or what else do we need?

We want something simpler and easier to use. That is not what Wikia has.
I could hardly stand trying it out for a few minutes.

Fred


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-02 Thread Platonides
Magnus Manske wrote:
 
 So, why not use my WYSIFTW approach? It will only parse the parts of
 the wikitext that it can turn back, edited or unedited, into wikitext,
 unaltered (including whitespace) if not manually changed. Some parts
 may therefore stay as wikitext, but it's very rare (except lists,
 which I didn't implement yet, but they look intuitive enough).
 
 Magnus

Crazy idea: What if it was an /extensible/ editor? You could add later a
module for enable lists, or enable graphic ref, but also instruct it
on how to present to the user some crazy template with a dozen parameters...



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-02 Thread Lee Worden
On 05/02/11 15:30, wikitech-l-requ...@lists.wikimedia.org wrote:
 Date: Tue, 03 May 2011 00:29:51 +0200
 From: Platonidesplatoni...@gmail.com
 Subject: Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong
   withWikia's WYSIWYG?)
 To:wikitech-l@lists.wikimedia.org
 Message-ID:ipnb0i$omi$1...@dough.gmane.org
 Content-Type: text/plain; charset=ISO-8859-1

 Magnus Manske wrote:
 
   So, why not use my WYSIFTW approach? It will only parse the parts of
   the wikitext that it can turn back, edited or unedited, into wikitext,
   unaltered (including whitespace) if not manually changed. Some parts
   may therefore stay as wikitext, but it's very rare (except lists,
   which I didn't implement yet, but they look intuitive enough).
 
   Magnus
 Crazy idea: What if it was an/extensible/  editor? You could add later a
 module for enable lists, or enable graphicref, but also instruct it
 on how to present to the user some crazy template with a dozen parameters...

Seems like it will need to be extensible, to allow authors of MW 
extensions to add support for cases where they've changed the parser's 
behavior?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-02 Thread George Herbert
On Mon, May 2, 2011 at 3:29 PM, Platonides platoni...@gmail.com wrote:
 Magnus Manske wrote:

 So, why not use my WYSIFTW approach? It will only parse the parts of
 the wikitext that it can turn back, edited or unedited, into wikitext,
 unaltered (including whitespace) if not manually changed. Some parts
 may therefore stay as wikitext, but it's very rare (except lists,
 which I didn't implement yet, but they look intuitive enough).

 Magnus

 Crazy idea: What if it was an /extensible/ editor? You could add later a
 module for enable lists, or enable graphic ref, but also instruct it
 on how to present to the user some crazy template with a dozen parameters...

Generically a nice idea.

Specific to Wikipedia / WMF projects - all the extensions you might
consider adding are pretty much required for our internal uptake of
the tool, as our pages are the biggest / oldest / crustyest ones
likely to have to be managed...


-- 
-george william herbert
george.herb...@gmail.com

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-02 Thread Tim Starling
On 03/05/11 04:25, Brion Vibber wrote:
 The most fundamental problem with Wikia's editor remains its fallback
 behavior when some structure is unsupported:
 
   Source mode required
 
   Rich text editing has been disabled because the page contains complex
 code.

I don't think that's a fundamental problem, I think it's a quick hack
added to reduce the development time devoted to rare wikitext
constructs, while maintaining round-trip safety. Like you said further
down in your post, it can be handled more elegantly by replacing the
complex code with a placeholder. Why not just do that?

CKEditor makes adding such placeholders really easy. The RTE source
has a long list of such client-side modules, added to support various
Wikia extensions.

 Here's an example of unsupported code, the presence of which makes a page
 permanently uneditable by the rich editor until it's removed:
 
   table
   trtda/td/tr
   /table
 
 You can try this out now at http://communitytest.wikia.com/

Works for me.

http://communitytest.wikia.com/wiki/Brion%27s_table

 Beyond that let's flip the question the other way -- what do we *want* out
 of WYSIWYG editing, and can that tool provide it or what else do we need?
 I've written up some notes a few weeks ago, which need some more collation 
 updating from the preliminary experiments I'm doing, and I would strongly
 appreciate more feedback from you Tim and from everyone else who's been
 poking about in parser  editing land:
 
   http://www.mediawiki.org/wiki/Wikitext.next

Some people in this thread have expressed concerns about the tiny
breakages in wikitext backwards compatibility introduced by RTE,
despite the fact that RTE has aimed for, and largely achieved, precise
backwards compatibility with legacy wikitext.

I find it hard to believe that those people would be comfortable with
a project which has as its goal a broad reform of wikitext syntax.

Perhaps there are good arguments for wikitext syntax reform, but I
have trouble believing that WYSIWYG support is one of them, since the
problem appears to have been solved already by RTE, without any reform.

 Another goal beyond editing itself is normalizing the world of 'alternate
 parsers'. There've been several announced recently, and we've got such a
 large array now of them available, all a little different. We even use mwlib
 ourselves in the PDF/ODF export deployment, and while we don't maintain that
 engine we need to coordinate a little with the people who do so that new
 extensions and structures get handled.

I know that there is a camp of data reusers who like to write their
own parsers. I think there are more people who have written a wikitext
parser from scratch than have contributed even a small change to the
MediaWiki core parser. They have a lot of influence, because they go
to conferences and ask for things face-to-face.

Now that we have HipHop support, we have the ability to turn
MediaWiki's core parser into a fast, reusable library. The performance
reasons for limiting the amount of abstraction in the core parser will
disappear. How many wikitext parsers does the world really need?

-- Tim Starling


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-02 Thread Chad
On Mon, May 2, 2011 at 8:28 PM, Tim Starling tstarl...@wikimedia.org wrote:
 I know that there is a camp of data reusers who like to write their
 own parsers. I think there are more people who have written a wikitext
 parser from scratch than have contributed even a small change to the
 MediaWiki core parser. They have a lot of influence, because they go
 to conferences and ask for things face-to-face.

 Now that we have HipHop support, we have the ability to turn
 MediaWiki's core parser into a fast, reusable library. The performance
 reasons for limiting the amount of abstraction in the core parser will
 disappear. How many wikitext parsers does the world really need?


People want to write their own parsers because they don't want to use PHP.
They want to parse in C, Java, Ruby, Python, Perl, Assembly and every
other language other than the one that it wasn't written in. There's this, IMHO,
misplaced belief that standardizing the parser or markup would put us in a
world of unicorns and rainbows where people can write their own parsers on
a whim, just because they can. Other than making it easier to integrate with
my project, I don't see a need for them either (and tbh, the endless
discussions grow tedious).

I don't see any problem with keeping the parser in PHP, and as you point out
with HipHop support on the not-too-distant horizon the complaints about
performance with Zend will largely evaporate.

-Chad

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-02 Thread Chad
On Mon, May 2, 2011 at 8:38 PM, Chad innocentkil...@gmail.com wrote:
 People want to write their own parsers because they don't want to use PHP.
 They want to parse in C, Java, Ruby, Python, Perl, Assembly and every
 other language other than the one that it wasn't written in.

s/wasn't/was/

-Chad

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-02 Thread Brion Vibber
On Mon, May 2, 2011 at 5:28 PM, Tim Starling tstarl...@wikimedia.orgwrote:

 On 03/05/11 04:25, Brion Vibber wrote:
  The most fundamental problem with Wikia's editor remains its fallback
  behavior when some structure is unsupported:
 
Source mode required
 
Rich text editing has been disabled because the page contains complex
  code.

 I don't think that's a fundamental problem, I think it's a quick hack
 added to reduce the development time devoted to rare wikitext
 constructs, while maintaining round-trip safety. Like you said further
 down in your post, it can be handled more elegantly by replacing the
 complex code with a placeholder. Why not just do that?


Excellent question -- how hard would it be to change that?

I'm fairly sure that's easier to do with an abstract parse tree generated
from source (don't recognize it? stash it in a dedicated blob); I worry it
may be harder trying to stash that into the middle of a multi-level HTML
translation engine that wasn't meant to be reversible in the first place (do
we even know if there's an opportunity to recognize the problem component
within the annotated HTML or not? Is it seeing things it doesn't recognize
in the HTML, or is it seeing certain structures in the source and aborting
before it even gets there?).

Like many such things, this might be better resolved by trying it and seeing
what happens -- I don't want us to lock into a strategy too early when a lot
of ideas are still unresolved.


I'm very interested in making experimentation easy; for my pre-exploratory
work I'm stashing things into a gadget which adds render/parse
tree/inspector modes to the editing page:

http://www.mediawiki.org/wiki/File:Parser_Playground_demo.png (screenshot 
links)

I've got this set up as a gadget on mediawiki.org now and as a user script
on en.wikipedia.org (loaded on User:Brion_VIBBER/vector.js) just for tossing
random pages in and getting a better sense of how things break down.
Currently parser variant choices are:

* the actual MediaWiki parser via API (parse tree shows the preprocessor
XML; side-by-side mode doesn't have a working inspector mode though)
* a really crappy FakeParser class I threw together, able to handle only a
few constructs. Generates a JSON parse tree, and the inspector mode can
match up nodes in side-by-side view of the tree  HTML.
* PegParser using the peg.js parser generator to build the source-tree
parser, and the same tree-html and tree-source round-trip functions as
FakeParser. The peg source can be edited and rerun to regen the new parse
tree. It's fun!

These are a long way off from the level of experimental support we're going
to want, but I think people are going to benefit from trying a few different
things and getting a better feel for how source, parse trees, and resulting
HTML really will look.

(Template expansion isn't yet presented in this system, and that's going to
be where the real fun is. ;)


Some people in this thread have expressed concerns about the tiny
 breakages in wikitext backwards compatibility introduced by RTE,
 despite the fact that RTE has aimed for, and largely achieved, precise
 backwards compatibility with legacy wikitext.

I find it hard to believe that those people would be comfortable with
 a project which has as its goal a broad reform of wikitext syntax.

 Perhaps there are good arguments for wikitext syntax reform, but I
 have trouble believing that WYSIWYG support is one of them, since the
 problem appears to have been solved already by RTE, without any reform.


Well, Wikia's RTE still doesn't work on high-profile Wikipedia article
pages, so that remains unproven...

That said, an RTE that doesn't require changing core parser behavior yet
*WILL BE A HUGE BENEFIT* to getting it into use sooner, and still leaves
future reform efforts open.

I'm *VERY OPEN* to the notion of doing the RTE using either a supplementary
source-level parser (which doesn't have to render all structures 100% the
same as the core parser, but *needs* to always create sensible structures
that are useful for editors and can round-trip cleanly) or an alternate
version of the core parser with annotations and limited transformations (eg
like how we don't strip comments out when producing editable source, so we
need to keep them in the output in some way if it's going to be fed into an
HTML-ish editing view).

A supplementary parser that deals with all your editing fun, but doesn't
play super nice with open...close templates is probably just fine for a huge
number of purposes.

Now that we have HipHop support, we have the ability to turn
 MediaWiki's core parser into a fast, reusable library. The performance
 reasons for limiting the amount of abstraction in the core parser will
 disappear. How many wikitext parsers does the world really need?


I'm not convinced that a giant blob of MediaWiki is suitable as a reusable
library, but would love to see it tried.

-- brion
___

Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-02 Thread Brion Vibber
On Mon, May 2, 2011 at 5:55 PM, Brion Vibber br...@pobox.com wrote:

 On Mon, May 2, 2011 at 5:28 PM, Tim Starling tstarl...@wikimedia.orgwrote:

 I don't think that's a fundamental problem, I think it's a quick hack
 added to reduce the development time devoted to rare wikitext
 constructs, while maintaining round-trip safety. Like you said further
 down in your post, it can be handled more elegantly by replacing the
 complex code with a placeholder. Why not just do that?


 Excellent question -- how hard would it be to change that?

 I'm fairly sure that's easier to do with an abstract parse tree generated
 from source (don't recognize it? stash it in a dedicated blob); I worry it
 may be harder trying to stash that into the middle of a multi-level HTML
 translation engine that wasn't meant to be reversible in the first place (do
 we even know if there's an opportunity to recognize the problem component
 within the annotated HTML or not? Is it seeing things it doesn't recognize
 in the HTML, or is it seeing certain structures in the source and aborting
 before it even gets there?).

 Like many such things, this might be better resolved by trying it and
 seeing what happens -- I don't want us to lock into a strategy too early
 when a lot of ideas are still unresolved.


Had a quick chat with Tim in IRC -- we're definitely going to try poking at
the current state of the Wikia RTE a bit more.

I'll start merging it to our extensions SVN so we've got a stable clone of
it that can be run on stock trunk. Little changes should be mergable back to
Wikia's SVN, and we'll have something available for stock distributions
that's more stable than the old FCK extension, and that we can start
experimenting with along with other stuff.

Another good thing in this code is the client-side editor plugins; once one
gets past the raw shove stuff in/out of the markup format most of the hard
work and value of an editor actually comes in the helpers for working with
links, images, tables, galleries, etc -- dialogs, wizards, helpers for
dragging things around. That's all stuff that we can examine and improve or
base from.

-- brion
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l