It would be interesting to determine whether the 'Barack Obama' article is an outlier. It could be that the simple English wikipedia has a larger ratio of text to markup, and thus the parsoid output is comparatively chunkier. "Barack Obama" may have a large amount of wikitext markup already, so the parsoid output isn't as (comparatively) large. If I get some free time I'll try to do some experiments to determine what's going on. --scott
On Tue, Feb 19, 2013 at 7:35 PM, Gabriel Wicke <[email protected]> wrote: > On 02/19/2013 03:52 PM, C. Scott Ananian wrote: > > So there's currently a 10x expansion in the uncompressed size, but only > > 3-4x expansion with compression. > > My last test after https://gerrit.wikimedia.org/r/#/c/49185/ was merged > showed a gzip-compressed factor of about 2 for a large article: > > 259K obama-parsoid-old.html.gz > 255K obama-parsoid-adaptive-attribute-quoting.html.gz > 135K obama-PHP.html.gz > > We currently store all round-trip information (plus some debug info) in > the DOM, but plan to move most of this information out of it. The > information is private in any case, so there is no reason to send it out > along with the DOM. We might keep some UID attributes to aid node > identification, but there is also the possibility to use subtree hashes > as in the XyDiff algorithm to help with that. > > In the end, the resulting DOM will likely still be slightly larger than > the PHP parser's output as it contains more information, in particular > about templates. > > Gabriel >
_______________________________________________ Wikitext-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitext-l
