It would be interesting to determine whether the 'Barack Obama' article is
an outlier.  It could be that the simple English wikipedia has a larger
ratio of text to markup, and thus the parsoid output is comparatively
chunkier.  "Barack Obama" may have a large amount of wikitext markup
already, so the parsoid output isn't as (comparatively) large.  If I get
some free time I'll try to do some experiments to determine what's going on.
 --scott

On Tue, Feb 19, 2013 at 7:35 PM, Gabriel Wicke <[email protected]> wrote:

> On 02/19/2013 03:52 PM, C. Scott Ananian wrote:
> > So there's currently a 10x expansion in the uncompressed size, but only
> > 3-4x expansion with compression.
>
> My last test after https://gerrit.wikimedia.org/r/#/c/49185/ was merged
> showed a gzip-compressed factor of about 2 for a large article:
>
> 259K obama-parsoid-old.html.gz
> 255K obama-parsoid-adaptive-attribute-quoting.html.gz
> 135K obama-PHP.html.gz
>
> We currently store all round-trip information (plus some debug info) in
> the DOM, but plan to move most of this information out of it. The
> information is private in any case, so there is no reason to send it out
> along with the DOM. We might keep some UID attributes to aid node
> identification, but there is also the possibility to use subtree hashes
> as in the XyDiff algorithm to help with that.
>
> In the end, the resulting DOM will likely still be slightly larger than
> the PHP parser's output as it contains more information, in particular
> about templates.
>
> Gabriel
>
_______________________________________________
Wikitext-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Reply via email to