Hi, to whoever is interested in this (and I hope I didn't just repeat someone else's experiments on this):
I wanted to know if a "long" or "short" article in terms of how much readable material (excluding pictures) is presented to the reader in the front-end is correlated to the byte size of the Wikisyntax which can be obtained from the DB or API; as people often define the "length" of an article by its length in bytes. TL;DR: Turns out size in bytes is a really, really bad indicator for the actual, readable content of a Wikipedia article, even worse than I thought. We "curl"ed the front-end HTML of all articles of the English Wikipedia (ns=0, no disambiguation, no redirects) between 5800 and 6000 bytes (as around 5900 bytes is the total en.wiki average for these articles). = 41981 articles. Results for size in characters (w/ whitespaces) after cleaning the HTML out: Min= 95 Max= 49441 Mean=4794.41 Std. Deviation=1712.748 Especially the gap between Min and Max was interesting. But templates make it possible. (See e.g. "Veer Teja Vidhya Mandir School", "Martin Callanan" -- Allthough for the ladder you could argue that expandable template listings are not really main "reading" content..) Effectively, correlation for readable character size with byte size = 0.04 (i.e. none) in the sample. If someone already did this or a similar analysis, I'd appreciate pointers. Best, Fabian -- Karlsruhe Institute of Technology (KIT) Institute of Applied Informatics and Formal Description Methods Dipl.-Medwiss. Fabian Flöck Research Associate Building 11.40, Room 222 KIT-Campus South D-76128 Karlsruhe Phone: +49 721 608 4 6584 Fax: +49 721 608 4 6580 Skype: f.floeck_work E-Mail: [email protected] WWW: http://www.aifb.kit.edu/web/Fabian_Flöck KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association _______________________________________________ Wiki-research-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
