Hi, Aaron, I tend to agree with your conclusion, and personally have little interest in the relationship between actual size and readable size. But from technical point of view, I guess you should plot your scatter plot in log-log scale and also calculate the correlation between the logarithm of the variables. The sizes are not normally distributed but log-normally [1], and linear statistics on heavy-tailed distributions are usually spurious.
[1] http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0038869.g018&representation=PNG_M Take care, Taha On 15 Mar 2014 18:21, "Aaron Halfaker" <aaron.halfa...@gmail.com> wrote: > Hi Fabian, > > I think that the primary reason that articles with smaller byte counts > show less consistency is due to templates. A lot of stubs and starts are > created with a collection of templates that consume few bytes of wikitext, > but balloon into lots of HTML/content. Regardless, there doesn't seem to > be much cause for concern, so I saw the issue as resolved. > > FWIW, I originally showed up in this conversation because I was skeptical > of your initial conclusion: "size in bytes is a really, really bad > indicator for the actual, readable content of a Wikipedia article". Now > that we've worked out the strong correlation between wikitext length and > readable content length for nearly all articles, I have little interest in > looking into the data further. > > -Aaron > > > On Sat, Mar 15, 2014 at 12:47 PM, Floeck, Fabian (AIFB) < > fabian.flo...@kit.edu> wrote: > >> Aaron, >> >> this seems kind of redundant as I already agreed that there is an overall >> high correlation and you posted this (almost) identical analysis 7 months >> ago. I don't know if you missed my later emails on the topic, but I already >> wrote that this "mistake" as you repeatedly put it, was a result of the >> selective sampling between 5000 and 6000 bytes. Hence, as I already said, >> my initial observations cannot be transferred to the general population of >> articles. >> >> Not surprising and congruent with Aarons results, I also get a high >> linear correlation of 0.96 (random sample of 5000 articles) outside the >> 5800-6000 sample even if I filter out Disamb articles. >> >> >> But, as I as well explained, there seem to be some indicators that in >> smaller size articles, this correlation is not as strong. >> >> I split up the random 5000 article sample I posted last time at the >> median (3709 bytes) into two parts, each 2500 articles big. >> For the "higher byte size" part (>3709 bytes) the correlation is 0.964 >> For the "lesser byte size" part (<3710 bytes ) the correlation is only >> 0.295 >> >> >> You will of course not see that in your example if you just take all data >> (of all article sizes) and draw a straight regression line through them. >> The "blob" on the bottom left might need some further investigation. Maybe >> you could look at only articles under 5000, 3000, 1000 bytes and see if the >> correlation changes somehow. My guess is it will be less strong. >> >> BTW: did you try to fit nonlinear models? >> I did not, and one reason for the bad fit in the lesser size articles >> could also be that there's a high correlation but not a linear one. >> >> >> Best, >> >> Fabian >> >> >> >> >> >> >> >> >> >> >> >> On 04.08.2013, at 11:43, Aaron Halfaker <aaron.halfa...@gmail.com> wrote: >> >> I just replicated this analysis. I think you might have made some >> mistakes. >> >> I took a random sample of non-redirect articles from English Wikipedia >> and compared the byte_length (from database) to the content_length (from >> API, tags and comments stripped). >> >> I get a pearson correlation coef of *0.9514766*. >> >> See the attached scatter plot including a linear regression line. See >> also the regress output below. >> >> Call: >> lm(formula = page_len ~ content_length, data = pages) >> >> Residuals: >> Min 1Q Median 3Q Max >> -38263 -419 82 592 37605 >> >> Coefficients: >> Estimate Std. Error t value Pr(>|t|) >> (Intercept) -97.40412 72.46523 -1.344 0.179 >> content_length 1.14991 0.00832 138.210 <2e-16 *** >> --- >> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 >> >> Residual standard error: 2722 on 1998 degrees of freedom >> Multiple R-squared: 0.9053, Adjusted R-squared: 0.9053 >> F-statistic: 1.91e+04 on 1 and 1998 DF, p-value: < 2.2e-16 >> >> >> >> On Fri, Aug 2, 2013 at 12:24 PM, Floeck, Fabian (AIFB) < >> fabian.flo...@kit.edu> wrote: >> >>> Hi, >>> to whoever is interested in this (and I hope I didn't just repeat >>> someone else's experiments on this): >>> >>> I wanted to know if a "long" or "short" article in terms of how much >>> readable material (excluding pictures) is presented to the reader in the >>> front-end is correlated to the byte size of the Wikisyntax which can be >>> obtained from the DB or API; as people often define the "length" of an >>> article by its length in bytes. >>> >>> TL;DR: Turns out size in bytes is a really, really bad indicator for the >>> actual, readable content of a Wikipedia article, even worse than I thought. >>> >>> We "curl"ed the front-end HTML of all articles of the English Wikipedia >>> (ns=0, no disambiguation, no redirects) between 5800 and 6000 bytes (as >>> around 5900 bytes is the total en.wiki average for these articles). = 41981 >>> articles. >>> Results for size in characters (w/ whitespaces) after cleaning the HTML >>> out: >>> Min= 95 Max= 49441 Mean=4794.41 Std. Deviation=1712.748 >>> >>> Especially the gap between Min and Max was interesting. But templates >>> make it possible. >>> (See e.g. "Veer Teja Vidhya Mandir School", "Martin Callanan" -- >>> Allthough for the ladder you could argue that expandable template listings >>> are not really main "reading" content..) >>> >>> Effectively, correlation for readable character size with byte size = >>> 0.04 (i.e. none) in the sample. >>> >>> If someone already did this or a similar analysis, I'd appreciate >>> pointers. >>> >>> Best, >>> >>> Fabian >>> >>> >>> >>> >>> -- >>> Karlsruhe Institute of Technology (KIT) >>> Institute of Applied Informatics and Formal Description Methods >>> >>> Dipl.-Medwiss. Fabian Flöck >>> Research Associate >>> >>> Building 11.40, Room 222 >>> KIT-Campus South >>> D-76128 Karlsruhe >>> >>> Phone: +49 721 608 4 6584 >>> Fax: +49 721 608 4 6580 >>> Skype: f.floeck_work >>> E-Mail: fabian.flo...@kit.edu >>> WWW: http://www.aifb.kit.edu/web/Fabian_Flöck >>> >>> KIT - University of the State of Baden-Wuerttemberg and >>> National Research Center of the Helmholtz Association >>> >>> >>> _______________________________________________ >>> Wiki-research-l mailing list >>> Wiki-research-l@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >> >> <bytes.content_length.scatter.png><ATT00001.c> >> >> >> >> >> >> >> -- >> >> Dipl.-Medwiss. Fabian Flöck >> Research Associate >> >> Karlsruhe Institute of Technology (KIT) >> Institute of Applied Informatics and Formal Description Methods >> >> Building 11.40, Room 222 >> KIT-Campus South >> D-76128 Karlsruhe >> >> Phone: +49 721 608 4 6584 >> Fax: +49 721 608 4 6580 >> Skype: f.floeck_work >> E-Mail: flo...@kit.edu >> >> WWW: http://www.aifb.kit.edu/web/Fabian_Flöck >> >> KIT - University of the State of Baden-Wuerttemberg and >> National Research Center of the Helmholtz Association >> >> >> _______________________________________________ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > >
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l