Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

Taha Yasseri Sat, 15 Mar 2014 14:12:07 -0700

Hi,
Aaron, I tend to agree with your conclusion, and personally have little
interest in the relationship between actual size and readable size.
But from technical point of view, I guess you should plot your scatter plot
in log-log scale and also calculate the correlation between the logarithm
of the variables.
The sizes are not normally distributed but log-normally [1], and linear
statistics on heavy-tailed distributions are usually spurious.


[1]
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0038869.g018&representation=PNG_M

Take care,

Taha


On 15 Mar 2014 18:21, "Aaron Halfaker" <aaron.halfa...@gmail.com> wrote:

> Hi Fabian,
>
> I think that the primary reason that articles with smaller byte counts
> show less consistency is due to templates.  A lot of stubs and starts are
> created with a collection of templates that consume few bytes of wikitext,
> but balloon into lots of HTML/content.  Regardless, there doesn't seem to
> be much cause for concern, so I saw the issue as resolved.
>
> FWIW, I originally showed up in this conversation because I was skeptical
> of your initial conclusion: "size in bytes is a really, really bad
> indicator for the actual, readable content of a Wikipedia article".  Now
> that we've worked out the strong correlation between wikitext length and
> readable content length for nearly all articles, I have little interest in
> looking into the data further.
>
> -Aaron
>
>
> On Sat, Mar 15, 2014 at 12:47 PM, Floeck, Fabian (AIFB) <
> fabian.flo...@kit.edu> wrote:
>
>> Aaron,
>>
>> this seems kind of redundant as I already agreed that there is an overall
>> high correlation and you posted this (almost) identical analysis 7 months
>> ago. I don't know if you missed my later emails on the topic, but I already
>> wrote that this "mistake" as you repeatedly put it, was a result of the
>> selective sampling between 5000 and 6000 bytes. Hence, as I already said,
>> my initial observations cannot be transferred to the general population of
>> articles.
>>
>> Not surprising and congruent with Aarons results, I also get a high
>> linear correlation of 0.96 (random sample of 5000 articles) outside the
>> 5800-6000 sample even if I filter out Disamb articles.
>>
>>
>> But, as I as well explained, there seem to be some indicators that in
>> smaller size articles, this correlation is not as strong.
>>
>> I split up the random 5000 article sample I posted last time at the
>> median (3709 bytes) into two parts, each 2500 articles big.
>> For the  "higher byte size" part (>3709 bytes) the correlation is 0.964
>> For the  "lesser byte size" part (<3710 bytes ) the correlation is only
>> 0.295
>>
>>
>> You will of course not see that in your example if you just take all data
>> (of all article sizes) and draw a straight regression line through them.
>> The "blob" on the bottom left might need some further investigation. Maybe
>> you could look at only articles under 5000, 3000, 1000 bytes and see if the
>> correlation changes somehow. My guess is it will be less strong.
>>
>> BTW: did you try to fit nonlinear models?
>> I did not, and one reason for the bad fit in the lesser size articles
>> could also be that there's a high correlation but not a linear one.
>>
>>
>> Best,
>>
>> Fabian
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 04.08.2013, at 11:43, Aaron Halfaker <aaron.halfa...@gmail.com> wrote:
>>
>> I just replicated this analysis.  I think you might have made some
>> mistakes.
>>
>> I took a random sample of non-redirect articles from English Wikipedia
>> and compared the byte_length (from database) to the content_length (from
>> API, tags and comments stripped).
>>
>> I get a pearson correlation coef of *0.9514766*.
>>
>> See the attached scatter plot including a linear regression line.  See
>> also the regress output below.
>>
>> Call:
>> lm(formula = page_len ~ content_length, data = pages)
>>
>> Residuals:
>>    Min     1Q Median     3Q    Max
>> -38263   -419     82    592  37605
>>
>> Coefficients:
>>                 Estimate Std. Error t value Pr(>|t|)
>> (Intercept)    -97.40412   72.46523  -1.344    0.179
>> content_length   1.14991    0.00832 138.210   <2e-16 ***
>> ---
>> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>>
>> Residual standard error: 2722 on 1998 degrees of freedom
>> Multiple R-squared: 0.9053, Adjusted R-squared: 0.9053
>> F-statistic: 1.91e+04 on 1 and 1998 DF,  p-value: < 2.2e-16
>>
>>
>>
>> On Fri, Aug 2, 2013 at 12:24 PM, Floeck, Fabian (AIFB) <
>> fabian.flo...@kit.edu> wrote:
>>
>>> Hi,
>>> to whoever is interested in this (and I hope I didn't just repeat
>>> someone else's experiments on this):
>>>
>>> I wanted to know if a "long" or "short" article in terms of how much
>>> readable material (excluding pictures) is presented to the reader in the
>>> front-end is correlated to the byte size of the Wikisyntax which can be
>>> obtained from the DB or API; as people often define the "length" of an
>>> article by its length in bytes.
>>>
>>> TL;DR: Turns out size in bytes is a really, really bad indicator for the
>>> actual, readable content of a Wikipedia article, even worse than I thought.
>>>
>>> We "curl"ed the front-end HTML of all articles of the English Wikipedia
>>> (ns=0, no disambiguation, no redirects) between 5800 and 6000 bytes (as
>>> around 5900 bytes is the total en.wiki average for these articles). = 41981
>>> articles.
>>> Results for size in characters (w/ whitespaces) after cleaning the HTML
>>> out:
>>> Min= 95 Max= 49441 Mean=4794.41 Std. Deviation=1712.748
>>>
>>> Especially the gap between Min and Max was interesting. But templates
>>> make it possible.
>>> (See e.g. "Veer Teja Vidhya Mandir School", "Martin Callanan" --
>>> Allthough for the ladder you could argue that expandable template listings
>>> are not really main "reading" content..)
>>>
>>> Effectively, correlation for readable character size with byte size =
>>> 0.04 (i.e. none) in the sample.
>>>
>>> If someone already did this or a similar analysis, I'd appreciate
>>> pointers.
>>>
>>> Best,
>>>
>>> Fabian
>>>
>>>
>>>
>>>
>>> --
>>> Karlsruhe Institute of Technology (KIT)
>>> Institute of Applied Informatics and Formal Description Methods
>>>
>>> Dipl.-Medwiss. Fabian Flöck
>>> Research Associate
>>>
>>> Building 11.40, Room 222
>>> KIT-Campus South
>>> D-76128 Karlsruhe
>>>
>>> Phone: +49 721 608 4 6584
>>> Fax: +49 721 608 4 6580
>>> Skype: f.floeck_work
>>> E-Mail: fabian.flo...@kit.edu
>>> WWW: http://www.aifb.kit.edu/web/Fabian_Flöck
>>>
>>> KIT - University of the State of Baden-Wuerttemberg and
>>> National Research Center of the Helmholtz Association
>>>
>>>
>>> _______________________________________________
>>> Wiki-research-l mailing list
>>> Wiki-research-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>
>> <bytes.content_length.scatter.png><ATT00001.c>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Dipl.-Medwiss. Fabian Flöck
>> Research Associate
>>
>> Karlsruhe Institute of Technology (KIT)
>> Institute of Applied Informatics and Formal Description Methods
>>
>> Building 11.40, Room 222
>> KIT-Campus South
>> D-76128 Karlsruhe
>>
>> Phone: +49 721 608 4 6584
>> Fax: +49 721 608 4 6580
>> Skype: f.floeck_work
>> E-Mail: flo...@kit.edu
>>
>> WWW: http://www.aifb.kit.edu/web/Fabian_Flöck
>>
>> KIT - University of the State of Baden-Wuerttemberg and
>> National Research Center of the Helmholtz Association
>>
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

Reply via email to