A few weeks back, I was playing around with some numbers.  It is true that
number of effective speakers isn't a very good predictor, but it is a place
to start.

Most of our mature editor communities have about 20 active editors per 1
million effective speakers, give or take a factor of 4.  In other
words, among communities with at least 100 active editors most range from
5 to 80 active editors per million effective speakers.  Admittedly, that is
not a very precise range.  English, for example, is right at 20 on this
metric.  There are also some important outliers, such as the Chinese and
Arabic communities (both less than 2 active editors per million speakers),
which probably have yet to reach parity with the other active languages.
There are also a few major languages (e.g. Hindi, Bengali, and Malay) that
arguably haven't even begun.  Those have fewer than 100 active editors and
less than 0.5 editors per 1 million speakers, despite hundreds of millions
of speakers.

I suspect that if one could start adjusting for other factors, e.g.
speakers with internet access, one might be able to narrow that predicted
range.  Economic and cultural factors are also probably important, as well
as the penetration of secondary languages like English.

Structurally, it seems like this kind of data analysis problem would be
fairly amenable to various kinds of regression analysis.  The main
difficulty would be gathering the right data, e.g. number of effective
speakers (which probably needs to subdivided by country in order to compare
to other data sets), internet penetration, economic indicators, access to
education, etc.  Anyone happen to know where there is comprehensive
language data broken down by country?

As others have suggested, I would emphasize community participation or
readership metrics rather than article metrics due to bot biasing, etc.

Anyway, if one uses 20 active editors per 1 million speakers as a rough
guide, one can estimate which languages have the most natural potential for
growth.  The top 15 on that list would be in order: Chinese, Hindi, Arabic,
Malay, Spanish, Indonesian, Bengali, Portuguese, Russian, Punjabi, Marathi,
Tagalog, Javanese, Wu, and Telugu.  Those would collectively account 70% of
"missing" editors if we assume that we roughly expect 20 editors / 1
million speakers.  In terms of feature development for under-utilized
languages, those are probably a reasonable set to be thinking about.

Most of the list is from Asian countries, and with the exception of Spanish
and Portuguese, they are all languages that use non-latin character sets.
So support for other scripts is obviously important.  On the other hand, it
is also possible that many of these language are "missing" in part because
the computer literate among the populations who speak these languages
actually prefer to edit in some other language (e.g. English).

Anyway, just a few thoughts.

-Robert Rohde

On Sun, Jan 25, 2015 at 5:57 PM, Amir E. Aharoni <
amir.ahar...@mail.huji.ac.il> wrote:

> Hi,
>
> It is well-known that the size of a Wikipedia in a given language is not
> proportional to the number of people who speak that language. By "size" I
> mean the article count and the active editor count.
>
> This begs the question: Is it proportional to anything else?
>
> I can think of a bunch of possible things (to most items you can add "...
> in the countries where this language is spoken"):
>
> * Penetration of Internet access
> * Quality of education
> * Number of people who know other major languages, such as English, French,
> Russian, Spanish, etc.
> * Number of people who *don't* know other major languages
> * Gross domestic product
> * Human Development Index
> * The level of usage of this language in the education system (in some
> countries schools function in foreign languages)
> * Amount of published literature in that language
> * Level of censorship and press freedom
> * [[Language planning]] policies (think Catalonia, Ukraine, Quebec, Israel)
>
> It is quite possible that the size of a Wikipedia is proportional not to
> one of these things, but to a combination of them. It is also possible that
> it is not proportional to any of the above, or to anything at all.
>
> Did anybody ever try to research this?
>
> --
> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
> http://aharoni.wordpress.com
> ‪“We're living in pieces,
> I want to live in peace.” – T. Moore‬
> _______________________________________________
> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>
_______________________________________________
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
<mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>

Reply via email to