Thanks Colin for your nice words!
I just uploaded the rulebook to Google Spreadsheet at http://goo.gl/zKslcu.
It can be used in whatever way, but do read the paper
(http://http://goo.gl/RnMvG1
goo.gl/RnMvG1) first to avoid misinterpretation.
Best,
On Tue, Oct 1, 2013 at 12:45 PM, Collin Anderson
col...@averysmallbird.comwrote:
Congratulations, this is impressive work. I am also completely jealous --
a colleague and myself will be releasing a similar report for Iran in the
next two weeks. This is intended at a broader global project on Wikipedia
censorship ({{Citation Filtered}}) that I would hope might merge well into
what you are doing.
On Mon, Sep 30, 2013 at 7:26 PM, 夏楚 summer.ag...@gmail.com wrote:
To all,
I just finished writing up my research on GFW (Great Firewall of China)
blacklist for Wikipedia. Some of you might find it interesting.
The paper can be found at goo.gl/RnMvG1 (tweeted
herehttps://twitter.com/SummerAgony/status/384820318402920448).
Here I paste excerpts from the Abstract and Conclusions below.
*Abstract*
In this report, we detail the *complete* and *exact* rulebook that the
Great Firewall of China (GFW) exerts on Wikipedia. We call it rulebook''
(instead of the common term blacklist'') because we not only identify the
blacklisted terms, but also the exact string matching rules deployed by
GFW. An efficient probing methodology makes this possible.
...
Wikipedia contains millions of pages, e.g. more than 700,000 articles for
the Chinese version, and more than 4,240,000 articles for the English
version. It seems a daunting and unfeasible task to test these pages
exhaustively, hence there has been no well known attempt to gather the
complete blacklist.
While a small sample of the blacklist is useful, the complete picture
can be much more powerful in revealing the underlying works of GFW and
its operators. In this study, we devised a methodology which efficiently
examines the entire Wikipedia corpus, hence exposing to the world the
complete GFW rulebook for Wikipedia the first time. In total, there are 919
rules (excluding URL terms) which are applicable to Wikipedia, affecting
5336 pages in Chinese Wikipedia and 67 English Wikipedia pages.
The revealed rulebook also demonstrates that the GFW operation is
haphazard and ill-maintained. At the same time, Chinese
censorship bureaucracy *intends* to be thorough and extensive.
To be precise, the findings in this report are on two Wikipedia
snapshots: 2013-09-08 for the Chinese version and 2013-09-04 for the
English version.
*Conclusion Remarks*
In this study, we examined the entire Wikipedia corpus (Chinese version
and English version) and revealed the complete and exact GFW rulebook for
Wikipedia (with caveats described in Section 6).
A sample of notable findings are:
- There are 78 terms for which GFW blocks a non-standard variant but
not the canonical path. These are cases the censors intend to block but
the
block does not really happen, suggesting the censors have poor
understanding of Wikipedia's serving system.
- Many obscure non-article pages are blocked, which raises suspicion
that these pages were provided to the censorship bureaucrats by Wikipedia
editors who are very familiar with the content (e.g. those who
participated
in the edit wars and/or discussions regarding self-censorship proposals).
- GFW string matching rules have a 64-byte hard limit of size.
The biggest learning out of this study, in my opinion, is that GFW
operation
is haphazard and ill-maintained. Also, there are many indications that the
GFW operators are somewhat disconnected from the censorship bureaucrats.
We hope the revealing can be of interest to internet censorship watchers,
Wikipedia researchers, China observers, and ordinary Chinese citizens.
--
Xia Chu (Twitter: @summer.agony)
--
Liberationtech is public archives are searchable on Google. Violations
of list guidelines will get you moderated:
https://mailman.stanford.edu/mailman/listinfo/liberationtech.
Unsubscribe, change to digest, or change password by emailing moderator at
compa...@stanford.edu.
--
*Collin David Anderson*
averysmallbird.com | @cda | Washington, D.C.
--
Liberationtech is public archives are searchable on Google. Violations
of list guidelines will get you moderated:
https://mailman.stanford.edu/mailman/listinfo/liberationtech.
Unsubscribe, change to digest, or change password by emailing moderator at
compa...@stanford.edu.
--
--
Xia Chu (Twitter: @summer.agony; Google+: gplus.to/summer.agony)
--
Liberationtech is public archives are searchable on Google. Violations of
list guidelines will get you moderated:
https://mailman.stanford.edu/mailman/listinfo/liberationtech. Unsubscribe,
change to digest, or change password by emailing moderator at
compa...@stanford.edu.