On 10/21/2010 12:06 PM, Martin Langhoff wrote:
> Unfortunately, there is a clear need to organise a facility to
> audit/edit the wikipedia snapshots we have and "repack" the archive.
> 
> Do we have any easy way to do this?

I'm the wrong person to answer this question, but the activity's archive
production system does already have support for an article blacklist (and
indeed many articles were excluded from the current bundles).  I don't
know who is in possession of this list, or exactly who took responsibility
for producing the most recent version.  Nonetheless, excluding articles is
"easy".

Actually editing article text is not something we have attempted AFAIK.
Ideally, I think, we would fix textual problems upstream as they are
discovered.  The most recent available snapshots for English and Spanish
are 10-14 days old, so this strategy does create a delay, during which
time things can continue to change.

In general, I believe that auditing wikipedia is a fool's errand.  There
are 3.5 million articles in English Wikipedia, growing by over a thousand
a day.   Spanish wikipedia has >650,000 articles.  If people want to
create snapshots containing only whitelisted articles, that's fine, but
many of the links will be broken and the amount of information will be
much reduced.

--Ben

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel

Reply via email to