Re: [Wiki-research-l] Looking for mirrors for Data dumps

2013-02-27 Thread Anthony
On Wed, Feb 27, 2013 at 8:14 AM, Mariya Nedelcheva Miteva wrote: > Anthony, what do you mean what's wrong with archive.org? Why aren't the dumps being uploaded to archive.org? (Maybe the answer is that they are, and I just didn't kno

Re: [Wiki-research-l] Looking for mirrors for Data dumps

2013-02-25 Thread Anthony
What's wrong with using archive.org? On Mon, Feb 25, 2013 at 8:35 AM, Maria Miteva wrote: > Hi everyone, > > As you can see on top of https://meta.wikimedia.org/wiki/Data_dumps, WMF is > actively looking for help archiving and distributing data dumps. It would > be great if you could check with

Re: [Wiki-research-l] Old Wikipedia backups discovered

2010-12-21 Thread Anthony
On Tue, Dec 21, 2010 at 8:13 PM, Anthony wrote: > Have you tried escaping them? By which I mean, using character references. ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/w

Re: [Wiki-research-l] Old Wikipedia backups discovered

2010-12-21 Thread Anthony
On Tue, Dec 21, 2010 at 7:51 PM, Tim Starling wrote: > In XML 1.1: > > "Char       ::=       #x9 | #xA | #xD | [#x20-#xD7FF] | > [#xE000-#xFFFD] | [#x1-#x10]    /* any Unicode character, > excluding the surrogate blocks, FFFE, and . */" Where are you reading that? At http://www.w3.or

Re: [Wiki-research-l] Old Wikipedia backups discovered

2010-12-21 Thread Anthony
On Tue, Dec 21, 2010 at 10:02 AM, Tim Starling wrote: > I've uploaded my latest attempt at converting the backup to XML: > > http://noc.wikimedia.org/~tstarling/wikipedia-2001-08-xml.7z > > The archive contains an invalid XML file, with control characters > preserved, and a valid XML file, with co

Re: [Wiki-research-l] Access to HTTP access logs for Wikipedia articles?

2010-04-26 Thread Anthony
On Mon, Apr 26, 2010 at 5:52 PM, Platonides wrote: > Anthony wrote: > > What kind of space needs are we talking about? > > 100k requests per second. > Assuming that an url is 50 bytes on average, that's 432 GB per day (the > usual apache log line is about 1.5 times

Re: [Wiki-research-l] Access to HTTP access logs for Wikipedia articles?

2010-04-23 Thread Anthony
On Thu, Apr 22, 2010 at 6:31 PM, Platonides wrote: > S. Nunes wrote: > > Hi all, > > > > I presume that Wikipedia keeps data about HTTP accesses to all articles. > > Can anybody inform me if this data is available for research purposes? > > No. With the amount of traffic it has, space needs would

Re: [Wiki-research-l] Help to solve three doubts on Wikipedia research data

2010-04-11 Thread Anthony
On Sun, Apr 11, 2010 at 6:27 PM, Luca de Alfaro wrote: > I guess that Wiki(pedia|media) could very well gather statistics on > > (revision_id, clicked_link) > > pairs without compromising the anonimity of the visitors. It would be very > useful to have indications on which hyperlinks are most us

Re: [Wiki-research-l] Access to older wikipedia dumps

2009-11-20 Thread Anthony
On Fri, Nov 20, 2009 at 10:57 AM, Anthony wrote: > The main thing that would be missing, and that can't be reconstructed > from the newer dumps, would be deleted articles.  0.1%, weighted by > number of revisions?  I have absolutely no idea. By the way, depending on what you

Re: [Wiki-research-l] Access to older wikipedia dumps

2009-11-20 Thread Anthony
On Fri, Nov 20, 2009 at 10:42 AM, Denny Vrandecic wrote: > > On Nov 20, 2009, at 16:38, Anthony wrote: > >> On Fri, Nov 20, 2009 at 9:25 AM, Denny Vrandecic >> wrote: >>> The newer dump should include almost all material from the older dumps, so >>> the

Re: [Wiki-research-l] Access to older wikipedia dumps

2009-11-20 Thread Anthony
On Fri, Nov 20, 2009 at 9:25 AM, Denny Vrandecic wrote: > The newer dump should include almost all material from the older dumps, so > the older dumps are redundant. Almost redundant :). > You can just get the fresh dumps and query appropriately. Except for the one that you can't get. ___

[Wiki-research-l] Building a (fast) Wikipedia offline reader

2007-08-14 Thread Anthony
I'm sure many of you have seen the slashdot story with the same title as this thread. It pointed to http://www.softlab.ntua.gr/~ttsiod/buildWikipediaOffline.html which is the description of a simple offline Wikipedia reader which runs off the bzipped dumps. I for one thought the use of bzip2recov