On Wed, Feb 27, 2013 at 8:14 AM, Mariya Nedelcheva Miteva
wrote:
> Anthony, what do you mean what's wrong with archive.org?
Why aren't the dumps being uploaded to archive.org?
(Maybe the answer is that they are, and I just didn't kno
What's wrong with using archive.org?
On Mon, Feb 25, 2013 at 8:35 AM, Maria Miteva wrote:
> Hi everyone,
>
> As you can see on top of https://meta.wikimedia.org/wiki/Data_dumps, WMF is
> actively looking for help archiving and distributing data dumps. It would
> be great if you could check with
On Tue, Dec 21, 2010 at 8:13 PM, Anthony wrote:
> Have you tried escaping them?
By which I mean, using character references.
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/w
On Tue, Dec 21, 2010 at 7:51 PM, Tim Starling wrote:
> In XML 1.1:
>
> "Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |
> [#xE000-#xFFFD] | [#x1-#x10] /* any Unicode character,
> excluding the surrogate blocks, FFFE, and . */"
Where are you reading that? At http://www.w3.or
On Tue, Dec 21, 2010 at 10:02 AM, Tim Starling wrote:
> I've uploaded my latest attempt at converting the backup to XML:
>
> http://noc.wikimedia.org/~tstarling/wikipedia-2001-08-xml.7z
>
> The archive contains an invalid XML file, with control characters
> preserved, and a valid XML file, with co
On Mon, Apr 26, 2010 at 5:52 PM, Platonides wrote:
> Anthony wrote:
> > What kind of space needs are we talking about?
>
> 100k requests per second.
> Assuming that an url is 50 bytes on average, that's 432 GB per day (the
> usual apache log line is about 1.5 times
On Thu, Apr 22, 2010 at 6:31 PM, Platonides wrote:
> S. Nunes wrote:
> > Hi all,
> >
> > I presume that Wikipedia keeps data about HTTP accesses to all articles.
> > Can anybody inform me if this data is available for research purposes?
>
> No. With the amount of traffic it has, space needs would
On Sun, Apr 11, 2010 at 6:27 PM, Luca de Alfaro wrote:
> I guess that Wiki(pedia|media) could very well gather statistics on
>
> (revision_id, clicked_link)
>
> pairs without compromising the anonimity of the visitors. It would be very
> useful to have indications on which hyperlinks are most us
On Fri, Nov 20, 2009 at 10:57 AM, Anthony wrote:
> The main thing that would be missing, and that can't be reconstructed
> from the newer dumps, would be deleted articles. 0.1%, weighted by
> number of revisions? I have absolutely no idea.
By the way, depending on what you
On Fri, Nov 20, 2009 at 10:42 AM, Denny Vrandecic
wrote:
>
> On Nov 20, 2009, at 16:38, Anthony wrote:
>
>> On Fri, Nov 20, 2009 at 9:25 AM, Denny Vrandecic
>> wrote:
>>> The newer dump should include almost all material from the older dumps, so
>>> the
On Fri, Nov 20, 2009 at 9:25 AM, Denny Vrandecic
wrote:
> The newer dump should include almost all material from the older dumps, so
> the older dumps are redundant.
Almost redundant :).
> You can just get the fresh dumps and query appropriately.
Except for the one that you can't get.
___
I'm sure many of you have seen the slashdot story with the same title
as this thread. It pointed to
http://www.softlab.ntua.gr/~ttsiod/buildWikipediaOffline.html which is
the description of a simple offline Wikipedia reader which runs off
the bzipped dumps.
I for one thought the use of bzip2recov
12 matches
Mail list logo