Re: [Foundation-l] [Wiki-research-l] Wikipedia dumps downloader

2011-06-28 Thread emijrp
Can you share your script with us?

2011/6/27 Platonides platoni...@gmail.com

 emijrp wrote:

 Hi SJ;

 You know that that is an old item in our TODO list ; )

 I heard that Platonides developed a script for that task long time ago.

 Platonides, are you there?

 Regards,
 emijrp


 Yes, I am. :)


___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] [Wiki-research-l] Wikipedia dumps downloader

2011-06-28 Thread emijrp
Hi;

@Derrick: I don't trust Amazon. Really, I don't trust Wikimedia Foundation
either. They can't and/or they don't want to provide image dumps (what is
worst?). Community donates images to Commons, community donates money every
year, and now community needs to develop a software to extract all the
images and packed them, and of course, host them in a permanent way. Crazy,
right?

@Milos: Instead of spliting image dump using the first letter of filenames,
I thought about spliting using the upload date (-MM-DD). So, first
chunks (2005-01-01) will be tiny, and recent ones of several GB (a single
day).

Regards,
emijrp

2011/6/28 Derrick Coetzee dcoet...@eecs.berkeley.edu

 As a Commons admin I've thought a lot about the problem of
 distributing Commons dumps. As for distribution, I believe BitTorrent
 is absolutely the way to go, but the Torrent will require a small
 network of dedicated permaseeds (servers that seed indefinitely).
 These can easily be set up at low cost on Amazon EC2 small instances
 - the disk storage for the archives is free, since small instances
 include a  large (~120 GB) ephemeral storage volume at no additional
 cost, and the cost of bandwidth can be controlled by configuring the
 BitTorrent client with either a bandwidth throttle or a transfer cap
 (or both). In fact, I think all Wikimedia dumps should be available
 through such a distribution solution, just as all Linux installation
 media are today.

 Additionally, it will be necessary to construct (and maintain) useful
 subsets of Commons media, such as all media used on the English
 Wikipedia, or thumbnails of all images on Wikimedia Commons, of
 particular interest to certain content reusers, since the full set is
 far too large to be of interest to most reusers. It's on this latter
 point that I want your feedback: what useful subsets of Wikimedia
 Commons does the research community want? Thanks for your feedback.

 --=20
 Derrick Coetzee
 User:Dcoetzee, English Wikipedia and Wikimedia Commons administrator
 http://www.eecs.berkeley.edu/~dcoetzee/


___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] [Wiki-research-l] Wikipedia dumps downloader

2011-06-28 Thread Milos Rancic
On 06/28/2011 07:21 PM, emijrp wrote:
 @Milos: Instead of spliting image dump using the first letter of filenames,
 I thought about spliting using the upload date (-MM-DD). So, first
 chunks (2005-01-01) will be tiny, and recent ones of several GB (a single
 day).

That would be better, indeed! And you could create a wiki page where
people like myself would coordinate backup: we should cover the backup
once, then we could create more. For example, I can cover every Nth day
in month.

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] [Wiki-research-l] Wikipedia dumps downloader

2011-06-28 Thread Platonides
emijrp wrote:
 Hi;

 @Derrick: I don't trust Amazon.

I disagree. Note that we only need them to keep a redundant copy of a 
file. If they tried to tamper the file we could detect it with the 
hashes (which should be properly secured, that's no problem).

I'd like having the hashes for the xml dumps content instead of the 
compressed one, though, so it could be easily stored with better 
compression without weakening the integrity check.

 Really, I don't trust Wikimedia
 Foundation either. They can't and/or they don't want to provide image
 dumps (what is worst?).

Wikimedia Foundation has provided image dumps several times in the past, 
and also rsync3 access to some individuals so that they could clone it.
It's like the enwiki history dump. An image dump is complex, and even 
less useful.


 Community donates images to Commons, community
 donates money every year, and now community needs to develop a software
 to extract all the images and packed them,

There's no *need* for that. In fact, such script would be trivial from 
the toolserver.

 and of course, host them in a permanent way. Crazy, right?

WMF also tries hard to not lose images. We want to provide some 
redundance on our own. That's perfectly fine, but it's not a 
requirement. Consider that WMF could be automatically deleting page 
history older than a month, or images not used on any article. *That* 
would be a real problem.


 @Milos: Instead of spliting image dump using the first letter of
 filenames, I thought about spliting using the upload date (-MM-DD).
 So, first chunks (2005-01-01) will be tiny, and recent ones of several
 GB (a single day).

 Regards,
 emijrp

I like that idea since it means the dumps are static. They could be 
placed in tape inside a safe and not needed to be taken out unless data 
loss arises.

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] [Wiki-research-l] Wikipedia dumps downloader

2011-06-28 Thread emijrp
2011/6/28 Platonides platoni...@gmail.com

 emijrp wrote:

 Hi;

 @Derrick: I don't trust Amazon.


 I disagree. Note that we only need them to keep a redundant copy of a file.
 If they tried to tamper the file we could detect it with the hashes (which
 should be properly secured, that's no problem).


I didn't mean security problems. I meant just deleted files by weird terms
of service. Commons hosts a lot of images which can be problematic, like
nudes or copyrighted materials in some jurisdictions. They can deleted what
they want and close every account they want, and we will lost the backups.
Period.

And we don't only need to keep a copy of every file. We need several copies
everywhere, not only in the Amazon coolcloud.


 I'd like having the hashes for the xml dumps content instead of the
 compressed one, though, so it could be easily stored with better compression
 without weakening the integrity check.


  Really, I don't trust Wikimedia
 Foundation either. They can't and/or they don't want to provide image
 dumps (what is worst?).


 Wikimedia Foundation has provided image dumps several times in the past,
 and also rsync3 access to some individuals so that they could clone it.


Ah, OK, that is enough (?). Then, you are OK with old-and-broken XML dumps,
because people can slurp all the pages using an API scrapper.


 It's like the enwiki history dump. An image dump is complex, and even less
 useful.


It is not complex, just resources consuming. If they need to buy another 10
TB of space and more CPU, they can. $16M were donated last year. They just
need to put resources in relevant stuff. WMF always says we host the 5th
website in the world, I say that they need to act like that.

Less useful? I hope they don't need such a useless dump for recovering
images, just like happened in the past.



  Community donates images to Commons, community
 donates money every year, and now community needs to develop a software
 to extract all the images and packed them,


 There's no *need* for that. In fact, such script would be trivial from the
 toolserver.


Ah, OK, only people with toolserver account may have access to an image
dump. And you say it is trivial from Toolserver and very complex from
Wikimedia main servers.

 and of course, host them in a permanent way. Crazy, right?


 WMF also tries hard to not lose images.


I hope that, but we remember a case of lost images.


 We want to provide some redundance on our own. That's perfectly fine, but
 it's not a requirement.


That _is_ a requirement. We can't trust Wikimedia Foundation. They lost
images. They have problems to generate English Wikipedia dumps and image
dumps. They had a hardware failure some months ago in the RAID which hosts
the XML dumps, and they didn't offer those dumps during months, while trying
to fix the crash.


 Consider that WMF could be automatically deleting page history older than a
 month,

 or images not used on any article. *That* would be a real problem.


You just don't understand how dangerous is the current status (and it was
worst in the past).



  @Milos: Instead of spliting image dump using the first letter of
 filenames, I thought about spliting using the upload date (-MM-DD).
 So, first chunks (2005-01-01) will be tiny, and recent ones of several
 GB (a single day).

 Regards,
 emijrp


 I like that idea since it means the dumps are static. They could be placed
 in tape inside a safe and not needed to be taken out unless data loss
 arises.

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] [Wiki-research-l] Wikipedia dumps downloader

2011-06-28 Thread Platonides
emijrp wrote:
 I didn't mean security problems. I meant just deleted files by weird
 terms of service. Commons hosts a lot of images which can be
 problematic, like nudes or copyrighted materials in some jurisdictions.
 They can deleted what they want and close every account they want, and
 we will lost the backups. Period.

Good point.


 And we don't only need to keep a copy of every file. We need several
 copies everywhere, not only in the Amazon coolcloud.

Sure. Relying *just* on Amazon would be very bad.



 Wikimedia Foundation has provided image dumps several times in the
 past, and also rsync3 access to some individuals so that they could
 clone it.


 Ah, OK, that is enough (?). Then, you are OK with old-and-broken XML
 dumps, because people can slurp all the pages using an API scrapper.

If all people that wants it can get it, then it's enough. Not so much in 
a timely manner, though, but that could be fixed. I'm quite confident 
that if rediris rang me tomorrow offering 20Tb for hosting commosns 
image dumps, that could be managed without too much problems.


 It's like the enwiki history dump. An image dump is complex, and
 even less useful.

 It is not complex, just resources consuming. If they need to buy another
 10 TB of space and more CPU, they can. $16M were donated last year. They
 just need to put resources in relevant stuff. WMF always says we host
 the 5th website in the world, I say that they need to act like that.

 Less useful? I hope they don't need such a useless dump for recovering
 images, just like happened in the past.

Yes, that seems sensible. You just need to convince them :)
But note that they are already making another datacenter and developing 
a system with which they would keep a copy of every upload on both of 
them. They are not so mean.


 Community donates images to Commons, community
 donates money every year, and now community needs to develop a
 software
 to extract all the images and packed them,


 There's no *need* for that. In fact, such script would be trivial
 from the toolserver.

 Ah, OK, only people with toolserver account may have access to an image
 dump. And you say it is trivial from Toolserver and very complex from
 Wikimedia main servers.

Come on. Making a script to dowload all images is trivial from the 
toolserver. It's just not so easy using the api.
The complexity is for making a dump that *anyone* can download. And it's 
just resources, not technical.



 and of course, host them in a permanent way. Crazy, right?
 WMF also tries hard to not lose images.
 I hope that, but we remember a case of lost images.

Yes. That's a reason for making copies, and I support that. But there's 
a difference between failures happen and WMF is not trying to keep 
copies.


 We want to provide some redundance on our own. That's perfectly
 fine, but it's not a requirement.

 That _is_ a requirement. We can't trust Wikimedia Foundation. They lost
 images. They have problems to generate English Wikipedia dumps and image
 dumps. They had a hardware failure some months ago in the RAID which
 hosts the XML dumps, and they didn't offer those dumps during months,
 while trying to fix the crash.


 You just don't understand how dangerous is the current status (and it
 was worst in the past).

The big problem is its huge size. If it was 2MB everyone and his 
grandmother would keep a copy.

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] [Wiki-research-l] Wikipedia dumps downloader

2011-06-27 Thread Samuel Klein
Thank you, Emijrp!

What about the dump of Commons images?   [for those with 10TB to spare]

SJ

On Sun, Jun 26, 2011 at 8:53 AM, emijrp emi...@gmail.com wrote:
 Hi all;

 Can you imagine a day when Wikipedia is added to this list?[1]

 WikiTeam have developed a script[2] to download all the Wikipedia dumps (and
 her sister projects) from dumps.wikimedia.org. It sorts in folders and
 checks md5sum. It only works on Linux (it uses wget).

 You will need about 100GB to download all the 7z files.

 Save our memory.

 Regards,
 emijrp

 [1] http://en.wikipedia.org/wiki/Destruction_of_libraries
 [2]
 http://code.google.com/p/wikiteam/source/browse/trunk/wikipediadownloader.py

 ___
 Wiki-research-l mailing list
 wiki-researc...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l





-- 
Samuel Klein          identi.ca:sj           w:user:sj          +1 617 529 4266

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] [Wiki-research-l] Wikipedia dumps downloader

2011-06-27 Thread Platonides
emijrp wrote:
 Hi SJ;

 You know that that is an old item in our TODO list ; )

 I heard that Platonides developed a script for that task long time ago.

 Platonides, are you there?

 Regards,
 emijrp

Yes, I am. :)


___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l