Re: [Xmldatadumps-l] Has anyone had success with data deduplication?

griffin tucker Wed, 29 Jul 2020 03:50:31 -0700

I figured I would decompress the .bz2 and .gz files and that subsequent 
downloads of dumps would only store the changes, disregarding the compressed 
.bz2 , .gz , and .7z files.


My purposes are just experimenting/learning (I’m a first year comp-sci student) 
and I really like the idea of downloading multiple dumps and it not taking up 
much more space.

My plan was to download a few dumps of enwikinews as a test, and then go for 
enwikipedia when it’s tested successfully.

I’ve just been doing this locally, however I was planning on using cloud 
virtual machines like aws, and then moving them to glacier for long-term 
storage (copies of the massive volumes).

I’ve tried following the guides for using mediawiki to reproduce the dumps, but 
it runs into errors after only a few thousand pages. I was going to reproduce 
each dump and then scrape locally for .html files and store those. Images would 
be a bonus.

Then, every month I want to run a script that would do all of this 
automatically, storing to a dedup volume.

That’s my plan, anyway.

From: Ariel Glenn WMF <ar...@wikimedia.org>
Sent: Wednesday, 29 July 2020 4:49 PM
To: Count Count <countvoncount123...@gmail.com>
Cc: griffin tucker <gtucker4....@hotmail.com>; 
xmldatadumps-l@lists.wikimedia.org
Subject: Re: [Xmldatadumps-l] Has anyone had success with data deduplication?

The basic problem is that the page content dumps are ordered by revision number 
within each page, which makes good sense for dumps users but means that the 
addition of a single revision to a page will shift all of the remaining data 
,resulting in different compressed blocks. That's going to be true regardless 
of the compression type.

In the not too distant future we might switch over to multi-stream output files 
for all page content, fixing the page id range per stream for bz2 files. This 
might let a user check the current list of page ids against the previous one 
and only get the streams with the pages they want, in the brave new 
Hadoop-backed object store of my dreams. 7z files are another matter altogether 
and I don't see how we can do better there without rethinking them altogether.

Can you describe which dump files you are keeping and why having them in 
sequence is useful? Maybe we can find a workaround that will let you get what 
you need without keeping a bunch of older files.

Ariel

On Tue, Jul 28, 2020 at 8:48 AM Count Count 
<countvoncount123...@gmail.com<mailto:countvoncount123...@gmail.com>> wrote:
Hi!

The underlying filesystem (ZFS) uses block-level deduplication, so unique 
chunks of 128KiB (default value) are only stored once. The 128KB chunks making 
up dumps are mostly unique since there is no alignment so deduplication will 
not help as far as I can see.

Best regards,

Count Count

On Tue, Jul 28, 2020 at 3:51 AM griffin tucker 
<gtucker4....@hotmail.com<mailto:gtucker4....@hotmail.com>> wrote:
I’ve tried using freenas/truenas with a data deduplication volume to store 
multiple sequential dumps, however it doesn’t seem to save much space at all – 
I was hoping someone could point me in the right direction so that I can 
download multiple dumps and not have it take up so much room (uncompressed).

Has anyone tried anything similar and had success with data deduplication?

Is there a guide?
_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org<mailto:Xmldatadumps-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org<mailto:Xmldatadumps-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

Re: [Xmldatadumps-l] Has anyone had success with data deduplication?

Reply via email to