Re: [Xmldatadumps-l] Has anyone had success with data deduplication?

Ariel Glenn WMF Wed, 29 Jul 2020 08:34:25 -0700

Are you looking for the most current version of each page? Do you want
articles or also talk pages, user pages and the rest?


In any case, There are a couple of projects that might be of interest to
you.
One is the so-called adds/changes dumps, available here:
https://dumps.wikimedia.org/other/incr/
The other istwork being done on producing HTML dumps; you may follow the
progress of that on Phabricator: https://phabricator.wikimedia.org/T254275
Note that parsing wikitext to generate HTML is quite intensive; you might
look at the Kwix project for more information about how they do it:
https://github.com/openzim/mwoffliner

Ariel

On Wed, Jul 29, 2020 at 1:48 PM griffin tucker <gtucker4....@hotmail.com>
wrote:

> I figured I would decompress the .bz2 and .gz files and that subsequent
> downloads of dumps would only store the changes, disregarding the
> compressed .bz2 , .gz , and .7z files.
>
>
>
> My purposes are just experimenting/learning (I’m a first year comp-sci
> student) and I really like the idea of downloading multiple dumps and it
> not taking up much more space.
>
>
>
> My plan was to download a few dumps of enwikinews as a test, and then go
> for enwikipedia when it’s tested successfully.
>
>
>
> I’ve just been doing this locally, however I was planning on using cloud
> virtual machines like aws, and then moving them to glacier for long-term
> storage (copies of the massive volumes).
>
>
>
> I’ve tried following the guides for using mediawiki to reproduce the
> dumps, but it runs into errors after only a few thousand pages. I was going
> to reproduce each dump and then scrape locally for .html files and store
> those. Images would be a bonus.
>
>
>
> Then, every month I want to run a script that would do all of this
> automatically, storing to a dedup volume.
>
>
>
> That’s my plan, anyway.
>
>
>
> *From:* Ariel Glenn WMF <ar...@wikimedia.org>
> *Sent:* Wednesday, 29 July 2020 4:49 PM
> *To:* Count Count <countvoncount123...@gmail.com>
> *Cc:* griffin tucker <gtucker4....@hotmail.com>;
> xmldatadumps-l@lists.wikimedia.org
> *Subject:* Re: [Xmldatadumps-l] Has anyone had success with data
> deduplication?
>
>
>
> The basic problem is that the page content dumps are ordered by revision
> number within each page, which makes good sense for dumps users but means
> that the addition of a single revision to a page will shift all of the
> remaining data ,resulting in different compressed blocks. That's going to
> be true regardless of the compression type.
>
>
>
> In the not too distant future we might switch over to multi-stream output
> files for all page content, fixing the page id range per stream for bz2
> files. This might let a user check the current list of page ids against the
> previous one and only get the streams with the pages they want, in the
> brave new Hadoop-backed object store of my dreams. 7z files are another
> matter altogether and I don't see how we can do better there without
> rethinking them altogether.
>
>
>
> Can you describe which dump files you are keeping and why having them in
> sequence is useful? Maybe we can find a workaround that will let you get
> what you need without keeping a bunch of older files.
>
>
>
> Ariel
>
>
>
> On Tue, Jul 28, 2020 at 8:48 AM Count Count <countvoncount123...@gmail.com>
> wrote:
>
> Hi!
>
>
>
> The underlying filesystem (ZFS) uses block-level deduplication, so unique
> chunks of 128KiB (default value) are only stored once. The 128KB chunks
> making up dumps are mostly unique since there is no alignment so
> deduplication will not help as far as I can see.
>
>
>
> Best regards,
>
>
>
> Count Count
>
>
>
> On Tue, Jul 28, 2020 at 3:51 AM griffin tucker <gtucker4....@hotmail.com>
> wrote:
>
> I’ve tried using freenas/truenas with a data deduplication volume to store
> multiple sequential dumps, however it doesn’t seem to save much space at
> all – I was hoping someone could point me in the right direction so that I
> can download multiple dumps and not have it take up so much room
> (uncompressed).
>
>
>
> Has anyone tried anything similar and had success with data deduplication?
>
>
>
> Is there a guide?
>
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
>

_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

Re: [Xmldatadumps-l] Has anyone had success with data deduplication?

Reply via email to