Are you looking for the most current version of each page? Do you want articles or also talk pages, user pages and the rest?
In any case, There are a couple of projects that might be of interest to you. One is the so-called adds/changes dumps, available here: https://dumps.wikimedia.org/other/incr/ The other istwork being done on producing HTML dumps; you may follow the progress of that on Phabricator: https://phabricator.wikimedia.org/T254275 Note that parsing wikitext to generate HTML is quite intensive; you might look at the Kwix project for more information about how they do it: https://github.com/openzim/mwoffliner Ariel On Wed, Jul 29, 2020 at 1:48 PM griffin tucker <gtucker4....@hotmail.com> wrote: > I figured I would decompress the .bz2 and .gz files and that subsequent > downloads of dumps would only store the changes, disregarding the > compressed .bz2 , .gz , and .7z files. > > > > My purposes are just experimenting/learning (I’m a first year comp-sci > student) and I really like the idea of downloading multiple dumps and it > not taking up much more space. > > > > My plan was to download a few dumps of enwikinews as a test, and then go > for enwikipedia when it’s tested successfully. > > > > I’ve just been doing this locally, however I was planning on using cloud > virtual machines like aws, and then moving them to glacier for long-term > storage (copies of the massive volumes). > > > > I’ve tried following the guides for using mediawiki to reproduce the > dumps, but it runs into errors after only a few thousand pages. I was going > to reproduce each dump and then scrape locally for .html files and store > those. Images would be a bonus. > > > > Then, every month I want to run a script that would do all of this > automatically, storing to a dedup volume. > > > > That’s my plan, anyway. > > > > *From:* Ariel Glenn WMF <ar...@wikimedia.org> > *Sent:* Wednesday, 29 July 2020 4:49 PM > *To:* Count Count <countvoncount123...@gmail.com> > *Cc:* griffin tucker <gtucker4....@hotmail.com>; > xmldatadumps-l@lists.wikimedia.org > *Subject:* Re: [Xmldatadumps-l] Has anyone had success with data > deduplication? > > > > The basic problem is that the page content dumps are ordered by revision > number within each page, which makes good sense for dumps users but means > that the addition of a single revision to a page will shift all of the > remaining data ,resulting in different compressed blocks. That's going to > be true regardless of the compression type. > > > > In the not too distant future we might switch over to multi-stream output > files for all page content, fixing the page id range per stream for bz2 > files. This might let a user check the current list of page ids against the > previous one and only get the streams with the pages they want, in the > brave new Hadoop-backed object store of my dreams. 7z files are another > matter altogether and I don't see how we can do better there without > rethinking them altogether. > > > > Can you describe which dump files you are keeping and why having them in > sequence is useful? Maybe we can find a workaround that will let you get > what you need without keeping a bunch of older files. > > > > Ariel > > > > On Tue, Jul 28, 2020 at 8:48 AM Count Count <countvoncount123...@gmail.com> > wrote: > > Hi! > > > > The underlying filesystem (ZFS) uses block-level deduplication, so unique > chunks of 128KiB (default value) are only stored once. The 128KB chunks > making up dumps are mostly unique since there is no alignment so > deduplication will not help as far as I can see. > > > > Best regards, > > > > Count Count > > > > On Tue, Jul 28, 2020 at 3:51 AM griffin tucker <gtucker4....@hotmail.com> > wrote: > > I’ve tried using freenas/truenas with a data deduplication volume to store > multiple sequential dumps, however it doesn’t seem to save much space at > all – I was hoping someone could point me in the right direction so that I > can download multiple dumps and not have it take up so much room > (uncompressed). > > > > Has anyone tried anything similar and had success with data deduplication? > > > > Is there a guide? > > _______________________________________________ > Xmldatadumps-l mailing list > Xmldatadumps-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l > > _______________________________________________ > Xmldatadumps-l mailing list > Xmldatadumps-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l > >
_______________________________________________ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l