Re: [Wikitech-l] changes coming to large dumps
It seems that the pagecounts-ez sets disappeared from dumps.wikimedia.org starting this date. Is that a coincidence ? Is it https://phabricator.wikimedia.org/T189283 perhaps ? DJ On Thu, Mar 29, 2018 at 2:42 PM, Ariel Glenn WMFwrote: > Here it comes: > > For the April 1st run and all following runs, the Wikidata dumps of > pages-meta-current.bz2 will be produced only as separate downloadable > files, no recombined single file will be produced. > > No other dump jobs will be impacted. > > A reminder that each of the single downloadable pieces has the siteinfo > header and the mediawiki footer so they may all be processed separately by > whatever tools you use to grab data out of the combined file. If your > workflow supports it, they may even be processed in parallel. > > I am still looking into what the best approach is for the pags-articles > dumps. > > Please forward wherever you deem appropriate. For further updates, don't > forget to check the Phab ticket! https://phabricator.wikimedia.org/T179059 > > On Mon, Mar 19, 2018 at 2:00 PM, Ariel Glenn WMF > wrote: > >> A reprieve! Code's not ready and I need to do some timing tests, so the >> March 20th run will do the standard recombining. >> >> For updates, don't forget to check the Phab ticket! >> https://phabricator.wikimedia.org/T179059 >> >> On Mon, Mar 5, 2018 at 1:10 PM, Ariel Glenn WMF >> wrote: >> >>> Please forward wherever you think appropriate. >>> >>> For some time we have provided multiple numbered pages-articles bz2 file >>> for large wikis, as well as a single file with all of the contents combined >>> into one. This is consuming enough time for Wikidata that it is no longer >>> sustainable. For wikis where the sizes of these files to recombine is "too >>> large", we will skip this recombine step. This means that downloader >>> scripts relying on this file will need to check its existence, and if it's >>> not there, fall back to downloading the multiple numbered files. >>> >>> I expect to get this done and deployed by the March 20th dumps run. You >>> can follow along here: https://phabricator.wikimedia.org/T179059 >>> >>> Thanks! >>> >>> Ariel >>> >> >> > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] changes coming to large dumps
Here it comes: For the April 1st run and all following runs, the Wikidata dumps of pages-meta-current.bz2 will be produced only as separate downloadable files, no recombined single file will be produced. No other dump jobs will be impacted. A reminder that each of the single downloadable pieces has the siteinfo header and the mediawiki footer so they may all be processed separately by whatever tools you use to grab data out of the combined file. If your workflow supports it, they may even be processed in parallel. I am still looking into what the best approach is for the pags-articles dumps. Please forward wherever you deem appropriate. For further updates, don't forget to check the Phab ticket! https://phabricator.wikimedia.org/T179059 On Mon, Mar 19, 2018 at 2:00 PM, Ariel Glenn WMFwrote: > A reprieve! Code's not ready and I need to do some timing tests, so the > March 20th run will do the standard recombining. > > For updates, don't forget to check the Phab ticket! > https://phabricator.wikimedia.org/T179059 > > On Mon, Mar 5, 2018 at 1:10 PM, Ariel Glenn WMF > wrote: > >> Please forward wherever you think appropriate. >> >> For some time we have provided multiple numbered pages-articles bz2 file >> for large wikis, as well as a single file with all of the contents combined >> into one. This is consuming enough time for Wikidata that it is no longer >> sustainable. For wikis where the sizes of these files to recombine is "too >> large", we will skip this recombine step. This means that downloader >> scripts relying on this file will need to check its existence, and if it's >> not there, fall back to downloading the multiple numbered files. >> >> I expect to get this done and deployed by the March 20th dumps run. You >> can follow along here: https://phabricator.wikimedia.org/T179059 >> >> Thanks! >> >> Ariel >> > > ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] changes coming to large dumps
A reprieve! Code's not ready and I need to do some timing tests, so the March 20th run will do the standard recombining. For updates, don't forget to check the Phab ticket! https://phabricator.wikimedia.org/T179059 On Mon, Mar 5, 2018 at 1:10 PM, Ariel Glenn WMFwrote: > Please forward wherever you think appropriate. > > For some time we have provided multiple numbered pages-articles bz2 file > for large wikis, as well as a single file with all of the contents combined > into one. This is consuming enough time for Wikidata that it is no longer > sustainable. For wikis where the sizes of these files to recombine is "too > large", we will skip this recombine step. This means that downloader > scripts relying on this file will need to check its existence, and if it's > not there, fall back to downloading the multiple numbered files. > > I expect to get this done and deployed by the March 20th dumps run. You > can follow along here: https://phabricator.wikimedia.org/T179059 > > Thanks! > > Ariel > ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] changes coming to large dumps
That's really big. :-) 2018-03-05 13:10 GMT+01:00 Ariel Glenn WMF: > We'll probably start at 20GB, which means that WIkidata will be the only > wiki affected for now. > > Ariel > > On Mon, Mar 5, 2018 at 1:40 PM, Bináris wrote: > > > Could you please translate "too large" to megabytes? > > > > 2018-03-05 12:10 GMT+01:00 Ariel Glenn WMF : > > > > > Please forward wherever you think appropriate. > > > > > > For some time we have provided multiple numbered pages-articles bz2 > file > > > for large wikis, as well as a single file with all of the contents > > combined > > > into one. This is consuming enough time for Wikidata that it is no > > longer > > > sustainable. For wikis where the sizes of these files to recombine is > > "too > > > large", we will skip this recombine step. This means that downloader > > > scripts relying on this file will need to check its existence, and if > > it's > > > not there, fall back to downloading the multiple numbered files. > > > > > > I expect to get this done and deployed by the March 20th dumps run. > You > > > can follow along here: https://phabricator.wikimedia.org/T179059 > > > > > > Thanks! > > > > > > Ariel > > > ___ > > > Wikitech-l mailing list > > > Wikitech-l@lists.wikimedia.org > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > > > > > > > -- > > Bináris > > ___ > > Wikitech-l mailing list > > Wikitech-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > -- Bináris ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] changes coming to large dumps
We'll probably start at 20GB, which means that WIkidata will be the only wiki affected for now. Ariel On Mon, Mar 5, 2018 at 1:40 PM, Bináriswrote: > Could you please translate "too large" to megabytes? > > 2018-03-05 12:10 GMT+01:00 Ariel Glenn WMF : > > > Please forward wherever you think appropriate. > > > > For some time we have provided multiple numbered pages-articles bz2 file > > for large wikis, as well as a single file with all of the contents > combined > > into one. This is consuming enough time for Wikidata that it is no > longer > > sustainable. For wikis where the sizes of these files to recombine is > "too > > large", we will skip this recombine step. This means that downloader > > scripts relying on this file will need to check its existence, and if > it's > > not there, fall back to downloading the multiple numbered files. > > > > I expect to get this done and deployed by the March 20th dumps run. You > > can follow along here: https://phabricator.wikimedia.org/T179059 > > > > Thanks! > > > > Ariel > > ___ > > Wikitech-l mailing list > > Wikitech-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > > -- > Bináris > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] changes coming to large dumps
Could you please translate "too large" to megabytes? 2018-03-05 12:10 GMT+01:00 Ariel Glenn WMF: > Please forward wherever you think appropriate. > > For some time we have provided multiple numbered pages-articles bz2 file > for large wikis, as well as a single file with all of the contents combined > into one. This is consuming enough time for Wikidata that it is no longer > sustainable. For wikis where the sizes of these files to recombine is "too > large", we will skip this recombine step. This means that downloader > scripts relying on this file will need to check its existence, and if it's > not there, fall back to downloading the multiple numbered files. > > I expect to get this done and deployed by the March 20th dumps run. You > can follow along here: https://phabricator.wikimedia.org/T179059 > > Thanks! > > Ariel > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- Bináris ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l