Perhaps dump it all out to sql, xz it up, then restore it later if
someone really wants it? 

Or split up the sql dump and process it on a number of machines? 

Dean

On 2019-10-18 04:33, Doug Bell wrote:

> The latest outage lasted 5 days. We gave up trying to negotiate with the down 
> server and got someone to physically reboot it. Because we still have data in 
> MyISAM tables, this comes with a potential for a few issues, not the least of 
> which is it can take days to rebuild the MyISAM indexes after a hard reboot 
> (luckily that did not happen, and we seem to be back online). 
> 
> When I joined the project, one of the initial goals was to move away from 
> MyISAM on to InnoDB (or, possibly, another DB entirely). My efforts to do 
> that continually run in to problems: 
> 
> * Some parts of the data _will not_ convert to InnoDB as-is due to 
> differences between MyISAM and InnoDB. 
> * The program I wrote to modify that data to a different format which can 
> exist in InnoDB will take months to complete. 
> * Relatedly, I have no reason to suspect moving all that data to a different 
> database would take any less time. 
> * The only reason we need these two servers specifically and solely dedicated 
> to the database is because of the database's size 
> 
> These issues all have a common root: There is a lot of data. I might say too 
> much data. 
> 
> CPAN Testers has accepted 100+ million test reports since it came online. 
> Some of these reports are for distributions no longer available on CPAN. 
> Reports are still being submitted for abandoned modules not updated in 
> decades for out-of-support Perl versions. Every development release of the 
> Perl interpreter gets tested against some (most? all?) of CPAN on multiple 
> platforms. This adds up to thousands of reports per day, and if the database 
> was up I could check what percentage of them are ever visited by human eyes 
> (but my guess is 5-10%). 
> 
> Even if the data is not seen by humans, it's useful in the aggregate: 
> Regression analysis requires as much data as possible to make its hypotheses 
> and suggestions. Even if the data is old does not mean it's useless: Old 
> versions of modules can still be installable from CPAN, and folks are still 
> running old versions of Perl. 
> 
> That said, timely data is more useful than untimely data. Do we need reports 
> submitted in 2006? Data for modules only available on BackPAN isn't 
> actionable, so do we need to keep that information? 
> 
> In the end, irrelevant data is worse than useless, it is actively detrimental 
> to the site's stability (as I mentioned above). For that reason, I propose to 
> implement the following data retention policies: 
> 
> 1. Full text reports will be kept a maximum of 5 years 
> 2. Report summaries will be kept for all distributions installable from CPAN, 
> or if no longer installable from CPAN, 5 years 
> * This means that someone will still know if a distribution passes/fails, but 
> if an author wants to know why they'll have to reproduce it themselves 
> 3. Along with (2), release summaries for distributions not installable from 
> CPAN and older than 5 years will be removed 
> * This ensures that the release summaries can be rebuilt from the report 
> summaries, and that there isn't a strange difference in numbers between the 
> CPAN Testers website and consumers of the release data 
> 
> So, this means that for all distributions available on CPAN, we will still 
> know pass/fail/na/unknown and which Perls and platforms. For the first five 
> years after the report's submission, one can view the entire text of the 
> report. If the distribution is still on CPAN, the full text report will be 
> deleted 5 years after it was submitted, but the summary information will 
> remain. If the distribution is removed from CPAN, all reports and all summary 
> information older than 5 years will be deleted. 
> 
> Purging report text older than 5 years will reduce the database by about 
> half. For the 1TB database we have now, that reduces it to a svelte 500GB. If 
> we purge more, we gain more, though report submissions have been increasing 
> over the years: 
> 
> +-----------+----------+----------+----------+----------+---------+ 
> | total     | 5y       | 4y       | 3y       | 2y       | 1y      | 
> +-----------+----------+----------+----------+----------+---------+ 
> | 107822513 | 62514949 | 48597230 | 35256342 | 21516482 | 9889931 | 
> +-----------+----------+----------+----------+----------+---------+ 
> 
> So, questions for those affected: 
> 
> * Do you look at text reports older than 5 years? 3 years? 1 year? 
> * Are test summaries useful to you without the full text of the report? 
> * Are pass/fail counts older than 5 years useful to you? 3 years? 1 year? 
> 
> I'd like to implement this sooner rather than later so I can build some 
> faster recovery systems, but I'll leave discussion open at least a week while 
> I develop the tools I need to do this anyway. 
> 
> Doug Bell 
> d...@preaction.me

Reply via email to