On Thu, May 17, 2012 at 2:06 AM, John <phoenixoverr...@gmail.com> wrote: > On Thu, May 17, 2012 at 1:52 AM, Anthony <wikim...@inbox.org> wrote: >> On Thu, May 17, 2012 at 1:22 AM, John <phoenixoverr...@gmail.com> wrote: >> > Anthony the process is linear, you have a php inserting X number of rows >> > per >> > Y time frame. >> >> Amazing. I need to switch all my databases to MySQL. It can insert X >> rows per Y time frame, regardless of whether the database is 20 >> gigabytes or 20 terabytes in size, regardless of whether the average >> row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a >> RAID array or a cluster of servers, etc. > > When refering to X over Y time, its an average of a of say 1000 revisions > per 1 minute, any X over Y period must be considered with averages in mind, > or getting a count wouldnt be possible.
The *average* en.wikipedia revision is more than twice the size of the *average* simple.wikipedia revision. The *average* performance of a 20 gig database is faster than the *average* performance of a 20 terabyte database. The *average* performance of your laptop's thumb drive is different from the *average* performance of a(n array of) drive(s) which can handle 20 terabytes of data. > If you setup your sever/hardware correctly it will compress the text > information during insertion into the database Is this how you set up your simple.wikipedia test? How long does it take import the data if you're using the same compression mechanism as WMF (which, you didn't answer, but I assume is concatenation and compression). How exactly does this work "during insertion" anyway? Does it intelligently group sets of revisions together to avoid decompressing and recompressing the same revision several times? I suppose it's possible, but that would introduce quite a lot of complication into the import script, slowing things down dramatically. What about the answers to my other questions? >> If you want to put your money where your mouth is, import >> en.wikipedia. It'll only take 5 days, right? > > If I actually had a server or the disc space to do it I would, just to prove > your smartass comments as stupid as they actually are. However given my > current resource limitations (fairly crappy internet connection, older > laptops, and lack of HDD) I tried to select something that could give > reliable benchmarks. If your willing to foot the bill for the new hardware > Ill gladly prove my point What you seem to be saying is that you're *not* putting your money where your mouth is. Anyway, if you want, I'll make a deal with you. A neutral third party rents the hardware at Amazon Web Services (AWS). We import simple.wikipedia full history (concatenating and compressing during import). We take the ratio of revisions in simple.wikipedia to the ratio of revisions in en.wikipedia. We import en.wikipedia full history (concatenating and compressing during import). If the ratio of time it takes to import en.wikipedia vs simple.wikipedia is greater than or equal to twice the ratio of revisions, then you reimburse the third party. If the ratio of import time is less than twice the ratio of revisions (you claim it is linear, therefore it'll be the same ratio), then I reimburse the third party. Either way, we save the new dump, with the processing already done, and send it to archive.org (and WMF if they're willing to host it). So we actually get a useful result out of this. It's not just for the purpose of settling an argument. Either of us can concede defeat at any point, and stop the experiment. At that point if the neutral third party wishes to pay to continue the job, s/he would be responsible for the additional costs. Shouldn't be too expensive. If you concede defeat after 5 days, then your CPU-time costs are $54 (assuming Extra Large High Memory Instance). Including 4 terabytes of EBS (which should be enough if you compress on the fly) for 5 days should be less than $100. I'm tempted to do it even if you don't take the bet. _______________________________________________ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l