I'll second Johan's comments. "Count the disk hits!"
One minor change: Don't store averages in the summary table; instead store the SUM(). That lets you get the mathematically correct AVERAGE over any time range via SUM(sum_foo) / SUM(count_foo) Switching between MySQL and Mongo requires rewriting _all_ of the relevant code. <opinion> NoSQL will be no better than MySQL for 150GB. </opinion> "Count the disk hits!" I recently built a system that topped out at 350GB (90 days' data). It involved hourly ingestion of a few GB of data and a variety of "reports". The prototype showed that most reports would take about an hour to run. Not good. The final product, with summary tables, lets the reports be run on-demand and online and each takes only a few seconds. By careful use of MEMORY tables, LOAD DATA, etc, the ingestion takes 5 minutes (each hour) for the raw data and 2 minutes (total) for the 7 summary tables. PARTITIONing was vital for the design. Once an hour a new partition is populated; once a day, 24 hourly partitions are rolled into a new daily partition and the 90-day old partition is DROPped. > -----Original Message----- > From: Johan De Meersman [mailto:vegiv...@tuxera.be] > Sent: Wednesday, June 13, 2012 6:20 AM > To: Manivannan S. > Cc: mysql@lists.mysql.com > Subject: Re: NoSQL help > > > ----- Original Message ----- > > From: "Manivannan S." <manivanna...@spanservices.com> > > > > Hi all, > > > > [lots of data] > > [slow reports] > > [wooo NoSQL magic] > > Not that I want to discourage you, but my standard first question is > "why do you think NoSQL (let alone any specific product) is the right > solution?" :-) > > Don't get me wrong, it might be; but from what little I now know about > your environment, it sounds like applying some data warehousing > techniques might suffice - and being the cynical dinosaur that I am, I > have a healthy reluctance about welding new technology onto a stable > environment. > > To speed up reporting (and note that these techniques are often applied > even when implementing NoSQL solutions, too) it is usually a good first > step to set up a process of data summarization. > > Basically, you pre-calculate averages, medians, groupings, whatever you > need for your reports; and your job also saves the last record IDs it's > processed; then on the next run, you only read the new records and > update your summary tables to incorporate the new data. > > Suppose I have a table like this: > > ID | Val > -------- > 1 1 > 2 7 > 3 5 > 4 13 > > I want to report the average on a daily basis, and calculating that > over those rows is unbearably slow because I'm running the process on a > wristwatch from 1860 :-) > > So I get a summary table, calculate (1+7+5+13)/4 = 6.5 and that then > gets a record saying this: > > Avg | elementCount | lastSeen > ----------------------------- > 6.5 4 4 > > Now, over the course of the day, the elements 4, 17 and 2 get added > with sequential row numbers. Instead of calculating > (1+7+5+13+4+17+2)/7, which would be slow; I can substitute the already > summarized data by Avg*elementCount. Thus, I calculate (6.5*4 + > 4+17+2)/7 = 7, which is a lot faster, and my summary table now looks > like this: > > Avg | elementCount | lastSeen > ----------------------------- > 7 7 7 > > This is of course a stupid example, but it saves you a lot of time if > you already have the summary of several thousand elements and only need > to update it for a handful. Similar tricks are possible for a lot of > typical reporting stuff - you don't need to re-calculate data for past > months over and over again, for instance - and that's what makes your > reports run fast. > > > Just my 2 cents :-) > /johan > > -- > Bier met grenadyn > Is als mosterd by den wyn > Sy die't drinkt, is eene kwezel > Hy die't drinkt, is ras een ezel > > -- > MySQL General Mailing List > For list archives: http://lists.mysql.com/mysql > To unsubscribe: http://lists.mysql.com/mysql