I'll second Johan's comments.

"Count the disk hits!"

One minor change:  Don't store averages in the summary table; instead store the 
SUM().  That lets you get the mathematically correct AVERAGE over any time 
range via
SUM(sum_foo) / SUM(count_foo)

Switching between MySQL and Mongo requires rewriting _all_ of the relevant code.

<opinion> NoSQL will be no better than MySQL for 150GB. </opinion>  "Count the 
disk hits!"

I recently built a system that topped out at 350GB (90 days' data).  It 
involved hourly ingestion of a few GB of data and a variety of "reports".  The 
prototype showed that most reports would take about an hour to run.  Not good.  
The final product, with summary tables, lets the reports be run on-demand and 
online and each takes only a few seconds.  By careful use of MEMORY tables, 
LOAD DATA, etc, the ingestion takes 5 minutes (each hour) for the raw data and 
2 minutes (total) for the 7 summary tables.  PARTITIONing was vital for the 
design.  Once an hour a new partition is populated; once a day, 24 hourly 
partitions are rolled into a new daily partition and the 90-day old partition 
is DROPped.


> -----Original Message-----
> From: Johan De Meersman [mailto:vegiv...@tuxera.be]
> Sent: Wednesday, June 13, 2012 6:20 AM
> To: Manivannan S.
> Cc: mysql@lists.mysql.com
> Subject: Re: NoSQL help
> 
> 
> ----- Original Message -----
> > From: "Manivannan S." <manivanna...@spanservices.com>
> >
> > Hi all,
> >
> > [lots of data]
> > [slow reports]
> > [wooo NoSQL magic]
> 
> Not that I want to discourage you, but my standard first question is
> "why do you think NoSQL (let alone any specific product) is the right
> solution?" :-)
> 
> Don't get me wrong, it might be; but from what little I now know about
> your environment, it sounds like applying some data warehousing
> techniques might suffice - and being the cynical dinosaur that I am, I
> have a healthy reluctance about welding new technology onto a stable
> environment.
> 
> To speed up reporting (and note that these techniques are often applied
> even when implementing NoSQL solutions, too) it is usually a good first
> step to set up a process of data summarization.
> 
> Basically, you pre-calculate averages, medians, groupings, whatever you
> need for your reports; and your job also saves the last record IDs it's
> processed; then on the next run, you only read the new records and
> update your summary tables to incorporate the new data.
> 
> Suppose I have a table like this:
> 
> ID | Val
> --------
>  1     1
>  2     7
>  3     5
>  4    13
> 
> I want to report the average on a daily basis, and calculating that
> over those rows is unbearably slow because I'm running the process on a
> wristwatch from 1860 :-)
> 
> So I get a summary table, calculate (1+7+5+13)/4 = 6.5 and that then
> gets a record saying this:
> 
> Avg | elementCount | lastSeen
> -----------------------------
> 6.5              4          4
> 
> Now, over the course of the day, the elements 4, 17 and 2 get added
> with sequential row numbers. Instead of calculating
> (1+7+5+13+4+17+2)/7, which would be slow; I can substitute the already
> summarized data by Avg*elementCount. Thus, I calculate (6.5*4 +
> 4+17+2)/7 = 7, which is a lot faster, and my summary table now looks
> like this:
> 
> Avg | elementCount | lastSeen
> -----------------------------
>   7              7          7
> 
> This is of course a stupid example, but it saves you a lot of time if
> you already have the summary of several thousand elements and only need
> to update it for a handful. Similar tricks are possible for a lot of
> typical reporting stuff - you don't need to re-calculate data for past
> months over and over again, for instance - and that's what makes your
> reports run fast.
> 
> 
> Just my 2 cents :-)
> /johan
> 
> --
> Bier met grenadyn
> Is als mosterd by den wyn
> Sy die't drinkt, is eene kwezel
> Hy die't drinkt, is ras een ezel
> 
> --
> MySQL General Mailing List
> For list archives: http://lists.mysql.com/mysql
> To unsubscribe:    http://lists.mysql.com/mysql

Reply via email to