Re: denormalization/E-T-L help

SGreen Thu, 24 Mar 2005 10:56:28 -0800

You are on the right track. Don't forget to use the EXPLAIN command to 
help you tune in your indexes.

One other thing that may help is to try making extracts of some of your 
tables as temp tables and using them in the final JOIN query that pulls it 
all together. For instance, you could make a temp table of vardata that 
only has the IDs for campaign 17 and an extract of vars where name IN 
('content','browser','bandwidth')Then build your report from those 
extracts instead of the full table.  When you build JOINS whose cross 
products approach hundreds of millions of records, you can easily start to 
see slowdowns. One way around that is to minimize the amount of data you 
actually JOIN together. 

If you have 4 tables of 1000 rows each all joined together in a single 
query, that's a cross product of 1000x1000x1000x1000 or 1.0e12 (1 
trillion)  possible record combinations that the engine must evaluate 
before it can finish building the internal JOIN table. If you can just 
pre-query two of those tables so that you are only joining 20x20x1000x1000 
you have just eliminated 600,000,000 (six-hundred billion) row 
combinations from the evaluation phase. How much faster do you think that 
will be, eh?

You may also want to look at other threads in this list that discuss 
something called a "crosstab query" or "pivot table".  You are actually 
building one and perhaps the techniques discussed for flipping a single 
table can help you optimize this query, too.

Off the top of my head, here is an attempt to help speed things up:

CREATE TEMPORARY TABLE tmpVardata (id, session, nameid, key(id, session), 
key(nameid))
FROM vardata
WHERE campaignID = 17;

CREATE TEMPORARY TABLE tmpVars(id, name, value, key(id))
FROM vars
WHERE name in ('content','browser','bandwidth');

SELECT vd.id AS id,
        vd.session AS session,
        max(if(tv.name='content', tv.value,null)) AS content,
        max(if(tv.name='browser', tv.value,null)) AS browser,
        max(if(tv.name='bandwidth',tv.value,null))AS bandwidth
FROM tmpVardata vd
LEFT JOIN data d
        on d.session = vd.session
LEFT JOIN tmpVars tv
        ON tv.id = vd.nameid
        OR tv.id = d.nameid
GROUP BY 1,2;

DROP TEMPORARY TABLE tmpVardata, tmpVars;

(I used a shorthand in the GROUP BY clause to group by column positions 
instead of names) Can you see what I am trying to do? First I shrink a 
couple of tables, then I eliminate the multiple joins. It may be close but 
I doubt I got it just right.

HTH,

Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine

Mathew Ray <[EMAIL PROTECTED]> wrote on 03/24/2005 12:47:20 PM:

> Hi Shawn,
> 
> I think I may have found a solution, but the query takes quite a while 
> to run...here is what I have now paraphrased:
> 
> 
> SELECT
> vd.id AS id,
> vd.session AS session,
> vl.value AS content,
> vd2.varvalue AS browser,
> vl3.value AS bandwidth
> 
> FROM vardata AS vd
> 
> LEFT OUTER JOIN vars AS vn ON ( vn.name = 'content' AND vn.id = 
vd.nameid )
> LEFT OUTER JOIN valuelookup AS vl ON ( vl.id = vd.valueid )
> 
> LEFT OUTER JOIN data AS vd2 ON ( vd2.session = vd.session )
> LEFT OUTER JOIN vars AS vn2 ON ( vn2.id = vd2.nameid )
> 
> /* and so forth for each variable... */
> WHERE vd.campaignId = 17
> AND vn.name = 'content'
> AND vn2.name = 'browser'
> AND vn3.name = 'bandwidth'
> 
> 
> This would result in the following 'flattened' structure:
> 
> [id]   [session]   [content]   [browser]   [bandwidth]
> 22   55      intro      MSIE 6.0   hi
> 23   55      form1      MSIE 6.0   hi
> 23   56      intro      Firefox 1.0   lo
> 
> Only problem is that the status of the thread stays in 'statistics' 
> forever... 10 mins last time I checked before cancelling it...
> 
> Seems like the time is increasing quite drastically as I pile on more 
> variable comparisons... I am guessing because I am increasing that 
> number of comparisons that must be done...
> 
> Thanks,
> Mathew
> 
> [EMAIL PROTECTED] wrote:
> > Mathew Ray <[EMAIL PROTECTED]> wrote on 03/24/2005 10:42:51 AM:
> > 
> > 
> >>Been searching for a while and can't seem to come up with any good 
> >>answers to this - I have a normalized structure of about 5 tables that 
I 
> > 
> > 
> >>need to denormalize into one big representation of the entire 
structure. 
> > 
> > 
> >>  Anyone know of a good tool or resource to 'flatten' my tables 
easily?
> >>
> >>I've got a query that removes all foreign key info and creates a 
column 
> >>that holds the linked value, but my problem is that I have some 
columns 
> >>that change with every row and others that should only be logged once 
> >>per row (responses to polls for example).
> >>
> >>Here is a simplified structure:
> >>[data]
> >>   id (int)
> >>   nameid (int)
> >>   valueid (int)
> >>
> >>[vars]
> >>   id (int)
> >>   name (varchar 255)
> >>
> >>[valuelookup]
> >>   id (int)
> >>   value (varchar 255)
> >>   varid  (int)
> >>
> >>Some variables, like 'browsertype', are consistent for every item, 
> >>others like 'content' change for every row. The 'poll' variable 
contains 
> > 
> > 
> >>multiple choice responses, but shouldn't be double counted when the 
> >>table is de-normalized. I looked at some open source E-T-L tools, but 
> >>they look pretty complicated, and I feel there is a way of doing this 
> >>with sql somehow. Any idea on how to approach this?
> >>-- 
> >>Thanks,
> >>Mathew
> >>
> > 
> > 
> > Show us some concrete examples of how you want your denormalized data 
to 
> > look and I am sure we can help.
> > 
> > Shawn Green
> > Database Administrator
> > Unimin Corporation - Spruce Pine
> 
>

Re: denormalization/E-T-L help

Reply via email to