sam wun <[EMAIL PROTECTED]> wrote on 01/20/2005 11:45:40 AM:

> 
> [EMAIL PROTECTED] wrote:
> 
> >
> >
> > SELECT DISTINCT i.basename
> > FROM inventory i
> > INNER JOIN transaction t
> >         ON i.prodcode = t.prodcode
> >         AND t.date >= '2004-01-01'
> >         AND t.date <= '2004-01-31'
> > INNER JOIN transaction tt
> >         ON i.prodcode = tt.prodcode
> >         AND tt.date >= '2005-01-01'
> >         AND tt.date <= '2005-01-31'
> > INNER JOIN customer c
> >         ON c.custcode = t.custcode
> >         AND c.custcode = tt.custcode
> > WHERE i.basename is not NULL
> >         and i.prodname is not NULL
> > order by i.basename
> >
> > This should give you a list of inventory basenames for all current 
> > customers (their names are still in the customer table) that "had 
> > transactions" (ordered products?) during both JAN 2004 and JAN 2005. 
> > This list will show only the products that were ordered during BOTH 
> > time periods BY THE SAME CUSTOMER at least once (at least one repeat 
> > sale, year to year, in JAN). Is this what you were after or was there 
> > a different question you were trying to answer?
> >
> Hi, while this is much faster than subquery, I found there is "cumsy" 
> way faster than this series inner joins.
> I dicovered that if I create two different temporary tables with "create 

> view as select ..." for 2 differnet period, the join between this temp 
> tables is also much faster than this series inner joins approach.  With 
> 160000 records in the transaction table, the series inner joins takes 
> very long time (more than 15 minutes) give out a result.
> 
> Any comment?
> Thanks
> Sam
> 
> > Shawn Green
> > Database Administrator
> > Unimin Corporation - Spruce Pine 
> 
> 

Once you start getting into larger sets of data (millions of rows), Taking 
snapshots (as you did with your CREATE TABLE ... SELECT statements) can 
provide some significant performance boosts.

Here is a numerical analysis of the two situations:

Assume we have a table with 1000000 (1Millon = 1e6) rows in it that needs 
to be joined to itself for a query (much as in the query above). Computing 
the JOIN of two tables that each contain 1e6 rows could result in a 
virtual table containing up to 1e12 rows. The actual number of rows in a 
JOIN will be smaller than the pure Cartesian product because it will be 
limited to only those rows match your ON conditions. However the number 
1e12 represents how many permutations of data the engine must decide 
between in order to calculate the results of the JOIN.

Let's imagine that instead of needing to join the entire table to itself, 
that we only need to match 10000 (1e4) rows of the table's data against 
another set of 10000 rows (two nice round figures that could stand in for 
the actual number of "transaction" records during January of each year in 
our original query data). A JOIN between those two sets of rows would 
result in a maximum Cartesian product of only 1e8 rows.

If it takes a fixed length of time (not a variable length of time) for the 
query engine to decide if any one row of a JOIN's Cartesian product 
belongs to the final JOIN results (based on evaluating the ON conditions 
against each row combination) then the subset JOIN will reduce the time it 
takes to compute the virtual table by a factor of 10000 (1e12/1e8 = 1e4). 
That means that the query engine made 999,999,990,000 fewer comparisons to 
build the virtual table based on the JOINed sub-tables than it would need 
to build the same table based on the whole table joined to itself.

Your results seem to support this analysis. The time you spent creating 
the subsets of data (time to select rows + time to write them into a 
table, twice) was much, MUCH less than the time it would take to sort 
through all of the other possible combinations of data in your original 
table that didn't need to participate in your analysis.

As I said at the beginning, this is a query optimization technique for 
certain queries against "larger" datasets. For queries that deal with 
tables on the order of a few hundred thousand rows or less, it may not 
apply.  Your mileage will vary. Only actual testing will reveal the actual 
performance of one technique over the other.

Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine

Reply via email to