James Black <[EMAIL PROTECTED]> wrote on 04/27/2005 10:36:20 AM:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> I am curious if there is a simple way to do this. I was just asked to
> give some data that requires me to group by day of week and also group
> by hour in day (two different pages).
> 
> Thanx for any help.
> 
> - --
> "Love is mutual self-giving that ends in self-recovery." Fulton Sheen
> James Black    [EMAIL PROTECTED]
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.1 (MingW32)
> Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
> 
> iD8DBQFCb6NkikQgpVn8xrARAhEpAJwK1Yp9gBn0bnfUJ07dbf2bVHpSiACfd8Wi
> HycjEYTbpk0NAPEtEgV5BpY=
> =3+gk
> -----END PGP SIGNATURE-----
> 
> -- 
> MySQL General Mailing List
> For list archives: http://lists.mysql.com/mysql
> To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]
> 

The simple answer (not good for frequent analysis) is to use the date and 
time functions to extract what you need from your data and group on those 
values (http://dev.mysql.com/doc/mysql/en/date-and-time-functions.html):

SELECT DAYOFWEEK(name_of_datetime_column), count(1)
FROM tablename
GROUP BY 1;

(that is a mysql shorthand way of saying: "group on the values in the 
first column")



If you are going to need to do this time-based analysis frequently on 
larger datasets, it will save you lots of time to precompute certain 
values and add them as additional columns to the table. This is better and 
may be sufficient for your needs but this isn't the best-performing 
solution. For instance, if you wanted to evaluate web traffic by month, 
week, day of month, day of year, day of the week, and hour then adding 
those columns to your data table will seriously speed up several common 
types of queries:

SELECT dayofyear, hour, sum(bytes_transferred)
FROM data_table
WHERE year = 2005
GROUP BY dayofyear, hour;

This will give you the data you need to plot your hourly traffic (in 
bytes) for every day this year or you could say

SELECT month, dayofmonth, hour, sum(bytes_transferred)
FROM data_table
WHERE year = 2005
GROUP BY month, dayofmonth, hour

You get the same information but it's broken down by month, too. However, 
you still have to scan every row of the table (unless you allocate some 
serious index space) and it's faster but not the fastest way of making 
this kind of analysis.

This general type of dimensional analysis is what the theories and 
practices of OLAP covers (online analytical processing). Generally, to get 
faster results for statistical research over values or ranges of values, 
you precompute some of your source data's statistics and associate those 
values with various dimensions (in this example we are dealing with 
sum(bytes_transferred) over periods of time).  One way to do this is with 
a table that looks like:

CREATE TABLE olapBytesTransferred (
        year int, 
        month int,
        week int,
        dayofyear int,
        dayofmonth int,
        dayofweek int,
        hour int,
        countBytes int,
        sumBytes float,
        avgBytes float,
        stdBytes float
)

and populate it periodically (say once a day or once an hour) with the 
numbers for the raw data you have collected since the last time you 
checked. Then when it comes time to slice-and-dice your statistics you 
don't have to go through the hundreds of millions of rows of raw log data 
because you already have it broken down and added up by the hour. To 
collect statistics by the day, you only need to add up 24 hour's-worth of 
data for each day. This is much faster than trying to slog through and 
sort and analyze 100000 events per day from the raw logs. If you need 
finer control, add more time division columns (our dimensions) to your 
table. Indexing these little monsters can be tricky because you need to 
balance the number of indexes (more index slow you down during INSERTs), 
index size (each index takes up space on disk and indexes that fit into 
memory are faster than those that require swap space to load), and 
performance desired (more indexes can mean faster SELECTs).

You could even create breakdowns based on users per day (sticking with the 
web log example) or IP addresses per hour or .... you see where this is 
going. By extracting and extrapolating information from your raw logs and 
creating a table structure that is DESIGNED to support reports and 
statistical analysis, you can make that kind of information much easier to 
get to. Yes, it's much faster but it's not as simple.

Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine




)

Reply via email to