OLAP Proposal for MySQL

Philip Stoev Mon, 25 Aug 2003 14:14:27 +0000

Hi all,

Please tell me if any of this makes sense. Any pointers to relevant
projects/articles will be much appreciated.


Philip Stoev
http://www.stoev.org/pivot/manifest.htm

===================================

OLAP PROPOSAL FOR MYSQL

The goal is to create an OLAP engine coupled with a presentation layer that
will be easy enough for normal people to use, with no MDX experience
required. While it is probably a fact that Wal-Mart has 70 GB of data, this
does not mean that all people have such data sets, so the goal is reasonable
performance for reasonably-sized datasets. Most people do not join 30 tables
together either. Also, it is pre-supposed that Wal-Mart engage in
extra-complex calculations to determine business strategies, most people are
often content to know "How much I sold yesterday".

I. OLAP ENGINE AND CACHING

The OLAP "engine" takes a standard SQL query with GROUP BY statements and
aggregate functions, executes it, and saves the entire resulting dataset in
the cache. A cache index entry is then created, noting what the source
tables, the GROUP_BY columns, the aggregate functions and the WHERE
conditions that were used.

Upon execution of further queries, the OLAP engine checks the cache whether
there is a cached dataset that can be used to answer the query immediately.
This would include any of the following:

1. The query's GROUP BY columns are equal or a sub-set of the cached query.
So, a query like:
            SELECT salesman, state, SUM(sales) FROM company.sales GROUP BY
salesman, state
provides the answer for
            SELECT salesman, SUM(sales) FROM company.sales GROUP BY salesman

2. The query's WHERE clause is equal or more restrictive to the WHERE clause
of a cached query, and contains columns that were GROUP BY-ed.
A query like:
            SELECT date, salesman, SUM(sales) FROM company.sales GROUP BY
date, salesman WHERE date > '2003-01-01'
provides the answer for:
            SELECT date, salesman, SUM(sales) FROM company.sales GROUP BY
date, salesman WHERE date > '2003-01-01' AND date > '2003-06-01'
Obviously, a human will not write a query with such a WHERE statement,
however a graphical Pivot tool may be explicitly designed to create such a
query when drilling-down so that a cache hit is scored.

3. The query's source tables are equal or a sub-set of the cached query's
source tables.
So, the query:
SELECT salesman, gender, SUM(sales) FROM company.sales INNER JOIN salesman
USING (salesman_id) GROUP BY salesman, gender
or even something very complex with 10 joined tables, can be used to answer:
SELECT salesman, SUM(sales) FROM company.sales GROUP BY salesman
or even something even more complex with 5 joined tables

4. The query's aggregate functions are equal of a sub-set of the cached
query's. Certain aggregate functions may not be cached like COUNT(DISTINCT),
and others require special care (AVERAGE(value) must be translated to
SUM(value)/COUNT(value)).

The benefits of such a cache implementation is that is it data-independent.
You do not have to describe your data prior to executing your queries. It
also does not rely on creating your own cache structure and your own cache
index - a few tables can be used to hold the cache index and can be then
queried by SQL themselves to determine a hit.

If an interactive Pivoting tool is executing those queries, the cache should
(hopefully) soon fill with entries that allow most, if not all, of the
queries resulting from interactive browsing to be served from the cache.
Additionally, the tool can apply for pre-fetching of relevant data by
drilling down a bit more than the user has requested, resulting in a cache
hit when the user indeed drills deeper. Also, the tool does not have to
cache data to sort it on its own, since queries that differ only in their
SORT BY are cached. An additional enhancement would be the ability to serve
a hit from the cache using more than one cached table.

Example:

A. No cache hit, so we just populate the cache
Initial query:
            SELECT salesman, state, COUNT(*) FROM sales GROUP BY salesman,
state
The server does:
            CREATE TABLE 1234567 SELECT salesman, COUNT(*) FROM sales GROUP
BY salesman, state
            SELECT * FROM 1234567

B. A cache hit
Initial query:
            SELECT state, COUNT(*) FROM sales GROUP BY state
The server does:
            SELECT state, SUM(`COUNT(*)`) AS `COUNT(*)` FROM 1234567 GROUP
BY state
[`COUNT(*)` being a valid column name for table 1234567]

II. DATA DESCRIPTION AND MANIPULATION

1. In my humble opinion, people do not think in MDX. Instead, they think in
terms of GROUP BY. So, for most uses, it should be sufficient to allow the
user to construct his own GROUP BY statement and specify the aggregate
functions that he is interested in, rather than asking him to create a cube,
an axis, a view, a measure, etc, etc.

2. People also think in terms of everyday phrases, like "last 7 days" or
"all Mondays". A pre-compiled dictionary of such phrases will be immensely
useful, as well as the ability to specify such phrases. People also like to
be able to do "call duration in 5-minite intervals", which is not available
in Microsoft Excel when working with columns of type "time".

3. Normal people do not expect all of their columns to be available for
analysis, and they do not want their report to have either 2 or 2000 rows.

For example, if you have a date column and you do a Microsoft Excel
PivotTable, you will first have to select that column from a list that
contains bunch of other fields, then wait for the table to be generated with
a row for each date, and then you group or sort the dates somehow to arrive
to the numbers that interest you. Other tools (at least in their example
scenarios) facing a date column will start with the data grouped by year,
and you then have to expand to month (the months often being shown as
numbers), and from there on to weeks and days, and table has to refresh and
recalculate a dozen times for your convenience.

Instead, a person should have a list of phrases that we can use as rows and
columns, like "last 7 days per day", "all months since January by week",
etc. She will then be able to arrive precisely to the data that she wants to
see. Only one SQL query will be required.

4. Data is not always perfect

If you store your data as 1 and 0, and your boss wants to see "yes" and "no"
, this should be possible. If sales > $5000 means a pro salesman, then the
user does not have to display the row sales number in a column, and then
group on figures below $5000 and figures above $5000, and then separately
calculate the salesmen that are too recently hired to be able to score.
Months and days of week have names. Times of the day may be morning,
afternoon and evening, not (0..24:0..59:0.59). Times that are messed up due
to time zones can be adjusted on the fly without jeopardizing the work of
company software that relates on data being messed up.

III. PRESENTATION

A mod_perl GUI is envisioned that will allow you view and rotate your data
as you see fit. In particular, the following goals have been set:
            1. Fully bookmarkable URLs that people can mail around to others
so that they too can see the same report;
            2. Usage of phrases described in Section II to make access to
the most relevant portions of the report easier;
            3. Sorting, drilling up and down, expanding, contracting,
hiding, showing, axis-swapping, grouping and ungrouping, coloring, etc.,
etc.
            4. Tabs instead of drop-down lists, e.g. a tab for January, a
tab for February, etc.
            5. Access control, full logging, etc. etc.;
            6. Speed, speed, speed. Anything that is slower than Microsoft
Excel for comparable datasets should be optimized. Data may be queried (and
retrieved) in portions to provide concurrency and instant feedback to user.
For example, if we have a table keyed by date, we can always retrieve
January, show it to the user, and then proceed to retrieve the other months
and keep displaying them as they arrive (which, as a side effect, may cause
other queries to slip in between, providing faster performance for everyone
at least perceptually). Any queries that are known to run long (based on
timing previous invocations), should have a progress bar.


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

OLAP Proposal for MySQL

Reply via email to