Re: Data Warehouse on MySQL questions

2008-04-07 Thread Dre

Hey Baron,

Your blog post was quite informative; your suggestion to use a 
combination of merged MyISAM tables and InnoDB for the live partition 
made a lot of sense, and it sounds like the path I'll need to follow.


I appreciate the information!

Baron Schwartz wrote:

Hi,

I'll just address the things others didn't answer.

On Thu, Apr 3, 2008 at 2:28 PM, Dre <[EMAIL PROTECTED]> wrote:
  

 1) Several sources seem to suggest MyISAM is a good choice for data
warehousing, but due to my lack of experience in a transaction-less world,



One approach you might consider is using MyISAM tables and MERGE
tables together.  This makes it much easier to remove old data, if
that is a concern, and keeps each table a manageable size.  For
example, put each day's data in a new MyISAM table and then create a
MERGE over them.

You might also look into MySQL 5.1's partitioning, too.

Brian Miezejevski gave a great session on this at last year's MySQL
Conference.  The slides don't seem to be available online, but I wrote
up my notes on the session:
http://www.xaprb.com/blog/2007/04/27/mysql-conference-and-expo-2007-day-4/
 You can probably email him and get the slides as well.

Regards
Baron

  


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Data Warehouse on MySQL questions

2008-04-06 Thread Baron Schwartz
Hi,

I'll just address the things others didn't answer.

On Thu, Apr 3, 2008 at 2:28 PM, Dre <[EMAIL PROTECTED]> wrote:
>  1) Several sources seem to suggest MyISAM is a good choice for data
> warehousing, but due to my lack of experience in a transaction-less world,

One approach you might consider is using MyISAM tables and MERGE
tables together.  This makes it much easier to remove old data, if
that is a concern, and keeps each table a manageable size.  For
example, put each day's data in a new MyISAM table and then create a
MERGE over them.

You might also look into MySQL 5.1's partitioning, too.

Brian Miezejevski gave a great session on this at last year's MySQL
Conference.  The slides don't seem to be available online, but I wrote
up my notes on the session:
http://www.xaprb.com/blog/2007/04/27/mysql-conference-and-expo-2007-day-4/
 You can probably email him and get the slides as well.

Regards
Baron

-- 

Baron Schwartz, Senior Consultant, Percona Inc.
Tel: +1 888 401 3401 ext 507
24/7 Emergency Line +1 888 401 3401 ext 911
Our Services: http://www.percona.com/services.html
Our Blog: http://www.mysqlperformanceblog.com/

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Data Warehouse on MySQL questions

2008-04-03 Thread Perrin Harkins
On Thu, Apr 3, 2008 at 2:28 PM, Dre <[EMAIL PROTECTED]> wrote:
>  1) Several sources seem to suggest MyISAM is a good choice for data
> warehousing, but due to my lack of experience in a transaction-less world,
> this makes me a little nervous.

MyISAM has the advantage of very fast loading.  It's much faster to
load than InnoDB.  However, InnoDB had better query performance on the
warehouse data I used.

> How do you handle data inconsistency
> problems when ETL jobs fail?

Usually, with a warehouse system the answer is that you rebuild it.  I
have built an incrementally loading ETL system, but I was under the
impression that is not the norm.

> I
> can see doing something like manually cleaning out the necessary tables
> before you re-run, but that seems a bit messy to me.  Anyone figure out a
> better approach?

I think it's pretty common to use a Y-loading approach, with one empty
schema and then do an atomic RENAME at the end to swap the new tables
into place.  When I used MyISAM tables, I did it that way.

>  2) Isn't the lack of bitmap indexes a problem in the warehouse? Most FKs in
> the fact tables will be low cardinality columns; queries that didn't use
> date would be very slow on large fact tables (MS SQL had this problem).  Has
> anyone run into this with MySQL?

You can bundle up low-cardinality columns by using a "junk dimension."
 It made a big difference for me.  You make a table with all of the
combinations of the low-cardinality fields and assign a key to each
combination.

There are some good presentation on data warehousing available from
the website for the MySQL Users Conference.

- Perrin

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Data Warehouse on MySQL questions

2008-04-03 Thread Gary Richardson
I've built several datamarts using perl and MySQL. The largest ones
have been up to about 30GB, so I'm not quite on your scale.

for #1, I have an etl_id in the fact table so I can track back any
particular ETL job. I typically make it a dimension and include date,
time, software version, etc. That doesn't help so much if you're
messing up your dimension tables, but I haven't typically run into
that problem based on the designs I've used.

For #2, I haven't built anything big enough for it to be a concern yet..

Also, LOAD DATA INFILE is your friend :)

On Thu, Apr 3, 2008 at 11:28 AM, Dre <[EMAIL PROTECTED]> wrote:
> Hey folks,
>
>  I'm currently deciding whether to build a decent sized (around 300-500GB,
> although honestly, I've got little to base that on at the moment) data
> warehouse in postgreSQL or MySQL.  I've developed several in MS SQL and
> postgreSQL, but the client is comfortable with MySQL, and I'd prefer to use
> that as the platform since it will be less painful for them to manage when
> I'm gone.  I'm hoping that someone with experience building a warehouse on
> MySQL will be able to answer two outstanding questions I have:
>
>  1) Several sources seem to suggest MyISAM is a good choice for data
> warehousing, but due to my lack of experience in a transaction-less world,
> this makes me a little nervous.  How do you handle data inconsistency
> problems when ETL jobs fail?  (For the record, I don't use a separate tool
> for the ETL; I usually use perl/shell scripts to interact with the file
> system, and pl/pgsql or transact-sql once the data is loaded into the
> staging database.  For each file that is loaded, I'll identify steps that
> must be posted together, and wrap them in a transaction in the ETL job.)  I
> can see doing something like manually cleaning out the necessary tables
> before you re-run, but that seems a bit messy to me.  Anyone figure out a
> better approach?
>
>  2) Isn't the lack of bitmap indexes a problem in the warehouse? Most FKs in
> the fact tables will be low cardinality columns; queries that didn't use
> date would be very slow on large fact tables (MS SQL had this problem).  Has
> anyone run into this with MySQL?
>
>  Many thanks in advance!
>
>  --
>  MySQL General Mailing List
>  For list archives: http://lists.mysql.com/mysql
>  To unsubscribe:
> http://lists.mysql.com/[EMAIL PROTECTED]
>
>

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]