Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-15 Thread Mischa Sandberg
Simon Riggs wrote:
Jim C. Nasby
On Mon, Sep 13, 2004 at 11:07:35PM +0100, Simon Riggs wrote:
PostgreSQL's functionality is in many ways similar to Oracle
Partitioning.
Loading up your data in many similar tables, then creating a view like:
CREATE VIEW BIGTABLE (idate, col1, col2, col3...) AS
SELECT 200409130800, col1, col2, col3... FROM table200409130800
UNION ALL
SELECT 200409131000, col1, col2, col3... FROM table200409131000
UNION ALL
SELECT 200409131200, col1, col2, col3... FROM table200409131200
...etc...
will allow the PostgreSQL optimizer to eliminate partitions
from the query
when you run queries which include a predicate on the
partitioning_col, e.g.
select count(*) from bigtable where idate >= 200409131000
The "partitions" are just tables, so no need for other management tools.
Oracle treats the partitions as sub-tables, so you need a range of commands
to add, swap etc the partitions of the main table.
A few years ago I wrote a federated query engine (wrapped as an ODBC 
driver) that had to handle thousands of contributors (partitions) to a 
pseudotable / VIEWofUNIONs. Joins did require some special handling in 
the optimizer, because of the huge number of crossproducts between 
different tables. It was definitely worth the effort at the time, 
because you need different strategies for: joining a partition to 
another partition on the same subserver; joining two large partitions on 
different servers; and joining a large partition on one server to a 
small one on another.

The differences may not be so great for a solitary server;
but they're still there, because of disparity in subtable sizes. The 
simplistic query plans tend to let you down, when you're dealing with 
honking warehouses.

I'm guessing that Oracle keeps per-subtable AND cross-all-subtables 
statistics, rather than building the latter from scratch in the course 
of evaluating the query plan. That's the one limitation I see in 
emulating their partitioned tables with Views.

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables?

2004-09-15 Thread Simon Riggs

Chris Browne <[EMAIL PROTECTED]> wrote on 15.09.2004, 04:34:53:
> [EMAIL PROTECTED] ("Simon Riggs") writes:
> > Well, its fairly straightforward to auto-generate the UNION ALL view,
and
> > important as well, since it needs to be re-specified each time a new
> > partition is loaded or an old one is cleared down. The main point is
that
> > the constant placed in front of each table must in some way relate to
the
> > data, to make it useful in querying. If it is just a unique constant,
chosen
> > at random, it won't do much for partition elimination. So, that tends to
> > make the creation of the UNION ALL view an application/data specific
thing.
>
> Ah, that's probably a good thought.
>
> When we used big "UNION ALL" views, it was with logging tables, where
> there wasn't really any meaningful distinction between partitions.
>
> So you say that if the VIEW contains, within it, meaningful constraint
> information, that can get applied to chop out irrelevant bits?
>
> That suggests a way of resurrecting the idea...
>
> Might we set up the view as:
>
> create view combination_of_logs as
>   select * from table_1 where txn_date between 'this' and 'that'
>union all
>   select * from table_2 where txn_date between 'this2' and 'that2'
>union all
>   select * from table_3 where txn_date between 'this3' and 'that3'
>union all
>   select * from table_4 where txn_date between 'this4' and 'that4'
>union all
>... ad infinitum
>union all
>   select * from table_n where txn_date > 'start_of_partition_n';
>
> and expect that to help, as long as the query that hooks up to this
> has date constraints?
>

That way of phrasing the view can give you the right answer to the
query, but does not exclude partitions.

With the above way of doing things, you end up with predicate phrases of
the form ((PARTLIMITLO < partcol) AND (PARTLIMITHI > partcol) AND
(partcol > QUERYLIMITLO) AND (partcol < QUERYLIMITHI))
...if the values in capitals are constants, then this should evaluate to
a true or false value for each partition table. The optimizer doesn't
yet do this

If you specify the view the way I showed, then the predicate query
becomes a comparison of constants, which DOES get evaluated prior to
full executionyou will see this as a "one time test: false" in the
EXPLAIN.

The way you've phrased the view is the more obvious way to phrase it,
and I'd spent a few days trying to work out how to solve the algebra
above in codebut that was wasted effort.

Anyway, if you use constants you can still specify ranges and betweens
and have them work... hence my example showed date-like integers - but
I don't think it just applies to one datatype.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread Steve Atkins
On Tue, Sep 14, 2004 at 05:33:33PM -0500, Jim C. Nasby wrote:
> On Mon, Sep 13, 2004 at 11:07:35PM +0100, Simon Riggs wrote:
> > PostgreSQL's functionality is in many ways similar to Oracle Partitioning.
> > 
> > Loading up your data in many similar tables, then creating a view like:
> > 
> > CREATE VIEW BIGTABLE (idate, col1, col2, col3...) AS
> > SELECT 200409130800, col1, col2, col3... FROM table200409130800
> > UNION ALL
> > SELECT 200409131000, col1, col2, col3... FROM table200409131000
> > UNION ALL
> > SELECT 200409131200, col1, col2, col3... FROM table200409131200
> > ...etc...
[...]
> 
> Is there by any chance a set of functions to manage adding and removing
> partitions? Certainly this can be done by hand, but having a set of
> tools would make life much easier. I just looked but didn't see anything
> on GBorg.

I've done a similar thing with time-segregated data by inheriting
all the partition tables from an (empty) parent table.

Adding a new partition is just a "create table tablefoo () inherits(bigtable)"
and removing a partition just "drop table tablefoo".

Cheers,
  Steve


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread Chris Browne
[EMAIL PROTECTED] ("Simon Riggs") writes:
> Well, its fairly straightforward to auto-generate the UNION ALL view, and
> important as well, since it needs to be re-specified each time a new
> partition is loaded or an old one is cleared down. The main point is that
> the constant placed in front of each table must in some way relate to the
> data, to make it useful in querying. If it is just a unique constant, chosen
> at random, it won't do much for partition elimination. So, that tends to
> make the creation of the UNION ALL view an application/data specific thing.

Ah, that's probably a good thought.

When we used big "UNION ALL" views, it was with logging tables, where
there wasn't really any meaningful distinction between partitions.

So you say that if the VIEW contains, within it, meaningful constraint
information, that can get applied to chop out irrelevant bits?  

That suggests a way of resurrecting the idea...

Might we set up the view as:

create view combination_of_logs as
  select * from table_1 where txn_date between 'this' and 'that' 
   union all
  select * from table_2 where txn_date between 'this2' and 'that2' 
   union all
  select * from table_3 where txn_date between 'this3' and 'that3' 
   union all
  select * from table_4 where txn_date between 'this4' and 'that4' 
   union all
   ... ad infinitum
   union all
  select * from table_n where txn_date > 'start_of_partition_n';

and expect that to help, as long as the query that hooks up to this
has date constraints?

We'd have to regenerate the view with new fixed constants each time we
set up the tables, but that sounds like it could work...
-- 
"cbbrowne","@","acm.org"
http://www3.sympatico.ca/cbbrowne/x.html
But  what can  you  do with  it?   -- ubiquitous  cry from  Linux-user
partner.  -- Andy Pearce, <[EMAIL PROTECTED]>

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread Christopher Browne
[EMAIL PROTECTED] ("Simon Riggs") wrote:
> The main point is that the constant placed in front of each table
> must in some way relate to the data, to make it useful in
> querying. If it is just a unique constant, chosen at random, it
> won't do much for partition elimination.

It just struck me - this is much the same notion as that of "cutting
planes" used in Integer Programming.

The approach, there, is that you take a linear program, which can give
fractional results, and throw on as many additional constraints as you
need in order to force the likelihood of particular variable falling
on integer values.  The constraints may appear redundant, but
declaring them allows the answers to be pushed in the right
directions.

In this particular case, the (arguably redundant) constraints let the
query optimizer have criteria for throwing out unnecessary tables.
Thanks for pointing this out; it may turn a fowl into a feature, when
I can get some "round tuits" :-).  That should allow me to turn an
81-way evil join into something that's 4-way at the worst.

Cheers!
-- 
"cbbrowne","@","linuxfinances.info"
http://linuxfinances.info/info/nonrdbms.html
Implementing systems is 95% boredom and 5% sheer terror.

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread Michael Glaesemann
On Sep 15, 2004, at 8:32 AM, Simon Riggs wrote:
The "partitions" are just tables, so no need for other management 
tools.
Oracle treats the partitions as sub-tables, so you need a range of 
commands
to add, swap etc the partitions of the main table.

I guess a set of tools that emulates that functionality would be 
generically
a good thing, if you can see a way to do that.

Oracle partitions were restricted in only allowing a single load 
statement
into a single partition at any time, whereas multiple COPY statements 
can
access a single partition table on PostgreSQL.
How does this compare to DB2 partitioning?
Michael Glaesemann
grzm myrealbox com
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread Simon Riggs
> Jim C. Nasby
> On Mon, Sep 13, 2004 at 11:07:35PM +0100, Simon Riggs wrote:
> > PostgreSQL's functionality is in many ways similar to Oracle
> Partitioning.
> >
> > Loading up your data in many similar tables, then creating a view like:
> >
> > CREATE VIEW BIGTABLE (idate, col1, col2, col3...) AS
> > SELECT 200409130800, col1, col2, col3... FROM table200409130800
> > UNION ALL
> > SELECT 200409131000, col1, col2, col3... FROM table200409131000
> > UNION ALL
> > SELECT 200409131200, col1, col2, col3... FROM table200409131200
> > ...etc...
> >
> > will allow the PostgreSQL optimizer to eliminate partitions
> from the query
> > when you run queries which include a predicate on the
> partitioning_col, e.g.
> >
> > select count(*) from bigtable where idate >= 200409131000
> >
> > will scan the last two partitions only...
> >
> > There are a few other ways of creating the view that return the
> same answer,
> > but only using constants in that way will allow the partitions to be
> > eliminated from the query, and so run for much longer.
>
> Is there by any chance a set of functions to manage adding and removing
> partitions? Certainly this can be done by hand, but having a set of
> tools would make life much easier. I just looked but didn't see anything
> on GBorg.

Well, its fairly straightforward to auto-generate the UNION ALL view, and
important as well, since it needs to be re-specified each time a new
partition is loaded or an old one is cleared down. The main point is that
the constant placed in front of each table must in some way relate to the
data, to make it useful in querying. If it is just a unique constant, chosen
at random, it won't do much for partition elimination. So, that tends to
make the creation of the UNION ALL view an application/data specific thing.

The "partitions" are just tables, so no need for other management tools.
Oracle treats the partitions as sub-tables, so you need a range of commands
to add, swap etc the partitions of the main table.

I guess a set of tools that emulates that functionality would be generically
a good thing, if you can see a way to do that.

Oracle partitions were restricted in only allowing a single load statement
into a single partition at any time, whereas multiple COPY statements can
access a single partition table on PostgreSQL.

BTW, multi-dimensional partitioning is also possible using the same general
scheme

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread Jim C. Nasby
On Mon, Sep 13, 2004 at 11:07:35PM +0100, Simon Riggs wrote:
> PostgreSQL's functionality is in many ways similar to Oracle Partitioning.
> 
> Loading up your data in many similar tables, then creating a view like:
> 
> CREATE VIEW BIGTABLE (idate, col1, col2, col3...) AS
> SELECT 200409130800, col1, col2, col3... FROM table200409130800
> UNION ALL
> SELECT 200409131000, col1, col2, col3... FROM table200409131000
> UNION ALL
> SELECT 200409131200, col1, col2, col3... FROM table200409131200
> ...etc...
> 
> will allow the PostgreSQL optimizer to eliminate partitions from the query
> when you run queries which include a predicate on the partitioning_col, e.g.
> 
> select count(*) from bigtable where idate >= 200409131000
> 
> will scan the last two partitions only...
> 
> There are a few other ways of creating the view that return the same answer,
> but only using constants in that way will allow the partitions to be
> eliminated from the query, and so run for much longer.

Is there by any chance a set of functions to manage adding and removing
partitions? Certainly this can be done by hand, but having a set of
tools would make life much easier. I just looked but didn't see anything
on GBorg.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread aaron werman
From: "Harald Lau (Sector-X)" <[EMAIL PROTECTED]>
...
> From: "Mischa Sandberg" <[EMAIL PROTECTED]>
>
> > If your company is currently happy with MySQL, there probably are
> > other (nontechnical) reasons to stick with it. I'm impressed that
> > you'd consider reconsidering PG.
>
> I'd like to second Mischa on that issue.
Though both of you are right from my point of view, I don't think
it's very useful to discuss this item here.
It is kinda windy for the list, but the point is that a big part of 
performance is developer expectation and user expectation. I'd hope to lower 
expectations before we see an article in eWeek. Perhaps this thread should 
move to the advocacy list until the migration needs specific advice.

_
Get ready for school! Find articles, homework help and more in the Back to 
School Guide! http://special.msn.com/network/04backtoschool.armx

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread Vivek Khera
> "MC" == Mark Cotner <[EMAIL PROTECTED]> writes:

MC> I've finished porting the schema and am importing the
MC> data now.  My estimates for just two-thirds(60 of the
MC> 90 days) of one of our 30 cable systems(MySQL dbs) is
MC> estimated to take about 16 hours.  This may seem like
MC> a lot, but I'm satisfied with the performance.  I've

be sure to load your data without indexes defined for your initial
import.

check your logs to see if increasing checkpoint_segments is
recommended.  I found that bumping it up to 50 helped speed up my
data loads (restore from dump) significantly.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Vivek Khera, Ph.D.Khera Communications, Inc.
Internet: [EMAIL PROTECTED]   Rockville, MD  +1-301-869-4449 x806
AIM: vivekkhera Y!: vivek_khera   http://www.khera.org/~vivek/

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread Harald Lau (Sector-X)
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

> From: "Mischa Sandberg" <[EMAIL PROTECTED]>
> 
> > If your company is currently happy with MySQL, there probably are
> > other (nontechnical) reasons to stick with it. I'm impressed that
> > you'd consider reconsidering PG.
> 
> I'd like to second Mischa on that issue.

Though both of you are right from my point of view, I don't think
it's very useful to discuss this item here.

Having once migrated a MySQL-DB to PG I can confirm, that in fact
chances are good you will be unhappy if you adopt the MySQL
data-model and the SQL 1:1.
As well as PG has to be much more configured and optimized than
MySQL.
As well as the client-application is supposed to be modified to a
certain extend, particularly if you want to take over some -or some
more- business-logic from client to database.

But, from what Mark stated so far I'm sure he is not going to migrate
his app just for fun, resp. without having considered this.

> NEVER reimplement an existing system unless the project includes
> substantial functional imporovement.

or monetary issues
I know one big database that was migrated from Oracle to PG and
another from SQLServer to PG because of licence-costs. Definitely
there are some more.
That applies to MySQL, too; licence policy is somewhat obscure to me,
but under certain circumstances you have to pay

regards Harald

-BEGIN PGP SIGNATURE-
Version: PGPfreeware 6.5.3 for non-commercial use 

iQA/AwUBQUb+O8JpD/drhCuMEQJCZACgqdJsrWjOwdP779PFaFMjxdgvqkwAoIPc
jPONy6urLRLf3vylVjVlEyci
=/1Ka
-END PGP SIGNATURE-


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread Aaron Werman
> Mark Cotner wrote:

> > The time has come to reevaluate/rearchitect an
> > application which I built about 3 years ago.  There
> > are no performance concerns with MySQL, but it would
> > benefit greatly from stored procedures, views, etc.
>

From: "Mischa Sandberg" <[EMAIL PROTECTED]>

> If your company is currently happy with MySQL, there probably are other
> (nontechnical) reasons to stick with it. I'm impressed that you'd
> consider reconsidering PG.

I'd like to second Mischa on that issue. In general, if you migrate an
*existing* application from one RDBMS to another, you should expect
performance to decrease significantly. This is always true in a well
performing system even if the replacement technology is more sophisticated.
This is because of several factors.

Even if you try to develop in totally agnostic generic SQL, you are always
customizing to a feature set, namely the ones in the current system. Any
existing application has had substantial tuning and tweaking, and the new
one is at a disadvantage. Moreover, an existing system is a Skinnerian
reward/punishment system to the developers and DBAs, rewarding or punishing
them for very environment specific choices - resulting in an application,
dbms, OS, and platform that are both explicitly and unconsciously customized
to work together in a particular manner.

The net effect is a rule of thumb that I use:

NEVER reimplement an existing system unless the project includes substantial
functional imporovement.

Every time I've broken that rule, I've found that users expectations, based
on the application they are used to, are locked in. Any place where the new
system is slower, the users are dissatisfied; where it exceeds expectations
it isn't appreciated: the users are used to the old system quirks, and the
improvements only leave them uncomforable since the system "acts
differently". (I've broken the rule on occation for standardization
conversions.)

My expectation is that pg will not get a fair shake here. If you do it - I'd
like to see the results anyway.

/Aaron

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread Pierre-Frédéric Caillaud

Performance hint :
	For static data, do not normalize too much.
	For instance if you have a row which can be linked to several other rows,  
you can do this :

create table parents (
id  serial primary key,
values... )
create table children (
id serial primary key,
parent_id references parents(id),
integer slave_value )
Or you can do this, using an array :
create table everything (
id  serial primary key,
integer[] children_values,
values... )
	Pros :
	No Joins. Getting the list of chilndren_values from table everything is  
just a select.
	On an application with several million rows, a query lasting 150 ms with  
a Join takes 30 ms with an array.
	You can build the arrays from normalized tables by using an aggregate  
function.
	You can index the array elements with a GIST index...

	Cons :
	No joins, thus your queries are a little bit limited ; problems if the  
array is too long ;



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread Mark Cotner
You all have been so very helpful so far and I really
appreciate it.

The data in these tables is thankfully static since
they are logging tables and an analyze only takes
about 4 minutes for the largest of them.

I've finished porting the schema and am importing the
data now.  My estimates for just two-thirds(60 of the
90 days) of one of our 30 cable systems(MySQL dbs) is
estimated to take about 16 hours.  This may seem like
a lot, but I'm satisfied with the performance.  I've
created a slightly normalized version and some stored
procedures to help me normalize the data.  When this
finishes I'm going to query the data as is with the
views as you suggested, and I'm going to create views
for the normalized version to test that as well.  This
will then be contrasted to the MySQL query results and
I plan to write a white paper of my findings.

I don't have any concerns that Postgres will do fine,
but if I run into any performance problems I'll be
sure and post them here first.

It should be noted that our development life cycle is
currently severely hindered by lack of features in
MySQL like views and stored procedures.  Frankly I've
implemented some pretty ugly SQL using as many as 5
temp tables to generate a result set with MySQL. 
Having stored procedures and views is going to help us
tremendously.  This performance evaluation is to
verify that Postgres can handle what we're going to
throw at it, not to find out if it's faster in
milliseconds than MySQL.  We love the speed and ease
of maintenance with MySQL, but have simply outgrown
it.  This will be reflected in the white paper.

I have already imported our customer tables, which
aren't too small(2.4M rows x 3 tables), and stuck a
view in front of it.  The view queried faster than
MySQL would query a pre-joined flat table.

Getting carried away . . . needless to say I'm really
excited about the possiblity of Postgres, but I won't
bore you with the details just yet.  I'll send the
link out to the white paper so you all can review it
before I send it anywhere else.  If anything could
have been optimized more please let me know and I'll
see that it gets updated before it's widely published.

Thanks again for all the great feedback!

'njoy,
Mark

--- Christopher Browne <[EMAIL PROTECTED]> wrote:

> A long time ago, in a galaxy far, far away,
> [EMAIL PROTECTED] (Mark Cotner) wrote:
> > Agreed, I did some preliminary testing today and
> am very impressed.
> > I wasn't used to running analyze after a data
> load, but once I did
> > that everything was snappy.
> 
> Something worth observing is that this is true for
> _any_ of the
> database systems supporting a "cost-based"
> optimization system,
> including Oracle and DB2.
> 
> When working with SAP R/3 Payroll, on one project,
> we found that when
> the system was empty of data, the first few employee
> creates were
> quick enough, but it almost immediately got
> excruciatingly slow.  One
> of the DBAs told the Oracle instance underneath to
> collect statistics
> on the main table, and things _immediately_ got
> snappy again.  But it
> didn't get snappy until the conversion folk had run
> the conversion
> process for several minutes, to the point to which
> it would get
> painfully slow :-(.  There, with MILLIONS of dollars
> worth of license
> fees being paid, across the various vendors, it
> still took a fair bit
> of manual fiddling.
> 
> MySQL(tm) is just starting to get into cost-based
> optimization; in
> that area, they're moving from where the "big DBs"
> were about 10 years
> ago.  It was either version 7 or 8 where Oracle
> started moving to
> cost-based optimization, and (as with the anecdote
> above) it took a
> release or two for people to get accustomed to the
> need to 'feed' the
> optimizer with statistics.  This is a "growing pain"
> that bites users
> with any database where this optimization gets
> introduced.  It's
> worthwhile, but is certainly not costless.
> 
> I expect some forseeable surprises will be
> forthcoming for MySQL AB's
> customers in this regard...
> 
> > My best results from MySQL bulk inserts was around
> 36k rows per
> > second on a fairly wide table.  Today I got 42k
> using the COPY
> > command, but with the analyze post insert the
> results were similar.
> > These are excellent numbers.  It basically means
> we could have our
> > cake(great features) and eat it too(performance
> that's good enough
> > to run the app).
> 
> In the end, performance for inserts is always
> fundamentally based on
> how much disk I/O there is, and so it should come as
> no shock that
> when roughly the same amount of data is getting laid
> down on disk,
> performance won't differ much on these sorts of
> essentials.
> 
> There are a few places where there's some need for
> cleverness; if you
> see particular queries running unusually slowly,
> it's worth doing an
> EXPLAIN or EXPLAIN ANALYZE on them, to see how the
> query plans are
> being generated.  There's some collected wisdom out
> he

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-13 Thread Mischa Sandberg
Mark Cotner wrote:
Hi all,
I had a difficult time deciding which list to post
this to, so please forgive me if this list doesn't
perfectly match my questions.  My decision will not
solely be based on performance, but it is the primary
concern.  I would be very appreciative if you all
could comment on my test plan.  Real world examples of
a Postgres implementation of >=600G with a web
front-end would be great, or any data warehouse with
some size to it.
I'm only in the 30GB range of database, in case that's a consideration 
for my comments that follow.

At this time, I'm working out the best ROLAP options for our PG 
transaction store. The transaction store is highly volatile -- longest a 
txn stays in it is 15 days ... so you imagine the need for historic 
summaries :-)

I've also implemented multiple data servers, including
a federated server that had to make the best of existing engines
(like MySQL, PG and everything from MSJet to RedBrick in the commercial 
world).

The time has come to reevaluate/rearchitect an
application which I built about 3 years ago.  There
are no performance concerns with MySQL, but it would
benefit greatly from stored procedures, views, etc. 
If your company is currently happy with MySQL, there probably are other 
(nontechnical) reasons to stick with it. I'm impressed that you'd 
consider reconsidering PG.

Some of the mining that we do could benefit from
stored procedures as well.  MySQL may have these in
the works, but we won't be able to move to a version
of MySQL that supports stored procs for another year
or two.
And PG lets you back-end with some powerful pattern- and 
aggregate-handling languages, like Perl. This was definitely a plus for 
data mining of web traffic, for example. The power of server-side 
extensibility for bailing you out of a design dead-end is not 
inconsequential.

PG doesn't have PIVOT operators (qv Oracle and MSSQL), but it makes the 
translation from data to column fairly painless otherwise.

Requirements:
Merge table definition equivalent.  We use these
extensively.
Looked all over mysql.com etc, and afaics merge table
is indeed exactly a view of a union-all. Is that right?
PG supports views, of course, as well (now) as tablespaces, allowing you 
to split tables/tablesets across multiple disk systems.
PG is also pretty efficient in query plans on such views, where (say) 
you make one column a constant (identifier, sort of) per input table.

Merge table equivalent with all tables containing over
100M rows(and about 40 columns, some quite wide) will
need to do index scans in at least 5 seconds(MySQL
currently does 2, but we can live with 5) and return
~200 rows.
PG has TOAST for handling REALLY BIG columns, and the generic TEXT type 
is as efficient as any size-specific VARCHAR() type ... should make 
things easier for you.

Um, gonna sound silly, but the web interface has to
remain "snappy" under load.  I don't see this as a
major concern since you don't require table locking.
Agreed. It's more in your warehouse design, and intelligent bounding of 
queries. I'd say PG's query analyzer is a few years ahead of MySQL for 
large and complex queries.

If business logic is moved to the database(likely with
Postgres) performance for inserting with light logic
on each insert has to keep up with the 4.5M inserts
per 2 hours(which MySQL completes in ~35min
currently).  Acceptable numbers for this aggregation
would be 45-55min using stored procedures.
Again, it's a matter of pipeline design. The tools for creating an 
efficient pipeline are at least as good in PG as MySQL.

If you try to insert and postprocess information one row at a time,
procedures or no, there's no offhand way to guarantee your performance 
without a test/prototype.

On the other hand, if you do warehouse-style loading (Insert, or PG 
COPY, into a temp table; and then 'upsert' into the perm table), I can 
guarantee 2500 inserts/sec is no problem.

Here's our case study if you're interested . . . 
http://www.mysql.com/customers/customer.php?id=16
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-13 Thread Simon Riggs

Mark,

I thought some additional comments on top of Christopher's excellent notes
might help you.

> Christopher Browne
> The world rejoiced as Mischa Sandberg
> <[EMAIL PROTECTED]> wrote:
> > Mark Cotner wrote:
> >> Requirements:
> >> Merge table definition equivalent.  We use these
> >> extensively.
>
> > Looked all over mysql.com etc, and afaics merge table is indeed
> > exactly a view of a union-all. Is that right?
>

PostgreSQL's functionality is in many ways similar to Oracle Partitioning.

Loading up your data in many similar tables, then creating a view like:

CREATE VIEW BIGTABLE (idate, col1, col2, col3...) AS
SELECT 200409130800, col1, col2, col3... FROM table200409130800
UNION ALL
SELECT 200409131000, col1, col2, col3... FROM table200409131000
UNION ALL
SELECT 200409131200, col1, col2, col3... FROM table200409131200
...etc...

will allow the PostgreSQL optimizer to eliminate partitions from the query
when you run queries which include a predicate on the partitioning_col, e.g.

select count(*) from bigtable where idate >= 200409131000

will scan the last two partitions only...

There are a few other ways of creating the view that return the same answer,
but only using constants in that way will allow the partitions to be
eliminated from the query, and so run for much longer.

So you can give different VIEWS to different user groups, have different
indexes on different tables etc.

However, I haven't managed to get this technique to work when performing a
star join to a TIME dimension table, since the parition elimination relies
on comparison of constant expressions. You'll need to check out each main
join type to make sure it works for you in your environment.

> > PG supports views, of course, as well (now) as tablespaces, allowing
> > you to split tables/tablesets across multiple disk systems.  PG is
> > also pretty efficient in query plans on such views, where (say) you
> > make one column a constant (identifier, sort of) per input table.
>
> The thing that _doesn't_ work well with these sorts of UNION views are
> when you do self-joins.  Supposing you have 10 members, a self-join
> leads to a 100-way join, which is not particularly pretty.
>

Well, that only happens when you forget to include the partitioning constant
in the self join.

e.g. select count(*) from bigtable a, bigtable b where a.idate =
.idate;  --works just fine

The optimizer really is smart enough to handle that too, but I'm sure such
large self-joins aren't common for you anyhow.

> I'm quite curious as to how MySQL(tm) copes with this, although it may
> not be able to take place; they may not support that...
>

It doesn't, AFAIK.

> Christopher Browne wrote
> A long time ago, in a galaxy far, far away, [EMAIL PROTECTED]
> (Mark Cotner) wrote:
> > Agreed, I did some preliminary testing today and am very impressed.
> > I wasn't used to running analyze after a data load, but once I did
> > that everything was snappy.
>
> Something worth observing is that this is true for _any_ of the
> database systems supporting a "cost-based" optimization system,
> including Oracle and DB2.

Agreed. You can reduce the time for the ANALYZE by ignoring some of the
(measures) columns not used in WHERE clauses.

Also, if you're sure that each load is very similar to the last, you might
even consider directly updating pg_statistic rows with the statistical
values produced from an earlier ANALYZE...scary, but it can work.

To create a set of tables of > 600Gb, you will benefit from creating each
table WITHOUT OIDS.

Hope some of that helps you...

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-13 Thread Christopher Browne
A long time ago, in a galaxy far, far away, [EMAIL PROTECTED] (Mark Cotner) wrote:
> Agreed, I did some preliminary testing today and am very impressed.
> I wasn't used to running analyze after a data load, but once I did
> that everything was snappy.

Something worth observing is that this is true for _any_ of the
database systems supporting a "cost-based" optimization system,
including Oracle and DB2.

When working with SAP R/3 Payroll, on one project, we found that when
the system was empty of data, the first few employee creates were
quick enough, but it almost immediately got excruciatingly slow.  One
of the DBAs told the Oracle instance underneath to collect statistics
on the main table, and things _immediately_ got snappy again.  But it
didn't get snappy until the conversion folk had run the conversion
process for several minutes, to the point to which it would get
painfully slow :-(.  There, with MILLIONS of dollars worth of license
fees being paid, across the various vendors, it still took a fair bit
of manual fiddling.

MySQL(tm) is just starting to get into cost-based optimization; in
that area, they're moving from where the "big DBs" were about 10 years
ago.  It was either version 7 or 8 where Oracle started moving to
cost-based optimization, and (as with the anecdote above) it took a
release or two for people to get accustomed to the need to 'feed' the
optimizer with statistics.  This is a "growing pain" that bites users
with any database where this optimization gets introduced.  It's
worthwhile, but is certainly not costless.

I expect some forseeable surprises will be forthcoming for MySQL AB's
customers in this regard...

> My best results from MySQL bulk inserts was around 36k rows per
> second on a fairly wide table.  Today I got 42k using the COPY
> command, but with the analyze post insert the results were similar.
> These are excellent numbers.  It basically means we could have our
> cake(great features) and eat it too(performance that's good enough
> to run the app).

In the end, performance for inserts is always fundamentally based on
how much disk I/O there is, and so it should come as no shock that
when roughly the same amount of data is getting laid down on disk,
performance won't differ much on these sorts of essentials.

There are a few places where there's some need for cleverness; if you
see particular queries running unusually slowly, it's worth doing an
EXPLAIN or EXPLAIN ANALYZE on them, to see how the query plans are
being generated.  There's some collected wisdom out here on how to
encourage the right plans.

There are also unexpected results that are OK.  We did a system
upgrade a few days ago that led to one of the tables starting out
totally empty.  A summary report that looks at that table wound up
with a pretty wacky looking query plan (compared to what's usual)
because the postmaster knew that the query would be reading in
essentially the entire table.  You'd normally expect an index scan,
looking for data for particular dates.  In this case, it did a "scan
the whole table; filter out a few irrelevant entries" plan.  

It looked wacky, compared to what's usual, but it ran in about 2
seconds, which was way FASTER than what's usual.  So the plan was
exactly the right one.

Telling the difference between the right plan and a poor one is a bit
of an art; we quite regularly take a look at query plans on this list
to figure out what might be not quite right.  If you find slow ones,
make sure you have run ANALYZE on the tables recently, to be sure that
the plans are sane, and you may want to consider posting some of them
to see if others can point to improvements that can be made.
-- 
If this was helpful,  rate me
http://linuxfinances.info/info/linuxdistributions.html
"I can't believe my room doesn't have Ethernet!  Why wasn't it wired
when the house was built?"
"The house was built in 1576." 
-- Alex Kamilewicz on the Oxford breed of `conference American.'

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-13 Thread Mark Cotner
See comments . . . thanks for the feedback.

'njoy,
Mark

--- Christopher Browne <[EMAIL PROTECTED]> wrote:

> The world rejoiced as Mischa Sandberg
> <[EMAIL PROTECTED]> wrote:
> > Mark Cotner wrote:
> >> Requirements:
> >> Merge table definition equivalent.  We use these
> >> extensively.
> 
> > Looked all over mysql.com etc, and afaics merge
> table is indeed
> > exactly a view of a union-all. Is that right?
> 
> > PG supports views, of course, as well (now) as
> tablespaces, allowing
> > you to split tables/tablesets across multiple disk
> systems.  PG is
> > also pretty efficient in query plans on such
> views, where (say) you
> > make one column a constant (identifier, sort of)
> per input table.
> 
> The thing that _doesn't_ work well with these sorts
> of UNION views are
> when you do self-joins.  Supposing you have 10
> members, a self-join
> leads to a 100-way join, which is not particularly
> pretty.
> 
> I'm quite curious as to how MySQL(tm) copes with
> this, although it may
> not be able to take place; they may not support
> that...
> 
> >> Um, gonna sound silly, but the web interface has
> to remain "snappy"
> >> under load.  I don't see this as a major concern
> since you don't
> >> require table locking.
> 
> > Agreed. It's more in your warehouse design, and
> intelligent bounding
> > of queries. I'd say PG's query analyzer is a few
> years ahead of
> > MySQL for large and complex queries.
> 
> The challenge comes in if the application has had
> enormous amounts of
> effort put into it to attune it exactly to
> MySQL(tm)'s feature set.
> 
> The guys working on RT/3 have found this a
> challenge; they had rather
> a lot of dependancies on its case-insensitive string
> comparisons,
> causing considerable grief.
>

Not so much, I've tried to be as agnostic as possible.
 Much of the more advanced mining that I've written is
kinda MySQL specific, but needs to be rewritten as
stored procedures anyway.
 
> > On the other hand, if you do warehouse-style
> loading (Insert, or PG
> > COPY, into a temp table; and then 'upsert' into
> the perm table), I
> > can guarantee 2500 inserts/sec is no problem.
> 
> The big wins are thus:
> 
>  1.  Group plenty of INSERTs into a single
> transaction.
> 
>  2.  Better still, use COPY to cut parsing costs
> plenty more.
> 
>  3.  Adding indexes _after_ the COPY are a further
> win.
> 
> Another possibility is to do clever things with
> stored procs; load
> incoming data using the above optimizations, and
> then run stored
> procedures to use some more or less fancy logic to
> put the data where
> it's ultimately supposed to be.  Having the logic
> running inside the
> engine is the big optimization.

Agreed, I did some preliminary testing today and am
very impressed.  I wasn't used to running analyze
after a data load, but once I did that everything was
snappy.

My best results from MySQL bulk inserts was around 36k
rows per second on a fairly wide table.  Today I got
42k using the COPY command, but with the analyze post
insert the results were similar.  These are excellent
numbers.  It basically means we could have our
cake(great features) and eat it too(performance that's
good enough to run the app).

Queries from my test views were equally pleasing.  I
won't bore you with the details just yet, but
PostgreSQL is doing great.   Not that you all are
surprised.  ;)


> -- 
> wm(X,Y):-write(X),write('@'),write(Y).
> wm('cbbrowne','linuxfinances.info').
> http://linuxfinances.info/info/spreadsheets.html
> Rules  of  the  Evil  Overlord   #198.   "I  will 
> remember  that  any
> vulnerabilities I have  are to be revealed strictly 
> on a need-to-know
> basis.  I will also remember that no one needs to
> know."
> 
> 
> ---(end of
> broadcast)---
> TIP 8: explain analyze is your friend
> 


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-12 Thread Christopher Browne
The world rejoiced as Mischa Sandberg <[EMAIL PROTECTED]> wrote:
> Mark Cotner wrote:
>> Requirements:
>> Merge table definition equivalent.  We use these
>> extensively.

> Looked all over mysql.com etc, and afaics merge table is indeed
> exactly a view of a union-all. Is that right?

> PG supports views, of course, as well (now) as tablespaces, allowing
> you to split tables/tablesets across multiple disk systems.  PG is
> also pretty efficient in query plans on such views, where (say) you
> make one column a constant (identifier, sort of) per input table.

The thing that _doesn't_ work well with these sorts of UNION views are
when you do self-joins.  Supposing you have 10 members, a self-join
leads to a 100-way join, which is not particularly pretty.

I'm quite curious as to how MySQL(tm) copes with this, although it may
not be able to take place; they may not support that...

>> Um, gonna sound silly, but the web interface has to remain "snappy"
>> under load.  I don't see this as a major concern since you don't
>> require table locking.

> Agreed. It's more in your warehouse design, and intelligent bounding
> of queries. I'd say PG's query analyzer is a few years ahead of
> MySQL for large and complex queries.

The challenge comes in if the application has had enormous amounts of
effort put into it to attune it exactly to MySQL(tm)'s feature set.

The guys working on RT/3 have found this a challenge; they had rather
a lot of dependancies on its case-insensitive string comparisons,
causing considerable grief.

> On the other hand, if you do warehouse-style loading (Insert, or PG
> COPY, into a temp table; and then 'upsert' into the perm table), I
> can guarantee 2500 inserts/sec is no problem.

The big wins are thus:

 1.  Group plenty of INSERTs into a single transaction.

 2.  Better still, use COPY to cut parsing costs plenty more.

 3.  Adding indexes _after_ the COPY are a further win.

Another possibility is to do clever things with stored procs; load
incoming data using the above optimizations, and then run stored
procedures to use some more or less fancy logic to put the data where
it's ultimately supposed to be.  Having the logic running inside the
engine is the big optimization.
-- 
wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','linuxfinances.info').
http://linuxfinances.info/info/spreadsheets.html
Rules  of  the  Evil  Overlord   #198.   "I  will  remember  that  any
vulnerabilities I have  are to be revealed strictly  on a need-to-know
basis.  I will also remember that no one needs to know."


---(end of broadcast)---
TIP 8: explain analyze is your friend