Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Leeuw van der, Tim
Hi,

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Michael Kleiser
Sent: Tuesday, September 14, 2004 4:23 PM
To: Leeuw van der, Tim
Cc: Steinar H. Gunderson; PostgreSQL Performance List
Subject: Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --


 What MySQl-table-type did you use?
 Was it MyISAM which don't supports transactions ?
 Yes I read about that bulk-inserts with this table-type are very fast.
 In Data Warehouse one often don't need transactions.

Although totally beyond the scope of this thread, we used InnoDB tables with MySQL 
because of the transaction-support.

regards,

--Tim

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Christopher Browne
In the last exciting episode, [EMAIL PROTECTED] (Joe Conway) wrote:
 That's exactly what we're doing, but using inherited tables instead of
 a union view. With inheritance, there is no need to rebuild the view
 each time a table is added or removed. Basically, in our application,
 tables are partitioned by either month or week, depending on the type
 of data involved, and queries are normally date qualified.

Sounds interesting, and possibly usable.

Where does the constraint come in that'll allow most of the data to be
excluded?

Or is this just that the entries are all part of bigtable so that
the self join is only 2-way?
-- 
let name=cbbrowne and tld=acm.org in name ^ @ ^ tld;;
http://linuxfinances.info/info/advocacy.html
Be humble.   A lot happened before  you were born.   - Life's Little
Instruction Book

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread simon

Joe Conway [EMAIL PROTECTED] wrote on 15.09.2004, 06:30:24:
 Chris Browne wrote:
  Might we set up the view as:
  
  create view combination_of_logs as
select * from table_1 where txn_date between 'this' and 'that' 
 union all
select * from table_2 where txn_date between 'this2' and 'that2' 
 union all
select * from table_3 where txn_date between 'this3' and 'that3' 
 union all
select * from table_4 where txn_date between 'this4' and 'that4' 
 union all
 ... ad infinitum
 union all
select * from table_n where txn_date  'start_of_partition_n';
  
  and expect that to help, as long as the query that hooks up to this
  has date constraints?
  
  We'd have to regenerate the view with new fixed constants each time we
  set up the tables, but that sounds like it could work...
 
 That's exactly what we're doing, but using inherited tables instead of a 
 union view. With inheritance, there is no need to rebuild the view each 
 time a table is added or removed. Basically, in our application, tables 
 are partitioned by either month or week, depending on the type of data 
 involved, and queries are normally date qualified.
 
 We're not completely done with our data conversion (from a commercial 
 RDBMSi), but so far the results have been excellent. Similar to what 
 others have said in this thread, the conversion involved restructuring 
 the data to better suit Postgres, and the application (data 
 analysis/mining vs. the source system which is operational). As a result 
 we've compressed a  1TB database down to ~0.4TB, and seen at least one 
 typical query reduced from ~9 minutes down to ~40 seconds.

Sounds interesting.

The performance gain comes from partition elimination of the inherited
tables under the root?

I take it the compression comes from use of arrays, avoiding the need
for additional rows and key overhead?

Best Regards, Simon Riggs

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Markus Schaber
Hi,

On Tue, 14 Sep 2004 22:10:04 -0700
Steve Atkins [EMAIL PROTECTED] wrote:

  Is there by any chance a set of functions to manage adding and removing
  partitions? Certainly this can be done by hand, but having a set of
  tools would make life much easier. I just looked but didn't see anything
  on GBorg.
 
 I've done a similar thing with time-segregated data by inheriting
 all the partition tables from an (empty) parent table.
 
 Adding a new partition is just a create table tablefoo () inherits(bigtable)
 and removing a partition just drop table tablefoo.

But you have to add table constraints restricting the time after adding
the partition?

Thanks,
Markus

-- 
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:[EMAIL PROTECTED] | www.logi-track.com

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Joe Conway
[EMAIL PROTECTED] wrote:
Joe Conway [EMAIL PROTECTED] wrote on 15.09.2004, 06:30:24:
We're not completely done with our data conversion (from a commercial 
RDBMSi), but so far the results have been excellent. Similar to what 
others have said in this thread, the conversion involved restructuring 
the data to better suit Postgres, and the application (data 
analysis/mining vs. the source system which is operational). As a result 
we've compressed a  1TB database down to ~0.4TB, and seen at least one 
typical query reduced from ~9 minutes down to ~40 seconds.
Sounds interesting.
The performance gain comes from partition elimination of the inherited
tables under the root?
I take it the compression comes from use of arrays, avoiding the need
for additional rows and key overhead?
Sorry, in trying to be concise I was not very clear. I'm using the term 
compression very generally here. I'll try to give a bit more background,

The original data source is a database schema designed for use by an 
operational application that my company sells to provide enhanced 
management of equipment that we also sell. The application needs to be 
very flexible in exactly what data it stores in order to be useful 
across a wide variety of equipment models and versions. In order to do 
that there is a very large central transaction table that stores 
name-value pairs in varchar columns. The name-value pairs come from 
parsed output of the equipment, and as such there is a fair amount of 
redundancy and unneeded data that ends up getting stored. At each 
installation in the field this table can get very large ( billion 
rows). Additionally the application prematerializes a variety of 
summaries for use by the operators using the GUI.

We collect the data exported from each of the systems in the field and 
accumulate it in a single central database for data mining and analysis. 
This is the database that is actually being converted. By compression I 
really mean that unneeded and redundant data is being stripped out, and 
data known to be of a certain datatype is stored in that type instead of 
varchar (e.g. values known to be int are stored as int). Also the 
summaries are not being converted (although we do some post processing 
to create new materialized summaries).

My points in telling this were:
 - the use of inherited tables to partition this huge number of rows and
   yet allow simple query access to it seems to work well, at least in
   early validation tests
 - had we simply taken the original database and slammed it into
   Postgres with no further thought, we would not have seen the big
   improvements, and thus the project might have been seen as a failure
   (even though it saves substantial $)
Hope that's a bit more clear. I'm hoping to write up a more detailed 
case study once we've cut the Postgres system into production and the 
dust settles a bit.

Joe
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Josh Berkus
Joe,

   - the use of inherited tables to partition this huge number of rows and
 yet allow simple query access to it seems to work well, at least in
 early validation tests
   - had we simply taken the original database and slammed it into
 Postgres with no further thought, we would not have seen the big
 improvements, and thus the project might have been seen as a failure
 (even though it saves substantial $)

Any further thoughts on developing this into true table partitioning?

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Joe Conway
Josh Berkus wrote:
 - the use of inherited tables to partition this huge number of rows and
   yet allow simple query access to it seems to work well, at least in
   early validation tests
 - had we simply taken the original database and slammed it into
   Postgres with no further thought, we would not have seen the big
   improvements, and thus the project might have been seen as a failure
   (even though it saves substantial $)

Any further thoughts on developing this into true table partitioning?
Just that I'd love to see it happen ;-)
Maybe someday I'll be able to find the time to work on it myself, but 
for the moment I'm satisfied with the workarounds we've made.

Joe
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] [GENERAL] disk performance benchmarks

2004-09-15 Thread Ron St-Pierre
Jeffrey W. Baker wrote:
All these replies are really interesting, but the point is not that my
RAIDs are too slow, or that my CPUs are too slow.  My point is that, for
long stretches of time, by database doesn't come anywhere near using the
capacity of the hardware.  And I think that's odd and would like to
config it to false.
 

What motherboard are you using, and what distro? Earlier you mentioned 
that you're on linux 2.6.7 and
a 64-bit Opteron 244 machine with 8GB main memory, two 4-disk RAID5  
arrays (one for
database, one for xlogs).

Also, did you have a chance to test performance before you implemented RAID?
Ron

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables?

2004-09-15 Thread Simon Riggs

Chris Browne [EMAIL PROTECTED] wrote on 15.09.2004, 04:34:53:
 [EMAIL PROTECTED] (Simon Riggs) writes:
  Well, its fairly straightforward to auto-generate the UNION ALL view,
and
  important as well, since it needs to be re-specified each time a new
  partition is loaded or an old one is cleared down. The main point is
that
  the constant placed in front of each table must in some way relate to
the
  data, to make it useful in querying. If it is just a unique constant,
chosen
  at random, it won't do much for partition elimination. So, that tends to
  make the creation of the UNION ALL view an application/data specific
thing.

 Ah, that's probably a good thought.

 When we used big UNION ALL views, it was with logging tables, where
 there wasn't really any meaningful distinction between partitions.

 So you say that if the VIEW contains, within it, meaningful constraint
 information, that can get applied to chop out irrelevant bits?

 That suggests a way of resurrecting the idea...

 Might we set up the view as:

 create view combination_of_logs as
   select * from table_1 where txn_date between 'this' and 'that'
union all
   select * from table_2 where txn_date between 'this2' and 'that2'
union all
   select * from table_3 where txn_date between 'this3' and 'that3'
union all
   select * from table_4 where txn_date between 'this4' and 'that4'
union all
... ad infinitum
union all
   select * from table_n where txn_date  'start_of_partition_n';

 and expect that to help, as long as the query that hooks up to this
 has date constraints?


That way of phrasing the view can give you the right answer to the
query, but does not exclude partitions.

With the above way of doing things, you end up with predicate phrases of
the form ((PARTLIMITLO  partcol) AND (PARTLIMITHI  partcol) AND
(partcol  QUERYLIMITLO) AND (partcol  QUERYLIMITHI))
...if the values in capitals are constants, then this should evaluate to
a true or false value for each partition table. The optimizer doesn't
yet do this

If you specify the view the way I showed, then the predicate query
becomes a comparison of constants, which DOES get evaluated prior to
full executionyou will see this as a one time test: false in the
EXPLAIN.

The way you've phrased the view is the more obvious way to phrase it,
and I'd spent a few days trying to work out how to solve the algebra
above in codebut that was wasted effort.

Anyway, if you use constants you can still specify ranges and betweens
and have them work... hence my example showed date-like integers - but
I don't think it just applies to one datatype.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Simon Riggs

Joe,

Your application is very interesting. I've just read your OSCON paper. I'd
like to talk more about that. Very similar to Kalido.

...but back to partitioning momentarily: Does the performance gain come from
partition elimination of the inherited tables under the root?

Best Regards, Simon Riggs




---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Joe Conway
Simon Riggs wrote:
Joe,
Your application is very interesting. I've just read your OSCON paper. I'd
like to talk more about that. Very similar to Kalido.
...but back to partitioning momentarily: Does the performance gain come from
partition elimination of the inherited tables under the root?
I think the major part of the peformance gain comes from the fact that 
the source database has different needs in terms of partitioning 
criteria because of it's different purpose. The data is basically 
partitioned by customer installation instead of by date. Our converted 
scheme partitions by date, which is in line with the analytical queries 
run at the corporate office. Again, this is an argument in favor of not 
simply porting what you're handed.

We might get similar query performance with a single large table and 
multiple partial indexes (e.g. one per month), but there would be one 
tradeoff and one disadvantage to that:
1) The indexes would need to be generated periodically -- this is a 
tradeoff since we currently need to create inherited tables at the same 
periodicity
2) It would be much more difficult to roll off a month's worth of data 
when needed. The general idea is that each month we create a new monthly 
table, then archive and drop the oldest monthly table. If all the data 
were in one big table we would have to delete many millions of rows from 
a (possibly) multibillion row table, and then vacuum that table -- no 
thanks ;-)

Joe
---(end of broadcast)---
TIP 8: explain analyze is your friend


[PERFORM] Partitioning

2004-09-15 Thread J. Andrew Rogers
On Tue, 2004-09-14 at 21:30, Joe Conway wrote:
 That's exactly what we're doing, but using inherited tables instead of a 
 union view. With inheritance, there is no need to rebuild the view each 
 time a table is added or removed. Basically, in our application, tables 
 are partitioned by either month or week, depending on the type of data 
 involved, and queries are normally date qualified.



We do something very similar, also using table inheritance and a lot of 
triggers to automatically generate partitions and so forth.  It works
pretty well, but it is a custom job every time I want to implement a
partitioned table.  You can save a lot on speed and space if you use it
to break up large tables with composite indexes, since you can drop
columns from the table depending on how you use it.  A big part of
performance gain is that the resulting partitions end up being more
well-ordered than the non-partitioned version, since inserts are hashed
to different partition according to the key and hash function.  It is
kind of like a cheap and dirty real-time CLUSTER operation.  It also
lets you truncate, lock, and generally be heavy-handed with subsets of
the table without affecting the rest of the table.


I think generic table partitioning could pretty much be built on top of
existing capabilities with a small number of tweaks.

The main difference would be the ability to associate a partitioning
hash function with a table (probably defined inline at CREATE TABLE
time).  Something with syntax like:

...PARTITION ON 'date_trunc(''hour'',ts)'...

There would also probably need to be some type of metadata table to
associate specific hashes with partition table names.  Other than that,
the capabilities largely already exist, and managing the partition
hashing and association is the ugly part when rolling your own. 
Intercepting DML when necessary and making it behave correctly is
already pretty easy, but could probably be streamlined.


j. andrew rogers




---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Partitioning

2004-09-15 Thread Greg Stark

J. Andrew Rogers [EMAIL PROTECTED] writes:

 We do something very similar, also using table inheritance

I have a suspicion postgres's table inheritance will end up serving as a good
base for a partitioned table feature. Is it currently possible to query which
subtable a record came from though?

 A big part of performance gain is that the resulting partitions end up being
 more well-ordered than the non-partitioned version, since inserts are hashed
 to different partition according to the key and hash function. It is kind of
 like a cheap and dirty real-time CLUSTER operation. 

There is also one particular performance gain that cannot be obtained via
other means: A query that accesses a large percentage of a single partition
can use a sequential table scan of just that partition. This can be several
times faster than using an index scan which is the only option if all the data
is stored in a single large table.

This isn't an uncommon occurrence. Consider an accounting table partitioned by
accounting period. Any aggregate reports for a single accounting period fall
into this category. If you define your partitions well that can often by most
or all of your reports.

Of course this applies equally if the query is accessing a small number of
partitions. A further refinement is to leverage the partitioning in GROUP BY
or ORDER BY clauses. If you're grouping by the partition key you can avoid a
large sort without having to resort to an index scan or even a hash. And of
course it's tempting to think about parallelization of such queries,
especially if the partitions are stored in separate table spaces on different
drives.

 It also lets you truncate, lock, and generally be heavy-handed with subsets
 of the table without affecting the rest of the table.

The biggest benefit by far is this management ability of being able to swap in
and out partitions in a single atomic transaction that doesn't require
extensive i/o.

In the application we used them on Oracle 8i they were an absolute life-saver.
They took a huge batch job that took several days to run in off-peak hours and
turned it into a single quick cron job that could run at peak hours. We were
able to cut the delay for our OLTP data showing up in the data warehouse from
about a week after extensive manual work to hours after a daily cron job.

   ...PARTITION ON 'date_trunc(''hour'',ts)'...
 
 There would also probably need to be some type of metadata table to
 associate specific hashes with partition table names.  Other than that,
 the capabilities largely already exist, and managing the partition
 hashing and association is the ugly part when rolling your own. 
 Intercepting DML when necessary and making it behave correctly is
 already pretty easy, but could probably be streamlined.

I would suggest you look at the Oracle syntax to handle this. They've already
gone through several iterations of implementations. The original Oracle 7
implementation was much as people have been describing where you had to define
a big UNION ALL view and enable an option to have the optimizer look for such
views and attempt to eliminate partitions.

In Oracle 8i they introduced first class partitions with commands to define
and manipulate them. You defined a high bound for each partition.

In Oracle 9 (or thereabouts, sometime after 8i at any rate) they introduced a
new form where you specify a specific constant value for each partition. This
seems to be more akin to how you're thinking about things.

The optimizer has several plan nodes specific for partitioned tables. It can
select a single known partition based on information present in the query. It
can also detect cases where it can be sure the query will only access a single
partition but won't be able to determine which until execution time based on
placeholder parameters for cases like WHERE partition_key = ?. It can also
detect cases like WHERE partition_key between ? and ? and WHERE
partition_key IN (?,?,?) Or join clauses on partitions. It can also do some
magic things with GROUP BY partition_key and ORDER BY partition_key.

The work in the optimizer will be the most challenging part. In an ideal world
if the optimizer is very solid it will be possible to bring some partitions to
slow or even near-line storage media. As long as no live queries accidentally
access the wrong partitions the rest of the database need never know that the
data isn't readily available.


-- 
greg


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Joe Conway
Iain wrote:
That's exactly what we're doing, but using inherited tables instead of a
union view. With inheritance, there is no need to rebuild the view each
time a table is added or removed. Basically, in our application, tables
are partitioned by either month or week, depending on the type of data
involved, and queries are normally date qualified.
That sounds interesting. I have to admit that I havn't touched iheritance in
pg at all yet so I find it hard to imagine how this would work. If you have
a chance, would you mind elaborating on it just a little?
OK, see below:
=
create table foo(f1 int, f2 date, f3 float8);
create table foo_2004_01() inherits (foo);
create table foo_2004_02() inherits (foo);
create table foo_2004_03() inherits (foo);
create index foo_2004_01_idx1 on foo_2004_01(f2);
create index foo_2004_02_idx1 on foo_2004_02(f2);
create index foo_2004_03_idx1 on foo_2004_03(f2);
insert into foo_2004_02 values(1,'2004-feb-15',3.14);
 -- needed just for illustration since these are toy tables
set enable_seqscan to false;
explain analyze select * from foo where f2 = '2004-feb-15';
QUERY PLAN

 Result  (cost=1.00..10061.32 rows=16 width=16) (actual 
time=0.224..0.310 rows=1 loops=1)
   -  Append  (cost=1.00..10061.32 rows=16 width=16) 
(actual time=0.214..0.294 rows=1 loops=1)
 -  Seq Scan on foo  (cost=1.00..10022.50 rows=5 
width=16) (actual time=0.004..0.004 rows=0 loops=1)
   Filter: (f2 = '2004-02-15'::date)
 -  Index Scan using foo_2004_01_idx1 on foo_2004_01 foo 
(cost=0.00..17.07 rows=5 width=16) (actual time=0.101..0.101 rows=0 loops=1)
   Index Cond: (f2 = '2004-02-15'::date)
 -  Index Scan using foo_2004_02_idx1 on foo_2004_02 foo 
(cost=0.00..4.68 rows=1 width=16) (actual time=0.095..0.101 rows=1 loops=1)
   Index Cond: (f2 = '2004-02-15'::date)
 -  Index Scan using foo_2004_03_idx1 on foo_2004_03 foo 
(cost=0.00..17.07 rows=5 width=16) (actual time=0.066..0.066 rows=0 loops=1)
   Index Cond: (f2 = '2004-02-15'::date)
 Total runtime: 0.582 ms
(11 rows)

create table foo_2004_04() inherits (foo);
create index foo_2004_04_idx1 on foo_2004_04(f2);
explain analyze select * from foo where f2 = '2004-feb-15';
QUERY PLAN

 Result  (cost=1.00..10078.38 rows=21 width=16) (actual 
time=0.052..0.176 rows=1 loops=1)
   -  Append  (cost=1.00..10078.38 rows=21 width=16) 
(actual time=0.041..0.159 rows=1 loops=1)
 -  Seq Scan on foo  (cost=1.00..10022.50 rows=5 
width=16) (actual time=0.004..0.004 rows=0 loops=1)
   Filter: (f2 = '2004-02-15'::date)
 -  Index Scan using foo_2004_01_idx1 on foo_2004_01 foo 
(cost=0.00..17.07 rows=5 width=16) (actual time=0.012..0.012 rows=0 loops=1)
   Index Cond: (f2 = '2004-02-15'::date)
 -  Index Scan using foo_2004_02_idx1 on foo_2004_02 foo 
(cost=0.00..4.68 rows=1 width=16) (actual time=0.016..0.022 rows=1 loops=1)
   Index Cond: (f2 = '2004-02-15'::date)
 -  Index Scan using foo_2004_03_idx1 on foo_2004_03 foo 
(cost=0.00..17.07 rows=5 width=16) (actual time=0.008..0.008 rows=0 loops=1)
   Index Cond: (f2 = '2004-02-15'::date)
 -  Index Scan using foo_2004_04_idx1 on foo_2004_04 foo 
(cost=0.00..17.07 rows=5 width=16) (actual time=0.095..0.095 rows=0 loops=1)
   Index Cond: (f2 = '2004-02-15'::date)
 Total runtime: 0.443 ms
(13 rows)

For loading data, we COPY into foo, and have a trigger that redirects 
the rows to the appropriate partition.

Notice that the partitions which do not contain any data of interest are 
still probed for data, but since they have none it is very quick. In a 
real life example I got the following results just this afternoon:

 - aggregate row count = 471,849,665
 - total number inherited tables = 216
   (many are future dated and therefore contain no data)
 - select one month's worth of data for one piece of equipment by serial
   number (49,257 rows) = 526.015 ms
Not too bad -- quick enough for my needs. BTW, this is using NFS mounted 
storage (NetApp NAS).

Joe
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Steve Atkins
On Wed, Sep 15, 2004 at 11:16:44AM +0200, Markus Schaber wrote:
 Hi,
 
 On Tue, 14 Sep 2004 22:10:04 -0700
 Steve Atkins [EMAIL PROTECTED] wrote:
 
   Is there by any chance a set of functions to manage adding and removing
   partitions? Certainly this can be done by hand, but having a set of
   tools would make life much easier. I just looked but didn't see anything
   on GBorg.
  
  I've done a similar thing with time-segregated data by inheriting
  all the partition tables from an (empty) parent table.
  
  Adding a new partition is just a create table tablefoo () inherits(bigtable)
  and removing a partition just drop table tablefoo.
 
 But you have to add table constraints restricting the time after adding
 the partition?

Uhm... unless I'm confused that's not a meaningful thing in this context.
There's no rule that's putting insertions into an inherited table - the
decision of which inherited table to insert into is made at application
level.

As I was using it to segregate data based on creation timestamp the
application just inserts into the 'newest' inherited table until it's
full, then creates a new inherited table.

I've no doubt you could set up rules to scatter inserted data across
a number of tables, but that's not something that's been applicaable
for the problems I tend to work with, so I've not looked at it.

Cheers,
  Steve


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-15 Thread Mischa Sandberg
Simon Riggs wrote:
Jim C. Nasby
On Mon, Sep 13, 2004 at 11:07:35PM +0100, Simon Riggs wrote:
PostgreSQL's functionality is in many ways similar to Oracle
Partitioning.
Loading up your data in many similar tables, then creating a view like:
CREATE VIEW BIGTABLE (idate, col1, col2, col3...) AS
SELECT 200409130800, col1, col2, col3... FROM table200409130800
UNION ALL
SELECT 200409131000, col1, col2, col3... FROM table200409131000
UNION ALL
SELECT 200409131200, col1, col2, col3... FROM table200409131200
...etc...
will allow the PostgreSQL optimizer to eliminate partitions
from the query
when you run queries which include a predicate on the
partitioning_col, e.g.
select count(*) from bigtable where idate = 200409131000
The partitions are just tables, so no need for other management tools.
Oracle treats the partitions as sub-tables, so you need a range of commands
to add, swap etc the partitions of the main table.
A few years ago I wrote a federated query engine (wrapped as an ODBC 
driver) that had to handle thousands of contributors (partitions) to a 
pseudotable / VIEWofUNIONs. Joins did require some special handling in 
the optimizer, because of the huge number of crossproducts between 
different tables. It was definitely worth the effort at the time, 
because you need different strategies for: joining a partition to 
another partition on the same subserver; joining two large partitions on 
different servers; and joining a large partition on one server to a 
small one on another.

The differences may not be so great for a solitary server;
but they're still there, because of disparity in subtable sizes. The 
simplistic query plans tend to let you down, when you're dealing with 
honking warehouses.

I'm guessing that Oracle keeps per-subtable AND cross-all-subtables 
statistics, rather than building the latter from scratch in the course 
of evaluating the query plan. That's the one limitation I see in 
emulating their partitioned tables with Views.

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Mischa Sandberg
Googling 'upsert' (an Oraclism, I believe) will get you hits on Oracle 
and DB2's implementation of MERGE, which does what AMOUNTS to what is
described below (one mass UPDATE...FROM, one mass INSERT...WHERE NOT 
EXISTS).

No, you shouldn't iterate row-by-row through the temp table.
Whenever possible, try to do updates in one single (mass) operation.
Doing it that way gives the optimizer the best chance at amortizing
fixed costs, and batching operations.
-
In any database other than Postgres, I would recommend doing the
INSERT /followed by/ the UPDATE. That order looks wonky --- your update
ends up pointlessly operating on the rows just INSERTED. The trick is, 
UPDATE acquires and holds write locks (the rows were previously visible 
to other processes), while INSERT's write locks refer to rows that no 
other process could try to lock.

Stephen Frost wrote:
* Markus Schaber ([EMAIL PROTECTED]) wrote:
Generally, what is the fastest way for doing bulk processing of 
update-if-primary-key-matches-and-insert-otherwise operations?

This is a very good question, and I havn't seen much of an answer to it
yet.  I'm curious about the answer myself, actually.  In the more recent
SQL specs, from what I understand, this is essentially what the 'MERGE'
command is for.  This was recently added and unfortunately is not yet
supported in Postgres.  Hopefully it will be added soon.
Otherwise, what I've done is basically an update followed by an insert
using outer joins.  If there's something better, I'd love to hear about
it.  The statements looks something like:
update X
  set colA = a.colA,
  colB = a.colB
  from Y a
  where keyA = a.keyA and
keyB = a.keyB;
insert into X
  select a.keyA,
 a.keyB,
 a.colA,
 a.colB
  from Y a left join X b
   using (keyA, keyB)
  where b.keyA is NULL and
b.keyB is NULL;
With the appropriate indexes, this is pretty fast but I think a merge
would be much faster.
Thanks,
			Stephen
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Christopher Kings-Lynne
insert into X
  select a.keyA,
 a.keyB,
 a.colA,
 a.colB
  from Y a left join X b
   using (keyA, keyB)
  where b.keyA is NULL and
b.keyB is NULL;
With the appropriate indexes, this is pretty fast but I think a merge
would be much faster.
Problem is it's subject to race conditions if another process is 
inserting stuff at the same time...

Chris
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Iain
Hi Joe,

You went to quite a bit of effort, thanks I have the picture now.

Using inheritence seems to be a useful refinement on top of the earlier
outlined aproach using the UNION ALL view with appropriate predicates on the
condition used to do the partitioning. Having the individual partitions
derived from a parent table makes a lot of sense.

regards
Iain


- Original Message - 
From: Joe Conway [EMAIL PROTECTED]
To: Iain [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Thursday, September 16, 2004 1:07 PM
Subject: Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --


 Iain wrote:
 That's exactly what we're doing, but using inherited tables instead of a
 union view. With inheritance, there is no need to rebuild the view each
 time a table is added or removed. Basically, in our application, tables
 are partitioned by either month or week, depending on the type of data
 involved, and queries are normally date qualified.
 
  That sounds interesting. I have to admit that I havn't touched
iheritance in
  pg at all yet so I find it hard to imagine how this would work. If you
have
  a chance, would you mind elaborating on it just a little?

 OK, see below:
 =

 create table foo(f1 int, f2 date, f3 float8);

 create table foo_2004_01() inherits (foo);
 create table foo_2004_02() inherits (foo);
 create table foo_2004_03() inherits (foo);

 create index foo_2004_01_idx1 on foo_2004_01(f2);
 create index foo_2004_02_idx1 on foo_2004_02(f2);
 create index foo_2004_03_idx1 on foo_2004_03(f2);

 insert into foo_2004_02 values(1,'2004-feb-15',3.14);


   -- needed just for illustration since these are toy tables
 set enable_seqscan to false;
 explain analyze select * from foo where f2 = '2004-feb-15';

 QUERY PLAN
 --
--
   Result  (cost=1.00..10061.32 rows=16 width=16) (actual
 time=0.224..0.310 rows=1 loops=1)
 -  Append  (cost=1.00..10061.32 rows=16 width=16)
 (actual time=0.214..0.294 rows=1 loops=1)
   -  Seq Scan on foo  (cost=1.00..10022.50 rows=5
 width=16) (actual time=0.004..0.004 rows=0 loops=1)
 Filter: (f2 = '2004-02-15'::date)
   -  Index Scan using foo_2004_01_idx1 on foo_2004_01 foo
 (cost=0.00..17.07 rows=5 width=16) (actual time=0.101..0.101 rows=0
loops=1)
 Index Cond: (f2 = '2004-02-15'::date)
   -  Index Scan using foo_2004_02_idx1 on foo_2004_02 foo
 (cost=0.00..4.68 rows=1 width=16) (actual time=0.095..0.101 rows=1
loops=1)
 Index Cond: (f2 = '2004-02-15'::date)
   -  Index Scan using foo_2004_03_idx1 on foo_2004_03 foo
 (cost=0.00..17.07 rows=5 width=16) (actual time=0.066..0.066 rows=0
loops=1)
 Index Cond: (f2 = '2004-02-15'::date)
   Total runtime: 0.582 ms
 (11 rows)

 create table foo_2004_04() inherits (foo);
 create index foo_2004_04_idx1 on foo_2004_04(f2);

 explain analyze select * from foo where f2 = '2004-feb-15';

 QUERY PLAN
 --
--
   Result  (cost=1.00..10078.38 rows=21 width=16) (actual
 time=0.052..0.176 rows=1 loops=1)
 -  Append  (cost=1.00..10078.38 rows=21 width=16)
 (actual time=0.041..0.159 rows=1 loops=1)
   -  Seq Scan on foo  (cost=1.00..10022.50 rows=5
 width=16) (actual time=0.004..0.004 rows=0 loops=1)
 Filter: (f2 = '2004-02-15'::date)
   -  Index Scan using foo_2004_01_idx1 on foo_2004_01 foo
 (cost=0.00..17.07 rows=5 width=16) (actual time=0.012..0.012 rows=0
loops=1)
 Index Cond: (f2 = '2004-02-15'::date)
   -  Index Scan using foo_2004_02_idx1 on foo_2004_02 foo
 (cost=0.00..4.68 rows=1 width=16) (actual time=0.016..0.022 rows=1
loops=1)
 Index Cond: (f2 = '2004-02-15'::date)
   -  Index Scan using foo_2004_03_idx1 on foo_2004_03 foo
 (cost=0.00..17.07 rows=5 width=16) (actual time=0.008..0.008 rows=0
loops=1)
 Index Cond: (f2 = '2004-02-15'::date)
   -  Index Scan using foo_2004_04_idx1 on foo_2004_04 foo
 (cost=0.00..17.07 rows=5 width=16) (actual time=0.095..0.095 rows=0
loops=1)
 Index Cond: (f2 = '2004-02-15'::date)
   Total runtime: 0.443 ms
 (13 rows)

 For loading data, we COPY into foo, and have a trigger that redirects
 the rows to the appropriate partition.

 Notice that the partitions which do not contain any data of interest are
 still probed for data, but since they have none it is very quick. In a
 real life example I got the following results just this afternoon:

   - aggregate row count = 471,849,665
   - total number inherited tables = 216
 (many are future dated and therefore contain no data)
   - select one month's worth of data for 

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Joe Conway
Christopher Browne wrote:
In the last exciting episode, [EMAIL PROTECTED] (Joe Conway) wrote:
That's exactly what we're doing, but using inherited tables instead of
a union view. With inheritance, there is no need to rebuild the view
each time a table is added or removed. Basically, in our application,
tables are partitioned by either month or week, depending on the type
of data involved, and queries are normally date qualified.

Where does the constraint come in that'll allow most of the data to be
excluded?
Not sure I follow this.
Or is this just that the entries are all part of bigtable so that
the self join is only 2-way?
We don't have a need for self-joins in our application. We do use a 
crosstab function to materialize some transposed views of the data, 
however. That allows us to avoid self-joins in the cases where we might 
otherwise need them.

Joe
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])