Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --
Hi, -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Michael Kleiser Sent: Tuesday, September 14, 2004 4:23 PM To: Leeuw van der, Tim Cc: Steinar H. Gunderson; PostgreSQL Performance List Subject: Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- What MySQl-table-type did you use? Was it MyISAM which don't supports transactions ? Yes I read about that bulk-inserts with this table-type are very fast. In Data Warehouse one often don't need transactions. Although totally beyond the scope of this thread, we used InnoDB tables with MySQL because of the transaction-support. regards, --Tim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --
In the last exciting episode, [EMAIL PROTECTED] (Joe Conway) wrote: That's exactly what we're doing, but using inherited tables instead of a union view. With inheritance, there is no need to rebuild the view each time a table is added or removed. Basically, in our application, tables are partitioned by either month or week, depending on the type of data involved, and queries are normally date qualified. Sounds interesting, and possibly usable. Where does the constraint come in that'll allow most of the data to be excluded? Or is this just that the entries are all part of bigtable so that the self join is only 2-way? -- let name=cbbrowne and tld=acm.org in name ^ @ ^ tld;; http://linuxfinances.info/info/advocacy.html Be humble. A lot happened before you were born. - Life's Little Instruction Book ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --
Joe Conway [EMAIL PROTECTED] wrote on 15.09.2004, 06:30:24: Chris Browne wrote: Might we set up the view as: create view combination_of_logs as select * from table_1 where txn_date between 'this' and 'that' union all select * from table_2 where txn_date between 'this2' and 'that2' union all select * from table_3 where txn_date between 'this3' and 'that3' union all select * from table_4 where txn_date between 'this4' and 'that4' union all ... ad infinitum union all select * from table_n where txn_date 'start_of_partition_n'; and expect that to help, as long as the query that hooks up to this has date constraints? We'd have to regenerate the view with new fixed constants each time we set up the tables, but that sounds like it could work... That's exactly what we're doing, but using inherited tables instead of a union view. With inheritance, there is no need to rebuild the view each time a table is added or removed. Basically, in our application, tables are partitioned by either month or week, depending on the type of data involved, and queries are normally date qualified. We're not completely done with our data conversion (from a commercial RDBMSi), but so far the results have been excellent. Similar to what others have said in this thread, the conversion involved restructuring the data to better suit Postgres, and the application (data analysis/mining vs. the source system which is operational). As a result we've compressed a 1TB database down to ~0.4TB, and seen at least one typical query reduced from ~9 minutes down to ~40 seconds. Sounds interesting. The performance gain comes from partition elimination of the inherited tables under the root? I take it the compression comes from use of arrays, avoiding the need for additional rows and key overhead? Best Regards, Simon Riggs ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --
Hi, On Tue, 14 Sep 2004 22:10:04 -0700 Steve Atkins [EMAIL PROTECTED] wrote: Is there by any chance a set of functions to manage adding and removing partitions? Certainly this can be done by hand, but having a set of tools would make life much easier. I just looked but didn't see anything on GBorg. I've done a similar thing with time-segregated data by inheriting all the partition tables from an (empty) parent table. Adding a new partition is just a create table tablefoo () inherits(bigtable) and removing a partition just drop table tablefoo. But you have to add table constraints restricting the time after adding the partition? Thanks, Markus -- markus schaber | dipl. informatiker logi-track ag | rennweg 14-16 | ch 8001 zürich phone +41-43-888 62 52 | fax +41-43-888 62 53 mailto:[EMAIL PROTECTED] | www.logi-track.com ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --
[EMAIL PROTECTED] wrote: Joe Conway [EMAIL PROTECTED] wrote on 15.09.2004, 06:30:24: We're not completely done with our data conversion (from a commercial RDBMSi), but so far the results have been excellent. Similar to what others have said in this thread, the conversion involved restructuring the data to better suit Postgres, and the application (data analysis/mining vs. the source system which is operational). As a result we've compressed a 1TB database down to ~0.4TB, and seen at least one typical query reduced from ~9 minutes down to ~40 seconds. Sounds interesting. The performance gain comes from partition elimination of the inherited tables under the root? I take it the compression comes from use of arrays, avoiding the need for additional rows and key overhead? Sorry, in trying to be concise I was not very clear. I'm using the term compression very generally here. I'll try to give a bit more background, The original data source is a database schema designed for use by an operational application that my company sells to provide enhanced management of equipment that we also sell. The application needs to be very flexible in exactly what data it stores in order to be useful across a wide variety of equipment models and versions. In order to do that there is a very large central transaction table that stores name-value pairs in varchar columns. The name-value pairs come from parsed output of the equipment, and as such there is a fair amount of redundancy and unneeded data that ends up getting stored. At each installation in the field this table can get very large ( billion rows). Additionally the application prematerializes a variety of summaries for use by the operators using the GUI. We collect the data exported from each of the systems in the field and accumulate it in a single central database for data mining and analysis. This is the database that is actually being converted. By compression I really mean that unneeded and redundant data is being stripped out, and data known to be of a certain datatype is stored in that type instead of varchar (e.g. values known to be int are stored as int). Also the summaries are not being converted (although we do some post processing to create new materialized summaries). My points in telling this were: - the use of inherited tables to partition this huge number of rows and yet allow simple query access to it seems to work well, at least in early validation tests - had we simply taken the original database and slammed it into Postgres with no further thought, we would not have seen the big improvements, and thus the project might have been seen as a failure (even though it saves substantial $) Hope that's a bit more clear. I'm hoping to write up a more detailed case study once we've cut the Postgres system into production and the dust settles a bit. Joe ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --
Joe, - the use of inherited tables to partition this huge number of rows and yet allow simple query access to it seems to work well, at least in early validation tests - had we simply taken the original database and slammed it into Postgres with no further thought, we would not have seen the big improvements, and thus the project might have been seen as a failure (even though it saves substantial $) Any further thoughts on developing this into true table partitioning? -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --
Josh Berkus wrote: - the use of inherited tables to partition this huge number of rows and yet allow simple query access to it seems to work well, at least in early validation tests - had we simply taken the original database and slammed it into Postgres with no further thought, we would not have seen the big improvements, and thus the project might have been seen as a failure (even though it saves substantial $) Any further thoughts on developing this into true table partitioning? Just that I'd love to see it happen ;-) Maybe someday I'll be able to find the time to work on it myself, but for the moment I'm satisfied with the workarounds we've made. Joe ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] [GENERAL] disk performance benchmarks
Jeffrey W. Baker wrote: All these replies are really interesting, but the point is not that my RAIDs are too slow, or that my CPUs are too slow. My point is that, for long stretches of time, by database doesn't come anywhere near using the capacity of the hardware. And I think that's odd and would like to config it to false. What motherboard are you using, and what distro? Earlier you mentioned that you're on linux 2.6.7 and a 64-bit Opteron 244 machine with 8GB main memory, two 4-disk RAID5 arrays (one for database, one for xlogs). Also, did you have a chance to test performance before you implemented RAID? Ron ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables?
Chris Browne [EMAIL PROTECTED] wrote on 15.09.2004, 04:34:53: [EMAIL PROTECTED] (Simon Riggs) writes: Well, its fairly straightforward to auto-generate the UNION ALL view, and important as well, since it needs to be re-specified each time a new partition is loaded or an old one is cleared down. The main point is that the constant placed in front of each table must in some way relate to the data, to make it useful in querying. If it is just a unique constant, chosen at random, it won't do much for partition elimination. So, that tends to make the creation of the UNION ALL view an application/data specific thing. Ah, that's probably a good thought. When we used big UNION ALL views, it was with logging tables, where there wasn't really any meaningful distinction between partitions. So you say that if the VIEW contains, within it, meaningful constraint information, that can get applied to chop out irrelevant bits? That suggests a way of resurrecting the idea... Might we set up the view as: create view combination_of_logs as select * from table_1 where txn_date between 'this' and 'that' union all select * from table_2 where txn_date between 'this2' and 'that2' union all select * from table_3 where txn_date between 'this3' and 'that3' union all select * from table_4 where txn_date between 'this4' and 'that4' union all ... ad infinitum union all select * from table_n where txn_date 'start_of_partition_n'; and expect that to help, as long as the query that hooks up to this has date constraints? That way of phrasing the view can give you the right answer to the query, but does not exclude partitions. With the above way of doing things, you end up with predicate phrases of the form ((PARTLIMITLO partcol) AND (PARTLIMITHI partcol) AND (partcol QUERYLIMITLO) AND (partcol QUERYLIMITHI)) ...if the values in capitals are constants, then this should evaluate to a true or false value for each partition table. The optimizer doesn't yet do this If you specify the view the way I showed, then the predicate query becomes a comparison of constants, which DOES get evaluated prior to full executionyou will see this as a one time test: false in the EXPLAIN. The way you've phrased the view is the more obvious way to phrase it, and I'd spent a few days trying to work out how to solve the algebra above in codebut that was wasted effort. Anyway, if you use constants you can still specify ranges and betweens and have them work... hence my example showed date-like integers - but I don't think it just applies to one datatype. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --
Joe, Your application is very interesting. I've just read your OSCON paper. I'd like to talk more about that. Very similar to Kalido. ...but back to partitioning momentarily: Does the performance gain come from partition elimination of the inherited tables under the root? Best Regards, Simon Riggs ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --
Simon Riggs wrote: Joe, Your application is very interesting. I've just read your OSCON paper. I'd like to talk more about that. Very similar to Kalido. ...but back to partitioning momentarily: Does the performance gain come from partition elimination of the inherited tables under the root? I think the major part of the peformance gain comes from the fact that the source database has different needs in terms of partitioning criteria because of it's different purpose. The data is basically partitioned by customer installation instead of by date. Our converted scheme partitions by date, which is in line with the analytical queries run at the corporate office. Again, this is an argument in favor of not simply porting what you're handed. We might get similar query performance with a single large table and multiple partial indexes (e.g. one per month), but there would be one tradeoff and one disadvantage to that: 1) The indexes would need to be generated periodically -- this is a tradeoff since we currently need to create inherited tables at the same periodicity 2) It would be much more difficult to roll off a month's worth of data when needed. The general idea is that each month we create a new monthly table, then archive and drop the oldest monthly table. If all the data were in one big table we would have to delete many millions of rows from a (possibly) multibillion row table, and then vacuum that table -- no thanks ;-) Joe ---(end of broadcast)--- TIP 8: explain analyze is your friend
[PERFORM] Partitioning
On Tue, 2004-09-14 at 21:30, Joe Conway wrote: That's exactly what we're doing, but using inherited tables instead of a union view. With inheritance, there is no need to rebuild the view each time a table is added or removed. Basically, in our application, tables are partitioned by either month or week, depending on the type of data involved, and queries are normally date qualified. We do something very similar, also using table inheritance and a lot of triggers to automatically generate partitions and so forth. It works pretty well, but it is a custom job every time I want to implement a partitioned table. You can save a lot on speed and space if you use it to break up large tables with composite indexes, since you can drop columns from the table depending on how you use it. A big part of performance gain is that the resulting partitions end up being more well-ordered than the non-partitioned version, since inserts are hashed to different partition according to the key and hash function. It is kind of like a cheap and dirty real-time CLUSTER operation. It also lets you truncate, lock, and generally be heavy-handed with subsets of the table without affecting the rest of the table. I think generic table partitioning could pretty much be built on top of existing capabilities with a small number of tweaks. The main difference would be the ability to associate a partitioning hash function with a table (probably defined inline at CREATE TABLE time). Something with syntax like: ...PARTITION ON 'date_trunc(''hour'',ts)'... There would also probably need to be some type of metadata table to associate specific hashes with partition table names. Other than that, the capabilities largely already exist, and managing the partition hashing and association is the ugly part when rolling your own. Intercepting DML when necessary and making it behave correctly is already pretty easy, but could probably be streamlined. j. andrew rogers ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Partitioning
J. Andrew Rogers [EMAIL PROTECTED] writes: We do something very similar, also using table inheritance I have a suspicion postgres's table inheritance will end up serving as a good base for a partitioned table feature. Is it currently possible to query which subtable a record came from though? A big part of performance gain is that the resulting partitions end up being more well-ordered than the non-partitioned version, since inserts are hashed to different partition according to the key and hash function. It is kind of like a cheap and dirty real-time CLUSTER operation. There is also one particular performance gain that cannot be obtained via other means: A query that accesses a large percentage of a single partition can use a sequential table scan of just that partition. This can be several times faster than using an index scan which is the only option if all the data is stored in a single large table. This isn't an uncommon occurrence. Consider an accounting table partitioned by accounting period. Any aggregate reports for a single accounting period fall into this category. If you define your partitions well that can often by most or all of your reports. Of course this applies equally if the query is accessing a small number of partitions. A further refinement is to leverage the partitioning in GROUP BY or ORDER BY clauses. If you're grouping by the partition key you can avoid a large sort without having to resort to an index scan or even a hash. And of course it's tempting to think about parallelization of such queries, especially if the partitions are stored in separate table spaces on different drives. It also lets you truncate, lock, and generally be heavy-handed with subsets of the table without affecting the rest of the table. The biggest benefit by far is this management ability of being able to swap in and out partitions in a single atomic transaction that doesn't require extensive i/o. In the application we used them on Oracle 8i they were an absolute life-saver. They took a huge batch job that took several days to run in off-peak hours and turned it into a single quick cron job that could run at peak hours. We were able to cut the delay for our OLTP data showing up in the data warehouse from about a week after extensive manual work to hours after a daily cron job. ...PARTITION ON 'date_trunc(''hour'',ts)'... There would also probably need to be some type of metadata table to associate specific hashes with partition table names. Other than that, the capabilities largely already exist, and managing the partition hashing and association is the ugly part when rolling your own. Intercepting DML when necessary and making it behave correctly is already pretty easy, but could probably be streamlined. I would suggest you look at the Oracle syntax to handle this. They've already gone through several iterations of implementations. The original Oracle 7 implementation was much as people have been describing where you had to define a big UNION ALL view and enable an option to have the optimizer look for such views and attempt to eliminate partitions. In Oracle 8i they introduced first class partitions with commands to define and manipulate them. You defined a high bound for each partition. In Oracle 9 (or thereabouts, sometime after 8i at any rate) they introduced a new form where you specify a specific constant value for each partition. This seems to be more akin to how you're thinking about things. The optimizer has several plan nodes specific for partitioned tables. It can select a single known partition based on information present in the query. It can also detect cases where it can be sure the query will only access a single partition but won't be able to determine which until execution time based on placeholder parameters for cases like WHERE partition_key = ?. It can also detect cases like WHERE partition_key between ? and ? and WHERE partition_key IN (?,?,?) Or join clauses on partitions. It can also do some magic things with GROUP BY partition_key and ORDER BY partition_key. The work in the optimizer will be the most challenging part. In an ideal world if the optimizer is very solid it will be possible to bring some partitions to slow or even near-line storage media. As long as no live queries accidentally access the wrong partitions the rest of the database need never know that the data isn't readily available. -- greg ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --
Iain wrote: That's exactly what we're doing, but using inherited tables instead of a union view. With inheritance, there is no need to rebuild the view each time a table is added or removed. Basically, in our application, tables are partitioned by either month or week, depending on the type of data involved, and queries are normally date qualified. That sounds interesting. I have to admit that I havn't touched iheritance in pg at all yet so I find it hard to imagine how this would work. If you have a chance, would you mind elaborating on it just a little? OK, see below: = create table foo(f1 int, f2 date, f3 float8); create table foo_2004_01() inherits (foo); create table foo_2004_02() inherits (foo); create table foo_2004_03() inherits (foo); create index foo_2004_01_idx1 on foo_2004_01(f2); create index foo_2004_02_idx1 on foo_2004_02(f2); create index foo_2004_03_idx1 on foo_2004_03(f2); insert into foo_2004_02 values(1,'2004-feb-15',3.14); -- needed just for illustration since these are toy tables set enable_seqscan to false; explain analyze select * from foo where f2 = '2004-feb-15'; QUERY PLAN Result (cost=1.00..10061.32 rows=16 width=16) (actual time=0.224..0.310 rows=1 loops=1) - Append (cost=1.00..10061.32 rows=16 width=16) (actual time=0.214..0.294 rows=1 loops=1) - Seq Scan on foo (cost=1.00..10022.50 rows=5 width=16) (actual time=0.004..0.004 rows=0 loops=1) Filter: (f2 = '2004-02-15'::date) - Index Scan using foo_2004_01_idx1 on foo_2004_01 foo (cost=0.00..17.07 rows=5 width=16) (actual time=0.101..0.101 rows=0 loops=1) Index Cond: (f2 = '2004-02-15'::date) - Index Scan using foo_2004_02_idx1 on foo_2004_02 foo (cost=0.00..4.68 rows=1 width=16) (actual time=0.095..0.101 rows=1 loops=1) Index Cond: (f2 = '2004-02-15'::date) - Index Scan using foo_2004_03_idx1 on foo_2004_03 foo (cost=0.00..17.07 rows=5 width=16) (actual time=0.066..0.066 rows=0 loops=1) Index Cond: (f2 = '2004-02-15'::date) Total runtime: 0.582 ms (11 rows) create table foo_2004_04() inherits (foo); create index foo_2004_04_idx1 on foo_2004_04(f2); explain analyze select * from foo where f2 = '2004-feb-15'; QUERY PLAN Result (cost=1.00..10078.38 rows=21 width=16) (actual time=0.052..0.176 rows=1 loops=1) - Append (cost=1.00..10078.38 rows=21 width=16) (actual time=0.041..0.159 rows=1 loops=1) - Seq Scan on foo (cost=1.00..10022.50 rows=5 width=16) (actual time=0.004..0.004 rows=0 loops=1) Filter: (f2 = '2004-02-15'::date) - Index Scan using foo_2004_01_idx1 on foo_2004_01 foo (cost=0.00..17.07 rows=5 width=16) (actual time=0.012..0.012 rows=0 loops=1) Index Cond: (f2 = '2004-02-15'::date) - Index Scan using foo_2004_02_idx1 on foo_2004_02 foo (cost=0.00..4.68 rows=1 width=16) (actual time=0.016..0.022 rows=1 loops=1) Index Cond: (f2 = '2004-02-15'::date) - Index Scan using foo_2004_03_idx1 on foo_2004_03 foo (cost=0.00..17.07 rows=5 width=16) (actual time=0.008..0.008 rows=0 loops=1) Index Cond: (f2 = '2004-02-15'::date) - Index Scan using foo_2004_04_idx1 on foo_2004_04 foo (cost=0.00..17.07 rows=5 width=16) (actual time=0.095..0.095 rows=0 loops=1) Index Cond: (f2 = '2004-02-15'::date) Total runtime: 0.443 ms (13 rows) For loading data, we COPY into foo, and have a trigger that redirects the rows to the appropriate partition. Notice that the partitions which do not contain any data of interest are still probed for data, but since they have none it is very quick. In a real life example I got the following results just this afternoon: - aggregate row count = 471,849,665 - total number inherited tables = 216 (many are future dated and therefore contain no data) - select one month's worth of data for one piece of equipment by serial number (49,257 rows) = 526.015 ms Not too bad -- quick enough for my needs. BTW, this is using NFS mounted storage (NetApp NAS). Joe ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --
On Wed, Sep 15, 2004 at 11:16:44AM +0200, Markus Schaber wrote: Hi, On Tue, 14 Sep 2004 22:10:04 -0700 Steve Atkins [EMAIL PROTECTED] wrote: Is there by any chance a set of functions to manage adding and removing partitions? Certainly this can be done by hand, but having a set of tools would make life much easier. I just looked but didn't see anything on GBorg. I've done a similar thing with time-segregated data by inheriting all the partition tables from an (empty) parent table. Adding a new partition is just a create table tablefoo () inherits(bigtable) and removing a partition just drop table tablefoo. But you have to add table constraints restricting the time after adding the partition? Uhm... unless I'm confused that's not a meaningful thing in this context. There's no rule that's putting insertions into an inherited table - the decision of which inherited table to insert into is made at application level. As I was using it to segregate data based on creation timestamp the application just inserts into the 'newest' inherited table until it's full, then creates a new inherited table. I've no doubt you could set up rules to scatter inserted data across a number of tables, but that's not something that's been applicaable for the problems I tend to work with, so I've not looked at it. Cheers, Steve ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables
Simon Riggs wrote: Jim C. Nasby On Mon, Sep 13, 2004 at 11:07:35PM +0100, Simon Riggs wrote: PostgreSQL's functionality is in many ways similar to Oracle Partitioning. Loading up your data in many similar tables, then creating a view like: CREATE VIEW BIGTABLE (idate, col1, col2, col3...) AS SELECT 200409130800, col1, col2, col3... FROM table200409130800 UNION ALL SELECT 200409131000, col1, col2, col3... FROM table200409131000 UNION ALL SELECT 200409131200, col1, col2, col3... FROM table200409131200 ...etc... will allow the PostgreSQL optimizer to eliminate partitions from the query when you run queries which include a predicate on the partitioning_col, e.g. select count(*) from bigtable where idate = 200409131000 The partitions are just tables, so no need for other management tools. Oracle treats the partitions as sub-tables, so you need a range of commands to add, swap etc the partitions of the main table. A few years ago I wrote a federated query engine (wrapped as an ODBC driver) that had to handle thousands of contributors (partitions) to a pseudotable / VIEWofUNIONs. Joins did require some special handling in the optimizer, because of the huge number of crossproducts between different tables. It was definitely worth the effort at the time, because you need different strategies for: joining a partition to another partition on the same subserver; joining two large partitions on different servers; and joining a large partition on one server to a small one on another. The differences may not be so great for a solitary server; but they're still there, because of disparity in subtable sizes. The simplistic query plans tend to let you down, when you're dealing with honking warehouses. I'm guessing that Oracle keeps per-subtable AND cross-all-subtables statistics, rather than building the latter from scratch in the course of evaluating the query plan. That's the one limitation I see in emulating their partitioned tables with Views. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --
Googling 'upsert' (an Oraclism, I believe) will get you hits on Oracle and DB2's implementation of MERGE, which does what AMOUNTS to what is described below (one mass UPDATE...FROM, one mass INSERT...WHERE NOT EXISTS). No, you shouldn't iterate row-by-row through the temp table. Whenever possible, try to do updates in one single (mass) operation. Doing it that way gives the optimizer the best chance at amortizing fixed costs, and batching operations. - In any database other than Postgres, I would recommend doing the INSERT /followed by/ the UPDATE. That order looks wonky --- your update ends up pointlessly operating on the rows just INSERTED. The trick is, UPDATE acquires and holds write locks (the rows were previously visible to other processes), while INSERT's write locks refer to rows that no other process could try to lock. Stephen Frost wrote: * Markus Schaber ([EMAIL PROTECTED]) wrote: Generally, what is the fastest way for doing bulk processing of update-if-primary-key-matches-and-insert-otherwise operations? This is a very good question, and I havn't seen much of an answer to it yet. I'm curious about the answer myself, actually. In the more recent SQL specs, from what I understand, this is essentially what the 'MERGE' command is for. This was recently added and unfortunately is not yet supported in Postgres. Hopefully it will be added soon. Otherwise, what I've done is basically an update followed by an insert using outer joins. If there's something better, I'd love to hear about it. The statements looks something like: update X set colA = a.colA, colB = a.colB from Y a where keyA = a.keyA and keyB = a.keyB; insert into X select a.keyA, a.keyB, a.colA, a.colB from Y a left join X b using (keyA, keyB) where b.keyA is NULL and b.keyB is NULL; With the appropriate indexes, this is pretty fast but I think a merge would be much faster. Thanks, Stephen ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --
insert into X select a.keyA, a.keyB, a.colA, a.colB from Y a left join X b using (keyA, keyB) where b.keyA is NULL and b.keyB is NULL; With the appropriate indexes, this is pretty fast but I think a merge would be much faster. Problem is it's subject to race conditions if another process is inserting stuff at the same time... Chris ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --
Hi Joe, You went to quite a bit of effort, thanks I have the picture now. Using inheritence seems to be a useful refinement on top of the earlier outlined aproach using the UNION ALL view with appropriate predicates on the condition used to do the partitioning. Having the individual partitions derived from a parent table makes a lot of sense. regards Iain - Original Message - From: Joe Conway [EMAIL PROTECTED] To: Iain [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Thursday, September 16, 2004 1:07 PM Subject: Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- Iain wrote: That's exactly what we're doing, but using inherited tables instead of a union view. With inheritance, there is no need to rebuild the view each time a table is added or removed. Basically, in our application, tables are partitioned by either month or week, depending on the type of data involved, and queries are normally date qualified. That sounds interesting. I have to admit that I havn't touched iheritance in pg at all yet so I find it hard to imagine how this would work. If you have a chance, would you mind elaborating on it just a little? OK, see below: = create table foo(f1 int, f2 date, f3 float8); create table foo_2004_01() inherits (foo); create table foo_2004_02() inherits (foo); create table foo_2004_03() inherits (foo); create index foo_2004_01_idx1 on foo_2004_01(f2); create index foo_2004_02_idx1 on foo_2004_02(f2); create index foo_2004_03_idx1 on foo_2004_03(f2); insert into foo_2004_02 values(1,'2004-feb-15',3.14); -- needed just for illustration since these are toy tables set enable_seqscan to false; explain analyze select * from foo where f2 = '2004-feb-15'; QUERY PLAN -- -- Result (cost=1.00..10061.32 rows=16 width=16) (actual time=0.224..0.310 rows=1 loops=1) - Append (cost=1.00..10061.32 rows=16 width=16) (actual time=0.214..0.294 rows=1 loops=1) - Seq Scan on foo (cost=1.00..10022.50 rows=5 width=16) (actual time=0.004..0.004 rows=0 loops=1) Filter: (f2 = '2004-02-15'::date) - Index Scan using foo_2004_01_idx1 on foo_2004_01 foo (cost=0.00..17.07 rows=5 width=16) (actual time=0.101..0.101 rows=0 loops=1) Index Cond: (f2 = '2004-02-15'::date) - Index Scan using foo_2004_02_idx1 on foo_2004_02 foo (cost=0.00..4.68 rows=1 width=16) (actual time=0.095..0.101 rows=1 loops=1) Index Cond: (f2 = '2004-02-15'::date) - Index Scan using foo_2004_03_idx1 on foo_2004_03 foo (cost=0.00..17.07 rows=5 width=16) (actual time=0.066..0.066 rows=0 loops=1) Index Cond: (f2 = '2004-02-15'::date) Total runtime: 0.582 ms (11 rows) create table foo_2004_04() inherits (foo); create index foo_2004_04_idx1 on foo_2004_04(f2); explain analyze select * from foo where f2 = '2004-feb-15'; QUERY PLAN -- -- Result (cost=1.00..10078.38 rows=21 width=16) (actual time=0.052..0.176 rows=1 loops=1) - Append (cost=1.00..10078.38 rows=21 width=16) (actual time=0.041..0.159 rows=1 loops=1) - Seq Scan on foo (cost=1.00..10022.50 rows=5 width=16) (actual time=0.004..0.004 rows=0 loops=1) Filter: (f2 = '2004-02-15'::date) - Index Scan using foo_2004_01_idx1 on foo_2004_01 foo (cost=0.00..17.07 rows=5 width=16) (actual time=0.012..0.012 rows=0 loops=1) Index Cond: (f2 = '2004-02-15'::date) - Index Scan using foo_2004_02_idx1 on foo_2004_02 foo (cost=0.00..4.68 rows=1 width=16) (actual time=0.016..0.022 rows=1 loops=1) Index Cond: (f2 = '2004-02-15'::date) - Index Scan using foo_2004_03_idx1 on foo_2004_03 foo (cost=0.00..17.07 rows=5 width=16) (actual time=0.008..0.008 rows=0 loops=1) Index Cond: (f2 = '2004-02-15'::date) - Index Scan using foo_2004_04_idx1 on foo_2004_04 foo (cost=0.00..17.07 rows=5 width=16) (actual time=0.095..0.095 rows=0 loops=1) Index Cond: (f2 = '2004-02-15'::date) Total runtime: 0.443 ms (13 rows) For loading data, we COPY into foo, and have a trigger that redirects the rows to the appropriate partition. Notice that the partitions which do not contain any data of interest are still probed for data, but since they have none it is very quick. In a real life example I got the following results just this afternoon: - aggregate row count = 471,849,665 - total number inherited tables = 216 (many are future dated and therefore contain no data) - select one month's worth of data for
Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --
Christopher Browne wrote: In the last exciting episode, [EMAIL PROTECTED] (Joe Conway) wrote: That's exactly what we're doing, but using inherited tables instead of a union view. With inheritance, there is no need to rebuild the view each time a table is added or removed. Basically, in our application, tables are partitioned by either month or week, depending on the type of data involved, and queries are normally date qualified. Where does the constraint come in that'll allow most of the data to be excluded? Not sure I follow this. Or is this just that the entries are all part of bigtable so that the self join is only 2-way? We don't have a need for self-joins in our application. We do use a crosstab function to materialize some transposed views of the data, however. That allows us to avoid self-joins in the cases where we might otherwise need them. Joe ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])