[PERFORM] Auto performance tuning?
I have to say I've been really impressed with the quality and diversity of tools here to increase performance for PostgreSQL. But I keep seeing a lot of the same basic things repeated again and again. Has anyone looked into a "smart" or auto-adjusting resource manager for postgres? Consider for instance you set it to aggressively use system resources, then it would do things like notice that it needs more work mem after profiling a few thousand queries and adds it for you, or that a specific index or table should be moved to a different spindle and does it in the background, or that query plans keep screwing up on a particular table so it knows to up the amount of stastics it keeps on that table. Is this a crazy idea or something someone's already working on? Orion ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Large Database Design Help
I just wanted to thank everyone for your input on my question. You've given me a lot of tools to solve my problem here. Orion ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[PERFORM] Large Database Design Help
Hello All, I've inherited a postgresql database that I would like to refactor. It was origionally designed for Postgres 7.0 on a PIII 500Mhz and some design decisions were made that don't make sense any more. Here's the problem: 1) The database is very large, the largest table has 40 million tuples. 2) The database needs to import 10's of thousands of tuples each night quickly. The current method is VERY slow. 3) I can't import new records with a COPY or drop my indexes b/c some of them are new records (INSERTS) and some are altered records (UPDATES) and the only way I can think of to identify these records is to perform a select for each record. Here is how the database is currently laid out and you'll see why I have a problem with it 1) The data is easily partitionable by client ID. In an attempt to keep the indexes small and the inserts fast one table was made per client ID. Thus the primary table in the database (the one with 40 million tuples) is really 133 tables each ending with a three digit suffix. The largest of these client tables has 8 million of the 40 million tuples. The system started with around a half dozen clients and is now a huge pain to manage with so many tables. I was hoping new hardware and new postgres features would allow for this data to be merged safely into a single table. 2) The imports are not done inside of transactions. I'm assuming the system designers excluded this for a reason. Will I run into problems performing tens of thousands of inserts and updates inside a single transaction? 3) The current code that bulk loads data into the database is a loop that looks like this: $result = exe("INSERT INTO $table ($name_str) SELECT $val_str WHERE NOT EXISTS (SELECT 1 FROM $table WHERE $keys)"); if ($result == 0) { $result = exe("UPDATE $table SET $non_keys WHERE $keys"); } Is there a faster way to bulk load data when it's not known ahead of time if it's a new record or an updated record? What I would LIKE to do but am afraid I will hit a serious performance wall (or am missing an obvious / better way to do it) 1) Merge all 133 client tables into a single new table, add a client_id column, do the data partitioning on the indexes not the tables as seen here: CREATE INDEX actioninfo_order_number_XXX_idx ON actioninfo ( order_number ) WHERE client_id = XXX; CREATE INDEX actioninfo_trans_date_XXX_idx ON actioninfo ( transaction_date ) WHERE client_id = XXX; (Aside question: if I were to find a way to use COPY and I were loading data on a single client_id, would dropping just the indexes for that client_id accelerate the load?) 2) Find some way to make the bulk loads faster or more efficent (help!) 3) Wrap each load into a transaction ( tens of thousands of records per load ) Is this a good plan? Is there a better way? Am I walking into a trap? Should I leave well enough alone and not try and fix something that's not broken? FWIW here's the hardware and the profile of the current uber table: Column | Type | Modifiers ---+-+--- order_number | integer | not null order_line_number | integer | not null action_number | integer | not null transaction_date | date| code | text| trans_group_code | text| quantity | integer | extension | money | sales_tax | money | shipping | money | discount | money | Dual Opteron 246, 4 disk SCSI RAID5, 4GB of RAM # du -sh /var/lib/postgres/data/ 16G /var/lib/postgres/data/ ( the current database is PG 7.4 - I intend to upgrade it to 8.1 if and when I do this refactoring ) ( the current OS is Debian Unstable but I intend to be running RHEL 4.0 if and when I do this refactoring ) ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] 7.3 vs 7.4 performance
On Wed, 2004-02-04 at 21:27, Josh Berkus wrote: Orion, > I've done some testing of 7.3.4 vs 7.4.1 and found 7.4.1 to be 20%-30% > slower than 7.3.4. Is this common knowledge or am I just unlucky with > my query/data selection? No, it's not common knowledge. It should be the other way around. Perhaps it's the queries you picked? Even so . feel free to post individual EXPLAIN ANALYZEs to the list. Thank you... Here's one good example of 7.3 beating 7.4 soundly: Again this could me some compile option since I built the 7.4 RPM from source and I got the 7.3 from Fedora or something to do with the Opteron architecture. (Yes the compiled postgres is 64 bit) SELECT cid,media_name,media_type,count(*) as count,sum(a_amount) as a,sum(case when b_amount > 0 then b_amount else 0 end) as b, sum(case when b_amount < 0 then b_amount else 0 end) as c FROM transdata JOIN media_info ON (media = media_type) WHERE cid = 140100 AND demo is not null AND trans_date between date '2004-01-01' AND date_trunc('month',date '2004-01-01' + interval '32 days') GROUP BY cid,media_name,media_type; Here's 7.3's time and explain real 0m34.260s user 0m0.010s sys 0m0.000s --- Aggregate (cost=7411.88..7415.32 rows=17 width=25) -> Group (cost=7411.88..7413.60 rows=172 width=25) -> Sort (cost=7411.88..7412.31 rows=172 width=25) Sort Key: transdata.cid, media_info.media_name, transdata.media_type -> Hash Join (cost=1.22..7405.50 rows=172 width=25) Hash Cond: ("outer".media_type = "inner".media) -> Index Scan using transdata_date_index on transdata (cost=0.00..7401.27 rows=172 width=14) Index Cond: ((trans_date >= ('2004-01-01'::date)::timestamp with time zone) AND (trans_date <= ('2004-02-01 00:00:00'::timestamp without time zone)::timestamp with time zone)) Filter: ((cid = 140100) AND (demo IS NOT NULL)) -> Hash (cost=1.18..1.18 rows=18 width=11) -> Seq Scan on media_info (cost=0.00..1.18 rows=18 width=11) Here's 7.4's time and explain real 0m43.052s user 0m0.000s sys 0m0.020s QUERY PLAN --- HashAggregate (cost=8098.26..8098.29 rows=2 width=23) -> Hash Join (cost=1.22..8095.48 rows=159 width=23) Hash Cond: ("outer".media_type = "inner".media) -> Index Scan using transdata_date_index on transdata (cost=0.00..8091.87 rows=159 width=14) Index Cond: ((trans_date >= ('2004-01-01'::date)::timestamp with time zone) AND (trans_date <= ('2004-02-01 00:00:00'::timestamp without time zone)::timestamp with time zone)) Filter: ((cid = 140100) AND (demo IS NOT NULL)) -> Hash (cost=1.18..1.18 rows=18 width=11) -> Seq Scan on media_info (cost=0.00..1.18 rows=18 width=11) signature.asc Description: This is a digitally signed message part
Re: [PERFORM] 7.3 vs 7.4 performance
On Fri, 2004-02-06 at 02:44, Hannu Krosing wrote: > Christopher Browne kirjutas N, 05.02.2004 kell 07:32: > > Oops! [EMAIL PROTECTED] (Orion Henry) was seen spray-painting on a wall: > > > Oh... as a side note I'm happy to announce that the 2.6 Linux kernel > > > has more than DOUBLED the speed of all my Postgres queries over the > > > 2.4. =) > > > > I did some heavy-transaction-oriented tests recently on somewhat > > heftier quad-Xeon hardware, and found little difference between 2.4 > > and 2.6, and a small-but-quite-repeatable advantage with FreeBSD 4.9. > > Now, I'm quite sure my load was rather different from yours, but I > > find the claim of doubling of speed rather surprising. > > perhaps you were just IO-bound while he was not ? > > or starving on some locks ? The queries were across almost 4 gigs of data on a machine with 512 MB of ram. I personally was assuming it was the anticipatory disk scheduler... but alas I don't know why it affected me so much. signature.asc Description: This is a digitally signed message part
Re: [PERFORM] 7.3 vs 7.4 performance
On Fri, 2004-02-06 at 02:43, Hannu Krosing wrote: > Orion Henry kirjutas N, 05.02.2004 kell 07:16: > > I've done some testing of 7.3.4 vs 7.4.1 and found 7.4.1 to be 20%-30% > > slower than 7.3.4. Is this common knowledge or am I just unlucky with > > my query/data selection? > > > > Things of note that might matter: the machine is a dual Opteron 1.4GHz > > running Fedora Core 1 Test 1 for X86_64. The 7.3.4 was from the Fedora > > distro and the 7.4.1 was the PGDG package. > > Are you sure that it is not the case that it is not tha case that 7.3.4 > is 64 bit and the PGDG package is 32 ? Yes sure... I don't know if they were compiled with differing optimizations or compilers though... > > The database is 3.5 Gigs with 10 millions rows and the machine had 1 Gig or ram. > > > > Oh... as a side note I'm happy to announce that the 2.6 Linux kernel has > > more than DOUBLED the speed of all my Postgres queries over the 2.4. =) > > Is this on this same hardware ? No. I havent gotten the 2.6 kernel working on the Opteron yet. The 2x speedup was on a dual Athlon 2GHz. signature.asc Description: This is a digitally signed message part
Re: [PERFORM] 7.3 vs 7.4 performance
> > One question though... It sounds like your 7.3 binaries are 64-bit and > your 7.4 binaries are 32-bit. Have you tried grabbing the SRPM for 7.4 > and recompiling it for X86_64? No, they were all 64 bit. I'm going to run explains on all my queries and see if I can find anything of interest... signature.asc Description: This is a digitally signed message part
[PERFORM] 7.3 vs 7.4 performance
I've done some testing of 7.3.4 vs 7.4.1 and found 7.4.1 to be 20%-30% slower than 7.3.4. Is this common knowledge or am I just unlucky with my query/data selection? Things of note that might matter: the machine is a dual Opteron 1.4GHz running Fedora Core 1 Test 1 for X86_64. The 7.3.4 was from the Fedora distro and the 7.4.1 was the PGDG package. The database is 3.5 Gigs with 10 millions rows and the machine had 1 Gig or ram. Oh... as a side note I'm happy to announce that the 2.6 Linux kernel has more than DOUBLED the speed of all my Postgres queries over the 2.4. =) -- Orion Henry <[EMAIL PROTECTED]> signature.asc Description: This is a digitally signed message part
[PERFORM] pg_stat_activity
Quick Question, The full query listed in pg_stat_activity is getting truncated. Does anyone know how I can see the full query in progress? -- Orion Henry <[EMAIL PROTECTED]> signature.asc Description: This is a digitally signed message part
Re: [PERFORM] help with dual indexing
Thanks Tom! You're a life-saver. On Fri, 2004-01-23 at 17:08, Tom Lane wrote: > Orion Henry <[EMAIL PROTECTED]> writes: > > The queries usually are in the form of, where "user_id = something and > > event_time between something and something". > > > Half of my queries index off of the user_id and half index off the > > event_time. I was thinking this would be a perfect opportunity to use a > > dual index of (user_id,event_time) but I'm confused as to weather this > > will help > > Probably. Put the user_id as the first column of the index --- if you > think about the sort ordering of a multicolumn index, you will see why. > With user_id first, a constraint as above describes a contiguous > subrange of the index; with event_time first it does not. > > regards, tom lane -- Orion Henry <[EMAIL PROTECTED]> signature.asc Description: This is a digitally signed message part
[PERFORM] help with dual indexing
I've got a table with about 10 million events in it. Each has a user_id (about 1000 users) and a event_time timestamp covering a 4 year period with about 50% of the events being in the last year. Some users have only dozens of events. A few have hundreds of thousands. The queries usually are in the form of, where "user_id = something and event_time between something and something". Half of my queries index off of the user_id and half index off the event_time. I was thinking this would be a perfect opportunity to use a dual index of (user_id,event_time) but I'm confused as to weather this will help considering the size of this index given that there very few tuples that have the exact same timestamp as another and I'm not sure which order to put the user_id/event_time as I don't know what is meant when people on this list ask which is more selective. Also, would it make sense for me to raise my ANALYZE value and how would I go about doing this? Thanks for the help. -- Orion Henry <[EMAIL PROTECTED]> signature.asc Description: This is a digitally signed message part