[PERFORM] Auto performance tuning?

2006-03-20 Thread Orion Henry


I have to say I've been really impressed with the quality and diversity 
of tools here to increase performance for PostgreSQL.  But I keep seeing 
a lot of the same basic things repeated again and again.  Has anyone 
looked into a "smart" or auto-adjusting resource manager for postgres? 

Consider for instance you set it to aggressively use system resources, 
then it would do things like notice that it needs more work mem after 
profiling a few thousand queries and adds it for you, or that a specific 
index or table should be moved to a different spindle and does it in the 
background, or that query plans keep screwing up on a particular table 
so it knows to up the amount of stastics it keeps on that table.


Is this a crazy idea or something someone's already working on?

Orion


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Large Database Design Help

2006-02-22 Thread Orion
I just wanted to thank everyone for your input on my question.  You've
given me a lot of tools to solve my problem here.

Orion

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[PERFORM] Large Database Design Help

2006-02-09 Thread Orion Henry


Hello All,

I've inherited a postgresql database that I would like to refactor.   It 
was origionally designed for Postgres 7.0 on a PIII 500Mhz and some 
design decisions were made that don't make sense any more.  Here's the 
problem:


1) The database is very large, the largest table has 40 million tuples.

2) The database needs to import 10's of thousands of tuples each night 
quickly.  The current method is VERY slow.


3) I can't import new records with a COPY or drop my indexes b/c some of 
them are new records (INSERTS) and some are altered records (UPDATES) 
and the only way I can think of to identify these records is to perform 
a select for each record.


Here is how the database is currently laid out and you'll see why I have 
a problem with it


1) The data is easily partitionable by client ID.  In an attempt to keep 
the indexes small and the inserts fast one table was made per client 
ID.  Thus the primary table in the database (the one with 40 million 
tuples) is really 133 tables each ending with a three digit suffix.   
The largest of these client tables has 8 million of the 40 million 
tuples.   The system started with around a half dozen clients and is now 
a huge pain to manage with so many tables.  I was hoping new hardware 
and new postgres features would allow for this data to be merged safely 
into a single table.


2) The imports are not done inside of transactions.  I'm assuming the 
system designers excluded this for a reason.  Will I run into problems 
performing tens of thousands of inserts and updates inside a single 
transaction?


3) The current code that bulk loads data into the database is a loop 
that looks like this:


   $result = exe("INSERT INTO $table ($name_str) SELECT 
$val_str WHERE NOT EXISTS (SELECT 1 FROM $table WHERE $keys)");

   if ($result == 0)
   {
   $result = exe("UPDATE $table SET $non_keys WHERE 
$keys");
   }  

Is there a faster way to bulk load data when it's not known ahead of 
time if it's a new record or an updated record?


What  I would LIKE to do but am afraid I will hit a serious performance 
wall (or am missing an obvious / better way to do it)


1) Merge all 133 client tables into a single new table, add a client_id 
column, do the data partitioning on the indexes not the tables as seen here:


 CREATE INDEX actioninfo_order_number_XXX_idx ON actioninfo ( 
order_number ) WHERE client_id = XXX;
 CREATE INDEX actioninfo_trans_date_XXX_idx ON actioninfo ( 
transaction_date ) WHERE client_id = XXX;


   (Aside question: if I were to find a way to use COPY and I were 
loading data on a single client_id, would dropping just the indexes for 
that client_id accelerate the load?)


2) Find some way to make the bulk loads faster or more efficent (help!)

3) Wrap each load into a transaction ( tens of thousands of records per 
load )


Is this a good plan?  Is there a better way?  Am I walking into a trap?  
Should I leave well enough alone and not try and fix something that's 
not broken?


FWIW here's the hardware and the profile of the current uber table:

 Column   |  Type   | Modifiers
---+-+---
order_number  | integer | not null
order_line_number | integer | not null
action_number | integer | not null
transaction_date  | date|
code  | text|
trans_group_code  | text|
quantity  | integer |
extension | money   |
sales_tax | money   |
shipping  | money   |
discount  | money   |

Dual Opteron 246, 4 disk SCSI RAID5, 4GB of RAM

# du -sh /var/lib/postgres/data/
16G /var/lib/postgres/data/  

( the current database is PG 7.4 - I intend to upgrade it to 8.1 if and 
when I do this refactoring )
( the current OS is Debian Unstable but I intend to be running RHEL 4.0 
if and when I do this refactoring )



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] 7.3 vs 7.4 performance

2004-02-06 Thread Orion Henry




On Wed, 2004-02-04 at 21:27, Josh Berkus wrote:

Orion,

> I've done some testing of 7.3.4 vs 7.4.1 and found 7.4.1 to be 20%-30%
> slower than 7.3.4.  Is this common knowledge or am I just unlucky with
> my query/data selection?

No, it's not common knowledge.  It should be the other way around.   Perhaps 
it's the queries you picked?   Even so .  feel free to post individual 
EXPLAIN ANALYZEs to the list.


Thank you...

Here's one good example of 7.3 beating 7.4 soundly:
Again this could me some compile option since I built the 7.4 RPM 
from source and I got the 7.3 from Fedora or something to
do with the Opteron architecture.  (Yes the compiled postgres
is 64 bit)

SELECT cid,media_name,media_type,count(*) as count,sum(a_amount) 
as a,sum(case when b_amount > 0 then b_amount else 0 end) as b,
sum(case when b_amount < 0 then b_amount else 0 end) as c 
FROM transdata JOIN media_info ON (media = media_type) 
WHERE cid = 140100 AND demo is not null 
AND trans_date between date '2004-01-01' 
AND date_trunc('month',date '2004-01-01' + interval '32 days') 
GROUP BY cid,media_name,media_type;

Here's 7.3's time and explain

real    0m34.260s
user    0m0.010s
sys 0m0.000s

---
 Aggregate  (cost=7411.88..7415.32 rows=17 width=25)
   ->  Group  (cost=7411.88..7413.60 rows=172 width=25)
 ->  Sort  (cost=7411.88..7412.31 rows=172 width=25)
   Sort Key: transdata.cid, media_info.media_name, transdata.media_type
   ->  Hash Join  (cost=1.22..7405.50 rows=172 width=25)
 Hash Cond: ("outer".media_type = "inner".media)
 ->  Index Scan using transdata_date_index on transdata  (cost=0.00..7401.27 rows=172 width=14)
   Index Cond: ((trans_date >= ('2004-01-01'::date)::timestamp with time zone) AND (trans_date <= ('2004-02-01 00:00:00'::timestamp without time zone)::timestamp with time zone))
   Filter: ((cid = 140100) AND (demo IS NOT NULL))
 ->  Hash  (cost=1.18..1.18 rows=18 width=11)
   ->  Seq Scan on media_info  (cost=0.00..1.18 rows=18 width=11)


Here's 7.4's time and explain

real    0m43.052s
user    0m0.000s
sys 0m0.020s

  QUERY PLAN
---
 HashAggregate  (cost=8098.26..8098.29 rows=2 width=23)
   ->  Hash Join  (cost=1.22..8095.48 rows=159 width=23)
 Hash Cond: ("outer".media_type = "inner".media)
 ->  Index Scan using transdata_date_index on transdata  (cost=0.00..8091.87 rows=159 width=14)
   Index Cond: ((trans_date >= ('2004-01-01'::date)::timestamp with time zone) AND (trans_date <= ('2004-02-01 00:00:00'::timestamp without time zone)::timestamp with time zone))
   Filter: ((cid = 140100) AND (demo IS NOT NULL))
 ->  Hash  (cost=1.18..1.18 rows=18 width=11)
   ->  Seq Scan on media_info  (cost=0.00..1.18 rows=18 width=11)








signature.asc
Description: This is a digitally signed message part


Re: [PERFORM] 7.3 vs 7.4 performance

2004-02-06 Thread Orion Henry
On Fri, 2004-02-06 at 02:44, Hannu Krosing wrote:
> Christopher Browne kirjutas N, 05.02.2004 kell 07:32:
> > Oops! [EMAIL PROTECTED] (Orion Henry) was seen spray-painting on a wall:
> > > Oh... as a side note I'm happy to announce that the 2.6 Linux kernel
> > > has more than DOUBLED the speed of all my Postgres queries over the
> > > 2.4. =)
> > 
> > I did some heavy-transaction-oriented tests recently on somewhat
> > heftier quad-Xeon hardware, and found little difference between 2.4
> > and 2.6, and a small-but-quite-repeatable advantage with FreeBSD 4.9.
> > Now, I'm quite sure my load was rather different from yours, but I
> > find the claim of doubling of speed rather surprising.
> 
> perhaps you were just IO-bound while he was not ?
> 
> or starving on some locks ?

The queries were across almost 4 gigs of data on a machine with 512 MB of ram.
I personally was assuming it was the anticipatory disk scheduler... but alas I 
don't know why it affected me so much.



signature.asc
Description: This is a digitally signed message part


Re: [PERFORM] 7.3 vs 7.4 performance

2004-02-06 Thread Orion Henry
On Fri, 2004-02-06 at 02:43, Hannu Krosing wrote:
> Orion Henry kirjutas N, 05.02.2004 kell 07:16:
> > I've done some testing of 7.3.4 vs 7.4.1 and found 7.4.1 to be 20%-30%
> > slower than 7.3.4.  Is this common knowledge or am I just unlucky with
> > my query/data selection?
> > 
> > Things of note that might matter: the machine is a dual Opteron 1.4GHz
> > running Fedora Core 1 Test 1 for X86_64.  The 7.3.4 was from the Fedora
> > distro and the 7.4.1 was the PGDG package.
> 
> Are you sure that it is not the case that it is not tha case that 7.3.4
> is 64 bit and the PGDG package is 32 ?

Yes sure... I don't know if they were compiled with differing
optimizations or compilers though...

> > The database is 3.5 Gigs with 10 millions rows and the machine had 1 Gig or ram.
> > 
> > Oh... as a side note I'm happy to announce that the 2.6 Linux kernel has
> > more than DOUBLED the speed of all my Postgres queries over the 2.4. =)
> 
> Is this on this same hardware ?

No.  I havent gotten the 2.6 kernel working on the Opteron yet.  The 2x speedup 
was on a dual Athlon 2GHz.



signature.asc
Description: This is a digitally signed message part


Re: [PERFORM] 7.3 vs 7.4 performance

2004-02-05 Thread Orion Henry

> 
> One question though...  It sounds like your 7.3 binaries are 64-bit and
> your 7.4 binaries are 32-bit.  Have you tried grabbing the SRPM for 7.4
> and recompiling it for X86_64?

No, they were all 64 bit.

I'm going to run explains on all my queries and see if I can find 
anything of interest...



signature.asc
Description: This is a digitally signed message part


[PERFORM] 7.3 vs 7.4 performance

2004-02-04 Thread Orion Henry
I've done some testing of 7.3.4 vs 7.4.1 and found 7.4.1 to be 20%-30%
slower than 7.3.4.  Is this common knowledge or am I just unlucky with
my query/data selection?

Things of note that might matter: the machine is a dual Opteron 1.4GHz
running Fedora Core 1 Test 1 for X86_64.  The 7.3.4 was from the Fedora
distro and the 7.4.1 was the PGDG package.  The database is 3.5 Gigs
with 10 millions rows and the machine had 1 Gig or ram.

Oh... as a side note I'm happy to announce that the 2.6 Linux kernel has
more than DOUBLED the speed of all my Postgres queries over the 2.4. =)



-- 
Orion Henry <[EMAIL PROTECTED]>


signature.asc
Description: This is a digitally signed message part


[PERFORM] pg_stat_activity

2004-02-02 Thread Orion Henry
Quick Question,

The full query listed in pg_stat_activity is getting truncated.  Does
anyone know how I can see the full query in progress?

-- 
Orion Henry <[EMAIL PROTECTED]>


signature.asc
Description: This is a digitally signed message part


Re: [PERFORM] help with dual indexing

2004-01-26 Thread Orion Henry
Thanks Tom!  You're a life-saver.

On Fri, 2004-01-23 at 17:08, Tom Lane wrote:
> Orion Henry <[EMAIL PROTECTED]> writes:
> > The queries usually are in the form of, where "user_id = something and
> > event_time between something and something".
> 
> > Half of my queries index off of the user_id and half index off the
> > event_time.  I was thinking this would be a perfect opportunity to use a
> > dual index of (user_id,event_time) but I'm confused as to weather this
> > will help
> 
> Probably.  Put the user_id as the first column of the index --- if you
> think about the sort ordering of a multicolumn index, you will see why.
> With user_id first, a constraint as above describes a contiguous
> subrange of the index; with event_time first it does not.
> 
>   regards, tom lane
-- 
Orion Henry <[EMAIL PROTECTED]>


signature.asc
Description: This is a digitally signed message part


[PERFORM] help with dual indexing

2004-01-23 Thread Orion Henry
I've got a table with about 10 million events in it.

Each has a user_id (about 1000 users) and a event_time timestamp
covering a 4 year period with about 50% of the events being in the last
year.  Some users have only dozens of events.  A few have hundreds of
thousands.

The queries usually are in the form of, where "user_id = something and
event_time between something and something".

Half of my queries index off of the user_id and half index off the
event_time.  I was thinking this would be a perfect opportunity to use a
dual index of (user_id,event_time) but I'm confused as to weather this
will help considering the size of this index given that there very few
tuples that have the exact same timestamp as another and I'm not sure
which order to put the user_id/event_time as I don't know what is meant
when people on this list ask which is more selective.

Also, would it make sense for me to raise my ANALYZE value and how would
I go about doing this?

Thanks for the help.


-- 
Orion Henry <[EMAIL PROTECTED]>


signature.asc
Description: This is a digitally signed message part