Re: [PERFORM] dbt-2 tuning results with postgresql-8.3.5

2008-12-23 Thread Alvaro Herrera
Mark Wong escribió:

 Hrm, tracking just the launcher process certainly doesn't help.  Are
 the spawned processed short lived?  I take a snapshot of
 /proc/pid/io data every 60 seconds.

The worker processes can be short-lived, but if they are, obviously they
are not vacuuming the large tables.  If you want to track all autovacuum
actions, change autovacuum_log_min_messages to 0.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] How to unique-ify HUGE table?

2008-12-23 Thread Kynn Jones
Hi everyone!
I have a very large 2-column table (about 500M records) from which I want to
remove duplicate records.

I have tried many approaches, but they all take forever.

The table's definition consists of two short TEXT columns.  It is a
temporary table generated from a query:

CREATE TEMP TABLE huge_table AS SELECT x, y FROM ... ;

Initially I tried

CREATE TEMP TABLE huge_table AS SELECT DISTINCT x, y FROM ... ;

but after waiting for nearly an hour I aborted the query, and repeated it
after getting rid of the DISTINCT clause.

Everything takes forever with this monster!  It's uncanny.  Even printing it
out to a file takes forever, let alone creating an index for it.

Any words of wisdom on how to speed this up would be appreciated.

TIA!

Kynn


Re: [PERFORM] How to unique-ify HUGE table?

2008-12-23 Thread Scott Marlowe
On Tue, Dec 23, 2008 at 10:25 AM, Kynn Jones kyn...@gmail.com wrote:
 Hi everyone!
 I have a very large 2-column table (about 500M records) from which I want to
 remove duplicate records.
 I have tried many approaches, but they all take forever.
 The table's definition consists of two short TEXT columns.  It is a
 temporary table generated from a query:

 CREATE TEMP TABLE huge_table AS SELECT x, y FROM ... ;
 Initially I tried
 CREATE TEMP TABLE huge_table AS SELECT DISTINCT x, y FROM ... ;
 but after waiting for nearly an hour I aborted the query, and repeated it
 after getting rid of the DISTINCT clause.
 Everything takes forever with this monster!  It's uncanny.  Even printing it
 out to a file takes forever, let alone creating an index for it.
 Any words of wisdom on how to speed this up would be appreciated.

Did you try cranking up work_mem to something that's a large
percentage (25 to 50%) of total memory?

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to unique-ify HUGE table?

2008-12-23 Thread D'Arcy J.M. Cain
On Tue, 23 Dec 2008 12:25:48 -0500
Kynn Jones kyn...@gmail.com wrote:
 Hi everyone!
 I have a very large 2-column table (about 500M records) from which I want to
 remove duplicate records.
 
 I have tried many approaches, but they all take forever.
 
 The table's definition consists of two short TEXT columns.  It is a
 temporary table generated from a query:
 
 CREATE TEMP TABLE huge_table AS SELECT x, y FROM ... ;
 
 Initially I tried
 
  CREATE TEMP TABLE huge_table AS SELECT DISTINCT x, y FROM ... ;
 
 but after waiting for nearly an hour I aborted the query, and repeated it

Do you have an index on x and y?  Also, does this work better?

CREATE TEMP TABLE huge_table AS SELECT x, y FROM ... GROUP BY x, y;

What does ANALYZE EXPLAIN have to say?

-- 
D'Arcy J.M. Cain da...@druid.net |  Democracy is three wolves
http://www.druid.net/darcy/|  and a sheep voting on
+1 416 425 1212 (DoD#0082)(eNTP)   |  what's for dinner.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to unique-ify HUGE table?

2008-12-23 Thread George Pavlov
You don't say what PG version you are on, but just for kicks you may try
using GROUP BY instead of DISTINCT. Yes, the two should perform the
same, but with 8.1 (or maybe 8.0) I had seen situations where GROUP BY
was faster (admittedly this happened with more complex queries). So, try
this:

  CREATE TEMP TABLE huge_table AS SELECT x, y FROM foo GROUP BY 1, 2;

Note that you may be tempted to add an index on foo(x,y), but I don't
think that helps (or at least I have not been able to hit the index in
similar situations).


 -Original Message-
 From: pgsql-performance-ow...@postgresql.org
[mailto:pgsql-performance-
 ow...@postgresql.org] On Behalf Of Kynn Jones
 Sent: Tuesday, December 23, 2008 9:26 AM
 To: pgsql-performance@postgresql.org
 Subject: [PERFORM] How to unique-ify HUGE table?
 
 Hi everyone!
 
 I have a very large 2-column table (about 500M records) from which I
 want to remove duplicate records.
 
 I have tried many approaches, but they all take forever.
 
 The table's definition consists of two short TEXT columns.  It is a
 temporary table generated from a query:
 
 
 CREATE TEMP TABLE huge_table AS SELECT x, y FROM ... ;
 
 Initially I tried
 
 CREATE TEMP TABLE huge_table AS SELECT DISTINCT x, y FROM ... ;
 
 but after waiting for nearly an hour I aborted the query, and repeated
 it after getting rid of the DISTINCT clause.
 
 Everything takes forever with this monster!  It's uncanny.  Even
 printing it out to a file takes forever, let alone creating an index
 for it.
 
 Any words of wisdom on how to speed this up would be appreciated.
 
 TIA!
 
 Kynn
 
 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to unique-ify HUGE table?

2008-12-23 Thread Scott Marlowe
On Tue, Dec 23, 2008 at 11:14 AM, George Pavlov gpav...@mynewplace.com wrote:
 You don't say what PG version you are on, but just for kicks you may try
 using GROUP BY instead of DISTINCT. Yes, the two should perform the
 same, but with 8.1 (or maybe 8.0) I had seen situations where GROUP BY
 was faster (admittedly this happened with more complex queries). So, try
 this:

Even in 8.3 it looks like group by is faster.  Tested it on a decent
sized table and group by used a hash agg and ran in ~600 ms, while
distinct used a sort and ran in 1300 ms.  That was on 500k rows.  On a
much larger table, one with about 10M rows, a similar statement runs
in 1500 ms with group by and in 2390 ms when run with distinct.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to unique-ify HUGE table?

2008-12-23 Thread Stefan Kaltenbrunner

Scott Marlowe wrote:

On Tue, Dec 23, 2008 at 11:14 AM, George Pavlov gpav...@mynewplace.com wrote:

You don't say what PG version you are on, but just for kicks you may try
using GROUP BY instead of DISTINCT. Yes, the two should perform the
same, but with 8.1 (or maybe 8.0) I had seen situations where GROUP BY
was faster (admittedly this happened with more complex queries). So, try
this:


Even in 8.3 it looks like group by is faster.  Tested it on a decent
sized table and group by used a hash agg and ran in ~600 ms, while
distinct used a sort and ran in 1300 ms.  That was on 500k rows.  On a
much larger table, one with about 10M rows, a similar statement runs
in 1500 ms with group by and in 2390 ms when run with distinct.


Not surprising - this is a known limitation in all released versions of 
postgresql (GROUP BY can use hashing and sorting - DISTINCT only 
sorting). 8.4 is going to improve that though.



Stefan

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance