[PERFORM] How to determine cause of performance problem?

2005-09-23 Thread Joost Kraaijeveld
Hi,

I must convert an old table into a new table. The conversion goes at ~
100 records per second. Given the fact that I must convert 40 million
records, it takes too much time: more hours than the 48 hour weekend I
have for the conversion;-).

The tables are rather simple: both tables only have a primary key
constraint (of type text) and no other indexes. I only copy 3 columns. I
use Java for the conversion. For the exact code see below.

During the conversion my processor load is almost non existant. The
harddisk throughput is ~ 6 megabyte/second max (measured with iostat).

My platform is Debian Sarge AMD64. My hardware is a Tyan Thunder K8W
2885 motherboard, 2 Opteron 248 processors, 2 GB RAM, a SATA bootdisk
with / and swap, and a 3Ware 9500S-8 RAID-5 controller with 5 attached
SATA disks with /home and /var. /var contains *all* PostgreSQL log and
database files (default Debian installation).

Output of hdparm -Tt /dev/sdb (sdb is the RAID opartition)

/dev/sdb:
 Timing cached reads:   1696 MB in  2.00 seconds = 846.86 MB/sec
 Timing buffered disk reads:  246 MB in  3.01 seconds =  81.79 MB/sec


I want to determine the cause of my performance problem (if it is one).

1. Is this a performance I can expect?
2. If not, how can I determine the cause?
3. Can I anyhow improve the performance without replacing my hardware,
e.g. by tweaking the software?
4. Is there a Linux (Debian) tool that I can use to benchmark write
performance?



-- 
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: [EMAIL PROTECTED]
web: www.askesis.nl 


The Java code I use for the conversion  :

 
ResultSet resultSet = selectStatement.executeQuery(
select ordernummer, orderdatum, klantnummer from odbc.orders order by
ordernummer);

connection.setAutoCommit(false);

PreparedStatement ordersInsertStatement = 
connection.prepareStatement(insert into prototype.orders
(objectid,ordernumber,orderdate,customernumber) values (?,?,?,?)); 


while( resultSet.next() )
{

if( (++record % 100) == 0){
System.err.println( handling record:  + record);
}

// the next line can do  1.000.000 objectId/sec
String orderObjectId = ObjectIdGenerator.newObjectId();
ordersInsertStatement.setString(1,orderObjectId);
ordersInsertStatement.setInt(2,resultSet.getInt(ordernummer)); 
ordersInsertStatement.setDate(3,resultSet.getDate(orderdatum)); 
ordersInsertStatement.setInt(4,resultSet.getInt(klantnummer)); 

ordersInsertStatement.execute();

}   

connection.commit();


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] How to determine cause of performance problem?

2005-09-23 Thread Ulrich Wisser

Hi Joost,

why do you convert programmatically? I would do something like

create sequence s_objectid;

insert into 
prototype.orders(objectid,ordernumber,orderdate,customernumber)

select next_val('s_objectid'),ordernummer, orderdatum, klantnummer from
odbc.orders


Sounds a lot faster to me.


/Ulrich

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] How to determine cause of performance problem?

2005-09-23 Thread Michael Stone

On Fri, Sep 23, 2005 at 08:49:27AM +0200, Joost Kraaijeveld wrote:

3. Can I anyhow improve the performance without replacing my hardware,
e.g. by tweaking the software?


It's not clear what your object id generator does. If it's just a
sequence, it's not clear that you need this program at all--just use a
SELECT INTO and make the object id a SERIAL.

If you do need to control the object id or do some other processing
before putting the data into the new table, rewrite to use a COPY
instead of an INSERT.

Mike Stone

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] How to determine cause of performance problem?

2005-09-23 Thread Joost Kraaijeveld
On Fri, 2005-09-23 at 05:55 -0400, Michael Stone wrote:
 It's not clear what your object id generator does. If it's just a
 sequence, it's not clear that you need this program at all--just use a
 SELECT INTO and make the object id a SERIAL.
It generates a GUID (and no, I do not want to turn this in a discussion
about GUIDs). As in the Java code comment: it is not the generation of
the GUID that is the problem (that is, I can generate millions of them
per second.)

 If you do need to control the object id or do some other processing
 before putting the data into the new table, rewrite to use a COPY
 instead of an INSERT.
It is actually the shortest piece of code that gives me a poor
performance. The conversion problem is much, much larger and much much
more complicated. 

I suspect that either my hardware is to slow (but then again, see the
specs), or my Debian is to slow, or my PostgreSQL settings are wrong.

But I have no clue where to begin with determining the bottleneck (it
even may be a normal performance for all I know: I have no experience
with converting such (large) database).

Any suggestions?


-- 
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: [EMAIL PROTECTED]
web: www.askesis.nl 



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] How to determine cause of performance problem?

2005-09-23 Thread Michael Stone

On Fri, Sep 23, 2005 at 12:21:15PM +0200, Joost Kraaijeveld wrote:

On Fri, 2005-09-23 at 05:55 -0400, Michael Stone wrote:

It's not clear what your object id generator does. If it's just a
sequence, it's not clear that you need this program at all--just use a
SELECT INTO and make the object id a SERIAL.

It generates a GUID (and no, I do not want to turn this in a discussion
about GUIDs). As in the Java code comment: it is not the generation of
the GUID that is the problem (that is, I can generate millions of them
per second.)


I didn't say it was, did I? If you use a SELECT INTO instead of
SELECTing each record and then reINSERTing it you avoid a round trip
latency for each row. There's a reason I said if it's just a sequence.


If you do need to control the object id or do some other processing
before putting the data into the new table, rewrite to use a COPY
instead of an INSERT.

It is actually the shortest piece of code that gives me a poor
performance. The conversion problem is much, much larger and much much
more complicated. 


Ok, that's great, but you didn't respond to the suggestion of using COPY
INTO instead of INSERT.


But I have no clue where to begin with determining the bottleneck (it
even may be a normal performance for all I know: I have no experience
with converting such (large) database).

Any suggestions?


Respond to the first suggestion?

Mike Stone

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] How to determine cause of performance problem?

2005-09-23 Thread Arjen van der Meijden

On 23-9-2005 13:05, Michael Stone wrote:

On Fri, Sep 23, 2005 at 12:21:15PM +0200, Joost Kraaijeveld wrote:

Ok, that's great, but you didn't respond to the suggestion of using COPY
INTO instead of INSERT.


But I have no clue where to begin with determining the bottleneck (it
even may be a normal performance for all I know: I have no experience
with converting such (large) database).

Any suggestions?



Respond to the first suggestion?


Another suggestion:
How many indexes and constraints are on the new table?
Drop all of them and recreate them once the table is filled. Of course 
that only works if you know your data will be ok (which is normal for 
imports of already conforming data like database dumps of existing tables).
This will give major performance improvements, if you have indexes and 
such on the new table.


Best regards,

Arjen

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] How to determine cause of performance problem?

2005-09-23 Thread Joost Kraaijeveld
On Fri, 2005-09-23 at 13:19 +0200, Arjen van der Meijden wrote:
 Another suggestion:
 How many indexes and constraints are on the new table?
As mentioned in the first mail: in this tables only primary key
constraints, no other indexes or constraints.

 Drop all of them and recreate them once the table is filled. Of course 
 that only works if you know your data will be ok (which is normal for 
 imports of already conforming data like database dumps of existing tables).
 This will give major performance improvements, if you have indexes and 
 such on the new table.
I will test this a for perfomance improvement, but still, I wonder if ~
100 inserts/second is a reasonable performance for my software/hardware
combination.

-- 
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: [EMAIL PROTECTED]
web: www.askesis.nl 



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] How to determine cause of performance problem?

2005-09-23 Thread Joost Kraaijeveld
On Fri, 2005-09-23 at 07:05 -0400, Michael Stone wrote: 
 On Fri, Sep 23, 2005 at 12:21:15PM +0200, Joost Kraaijeveld wrote:
 On Fri, 2005-09-23 at 05:55 -0400, Michael Stone wrote:
 I didn't say it was, did I? 
No, you did not. But only last week someon'es head was (luckely for him
only virtually) almost chopped off for suggesting the usage of GUIDs ;-)


 Ok, that's great, but you didn't respond to the suggestion of using COPY
 INTO instead of INSERT.
Part of the code I left out are some data conversions (e.g. from
path-to-file to blob, from text to date (not castable because of the
homebrew original format)). I don't believe that I can do these in a SQL
statement, can I (my knowledge of SQL as a langage is not that good)? .
However I will investigate if I can do the conversion in two steps and
check if it is faster.

But still, I wonder if ~100 inserts/second is a reasonable performance
for my software/hardware combination.


-- 
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: [EMAIL PROTECTED]
web: www.askesis.nl 



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] How to determine cause of performance problem?

2005-09-23 Thread Arjen van der Meijden

On 23-9-2005 15:35, Joost Kraaijeveld wrote:

On Fri, 2005-09-23 at 13:19 +0200, Arjen van der Meijden wrote:
Drop all of them and recreate them once the table is filled. Of course 
that only works if you know your data will be ok (which is normal for 
imports of already conforming data like database dumps of existing tables).
This will give major performance improvements, if you have indexes and 
such on the new table.


I will test this a for perfomance improvement, but still, I wonder if ~
100 inserts/second is a reasonable performance for my software/hardware
combination.


For the hardware: no, I don't think it is for such a simple table/small 
recordsize.
I did a few batch-inserts with indexes on tables and was very 
disappointed about the time it took. But with no indexes and constraints 
left it flew and the table of 7 million records (of 3 ints and 2 
bigints) was imported in 75 seconds, on a bit simpler hardware. That was 
done using a simple pg_dump-built sql-file which was then fed to psql as 
input. And of course that used the local unix socket, not the local 
network interface (I don't know which jdbc takes).
But generating a single transaction (as you do) with inserts shouldn't 
be that much slower.


So I don't think its your hardware, nor your postgresql, although a bit 
extra maintenance_work_mem may help, if you haven't touched that.
Leaving the queries, the application and the driver. But I don't have 
that much experience with jdbc and postgresql-performance. In php I 
wouldn't select all the 40M records at once, the resultset would be in 
the clients-memory and that may actually cause trouble. But I don't know 
how that is implemented in JDBC, it may of course be using cursors and 
it would be less of a problem than perhaps.
You could try writing the inserts to file and see how long that takes, 
to eliminate the possibility of your application being slow on other 
parts than the inserting of data. If that is fast enough, a last resort 
may be to write a csv-file from java and use that with a copy-statement 
in psql ;)


Best regards,

Arjen

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] How to determine cause of performance problem?

2005-09-23 Thread Tom Lane
Joost Kraaijeveld [EMAIL PROTECTED] writes:
 I will test this a for perfomance improvement, but still, I wonder if ~
 100 inserts/second is a reasonable performance for my software/hardware
 combination.

Is the client code running on the same machine as the database server?
If not, what's the network delay and latency between them?

The major problem you're going to have here is at least one network
round trip per row inserted --- possibly more, if the jdbc driver is
doing helpful stuff behind your back like starting/committing
transactions.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] How to determine cause of performance problem?

2005-09-23 Thread Joost Kraaijeveld
On Fri, 2005-09-23 at 10:33 -0400, Tom Lane wrote:
 Is the client code running on the same machine as the database server?
 If not, what's the network delay and latency between them?
Yes, it is running on the same machine.


 The major problem you're going to have here is at least one network
 round trip per row inserted --- possibly more, if the jdbc driver is
 doing helpful stuff behind your back like starting/committing
 transactions.
OK, I will look into that.

But do you maybe know a pointer to info, or tools that can measure, what
my machine is doing during all the time it is doing nothing? Something
like the performance monitor in Windows but than for Linux?

-- 
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: [EMAIL PROTECTED]
web: www.askesis.nl 



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] How to determine cause of performance problem?

2005-09-23 Thread Tom Lane
Joost Kraaijeveld [EMAIL PROTECTED] writes:
 But do you maybe know a pointer to info, or tools that can measure, what
 my machine is doing during all the time it is doing nothing? Something
 like the performance monitor in Windows but than for Linux?

top, vmstat, iostat, sar, strace, oprofile, ...

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] How to determine cause of performance problem?

2005-09-23 Thread Michael Stone

On Fri, Sep 23, 2005 at 03:49:25PM +0200, Joost Kraaijeveld wrote:
On Fri, 2005-09-23 at 07:05 -0400, Michael Stone wrote: 

Ok, that's great, but you didn't respond to the suggestion of using COPY
INTO instead of INSERT.

Part of the code I left out are some data conversions (e.g. from
path-to-file to blob, from text to date (not castable because of the
homebrew original format)). I don't believe that I can do these in a SQL
statement, can I (my knowledge of SQL as a langage is not that good)? .
However I will investigate if I can do the conversion in two steps and
check if it is faster.


I'm not sure what you're trying to say. 


You're currently putting rows into the table by calling INSERT INTO
for each row. The sample code you send could be rewritten to use COPY
INTO instead.  For bulk inserts like you're doing, the copy approach
will be a lot faster.  Instead of inserting one row, waiting for a
reply, and inserting the next row, you just cram data down a pipe to the
server.  


See:
http://www.postgresql.org/docs/8.0/interactive/sql-copy.html
http://www.faqs.org/docs/ppbook/x5504.htm

Mike Stone

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq