[PERFORM] Load experimentation

2009-12-07 Thread Ben Brehmer

Hello All,

I'm in the process of loading a massive amount of data (500 GB). After 
some initial timings, I'm looking at 260 hours to load the entire 500GB. 
10 days seems like an awfully long time so I'm searching for ways to 
speed this up. The load is happening in the Amazon cloud (EC2), on a 
m1.large instance:

-7.5 GB memory
-4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)
-64-bit platform


So far I have modified my postgresql.conf  file (PostgreSQL 8.1.3). The 
modifications I have made are as follows:


shared_buffers = 786432
work_mem = 10240
maintenance_work_mem = 6291456
max_fsm_pages = 300
wal_buffers = 2048
checkpoint_segments = 200
checkpoint_timeout = 300
checkpoint_warning = 30
autovacuum = off


There are a variety of instance types available in the Amazon cloud 
(http://aws.amazon.com/ec2/instance-types/), including high memory and 
high CPU. High memory instance types come with 34GB or 68GB of memory. 
High CPU instance types have a lot less memory (7GB max) but up to 8 
virtual cores. I am more than willing to change to any of the other 
instance types.


Also, there is nothing else happening on the loading server. It is 
completely dedicated to the load.


Any advice would be greatly appreciated.

Thanks,

Ben

--
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Load experimentation

2009-12-07 Thread Kevin Grittner
Ben Brehmer  wrote:
 
> -7.5 GB memory
> -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units
>each)
> -64-bit platform
 
What OS?
 
> (PostgreSQL 8.1.3)
 
Why use such an antiquated, buggy version?  Newer versions are
faster.
 
-Kevin

-- 
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Load experimentation

2009-12-07 Thread Scott Mead
 On Mon, Dec 7, 2009 at 1:12 PM, Ben Brehmer  wrote:

> Hello All,
>
> I'm in the process of loading a massive amount of data (500 GB). After some
> initial timings, I'm looking at 260 hours to load the entire 500GB. 10 days
> seems like an awfully long time so I'm searching for ways to speed this up.
> The load is happening in the Amazon cloud (EC2), on a m1.large instance:
> -7.5 GB memory
> -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)
> -64-bit platform
>
>
> So far I have modified my postgresql.conf  file (PostgreSQL 8.1.3). The
> modifications I have made are as follows:
>

Can you go with PG 8.4?  That's a start :-)

>
> shared_buffers = 786432
> work_mem = 10240
> maintenance_work_mem = 6291456
> max_fsm_pages = 300
> wal_buffers = 2048
> checkpoint_segments = 200
> checkpoint_timeout = 300
> checkpoint_warning = 30
> autovacuum = off
>

   I'd set fsync=off for the load, I'd also make sure that you're using the
COPY command (on the server side) to do the load.


Re: [PERFORM] Load experimentation

2009-12-07 Thread Thom Brown
2009/12/7 Kevin Grittner 

> Ben Brehmer  wrote:
>
> > -7.5 GB memory
> > -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units
> >each)
> > -64-bit platform
>
> What OS?
>
> > (PostgreSQL 8.1.3)
>
> Why use such an antiquated, buggy version?  Newer versions are
> faster.
>
> -Kevin
>


I'd agree with trying to use the latest version you can.

How are you loading this data?  I'd make sure you haven't got any indices,
primary keys, triggers or constraints on your tables before you begin the
initial load, just add them after.  Also use either the COPY command for
loading, or prepared transactions.  Individual insert commands will just
take way too long.

Regards

Thom


Re: [PERFORM] Load experimentation

2009-12-07 Thread Ben Brehmer

Kevin,

This is running on on x86_64-unknown-linux-gnu, compiled by GCC gcc 
(GCC) 4.1.2 20080704 (Red Hat 4.1.2-44)


Ben

On 07/12/2009 10:33 AM, Kevin Grittner wrote:

Ben Brehmer  wrote:

   

-7.5 GB memory
-4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units
each)
-64-bit platform
 


What OS?

   

(PostgreSQL 8.1.3)
 


Why use such an antiquated, buggy version?  Newer versions are
faster.

-Kevin

   


--
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Load experimentation

2009-12-07 Thread Craig James

Ben Brehmer wrote:

Hello All,

I'm in the process of loading a massive amount of data (500 GB). After 
some initial timings, I'm looking at 260 hours to load the entire 500GB.


You don't say how you are loading the data, so there's not much to go on.  But 
generally, there are two primary ways to speed things up:

1. Group MANY inserts into a single transaction.  If you're doing a row-at-a-time, it 
will be very slow.  The "sweet spot" seems to be somewhere between 100 and 1000 
inserts in a single transaction.  Below 100, you're still slowing things down, above 
1000, it probably won't make much difference.

2. Use the COPY command.  This requires you to format your data into the form 
that COPY uses.  But it's VERY fast.

Craig

10 days seems like an awfully long time so I'm searching for ways to 
speed this up. The load is happening in the Amazon cloud (EC2), on a 
m1.large instance:

-7.5 GB memory
-4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)
-64-bit platform


So far I have modified my postgresql.conf  file (PostgreSQL 8.1.3). The 
modifications I have made are as follows:


shared_buffers = 786432
work_mem = 10240
maintenance_work_mem = 6291456
max_fsm_pages = 300
wal_buffers = 2048
checkpoint_segments = 200
checkpoint_timeout = 300
checkpoint_warning = 30
autovacuum = off


There are a variety of instance types available in the Amazon cloud 
(http://aws.amazon.com/ec2/instance-types/), including high memory and 
high CPU. High memory instance types come with 34GB or 68GB of memory. 
High CPU instance types have a lot less memory (7GB max) but up to 8 
virtual cores. I am more than willing to change to any of the other 
instance types.


Also, there is nothing else happening on the loading server. It is 
completely dedicated to the load.


Any advice would be greatly appreciated.

Thanks,

Ben




--
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] RAID card recommendation

2009-12-07 Thread Scott Carey



On 12/1/09 6:08 PM, "Karl Denninger"  wrote:

> Scott Carey wrote:
>>  
>> On 11/24/09 11:13 AM, "Scott Marlowe" 
>>   wrote:
>> 
>> 
>>   
>>  
>>>  
>>> They get good reviews as well.  Both manufacturers have their "star"
>>> performers, and their "utility" or work group class controllers.  For
>>> what you're doing the areca 12xx or 3ware 95xx series should do fine.
>>> 
>>>  
>>  
>> 
>> -1 to 3ware's SATA solutions
>> 
>> 3ware 95xx and 96xx had performance somewhere between PERC 5 (horrid) and
>> PERC 6 (mediocre) when I tested them with large SATA drives with RAID 10.
>> Haven't tried raid 6 or 5.  Haven't tried the "SA" model that supports SAS.
>> When a competing card (Areca or Adaptec) gets 3x the sequential throughput
>> on an 8 disk RAID 10 and only catches up to be 60% the speed after heavy
>> tuning of readahead value, there's something wrong.
>> Random access throughput doesn't suffer like that however -- but its nice
>> when the I/O can sequential scan faser than postgres can read the tuples.
>>   
> What operating system?
> 
> I am running under FreeBSD with 96xx series and am getting EXCELLENT
> performance.  Under Postgres 8.4.x on identical hardware except for the disk
> controller, I am pulling a literal 3x the iops on the same disks that I do
> with the Adaptec (!)
> 
> I DID note that under Linux the same hardware was a slug.
> 
> Hm...
> 

Linux, Centos 5.3.  Drivers/OS can certainly make a big difference.

> 
> -- Karl
> 


-- 
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] RAID card recommendation

2009-12-07 Thread Scott Carey



On 12/1/09 6:49 PM, "Greg Smith"  wrote:

> Scott Carey wrote:
>> 3ware 95xx and 96xx had performance somewhere between PERC 5 (horrid) and
>> PERC 6 (mediocre) when I tested them with large SATA drives with RAID 10.
>> Haven't tried raid 6 or 5.  Haven't tried the "SA" model that supports SAS
> The only models I've tested and recommended lately are exactly those
> though.  The 9690SA is the earliest 3ware card I've mentioned as seeming
> to have reasonable performance.  The 95XX cards are certainly much
> slower than similar models from, say, Areca.  I've never had one of the
> earlier 96XX models to test.  Now you've got me wondering what the
> difference between the earlier and current 96XX models really is.

9650 was made by 3Ware, essentially a PCIe version of the 9550. The 9690SA
was from some sort of acquisition/merger. They are not the same product line
at all.
3Ware, IIRC, has its roots in ATA and SATA RAID.


I gave up on them after the 9650 and 9550 experiences (on Linux) though.

> 
> --
> Greg Smith2ndQuadrant   Baltimore, MD
> PostgreSQL Training, Services and Support
> [email protected]  www.2ndQuadrant.com
> 
> 


-- 
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Load experimentation

2009-12-07 Thread Ben Brehmer
Thanks for the quick responses. I will respond to all questions in one 
email:


By "Loading data" I am implying: "psql -U postgres -d somedatabase -f 
sql_file.sql".  The sql_file.sql contains table creates and insert 
statements. There are no indexes present nor created during the load.


OS: x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 
(Red Hat 4.1.2-44)


PostgreSQL: I will try upgrading to latest version.

COPY command: Unfortunately I'm stuck with INSERTS due to the nature 
this data was generated (Hadoop/MapReduce).


Transactions: Have started a second load process with chunks of 1000 
inserts wrapped in a transaction. Its dropped the load time for 1000 
inserts from 1 Hour to 7 minutes :)


Disk Setup: Using a single disk Amazon image for the destination 
(database). Source is coming from an EBS volume. I didn't think there 
were any disk options in Amazon?



Thanks!

Ben





On 07/12/2009 10:39 AM, Thom Brown wrote:
2009/12/7 Kevin Grittner >


Ben Brehmer mailto:[email protected]>>
wrote:

> -7.5 GB memory
> -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units
>each)
> -64-bit platform

What OS?

> (PostgreSQL 8.1.3)

Why use such an antiquated, buggy version?  Newer versions are
faster.

-Kevin



I'd agree with trying to use the latest version you can.

How are you loading this data?  I'd make sure you haven't got any 
indices, primary keys, triggers or constraints on your tables before 
you begin the initial load, just add them after.  Also use either the 
COPY command for loading, or prepared transactions.  Individual insert 
commands will just take way too long.


Regards

Thom


Re: [PERFORM] Load experimentation

2009-12-07 Thread Craig James

Ben Brehmer wrote:
Thanks for the quick responses. I will respond to all questions in one 
email:


By "Loading data" I am implying: "psql -U postgres -d somedatabase -f 
sql_file.sql".  The sql_file.sql contains table creates and insert 
statements. There are no indexes present nor created during the load.


Although transactions of over 1000 INSERT statements don't speed things up much, they don't hurt 
either, especially on a new system that nobody is using yet.  Since you're loading from big SQL 
files using psql, just put a "begin;" at the top of the file and a "commit;" at 
the bottom.  Unlike Oracle, Postgres even allows CREATE and such to be done inside a transaction.

And BTW, don't forget to ANALYZE when you're all done.

Craig



OS: x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 
(Red Hat 4.1.2-44)


PostgreSQL: I will try upgrading to latest version.

COPY command: Unfortunately I'm stuck with INSERTS due to the nature 
this data was generated (Hadoop/MapReduce).


Transactions: Have started a second load process with chunks of 1000 
inserts wrapped in a transaction. Its dropped the load time for 1000 
inserts from 1 Hour to 7 minutes :)


Disk Setup: Using a single disk Amazon image for the destination 
(database). Source is coming from an EBS volume. I didn't think there 
were any disk options in Amazon?



Thanks!

Ben





On 07/12/2009 10:39 AM, Thom Brown wrote:
2009/12/7 Kevin Grittner >


Ben Brehmer mailto:[email protected]>>
wrote:

> -7.5 GB memory
> -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units
>each)
> -64-bit platform

What OS?

> (PostgreSQL 8.1.3)

Why use such an antiquated, buggy version?  Newer versions are
faster.

-Kevin



I'd agree with trying to use the latest version you can.

How are you loading this data?  I'd make sure you haven't got any 
indices, primary keys, triggers or constraints on your tables before 
you begin the initial load, just add them after.  Also use either the 
COPY command for loading, or prepared transactions.  Individual insert 
commands will just take way too long.


Regards

Thom



--
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Load experimentation

2009-12-07 Thread Alan Hodgson
On Monday 07 December 2009, Ben Brehmer  wrote:
> Disk Setup: Using a single disk Amazon image for the destination
> (database). Source is coming from an EBS volume. I didn't think there
> were any disk options in Amazon?

I don't think any Amazon cloud service is particularly well suited to a 
database. Combined virtualized hosts with terrible I/O, and it's actually 
hard to envision a worse place to run a database.

-- 
"No animals were harmed in the recording of this episode. We tried but that 
damn monkey was just too fast."

-- 
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] RAID card recommendation

2009-12-07 Thread Karl Denninger
Scott Carey wrote:
> On 12/1/09 6:49 PM, "Greg Smith"  wrote:
>
>   
>> Scott Carey wrote:
>> 
>>> 3ware 95xx and 96xx had performance somewhere between PERC 5 (horrid) and
>>> PERC 6 (mediocre) when I tested them with large SATA drives with RAID 10.
>>> Haven't tried raid 6 or 5.  Haven't tried the "SA" model that supports SAS
>>>   
>> The only models I've tested and recommended lately are exactly those
>> though.  The 9690SA is the earliest 3ware card I've mentioned as seeming
>> to have reasonable performance.  The 95XX cards are certainly much
>> slower than similar models from, say, Areca.  I've never had one of the
>> earlier 96XX models to test.  Now you've got me wondering what the
>> difference between the earlier and current 96XX models really is.
>> 
>
> 9650 was made by 3Ware, essentially a PCIe version of the 9550. The 9690SA
> was from some sort of acquisition/merger. They are not the same product line
> at all.
> 3Ware, IIRC, has its roots in ATA and SATA RAID.
>
>
> I gave up on them after the 9650 and 9550 experiences (on Linux) though.
>   
My experience under FreeBSD:

1. The Adaptecs suck.  1/3rd to 1/2 the performance of
2. The 9650s 3ware boards, which under FreeBSD are quite fast.
3. However, the Areca 1680-IX is UNBELIEVABLY fast.  Ridiculously so in
fact.

I have a number of 9650s in service and have been happy with them under
FreeBSD.  Under Linux, however, they bite in comparison.

The Areca 1680 is not cheap.  However, it comes with out-of-band
management (IP-KVM, direct SMTP and SNMP connectivity, etc) which is
EXTREMELY nice, especially for colocated machines where you need a way
in if things go horribly wrong.

One warning: I have had problems with the Areca under FreeBSD if you set
up a passthrough (e.g. JBOD) disc, delete it from the config while
running and then either accidentally touch the device nodes OR try to
use FreeBSD's "camcontrol" utility to tell it to pick up driver
changes.  Either is a great way to panic the machine. 

As such for RAID it's fine but use care if you need to be able to swap
NON-RAID disks while the machine is operating (e.g. for backup purposes
- run a dump then dismount it and pull the carrier) - it is dangerous to
attempt this (the 3Ware card does NOT have this problem.)  I am trying
to figure out exactly what provokes this and if I can get around it at
this point (in the lab of course!)

No experience with the 9690 3Wares as of yet.

-- Karl

<>
-- 
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Load experimentation

2009-12-07 Thread Greg Smith

Ben Brehmer wrote:
By "Loading data" I am implying: "psql -U postgres -d somedatabase -f 
sql_file.sql".  The sql_file.sql contains table creates and insert 
statements. There are no indexes present nor created during the load.
COPY command: Unfortunately I'm stuck with INSERTS due to the nature 
this data was generated (Hadoop/MapReduce).
Your basic options here are to batch the INSERTs into bigger chunks, 
and/or to split your data file up so that it can be loaded by more than 
one process at a time.  There's some comments and links to more guidance 
here at http://wiki.postgresql.org/wiki/Bulk_Loading_and_Restores


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
[email protected]  www.2ndQuadrant.com



Re: [PERFORM] RAID card recommendation

2009-12-07 Thread Greg Smith

Scott Carey wrote:

9650 was made by 3Ware, essentially a PCIe version of the 9550. The 9690SA
was from some sort of acquisition/merger. They are not the same product line
at all.
  
3ware became a division of AMCC, which was then bought by LSI.  The 
9590SA came out while they were a part of AMCC.


I was under the impression that the differences between the 9650 and the 
9690SA were mainly related to adding SAS support, which was sort of a 
bridge addition rather than a fundamental change in the design of the 
card.  You'll often see people refer to "9650/9690" as if they're the 
same card; they may never run the same firmware.  They certainly always 
get firmware updates at the same time, and as part of the same download 
package.


Another possibility for the difference between Scott's experience and 
mine is that I've only evaluated those particular cards recently, and 
there seems to be evidence that 3ware did some major firmware overhauls 
in late 2008, i.e. 
http://unix.derkeiler.com/Mailing-Lists/FreeBSD/performance/2008-10/msg5.html


Let me try to summarize where things are at a little more clearly, with 
the data accumulated during this long thread:


-Areca:  Usually the fastest around.  Management tools are limited 
enough that you really want the version with the on-board management 
NIC.  May require some testing to find a good driver version.


-3ware:  Performance on current models not as good as Areca, but with a 
great set of management tools (unless you're using SAS) and driver 
reliability.  Exact magnitude of the performance gap with Areca is 
somewhat controversial and may depend on OS--FreeBSD performance might 
be better than Linux in particular.  Older 3ware cards were really slow.


One of these days I need to wrangle up enough development cash to buy 
current Areca and 3ware cards, an Intel SSD, and disappear into the lab 
(already plenty of drives here) until I've sorted this all out to my 
satisfaction.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
[email protected]  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] performance penalty between Postgresql 8.3.8 and 8.4.1

2009-12-07 Thread Schmitz, David
Hello everybody,

we have severe performance penalty between Postgresql 8.3.8 and 8.4.1

Consider the following tables:

CREATE TABLE xdf.xdf_admin_hierarchy
(
  admin_place_id integer NOT NULL,
  admin_order smallint NOT NULL,
  iso_country_code character(3) NOT NULL,
  country_id integer NOT NULL,
  order1_id integer,
  order2_id integer,
  order8_id integer,
  builtup_id integer,
  num_links integer,
  CONSTRAINT pk_xdf_admin_hierarchy PRIMARY KEY (admin_place_id)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE xdf.xdf_admin_hierarchy OWNER TO frog;

CREATE TABLE xdf.xdf_link_admin
(
  admin_place_id integer NOT NULL,
  link_id integer NOT NULL,
  side character(1) NOT NULL,
  CONSTRAINT pk_xdf_link_admin PRIMARY KEY (link_id, side)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE xdf.xdf_link_admin OWNER TO frog;

CREATE INDEX nx_xdflinkadmin_adminplaceid
  ON xdf.xdf_link_admin
  USING btree
  (admin_place_id);

CREATE INDEX nx_xdflinkadmin_linkid
  ON xdf.xdf_link_admin
  USING btree
  (link_id);

CREATE TABLE xdf.xdf_road_link
(
  road_link_id integer NOT NULL,
  road_name_id integer,
  left_address_range_id integer NOT NULL,
  right_address_range_id integer NOT NULL,
  address_type smallint NOT NULL,
  is_exit_name character(1) NOT NULL,
  explicatable character(1) NOT NULL,
  is_junction_name character(1) NOT NULL,
  is_name_on_roadsign character(1) NOT NULL,
  is_postal_name character(1) NOT NULL,
  is_stale_name character(1) NOT NULL,
  is_vanity_name character(1) NOT NULL,
  is_scenic_name character(1) NOT NULL,
  link_id integer NOT NULL,
  CONSTRAINT pk_xdf_road_link PRIMARY KEY (road_link_id)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE xdf.xdf_road_link OWNER TO frog;

CREATE INDEX nx_xdfroadlink_leftaddressrangeid
  ON xdf.xdf_road_link
  USING btree
  (left_address_range_id);

CREATE INDEX nx_xdfroadlink_linkid
  ON xdf.xdf_road_link
  USING btree
  (link_id);

CREATE INDEX nx_xdfroadlink_rightaddressrangeid
  ON xdf.xdf_road_link
  USING btree
  (right_address_range_id);

CREATE INDEX nx_xdfroadlink_roadnameid
  ON xdf.xdf_road_link
  USING btree
  (road_name_id);

CREATE TABLE xdf.xdf_road_name
(
  road_name_id integer NOT NULL,
  route_type smallint NOT NULL,
  attached_to_base character(1) NOT NULL,
  precedes_base character(1) NOT NULL,
  prefix character varying(10),
  street_type character varying(30),
  suffix character varying(2),
  base_name character varying(60) NOT NULL,
  language_code character(3) NOT NULL,
  is_exonym character(1) NOT NULL,
  name_type character(1) NOT NULL,
  direction_on_sign character(1) NOT NULL,
  street_name character varying(60) NOT NULL,
  CONSTRAINT pk_xdf_road_name PRIMARY KEY (road_name_id)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE xdf.xdf_road_name OWNER TO frog;

CREATE INDEX nx_xdfroadname_languagecode
  ON xdf.xdf_road_name
  USING btree
  (language_code);

If one executes a query of the following structure:

SELECT AH.ORDER8_ID, AH.BUILTUP_ID, RL.LINK_ID, LA.SIDE,
RL.ROAD_NAME_ID, RL.LEFT_ADDRESS_RANGE_ID, RL.RIGHT_ADDRESS_RANGE_ID,
RL.IS_EXIT_NAME, RL.EXPLICATABLE, RL.IS_JUNCTION_NAME,
RL.IS_NAME_ON_ROADSIGN, RL.IS_POSTAL_NAME, RL.IS_STALE_NAME,
RL.IS_VANITY_NAME, RL.ROAD_LINK_ID, RN.STREET_NAME,
RN.ROUTE_TYPE
FROM xdf.xdf_ADMIN_HIERARCHY AH, xdf.xdf_LINK_ADMIN LA,
xdf.xdf_ROAD_LINK RL, xdf.xdf_ROAD_NAME RN
WHERE AH.ADMIN_PLACE_ID = LA.ADMIN_PLACE_ID
AND LA.LINK_ID = RL.LINK_ID
AND RL.ROAD_NAME_ID = RN.ROAD_NAME_ID
AND RL.IS_EXIT_NAME = 'N'
AND RL.IS_JUNCTION_NAME = 'N'
AND RN.ROAD_NAME_ID BETWEEN 158348561 AND 158348660
ORDER BY RL.ROAD_NAME_ID, AH.ORDER8_ID, AH.BUILTUP_ID, RL.LINK_ID;
   
It is carried out with poor performance on postgresql 8.4.1 However postgresql 
8.3.8 performs just fine.
If you take a closer look at the query with EXPLAIN, it becomes obvious, that 
postgresql 8.4 does not
consider the primary key at level 3 and instead generates a hash join:

Postgresql 8.4.1:

Sort  (cost=129346.71..129498.64 rows=60772 width=61)
  Sort Key: rl.road_name_id, ah.order8_id, ah.builtup_id, rl.link_id
  ->  Hash Join  (cost=2603.57..124518.03 rows=60772 width=61)
Hash Cond: (la.admin_place_id = ah.admin_place_id)
->  Nested Loop  (cost=6.82..120781.81 rows=60772 width=57)
  ->  Nested Loop  (cost=6.82..72383.98 rows=21451 width=51)
->  Index Scan using pk_rdf_road_name on rdf_road_name rn  
(cost=0.00..11.24 rows=97 width=21)
  Index Cond: ((road_name_id >= 158348561) AND 
(road_name_id <= 158348660))
->  Bitmap Heap Scan on rdf_road_link rl  
(cost=6.82..743.34 rows=222 width=34)
  Recheck Cond: (rl.road_name_id = rn.road_name_id)
  Filter: ((rl.is_exit_name = 'N'::bpchar) AND 
(rl.is_junction_name = 'N'::bpchar))
  ->  Bitmap Index Scan on nx_rdfroadlink_roadnameid  
(cost=0.00..6.76 rows=222 width=0)
Index Cond: (rl.road_name_id = rn.road_name_id)
  -> 

Re: [PERFORM] performance penalty between Postgresql 8.3.8 and 8.4.1

2009-12-07 Thread Kevin Grittner
"Schmitz, David"  wrote:
 
> It is carried out with poor performance on postgresql 8.4.1
> However postgresql 8.3.8 performs just fine.
> If you take a closer look at the query with EXPLAIN, it becomes
> obvious, that postgresql 8.4 does not consider the primary key at
> level 3 and instead generates a hash join:
 
> Postgresql 8.4.1:
> 
> Sort  (cost=129346.71..129498.64 rows=60772 width=61)
 
> Postgresql 8.3.8:
> 
> Sort  (cost=3792.75..3792.95 rows=81 width=61)
 
It determines the plan based on available statistics, which in this
case seem to indicate rather different data.  Do the two databases
have identical data?  Have they both been recently analyzed?  What
is the default_statistics_target on each?  Do any columns in these
tables have overrides?
 
-Kevin

-- 
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] RAID card recommendation

2009-12-07 Thread Craig James

Greg Smith wrote:
Let me try to summarize where things are at a little more clearly, with 
the data accumulated during this long thread:


-Areca:  Usually the fastest around.  Management tools are limited 
enough that you really want the version with the on-board management 
NIC.  May require some testing to find a good driver version.


-3ware:  Performance on current models not as good as Areca, but with a 
great set of management tools (unless you're using SAS) and driver 
reliability.  Exact magnitude of the performance gap with Areca is 
somewhat controversial and may depend on OS--FreeBSD performance might 
be better than Linux in particular.  Older 3ware cards were really slow.


One of these days I need to wrangle up enough development cash to buy 
current Areca and 3ware cards, an Intel SSD, and disappear into the lab 
(already plenty of drives here) until I've sorted this all out to my 
satisfaction.


... and do I hear you saying that no other vendor is worth considering?  Just 
how far off are they?

Thanks,
Craig


--
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] RAID card recommendation

2009-12-07 Thread Karl Denninger
Greg Smith wrote:
> Scott Carey wrote:
>> 9650 was made by 3Ware, essentially a PCIe version of the 9550. The
>> 9690SA
>> was from some sort of acquisition/merger. They are not the same
>> product line
>> at all.
>>   
> 3ware became a division of AMCC, which was then bought by LSI.  The
> 9590SA came out while they were a part of AMCC.
>
> I was under the impression that the differences between the 9650 and
> the 9690SA were mainly related to adding SAS support, which was sort
> of a bridge addition rather than a fundamental change in the design of
> the card.  You'll often see people refer to "9650/9690" as if they're
> the same card; they may never run the same firmware.  They certainly
> always get firmware updates at the same time, and as part of the same
> download package.
>
> Another possibility for the difference between Scott's experience and
> mine is that I've only evaluated those particular cards recently, and
> there seems to be evidence that 3ware did some major firmware
> overhauls in late 2008, i.e.
> http://unix.derkeiler.com/Mailing-Lists/FreeBSD/performance/2008-10/msg5.html
>
>
> Let me try to summarize where things are at a little more clearly,
> with the data accumulated during this long thread:
>
> -Areca:  Usually the fastest around.  Management tools are limited
> enough that you really want the version with the on-board management
> NIC.  May require some testing to find a good driver version.
>
> -3ware:  Performance on current models not as good as Areca, but with
> a great set of management tools (unless you're using SAS) and driver
> reliability.  Exact magnitude of the performance gap with Areca is
> somewhat controversial and may depend on OS--FreeBSD performance might
> be better than Linux in particular.  Older 3ware cards were really slow.
>
> One of these days I need to wrangle up enough development cash to buy
> current Areca and 3ware cards, an Intel SSD, and disappear into the
> lab (already plenty of drives here) until I've sorted this all out to
> my satisfaction.
Most common SSDs will NOT come up on the 3ware cards at present.  Not
sure why as of yet - I've tried several.

Not had the time to screw with them on the ARECA cards yet.

-- Karl
<>
-- 
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] RAID card recommendation

2009-12-07 Thread Greg Smith

Craig James wrote:
... and do I hear you saying that no other vendor is worth 
considering?  Just how far off are they?
I wasn't trying to summarize every possible possibility, just the 
complicated ones there's some debate over.


What else is OK besides Areca and 3ware?  HP's P800 is good, albeit not 
so easy to buy unless you're getting an HP system.  The LSI Megaraid 
stuff and its close relative the Dell PERC6 are OK for some apps too; my 
intense hatred of Dell usually results in my forgetting about them.  (As 
an example, 
http://en.wikipedia.org/wiki/ATX#Issues_with_Dell_power_supplies 
documents what I consider the worst design decision ever made by a PC 
manufacturer)


I don't think any of the other vendors on the market are viable for a 
Linux system due to driver issues and general low quality, which 
includes  Adaptec, Promise, Highpoint, and all the motherboard Fake RAID 
stuff from Silicon Image, Intel, Via, etc.  I don't feel there's any 
justification for using those products instead of using a simple SATA 
controller and Linux software RAID in a PostgreSQL context.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
[email protected]  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] RAID card recommendation

2009-12-07 Thread Greg Smith

Karl Denninger wrote:

Most common SSDs will NOT come up on the 3ware cards at present.  Not
sure why as of yet - I've tried several.
  
Right, and they're being rather weasly at 
http://www.3ware.com/kb/Article.aspx?id=15470 talking about it too.

Not had the time to screw with them on the ARECA cards yet.
  

I know the situation there is much better, like:
http://hothardware.com/News/24-Samsung-SSDs-Linked-Together-for-2GBSec/

Somebody at Newegg has said they got their Areca 1680 working with one 
of the Intel X-25 drives, but wasn't impressed by the write
performance of the result.  Makes me wonder if the Areca card is messing 
with the write cache of the drive.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
[email protected]  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] RAID card recommendation

2009-12-07 Thread Karl Denninger
Greg Smith wrote:
> Craig James wrote:
>> ... and do I hear you saying that no other vendor is worth
>> considering?  Just how far off are they?
> I wasn't trying to summarize every possible possibility, just the
> complicated ones there's some debate over.
>
> What else is OK besides Areca and 3ware?  HP's P800 is good, albeit
> not so easy to buy unless you're getting an HP system.  The LSI
> Megaraid stuff and its close relative the Dell PERC6 are OK for some
> apps too; my intense hatred of Dell usually results in my forgetting
> about them.  (As an example,
> http://en.wikipedia.org/wiki/ATX#Issues_with_Dell_power_supplies
> documents what I consider the worst design decision ever made by a PC
> manufacturer)
>
> I don't think any of the other vendors on the market are viable for a
> Linux system due to driver issues and general low quality, which
> includes  Adaptec, Promise, Highpoint, and all the motherboard Fake
> RAID stuff from Silicon Image, Intel, Via, etc.  I don't feel there's
> any justification for using those products instead of using a simple
> SATA controller and Linux software RAID in a PostgreSQL context.
The LSI Megaraid (and Intel's repackaging of it, among others) is
reasonably good under FreeBSD.

Performance is slightly worse than the 3ware 95xx series boards, but not
materially so.

Their CLI interface is "interesting" (it drops a log file in the working
directly BY DEFAULT unless you tell it otherwise, among other things.) 

-- Karl
<>
-- 
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance penalty between Postgresql 8.3.8 and 8.4.1

2009-12-07 Thread Schmitz, David
-Ursprüngliche Nachricht-
Von:Schmitz, David
Gesendet:   Di 08.12.2009 00:14
An: Kevin Grittner
Cc: 
Betreff:AW: [PERFORM] performance penalty between Postgresql 8.3.8 and 
8.4.1




-Ursprüngliche Nachricht-
Von:Kevin Grittner [mailto:[email protected]]
Gesendet:   Mo 07.12.2009 23:19
An: Schmitz, David; [email protected]
Cc: 
Betreff:Re: [PERFORM] performance penalty between Postgresql 8.3.8 and 
8.4.1

"Schmitz, David"  wrote:
 
> It is carried out with poor performance on postgresql 8.4.1
> However postgresql 8.3.8 performs just fine.
> If you take a closer look at the query with EXPLAIN, it becomes
> obvious, that postgresql 8.4 does not consider the primary key at
> level 3 and instead generates a hash join:
 
> Postgresql 8.4.1:
> 
> Sort  (cost=129346.71..129498.64 rows=60772 width=61)
 
> Postgresql 8.3.8:
> 
> Sort  (cost=3792.75..3792.95 rows=81 width=61)
 
It determines the plan based on available statistics, which in this
case seem to indicate rather different data.  Do the two databases
have identical data?  Have they both been recently analyzed?  What
is the default_statistics_target on each?  Do any columns in these
tables have overrides?
 
-Kevin


Hello Kevin,

both databases have identical / same data and hardware. On postgresql 8.3.8 
default statistics target is 10 and at postgresql 8.4.1 it is 100. But i have 
been experimenting in both directions with postgres 8.4.1 10, 100, 1000 or 
1 does not matter perfomance remains bad. Analyze has been run recently on 
both databases (even an explicit analayze before query makes no difference). 
Autovaccuum and analyze are set quite aggressive at 0.01 (v) and 0.02 (a) and 
postgres 8.3.8 still outperforms 8.4.1.

Regards

dave 
 
***
innovative systems GmbH Navigation-Multimedia
Geschaeftsfuehrung: Edwin Summers - Michael Juergen Mauser
Sitz der Gesellschaft: Hamburg - Registergericht: Hamburg HRB 59980 
 
***
Diese E-Mail enthaelt vertrauliche und/oder rechtlich geschuetzte 
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail 
irrtuemlich erhalten haben, informieren Sie bitte sofort den Absender und 
loeschen Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte Weitergabe 
dieser Mail ist nicht gestattet.
This e-mail may contain confidential and/or privileged information. If you are 
not the intended recipient (or have received this e-mail in error) please 
notify the sender immediately and delete this e-mail. Any unauthorized copying, 
disclosure or distribution of the contents in this e-mail is strictly forbidden.
***

-- 
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance penalty between Postgresql 8.3.8 and 8.4.1

2009-12-07 Thread Andres Freund
Hi David,

On Monday 07 December 2009 23:05:14 Schmitz, David wrote:
> With our data it is a performance difference from 1h16min (8.3.8) to
>  2h43min (8.4.1)
Can you afford a explain analyze run overnight or so for both?

Andres

-- 
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Dynamlically updating the estimated cost of a transaction

2009-12-07 Thread Hasini Gunasinghe
Hi,

I am looking for a way to let the user know what the estimated time for the
current transaction he requested and while the transaction is in progress,
how much time is elapsed for the transaction  as a fraction of the total
estimated time at a particular instance, by dynamically estimating the time
for the transaction at that instance.

I got to know how Prostgre estimates the cost for a particular operation.
Does estimated cost means the estimated time to evaluate an operation in the
context of Postgre? And also may I know if there is any way to achieve the
requirement I mentioned above, with the Postgre SQL?

I would be very much thankful for your suggestions as I am doing a research
project to implement a mechanism to achieve the above mentioned task.

Thank you.

regards,
Hasini.


Re: [PERFORM] performance penalty between Postgresql 8.3.8 and 8.4.1

2009-12-07 Thread Robert Haas
On Mon, Dec 7, 2009 at 5:19 PM, Kevin Grittner
 wrote:
> "Schmitz, David"  wrote:
>
>> It is carried out with poor performance on postgresql 8.4.1
>> However postgresql 8.3.8 performs just fine.
>> If you take a closer look at the query with EXPLAIN, it becomes
>> obvious, that postgresql 8.4 does not consider the primary key at
>> level 3 and instead generates a hash join:
>
>> Postgresql 8.4.1:
>>
>> Sort  (cost=129346.71..129498.64 rows=60772 width=61)
>
>> Postgresql 8.3.8:
>>
>> Sort  (cost=3792.75..3792.95 rows=81 width=61)
>
> It determines the plan based on available statistics, which in this
> case seem to indicate rather different data.  Do the two databases
> have identical data?  Have they both been recently analyzed?  What
> is the default_statistics_target on each?  Do any columns in these
> tables have overrides?

I think Tom made some changes to the join selectivity code which might
be relevant here, though I'm not sure exactly what's going on.  Can we
see, on the 8.4.1 database:

SELECT SUM(1) FROM rdf_admin_hierarchy;
SELECT s.stadistinct, s.stanullfrac, s.stawidth,
array_upper(s.stanumbers1, 1) FROM pg_statistic s WHERE s.starelid =
'rdf_admin_hierarchy'::regclass AND s.staattnum = (SELECT a.attnum
FROM pg_attribute a WHERE a.attname = 'admin_place_id' AND a.attrelid
= 'rdf_admin_hierarchy'::regclass);

...Robert

-- 
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] error occured in dbt2 against with postgresql

2009-12-07 Thread Niu Yan
*$ dbt2-run-workload -a pgsql -d 10 -w 1 -c 1 -o
/home/store/tmp/testdbt2/out-1.o*
postmaster starting
DBT-2 test for pgsql started...

DATABASE SYSTEM: localhost
DATABASE NAME: dbt2
DATABASE CONNECTIONS: 1
TERMINAL THREADS: 10
TERMINALS PER WAREHOUSE: 10
WAREHOUSES PER THREAD/CLIENT PAIR: 500
SCALE FACTOR (WAREHOUSES): 1
DURATION OF TEST (in sec): 10
1 client stared every 1000 millisecond(s)

Stage 1. Starting up client...
Sleeping 501 seconds
collecting database statistics...

Stage 2. Starting up driver...
1000 threads started per millisecond
estimated rampup time: Sleeping 5010 seconds

estimated rampup time has elapsed
estimated steady state time: Sleeping 10 seconds

Stage 3. Processing of results...
Killing client...
/usr/local/bin/dbt2-run-workload: line 518: 24055 Terminated
dbt2-client ${CLIENT_COMMAND_ARGS} -p ${PORT} -o ${CDIR} >
${CLIENT_OUTPUT_DIR}/`hostname`/client-${SEG}.out 2>&1
waiting for postmaster to shut down done
postmaster stopped
*Can't use an undefined value as an ARRAY reference at
/usr/lib/perl5/site_perl/5.8.8/Test/Parser/Dbt2.pm line 521.*
The authenticity of host 'localhost (127.0.0.1)' can't be established.
RSA key fingerprint is b6:da:1d:d0:28:d7:ed:06:08:72:44:de:02:f1:b9:52.
Are you sure you want to continue connecting (yes/no)?
Host key verification failed.
Test completed.
Results are in: /home/store/tmp/testdbt2/out-1.o

 Response Time (s)
 Transaction  %Average :90th %Total
Rollbacks  %
  -  -  ---  ---
-


-- 
Best regards,
Niu Yan


Re: [PERFORM] Dynamlically updating the estimated cost of a transaction

2009-12-07 Thread Greg Smith

Hasini Gunasinghe wrote:


I am looking for a way to let the user know what the estimated time 
for the current transaction he requested and while the transaction is 
in progress, how much time is elapsed for the transaction  as a 
fraction of the total estimated time at a particular instance, by 
dynamically estimating the time for the transaction at that instance.
I think this one needs to get added to the FAQ.  To re-use from when I 
answered this last month:  this just isn't exposed in PostgreSQL yet. 
Clients ask for queries to be run, eventually they get rows of results 
back, but there's no notion of how many they're going to get in advance 
or how far along they are in executing the query's execution plan. 
There's a couple of academic projects that have started exposing more of 
the query internals, but I'm not aware of anyone who's even started 
moving in the direction of what you'd need to produce a progress bar or 
estimate a total run-time.  It's a hard problem--you could easily spend 
several years of your life on this alone and still not have even a 
mediocre way to predict how much time is left to execute a generic query.


In practice, people tend to save query log files showing historical 
information about how long queries took to run, and then use that to 
predict future response times.  That's a much easier way to get 
something useful for a lot of applications than expecting you can ever 
estimate just based on an EXPLAIN plan.


I got to know how Prostgre estimates the cost for a particular 
operation. Does estimated cost means the estimated time to evaluate an 
operation in the context of Postgre? And also may I know if there is 
any way to achieve the requirement I mentioned above, with the Postgre 
SQL?
Estimated costs are not ever used to predict an estimated time.  An 
interesting research project would be trying to tie the two together 
more tightly, by collecting a bunch of data measuring real EXPLAIN 
ANALYZE execution times with their respective cost estimates.


Even after you collected it, actually using the data from such research 
is quite tricky.  For example, people have tried to tie some of the 
individual cost components to the real world--for example, measuring the 
true amount of time it takes to do a sequential read vs. a seek and 
adjusting random_page_cost accordingly.  But if you then set 
random_page_cost to its real-world value based on that estimate, you get 
a value well outside what seems to work for people in practice.  This 
suggests the underlying cost estimate doesn't reflect the real-world 
value it intends to that closely.  But improving on that situation 
without going backwards in the quality of the plans the query optimizer 
produces is a tricky problem.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
[email protected]  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Load experimentation

2009-12-07 Thread Ben Brehmer

Thanks for all the responses. I have one more thought;

Since my input data is split into about 200 files (3GB each), I could 
potentially spawn one load command for each file. What would be the 
maximum number of input connections Postgres can handle without bogging 
down? When I say 'input connection' I mean "psql -U postgres -d dbname 
-f one_of_many_sql_files".


Thanks,
Ben



On 07/12/2009 12:59 PM, Greg Smith wrote:

Ben Brehmer wrote:
By "Loading data" I am implying: "psql -U postgres -d somedatabase -f 
sql_file.sql".  The sql_file.sql contains table creates and insert 
statements. There are no indexes present nor created during the load.
COPY command: Unfortunately I'm stuck with INSERTS due to the nature 
this data was generated (Hadoop/MapReduce).
Your basic options here are to batch the INSERTs into bigger chunks, 
and/or to split your data file up so that it can be loaded by more 
than one process at a time.  There's some comments and links to more 
guidance here at http://wiki.postgresql.org/wiki/Bulk_Loading_and_Restores


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
[email protected]   www.2ndQuadrant.com
   


Re: [PERFORM] Load experimentation

2009-12-07 Thread Greg Smith

Ben Brehmer wrote:
Since my input data is split into about 200 files (3GB each), I could 
potentially spawn one load command for each file. What would be the 
maximum number of input connections Postgres can handle without 
bogging down? 
You can expect to easily get one loader process per real CPU going.  
Beyond that, it depends on how CPU intensive they all are and what the 
resulting I/O rate out of the combination is.  You're probably going to 
run out of CPU on a loading job long before you hit any of the other 
limits in this area, and potentially you could run out of disk 
throughput on a cloud system before that.  PostgreSQL isn't going to bog 
down on a connection basis until you've reached several hundred of them, 
your loader will be lucky to hit 10 active processes before it grinds to 
a halt on some physical resources unrelated to general database scaling.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
[email protected]  www.2ndQuadrant.com



Re: [PERFORM] Load experimentation

2009-12-07 Thread Scott Marlowe
On Tue, Dec 8, 2009 at 12:22 AM, Ben Brehmer  wrote:
> Thanks for all the responses. I have one more thought;
>
> Since my input data is split into about 200 files (3GB each), I could
> potentially spawn one load command for each file. What would be the maximum
> number of input connections Postgres can handle without bogging down? When I
> say 'input connection' I mean "psql -U postgres -d dbname -f
> one_of_many_sql_files".

This is VERY dependent on your IO capacity and number of cores.  My
experience is that unless you're running on a decent number of disks,
you'll run out of IO capacity first in most machines.  n pairs of
mirrors in a RAID-10 can handle x input threads where x has some near
linear relation to n.  Have 100 disks in a RAID-10 array?  You can
surely handle dozens of load threads with no IO wait.  Have 4 disks in
a RAID-10?  Maybe two to four load threads will max you out.  Once
you're IO bound, adding more threads and more CPUs won't help, it'll
hurt.  The only way to really know is to benchmark it, but i'd guess
that about half as many import threads as mirror pairs in a RAID-10
(or just drives if you're using RAID-0) would be a good place to start
and work from there.

-- 
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance