date:20050901

Re: [PERFORM] Big question on insert performance/using COPY FROM

2005-09-01 Thread Richard Huxton


Morgan Kita wrote:

Hi,

I am currently trying to speed up the insertion of bulk loads to my
database. I have fiddled with all of the parameters that I have seen
suggested(aka checkpoint_segments, checkpoint_timeout,
maintinence_work_mem, and shared buffers) with no success. I even
turned off fysnc with no effect so I am pretty sure the biggest
problem is that the DB is CPU limited at the moment because of the
rather weak machine that postmaster is running on(Athlon 2400+ xp
with 512 RAM)


Don't be pretty sure, be abolutely sure. What do your various 
system-load figures show? Windows has a system performance monitoring 
tool that can show CPU/Memory/Disk IO, and *nix tools have vmstat or iostat.


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

[PERFORM] Poor performance on HP Package Cluster

2005-09-01 Thread Ernst Einstein

Hi!

I've set up a Package Cluster ( Fail-Over Cluster ) on our two HP DL380
G4 with MSA Storrage G2.( Xeon 3,4Ghz, 6GB Ram, 2x [EMAIL PROTECTED] Raid1)
The system is running under Suse Linux Enterprise Server.

My problem is, that the performance is very low. On our old Server
( Celeron 2Ghz with 2 GB of Ram ) an import of our Data takes about 10
minutes. ( 1,1GB data )
One of the DL380 it takes more than 90 minutes...
Selects response time have also been increased. Celeron 3 sec, Xeon
30-40sec.

 I'm trying to fix the problem for two day's now, googled a lot, but i
don't know what to do.

Top says, my CPU spends ~50% time with wait io.

top - 14:07:34 up 22 min,  3 users,  load average: 1.09, 1.04, 0.78
Tasks:  74 total,   3 running,  71 sleeping,   0 stopped,   0 zombie
Cpu(s): 50.0% us,  5.0% sy,  0.0% ni,  0.0% id, 45.0% wa,  0.0% hi,
0.0% si
Mem:   6050356k total,   982004k used,  5068352k free,60300k buffers
Swap:  2097136k total,0k used,  2097136k free,   786200k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+
COMMAND
 9939 postgres  18   0  254m 143m 140m R 49.3  2.4   8:35.43 postgres:
postgres plate [local] INSERT 
 9938 postgres  16   0 13720 1440 1120 S  4.9  0.0   0:59.08 psql -d
plate -f dump.sql   
10738 root  15   0  3988 1120  840 R  4.9  0.0   0:00.05 top -d
0.2  
1 root  16   0   640  264  216 S  0.0  0.0   0:05.03 init
[3]
2 root  34  19 000 S  0.0  0.0   0:00.00
[ksoftirqd/0] 

vmstat 1:

ClusterNode2 root $ vmstat 1
procs ---memory-- ---swap-- -io --system--
cpu
 r  b   swpd   free   buff  cache   si   sobibo   incs us sy
id wa
 1  0  0 5032012  60888 82100800   216  6938 1952  5049 40
8 15 37
 0  1  0 5031392  60892 82163200 0  8152 2126  5725 45
6  0 49
 0  1  0 5030896  60900 82214400 0  8124 2052  5731 46
6  0 47
 0  1  0 5030400  60908 82276800 0  8144 2124  5717 44
7  0 50
 1  0  0 5029904  60924 82327200 0  8304 2062  5763 43
7  0 49

I've read (2004), that Xeon may have problems with content switching -
is the problem still existing? Can I do something to minimize the
problem?


postgresql.conf:

shared_buffers = 28672
effective_cache_size = 40   
random_page_cost = 2


shmall  shmmax are set to 268435456

hdparm:

ClusterNode2 root $ hdparm -tT /dev/cciss/c0d0p1

/dev/cciss/c0d0p1:
Timing buffer-cache reads: 3772 MB in 2.00 seconds = 1885.34 MB/sec
Timing buffered disk reads: 150 MB in 2.06 seconds = 72.72 MB/sec

greetings Ernst





---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

[PERFORM] Need for speed 3

2005-09-01 Thread Ulrich Wisser


Hi again,

first I want to say ***THANK YOU*** for everyone who kindly shared their 
thoughts on my hardware problems. I really appreciate it. I started to 
look for a new server and I am quite sure we'll get a serious hardware 
update. As suggested by some people I would like now to look closer at 
possible algorithmic improvements.


My application basically imports Apache log files into a Postgres 
database. Every row in the log file gets imported in one of three (raw 
data) tables. My columns are exactly as in the log file. The import is 
run approx. every five minutes. We import about two million rows a month.


Between 30 and 50 users are using the reporting at the same time.

Because reporting became so slow, I did create a reporting table. In 
that table data is aggregated by dropping time (date is preserved), ip, 
referer, user-agent. And although it breaks normalization some data from 
a master table is copied, so no joins are needed anymore.


After every import the data from the current day is deleted from the 
reporting table and recalculated from the raw data table.



Is this description understandable? If so

What do you think of this approach? Are there better ways to do it? Is 
there some literature you recommend reading?


TIA

Ulrich


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Poor performance on HP Package Cluster

2005-09-01 Thread Ron

Your HD raw IO rate seems fine, so the problem is not likely to be 
with the HDs.


That consistent ~10x increase in how long it takes to do an import or 
a select is noteworthy.


This smells like an interconnect problem.  Was the Celeron locally 
connected to the HDs while the new Xeons are network 
connected?  Getting 10's or even 100's of MBps throughput out of 
local storage is much easier than it is to do over a network.  1GbE 
is required if you want HDs to push 72.72MBps over a network, and not 
even one 10GbE line will allow you to match local buffered IO of 
1885.34MBps.  What size are those network connects (Server A - 
storage, Server B - storage, Server A - Server B)?


Ron Peacetree


At 10:16 AM 9/1/2005, Ernst Einstein wrote:

I've set up a Package Cluster ( Fail-Over Cluster ) on our two HP 
DL380 G4 with MSA Storage G2.( Xeon 3,4Ghz, 6GB Ram, 2x [EMAIL PROTECTED] 
Raid1).  The system is running under Suse Linux Enterprise Server.


My problem is, that the performance is very low. On our old Server ( 
Celeron 2Ghz with 2 GB of Ram ) an import of our Data takes about 10

minutes. ( 1,1GB data ).  One of the DL380 it takes more than 90 minutes...
Selects response time have also been increased. Celeron 3 sec, Xeon 30-40sec.

I'm trying to fix the problem for two day's now, googled a lot, but 
i don't know what to do.


Top says, my CPU spends ~50% time with wait io.

top - 14:07:34 up 22 min,  3 users,  load average: 1.09, 1.04, 0.78
Tasks:  74 total,   3 running,  71 sleeping,   0 stopped,   0 zombie
Cpu(s): 50.0% us,  5.0% sy,  0.0% ni,  0.0% id, 45.0% wa,  0.0% hi,  0.0% si
Mem:   6050356k total,   982004k used,  5068352k free,60300k buffers
Swap:  2097136k total,0k used,  2097136k free,   786200k cached

  PID USER  PR  NI  VIRT  RES   SHR S %CPU 
%MEM  TIME+COMMAND
 9939 postgres   18   0  254m 143m 140m 
R   49.3  2.48:35.43 postgres:postgres plate [local] 
INSERT
 9938 postgres   16   0 13720 1440   1120 
S 4.9  0.00:59.08 psql -d plate -f 
dump.sql
10738 root 15   0  3988  1120 840 
R 4.9  0.00:00.05 top -d 
0.2
   1 root 16   0   640264 216 
S  0.0  0.0   0:05.03 
init[3]
   2 root 34  19  0   0 0 
S  0.0  0.0   0:00.00 [ksoftirqd/0]


vmstat 1:

ClusterNode2 root $ vmstat 1
procs ---memory-- ---swap-- -io --system--cpu
 r  b   swpd   freebuff   cachesi   sobi  bo 
  in  cs us sy id wa
 1  0  0 5032012  60888 82100800   216  6938 1952  5049 
40  8 15 37
 0  1  0 5031392  60892 82163200   0  8152 
2126  5725 45  6  0 49
 0  1  0 5030896  60900 82214400   0  8124 
2052  5731 46  6  0 47
 0  1  0 5030400  60908 82276800   0  8144 
2124  5717 44  7  0 50
 1  0  0 5029904  60924 82327200   0  8304 
2062  5763 43  7  0 49


I've read (2004), that Xeon may have problems with content switching 
- is the problem still existing? Can I do something to minimize the

problem?


postgresql.conf:

shared_buffers = 28672
effective_cache_size = 40
random_page_cost = 2


shmall  shmmax are set to 268435456

hdparm:

ClusterNode2 root $ hdparm -tT /dev/cciss/c0d0p1

/dev/cciss/c0d0p1:
Timing buffer-cache reads: 3772 MB in 2.00 seconds = 1885.34 MB/sec
Timing buffered disk reads: 150 MB in 2.06 seconds = 72.72 MB/sec





---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] Need for speed 3

2005-09-01 Thread Merlin Moncure


Ulrich wrote:
 Hi again,
 
 first I want to say ***THANK YOU*** for everyone who kindly shared
their
 thoughts on my hardware problems. I really appreciate it. I started to
 look for a new server and I am quite sure we'll get a serious hardware
 update. As suggested by some people I would like now to look closer
at
 possible algorithmic improvements.
 
 My application basically imports Apache log files into a Postgres
 database. Every row in the log file gets imported in one of three (raw
 data) tables. My columns are exactly as in the log file. The import is
 run approx. every five minutes. We import about two million rows a
month.
 
 Between 30 and 50 users are using the reporting at the same time.
 
 Because reporting became so slow, I did create a reporting table. In
 that table data is aggregated by dropping time (date is preserved),
ip,
 referer, user-agent. And although it breaks normalization some data
from
 a master table is copied, so no joins are needed anymore.
 
 After every import the data from the current day is deleted from the
 reporting table and recalculated from the raw data table.
 

schemas would be helpful.  You may be able to tweak the import table a
bit and how it moves over to the data tables.

Just a thought: have you considered having apache logs write to a
process that immediately makes insert query(s) to postgresql? 

You could write small C program which executes advanced query interface
call to the server.

Merlin

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] Need for speed 3

2005-09-01 Thread Ulrich Wisser


Hi Merlin,

schemas would be helpful.  


right now I would like to know if my approach to the problem makes 
sense. Or if I should rework the whole procedure of import and aggregate.



Just a thought: have you considered having apache logs write to a
process that immediately makes insert query(s) to postgresql? 


Yes we have considered that, but dismissed the idea very soon. We need 
Apache to be as responsive as possible. It's a two server setup with 
load balancer and failover. Serving about ones thousand domains and 
counting. It needs to be as failsafe as possible and under no 
circumstances can any request be lost. (The click counting is core 
business and relates directly to our income.)
That said it seemed quite save to let Apache write logfiles. And import 
them later. By that a database downtime wouldn't be mission critical.




You could write small C program which executes advanced query interface
call to the server.


How would that improve performance?

Ulrich

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] Need for speed 3

2005-09-01 Thread Merlin Moncure

 Hi Merlin,
  Just a thought: have you considered having apache logs write to a
  process that immediately makes insert query(s) to postgresql?
 
 Yes we have considered that, but dismissed the idea very soon. We need
 Apache to be as responsive as possible. It's a two server setup with
 load balancer and failover. Serving about ones thousand domains and
 counting. It needs to be as failsafe as possible and under no
 circumstances can any request be lost. (The click counting is core
 business and relates directly to our income.)
 That said it seemed quite save to let Apache write logfiles. And
import
 them later. By that a database downtime wouldn't be mission critical.

hm.  well, it may be possible to do this in a fast and safe way but I
understand your reservations here, but I'm going to spout off my opinion
anyways :).

If you are not doing this the following point is moot.  But take into
consideration you could set a very low transaction time out (like .25
seconds) and siphon log entries off to a text file if your database
server gets in trouble.  2 million hits a month is not very high even if
your traffic is bursty (there are approx 2.5 million seconds in a
month).

With a direct linked log file you get up to date stats always and spare
yourself the dump/load song and dance which is always a headache :(.
Also, however you are doing your billing, it will be easier to manage it
if everything is extracted from pg and not some conglomeration of log
files, *if* you can put 100% faith in your database.  When it comes to
pg now, I'm a believer.

  You could write small C program which executes advanced query
interface
  call to the server.
 
 How would that improve performance?

The functions I'm talking about are PQexecParams and PQexecPrepared.
The query string does not need to be encoded or decoded and is very
light on server resources and is very low latency.  Using them you could
get prob. 5000 inserts/sec on a cheap server if you have some type of
write caching in place with low cpu load.  

Merlin



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Need for speed 3

2005-09-01 Thread Luke Lonergan

Ulrich,

On 9/1/05 6:25 AM, Ulrich Wisser [EMAIL PROTECTED] wrote:

 My application basically imports Apache log files into a Postgres
 database. Every row in the log file gets imported in one of three (raw
 data) tables. My columns are exactly as in the log file. The import is
 run approx. every five minutes. We import about two million rows a month.

Bizgres Clickstream does this job using an ETL (extract transform and load)
process to transform the weblogs into an optimized schema for reporting.
 
 After every import the data from the current day is deleted from the
 reporting table and recalculated from the raw data table.

This is something the optimized ETL in Bizgres Clickstream also does well.
 
 What do you think of this approach? Are there better ways to do it? Is
 there some literature you recommend reading?

I recommend the Bizgres Clickstream docs, you can get it from Bizgres CVS,
and there will shortly be a live html link on the website.

Bizgres is free - it also improves COPY performance by almost 2x, among
other enhancements.

- Luke 



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

[PERFORM] Massive performance issues

2005-09-01 Thread Matthew Sackman

Hi,

I'm having performance issues with a table consisting of 2,043,133 rows. The
schema is:

\d address
  Table public.address
Column|  Type  | Modifiers 
--++---
 postcode_top | character varying(2)   | not null
 postcode_middle  | character varying(4)   | not null
 postcode_bottom  | character varying(7)   | not null
 postcode | character varying(10)  | not null
 property_type| character varying(15)  | not null
 sale_type| character varying(10)  | not null
 flat_extra   | character varying(100) | not null
 number   | character varying(100) | not null
 street   | character varying(100) | not null
 locality_1   | character varying(100) | not null
 locality_2   | character varying(100) | not null
 city | character varying(100) | not null
 county   | character varying(100) | not null
Indexes:
address_city_index btree (city)
address_county_index btree (county)
address_locality_1_index btree (locality_1)
address_locality_2_index btree (locality_2)
address_pc_bottom_index btree (postcode_bottom)
address_pc_middle_index btree (postcode_middle)
address_pc_top_index btree (postcode_top)
address_pc_top_middle_bottom_index btree (postcode_top,
 postcode_middle, postcode_bottom)
address_pc_top_middle_index btree (postcode_top, postcode_middle)
address_postcode_index btree (postcode)
address_property_type_index btree (property_type)
address_street_index btree (street)
street_prefix btree (lower(substring((street)::text, 1, 1)))

This is with postgresql 7.4 running on linux 2.6.11 with a 3GHz P4 and a
SATA harddrive.

Queries such as:

select locality_2 from address where locality_2 = 'Manchester';

are taking 14 seconds to complete, and this is only 2 years worth of
data - we will have up to 15 years (so over 15 million rows).

Interestingly, doing:
explain select locality_2 from address where locality_2 = 'Manchester';
gives
   QUERY PLAN   

 Seq Scan on address  (cost=0.00..80677.16 rows=27923 width=12)
   Filter: ((locality_2)::text = 'Manchester'::text)

but:
explain select locality_1 from address where locality_1 = 'Manchester';
gives
   QUERY PLAN   
 

 Index Scan using address_locality_1_index on address
(cost=0.00..69882.18 rows=17708 width=13)
   Index Cond: ((locality_1)::text = 'Manchester'::text)

Sadly, using the index makes things worse, the query taking 17 seconds.

locality_1 has 16650 distinct values and locality_2 has 1156 distinct
values.

Whilst the locality_2 query is in progress, both the disk and the CPU
are maxed out with the disk constantly reading at 60MB/s and the CPU
rarely dropping under 100% load.

With the locality_1 query in progress, the CPU is maxed out but the disk
is reading at just 3MB/s.

Obviously, to me, this is a problem, I need these queries to be under a
second to complete. Is this unreasonable? What can I do to make this go
faster? I've considered normalising the table but I can't work out
whether the slowness is in dereferencing the pointers from the index
into the table or in scanning the index in the first place. And
normalising the table is going to cause much pain when inserting values
and I'm not entirely sure if I see why normalising it should cause a
massive performance improvement.

I need to get to the stage where I can run queries such as:
select street, locality_1, locality_2, city from address 
where (city = 'Nottingham' or locality_2 = 'Nottingham'
   or locality_1 = 'Nottingham')
  and upper(substring(street from 1 for 1)) = 'A' 
group by street, locality_1, locality_2, city
order by street
limit 20 offset 0

and have the results very quickly.

Any help most gratefully received (even if it's to say that I should be
posting to a different mailing list!).

Many thanks,

Matthew


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Merlin Moncure

 I'm having performance issues with a table consisting of 2,043,133
rows.
 The
 schema is:

 locality_1 has 16650 distinct values and locality_2 has 1156 distinct
 values.

Just so you know I have a 2GHz p4 workstation with similar size (2M
rows), several keys, and can find and fetch 2k rows based on 20k unique
value key in about 60 ms. (.06 seconds).

Merlin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Merlin Moncure

   Table public.address
 Column|  Type  | Modifiers
 --++---
  postcode_top | character varying(2)   | not null
  postcode_middle  | character varying(4)   | not null
  postcode_bottom  | character varying(7)   | not null


consider making above fields char(x) not varchar(x) for small but
important savings.

  postcode | character varying(10)  | not null
  property_type| character varying(15)  | not null
  sale_type| character varying(10)  | not null
  flat_extra   | character varying(100) | not null
  number   | character varying(100) | not null
  street   | character varying(100) | not null
  locality_1   | character varying(100) | not null
  locality_2   | character varying(100) | not null
  city | character varying(100) | not null
  county   | character varying(100) | not null
 Indexes:
 address_city_index btree (city)
 address_county_index btree (county)
 address_locality_1_index btree (locality_1)
 address_locality_2_index btree (locality_2)
 address_pc_bottom_index btree (postcode_bottom)
 address_pc_middle_index btree (postcode_middle)
 address_pc_top_index btree (postcode_top)
 address_pc_top_middle_bottom_index btree (postcode_top,
  postcode_middle, postcode_bottom)
 address_pc_top_middle_index btree (postcode_top,
postcode_middle)
 address_postcode_index btree (postcode)
 address_property_type_index btree (property_type)
 address_street_index btree (street)
 street_prefix btree (lower(substring((street)::text, 1, 1)))
 
 Obviously, to me, this is a problem, I need these queries to be under
a
 second to complete. Is this unreasonable? What can I do to make this
go
 faster? I've considered normalising the table but I can't work out
 whether the slowness is in dereferencing the pointers from the index
 into the table or in scanning the index in the first place. And
 normalising the table is going to cause much pain when inserting
values
 and I'm not entirely sure if I see why normalising it should cause a
 massive performance improvement.

http://www.dbdebunk.com :)

 I need to get to the stage where I can run queries such as:
 select street, locality_1, locality_2, city from address
 where (city = 'Nottingham' or locality_2 = 'Nottingham'
or locality_1 = 'Nottingham')
   and upper(substring(street from 1 for 1)) = 'A'
 group by street, locality_1, locality_2, city
 order by street
 limit 20 offset 0
 
 and have the results very quickly.
 
 Any help most gratefully received (even if it's to say that I should
be
 posting to a different mailing list!).

this is correct list.  did you run vacuum/analyze, etc?
Please post vacuum analyze times.

Merlin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Matthew Sackman

On Thu, Sep 01, 2005 at 02:47:06PM -0400, Tom Lane wrote:
 Matthew Sackman [EMAIL PROTECTED] writes:
  Obviously, to me, this is a problem, I need these queries to be under a
  second to complete. Is this unreasonable?
 
 Yes.  Pulling twenty thousand rows at random from a table isn't free.

I appreciate that. But I'm surprised by how un-free it seems to be.
And it seems others here have performance I need on similar hardware.

 You were pretty vague about your disk hardware, which makes me think
 you didn't spend a lot of money on it ... and on low-ball hardware,
 that sort of random access speed just isn't gonna happen.

Well, this is a development box. But the live box wouldn't be much more
than RAID 1 on SCSI 10ks so that should only be a halving of seek time,
not the 1000 times reduction I'm after!

In fact, now I think about it, I have been testing on a 2.4 kernel on a
dual HT 3GHz Xeon with SCSI RAID array and the performance is only
marginally better.

 If the queries you need are very consistent, you might be able to get
 some mileage out of CLUSTERing by the relevant index ... but the number
 of indexes you've created makes me think that's not so ...

No, the queries, whilst in just three distinct forms, will effectively
be for fairly random values.

Matthew

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Matthew Sackman

On Thu, Sep 01, 2005 at 02:04:54PM -0400, Merlin Moncure wrote:
  Any help most gratefully received (even if it's to say that I should
 be
  posting to a different mailing list!).
 
 this is correct list.  did you run vacuum/analyze, etc?
 Please post vacuum analyze times.

2005-09-01 19:47:08 LOG:  statement: vacuum full analyze address;
2005-09-01 19:48:44 LOG:  duration: 96182.777 ms

2005-09-01 19:50:20 LOG:  statement: vacuum analyze address;
2005-09-01 19:51:48 LOG:  duration: 87675.268 ms

I run them regularly, pretty much after every bulk import.

Matthew

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Tom Lane

Matthew Sackman [EMAIL PROTECTED] writes:
 Obviously, to me, this is a problem, I need these queries to be under a
 second to complete. Is this unreasonable?

Yes.  Pulling twenty thousand rows at random from a table isn't free.
You were pretty vague about your disk hardware, which makes me think
you didn't spend a lot of money on it ... and on low-ball hardware,
that sort of random access speed just isn't gonna happen.

If the queries you need are very consistent, you might be able to get
some mileage out of CLUSTERing by the relevant index ... but the number
of indexes you've created makes me think that's not so ...

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Joel Fradkin

Any chance it's a vacuum thing?
Or configuration (out of the box it needs adjusting)?

Joel Fradkin
 
 

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Merlin Moncure
Sent: Thursday, September 01, 2005 2:11 PM
To: Matthew Sackman
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Massive performance issues

 I'm having performance issues with a table consisting of 2,043,133
rows.
 The
 schema is:

 locality_1 has 16650 distinct values and locality_2 has 1156 distinct
 values.

Just so you know I have a 2GHz p4 workstation with similar size (2M
rows), several keys, and can find and fetch 2k rows based on 20k unique
value key in about 60 ms. (.06 seconds).

Merlin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Sebastian Hennebrueder

Matthew Sackman schrieb:

Hi,

I'm having performance issues with a table consisting of 2,043,133 rows. The
schema is:

\d address
  Table public.address
Column|  Type  | Modifiers 
--++---
 postcode_top | character varying(2)   | not null
 postcode_middle  | character varying(4)   | not null
 postcode_bottom  | character varying(7)   | not null
 postcode | character varying(10)  | not null
 property_type| character varying(15)  | not null
 sale_type| character varying(10)  | not null
 flat_extra   | character varying(100) | not null
 number   | character varying(100) | not null
 street   | character varying(100) | not null
 locality_1   | character varying(100) | not null
 locality_2   | character varying(100) | not null
 city | character varying(100) | not null
 county   | character varying(100) | not null
Indexes:
address_city_index btree (city)
address_county_index btree (county)
address_locality_1_index btree (locality_1)
address_locality_2_index btree (locality_2)
address_pc_bottom_index btree (postcode_bottom)
address_pc_middle_index btree (postcode_middle)
address_pc_top_index btree (postcode_top)
address_pc_top_middle_bottom_index btree (postcode_top,
 postcode_middle, postcode_bottom)
address_pc_top_middle_index btree (postcode_top, postcode_middle)
address_postcode_index btree (postcode)
address_property_type_index btree (property_type)
address_street_index btree (street)
street_prefix btree (lower(substring((street)::text, 1, 1)))

This is with postgresql 7.4 running on linux 2.6.11 with a 3GHz P4 and a
SATA harddrive.

Queries such as:

select locality_2 from address where locality_2 = 'Manchester';

are taking 14 seconds to complete, and this is only 2 years worth of
data - we will have up to 15 years (so over 15 million rows).

Interestingly, doing:
explain select locality_2 from address where locality_2 = 'Manchester';
gives
   QUERY PLAN   

 Seq Scan on address  (cost=0.00..80677.16 rows=27923 width=12)
   Filter: ((locality_2)::text = 'Manchester'::text)

but:
explain select locality_1 from address where locality_1 = 'Manchester';
gives
   QUERY PLAN  
   

 Index Scan using address_locality_1_index on address
(cost=0.00..69882.18 rows=17708 width=13)
   Index Cond: ((locality_1)::text = 'Manchester'::text)

Sadly, using the index makes things worse, the query taking 17 seconds.

locality_1 has 16650 distinct values and locality_2 has 1156 distinct
values.

Whilst the locality_2 query is in progress, both the disk and the CPU
are maxed out with the disk constantly reading at 60MB/s and the CPU
rarely dropping under 100% load.

With the locality_1 query in progress, the CPU is maxed out but the disk
is reading at just 3MB/s.

Obviously, to me, this is a problem, I need these queries to be under a
second to complete. Is this unreasonable? What can I do to make this go
faster? I've considered normalising the table but I can't work out
whether the slowness is in dereferencing the pointers from the index
into the table or in scanning the index in the first place. And
normalising the table is going to cause much pain when inserting values
and I'm not entirely sure if I see why normalising it should cause a
massive performance improvement.

  

Just an idea: When you do not want to adapt your application to use a
normalized database you may push the data into normalized table using
triggers.
Example:
Add a table city with column id, name
and add a column city_id to your main table.
In this case you have redundant data in your main table (locality_1 and
city_id) but you could make queries to the city table when searching for
'Man%'

-- 
Best Regards / Viele Grüße

Sebastian Hennebrueder



http://www.laliluna.de

Tutorials for JSP, JavaServer Faces, Struts, Hibernate and EJB 

Get support, education and consulting for these technologies.


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Alvaro Herrera

On Thu, Sep 01, 2005 at 02:04:54PM -0400, Merlin Moncure wrote:
Table public.address
  Column|  Type  | Modifiers
  --++---
   postcode_top | character varying(2)   | not null
   postcode_middle  | character varying(4)   | not null
   postcode_bottom  | character varying(7)   | not null
 
 consider making above fields char(x) not varchar(x) for small but
 important savings.

Huh, hang on -- AFAIK there's no saving at all by doing that.  Quite the
opposite really, because with char(x) you store the padding blanks,
which are omitted with varchar(x), so less I/O (not necessarily a
measurable amount, mind you, maybe even zero because of padding issues.)

-- 
Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com
You liked Linux a lot when he was just the gawky kid from down the block
mowing your lawn or shoveling the snow. But now that he wants to date
your daughter, you're not so sure he measures up. (Larry Greenemeier)

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Poor performance on HP Package Cluster

2005-09-01 Thread Luke Lonergan

Are you using the built-in HP SmartArray RAID/SCSI controllers?  If so, that
could be your problem, they are known to have terrible and variable
performance with Linux.

The only good fix is to add a simple SCSI controller to your system (HP
sells them) and stay away from hardware RAID.

- Luke  


On 9/1/05 7:16 AM, Ernst Einstein [EMAIL PROTECTED] wrote:

 Hi!
 
 I've set up a Package Cluster ( Fail-Over Cluster ) on our two HP DL380
 G4 with MSA Storrage G2.( Xeon 3,4Ghz, 6GB Ram, 2x [EMAIL PROTECTED] Raid1)
 The system is running under Suse Linux Enterprise Server.
 
 My problem is, that the performance is very low. On our old Server
 ( Celeron 2Ghz with 2 GB of Ram ) an import of our Data takes about 10
 minutes. ( 1,1GB data )
 One of the DL380 it takes more than 90 minutes...
 Selects response time have also been increased. Celeron 3 sec, Xeon
 30-40sec.
 
  I'm trying to fix the problem for two day's now, googled a lot, but i
 don't know what to do.
 
 Top says, my CPU spends ~50% time with wait io.
 
 top - 14:07:34 up 22 min,  3 users,  load average: 1.09, 1.04, 0.78
 Tasks:  74 total,   3 running,  71 sleeping,   0 stopped,   0 zombie
 Cpu(s): 50.0% us,  5.0% sy,  0.0% ni,  0.0% id, 45.0% wa,  0.0% hi,
 0.0% si
 Mem:   6050356k total,   982004k used,  5068352k free,60300k buffers
 Swap:  2097136k total,0k used,  2097136k free,   786200k cached
 
   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+
 COMMAND  
  9939 postgres  18   0  254m 143m 140m R 49.3  2.4   8:35.43 postgres:
 postgres plate [local] INSERT
  9938 postgres  16   0 13720 1440 1120 S  4.9  0.0   0:59.08 psql -d
 plate -f dump.sql
 10738 root  15   0  3988 1120  840 R  4.9  0.0   0:00.05 top -d
 0.2  
 1 root  16   0   640  264  216 S  0.0  0.0   0:05.03 init
 [3]  
 2 root  34  19 000 S  0.0  0.0   0:00.00
 [ksoftirqd/0] 
 
 vmstat 1:
 
 ClusterNode2 root $ vmstat 1
 procs ---memory-- ---swap-- -io --system--
 cpu
  r  b   swpd   free   buff  cache   si   sobibo   incs us sy
 id wa
  1  0  0 5032012  60888 82100800   216  6938 1952  5049 40
 8 15 37
  0  1  0 5031392  60892 82163200 0  8152 2126  5725 45
 6  0 49
  0  1  0 5030896  60900 82214400 0  8124 2052  5731 46
 6  0 47
  0  1  0 5030400  60908 82276800 0  8144 2124  5717 44
 7  0 50
  1  0  0 5029904  60924 82327200 0  8304 2062  5763 43
 7  0 49
 
 I've read (2004), that Xeon may have problems with content switching -
 is the problem still existing? Can I do something to minimize the
 problem?
 
 
 postgresql.conf:
 
 shared_buffers = 28672
 effective_cache_size = 40
 random_page_cost = 2
 
 
 shmall  shmmax are set to 268435456
 
 hdparm:
 
 ClusterNode2 root $ hdparm -tT /dev/cciss/c0d0p1
 
 /dev/cciss/c0d0p1:
 Timing buffer-cache reads: 3772 MB in 2.00 seconds = 1885.34 MB/sec
 Timing buffered disk reads: 150 MB in 2.06 seconds = 72.72 MB/sec
 
 greetings Ernst
 
 
 
 
 
 ---(end of broadcast)---
 TIP 2: Don't 'kill -9' the postmaster
 



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Arjen van der Meijden


On 1-9-2005 19:42, Matthew Sackman wrote:

Obviously, to me, this is a problem, I need these queries to be under a
second to complete. Is this unreasonable? What can I do to make this go
faster? I've considered normalising the table but I can't work out
whether the slowness is in dereferencing the pointers from the index
into the table or in scanning the index in the first place. And
normalising the table is going to cause much pain when inserting values
and I'm not entirely sure if I see why normalising it should cause a
massive performance improvement.


In this case, I think normalising will give a major decrease in on-disk 
table-size of this large table and the indexes you have. If that's the 
case, that in itself will speed-up all i/o-bound queries quite a bit.


locality_1, _2, city and county can probably be normalised away without 
much problem, but going from varchar's to integers will probably safe 
you quite a bit of (disk)space.


But since it won't change the selectivity of indexes, so you won't get 
more index-scans instead of sequential scans, I suppose.
I think its not that hard to create a normalized set of tables from this 
 data-set (using insert into tablename select distinct ... from address 
and such, insert into address_new (..., city) select ... (select cityid 
from cities where city = address.city) from address)
So its at least relatively easy to figure out the performance 
improvement from normalizing the dataset a bit.


If you want to improve your hardware, have a look at the Western Digital 
Raptor-series SATA disks, they are fast scsi-like SATA drives. You may 
also have a look at the amount of memory available, to allow caching 
this (entire) table.


Best regards,

Arjen

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Tom Lane

Ron [EMAIL PROTECTED] writes:
 ...  Your target is to have each row take = 512B.

Ron, are you assuming that the varchar fields are blank-padded or
something?  I think it's highly unlikely that he's got more than a
couple hundred bytes per row right now --- at least if the data is
what it sounds like.

The upthread comment about strcoll() set off some alarm bells in my head.
If the database wasn't initdb'd in C locale already, try making it so.
Also, use a single-byte encoding if you can (LatinX is fine, Unicode not).

 Upgrade pg to 8.0.3 and make sure you have enough RAM for your real 
 day to day load.

Newer PG definitely better.  Some attention to the configuration
parameters might also be called for.  I fear though that these things
are probably just chipping at the margins ...

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Ron


This should be able to run _very_ fast.

At 01:42 PM 9/1/2005, Matthew Sackman wrote:

Hi,

I'm having performance issues with a table consisting of 2,043,133 rows. The
schema is:

\d address
  Table public.address
Column|  Type  | Modifiers
--++---
 postcode_top | character varying(2)   | not null
 postcode_middle  | character varying(4)   | not null
 postcode_bottom  | character varying(7)   | not null
 postcode | character varying(10)  | not null
 property_type| character varying(15)  | not null
 sale_type| character varying(10)  | not null
 flat_extra   | character varying(100) | not null
 number   | character varying(100) | not null
 street   | character varying(100) | not null
 locality_1   | character varying(100) | not null
 locality_2   | character varying(100) | not null
 city | character varying(100) | not null
 county   | character varying(100) | not null
Indexes:
address_city_index btree (city)
address_county_index btree (county)
address_locality_1_index btree (locality_1)
address_locality_2_index btree (locality_2)
address_pc_bottom_index btree (postcode_bottom)
address_pc_middle_index btree (postcode_middle)
address_pc_top_index btree (postcode_top)
address_pc_top_middle_bottom_index btree (postcode_top,
 postcode_middle, postcode_bottom)
address_pc_top_middle_index btree (postcode_top, postcode_middle)
address_postcode_index btree (postcode)
address_property_type_index btree (property_type)
address_street_index btree (street)
street_prefix btree (lower(substring((street)::text, 1, 1)))


IOW, each row takes ~1KB on HD.  First suggestion: format your HD to 
use 8KB pages with 1KB segments.  That'll out each row down on HD as 
an atomic unit.  8KB pages also play nice with pg.


At 1KB per row, this table takes up ~2.1GB and should fit into RAM 
fairly easily on a decently configured DB server (my _laptop_ has 2GB 
of RAM after all...)


Since you are using ~2.1GB for 2 years worth of data, 15 years worth 
should take no more than 2.1GB*7.5= 15.75GB.


If you replace some of those 100 char fields with integers for code 
numbers and have an auxiliary table for each of those fields mapping 
the code numbers to the associated 100 char string, you should be 
able to shrink a row considerably.  Your target is to have each row 
take = 512B.  Once a row fits into one 512B sector on HD, there's a 
no point in making it smaller unless you can shrink it enough to fit 
2 rows into one sector (= 256B).  Once two rows fit into one sector, 
there's no point shrinking a row unless you can make 3 rows fit into 
a sector.  Etc.


Assuming each 100 char (eg 100B) field can be replaced with a 4B int, 
each row could be as small as 76B.  That makes 85B per row the goal 
as it would allow you to fit 6 rows per 512B HD sector.  So in the 
best case your table will be 12x smaller in terms of real HD space.


Fitting one (or more) row(s) into one sector will cut down the real 
space used on HD for the table to ~7.88GB (or 1.32GB in the best 
case).  Any such streamlining will make it faster to load, make the 
working set that needs to be RAM for best performance smaller, etc, etc.



This is with postgresql 7.4 running on linux 2.6.11 with a 3GHz P4 
and a SATA harddrive.


Upgrade pg to 8.0.3 and make sure you have enough RAM for your real 
day to day load.  Frankly, RAM is so cheap ($75-$150/GB), I'd just 
upgrade the machine to 4GB as a matter of course.  P4's have PAE, so 
if your mainboard can hold it, put more than 4GB of RAM in if you 
find you need it.


Since you are describing your workload as being predominantly reads, 
you can get away with far less HD capability as long as you crank up 
RAM high enough to hold the working set of the DB.  The indications 
from the OP are that you may very well be able to hold the entire DB 
in RAM.  That's a big win whenever you can achieve it.


After these steps, there may still be performance issues that need 
attention, but the DBMS should be _much_ faster.


Ron Peacetree



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Ron


At 04:25 PM 9/1/2005, Tom Lane wrote:

Ron [EMAIL PROTECTED] writes:
 ...  Your target is to have each row take = 512B.

Ron, are you assuming that the varchar fields are blank-padded or 
something?  I think it's highly unlikely that he's got more than a 
couple hundred bytes per row right now --- at least if the data is 
what it sounds like.


As it stands, each row will take 55B - 748B and each field is 
variable in size up to the maximums given in the OP's schema.  Since 
pg uses an underlying OS FS, and not a native one, there will be 
extra FS overhead no matter what we do, particularly to accommodate 
such flexibility...  The goal is to minimize overhead and maximize 
regularity in layout.  The recipe I know for HD IO speed is in 
keeping the data small, regular, and as simple as possible.


Even better, if the table(s) can be made RAM resident, then searches, 
even random ones, can be very fast.  He wants a 1000x performance 
improvement.  Going from disk resident to RAM resident should help 
greatly in attaining that goal.


In addition, by replacing as many variable sized text strings as 
possible with ints, the actual compare functions he used as examples 
should run faster as well.



The upthread comment about strcoll() set off some alarm bells in my 
head.  If the database wasn't initdb'd in C locale already, try 
making it so.  Also, use a single-byte encoding if you can (LatinX 
is fine, Unicode not).


Good thoughts I hadn't had.



 Upgrade pg to 8.0.3 and make sure you have enough RAM for your real
 day to day load.

Newer PG definitely better.  Some attention to the configuration 
parameters might also be called for.  I fear though that these 
things are probably just chipping at the margins ...


I don't expect 8.0.3 to be a major performance improvement.  I do 
expect it to be a major _maintenance_ improvement for both him and 
those of us trying to help him ;-)


The performance difference between not having the working set of the 
DB fit into RAM during ordinary operation vs having it be so (or 
better, having the whole DB fit into RAM during ordinary operation) 
has been considerably more effective than chipping at the margins 
IME.  Especially so if the HD IO subsystem is wimpy.


Ron Peacetree



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Matthew Sackman

On Thu, Sep 01, 2005 at 10:09:30PM +0200, Steinar H. Gunderson wrote:
  address_city_index btree (city)
  address_county_index btree (county)
  address_locality_1_index btree (locality_1)
  address_locality_2_index btree (locality_2)
  address_pc_bottom_index btree (postcode_bottom)
  address_pc_middle_index btree (postcode_middle)
  address_pc_top_index btree (postcode_top)
  address_pc_top_middle_bottom_index btree (postcode_top,
   postcode_middle, postcode_bottom)
  address_pc_top_middle_index btree (postcode_top, postcode_middle)
  address_postcode_index btree (postcode)
  address_property_type_index btree (property_type)
  address_street_index btree (street)
  street_prefix btree (lower(substring((street)::text, 1, 1)))
 
 Wow, that's quite a lof of indexes... but your problem isn't reported as
 being in insert/update/delete.

Hah, well now that you mention it. Basically, 100,000 rows come in in a
bulk import every month and the only way I can get it to complete in any
sane time frame at all is to drop the indexes, do the import and then
recreate the indexes. But that's something that I'm OK with - the
imports don't have to be that fast and whilst important, it's not *the*
critical path. Selection from the database is, hence the indexes.

  This is with postgresql 7.4 running on linux 2.6.11 with a 3GHz P4 and a
  SATA harddrive.
 
 8.0 or 8.1 might help you some -- better (and more!) disks will probably help
 a _lot_.

Ok, I did try 8.0 when I started this and found that the server bind
parameters (both via DBD::Pg (with pg_prepare_server = 1) and via JDBC
(various versions I tried)) failed - the parameters were clearly not
being substituted. This was Postgresql 8.0 from Debian unstable. That
was a couple of weeks ago and I've not been back to check whether its
been fixed. Anyway, because of these problems I dropped back to 7.4.

  Queries such as:
  
  select locality_2 from address where locality_2 = 'Manchester';
  
  are taking 14 seconds to complete, and this is only 2 years worth of
  data - we will have up to 15 years (so over 15 million rows).
 
 As Tom pointed out; you're effectively doing random searches here, and using
 CLUSTER might help. Normalizing your data to get smaller rows (and avoid
 possibly costly string comparisons if your strcoll() is slow) will probably
 also help.

Ok, so you're saying that joining the address table into an address_city
table (the obvious normalization) will help here?

The locale settings in postgresql.conf all have en_GB and a \l shows
encoding of LATIN1. So I don't think I've set anything to UTF8 or such
like.

  I need to get to the stage where I can run queries such as:
  select street, locality_1, locality_2, city from address 
  where (city = 'Nottingham' or locality_2 = 'Nottingham'
 or locality_1 = 'Nottingham')
and upper(substring(street from 1 for 1)) = 'A' 
  group by street, locality_1, locality_2, city
  order by street
  limit 20 offset 0
 
 This might be a lot quicker than pulling all the records like in your example
 queries...

Yes, that certainly does seem to be the case - around 4 seconds. But I
need it to be 10 times faster (or thereabouts) otherwise I have big
problems!

Many thanks for all the advice so far.

Matthew


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Matthew Sackman

On Thu, Sep 01, 2005 at 10:54:45PM +0200, Arjen van der Meijden wrote:
 On 1-9-2005 19:42, Matthew Sackman wrote:
 Obviously, to me, this is a problem, I need these queries to be under a
 second to complete. Is this unreasonable? What can I do to make this go
 faster? I've considered normalising the table but I can't work out
 whether the slowness is in dereferencing the pointers from the index
 into the table or in scanning the index in the first place. And
 normalising the table is going to cause much pain when inserting values
 and I'm not entirely sure if I see why normalising it should cause a
 massive performance improvement.
 
 In this case, I think normalising will give a major decrease in on-disk 
 table-size of this large table and the indexes you have. If that's the 
 case, that in itself will speed-up all i/o-bound queries quite a bit.

Well that's the thing - on the queries where it decides to use the index
it only reads at around 3MB/s and the CPU is maxed out, whereas when it
doesn't use the index, the disk is being read at 60MB/s. So when it
decides to use an index, I don't seem to be IO bound at all. Or at least
that's the way it seems to me.

 locality_1, _2, city and county can probably be normalised away without 
 much problem, but going from varchar's to integers will probably safe 
 you quite a bit of (disk)space.

Sure, that's what I've been considering today.

 But since it won't change the selectivity of indexes, so you won't get 
 more index-scans instead of sequential scans, I suppose.
 I think its not that hard to create a normalized set of tables from this 
  data-set (using insert into tablename select distinct ... from address 
 and such, insert into address_new (..., city) select ... (select cityid 
 from cities where city = address.city) from address)
 So its at least relatively easy to figure out the performance 
 improvement from normalizing the dataset a bit.

Yeah, the initial creation isn't too painful but when adding rows into
the address table it gets more painful. However, as I've said elsewhere,
the import isn't the critical path so I can cope with that pain,
possibly coding around it in a stored proceedure and triggers as
suggested.

 If you want to improve your hardware, have a look at the Western Digital 
 Raptor-series SATA disks, they are fast scsi-like SATA drives. You may 
 also have a look at the amount of memory available, to allow caching 
 this (entire) table.

Well I've got 1GB of RAM, but from analysis of its use, a fair amount
isn't being used. About 50% is actually in use by applications and about
half of the rest is cache and the rest isn't being used. Has this to do
with the max_fsm_pages and max_fsm_relations settings? I've pretty much
not touched the configuration and it's the standard Debian package.

Matthew

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Jeff Frost


Well I've got 1GB of RAM, but from analysis of its use, a fair amount
isn't being used. About 50% is actually in use by applications and about
half of the rest is cache and the rest isn't being used. Has this to do
with the max_fsm_pages and max_fsm_relations settings? I've pretty much
not touched the configuration and it's the standard Debian package.


Matt, have a look at the annotated postgresql.conf for 7.x here:

http://www.varlena.com/varlena/GeneralBits/Tidbits/annotated_conf_e.html

If you have the default settings, you're likely hampering yourself quite a 
bit.  You probably care about shared_buffers, sort_mem, 
vacuum_mem, max_fsm_pages, effective_cache_size


Also, you may want to read the  PostgreSQL 8.0 Performance Checklist.  Even 
though it's for 8.0, it'll give you good ideas on what to change in 7.4.  You 
can find it here: http://www.powerpostgresql.com/PerfList/


--
Jeff Frost, Owner   [EMAIL PROTECTED]
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Steinar H. Gunderson

On Thu, Sep 01, 2005 at 10:13:59PM +0100, Matthew Sackman wrote:
 Well that's the thing - on the queries where it decides to use the index
 it only reads at around 3MB/s and the CPU is maxed out, whereas when it
 doesn't use the index, the disk is being read at 60MB/s. So when it
 decides to use an index, I don't seem to be IO bound at all. Or at least
 that's the way it seems to me.

You are I/O bound; your disk is doing lots and lots of seeks. The SATA
interface is not the bottleneck; the disk's ability to rotate and move its
heads is.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Matthew Sackman

On Thu, Sep 01, 2005 at 02:26:47PM -0700, Jeff Frost wrote:
 Well I've got 1GB of RAM, but from analysis of its use, a fair amount
 isn't being used. About 50% is actually in use by applications and about
 half of the rest is cache and the rest isn't being used. Has this to do
 with the max_fsm_pages and max_fsm_relations settings? I've pretty much
 not touched the configuration and it's the standard Debian package.
 
 Matt, have a look at the annotated postgresql.conf for 7.x here:
 
 http://www.varlena.com/varlena/GeneralBits/Tidbits/annotated_conf_e.html
 
 If you have the default settings, you're likely hampering yourself quite a 
 bit.  You probably care about shared_buffers, sort_mem, 
 vacuum_mem, max_fsm_pages, effective_cache_size

That's a useful resource, thanks for the pointer. I'll work through that
tomorrow.

 Also, you may want to read the  PostgreSQL 8.0 Performance Checklist.  Even 
 though it's for 8.0, it'll give you good ideas on what to change in 7.4.  
 You can find it here: http://www.powerpostgresql.com/PerfList/

Thanks, another good resource. I'll work through that too.

Matthew

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Ron


At 05:06 PM 9/1/2005, Matthew Sackman wrote:

On Thu, Sep 01, 2005 at 10:09:30PM +0200, Steinar H. Gunderson wrote:
  address_city_index btree (city)
  address_county_index btree (county)
  address_locality_1_index btree (locality_1)
  address_locality_2_index btree (locality_2)
  address_pc_bottom_index btree (postcode_bottom)
  address_pc_middle_index btree (postcode_middle)
  address_pc_top_index btree (postcode_top)
  address_pc_top_middle_bottom_index btree (postcode_top,
   postcode_middle, postcode_bottom)
  address_pc_top_middle_index btree (postcode_top, postcode_middle)
  address_postcode_index btree (postcode)
  address_property_type_index btree (property_type)
  address_street_index btree (street)
  street_prefix btree (lower(substring((street)::text, 1, 1)))

 Wow, that's quite a lof of indexes... but your problem isn't reported as
 being in insert/update/delete.

Hah, well now that you mention it. Basically, 100,000 rows come in in a
bulk import every month and the only way I can get it to complete in any
sane time frame at all is to drop the indexes, do the import and then
recreate the indexes. But that's something that I'm OK with -


FTR, this drop the indexes, do foo, recreate the indexes is 
Industry Standard Practice for bulk 
inserts/updates/deletes.  Regardless of DB product used.



 - the imports don't have to be that fast and whilst important, 
it's not *the*

critical path.  Selection from the database is, hence the indexes.


A DB _without_ indexes that fits into RAM during ordinary operation 
may actually be faster than a DB _with_ indexes that does 
not.  Fitting the entire DB into RAM during ordinary operation if at 
all possible should be the first priority with a small data mine-like 
application such as you've described.


Also normalization is _not_ always a good thing for data mining like 
apps.  Having most or everything you need in one place in a compact 
and regular format is usually more effective for data mines than Nth 
Order Normal Form optimization to the degree usually found in 
textbooks using OLTP-like examples.


Indexes are a complication used as a performance enhancing technique 
because without them the DB is not performing well enough.  IME, it's 
usually better to get as much performance as one can from other 
aspects of design and _then_ start adding complications.  Including 
indexes.  Even if you fit the whole DB in RAM, you are very likely to 
need some indexes; but profile your performance first and then add 
indexes as needed rather than just adding them willy nilly early in 
the design process.


You said you had 1GB of RAM on the machine now.  That clearly is 
inadequate to your desired performance given what you said about the 
DB.  Crank that box to 4GB and tighten up your data structures.  Then 
see where you are.




  This is with postgresql 7.4 running on linux 2.6.11 with a 3GHz P4 and a
  SATA harddrive.

 8.0 or 8.1 might help you some -- better (and more!) disks will 
probably help

 a _lot_.

Ok, I did try 8.0 when I started this and found that the server bind
parameters (both via DBD::Pg (with pg_prepare_server = 1) and via JDBC
(various versions I tried)) failed - the parameters were clearly not
being substituted. This was Postgresql 8.0 from Debian unstable. That
was a couple of weeks ago and I've not been back to check whether its
been fixed. Anyway, because of these problems I dropped back to 7.4.


Since I assume you are not going to run anything with the string 
unstable in its name in production (?!), why not try a decent 
production ready distro like SUSE 9.x and see how pg 8.0.3 runs on a 
OS more representative of what you are likely (or at least what is 
safe...) to run in production?




  Queries such as:
 
  select locality_2 from address where locality_2 = 'Manchester';
 
  are taking 14 seconds to complete, and this is only 2 years worth of
  data - we will have up to 15 years (so over 15 million rows).

 As Tom pointed out; you're effectively doing random searches 
here, and using

 CLUSTER might help. Normalizing your data to get smaller rows (and avoid
 possibly costly string comparisons if your strcoll() is slow) will probably
 also help.

Ok, so you're saying that joining the address table into an address_city
table (the obvious normalization) will help here?

The locale settings in postgresql.conf all have en_GB and a \l shows
encoding of LATIN1. So I don't think I've set anything to UTF8 or such
like.

  I need to get to the stage where I can run queries such as:
  select street, locality_1, locality_2, city from address
  where (city = 'Nottingham' or locality_2 = 'Nottingham'
 or locality_1 = 'Nottingham')
and upper(substring(street from 1 for 1)) = 'A'
  group by street, locality_1, locality_2, city
  order by street
  limit 20 offset 0

 This might be a lot quicker than pulling all the records like in 
your example

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Matthew Sackman

On Thu, Sep 01, 2005 at 11:52:45PM +0200, Steinar H. Gunderson wrote:
 On Thu, Sep 01, 2005 at 10:13:59PM +0100, Matthew Sackman wrote:
  Well that's the thing - on the queries where it decides to use the index
  it only reads at around 3MB/s and the CPU is maxed out, whereas when it
  doesn't use the index, the disk is being read at 60MB/s. So when it
  decides to use an index, I don't seem to be IO bound at all. Or at least
  that's the way it seems to me.
 
 You are I/O bound; your disk is doing lots and lots of seeks. The SATA
 interface is not the bottleneck; the disk's ability to rotate and move its
 heads is.

Ahh of course (/me hits head against wall). Because I've /seen/ it read
at 60MB/s I was assuming that if it wasn't reading that fast then I'm
not IO bound but of course, it's not reading sequentially. That all
makes sense. Been a long day etc... ;-)

Matthew

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Matthew Sackman

On Thu, Sep 01, 2005 at 06:05:43PM -0400, Ron wrote:
  Selection from the database is, hence the indexes.
 
 A DB _without_ indexes that fits into RAM during ordinary operation 
 may actually be faster than a DB _with_ indexes that does 
 not.  Fitting the entire DB into RAM during ordinary operation if at 
 all possible should be the first priority with a small data mine-like 
 application such as you've described.

That makes sense.

 Also normalization is _not_ always a good thing for data mining like 
 apps.  Having most or everything you need in one place in a compact 
 and regular format is usually more effective for data mines than Nth 
 Order Normal Form optimization to the degree usually found in 
 textbooks using OLTP-like examples.

Sure.

 Ok, I did try 8.0 when I started this and found that the server bind
 parameters (both via DBD::Pg (with pg_prepare_server = 1) and via JDBC
 (various versions I tried)) failed - the parameters were clearly not
 being substituted. This was Postgresql 8.0 from Debian unstable. That
 was a couple of weeks ago and I've not been back to check whether its
 been fixed. Anyway, because of these problems I dropped back to 7.4.
 
 Since I assume you are not going to run anything with the string 
 unstable in its name in production (?!), why not try a decent 
 production ready distro like SUSE 9.x and see how pg 8.0.3 runs on a 
 OS more representative of what you are likely (or at least what is 
 safe...) to run in production?

Well, you see, as ever, it's a bit complicated. The company I'm doing
the development for has been subcontracted to do it and the contractor was
contracted by the actual client. So there are two companies involved
in addition to the client. Sadly, the client actually has dictated
things like it will be deployed on FreeBSD and thou shall not argue.
At this point in time, I actually have very little information about the
specification of the boxen that'll be running this application. This is
something I'm hoping to solve very soon. The worst part of it is that
I'm not going have direct (ssh) access to the box and all configuration
changes will most likely have to be relayed through techies at the
client so fine tuning this is going to be a veritable nightmare.

   I need to get to the stage where I can run queries such as:
   select street, locality_1, locality_2, city from address
   where (city = 'Nottingham' or locality_2 = 'Nottingham'
  or locality_1 = 'Nottingham')
 and upper(substring(street from 1 for 1)) = 'A'
   group by street, locality_1, locality_2, city
   order by street
   limit 20 offset 0
 
  This might be a lot quicker than pulling all the records like in 
 your example
  queries...
 
 Yes, that certainly does seem to be the case - around 4 seconds. But I
 need it to be 10 times faster (or thereabouts) otherwise I have big
 problems!
 
 *beats drum* Get it in RAM, Get it in RAM, ...

Ok, but I currently have 2 million rows. When this launches in a couple
of weeks, it'll launch with 5 million+ and then gain  a million a year.
I think the upshot of this all is 4GB RAM as a minimum and judicious use
of normalization so as to avoid more expensive string comparisons and
reduce table size is my immediate plan (along with proper configuration
of pg).

Matthew

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Michael Fuhr

On Thu, Sep 01, 2005 at 06:42:31PM +0100, Matthew Sackman wrote:

 address_pc_top_index btree (postcode_top)
 address_pc_top_middle_bottom_index btree (postcode_top,
  postcode_middle, postcode_bottom)
 address_pc_top_middle_index btree (postcode_top, postcode_middle)

This doesn't address the query performance problem, but isn't only
one of these indexes necessary?  The second one, on all three
columns, because searches involving only postcode_top or only
postcode_top and postcode_middle could use it, making the indexes
on only those columns superfluous.  Or am I missing something?

-- 
Michael Fuhr

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Poor performance on HP Package Cluster

2005-09-01 Thread Dan Harris

Do you have any sources for that information?  I am running dual  
SmartArray 6402's in my DL585 and haven't noticed anything poor about  
their performance.


On Sep 1, 2005, at 2:24 PM, Luke Lonergan wrote:

Are you using the built-in HP SmartArray RAID/SCSI controllers?  If  
so, that

could be your problem, they are known to have terrible and variable
performance with Linux.


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Ron


At 06:22 PM 9/1/2005, Matthew Sackman wrote:

On Thu, Sep 01, 2005 at 06:05:43PM -0400, Ron wrote:

 Since I assume you are not going to run anything with the string
 unstable in its name in production (?!), why not try a decent
 production ready distro like SUSE 9.x and see how pg 8.0.3 runs on a
 OS more representative of what you are likely (or at least what is
 safe...) to run in production?

Well, you see, as ever, it's a bit complicated. The company I'm doing
the development for has been subcontracted to do it and the contractor was
contracted by the actual client. So there are two companies involved
in addition to the client. Sadly, the client actually has dictated
things like it will be deployed on FreeBSD and thou shall not argue.


At least get them to promise they will use a release the BSD folks 
mark stable!




At this point in time, I actually have very little information about the
specification of the boxen that'll be running this application. This is
something I'm hoping to solve very soon. The worst part of it is that
I'm not going have direct (ssh) access to the box and all configuration
changes will most likely have to be relayed through techies at the
client so fine tuning this is going to be a veritable nightmare.


IME, what you have actually just said is It will not be possible to 
safely fine tune the DB unless or until I have direct access; and/or 
someone who does have direct access is correctly trained.


Ick.



   I need to get to the stage where I can run queries such as:
   select street, locality_1, locality_2, city from address
   where (city = 'Nottingham' or locality_2 = 'Nottingham'
  or locality_1 = 'Nottingham')
 and upper(substring(street from 1 for 1)) = 'A'
   group by street, locality_1, locality_2, city
   order by street
   limit 20 offset 0
 
  This might be a lot quicker than pulling all the records like in
 your example
  queries...
 
 Yes, that certainly does seem to be the case - around 4 seconds. But I
 need it to be 10 times faster (or thereabouts) otherwise I have big
 problems!

 *beats drum* Get it in RAM, Get it in RAM, ...

Ok, but I currently have 2 million rows. When this launches in a couple
of weeks, it'll launch with 5 million+ and then gain  a million a year.


At my previously mentioned optimum of 85B per row, 2M rows is 
170MB.  5M rows is 425MB.  Assuming the gain of 1M rows per year, 
that's +85MB per year for this table.


Up to 2GB DIMMs are currently standard, and 4GB DIMMs are just in the 
process of being introduced.  Mainboards with anything from 4 to 16 
DIMM slots are widely available.


IOW, given the description you've provided this DB should _always_ 
fit in RAM.  Size the production system such that the entire DB fits 
into RAM during ordinary operation with an extra 1GB of RAM initially 
tossed on as a safety measure and the client will be upgrading the HW 
because it's obsolete before they run out of room in RAM.




I think the upshot of this all is 4GB RAM as a minimum and judicious use
of normalization so as to avoid more expensive string comparisons and
reduce table size is my immediate plan (along with proper configuration
of pg).


My suggestion is only slightly different.  Reduce table size(s) and 
up the RAM to the point where the whole DB fits comfortably in RAM.


You've got the rare opportunity to build a practical Memory Resident 
Database.  It should run like a banshee when you're done.  I'd love 
to see the benches on the final product.


Ron Peacetree



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Steinar H. Gunderson

On Thu, Sep 01, 2005 at 06:42:31PM +0100, Matthew Sackman wrote:
  flat_extra   | character varying(100) | not null
  number   | character varying(100) | not null
  street   | character varying(100) | not null
  locality_1   | character varying(100) | not null
  locality_2   | character varying(100) | not null
  city | character varying(100) | not null
  county   | character varying(100) | not null

Having these fixed probably won't give you any noticeable improvements;
unless there's something natural about your data setting 100 as a hard limit,
you could just as well drop these.

 address_city_index btree (city)
 address_county_index btree (county)
 address_locality_1_index btree (locality_1)
 address_locality_2_index btree (locality_2)
 address_pc_bottom_index btree (postcode_bottom)
 address_pc_middle_index btree (postcode_middle)
 address_pc_top_index btree (postcode_top)
 address_pc_top_middle_bottom_index btree (postcode_top,
  postcode_middle, postcode_bottom)
 address_pc_top_middle_index btree (postcode_top, postcode_middle)
 address_postcode_index btree (postcode)
 address_property_type_index btree (property_type)
 address_street_index btree (street)
 street_prefix btree (lower(substring((street)::text, 1, 1)))

Wow, that's quite a lof of indexes... but your problem isn't reported as
being in insert/update/delete.

 This is with postgresql 7.4 running on linux 2.6.11 with a 3GHz P4 and a
 SATA harddrive.

8.0 or 8.1 might help you some -- better (and more!) disks will probably help
a _lot_.

 Queries such as:
 
 select locality_2 from address where locality_2 = 'Manchester';
 
 are taking 14 seconds to complete, and this is only 2 years worth of
 data - we will have up to 15 years (so over 15 million rows).

As Tom pointed out; you're effectively doing random searches here, and using
CLUSTER might help. Normalizing your data to get smaller rows (and avoid
possibly costly string comparisons if your strcoll() is slow) will probably
also help.

 I need to get to the stage where I can run queries such as:
 select street, locality_1, locality_2, city from address 
 where (city = 'Nottingham' or locality_2 = 'Nottingham'
or locality_1 = 'Nottingham')
   and upper(substring(street from 1 for 1)) = 'A' 
 group by street, locality_1, locality_2, city
 order by street
 limit 20 offset 0

This might be a lot quicker than pulling all the records like in your example
queries...

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

[PERFORM] Avoid using swap in a cluster

2005-09-01 Thread Ricardo Humphreys

Hi all.

In a cluster, is there any way to use the main memory of the other nodes instead of the swap? If I have a query with many sub-queries and a lot of data, I can easily fill all the memory in a node. The point is: is there any way to continue using the main memory from other nodes in the same query instead of the swap?


Thank you,
Ricardo.

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Mark Kirkwood

It would be good to see EXPLAIN ANALYZE output for the three queries 
below (the real vs. estimated row counts being of interest).


The number of pages in your address table might be interesting to know too.

regards

Mark

Matthew Sackman wrote (with a fair bit of snippage):

explain select locality_2 from address where locality_2 = 'Manchester';
gives
   QUERY PLAN   


 Seq Scan on address  (cost=0.00..80677.16 rows=27923 width=12)
   Filter: ((locality_2)::text = 'Manchester'::text)


explain select locality_1 from address where locality_1 = 'Manchester';
gives
   QUERY PLAN


 Index Scan using address_locality_1_index on address
(cost=0.00..69882.18 rows=17708 width=13)
   Index Cond: ((locality_1)::text = 'Manchester'::text)



select street, locality_1, locality_2, city from address 
where (city = 'Nottingham' or locality_2 = 'Nottingham'

   or locality_1 = 'Nottingham')
  and upper(substring(street from 1 for 1)) = 'A' 
group by street, locality_1, locality_2, city

order by street
limit 20 offset 0



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Mark Kirkwood


Matthew Sackman wrote:


I need to get to the stage where I can run queries such as:


select street, locality_1, locality_2, city from address 
where (city = 'Nottingham' or locality_2 = 'Nottingham'

   or locality_1 = 'Nottingham')
  and upper(substring(street from 1 for 1)) = 'A' 
group by street, locality_1, locality_2, city

order by street
limit 20 offset 0

and have the results very quickly.



This sort of query will be handled nicely in 8.1 - it has bitmap and/or 
processing to make use of multiple indexes. Note that 8.1 is in beta now.


Cheers

Mark


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Poor performance on HP Package Cluster

2005-09-01 Thread Luke Lonergan

Dan,

On 9/1/05 4:02 PM, Dan Harris [EMAIL PROTECTED] wrote:

 Do you have any sources for that information?  I am running dual
 SmartArray 6402's in my DL585 and haven't noticed anything poor about
 their performance.

I've previously posted comprehensive results using the 5i and 6xxx series
smart arrays using software RAID, HW RAID on 3 different kernels, alongside
LSI and Adaptec SCSI controllers, and an Adaptec 24xx HW RAID adapter.
Results with bonnie++ and simple sequential read/write with dd.

I'll post them again here for reference in the next message.  Yes, the
performance of the SmartArray controllers under Linux was abysmal, even when
run by the labs at HP.

- Luke



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

38 matches

Mail list logo