[PERFORM] Samsung 32GB SATA SSD tested

2008-07-22 Thread Jeffrey W. Baker
For background, please read the thread "Fusion-io ioDrive", archived at

http://archives.postgresql.org/pgsql-performance/2008-07/msg00010.php

To recap, I tested an ioDrive versus a 6-disk RAID with pgbench on an
ordinary PC.  I now also have a 32GB Samsung SATA SSD, and I have tested
it in the same machine with the same software and configuration.  I
tested it connected to the NVIDIA CK804 SATA controller on the
motherboard, and as a pass-through disk on the Areca RAID controller,
with write-back caching enabled.

   Service Time Percentile, millis
   R/W TPS   R-O TPS  50th   80th   90th   95th
RAID  182   673 18 32 42 64
Fusion971  4792  8  9 10 11
SSD+NV442  4399 12 18 36 43
SSD+Areca 252  5937 12 15 17 21

As you can see, there are tradeoffs.  The motherboard's ports are
substantially faster on the TPC-B type of workload.  This little, cheap
SSD achieves almost half the performance of the ioDrive (i.e. similar
performance to a 50-disk SAS array.)  The RAID controller does a better
job on the read-only workload, surpassing the ioDrive by 20%.

Strangely the RAID controller behaves badly on the TPC-B workload.  It
is faster than disk, but not by a lot, and it's much slower than the
other flash configurations.  The read/write benchmark did not vary when
changing the number of clients between 1 and 8.  I suspect this is some
kind of problem with Areca's kernel driver or firmware.

On the bright side, the Samsung+Areca configuration offers excellent
service time distribution, comparable to that achieved by the ioDrive.
Using the motherboard's SATA ports gave service times comparable to the
disk RAID.

The performance is respectable for a $400 device.  You get about half
the tps and half the capacity of the ioDrive, but for one fifth the
price and in the much more convenient SATA form factor.

Your faithful investigator,
jwb


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Size and performance hit from using UTF8 vs. ASCII?

2006-02-08 Thread Jeffrey W. Baker
On Wed, 2006-02-08 at 09:11 -0500, Ron wrote:
> I'm specifically interested in the default C Locale; but if there's a 
> difference in the answer for other locales, I'd like to hear about 
> that as well.

The size hit will be effectively zero if your data is mainly of the
ASCII variety, since ASCII printable characters to UTF-8 is an identity
transform.  However anything involving string comparisons, including
equality, similarity (LIKE, regular expressions), or any other kind of
comparison (ORDER BY, GROUP BY) will be slower.  In my experience the
performance hit varies from zero to 100% in CPU time.  UTF-8 is never
faster that ASCII, as far as I know.

However, if you need UTF-8 then you need it, and there's no point in
worrying about the performance hit.

You may as well just do two benchmark runs with your database
initialized in either character set to see for yourself.

-jwb



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Storing Digital Video

2006-02-04 Thread Jeffrey W. Baker
On Tue, 2006-01-31 at 16:32 -0800, Rodrigo Madera wrote:
> I am concerned with performance issues involving the storage of DV on
> a database.
> 
> I though of some options, which would be the most advised for speed?
> 
> 1) Pack N frames inside a "container" and store the container to the db.
> 2) Store each frame in a separate record in the table "frames".
> 3) (type something here)

How about some more color?  _Why_, for example, would you store video in
a relational database?

-jwb

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Index Usage using IN

2006-02-01 Thread Jeffrey W. Baker
On Wed, 2006-02-01 at 12:22 -0800, Jeffrey W. Baker wrote:
> On Thu, 2006-02-02 at 09:12 +1300, Ralph Mason wrote:
> > Hi,
> > 
> > I have 2 tables both have an index on ID (both ID columns are an oid).
> > 
> > I want to find only only rows in one and not the other.
> > 
> > Select ID from TableA where ID not IN ( Select ID from Table B)
> 
> Have you considered this:
> 
> SELECT ID from TableA EXCEPT Select ID from Table B

Alternately:

   SELECT a.ID 
 FROM TableA AS a 
LEFT JOIN TableB AS b 
   ON a.ID = b.ID 
WHERE b.ID IS NULL

-jwb

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Index Usage using IN

2006-02-01 Thread Jeffrey W. Baker
On Thu, 2006-02-02 at 09:12 +1300, Ralph Mason wrote:
> Hi,
> 
> I have 2 tables both have an index on ID (both ID columns are an oid).
> 
> I want to find only only rows in one and not the other.
> 
> Select ID from TableA where ID not IN ( Select ID from Table B)

Have you considered this:

SELECT ID from TableA EXCEPT Select ID from Table B

?

-jwb


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Huge Data sets, simple queries

2006-02-01 Thread Jeffrey W. Baker
On Tue, 2006-01-31 at 21:53 -0800, Luke Lonergan wrote:
> Jeffrey,
> 
> On 1/31/06 8:09 PM, "Jeffrey W. Baker" <[EMAIL PROTECTED]> wrote:
> >> ... Prove it.
> > I think I've proved my point.  Software RAID1 read balancing provides
> > 0%, 300%, 100%, and 100% speedup on 1, 2, 4, and 8 threads,
> > respectively.  In the presence of random I/O, the results are even
> > better.
> > Anyone who thinks they have a single-threaded workload has not yet
> > encountered the autovacuum daemon.
> 
> Good data - interesting case.  I presume from your results that you had to
> make the I/Os non-overlapping (the "skip" option to dd) in order to get the
> concurrent access to work.  Why the particular choice of offset - 3.2GB in
> this case?

No particular reason.  8k x 10 is what the last guy used upthread.
> 
> So - the bandwidth doubles in specific circumstances under concurrent
> workloads - not relevant to "Huge Data sets, simple queries", but possibly
> helpful for certain kinds of OLTP applications.

Ah, but someday Pg will be able to concurrently read from two
datastreams to complete a single query.  And that day will be glorious
and fine, and you'll want as much disk concurrency as you can get your
hands on.

-jwb


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Huge Data sets, simple queries

2006-01-31 Thread Jeffrey W. Baker
On Tue, 2006-01-31 at 12:47 -0800, Luke Lonergan wrote:
> Jeffrey,
> 
> On 1/31/06 12:03 PM, "Jeffrey W. Baker" <[EMAIL PROTECTED]> wrote:
> > Linux does balanced reads on software
> > mirrors.  I'm not sure why you think this can't improve bandwidth.  It
> > does improve streaming bandwidth as long as the platter STR is more than
> > the bus STR.
> 
> ... Prove it.

It's clear that Linux software RAID1, and by extension RAID10, does
balanced reads, and that these balanced reads double the bandwidth.  A
quick glance at the kernel source code, and a trivial test, proves the
point.

In this test, sdf and sdg are Seagate 15k.3 disks on a single channel of
an Adaptec 39320, but the enclosure, and therefore the bus, is capable
of only Ultra160 operation.

# grep md0 /proc/mdstat 
md0 : active raid1 sdf1[0] sdg1[1]

# dd if=/dev/md0 of=/dev/null bs=8k count=40 skip=0  & 
  dd if=/dev/md0 of=/dev/null bs=8k count=40 skip=40
40+0 records in
40+0 records out
327680 bytes transferred in 48.243362 seconds (67922298 bytes/sec)
40+0 records in
40+0 records out
327680 bytes transferred in 48.375897 seconds (67736211 bytes/sec)

That's 136MB/sec, for those following along at home.  With only two
disks in a RAID1, you can nearly max out the SCSI bus.

# dd if=/dev/sdf1 of=/dev/null bs=8k count=40 skip=0  & 
  dd if=/dev/sdf1 of=/dev/null bs=8k count=40 skip=40
40+0 records in
40+0 records out
327680 bytes transferred in 190.413286 seconds (17208883 bytes/sec)
40+0 records in
40+0 records out
327680 bytes transferred in 192.096232 seconds (17058117 bytes/sec)

That, on the other hand, is only 34MB/sec.  With two threads, the RAID1
is 296% faster.

# dd if=/dev/md0 of=/dev/null bs=8k count=40 skip=0   & 
  dd if=/dev/md0 of=/dev/null bs=8k count=40 skip=40  & 
  dd if=/dev/md0 of=/dev/null bs=8k count=40 skip=80  & 
  dd if=/dev/md0 of=/dev/null bs=8k count=40 skip=120 &
40+0 records in
40+0 records out
327680 bytes transferred in 174.276585 seconds (18802296 bytes/sec)
40+0 records in
40+0 records out
327680 bytes transferred in 181.581893 seconds (18045852 bytes/sec)
40+0 records in
40+0 records out
327680 bytes transferred in 183.724243 seconds (17835425 bytes/sec)
40+0 records in
40+0 records out
327680 bytes transferred in 184.209018 seconds (17788489 bytes/sec)

That's 71MB/sec with 4 threads...

# dd if=/dev/sdf1 of=/dev/null bs=8k count=10 skip=0   & 
  dd if=/dev/sdf1 of=/dev/null bs=8k count=10 skip=40  & 
  dd if=/dev/sdf1 of=/dev/null bs=8k count=10 skip=80  & 
  dd if=/dev/sdf1 of=/dev/null bs=8k count=10 skip=120 &
10+0 records in
10+0 records out
81920 bytes transferred in 77.489210 seconds (10571794 bytes/sec)
10+0 records in
10+0 records out
81920 bytes transferred in 87.628000 seconds (9348610 bytes/sec)
10+0 records in
10+0 records out
81920 bytes transferred in 88.912989 seconds (9213502 bytes/sec)
10+0 records in
10+0 records out
81920 bytes transferred in 90.238705 seconds (9078144 bytes/sec)

Only 36MB/sec for the single disk.  96% advantage for the RAID1.

# dd if=/dev/md0 of=/dev/null bs=8k count=5 skip=0 & 
  dd if=/dev/md0 of=/dev/null bs=8k count=5 skip=40  & 
  dd if=/dev/md0 of=/dev/null bs=8k count=5 skip=80  & 
  dd if=/dev/md0 of=/dev/null bs=8k count=5 skip=120 & 
  dd if=/dev/md0 of=/dev/null bs=8k count=5 skip=160 & 
  dd if=/dev/md0 of=/dev/null bs=8k count=5 skip=200 & 
  dd if=/dev/md0 of=/dev/null bs=8k count=5 skip=240 & 
  dd if=/dev/md0 of=/dev/null bs=8k count=5 skip=280 &
5+0 records in
5+0 records out
40960 bytes transferred in 35.289648 seconds (11606803 bytes/sec)
5+0 records in
5+0 records out
40960 bytes transferred in 42.653475 seconds (9602969 bytes/sec)
5+0 records in
5+0 records out
40960 bytes transferred in 43.524714 seconds (9410745 bytes/sec)
5+0 records in
5+0 records out
40960 bytes transferred in 45.151705 seconds (9071640 bytes/sec)
5+0 records in
5+0 records out
40960 bytes transferred in 47.741845 seconds (8579476 bytes/sec)
5+0 records in
5+0 records out
40960 bytes transferred in 48.600533 seconds (8427891 bytes/sec)
5+0 records in
5+0 records out
40960 bytes transferred in 48.758726 seconds (8400548 bytes/sec)
5+0 records in
5+0 records out
40960 bytes transferred in 49.679275 seconds (8244887 bytes/sec)

66MB/s with 8 threads.

# dd if=/dev/sdf1 of=/dev/null bs=8k count=5 skip=0 & 
  dd if=/dev/sdf1 of=/dev/null bs=8k count=5 skip=40  & 
  dd if=/dev/sdf1 of=/dev/null bs=8k count=5 skip=80

Re: [PERFORM] Huge Data sets, simple queries

2006-01-31 Thread Jeffrey W. Baker
On Tue, 2006-01-31 at 09:00 -0800, Luke Lonergan wrote:
> Jim,
> 
> On 1/30/06 12:25 PM, "Jim C. Nasby" <[EMAIL PROTECTED]> wrote:
> 
> > Why divide by 2? A good raid controller should be able to send read
> > requests to both drives out of the mirrored set to fully utilize the
> > bandwidth. Of course, that probably won't come into play unless the OS
> > decides that it's going to read-ahead fairly large chunks of the table
> > at a time...
> 
> I've not seen one that does, nor would it work in the general case IMO.  In
> RAID1 writes are duplicated and reads come from one of the copies.  You
> could alternate read service requests to minimize rotational latency, but
> you can't improve bandwidth.

Then you've not seen Linux.  Linux does balanced reads on software
mirrors.  I'm not sure why you think this can't improve bandwidth.  It
does improve streaming bandwidth as long as the platter STR is more than
the bus STR.

-jwb

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Huge Data sets, simple queries

2006-01-29 Thread Jeffrey W. Baker
On Sun, 2006-01-29 at 13:44 -0500, Luke Lonergan wrote:
> Depesz,
> 
> > [mailto:[EMAIL PROTECTED] On Behalf Of 
> > hubert depesz lubaczewski
> > Sent: Sunday, January 29, 2006 3:25 AM
> >
> > hmm .. do i understand correctly that you're suggesting that 
> > using raid 10 and/or hardware raid adapter might hurt disc 
> > subsystem performance? could you elaborate on the reasons, 
> > please? it's not that i'm against the idea - i'm just curious 
> > as this is very "against-common-sense". and i always found it 
> > interesting when somebody states something that uncommon...

> Oh - and about RAID 10 - for large data work it's more often a waste of
> disk performance-wise compared to RAID 5 these days.  RAID5 will almost
> double the performance on a reasonable number of drives.

I think you might want to be more specific here.  I would agree with you
for data warehousing, decision support, data mining, and similar
read-mostly non-transactional loads.  For transactional loads RAID-5 is,
generally speaking, a disaster due to the read-before-write problem.

While we're on the topic, I just installed another one of those Areca
ARC-1130 controllers with 1GB cache.  It's ludicrously fast: 250MB/sec
burst writes, CPU-limited reads.  I can't recommend them highly enough.

-jwb

PS: Could you look into fixing your mailer?  Your messages sometimes
don't contain In-Reply-To headers, and therefore don't thread properly.


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Huge Data sets, simple queries

2006-01-28 Thread Jeffrey W. Baker
On Sat, 2006-01-28 at 10:55 -0500, Tom Lane wrote:
> 
> Assuming that "month" means what it sounds like, the above would
> result
> in running twelve parallel sort/uniq operations, one for each month
> grouping, to eliminate duplicates before counting.  You've got sortmem
> set high enough to blow out RAM in that scenario ...

Hrmm, why is it that with a similar query I get a far simpler plan than
you describe, and relatively snappy runtime?

  select date
   , count(1) as nads
   , sum(case when premium then 1 else 0 end) as npremium
   , count(distinct(keyword)) as nwords
   , count(distinct(advertiser)) as nadvertisers 
from data 
group by date 
order by date asc

  QUERY PLAN
   
---
 GroupAggregate  (cost=0.00..14452743.09 rows=721 width=13)
   ->  Index Scan using data_date_idx on data  (cost=0.00..9075144.27 
rows=430206752 width=13)
(2 rows)

=# show server_version;
 server_version 

 8.1.2
(1 row)

-jwb


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Huge Data sets, simple queries

2006-01-27 Thread Jeffrey W. Baker
On Fri, 2006-01-27 at 20:23 -0500, Mike Biamonte wrote:
> 
> Does anyone have any experience with extremely large data sets?
> I'm mean hundreds of millions of rows.

Sure, I think more than a few of us do.  Just today I built a summary
table from a 25GB primary table with ~430 million rows.  This took about
45 minutes.

> The queries I need to run on my 200 million transactions are relatively
> simple:
> 
>select month, count(distinct(cardnum)) count(*), sum(amount) from
> transactions group by month;
> 
> This query took 18 hours on PG 8.1 on a Dual Xeon, RHEL3, (2.4 Kernel) with
> RAID-10 (15K drives)
> and 12 GB Ram.  I was expecting it to take about 4 hours - based on some
> experience with a
> similar dataset on a different machine (RH9, PG7.3 Dual Xeon, 4GB RAM,
> Raid-5 10K drives)

Possibly the latter machine has a faster I/O subsystem.  How large is
the table on disk?

>   This machine is COMPLETELY devoted to running these relatively simple
> queries one at a
> time. (No multi-user support needed!)I've been tooling with the various
> performance settings:
> effective_cache at 5GB, shared_buffers at 2 GB, workmem, sortmem at 1 GB
> each.
> ( Shared buffers puzzles me a it bit - my instinct says to set it as high as
> possible,
> but everything I read says that "too high" can hurt performance.)
> 
>Any ideas for performance tweaking in this kind of application would be
> greatly appreciated.
> We've got indexes on the fields being grouped, 
> and always vacuum analzye
> after building them.

Probably vacuum makes no difference.

>It's difficult to just "try" various ideas because each attempt takes a
> full day to test.  Real
> experience is needed here!

Can you send us an EXPLAIN of the query?  I believe what you're seeing
here is probably:

Aggregate
+-Sort
  +-Sequential Scan

or perhaps:

Aggregate
+-Index Scan

I have a feeling that the latter will be much faster.  If your table has
been created over time, then it is probably naturally ordered by date,
and therefore also ordered by month.  You might expect a Sequential Scan
to be the fastest, but the Sort step will be a killer.  On the other
hand, if your table is badly disordered by date, the Index Scan could
also be very slow.

Anyway, send us the query plan and also perhaps a sample of vmstat
during the query.

For what it's worth, I have:

effective_cache_size| 70
cpu_tuple_cost  | 0.01
cpu_index_tuple_cost| 0.001
random_page_cost| 3
shared_buffers  | 5
temp_buffers| 1000
work_mem| 1048576 <= for this query only

And here's a few lines from vmstat during the query:

procs ---memory-- ---swap-- -io --system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   incs us sy id wa
 2  1 76  43476  94916 765514800 78800 0 1662   788 68 12  0 20
 1  1 76  45060  91196 765808800 78028 0 1639   712 71 11  0 19
 2  0 76  44668  87624 766296000 7892452 1650   736 69 12  0 19
 2  0 76  45300  83672 766743200 8353616 1688   768 71 12  0 18
 1  1 76  45744  80652 767071200 84052 0 1691   796 70 12  0 17

That's about 80MB/sec sequential input, for comparison purposes.

-jwb


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Invulnerable VACUUM process thrashing everything

2005-12-29 Thread Jeffrey W. Baker
On Thu, 2005-12-29 at 22:53 +, Russ Garrett wrote:
> In my experience a kill -9 has never resulted in any data loss in this 
> situation (it will cause postgres to detect that the process died, shut 
> down, then recover), and most of the time it only causes a 5-10sec 
> outage. I'd definitely hesitate to recommend it in a production context 
> though, especially since I think there are some known race-condition 
> bugs in 7.4.
> 
> VACUUM *will* respond to a SIGTERM, but it doesn't check very often - 
> I've often had to wait hours for it to determine that it's been killed, 
> and my tables aren't anywhere near 1TB. Maybe this is a place where 
> things could be improved...

FWIW, I murdered this process with SIGKILL, and the recovery was very
short.


> Incidentally, I have to kill -9 some of our MySQL instances quite 
> regularly because they do odd things. Not something you want to be 
> doing, especially when MySQL takes 30mins to recover.

Agreed.  After mysql shutdown with MyISAM, all tables must be checked
and usually many need to be repaired.  This takes a reallly long
time.

-jwb

> Russ Garrett
> Last.fm Ltd.
> [EMAIL PROTECTED]
> 
> Ron wrote:
> 
> > Ick.  Can you get users and foreign connections off that machine, lock 
> > them out for some period, and renice the VACUUM?
> >
> > Shedding load and keeping it off while VACUUM runs high priority might 
> > allow it to finish in a reasonable amount of time.
> > Or
> > Shedding load and dropping the VACUUM priority might allow a kill 
> > signal to get through.
> >
> > Hope this helps,
> > Ron
> >
> >
> > At 05:09 PM 12/29/2005, Jeffrey W. Baker wrote:
> >
> >> A few WEEKS ago, the autovacuum on my instance of pg 7.4 unilaterally
> >> decided to VACUUM a table which has not been updated in over a year and
> >> is more than one terabyte on the disk.  Because of the very high
> >> transaction load on this database, this VACUUM has been ruining
> >> performance for almost a month.  Unfortunately is seems invulnerable to
> >> killing by signals:
> >>
> >> # ps ax | grep VACUUM
> >> 15308 ?D588:00 postgres: postgres skunk [local] VACUUM
> >> # kill -HUP 15308
> >> # ps ax | grep VACUUM
> >> 15308 ?D588:00 postgres: postgres skunk [local] VACUUM
> >> # kill -INT 15308
> >> # ps ax | grep VACUUM
> >> 15308 ?D588:00 postgres: postgres skunk [local] VACUUM
> >> # kill -PIPE 15308
> >> # ps ax | grep VACUUM
> >> 15308 ?D588:00 postgres: postgres skunk [local] VACUUM
> >>
> >> o/~ But the cat came back, the very next day ...
> >>
> >> I assume that if I kill this with SIGKILL, that will bring down every
> >> other postgres process, so that should be avoided.  But surely there is
> >> a way to interrupt this.  If I had some reason to shut down the
> >> instance, I'd be screwed, it seems.
> >
> >
> >
> >
> >
> > ---(end of broadcast)---
> > TIP 2: Don't 'kill -9' the postmaster
> >
> 

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[PERFORM] Invulnerable VACUUM process thrashing everything

2005-12-29 Thread Jeffrey W. Baker
A few WEEKS ago, the autovacuum on my instance of pg 7.4 unilaterally
decided to VACUUM a table which has not been updated in over a year and
is more than one terabyte on the disk.  Because of the very high
transaction load on this database, this VACUUM has been ruining
performance for almost a month.  Unfortunately is seems invulnerable to
killing by signals:

# ps ax | grep VACUUM
15308 ?D588:00 postgres: postgres skunk [local] VACUUM
# kill -HUP 15308
# ps ax | grep VACUUM
15308 ?D588:00 postgres: postgres skunk [local] VACUUM
# kill -INT 15308
# ps ax | grep VACUUM
15308 ?D588:00 postgres: postgres skunk [local] VACUUM
# kill -PIPE 15308
# ps ax | grep VACUUM
15308 ?D588:00 postgres: postgres skunk [local] VACUUM

o/~ But the cat came back, the very next day ...

I assume that if I kill this with SIGKILL, that will bring down every
other postgres process, so that should be avoided.  But surely there is
a way to interrupt this.  If I had some reason to shut down the
instance, I'd be screwed, it seems.

-jwb

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[PERFORM] Process executing COPY opens and reads every table on the system

2005-12-29 Thread Jeffrey W. Baker
I have an instance of PG 7.4 where I would really like to execute some
schema changes, but every schema change is blocked waiting for a process
doing a COPY.  That query is:

COPY drill.trades (manager, sec_id, ticker, bridge_tkr, date, "type",
short, quantity, price, prin, net_money, factor) TO stdout;

So it's only involved with a single table in a single schema.
Unfortunately, what this process is doing is opening and reading every
table in the database:

# strace -e open,close -p 29859
Process 29859 attached - interrupt to quit
open("/var/lib/postgres/data/base/7932340/2442094542", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/358185104.16", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2023517557.1", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2023517557.3", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2023517557.5", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2023517557.5", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2023517557.3", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2023517557.6", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2023517557.9", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2023517557.3", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/358185104.16", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/358185104.16", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/358185104.16", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2414561511", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2426495316", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2426495316", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2414561511", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2426495316", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2426495316", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2414561511", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/358185104.16", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2429205386", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2429205433", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/358185104.16", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2426495316", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2414561511", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/358185104.16", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/358185104.16", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2429205441", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/358185104.16", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2414561511", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2426495316", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2023517557.3", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2023517557.10", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/358185104.16", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/358185104.16", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/2023517557.9", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/2298808676/2298808939.10", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/7932340/358185104.15", O_RDWR) = 49
close(49)   = 0
open("/var/lib/postgres/data/base/793234

Re: [PERFORM] opinion on disk speed

2005-12-08 Thread Jeffrey W. Baker
On Thu, 2005-12-08 at 11:52 -0500, Vivek Khera wrote:
> I have a choice to make on a RAID enclosure:
> 
> 14x 36GB 15kRPM ultra 320 SCSI drives
> 
> OR
> 
> 12x 72GB 10kRPM ultra 320 SCSI drives
> 
> both would be configured into RAID 10 over two SCSI channels using a  
> megaraid 320-2x card.
> 
> My goal is speed.  Either would provide more disk space than I would  
> need over the next two years.
> 
> The database does a good number of write transactions, and a decent  
> number of sequential scans over the whole DB (about 60GB including  
> indexes) for large reports.

The STR of 15k is quite a bit higher than 10k.  I'd be inclined toward
the 15k if it doesn't impact the budget.

For the write transactions, the speed and size of the DIMM on that LSI
card will matter the most.  I believe the max memory on that adapter is
512MB.  These cost so little that it wouldn't make sense to go with
anything smaller.

When comparing the two disks, don't forget to check for supported SCSI
features.  In the past I've been surprised that some 10k disks don't
support packetization, QAS, and so forth.  All 15k disks seem to support
these.

Don't forget to post some benchmarks when your vendor delivers ;)

-jwb

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-05 Thread Jeffrey W. Baker
On Wed, 2005-10-05 at 12:14 -0400, Ron Peacetree wrote:
> I've now gotten verification from multiple working DBA's that DB2, Oracle, and
> SQL Server can achieve ~250MBps ASTR (with as much as ~500MBps ASTR in
> setups akin to Oracle RAC) when attached to a decent (not outrageous, but
> decent) HD subsystem...
> 
> I've not yet had any RW DBA verify Jeff Baker's supposition that ~1GBps ASTR 
> is
> attainable.  Cache based bursts that high, yes.  ASTR, no.

I find your tone annoying.  That you do not have access to this level of
hardware proves nothing, other than pointing out that your repeated
emails on this list are based on supposition.

If you want 1GB/sec STR you need:

1) 1 or more Itanium CPUs
2) 24 or more disks
3) 2 or more SATA controllers
4) Linux

Have fun.

-jwb

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Jeffrey W. Baker
On Mon, 2005-10-03 at 14:16 -0700, Josh Berkus wrote:
> Jeff,
> 
> > > Nope, LOTS of testing, at OSDL, GreenPlum and Sun.   For comparison, A
> > > Big-Name Proprietary Database doesn't get much more than that either.
> >
> > I find this claim very suspicious.  I get single-threaded reads in
> > excess of 1GB/sec with XFS and > 250MB/sec with ext3.
> 
> Database reads?  Or raw FS reads?  It's not the same thing.

Just reading files off the filesystem.  These are input rates I get with
a specialized sort implementation.  1GB/sec is not even especially
wonderful, I can get that on two controllers with 24-disk stripe set.

I guess database reads are different, but I remain unconvinced that they
are *fundamentally* different.  After all, a tab-delimited file (my sort
workload) is a kind of database.

> Also, we're talking *write speed* here, not read speed.

Ok, I did not realize.  Still you should see 250-300MB/sec
single-threaded sequential output on ext3, assuming the storage can
provide that rate.

> I also find *your* claim suspicious, since there's no way XFS is 300% faster 
> than ext3 for the *general* case.

On a single disk you wouldn't notice, but XFS scales much better when
you throw disks at it.  I get a 50MB/sec boost from the 24th disk,
whereas ext3 stops scaling after 16 disks.  For writes both XFS and ext3
top out around 8 disks, but in this case XFS tops out at 500MB/sec while
ext3 can't break 350MB/sec.

I'm hopeful that in the future the work being done at ClusterFS will
make ext3 on-par with XFS.

-jwb

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Jeffrey W. Baker
On Mon, 2005-10-03 at 13:34 -0700, Josh Berkus wrote:
> Michael,
> 
> > >Realistically, you can't do better than about 25MB/s on a
> > > single-threaded I/O on current Linux machines,
> >
> > What on earth gives you that idea? Did you drop a zero?
> 
> Nope, LOTS of testing, at OSDL, GreenPlum and Sun.   For comparison, A 
> Big-Name Proprietary Database doesn't get much more than that either.

I find this claim very suspicious.  I get single-threaded reads in
excess of 1GB/sec with XFS and > 250MB/sec with ext3.  

-jwb

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Ultra-cheap NVRAM device

2005-10-03 Thread Jeffrey W. Baker
On Mon, 2005-10-03 at 11:15 -0600, Dan Harris wrote:
> On Oct 3, 2005, at 5:02 AM, Steinar H. Gunderson wrote:
> 
> > I thought this might be interesting, not the least due to the  
> > extremely low
> > price ($150 + the price of regular DIMMs):
> >
> >
> >
> 
> This has been posted before, and the main reason nobody got very  
> excited is that:
> 
> a) it only uses the PCI bus to provide power to the device, not for I/O
> b) It is limited to SATA bandwidth
> c) The benchmarks did not prove it to be noticeably faster than a  
> good single SATA drive
> 
> A few of us were really excited at first too, until seeing the  
> benchmarks..

Also, no ECC support.  You'd be crazy to use it for anything.

-jwb

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-29 Thread Jeffrey W. Baker
On Thu, 2005-09-29 at 10:06 -0700, Luke Lonergan wrote:
> Josh,
> 
> On 9/29/05 9:54 AM, "Josh Berkus"  wrote:
> 
> > Following an index creation, we see that 95% of the time required is the
> > external sort, which averages 2mb/s.  This is with seperate drives for
> > the WAL, the pg_tmp, the table and the index.  I've confirmed that
> > increasing work_mem beyond a small minimum (around 128mb) had no benefit
> > on the overall index creation speed.
> 
> Yp!  That about sums it up - regardless of taking 1 or 2 passes through
> the heap being sorted, 1.5 - 2 MB/s is the wrong number.

Yeah this is really bad ... approximately the speed of GNU sort.

Josh, do you happen to know how many passes are needed in the multiphase
merge on your 60GB table?

Looking through tuplesort.c, I have a couple of initial ideas.  Are we
allowed to fork here?  That would open up the possibility of using the
CPU and the I/O in parallel.  I see that tuplesort.c also suffers from
the kind of postgresql-wide disease of calling all the way up and down a
big stack of software for each tuple individually.  Perhaps it could be
changed to work on vectors.

I think the largest speedup will be to dump the multiphase merge and
merge all tapes in one pass, no matter how large M.  Currently M is
capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
the tape.  It could be done in a single pass heap merge with N*log(M)
comparisons, and, more importantly, far less input and output.

I would also recommend using an external processes to asynchronously
feed the tuples into the heap during the merge.

What's the timeframe for 8.2?

-jwb




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Sequential I/O Cost (was Re: [PERFORM] A Better External Sort?)

2005-09-28 Thread Jeffrey W. Baker
On Wed, 2005-09-28 at 12:03 -0400, Ron Peacetree wrote:
> >From: "Jeffrey W. Baker" <[EMAIL PROTECTED]>
> >Perhaps I believe this because you can now buy as much sequential I/O
> >as you want.  Random I/O is the only real savings.
> >
> 1= No, you can not "buy as much sequential IO as you want".  Even if
> with an infinite budget, there are physical and engineering limits.  Long
> before you reach those limits, you will pay exponentially increasing costs
> for linearly increasing performance gains.  So even if you _can_ buy a
> certain level of sequential IO, it may not be the most efficient way to
> spend money.

This is just false.  You can buy sequential I/O for linear money up to
and beyond your platform's main memory bandwidth.  Even 1GB/sec will
severely tax memory bandwidth of mainstream platforms, and you can
achieve this rate for a modest cost.  

I have one array that can supply this rate and it has only 15 disks.  It
would fit on my desk.  I think your dire talk about the limits of
science and engineering may be a tad overblown.


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-28 Thread Jeffrey W. Baker
On Wed, 2005-09-28 at 12:03 -0400, Ron Peacetree wrote:
> >From: "Jeffrey W. Baker" <[EMAIL PROTECTED]>
> >Sent: Sep 27, 2005 1:26 PM
> >To: Ron Peacetree <[EMAIL PROTECTED]>
> >Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
> >
> >On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote:
> >
> >>That Btree can be used to generate a physical reordering of the data
> >>in one pass, but that's the weakest use for it.  The more powerful
> >>uses involve allowing the Btree to persist and using it for more
> >>efficient re-searches or combining it with other such Btrees (either as
> >>a step in task distribution across multiple CPUs or as a more efficient
> >>way to do things like joins by manipulating these Btrees rather than
> >>the actual records.)
> >
> >Maybe you could describe some concrete use cases.  I can see what
> >you are getting at, and I can imagine some advantageous uses, but
> >I'd like to know what you are thinking.
> >
> >Specifically I'd like to see some cases where this would beat sequential
> >scan.  I'm thinking that in your example of a terabyte table with a
> >column having only two values, all the queries I can think of would be
> >better served with a sequential scan.
> >
> In my original example, a sequential scan of the 1TB of 2KB or 4KB
> records, => 250M or 500M records of data, being sorted on a binary
> value key will take ~1000x more time than reading in the ~1GB Btree
> I described that used a Key+RID (plus node pointers) representation
> of the data.

You are engaging in a length and verbose exercise in mental
masturbation, because you have not yet given a concrete example of a
query where this stuff would come in handy.  A common, general-purpose
case would be the best.

We can all see that the method you describe might be a good way to sort
a very large dataset with some known properties, which would be fine if
you are trying to break the terasort benchmark.  But that's not what
we're doing here.  We are designing and operating relational databases.
So please explain the application.

Your main example seems to focus on a large table where a key column has
constrained values.  This case is interesting in proportion to the
number of possible values.  If I have billions of rows, each having one
of only two values, I can think of a trivial and very fast method of
returning the table "sorted" by that key: make two sequential passes,
returning the first value on the first pass and the second value on the
second pass.  This will be faster than the method you propose.

I think an important aspect you have failed to address is how much of
the heap you must visit after the sort is complete.  If you are
returning every tuple in the heap then the optimal plan will be very
different from the case when you needn't.  

-jwb

PS: Whatever mailer you use doesn't understand or respect threading nor
attribution.  Out of respect for the list's readers, please try a mailer
that supports these 30-year-old fundamentals of electronic mail.


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-27 Thread Jeffrey W. Baker
On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote:

> That Btree can be used to generate a physical reordering of the data
> in one pass, but that's the weakest use for it.  The more powerful
> uses involve allowing the Btree to persist and using it for more
> efficient re-searches or combining it with other such Btrees (either as
> a step in task distribution across multiple CPUs or as a more efficient
> way to do things like joins by manipulating these Btrees rather than
> the actual records.)

Maybe you could describe some concrete use cases.  I can see what you
are getting at, and I can imagine some advantageous uses, but I'd like
to know what you are thinking.

Specifically I'd like to see some cases where this would beat sequential
scan.  I'm thinking that in your example of a terabyte table with a
column having only two values, all the queries I can think of would be
better served with a sequential scan.

Perhaps I believe this because you can now buy as much sequential I/O as
you want.  Random I/O is the only real savings.

-jwb



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Battery Backed Cache for RAID

2005-09-14 Thread Jeffrey W. Baker
On Wed, 2005-09-14 at 11:25 -0700, Peter Darley wrote:
>   I'm getting a new server for our database, and I have a quick question
> about RAID controllers with a battery backed cache.  I understand that the
> cache will allow the cache to be written out if the power fails to the box,
> which allows it to report a write as committed safely when it's not actually
> committed.

Actually the cache will just hold its contents while the power is out.
When the power is restored, the RAID controller will complete the writes
to disk.  If the battery does not last through the outage, the data is
lost.

>   My question is, if the power goes off, and the drives stop, how does the
> battery backed cache save things out to the dead drives?  Is there another
> component that is implied that will provide power to the drives that I
> should be looking into as well?

A UPS would allow you to do an orderly shutdown and write contents to
disk during a power failure.  However a UPS can be an extra point of
failure.

-jwb

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Observation about db response time

2005-09-05 Thread Jeffrey W. Baker
On Tue, 2005-08-30 at 08:13 -0500, Frank Wiles wrote:
> On Tue, 30 Aug 2005 18:35:30 +0530
> "Akshay Mathur" <[EMAIL PROTECTED]> wrote:
> 
> > Hello Friends,
> >  
> > We were having a database in pgsql7.4.2 The database was responding
> > very slowly even after full vacuum analyze (select count(*) from
> > some_table_having_18000_records was taking 18 Sec).
> >  
> > We took a backup of that db and restored it back. Now the same db on
> > same PC is responding fast (same query is taking 18 ms).
> >  
> > But we can't do the same as a solution of slow response. Do anybody
> > has faced similar problem? Is this due to any internal problem of
> > pgsql? Is there any clue to fasten the database?
> 
>   This could be because you don't have max_fsm_pages and
>   max_fsm_relations setup correctly or are not doing full vacuums 
>   often enough. 
> 
>   If your database deletes a ton of data as a matter of course then
>   sometimes a full vacuum will not clear up as much space as it could.
> 
>   Try increasing those configuration values and doing vacuums more
>   often. 
> 
>   If you should also explore upgrading to the latest 8.0 as you will
>   no doubt see noticeable speed improvements. 

This can also be caused by index bloat.  VACUUM does not clear out the
index.  You must use REINDEX for that.

-jwb

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Limit + group + join

2005-08-25 Thread Jeffrey W. Baker
On Thu, 2005-08-25 at 18:56 -0700, Jeffrey W. Baker wrote:
> On Fri, 2005-08-26 at 02:27 +0200, Tobias Brox wrote:
> > Consider this setup - which is a gross simplification of parts of our
> > production system ;-)
> > 
> >   create table c (id integer primary key);
> >   create table b (id integer primary key, c_id integer);
> >   create index b_on_c on b(c_id)
> > 
> >   insert into c (select ... lots of IDs ...);
> >   insert into b (select id, id from c); /* keep it simple :-) */
> >   
> > Now, I'm just interessted in some few rows.  
> > 
> > All those gives good plans:
> > 
> > explain select c.id from c order by c.id limit 1;
> > explain select c.id from c group by c.id order by c.id limit 1;
> > explain select c.id from c join b on c_id=c.id order by c.id limit 1;
> > 
> > ... BUT ... combining join, group and limit makes havoc:
> > 
> > explain select c.id from c join b on c_id=c.id  group by c.id order by c.id
> > desc limit 5;
> 
> Where's b in this join clause?  It looks like a cartesian product to me.

Nevermind.  I read c_id as c.id.

-jwb


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Limit + group + join

2005-08-25 Thread Jeffrey W. Baker
On Fri, 2005-08-26 at 02:27 +0200, Tobias Brox wrote:
> Consider this setup - which is a gross simplification of parts of our
> production system ;-)
> 
>   create table c (id integer primary key);
>   create table b (id integer primary key, c_id integer);
>   create index b_on_c on b(c_id)
> 
>   insert into c (select ... lots of IDs ...);
>   insert into b (select id, id from c); /* keep it simple :-) */
>   
> Now, I'm just interessted in some few rows.  
> 
> All those gives good plans:
> 
> explain select c.id from c order by c.id limit 1;
> explain select c.id from c group by c.id order by c.id limit 1;
> explain select c.id from c join b on c_id=c.id order by c.id limit 1;
> 
> ... BUT ... combining join, group and limit makes havoc:
> 
> explain select c.id from c join b on c_id=c.id  group by c.id order by c.id
> desc limit 5;

Where's b in this join clause?  It looks like a cartesian product to me.

-jwb


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Read/Write block sizes

2005-08-23 Thread Jeffrey W. Baker
On Wed, 2005-08-24 at 01:56 -0400, Tom Lane wrote:
> "Jeffrey W. Baker" <[EMAIL PROTECTED]> writes:
> > On Wed, 2005-08-24 at 17:20 +1200, Guy Thornley wrote:
> >> Dont forget that already in postgres, you have a process per connection, 
> >> and
> >> all the processes take care of their own I/O.
> 
> > That's the problem.  Instead you want 1 or 4 or 10 i/o slaves
> > coordinating the I/O of all the backends optimally.  For instance, with
> > synchronous scanning.
> 
> And why exactly are we going to do a better job of I/O scheduling than
> the OS itself can do?
...
> There are some things we could do to reduce the impedance between us and
> the OS --- for instance, the upthread criticism that a seqscan asks the
> OS for only 8K at a time is fair enough.  But that doesn't translate
> to a conclusion that we should schedule the I/O instead of the OS.

Synchronous scanning is a fairly huge and obvious win.  If you have two
processes 180 degrees out-of-phase in a linear read, neither process is
going to get anywhere near the throughput they would get from a single
scan.

I think you're being deliberately obtuse with regards to file I/O and
the operating system.  The OS isn't magical.  It has to strike a balance
between a reasonable read latency and a reasonable throughput.  As far
as the kernel is concerned, a busy postgresql server is
indistinguishable from 100 unrelated activities.  All backends will be
served equally, even if in this case "equally" means "quite badly all
around."

An I/O slave process could be a big win in Postgres for many kinds of
reads.  Instead of opening and reading files the backends would connect
to the I/O slave and request the file be read.  If a scan of that file
were already underway, the new backends would be attached.  Otherwise a
new scan would commence.  In either case, the slave process can issue
(sometimes non-dependant) reads well ahead of the needs of the backend.
You may think the OS can do this for you but it can't.  On postgres
knows that it needs the whole file from beginning to end.  The OS can
only guess.

Ask me sometime about my replacement for GNU sort.  It uses the same
sorting algorithm, but it's an order of magnitude faster due to better
I/O strategy.  Someday, in my infinite spare time, I hope to demonstrate
that kind of improvement with a patch to pg.

-jwb


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Read/Write block sizes

2005-08-23 Thread Jeffrey W. Baker
On Wed, 2005-08-24 at 17:20 +1200, Guy Thornley wrote:
> As for the async IO, sure you might think 'oh async IO would be so cool!!'
> and I did, once, too. But then I sat down and _thought_ about it, and
> decided well, no, actually, theres _very_ few areas it could actually help,
> and in most cases it just make it easier to drive your box into lseek()
> induced IO collapse.
> 
> Dont forget that already in postgres, you have a process per connection, and
> all the processes take care of their own I/O.

That's the problem.  Instead you want 1 or 4 or 10 i/o slaves
coordinating the I/O of all the backends optimally.  For instance, with
synchronous scanning.

-jwb


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Read/Write block sizes

2005-08-23 Thread Jeffrey W. Baker
On Tue, 2005-08-23 at 19:31 -0700, Josh Berkus wrote:
> Steve,
> 
> > I would assume that dbt2 with STP helps minimize the amount of hours
> > someone has to invest to determine performance gains with configurable
> > options?
> 
> Actually, these I/O operation issues show up mainly with DW workloads, so the 
> STP isn't much use there.   If I can ever get some of these machines back 
> from the build people, I'd like to start testing some stuff.
> 
> One issue with testing this is that currently PostgreSQL doesn't support 
> block 
> sizes above 128K.  We've already done testing on that (well, Mark has) and 
> the performance gains aren't even worth the hassle of remembering you're on a 
> different block size (like, +4%).
> 
> What the Sun people have done with other DB systems is show that substantial 
> performance gains are possible on large databases (>100G) using block sizes 
> of 1MB.   I believe that's possible (and that it probably makes more of a 
> difference on Solaris than on BSD) but we can't test it without some hackery 
> first.

To get decent I/O you need 1MB fundamental units all the way down the
stack.  You need a filesystem that can take a 1MB write well, and you
need an I/O scheduler that will keep it together, and you need a storage
controller that can eat a 1MB request at once.  Ideally you'd like an
architecture with a 1MB page (Itanium has this, and AMD64 Linux will
soon have this.)  The Lustre people have done some work in this area,
opening up the datapaths in the kernel so they can keep the hardware
really working.  They even modified the QLogic SCSI/FC driver so it
supports such large transfers.  Their work has shown that you can get
significant perf boost on Linux just by thinking in terms of larger
transfers.

Unfortunately I'm really afraid that this conversation is about trees
when the forest is the problem.  PostgreSQL doesn't even have an async
reader, which is the sort of thing that could double or triple its
performance.  You're talking about block sizes and such, but the kinds
of improvements you can get there are in the tens of percents at most.

-jwb


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Read/Write block sizes (Was: Caching by Postgres)

2005-08-23 Thread Jeffrey W. Baker
On Tue, 2005-08-23 at 19:12 -0400, Michael Stone wrote:
> On Tue, Aug 23, 2005 at 05:29:01PM -0400, Jignesh Shah wrote:
> >Actually some of that readaheads,etc  the OS does  already if it does
> >some sort of throttling/clubbing of reads/writes.
> 
> Note that I specified the fully cached case--even with the workload in
> RAM the system still has to process a heck of a lot of read calls.
> 
> >* Introduce a multiblock or extent tunable variable where you can
> >define a multiple of 8K (or BlockSize tuneable) to read a bigger chunk
> >and store it in the bufferpool.. (Maybe writes too) (Most devices now
> >support upto 1MB chunks for reads and writes)
> 
> Yeah. The problem with relying on OS readahead is that the OS doesn't
> know whether you're doing a sequential scan or an index scan; if you
> have the OS agressively readahead you'll kill your seek performance.
> OTOH, if you don't do readaheads you'll kill your sequential scan
> performance. At the app level you know which makes sense for each
> operation.

This is why we have MADVISE_RANDOM and MADVISE_SEQUENTIAL.

-jwb

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] unused item pointers?

2005-08-22 Thread Jeffrey W. Baker
On Mon, 2005-08-22 at 22:51 -0400, Alan Stange wrote:
> Hello all,
> 
> what are unused item pointers and how do I get rid of them?
> 
> We have a fairly large table which is vacuumed daily and reindexed every 
> weekend.

> as you can see we have 235M unused item pointers in the main table and a 
> few 10's of millions more in other associated tables. 
> 
> Please note that the advice "vacuum more often" is a non-starter as the 
> total time here is already about 3 hours and this is just one table.  
> This is a fairly active table to which about 20M rows are added and 
> removed daily.

That may be so, but the answer is still to VACUUM more often.  Try the
autovacuum.  If it takes 3 hours with 90% wasted records, it would only
take 20 minutes when running properly.

You might be able to change your application to avoid generating so many
dead rows.  For example, test before insert so you don't make a dead
tuple on duplicate insert.

To repair this table, you can try VACUUM FULL but this is likely to take
longer than you find reasonable.  I would recommend dump and reload.

-jwb


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] (Re)-indexing on updates

2005-08-21 Thread Jeffrey W. Baker
On Sun, 2005-08-21 at 20:32 +0200, Yves Vindevogel wrote:
> 
> 
> __
> 
> Hi,
> 
> Say I have a table with column A, B, C, D
> A has a unique index on it (primary key)
> B and C have a normal index on it
> D has no index
> 
> If I perform a query likeupdate tbl set D = 'whatever' ;
> that should make no difference on the indexes on the other columns,
> right ?

What postgresql does on update is to make a new record, so there will be
two records in your table and two records in your index.  You would need
to vacuum the table to mark the space for the old record free, and you
would need to reindex the table to shrink the index.

> 
> Or is there some kind of mechanism that does create a sort of new
> record, thus makes the indexes go wild.

Yes.

-jwb


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] sustained update load of 1-2k/sec

2005-08-19 Thread Jeffrey W. Baker
On Fri, 2005-08-19 at 10:54 -0400, Ron wrote:
> Maxtor Atlas 15K II's.

> Areca's 1GB buffer RAID cards 

The former are SCSI disks and the latter is an SATA controller.  The
combination would have a transaction rate of approximately 0.

I can vouch for the Areca controllers, however.  You can certainly
achieve pgbench transaction rates in the hundreds per second even with
only 5 7200RPM disks and 128MB cache.

Don't forget to buy the battery.

-jwb

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Query plan looks OK, but slow I/O - settings advice?

2005-08-18 Thread Jeffrey W. Baker
On Tue, 2005-08-16 at 10:46 -0700, Roger Hand wrote:
> The disks are ext3 with journalling type of ordered, but this was later 
> changed to writeback with no apparent change in speed.
> 
> They're on a Dell poweredge 6650 with LSI raid card, setup as follows:
> 4 disks raid 10 for indexes (145GB) - sdc1
> 6 disks raid 10 for data (220GB) - sdd1
> 2 mirrored disks for logs - sdb1
> 
> stripe size is 32k
> cache policy: cached io (am told the controller has bbu)
> write policy: write-back
> read policy: readahead

I assume you are using Linux 2.6.  Have you considered booting your
machine with elevator=deadline?  You can also change this at runtime
using sysfs.

These read speeds are not too impressive.  Perhaps this is a slow
controller.  Alternately you might need bigger CPUs.

There's a lot of possibilities, obviously :)  I'd start with the
elevator, since that's easily tested.

-jwb


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Need for speed

2005-08-17 Thread Jeffrey W. Baker
On Wed, 2005-08-17 at 11:15 +0200, Ulrich Wisser wrote:
> Hello,
> 
> thanks for all your suggestions.
> 
> I can see that the Linux system is 90% waiting for disc io. At that time 
> all my queries are *very* slow. My scsi raid controller and disc are 
> already the fastest available.

What RAID controller?  Initially you said you have only 2 disks, and
since you have your xlog on a separate spindle, I assume you have 1 disk
for the xlog and 1 for the data.  Even so, if you have a RAID, I'm going
to further assume you are using RAID 1, since no sane person would use
RAID 0.  In those cases you are getting the performance of a single
disk, which is never going to be very impressive.  You need a RAID.

Please be more precise when describing your system to this list.

-jwb


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Need for speed

2005-08-16 Thread Jeffrey W. Baker
On Tue, 2005-08-16 at 17:39 +0200, Ulrich Wisser wrote:
> Hello,
> 
> one of our services is click counting for on line advertising. We do 
> this by importing Apache log files every five minutes. This results in a 
> lot of insert and delete statements. At the same time our customers 
> shall be able to do on line reporting.
> 
> We have a box with
> Linux Fedora Core 3, Postgres 7.4.2
> Intel(R) Pentium(R) 4 CPU 2.40GHz

This is not a good CPU for this workload.  Try an Opteron or Xeon.  Also
of major importance is the amount of memory.  If possible, you would
like to have memory larger than the size of your database.

> 2 scsi 76GB disks (15.000RPM, 2ms)

If you decide your application is I/O bound, here's an obvious place for
improvement.  More disks == faster.

> I did put pg_xlog on another file system on other discs.

Did that have a beneficial effect?

> Still when several users are on line the reporting gets very slow. 
> Queries can take more then 2 min.

Is this all the time or only during the insert?

> I need some ideas how to improve performance in some orders of 
> magnitude. I already thought of a box with the whole database on a ram 
> disc. So really any idea is welcome.

You don't need a RAM disk, just a lot of RAM.  Your operating system
will cache disk contents in memory if possible.  You have a very small
configuration, so more CPU, more memory, and especially more disks will
probably all yield improvements.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [SPAM?] Re: [PERFORM] PG8 Tuning

2005-08-11 Thread Jeffrey W. Baker
On Fri, 2005-08-12 at 08:47 +, Steve Poe wrote:
> Paul,
> 
> Before I say anything else, one online document which may be of
> assistance to you is:
> http://www.powerpostgresql.com/PerfList/
> 
> Some thoughts I have:
> 
> 3) You're shared RAM setting seems overkill to me. Part of the challenge
> is you're going from 1000 to 262K with no assessment in between. Each
> situation can be different, but try in the range of 10 - 50K.
> 
> 4) pg_xlog: If you're pg_xlog is on a spindle is *only* for pg_xlog
> you're better off.

Like Mr. Stone said earlier, this is pure dogma.  In my experience,
xlogs on the same volume with data is much faster if both are on
battery-backed write-back RAID controller memory.  Moving from this
situation to xlogs on a single normal disk is going to be much slower in
most cases.

-jwb

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Performance problems on 4/8way Opteron (dualcore) HP

2005-07-29 Thread Jeffrey W. Baker
On Fri, 2005-07-29 at 10:46 -0700, Josh Berkus wrote:
> Dirk,
> 
> > does anybody have expierence with this machine (4x 875 dual core Opteron
> > CPUs)?

I'm using dual 275s without problems.

> Nope.   I suspect that you may be the first person to report in on 
> dual-cores.  There may be special compile issues with dual-cores that 
> we've not yet encountered.

Doubtful.  However you could see improvements using recent Linux kernel
code.  There have been some patches for optimizing scheduling and memory
allocations.

However, if you are running this machine in 32-bit mode, why did you
bother paying $14,000 for your CPUs?  You will get FAR better
performance in 64-bit mode.  64-bit mode will give you 30-50% better
performance on PostgreSQL loads, in my experience.  Also, if I remember
correctly, the 32-bit x86 kernel doesn't understand Opteron NUMA
topology, so you may be seeing poor memory allocation decisions.

-jwb

> > We run RHEL 3.0, 32bit and under high load it is a drag. We 
> > mostly run memory demanding queries. Context switches are pretty much
> > around 20.000 on the average, no cs spikes when we run many processes in
> > parallel. Actually we only see two processes in running state! When
> > there are only a few processes running context switches go much higher.
> > At the moment we are much slower that with a 4way XEON box (DL580).
> 
> Um, that was a bit incoherent.  Are you seeing a CS storm or aren't you?
> 

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Cheap RAM disk?

2005-07-26 Thread Jeffrey W. Baker
On Tue, 2005-07-26 at 11:34 -0500, John A Meinel wrote:
> I saw a review of a relatively inexpensive RAM disk over at 
> anandtech.com, the Gigabyte i-RAM
> http://www.anandtech.com/storage/showdoc.aspx?i=2480
> 
> Basically, it is a PCI card, which takes standard DDR RAM, and has a 
> SATA port on it, so that to the system, it looks like a normal SATA drive.
> 
> The card costs about $100-150, and you fill it with your own ram, so for 
> a 4GB (max size) disk, it costs around $500. Looking for solid state 
> storage devices, the cheapest I found was around $5k for 2GB.
> 
> Gigabyte claims that the battery backup can last up to 16h, which seems 
> decent, if not really long (the $5k solution has a built-in harddrive so 
> that if the power goes out, it uses the battery power to copy the 
> ramdisk onto the harddrive for more permanent storage).
> 
> Anyway, would something like this be reasonable as a drive for storing 
> pg_xlog? With 4GB you could have as many as 256 checkpoint segments.

I haven't tried this product, but the microbenchmarks seem truly slow.
I think you would get a similar benefit by simply sticking a 1GB or 2GB
DIMM -- battery-backed, of course -- in your RAID controller.

-jwb

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Huge performance problem between 7.4.1 and 8.0.3 -

2005-07-17 Thread Jeffrey W. Baker
On Sun, 2005-07-17 at 21:34 -0600, Robert Creager wrote:
> Sigh...
> 
> I recently upgraded from 7.4.1 to 8.0.3.  The application did not change.  I'm
> now running both database concurrently (on different ports, same machine) just
> so I could verify the problem really exists.
> 
> The application is a custom test application for testing mechanical systems. 
> The runs in question (4 at a time) each generate 16 queries at a time of which
> the results are sent to the mechanical system which processes the request, 
> which
> processes them anywhere from 10 to 120 seconds.  The system is capable of
> completing between 4 and 8 jobs at once.  So, once the system is running, at
> most there will be 8 queries per run simultaneously.
> 
> The entire database fits into RAM (2Gb), as evidenced by no disk activity and
> relatively small database size.  pg_xlog is on different disks from the db.
> 
> The problem is that on version 8.0.3, once I get 3 or more concurrent runs
> going, the query times start tanking (>20 seconds).  On 7.4.1, the 
> applications
> hum along with queries typically below .2 seconds on over 5 concurrent runs. 
> Needless to say, 7.4.1 behaves as expected...  The only change between runs is
> the port connecting to.  Bot DB's are up at the same time.
> 
> For 8.03, pg_autovacuum is running.  On 7.4.1, I set up a cron job to vacuum
> analyze every 5 minutes.
> 
> The system is Mandrake Linux running 2.4.22 kernel with dual Intel Xenon CPU
> with HT enabled.  On an 803 run, the context switching is up around 60k.  On
> 7.4.1, it maxes around 23k and averages < 1k.

Did you build 8.0.3 yourself, or install it from packages?  I've seen in
the past where pg would build with the wrong kind of mutexes on some
machines, and that would send the CS through the roof.  If you did build
it yourself, check your ./configure logs.  If not, try strace.

-jwb

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[PERFORM] more filesystem benchmarks

2005-07-16 Thread Jeffrey W. Baker
In our last installment, we saw that JFS provides higher pgbench
performance than either XFS or ext3.  Using a direct-I/O patch stolen
from 8.1, JFS achieved 105 tps with 100 clients.

To refresh, the machine in question has 5 7200RPM SATA disks, an Areca
RAID controller with 128MB cache, and 1GB of main memory.  pgbench is
being run with a scale factor of 1000 and 10 total transactions.

At the suggestion of Andreas Dilger of clusterfs, I tried modulating the
size of the ext3 journal, and the mount options (data=journal,
writeback, and ordered).  I turns out that you can achieve a substantial
improvement (almost 50%) by simply mounting the ext3 volume with
data=writeback instead of data=ordered (the default).  Changing the
journal size did not seem to make a difference, except that 256MB is for
some reason pathological (9% slower than the best time).  128MB, the
default for a large volume, gave the same performance as 400MB (the max)
or 32MB.

In the end, the ext3 volume mounted with -o noatime,data=writeback
yielded 88 tps with 100 clients.  This is about 16% off the performance
of JFS with default options.

Andreas pointed me to experimental patches to ext3's block allocation
code and writeback strategy.  I will test these, but I expect the
database community, which seems so attached to its data, will be very
interested in code that has not yet entered mainstream use.

Another frequent suggestion is to put the xlog on a separate device.  I
tried this, and, for a given number of disks, it appears to be
counter-productive.  A RAID5 of 5 disks holding both logs and data is
about 15% faster than a RAID5 of 3 disks with the data, and a mirror of
two disks holding the xlog.

Here are the pgbench results for each permutation of ext3:

Journal Size | Journal Mode | 1 Client | 10 Clients | 100 Clients
--
32 ordered28 51   57
32 writeback  34 70   88
64 ordered29 52   61
64 writeback  32 69   87
128ordered32 54   62
128writeback  34 70   88
256ordered28 51   60
256writeback  29 64   79
400ordered26 49   59
400writeback  32 70   87

-jwb

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Really bad diskio

2005-07-15 Thread Jeffrey W. Baker
On Fri, 2005-07-15 at 15:29 -0600, Ron Wills wrote:
> Here's a bit of a dump of the system that should be useful.
> 
> Processors x2:
> 
> vendor_id   : AuthenticAMD
> cpu family  : 6
> model   : 8
> model name  : AMD Athlon(tm) MP 2400+
> stepping: 1
> cpu MHz : 2000.474
> cache size  : 256 KB
> 
> MemTotal:   903804 kB
> 
> Mandrake 10.0 Linux kernel 2.6.3-19mdk
> 
> The raid controller, which is using the hardware raid configuration:
> 
> 3ware 9000 Storage Controller device driver for Linux v2.26.02.001.
> scsi0 : 3ware 9000 Storage Controller
> 3w-9xxx: scsi0: Found a 3ware 9000 Storage Controller at 0xe802, IRQ: 17.
> 3w-9xxx: scsi0: Firmware FE9X 2.02.00.011, BIOS BE9X 2.02.01.037, Ports: 4.
>   Vendor: 3ware Model: Logical Disk 00   Rev: 1.00
>   Type:   Direct-Access  ANSI SCSI revision: 00
> SCSI device sda: 624955392 512-byte hdwr sectors (319977 MB)
> SCSI device sda: drive cache: write back, no read (daft)
> 
> This is also on a 3.6 reiser filesystem.
> 
> Here's the iostat for 10mins every 10secs. I've removed the stats from
> the idle drives to reduce the size of this email.
> 
> Linux 2.6.3-19mdksmp (photo_server)   07/15/2005
> 
> avg-cpu:  %user   %nice%sys %iowait   %idle
>2.851.532.15   39.52   53.95
> 
> Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda  82.49  4501.73   188.38 1818836580   76110154
> 
> avg-cpu:  %user   %nice%sys %iowait   %idle
>0.300.001.00   96.302.40
> 
> Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda  87.80  6159.20   340.00  61592   3400

These I/O numbers are not so horrible, really.  100% iowait is not
necessarily a symptom of misconfiguration.  It just means you are disk
limited.  With a database 20 times larger than main memory, this is no
surprise.

If I had to speculate about the best way to improve your performance, I
would say:

1a) Get a better RAID controller.  The 3ware hardware RAID5 is very bad.
1b) Get more disks.
2) Get a (much) newer kernel.
3) Try XFS or JFS.  Reiser3 has never looked good in my pgbench runs

By the way, are you experiencing bad application performance, or are you
just unhappy with the iostat figures?

Regards,
jwb


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Really bad diskio

2005-07-15 Thread Jeffrey W. Baker
On Fri, 2005-07-15 at 15:04 -0600, Ron Wills wrote:
> At Fri, 15 Jul 2005 13:45:07 -0700,
> Joshua D. Drake wrote:
> > 
> > Ron Wills wrote:
> > > Hello all
> > > 
> > >   I'm running a postgres 7.4.5, on a dual 2.4Ghz Athlon, 1Gig RAM and
> > > an 3Ware SATA raid. 
> > 
> > 2 drives?
> > 4 drives?
> > 8 drives?
> 
>   3 drives raid 5. I don't believe it's the raid. I've tested this by
> moving the database to the mirrors software raid where the root is
> found and onto the the SATA raid. Neither relieved the IO problems.

Hard or soft RAID?  Which controller?  Many of the 3Ware controllers
(85xx and 95xx) have extremely bad RAID 5 performance.

Did you take any pgbench or other benchmark figures before you started
using the DB?

-jwb

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Really bad diskio

2005-07-15 Thread Jeffrey W. Baker
On Fri, 2005-07-15 at 14:39 -0600, Ron Wills wrote:
> Hello all
> 
>   I'm running a postgres 7.4.5, on a dual 2.4Ghz Athlon, 1Gig RAM and
> an 3Ware SATA raid. Currently the database is only 16G with about 2
> tables with 50+ row, one table 20+ row and a few small
> tables. The larger tables get updated about every two hours. The
> problem I having with this server (which is in production) is the disk
> IO. On the larger tables I'm getting disk IO wait averages of
> ~70-90%. I've been tweaking the linux kernel as specified in the
> PostgreSQL documentations and switched to the deadline
> scheduler. Nothing seems to be fixing this. The queries are as
> optimized as I can get them. fsync is off in an attempt to help
> preformance still nothing. Are there any setting I should be look at
> the could improve on this???

Can you please characterize this a bit better?  Send the output of
vmstat or iostat over several minutes, or similar diagnostic
information.

Also please describe your hardware more.

Regards,
Jeff Baker

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] What's a lot of connections?

2005-07-15 Thread Jeffrey W. Baker
On Fri, 2005-07-15 at 00:00 -0700, Karim Nassar wrote:
> I am working on a system that uses postgresql 7.4.2 (can't change that
> until 8.1 goes stable). Just figured out that there are about 285,000
> connections created over about 11 hours every day. That averages out to
> about 7.2 connections per second.
> 
> Is that a lot? I've never seen that many.

I see about 8 million connections per full day.  Connecting to postgres
is cheap.

-jwb

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[PERFORM] JFS fastest filesystem for PostgreSQL?

2005-07-14 Thread Jeffrey W. Baker
I just took delivery of a new system, and used the opportunity to
benchmark postgresql 8.0 performance on various filesystems.  The system
in question runs Linux 2.6.12, has one CPU and 1GB of system memory, and
5 7200RPM SATA disks attached to an Areca hardware RAID controller
having 128MB of cache.  The caches are all write-back.

I ran pgbench with a scale factor of 1000 and a total of 100,000
transactions per run.  I varied the number of clients between 10 and
100.  It appears from my test JFS is much faster than both ext3 and XFS
for this workload.  JFS and XFS were made with the mkfs defaults.  ext3
was made with -T largefile4 and -E stride=32.  The deadline scheduler
was used for all runs (anticipatory scheduler is much worse).

Here's the result, in transactions per second.

  ext3  jfs  xfs
-
 10 Clients 55   81   68
100 Clients 61  100   64


-jwb

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] JFS fastest filesystem for PostgreSQL?

2005-07-14 Thread Jeffrey W. Baker
On Thu, 2005-07-14 at 10:03 +0200, Dawid Kuroczko wrote:
> On 7/14/05, Jeffrey W. Baker <[EMAIL PROTECTED]> wrote:
> > [reposted due to delivery error -jwb]
> > 
> > I just took delivery of a new system, and used the opportunity to
> > benchmark postgresql 8.0 performance on various filesystems.  The system
> > in question runs Linux 2.6.12, has one CPU and 1GB of system memory, and
> > 5 7200RPM SATA disks attached to an Areca hardware RAID controller
> > having 128MB of cache.  The caches are all write-back.
> > 
> > I ran pgbench with a scale factor of 1000 and a total of 100,000
> > transactions per run.  I varied the number of clients between 10 and
> > 100.  It appears from my test JFS is much faster than both ext3 and XFS
> > for this workload.  JFS and XFS were made with the mkfs defaults.  ext3
> > was made with -T largefile4 and -E stride=32.  The deadline scheduler
> > was used for all runs (anticipatory scheduler is much worse).
> > 
> > Here's the result, in transactions per second.
> > 
> >   ext3  jfs  xfs
> > -
> >  10 Clients 55   81   68
> > 100 Clients 61  100   64
> > 
> 
> If you still have a chance, could you do tests with other journaling
> options for ext3 (journal=writeback, journal=data)?  And could you
> give figures about performace of other IO elevators?  I mean, you
> wrote that anticipatory is much wore -- how much worse? :)  Could
> you give numbers for deadline,anticipatory,cfq elevators? :)
> 
> And, additionally would it be possible to give numbers for bonnie++
> results?  To see how does pgbench to bonnie++ relate?

Phew, that's a lot of permutations.  At 20-30 minutes per run, I'm
thinking 5-8 hours or so.  Still, for you dear readers, I'll somehow
accomplish this tedious feat.

As for Bonnie, JFS is a good 60-80% faster than ext3.  See my message to
ext3-users yesterday.

Using bonnie++ with a 10GB fileset, in MB/s:

 ext3jfsxfs
Read 112 188141
Write 97 157167
Rewrite   51  71 60

-jwb

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[PERFORM] JFS fastest filesystem for PostgreSQL?

2005-07-13 Thread Jeffrey W. Baker
[reposted due to delivery error -jwb]

I just took delivery of a new system, and used the opportunity to
benchmark postgresql 8.0 performance on various filesystems.  The system
in question runs Linux 2.6.12, has one CPU and 1GB of system memory, and
5 7200RPM SATA disks attached to an Areca hardware RAID controller
having 128MB of cache.  The caches are all write-back.

I ran pgbench with a scale factor of 1000 and a total of 100,000
transactions per run.  I varied the number of clients between 10 and
100.  It appears from my test JFS is much faster than both ext3 and XFS
for this workload.  JFS and XFS were made with the mkfs defaults.  ext3
was made with -T largefile4 and -E stride=32.  The deadline scheduler
was used for all runs (anticipatory scheduler is much worse).

Here's the result, in transactions per second.

  ext3  jfs  xfs
-
 10 Clients 55   81   68
100 Clients 61  100   64


-jwb

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match