Re: [PERFORM] index structure for 114-dimension vector

2007-04-27 Thread Arjen van der Meijden

On 21-4-2007 1:42 Mark Kirkwood wrote:

I don't think that will work for the vector norm i.e:

|x - y| = sqrt(sum over j ((x[j] - y[j])^2))


I don't know if this is usefull here, but I was able to rewrite that 
algorithm for a set of very sparse vectors (i.e. they had very little 
overlapping factors) to something like:

|x - y| = sum over j (x[j]^2) + sum over j (y[j]^2)
   + for each j where x[j] and y[j] are both non-zero: - (x[j]^2 + 
y[j]^2) + (x[j] - y[j])^2


The first two parts sums can be calculated only once. So if you have 
very little overlap, this is therefore much more efficient (if there is 
no overlap at all you end up with x[j]^2 + y[j]^2 anyway). Besides, this 
rewritten calculation allows you to store the X and Y vectors using a 
trivial table-layout vector(x,i,value) which is only filled with 
non-zero's and which you can trivially self-join to find the closest 
matches. You don't care about the j's where there is either no x or 
y-value anyway with this rewrite.


I can compare over 1000 y's of on average 100 elements to two x's of 
over 1000 elements on just a single 1.8Ghz amd processor. (I use it for 
a bi-kmeans algorithm, so there are only two buckets to compare to).


So it might be possible to rewrite your algorithm to be less 
calculation-intensive. Obviously, with a dense-matrix this isn't going 
to work, but there may be other ways to circumvent parts of the 
algorithm or to cache large parts of it.
It might also help to extract only the 6 relevant columns into a 
seperate temporary table which will have much smaller records and thus 
can fit more records per page.


Best regards,

Arjen

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: Filesystem fragmentation (Re: [PERFORM] Fragmentation of WAL files)

2007-04-27 Thread Florian Weimer
* Bill Moran:

 To clarify my viewpoint:
 To my knowledge, there is no Unix filesystem that _suffers_ from
 fragmentation.  Specifically, all filessytems have some degree of
 fragmentation that occurs, but every Unix filesystem that I am aware of
 has built-in mechanisms to mitigate this and prevent it from becoming
 a performance issue.

One database engine tends to create a huge number of fragments because
the files are written with holes in them.  There is a significant
impact on sequential reads, but that doesn't matter much because the
engine doesn't implement fast, out-of-order B-tree scans anyway. 8-/

I still think that preallocating in reasonably sized chunks is
beneficial.

-- 
Florian Weimer[EMAIL PROTECTED]
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Usage up to 50% CPU

2007-04-27 Thread Magnus Hagander
On Fri, Apr 27, 2007 at 04:43:06AM +, Andres Retzlaff wrote:
 Hi,
 
 I have pg 8.1.4 running in
 Windows XP Pro
 wirh a Pentium D
 
 and I notice that I can not use more than 50% of the cpus (Pentium D has 2 
 cpus), how can I change the settings to use the 100% of it.

A single query will only use one CPU. If you have multiple parallell
clients, they will use the differnt CPUs.

//Magnus


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Usage up to 50% CPU

2007-04-27 Thread Andres Retzlaff

Hi Magnus,

in this case each CPU goes up to 50%, giveing me 50% total usage. I was 
specting as you say 1 query 100% cpu.


Any ideas?

Andrew

On Fri, Apr 27, 2007 at 04:43:06AM +, Andres Retzlaff wrote:
 Hi,

 I have pg 8.1.4 running in
 Windows XP Pro
 wirh a Pentium D

 and I notice that I can not use more than 50% of the cpus (Pentium D has 
2

 cpus), how can I change the settings to use the 100% of it.

A single query will only use one CPU. If you have multiple parallell
clients, they will use the differnt CPUs.

//Magnus

_
Advertisement: Its simple! Sell your car for just $30 at carsales.com.au 
http://a.ninemsn.com.au/b.aspx?URL=http%3A%2F%2Fsecure%2Dau%2Eimrworldwide%2Ecom%2Fcgi%2Dbin%2Fa%2Fci%5F450304%2Fet%5F2%2Fcg%5F801577%2Fpi%5F1005244%2Fai%5F838588_t=754951090_r=tig_m=EXT



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Usage up to 50% CPU

2007-04-27 Thread Magnus Hagander
On Fri, Apr 27, 2007 at 08:10:48AM +, Andres Retzlaff wrote:
 Hi Magnus,
 
 in this case each CPU goes up to 50%, giveing me 50% total usage. I was 
 specting as you say 1 query 100% cpu.
 
 Any ideas?

No. 1 query will only use 100% of *one* CPU, which means 50% total usage.
You need at least one query per CPU to reach full 100% of the whole system.
(Actually, you can get slightly above 50% since the query will run on one
CPU and the OS and autovacuum and bgwriter can run on the other. But it's
marginally)

//Magnus


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Usage up to 50% CPU

2007-04-27 Thread Shane Ambler

Magnus Hagander wrote:

On Fri, Apr 27, 2007 at 08:10:48AM +, Andres Retzlaff wrote:

Hi Magnus,

in this case each CPU goes up to 50%, giveing me 50% total usage. I was 
specting as you say 1 query 100% cpu.


Any ideas?


No. 1 query will only use 100% of *one* CPU, which means 50% total usage.
You need at least one query per CPU to reach full 100% of the whole system.
(Actually, you can get slightly above 50% since the query will run on one
CPU and the OS and autovacuum and bgwriter can run on the other. But it's
marginally)

//Magnus


I would think that as you are sitting and watching the cpu usage, your 
query would seem to taking a while to run, leading me to wonder if you 
are getting a full table scan that is causing pg to wait for disk response?


Or are you running a long list of steps that take a while?



--

Shane Ambler
[EMAIL PROTECTED]

Get Sheeky @ http://Sheeky.Biz

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Usage up to 50% CPU

2007-04-27 Thread Michael Stone

On Fri, Apr 27, 2007 at 07:23:41PM +0930, Shane Ambler wrote:
I would think that as you are sitting and watching the cpu usage, your 
query would seem to taking a while to run, leading me to wonder if you 
are getting a full table scan that is causing pg to wait for disk response?


If so, you probably wouldn't be seeing that much cpu usage... (if the 
cpu is waiting for disk it's idling)


Mike Stone

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-04-27 Thread Carlos Moreno

Tom Lane wrote:

Carlos Moreno [EMAIL PROTECTED] writes:
  

...  But, wouldn't it make sense that the configure script
determines the amount of physical memory and perhaps even do a HD
speed estimate to set up defaults that are closer to a 
performance-optimized

configuration?



No.  Most copies of Postgres these days are executed on machines very
far away from where the code was built.  It's a little bit safer to
try to tune things at initdb time ... as indeed we already do. 


D'oh!  Yes, that makes more sense, of course.


 But
the fundamental problem remains that we don't know that much about
how the installation will be used. 


Notice that the second part of my suggestion covers this --- have 
additional
switches to initdb so that the user can tell it about estimates on how 
the DB
will be used:  estimated size of the DB, estimated percentage of 
activity that
will involve writing, estimated percentage of activity that will be 
transactions,
percentage that will use indexes, percentage of queries that will be 
complex,

etc. etc.

Wouldn't initdb be able to do a better job at coming up with sensible
defaults if it counts on this information?   Of course, all these 
parameters

would have their own defaults --- the user won't necessarily know or have
an accurate estimate for each and every one of them.


Also, there is an extremely good reason why Postgres will never be set
up to try to take over the whole machine by default: most of the
developers run multiple postmasters on their machines.
  

Wouldn't this be covered by the above suggestion??  One of the switches
for the command initdb could allow the user to specify how many instances
will be run  (I assume you're talking about having different instances 
listening

on different ports for increased concurrency-related benefits?)

Does my suggestion make more sense now?  Or is it still too unrealistic to
make it work properly/safely?

Carlos
--


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] What`s wrong with JFS configuration?

2007-04-27 Thread Jim Nasby

Adding -performance back in so others can learn.

On Apr 26, 2007, at 9:40 AM, Paweł Gruszczyński wrote:


Jim Nasby napisał(a):

On Apr 25, 2007, at 8:51 AM, Paweł Gruszczyński wrote:
where u6 stores Fedora Core 6 operating system, and u0 stores 3  
partitions with ext2, ext3 and jfs filesystem.


Keep in mind that drives have a faster data transfer rate at the  
outer-edge than they do at the inner edge, so if you've got all 3  
filesystems sitting on that array at the same time it's not a fair  
test. I heard numbers on the impact of this a *long* time ago and  
I think it was in the 10% range, but I could be remembering wrong.


You'll need to drop each filesystem and create the next one go get  
a fair comparison.


I thought about it by my situation is not so clear, becouse my hard  
drive for postgresql data is rather logical becouse of RAID array  
i mode 1+0. My RAID Array is divided like this:


  Device Boot  Start End  Blocks   Id  System
/dev/sda1   1  159850   163686384   83  Linux
/dev/sda2  159851  319431   163410944   83  Linux
/dev/sda3  319432  478742   163134464   83  Linux

and partitions are:

/dev/sda1 ext2   161117780   5781744 147151720   4% /fs/ext2
/dev/sda2 ext3   160846452   2147848 150528060   2% /fs/ext3
/dev/sda3  jfs   163096512   3913252 159183260   3% /fs/jfs

so if RAID 1+0 do not change enything, JFS file system is at third  
partition wich is at the end of hard drive.


Yes, which means that JFS is going to be at a disadvantage to ext3,  
which will be at a disadvantage to ext2. You should really re-perform  
the tests with each filesystem in the same location.


What about HDD with two magnetic disk`s? Then the speed depending  
of partition phisical location is more difficult to calculate ;)  
Propably first is slow, secund is fast in firs halt and slow in  
secund halt, third is the fastes one. In both cases my JFS partitin  
should be ath the end on magnetic disk. Am I wrong?


I'm not a HDD expert, but as far as I know the number of platters  
doesn't change anything. When you have multiple platters, the drive  
essentially splits bytes across all the platters; it doesn't start  
writing one platter, then switch to another platter.

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-04-27 Thread Michael Stone

On Fri, Apr 27, 2007 at 09:27:49AM -0400, Carlos Moreno wrote:
Notice that the second part of my suggestion covers this --- have 
additional
switches to initdb so that the user can tell it about estimates on how 
the DB
will be used:  estimated size of the DB, estimated percentage of 
activity that
will involve writing, estimated percentage of activity that will be 
transactions,
percentage that will use indexes, percentage of queries that will be 
complex,

etc. etc.


If the person knows all that, why wouldn't they know to just change the 
config parameters?


Mike Stone

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-04-27 Thread Tom Lane
Carlos Moreno [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 But
 the fundamental problem remains that we don't know that much about
 how the installation will be used. 

 Notice that the second part of my suggestion covers this --- have 
 additional switches to initdb

That's been proposed and rejected before, too; the main problem being
that initdb is frequently a layer or two down from the user (eg,
executed by initscripts that can't pass extra arguments through, even
assuming they're being invoked by hand in the first place).

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-04-27 Thread Mark Lewis
Maybe he's looking for a switch for initdb that would make it
interactive and quiz you about your expected usage-- sort of a magic
auto-configurator wizard doohicky?  I could see that sort of thing being
nice for the casual user or newbie who otherwise would have a horribly
mis-tuned database.  They could instead have only a marginally mis-tuned
database :)

On Fri, 2007-04-27 at 10:30 -0400, Michael Stone wrote:
 On Fri, Apr 27, 2007 at 09:27:49AM -0400, Carlos Moreno wrote:
 Notice that the second part of my suggestion covers this --- have 
 additional
 switches to initdb so that the user can tell it about estimates on how 
 the DB
 will be used:  estimated size of the DB, estimated percentage of 
 activity that
 will involve writing, estimated percentage of activity that will be 
 transactions,
 percentage that will use indexes, percentage of queries that will be 
 complex,
 etc. etc.
 
 If the person knows all that, why wouldn't they know to just change the 
 config parameters?
 
 Mike Stone
 
 ---(end of broadcast)---
 TIP 4: Have you searched our list archives?
 
http://archives.postgresql.org

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-04-27 Thread H.J. Sanders

Hello.


Just my 2 cents, and not looking to the technical aspects:

setting up PSQL is the weakest point of PSQL as we have experienced ourself,
once it is running it is great.

I can imagine that a lot of people of stops after their first trials after
they have
experienced the troubles and performance of a standard set up.

This is ofcourse a lost user forever.

So anything that could be done to get an easier and BETTER setup would
strongly enhance PSQL.

My 2 cents.

Henk Sanders



-Oorspronkelijk bericht-
Van: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Tom Lane
Verzonden: vrijdag 27 april 2007 16:37
Aan: Carlos Moreno
CC: PostgreSQL Performance
Onderwerp: Re: [PERFORM] Feature Request --- was: PostgreSQL Performance
Tuning


Carlos Moreno [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 But
 the fundamental problem remains that we don't know that much about
 how the installation will be used.

 Notice that the second part of my suggestion covers this --- have
 additional switches to initdb

That's been proposed and rejected before, too; the main problem being
that initdb is frequently a layer or two down from the user (eg,
executed by initscripts that can't pass extra arguments through, even
assuming they're being invoked by hand in the first place).

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-04-27 Thread Michael Stone

On Fri, Apr 27, 2007 at 07:36:52AM -0700, Mark Lewis wrote:

Maybe he's looking for a switch for initdb that would make it
interactive and quiz you about your expected usage-- sort of a magic
auto-configurator wizard doohicky?  I could see that sort of thing being
nice for the casual user or newbie who otherwise would have a horribly
mis-tuned database.  They could instead have only a marginally mis-tuned
database :)


However you implement it, anyone who can answer all of those questions 
is probably capable of reading and understanding the performance section 
in the manual. 

It's probably more practical to have a seperate script that looks at the 
running system (ram, disks, pg config, db size, indices, stats, etc.) 
and makes suggestions--if someone wants to write such a thing.


Mike Stone

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


[PERFORM] [Fwd: ] How

2007-04-27 Thread Siddharth Anand
Hi!
I read the link below and am puzzled by or curious about something.
http://www.postgresql.org/docs/8.1/interactive/datatype-character.html

The Tip below is intriguing

Tip:  There are no performance differences between these three types,
apart from the increased storage size when using the blank-padded type.
While character(n) has performance advantages in some other database
systems, it has no such advantages in PostgreSQL. In most situations text
or character varying should be used instead.

How can a field that doesn't have a limit like text perform similarly to
char varying(128), for example? At some point, we need to write data to
disk. The more data that needs to be written, the longer the disk write
will take, especially when it requires finding free sectors to write to.

Another interesting quote from the same page is the following:

Long values are also stored in background tables so they do not interfere
with rapid access to the shorter column values. 

If the long values are stored in a separate table, on a different part of
the disk, doesn't this imply an extra disk seek? Won't it therefore take
longer?


Sid






---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[PERFORM] How can fixed and variable width columns perform similarly?

2007-04-27 Thread Siddharth Anand
Hi!
I read the link below and am puzzled by or curious about something.
http://www.postgresql.org/docs/8.1/interactive/datatype-character.html

The Tip below is intriguing

Tip:  There are no performance differences between these three types,
apart from the increased storage size when using the blank-padded type.
While character(n) has performance advantages in some other database
systems, it has no such advantages in PostgreSQL. In most situations text
or character varying should be used instead.

How can a field that doesn't have a limit like text perform similarly to
char varying(128), for example? At some point, we need to write data to
disk. The more data that needs to be written, the longer the disk write
will take, especially when it requires finding free sectors to write to.

Another interesting quote from the same page is the following:

Long values are also stored in background tables so they do not interfere
with rapid access to the shorter column values. 

If the long values are stored in a separate table, on a different part of
the disk, doesn't this imply an extra disk seek? Won't it therefore take
longer?


Sid








---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-04-27 Thread Jim Nasby

On Apr 27, 2007, at 3:30 PM, Michael Stone wrote:

On Fri, Apr 27, 2007 at 09:27:49AM -0400, Carlos Moreno wrote:
Notice that the second part of my suggestion covers this --- have  
additional
switches to initdb so that the user can tell it about estimates on  
how the DB
will be used:  estimated size of the DB, estimated percentage of  
activity that
will involve writing, estimated percentage of activity that will  
be transactions,
percentage that will use indexes, percentage of queries that will  
be complex,

etc. etc.


If the person knows all that, why wouldn't they know to just change  
the config parameters?


Because knowing your expected workload is a lot easier for many  
people than knowing what every GUC does.


Personally, I think it would be a tremendous start if we just  
provided a few sample configs like MySQL does. Or if someone wanted  
to get fancy they could stick a web page somewhere that would produce  
a postgresql.conf based simply on how much available RAM you had,  
since that's one of the biggest performance-hampering issues we run  
into (ie: shared_buffers left at the default of 32MB).

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] How can fixed and variable width columns perform similarly?

2007-04-27 Thread Tom Lane
Siddharth Anand [EMAIL PROTECTED] writes:
 How can a field that doesn't have a limit like text perform similarly to
 char varying(128), for example? At some point, we need to write data to
 disk. The more data that needs to be written, the longer the disk write
 will take, especially when it requires finding free sectors to write to.

What's your point?  If you're not going to put more than 128 characters
in the field, there's no difference in the amount of data involved.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] [Fwd: ] How

2007-04-27 Thread Richard Huxton

Siddharth Anand wrote:

Hi!
I read the link below and am puzzled by or curious about something.
http://www.postgresql.org/docs/8.1/interactive/datatype-character.html

The Tip below is intriguing

Tip:  There are no performance differences between these three types,
apart from the increased storage size when using the blank-padded type.
While character(n) has performance advantages in some other database
systems, it has no such advantages in PostgreSQL. In most situations text
or character varying should be used instead.

How can a field that doesn't have a limit like text perform similarly to
char varying(128), for example? At some point, we need to write data to
disk. The more data that needs to be written, the longer the disk write
will take, especially when it requires finding free sectors to write to.


That's no difference *for the same amount of data*. So, char(128), 
varchar(128) with 128 characters and text with 128 characters in it are 
the same. This isn't always the case with other systems.



Another interesting quote from the same page is the following:

Long values are also stored in background tables so they do not interfere
with rapid access to the shorter column values. 

If the long values are stored in a separate table, on a different part of
the disk, doesn't this imply an extra disk seek? Won't it therefore take
longer?


Yes. But you gain every time you read from the table and aren't 
interested in that column. Typically large text columns contain 
descriptive text and aren't used in joins, so it pays for itself quite 
easily.


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] How can fixed and variable width columns perform similarly?

2007-04-27 Thread Dave Dutcher
I think the manual is implying that if you store a value like Sid in a
field either of type varchar(128) or type text there is no performance
difference.  The manual is not saying that you get the same performance
storing a 500k text field as when you store the value Sid.

Dave


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Siddharth Anand
Sent: Friday, April 27, 2007 10:32 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] How can fixed and variable width columns perform
similarly?

Hi!
I read the link below and am puzzled by or curious about something.
http://www.postgresql.org/docs/8.1/interactive/datatype-character.html

The Tip below is intriguing

Tip:  There are no performance differences between these three types,
apart from the increased storage size when using the blank-padded type.
While character(n) has performance advantages in some other database
systems, it has no such advantages in PostgreSQL. In most situations text
or character varying should be used instead.

How can a field that doesn't have a limit like text perform similarly to
char varying(128), for example? At some point, we need to write data to
disk. The more data that needs to be written, the longer the disk write
will take, especially when it requires finding free sectors to write to.

Another interesting quote from the same page is the following:

Long values are also stored in background tables so they do not interfere
with rapid access to the shorter column values. 

If the long values are stored in a separate table, on a different part of
the disk, doesn't this imply an extra disk seek? Won't it therefore take
longer?


Sid








---(end of broadcast)---
TIP 6: explain analyze is your friend


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] How can fixed and variable width columns perform similarly?

2007-04-27 Thread Siddharth Anand
Hi Tom,
My question wasn't phrased clearly. Oracle exhibits a performance
degradation for very large-sized fields (CLOB types that I equate to
PostGres' text type) when compared with the performance of field types
like varchar that handle a max character limit of a few thousand bytes in
Oracle.

It sounds like PostGres doesn't exhibit this same difference. I wanted to
understand how this could be and whether there was a trade-off.

Cheers!
Sid
 Siddharth Anand [EMAIL PROTECTED] writes:
 How can a field that doesn't have a limit like text perform similarly
 to
 char varying(128), for example? At some point, we need to write data to
 disk. The more data that needs to be written, the longer the disk write
 will take, especially when it requires finding free sectors to write to.

 What's your point?  If you're not going to put more than 128 characters
 in the field, there's no difference in the amount of data involved.

   regards, tom lane

 ---(end of broadcast)---
 TIP 2: Don't 'kill -9' the postmaster




---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] How can fixed and variable width columns perform similarly?

2007-04-27 Thread Tom Lane
Siddharth Anand [EMAIL PROTECTED] writes:
 My question wasn't phrased clearly. Oracle exhibits a performance
 degradation for very large-sized fields (CLOB types that I equate to
 PostGres' text type) when compared with the performance of field types
 like varchar that handle a max character limit of a few thousand bytes in
 Oracle.

 It sounds like PostGres doesn't exhibit this same difference. I wanted to
 understand how this could be and whether there was a trade-off.

Ah.  Well, the answer is that we change behavior dynamically depending
on the size of the particular field value, instead of hard-wiring it to
the declared column type.  It sounds like Oracle's CLOB might be doing
about the same thing as an out-of-line toasted field value in
Postgres.  In PG, text and varchar behave identically except that
varchar(N) adds an insert-time check on the length of the field value
--- but this is just a constraint check and doesn't have any direct
influence on how the value is stored.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[PERFORM] Very specific server situation

2007-04-27 Thread Mauro N. Infantino
Hi,

We're facing some perfomance problems with the database for a web site with
very specific needs. First of all, we're using version 8.1 in a server with
1GB of RAM. I know memory normally should be more, but as our tables are not
so big (as a matter of fact, they are small) I think the solution would not
be adding more RAM.

What we basically have is a site where each user has a box with links to
other randomly selected users. Whenever a box from a user is shown, a SPs is
executed: a credit is added to that user and a credit is substracted from
the accounts of the shown links. Accounts with no credits do not have to be
listed. So, we've lots (LOTS) of users querying and updating the same table.
Sometimes with big peaks.

Our first attempt was to split that table in two: one for the actual credits
and another one for the users. So, only the credits table gets updated on
every request, but it has a trigger that updates a flag field in the users
table saying if the user has credits. This had a good impact, but I guess
it's not enough.

For now, we only have 23.000 users, but it's going to grow. Do you have any
advice? Is this possible with postgres or do you recommend just to try with
a volatile memory approach for the credits?

We're using pgpool and the output from free shows only 350M of RAM being
used.

Some relevants parts of the .conf:

max_connections = 160
shared_buffers = 4
work_mem = 3096
maintenance_work_mem = 131072
max_fsm_pages = 7
fsync = false
autovacuum = on

Any help would be really appreciated.

Thanks in advance,
Mauro.


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-04-27 Thread Dan Harris

Michael Stone wrote:

On Fri, Apr 27, 2007 at 09:27:49AM -0400, Carlos Moreno wrote:
Notice that the second part of my suggestion covers this --- have 
additional

switches to initdb

snip
If the person knows all that, why wouldn't they know to just change the 
config parameters?




Exactly..  What I think would be much more productive is to use the great amount 
of information that PG tracks internally and auto-tune the parameters based on 
it.  For instance:


Why does the user need to manually track max_fsm_pages and max_fsm_relations?  I 
bet there are many users who have never taken the time to understand what this 
means and wondering why performance still stinks after vacuuming their database 
( spoken from my own experience )


How about work_mem?  shared_buffers?  column statistics sizes? random_page_cost?

Couldn't some fairly simple regression tests akin to a VACUUM process spot 
potential problems?  Hey, it looks like you need more fsm_relations.. I bumped 
that up automatically for you.  Or These indexes look bloated, shall I 
automatically reindex them for you?


I'm sure there are many more examples, that with some creative thinking, could 
be auto-adjusted to match the usage patterns of the database. PG does an 
excellent job of exposing the variables to the users, but mostly avoids telling 
the user what to do or doing it for them.  Instead, it is up to the user to know 
where to look, what to look for, and how to react to things to improve 
performance.  This is not all bad, but it is assuming that all users are hackers 
( which used to be true ), but certainly doesn't help when the average SQLServer 
admin tries out Postgres and then is surprised at the things they are now 
responsible for managing.  PG is certainly *not* the only database to suffer 
from this syndrome, I know..


I like to think of my systems as good employees.  I don't want to have to 
micromanage everything they do.  I want to tell them here's what I want done, 
and assuming I made a good hiring choice, they will do it and take some liberty 
to adjust parameters where needed to achieve the spirit of the goal, rather than 
 blindly do something inefficiently because I failed to explain to them the 
absolute most efficient way to accomplish the task.


Granted, there are some people who don't like the developers making any 
assumptions about their workload.  But this doesn't have to be an either/or 
proposition.  I don't think any control needs to be abandoned.  But 
self-adjusting defaults seem like an achievable goal ( I know, I know, show us 
the patch ).  I just don't know if this feeling has resonated well between new 
users and long-term developers.  I know it must be grating to have to answer the 
same questions over and over and over have you analyzed?  Did you leave 
postgresql.conf at the defaults??.  Seems like a win-win for both sides, IMHO.


In closing, I am not bashing PG!  I love it and swear by it.  These comments are 
purely from an advocacy perspective.  I'd love to see PG user base continue to grow.


My .02

-Dan



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-04-27 Thread Josh Berkus
Dan,

 Exactly..  What I think would be much more productive is to use the
 great amount of information that PG tracks internally and auto-tune the
 parameters based on it.  For instance:

*Everyone* wants this.  The problem is that it's very hard code to write 
given the number of variables.  I'm working on it but progress is slow, 
due to my travel schedule.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-04-27 Thread Bill Moran
In response to Dan Harris [EMAIL PROTECTED]:

 Michael Stone wrote:
  On Fri, Apr 27, 2007 at 09:27:49AM -0400, Carlos Moreno wrote:
  Notice that the second part of my suggestion covers this --- have 
  additional
  switches to initdb
 snip
  If the person knows all that, why wouldn't they know to just change the 
  config parameters?
 
 Exactly..  What I think would be much more productive is to use the great 
 amount 
 of information that PG tracks internally and auto-tune the parameters based 
 on 
 it.  For instance:
 
 Why does the user need to manually track max_fsm_pages and max_fsm_relations? 
  I 
 bet there are many users who have never taken the time to understand what 
 this 
 means and wondering why performance still stinks after vacuuming their 
 database 
 ( spoken from my own experience )

But there are two distinct routes that can be taken if there's not enough
fsm space: add fsm space or vacuum more frequently.  I don't want the system
to eat up a bunch of memory for fsm entries if my workload indicates that
I can easily vacuum more frequently.

 How about work_mem?  shared_buffers?  column statistics sizes? 
 random_page_cost?

The only one that seems practical (to me) is random_page_cost.  The others are
all configuration options that I (as a DBA) want to be able to decide for
myself.  For example, I have some dedicated PG servers that I pretty much
max those values out at, to let PG know that it can use everything on the
system -- but I also have some shared use machines with PG, where I carefully
constrain those values so that PG doesn't muscle other daemons out of their
share of the RAM (work_mem is probably the best example)

It would be nice to have some kind of utility that could tell me what
random_page_cost should be, as I've never felt comfortable tweaking it.
Like some utility to run that would say based on the seek tests I just
ran, you should set random_page_cost to x.  Of course, if such a thing
existed, it could just fill in the value for you.  But I haven't figured
out how to pick a good value for that setting, so I have no idea how to
suggest to have it automatically set.

 Couldn't some fairly simple regression tests akin to a VACUUM process spot 
 potential problems?  Hey, it looks like you need more fsm_relations.. I 
 bumped 
 that up automatically for you.  Or These indexes look bloated, shall I 
 automatically reindex them for you?

A lot of that stuff does happen.  A vacuum verbose will tell you what it
thinks you should do, but I don't _want_ it to do it automatically.  What
if I create huge temporary tables once a week for some sort of analysis that
overload the fsm space?  And if I'm dropping those tables when the analysis
is done, do I want the fsm space constantly adjusting?

Plus, some is just impossible.  shared_buffers requires a restart.  Do you
want your DB server spontaneously restarting because it thought more
buffers might be nice?

 I'm sure there are many more examples, that with some creative thinking, 
 could 
 be auto-adjusted to match the usage patterns of the database. PG does an 
 excellent job of exposing the variables to the users, but mostly avoids 
 telling 
 the user what to do or doing it for them.  Instead, it is up to the user to 
 know 
 where to look, what to look for, and how to react to things to improve 
 performance.  This is not all bad, but it is assuming that all users are 
 hackers 
 ( which used to be true ), but certainly doesn't help when the average 
 SQLServer 
 admin tries out Postgres and then is surprised at the things they are now 
 responsible for managing.  PG is certainly *not* the only database to suffer 
 from this syndrome, I know..

I expect the suffering is a result of the fact that databases are non-trivial
pieces of software, and there's no universally simple way to set them up
and make them run well.

 I like to think of my systems as good employees.  I don't want to have to 
 micromanage everything they do.  I want to tell them here's what I want 
 done, 
 and assuming I made a good hiring choice, they will do it and take some 
 liberty 
 to adjust parameters where needed to achieve the spirit of the goal, rather 
 than 
   blindly do something inefficiently because I failed to explain to them the 
 absolute most efficient way to accomplish the task.

That's silly.  No software does that.  You're asking software to behave like
humans.  If that were the case, this would be Isaac Asimov's world, not the
real one.

 Granted, there are some people who don't like the developers making any 
 assumptions about their workload.  But this doesn't have to be an either/or 
 proposition.  I don't think any control needs to be abandoned.  But 
 self-adjusting defaults seem like an achievable goal ( I know, I know, show 
 us 
 the patch ).  I just don't know if this feeling has resonated well between 
 new 
 users and long-term developers.  I know it must be grating to have to answer 
 the 
 same questions over and 

Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-04-27 Thread Kevin Hunter

At 10:36a -0400 on 27 Apr 2007, Tom Lane wrote:

That's been proposed and rejected before, too; the main problem being
that initdb is frequently a layer or two down from the user (eg,
executed by initscripts that can't pass extra arguments through, even
assuming they're being invoked by hand in the first place).


And following after Dan Harris' response . . .

So what's the problem with having some sort of cronjob contrib module  
that utilizes the actual and current statistics to make  
recommendations?  I don't think it'd be right to simply change the  
configuration options as it sees fit (especially as it was pointed  
out that many run multiple postmasters or have other uses for the  
machines in question), but perhaps it could send a message (email?)  
along the lines of Hey, I'm currently doing this many of X  
transactions, against this much of Y data, and working under these  
constraints.  You might get better performance (in this area ... ) if  
you altered the the configurations options like so: ...


Certainly not for the masters, but perhaps for standard installation  
sort of deals, sort of liking bringing up the rear . . . just a thought.


Kevin

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-04-27 Thread Ray Stell
On Fri, Apr 27, 2007 at 02:40:07PM -0400, Kevin Hunter wrote:
 out that many run multiple postmasters or have other uses for the  
 machines in question), but perhaps it could send a message (email?)  
 along the lines of Hey, I'm currently doing this many of X  
 transactions, against this much of Y data, and working under these  
 constraints.  You might get better performance (in this area ... ) if  
 you altered the the configurations options like so: ...


or storing the values in the db for later trending analysis, witness 
ora statspack.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-04-27 Thread Josh Berkus
Bill,

 The only one that seems practical (to me) is random_page_cost.  The
 others are all configuration options that I (as a DBA) want to be able
 to decide for myself. 

Actually, random_page_cost *should* be a constant 4.0 or 3.5, which 
represents the approximate ratio of seek/scan speed which has been 
relatively constant across 6 years of HDD technology.  The only reason we 
make it a configuration variable is that there's defects in our cost model 
which cause users to want to tinker with it.

Mind you, that's gotten better in recent versions as well.  Lately I mostly 
tinker with effective_cache_size and the various cpu_* stats rather than 
modifying random_page_cost.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-04-27 Thread Dan Harris

Bill Moran wrote:

In response to Dan Harris [EMAIL PROTECTED]:

snip
Why does the user need to manually track max_fsm_pages and max_fsm_relations?  I 
bet there are many users who have never taken the time to understand what this 
means and wondering why performance still stinks after vacuuming their database 
( spoken from my own experience )


But there are two distinct routes that can be taken if there's not enough
fsm space: add fsm space or vacuum more frequently.  I don't want the system
to eat up a bunch of memory for fsm entries if my workload indicates that
I can easily vacuum more frequently.


There's no magic bullet here, but heuristics should be able to tell us you can 
easily vacuum more frequently  And again, I said these things would be 
*optional*.  Like an item in postgresql.conf 
i_have_read_the_manual_and_know_what_this_all_means = false #default false. 
If you change it to true, you have all the control you're used to and nothing 
will get in your way.





How about work_mem?  shared_buffers?  column statistics sizes? random_page_cost?


The only one that seems practical (to me) is random_page_cost.  The others are
all configuration options that I (as a DBA) want to be able to decide for
myself.  For example, I have some dedicated PG servers that I pretty much
max those values out at, to let PG know that it can use everything on the
system -- but I also have some shared use machines with PG, where I carefully
constrain those values so that PG doesn't muscle other daemons out of their
share of the RAM (work_mem is probably the best example)



Just because you carefully constrain it does not preclude the ability for 
program logic to maintain statistics to do what I suggested.



It would be nice to have some kind of utility that could tell me what
random_page_cost should be, as I've never felt comfortable tweaking it.
Like some utility to run that would say based on the seek tests I just
ran, you should set random_page_cost to x.  Of course, if such a thing
existed, it could just fill in the value for you.  But I haven't figured
out how to pick a good value for that setting, so I have no idea how to
suggest to have it automatically set.


Me either, but I thought if there's a reason it's user-settable, there must be 
some demonstrable method for deciding what is best.




Couldn't some fairly simple regression tests akin to a VACUUM process spot 
potential problems?  Hey, it looks like you need more fsm_relations.. I bumped 
that up automatically for you.  Or These indexes look bloated, shall I 
automatically reindex them for you?


A lot of that stuff does happen.  A vacuum verbose will tell you what it
thinks you should do, but I don't _want_ it to do it automatically.  What
if I create huge temporary tables once a week for some sort of analysis that
overload the fsm space?  And if I'm dropping those tables when the analysis
is done, do I want the fsm space constantly adjusting?


I understand *you* don't want it done automatically.  But my suspicion is that 
there are a lot more newbie pg admins who would rather let the system do 
something sensible as a default.  Again, you sound defensive that somehow my 
ideas would take power away from you.  I'm not sure why that is, but certainly 
I'm not suggesting that.  An auto-pilot mode is not a bad idea just because a 
few pilots don't want to use it.




Plus, some is just impossible.  shared_buffers requires a restart.  Do you
want your DB server spontaneously restarting because it thought more
buffers might be nice?


Well, maybe look at the bigger picture and see if it can be fixed to *not* 
require a program restart?  Or.. take effect on the next pid that gets created? 
 This is a current limitation, but doesn't need to be one for eternity does it?




I'm sure there are many more examples, that with some creative thinking, could 
be auto-adjusted to match the usage patterns of the database. PG does an 
excellent job of exposing the variables to the users, but mostly avoids telling 
the user what to do or doing it for them.  Instead, it is up to the user to know 
where to look, what to look for, and how to react to things to improve 
performance.  This is not all bad, but it is assuming that all users are hackers 
( which used to be true ), but certainly doesn't help when the average SQLServer 
admin tries out Postgres and then is surprised at the things they are now 
responsible for managing.  PG is certainly *not* the only database to suffer 
from this syndrome, I know..


I expect the suffering is a result of the fact that databases are non-trivial
pieces of software, and there's no universally simple way to set them up
and make them run well.


Speaking as a former SQL Server admin ( from day 1 of the Sybase fork up to 
version 2000 ), I can say there *is* a way to make them simple.  It's certainly 
not a perfect piece of software, but the learning curve speaks for itself.  It 
can auto-shrink your databases ( without locking 

Re: [PERFORM] Very specific server situation

2007-04-27 Thread Tom Lane
Mauro N. Infantino [EMAIL PROTECTED] writes:
 What we basically have is a site where each user has a box with links to
 other randomly selected users. Whenever a box from a user is shown, a SPs is
 executed: a credit is added to that user and a credit is substracted from
 the accounts of the shown links. Accounts with no credits do not have to be
 listed. So, we've lots (LOTS) of users querying and updating the same table.

Have you checked to make sure the query plans are reasonable?  Have you
checked that autovacuum is running often enough?  (You might want to try
contrib/pgstattuple to see how much dead space there is in your
heavily-updated tables.)  Also, on a high-update workload it is
absolutely critical to boost checkpoint_segments far enough that you are
not doing checkpoints oftener than maybe once every five minutes.

If the performance problems seem bursty then you may also need to look
at adjusting bgwriter and/or vacuum cost delay parameters to smooth out
the I/O load.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-04-27 Thread Josh Berkus
Dan,

 Yes, this is the classic problem.  I'm not demanding anyone pick up the
 ball and jump on this today, tomorrow, etc.. I just think it would be
 good for those who *could* make a difference to keep those goals in mind
 when they continue.  If you have the right mindset, this problem will
 fix itself over time.

Don't I wish.  Autotuning is *hard*.  It took Oracle 6 years.  It took 
Microsoft 3-4 years, and theirs still has major issues last I checked. And 
both of those DBs support less OSes than we do.  I think it's going to 
take more than the *right mindset* and my spare time.

 I appreciate your efforts in this regard.  Do you have a formal project
 plan for this?  If you can share it with me, I'll take a look and see if
 there is anything I can do to help out.

Nope, just some noodling around on the configurator:
www.pgfoundry.org/projects/configurator

 I am on the verge of starting a Java UI that will query a bunch of the
 pg_* tables and give the user information about wasted table space,
 index usage, table scans, slow-running queries and spoon-feed it in a
 nice attractive interface that can be a real-time system monitor tool. 
 This could be a cooperative project or might have some redundancy with
 what you're up to.

I'd be *very* interested in collaborating with you on this.  Further, we 
could feed DTrace ( systemtap?) into the interface to get data that 
PostgreSQL doesn't currently produce.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-04-27 Thread david

On Fri, 27 Apr 2007, Josh Berkus wrote:


Dan,


Yes, this is the classic problem.  I'm not demanding anyone pick up the
ball and jump on this today, tomorrow, etc.. I just think it would be
good for those who *could* make a difference to keep those goals in mind
when they continue.  If you have the right mindset, this problem will
fix itself over time.


Don't I wish.  Autotuning is *hard*.  It took Oracle 6 years.  It took
Microsoft 3-4 years, and theirs still has major issues last I checked. And
both of those DBs support less OSes than we do.  I think it's going to
take more than the *right mindset* and my spare time.


I think there are a couple different things here.

1. full autotuning

  as you say, this is very hard and needs a lot of info about your 
particular database useage.


2. getting defaults that are closer to right then current.

  this is much easier. for this nobody is expecting that the values are 
right, we're just begging for some tool to get us within an couple orders 
of magnatude of what's correct.


the current defaults are appropriate for a single cpu with 10's of MB of 
ram and a single drive


nowdays you have people trying to run quick-and-dirty tests on some spare 
hardware they have laying around (waiting for another project) that's got 
4-8 CPU's with 10's of GB of ram and a couple dozen drives


these people don't know about database tuneing, they can learn, but they 
want to see if postgres is even in the ballpark. if the results are close 
to acceptable they will ask questions and research the tuneing, but if the 
results are orders of magnatude lower then they need to be they'll just 
say that postgress is too slow and try another database.


an autodefault script that was written assuming that postgres has the box 
to itself would be a wonderful start.


I think the next step would be to be able to tell the script 'only plan on 
useing 1/2 of this box'


and beyond that would be the steps that you are thinking of where the 
useage pattern is considered.


but when every performance question is answered with did you change the 
defaults? they are way too low for modern hardware, raise them by 2 orders 
of magnatude and then we'll start investigating


David Lang
---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq