[HACKERS] Re: [HACKERS] ExclusiveLock on PostgreSQL - Fabio Mendonça

2015-10-30 Thread Robert Haas
On Wed, Oct 28, 2015 at 5:59 PM, Fabio Oliveira De Mendonca
 wrote:
> I 've a process with 600.000 rows, for insert on table "A" with 130 columns
> and I'm received  the "Exclusivelock"   error message, making lost some
> rows during transaction.  The insert of transaction occurs on each  2 min.
> and for each 1 min, a second process read the table "A" (with Join Table "C"
> using  PK ) to make a insert on a table  ("B") . Well ,  I did think create
> a partitions on table "A",  but I don't believe get a correcting in the
> problem ( "Exclusivelock" ).

This isn't really the right mailing list for this question.

You might find 
https://listas.postgresql.org.br/cgi-bin/mailman/listinfo/pgbr-geral
helpful, or you can ask at
http://www.postgresql.org/list/pgsql-general/

You should also read
https://wiki.postgresql.org/wiki/Guide_to_reporting_problems --
because this report does not contain enough information for someone to
answer your question.  In particular, including the exact text of any
commands you executed and any error or other messages the system
generated would be helpful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ExclusiveLock on PostgreSQL - Fabio Mendonça

2015-10-30 Thread Fabio Oliveira De Mendonca

Thanks Robert. 

I read the material link and did help me to take a new decision

thank you.  
att.

Fabio Mendonça




De: Robert Haas <robertmh...@gmail.com>
Enviado: sexta-feira, 30 de outubro de 2015 07:49
Para: Fabio Oliveira De Mendonca
Cc: k...@it.is.rice.edu; gsst...@mit.edu; pgsql-hackers@postgresql.org; 
fabio.mendonca@gmail.com
Assunto: Re: [HACKERS] ExclusiveLock on PostgreSQL - Fabio Mendonça

On Wed, Oct 28, 2015 at 5:59 PM, Fabio Oliveira De Mendonca
<fabiomedo...@brq.com> wrote:
> I 've a process with 600.000 rows, for insert on table "A" with 130 columns
> and I'm received  the "Exclusivelock"   error message, making lost some
> rows during transaction.  The insert of transaction occurs on each  2 min.
> and for each 1 min, a second process read the table "A" (with Join Table "C"
> using  PK ) to make a insert on a table  ("B") . Well ,  I did think create
> a partitions on table "A",  but I don't believe get a correcting in the
> problem ( "Exclusivelock" ).

This isn't really the right mailing list for this question.

You might find 
https://listas.postgresql.org.br/cgi-bin/mailman/listinfo/pgbr-geral
helpful, or you can ask at
http://www.postgresql.org/list/pgsql-general/

You should also read
https://wiki.postgresql.org/wiki/Guide_to_reporting_problems --
because this report does not contain enough information for someone to
answer your question.  In particular, including the exact text of any
commands you executed and any error or other messages the system
generated would be helpful.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ExclusiveLock on extension of relation with huge shared_buffers

2014-12-28 Thread Borodin Vladimir

25 окт. 2014 г., в 4:31, Jim Nasby jim.na...@bluetreble.com написал(а):

 Please don't top-post.
 
 On 10/24/14, 3:40 AM, Borodin Vladimir wrote:
 I have taken some backtraces (they are attached to the letter) of two 
 processes with such command:
 pid=17981; while true; do date; gdb -batch -e back 
 /usr/pgsql-9.4/bin/postgres $pid; echo; echo; echo; echo; sleep 0.1; done
 
 Process 17981 was holding the lock for a long time - 
 http://pastie.org/9671931.
 And process 13886 was waiting for lock (in different time and from different 
 blocker actually but I don’t think it is really important) - 
 http://pastie.org/9671939.
 
 As I can see, 17981 is actually waiting for LWLock on BufFreelistLock in 
 StrategyGetBuffer function, freelist.c:134 while holding exclusive lock on 
 relation. I will try to increase NUM_BUFFER_PARTITIONS (on read-only load it 
 also gave us some performance boost) and write the result in this thread.
 
 BufFreelistLock becomes very contended when shared buffers are under a lot of 
 pressure.
 
 Here's what I believe is happening:
 
 If RelationGetBufferForTuple() decides it needs to extend, this happens:
   LockRelationForExtension(relation, ExclusiveLock);
   buffer = ReadBufferBI(relation, P_NEW, bistate);
 
 Assuming bistate is false (I didn't check the bulk case), ReadBufferBI() ends 
 up at ReadBuffer_common(), which calls BufferAlloc(). In the normal case, 
 BufferAlloc() won't find the necessary buffer, so it will call 
 StrategyGetBuffer(), which will end up getting the freelist lock. Currently 
 the free list is normally empty, which means we now need to run the clock 
 sweep to find a victim buffer. The clock sweep will keep running until it 
 finds a buffer that is not pinned and has usage_count = 0. If shared buffers 
 are under heavy pressure, you can have a huge number of them with usage_count 
 = 5, which for 100GB shared buffers and an 8K BLKSZ, you could have to check 
 buffers *52 million* times (assuming you finally find a buffer on the start 
 of the 5th loop) before you find a victim.
 
 Keep in mind that's all happening while you're holding both the extension 
 lock *and the freelist lock*, which basically means no one else in the entire 
 system can allocate a new buffer.

I’ll try the same workload with recent patch from Andres Freund [0].

 
 This is one reason why a large shared_buffers setting is usually 
 counter-productive. Experience with older versions is that setting it higher 
 than about 8GB is more likely to hurt than to help. Newer versions are 
 probably better, but I think you'll be hard-pressed to find a workload where 
 100GB makes sense. It might if your entire database fits in shared_buffers 
 (though, even then there's probably a number of O(n) or worse operations that 
 will hurt you), but if your database is  shared_buffers you're probably in 
 trouble.
 
 I suggest cutting shared_buffers *way* down. Old-school advice for this 
 machine would be 8G (since 25% of 128G would be too big). You might be able 
 to do better than 8G, but I recommend not even trying unless you've got a 
 good way to test your performance.
 
 If you can test performance and find an optimal setting for shared_buffers, 
 please do share your test data and findings. :)

Of course, it works well with shared_buffers = 8GB. But we have seen that on 
read-only load when data set fits in RAM with =8GB shared_buffers we hit 
BufFreelistLock LWLock while moving pages between shared buffers and page 
cache. Increasing shared_buffers size to the size of data set improves 
performance up to 2,5X faster on this read-only load. So we started testing 
configuration with huge shared_buffers under writing load and that’s why I 
started this thread.

Since StrategyGetBuffer() does not use BufFreelistLock LWLock any more [1] I’ll 
also re-run tests with read-only load and small shared_buffers.

[0] 
http://git.postgresql.org/pg/commitdiff/d72731a70450b5e7084991b9caa15cb58a2820df
[1] 
http://git.postgresql.org/pg/commitdiff/1dcfb8da09c47d2a7502d1dfab06c8be4b6cf323

 -- 
 Jim Nasby, Data Architect, Blue Treble Consulting
 Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Vladimir






Re: [HACKERS] ExclusiveLock on extension of relation with huge shared_buffers

2014-10-24 Thread Jim Nasby

Please don't top-post.

On 10/24/14, 3:40 AM, Borodin Vladimir wrote:

I have taken some backtraces (they are attached to the letter) of two processes 
with such command:
pid=17981; while true; do date; gdb -batch -e back /usr/pgsql-9.4/bin/postgres 
$pid; echo; echo; echo; echo; sleep 0.1; done

Process 17981 was holding the lock for a long time - http://pastie.org/9671931.
And process 13886 was waiting for lock (in different time and from different 
blocker actually but I don’t think it is really important) - 
http://pastie.org/9671939.

As I can see, 17981 is actually waiting for LWLock on BufFreelistLock in 
StrategyGetBuffer function, freelist.c:134 while holding exclusive lock on 
relation. I will try to increase NUM_BUFFER_PARTITIONS (on read-only load it 
also gave us some performance boost) and write the result in this thread.


BufFreelistLock becomes very contended when shared buffers are under a lot of 
pressure.

Here's what I believe is happening:

If RelationGetBufferForTuple() decides it needs to extend, this happens:
LockRelationForExtension(relation, ExclusiveLock);
buffer = ReadBufferBI(relation, P_NEW, bistate);

Assuming bistate is false (I didn't check the bulk case), ReadBufferBI() ends 
up at ReadBuffer_common(), which calls BufferAlloc(). In the normal case, 
BufferAlloc() won't find the necessary buffer, so it will call 
StrategyGetBuffer(), which will end up getting the freelist lock. Currently the 
free list is normally empty, which means we now need to run the clock sweep to 
find a victim buffer. The clock sweep will keep running until it finds a buffer 
that is not pinned and has usage_count = 0. If shared buffers are under heavy 
pressure, you can have a huge number of them with usage_count = 5, which for 
100GB shared buffers and an 8K BLKSZ, you could have to check buffers *52 
million* times (assuming you finally find a buffer on the start of the 5th 
loop) before you find a victim.

Keep in mind that's all happening while you're holding both the extension lock 
*and the freelist lock*, which basically means no one else in the entire system 
can allocate a new buffer.

This is one reason why a large shared_buffers setting is usually 
counter-productive. Experience with older versions is that setting it higher than 
about 8GB is more likely to hurt than to help. Newer versions are probably better, 
but I think you'll be hard-pressed to find a workload where 100GB makes sense. It 
might if your entire database fits in shared_buffers (though, even then there's 
probably a number of O(n) or worse operations that will hurt you), but if your 
database is  shared_buffers you're probably in trouble.

I suggest cutting shared_buffers *way* down. Old-school advice for this machine 
would be 8G (since 25% of 128G would be too big). You might be able to do 
better than 8G, but I recommend not even trying unless you've got a good way to 
test your performance.

If you can test performance and find an optimal setting for shared_buffers, 
please do share your test data and findings. :)
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-24 Thread Bort, Paul
Title: RE: [Testperf-general] Re: [HACKERS] ExclusiveLock





 From: Kenneth Marshall [mailto:[EMAIL PROTECTED]]
[snip]
 The simplest idea I had was to pre-layout the WAL logs in a 
 contiguous fashion
 on the disk. Solaris has this ability given appropriate FS 
 parameters and we
 should be able to get close on most other OSes. Once that has 
 happened, use
 something like the FSM map to show the allocated blocks. The 
 CPU can keep track
 of its current disk rotational position (approx. is okay) 
 then when we need to
 write a WAL block start writing at the next area that the 
 disk head will be
 sweeping. Give it a little leaway for latency in the system 
 and we should be
 able to get very low latency for the writes. Obviously, there 
 would be wasted
 space but you could intersperse writes to the granularity of 
 space overhead
 that you would like to see. As far as implementation, I was reading an
 interesting article that used a simple theoretical model to 
 estimate disk head
 position to avoid latency.
 


Ken, 


That's a neat idea, but I'm not sure how much good it will do. As bad as rotational latency is, seek time is worse. Pre-allocation isn't going to do much for rotational latency if the heads also have to seek back to the WAL. 

OTOH, pre-allocation could help two other performance aspects of the WAL: First, if the WAL was pre-allocated, steps could be taken (by the operator, based on their OS) to make the space allocated to the WAL contiguous. Statistics on how much WAL is needed in 24 hours would help with that sizing. This would reduce seeks involved in writing the WAL data.

The other thing it would do is reduce seeks and metadata writes involved in extending WAL files.


All of this is moot if the WAL doesn't have its own spindle(s).


This almost leads back to the old-fashioned idea of using a raw partition, to avoid the overhead of the OS and file structure. 

Or I could be thoroughly demonstrating my complete lack of understanding of PostgreSQL internals. :-)


Maybe I'll get a chance to try the flash drive WAL idea in the next couple of weeks. Need to see if the hardware guys have a spare flash drive I can abuse.

Paul





Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-24 Thread Kenneth Marshall
On Wed, Nov 24, 2004 at 11:00:30AM -0500, Bort, Paul wrote:
  From: Kenneth Marshall [mailto:[EMAIL PROTECTED]
 [snip]
  The simplest idea I had was to pre-layout the WAL logs in a 
  contiguous fashion
  on the disk. Solaris has this ability given appropriate FS 
  parameters and we
  should be able to get close on most other OSes. Once that has 
  happened, use
  something like the FSM map to show the allocated blocks. The 
  CPU can keep track
  of its current disk rotational position (approx. is okay) 
  then when we need to
  write a WAL block start writing at the next area that the 
  disk head will be
  sweeping. Give it a little leaway for latency in the system 
  and we should be
  able to get very low latency for the writes. Obviously, there 
  would be wasted
  space but you could intersperse writes to the granularity of 
  space overhead
  that you would like to see. As far as implementation, I was reading an
  interesting article that used a simple theoretical model to 
  estimate disk head
  position to avoid latency.
  
 
 Ken, 
 
 That's a neat idea, but I'm not sure how much good it will do. As bad as
 rotational latency is, seek time is worse. Pre-allocation isn't going to do
 much for rotational latency if the heads also have to seek back to the WAL. 
 
 OTOH, pre-allocation could help two other performance aspects of the WAL:
 First, if the WAL was pre-allocated, steps could be taken (by the operator,
 based on their OS) to make the space allocated to the WAL contiguous.
 Statistics on how much WAL is needed in 24 hours would help with that
 sizing. This would reduce seeks involved in writing the WAL data.
 
 The other thing it would do is reduce seeks and metadata writes involved in
 extending WAL files.
 
 All of this is moot if the WAL doesn't have its own spindle(s).
 
 This almost leads back to the old-fashioned idea of using a raw partition,
 to avoid the overhead of the OS and file structure. 
 
 Or I could be thoroughly demonstrating my complete lack of understanding of
 PostgreSQL internals. :-)
 
 Maybe I'll get a chance to try the flash drive WAL idea in the next couple
 of weeks. Need to see if the hardware guys have a spare flash drive I can
 abuse.
 
 Paul
 

Obviously, this whole process would be much more effective on systems with
separate WAL drives. But even on less busy systems, the lock-step of
write-a-WAL/wait-for-heads/write-a-WAL can dramatically decrease your
effective throughput to the drive. For example, the worst case would be
write one WAL block to disk. Then schedule another WAL block to be written
to disk. This block will need to wait for 1 full disk rotation to perform
the write. On a 10k drive, you will be able to log in this scenario 166
TPS assuming no piggy-backed syncs. Now look at the case where we can use
the preallocated WAL and write immediately. Assuming a 100% sequential disk
layout, if we can start writing within 25% of the full rotation we can now
support 664 TPS on the same hardware. Now look at a typical hard drive on
my desktop system with 150M sectors/4 heads/5 tracks - 3000 blocks/track
or 375 8K blocks. If we can write the next block within 10 8K blocks we can
perform 6225 TPS, within 5 8K blocks = 12450 TPS, within 2 8K blocks =
31125 TPS. This is just on a simple disk drive. As you can see, even small
improvements can make a tremendous difference in throughput. My analysis
is very simplistic and whether we can model the I/O quickly enough to be
useful is still to be determined. Maybe someone on the mailing list with
more experiance in how disk drives actually function can provide more
definitive information.

Ken


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-23 Thread Bort, Paul
Title: RE: [Testperf-general] Re: [HACKERS] ExclusiveLock





 The impression I had was that disk drives no longer pay the slightest
 attention to interleave specs, because the logical model 
 implied by the
 concept is too far removed from modern reality (on-disk buffering,
 variable numbers of sectors per track, transparently remapped bad
 sectors, yadda yadda).
 


Entirely true. Interleave was an issue back when the controller wasn't fast enough to keep up with 3600 RPM disks, and is now completely obscured from the bus. I don't know if the ATA spec includes interleave control; I suspect it does not.

 And that's just at the hardware level ... who knows where the 
 filesystem
 is putting your data, or what the kernel I/O scheduler is doing with
 your requests :-(
 
 Basically I see the TODO item as a blue-sky research topic, not
 something we have any idea how to implement. That doesn't 
 mean it can't
 be on the TODO list ...
 


I think that if we also take into consideration various hardware and software RAID configurations, this is just too far removed from the database level to be at all practical to throw code at.

Perhaps this should be rewritten as a documentation change: recommendations about performance hardware? What we recommend for our highest volume customers (alas, on a proprietary RDBMS, and only x86) is something like this:

- Because drive capacity is so huge now, choose faster drives over larger drives. 15K RPM isn't three times faster than 5400, but there is a noticable difference. 

- More spindles reduce delays even further. Mirroring allows reads to happen faster because they can come from either side of the mirror, and spanning reduces problems with rotational delays.

- The ideal disk configuration that we recommend is a 14 drive chassis with a split backplane. Run each backplane to a separate channel on the controller, and mirror the channels. Use the first drive on each channel for the OS and swap, the second drive for transaction logs, and the remaining drives spanned (and already mirrored) for data. With a reasonable write cache on the controller, this has proven to be a pretty fast configuration despite a less than ideal engine.

One other thought: How does static RAM compare to disk speed nowadays? A 1Gb flash drive might be reasonable for the WAL if it can keep up. 




Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-23 Thread Doug McNaught
Bort, Paul [EMAIL PROTECTED] writes:

One other thought: How does static RAM compare to disk speed nowadays?
A 1Gb flash drive might be reasonable for the WAL if it can keep up.

Flash RAM wears out; it's not suitable for a continuously-updated
application like WAL.

-Doug

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-23 Thread Bort, Paul
Title: RE: [Testperf-general] Re: [HACKERS] ExclusiveLock





 From: Doug McNaught [mailto:[EMAIL PROTECTED]]
 
 Bort, Paul [EMAIL PROTECTED] writes:
 
  One other thought: How does static RAM compare to disk 
 speed nowadays?
  A 1Gb flash drive might be reasonable for the WAL if it 
 can keep up.
 
 Flash RAM wears out; it's not suitable for a continuously-updated
 application like WAL.
 
 -Doug
 


But if it's even 2x faster than a disk, that might be worth wearing them out. Given that they have published write count limits, one could reasonably plan to replace the memory after half of that time and be comfortable with the lifecycle. I saw somewhere that even with continuous writes on USB 2.0, it would take about twelve years to exhaust the write life of a typical flash drive. Even an order-of-magnitude increase in throughput beyond that only calls for a new drive every year. (Or every six months if you're paranoid. If you're that paranoid, you can mirror them, too.)

Whether USB 2.0 is fast enought for the WAL is a separate discussion. 





Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-23 Thread Kenneth Marshall
On Tue, Nov 23, 2004 at 12:04:17AM +, Simon Riggs wrote:
 On Mon, 2004-11-22 at 23:37, Greg Stark wrote:
  Simon Riggs [EMAIL PROTECTED] writes:
  
   - Find a way to reduce rotational delay when repeatedly writing last WAL
   page 
   
   Currently fsync of WAL requires the disk platter to perform a full
   rotation to fsync again. One idea is to write the WAL to different
   offsets that might reduce the rotational delay. 
  
  Once upon a time when you formatted hard drives you actually gave them an
  interleave factor for a similar reason. These days you invariably use an
  interleave of 1, ie, store the blocks continuously. Whether that's because
  controllers have become fast enough to keep up with the burst rate or 
  because
  the firmware is smart enough to handle the block interleaving invisibly 
  isn't
  clear to me.
  
  I wonder if formatting the drive to have an interleave 1 would actually
  improve performance of the WAL log. 
  
  It would depend a lot on the usage pattern though. A heavily used system 
  might
  be able to generate enough WAL traffic to keep up with the burst rate of the
  drive. And an less used system might benefit but might lose.
  
  Probably now the less than saturated system gets close to the average
  half-rotation-time latency. This idea would only really help if you have a
  system that happens to be triggering pessimal results worse than that due to
  unfortunate timing.
 
 I was asking whether that topic should be removed, since Tom had said it
 had been rejected
 
 If you could tell me how to instrument the system to (better) show
 whether such plans as you suggest are workable, I would be greatly
 interested. Anything we do needs to be able to be monitored for
 success/failure.
 
 -- 
 Best Regards, Simon Riggs
 

The disk performance has increased so much that the reasons for having
an interleave factor other than 1 (no interleaving) have all but disappeared.
CPU speed has also increased so much relative to disk speed that using some
CPU cycles to improve I/O is a reasonable approach. I have been considering
how this might be accomplished. As Simon so aptly pointed out, we need to
show that it materially affects the performance or it is not worth doing.
The simplest idea I had was to pre-layout the WAL logs in a contiguous fashion
on the disk. Solaris has this ability given appropriate FS parameters and we
should be able to get close on most other OSes. Once that has happened, use
something like the FSM map to show the allocated blocks. The CPU can keep track
of its current disk rotational position (approx. is okay) then when we need to
write a WAL block start writing at the next area that the disk head will be
sweeping. Give it a little leaway for latency in the system and we should be
able to get very low latency for the writes. Obviously, there would be wasted
space but you could intersperse writes to the granularity of space overhead
that you would like to see. As far as implementation, I was reading an
interesting article that used a simple theoretical model to estimate disk head
position to avoid latency.

Yours truly,
Ken Marshall

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-22 Thread Simon Riggs
On Thu, 2004-11-18 at 23:54, Tom Lane wrote:
 I don't think so; WAL is inherently a linear log.  (Awhile ago there was
 some talk of nonlinear log writing to get around the one-commit-per-
 disk-revolution syndrome, but the idea basically got rejected as
 unworkably complicated.)  

...this appears to still be on the TODO list... should it be removed?

- Find a way to reduce rotational delay when repeatedly writing last WAL
page 

Currently fsync of WAL requires the disk platter to perform a full
rotation to fsync again. One idea is to write the WAL to different
offsets that might reduce the rotational delay. 

-- 
Best Regards, Simon Riggs


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-22 Thread Greg Stark

Simon Riggs [EMAIL PROTECTED] writes:

 - Find a way to reduce rotational delay when repeatedly writing last WAL
 page 
 
 Currently fsync of WAL requires the disk platter to perform a full
 rotation to fsync again. One idea is to write the WAL to different
 offsets that might reduce the rotational delay. 

Once upon a time when you formatted hard drives you actually gave them an
interleave factor for a similar reason. These days you invariably use an
interleave of 1, ie, store the blocks continuously. Whether that's because
controllers have become fast enough to keep up with the burst rate or because
the firmware is smart enough to handle the block interleaving invisibly isn't
clear to me.

I wonder if formatting the drive to have an interleave 1 would actually
improve performance of the WAL log. 

It would depend a lot on the usage pattern though. A heavily used system might
be able to generate enough WAL traffic to keep up with the burst rate of the
drive. And an less used system might benefit but might lose.

Probably now the less than saturated system gets close to the average
half-rotation-time latency. This idea would only really help if you have a
system that happens to be triggering pessimal results worse than that due to
unfortunate timing.

-- 
greg


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-22 Thread Simon Riggs
On Mon, 2004-11-22 at 23:37, Greg Stark wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
 
  - Find a way to reduce rotational delay when repeatedly writing last WAL
  page 
  
  Currently fsync of WAL requires the disk platter to perform a full
  rotation to fsync again. One idea is to write the WAL to different
  offsets that might reduce the rotational delay. 
 
 Once upon a time when you formatted hard drives you actually gave them an
 interleave factor for a similar reason. These days you invariably use an
 interleave of 1, ie, store the blocks continuously. Whether that's because
 controllers have become fast enough to keep up with the burst rate or because
 the firmware is smart enough to handle the block interleaving invisibly isn't
 clear to me.
 
 I wonder if formatting the drive to have an interleave 1 would actually
 improve performance of the WAL log. 
 
 It would depend a lot on the usage pattern though. A heavily used system might
 be able to generate enough WAL traffic to keep up with the burst rate of the
 drive. And an less used system might benefit but might lose.
 
 Probably now the less than saturated system gets close to the average
 half-rotation-time latency. This idea would only really help if you have a
 system that happens to be triggering pessimal results worse than that due to
 unfortunate timing.

I was asking whether that topic should be removed, since Tom had said it
had been rejected

If you could tell me how to instrument the system to (better) show
whether such plans as you suggest are workable, I would be greatly
interested. Anything we do needs to be able to be monitored for
success/failure.

-- 
Best Regards, Simon Riggs


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-22 Thread Bruce Momjian
Simon Riggs wrote:
 On Mon, 2004-11-22 at 23:37, Greg Stark wrote:
  Simon Riggs [EMAIL PROTECTED] writes:
  
   - Find a way to reduce rotational delay when repeatedly writing last WAL
   page 
   
   Currently fsync of WAL requires the disk platter to perform a full
   rotation to fsync again. One idea is to write the WAL to different
   offsets that might reduce the rotational delay. 
  
  Once upon a time when you formatted hard drives you actually gave them an
  interleave factor for a similar reason. These days you invariably use an
  interleave of 1, ie, store the blocks continuously. Whether that's because
  controllers have become fast enough to keep up with the burst rate or 
  because
  the firmware is smart enough to handle the block interleaving invisibly 
  isn't
  clear to me.
  
  I wonder if formatting the drive to have an interleave 1 would actually
  improve performance of the WAL log. 
  
  It would depend a lot on the usage pattern though. A heavily used system 
  might
  be able to generate enough WAL traffic to keep up with the burst rate of the
  drive. And an less used system might benefit but might lose.
  
  Probably now the less than saturated system gets close to the average
  half-rotation-time latency. This idea would only really help if you have a
  system that happens to be triggering pessimal results worse than that due to
  unfortunate timing.
 
 I was asking whether that topic should be removed, since Tom had said it
 had been rejected

The method used to fix it was rejected, but the goal of making it better
is still a valid one.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-22 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 Once upon a time when you formatted hard drives you actually gave them an
 interleave factor for a similar reason. These days you invariably use an
 interleave of 1, ie, store the blocks continuously. Whether that's because
 controllers have become fast enough to keep up with the burst rate or because
 the firmware is smart enough to handle the block interleaving invisibly isn't
 clear to me.

The impression I had was that disk drives no longer pay the slightest
attention to interleave specs, because the logical model implied by the
concept is too far removed from modern reality (on-disk buffering,
variable numbers of sectors per track, transparently remapped bad
sectors, yadda yadda).

And that's just at the hardware level ... who knows where the filesystem
is putting your data, or what the kernel I/O scheduler is doing with
your requests :-(

Basically I see the TODO item as a blue-sky research topic, not
something we have any idea how to implement.  That doesn't mean it can't
be on the TODO list ...

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-21 Thread Simon Riggs
On Sat, 2004-11-20 at 16:14, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  On Thu, 2004-11-18 at 22:55, Tom Lane wrote:
  If it is a problem, the LockBuffer calls in RelationGetBufferForTuple
  would be the places showing contention delays.
 
  You say this as if we can easily check that.
 
 I think this can be done with oprofile ...

OK, well thats where this thread started.

oprofile only tells us aggregate information. It doesn't tell us how
much time is spent waiting because of contention issues, it just tells
us how much time is spent and even that is skewed.

There really ought to be a better way to instrument things from inside,
based upon knowledge of the code.

-- 
Best Regards, Simon Riggs


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-20 Thread Simon Riggs
On Thu, 2004-11-18 at 22:55, Tom Lane wrote:
 Josh Berkus [EMAIL PROTECTED] writes:
  The main problem on INSERTs is that it is usually the same few pages:
  the lead data block and the lead index block. There are ways of
  spreading the load out across an index, but I'm not sure what happens on
  the leading edge of the data relation, but I think it hits the same
  block each time.
 
  I actually have several test cases for this, can you give me a trace or 
  profile suggestion that would show if this is happening?
 
 If it is a problem, the LockBuffer calls in RelationGetBufferForTuple
 would be the places showing contention delays.

You say this as if we can easily check that. My understanding is that
this would require a scripted gdb session to instrument the executable
at that point.

Is that what you mean? That isn't typically regarded as a great thing to
do on a production system. 

You've mentioned about performance speculation, which I agree with, but
what are the alternatives? Compile-time changes aren't usually able to
be enabled, since many people from work RPMs.

 It could also be that the contention is for the WALInsertLock, ie, the
 right to stuff a WAL record into the shared buffers.  This effect would
 be the same even if you were inserting into N separate tables.

...and how do we check that also.

Are we back to simulated workloads and fully rigged executables?

-- 
Best Regards, Simon Riggs


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-20 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 On Thu, 2004-11-18 at 22:55, Tom Lane wrote:
 If it is a problem, the LockBuffer calls in RelationGetBufferForTuple
 would be the places showing contention delays.

 You say this as if we can easily check that.

I think this can be done with oprofile ...

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-18 Thread Josh Berkus
Tom,

 I think you are right that these reflect heap or btree-index extension
 operations.  Those do not actually take locks on the *table* however,
 but locks on a single page within it (which are completely orthogonal to
 table locks and don't conflict).  The pg_locks output leaves something
 to be desired, because you can't tell the difference between table and
 page locks.

Aside from foriegn keys, though, is there any way in which INSERT page locks 
could block other inserts?I have another system (Lyris) where that 
appears to be happening with 32 concurrent INSERT streams.It's possible 
that the problem is somewhere else, but I'm disturbed by the possibility.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-18 Thread Tom Lane
Josh Berkus [EMAIL PROTECTED] writes:
 Aside from foriegn keys, though, is there any way in which INSERT page locks 
 could block other inserts?

Not for longer than the time needed to physically add a tuple to a page.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-18 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 The main problem on INSERTs is that it is usually the same few pages:
 the lead data block and the lead index block. There are ways of
 spreading the load out across an index, but I'm not sure what happens on
 the leading edge of the data relation, but I think it hits the same
 block each time.

FSM does what it can to spread the insertion load across multiple pages,
but of course this is not going to help much unless your table has lots
of embedded free space.  I think it would work pretty well on a table
with lots of update turnover, but not on an INSERT-only workload.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-18 Thread Simon Riggs
On Thu, 2004-11-18 at 22:12, Tom Lane wrote:
 Josh Berkus [EMAIL PROTECTED] writes:
  Aside from foriegn keys, though, is there any way in which INSERT page 
  locks 
  could block other inserts?
 
 Not for longer than the time needed to physically add a tuple to a page.

The main problem on INSERTs is that it is usually the same few pages:
the lead data block and the lead index block. There are ways of
spreading the load out across an index, but I'm not sure what happens on
the leading edge of the data relation, but I think it hits the same
block each time.

Only an issue if you have more than one CPU...

-- 
Best Regards, Simon Riggs


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-18 Thread Josh Berkus
Simon, Tom,

 The main problem on INSERTs is that it is usually the same few pages:
 the lead data block and the lead index block. There are ways of
 spreading the load out across an index, but I'm not sure what happens on
 the leading edge of the data relation, but I think it hits the same
 block each time.

I actually have several test cases for this, can you give me a trace or 
profile suggestion that would show if this is happening?

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-18 Thread Tom Lane
Josh Berkus [EMAIL PROTECTED] writes:
 The main problem on INSERTs is that it is usually the same few pages:
 the lead data block and the lead index block. There are ways of
 spreading the load out across an index, but I'm not sure what happens on
 the leading edge of the data relation, but I think it hits the same
 block each time.

 I actually have several test cases for this, can you give me a trace or 
 profile suggestion that would show if this is happening?

If it is a problem, the LockBuffer calls in RelationGetBufferForTuple
would be the places showing contention delays.

It could also be that the contention is for the WALInsertLock, ie, the
right to stuff a WAL record into the shared buffers.  This effect would
be the same even if you were inserting into N separate tables.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-18 Thread Simon Riggs
On Thu, 2004-11-18 at 22:51, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  The main problem on INSERTs is that it is usually the same few pages:
  the lead data block and the lead index block. There are ways of
  spreading the load out across an index, but I'm not sure what happens on
  the leading edge of the data relation, but I think it hits the same
  block each time.
 
 FSM does what it can to spread the insertion load across multiple pages,
 but of course this is not going to help much unless your table has lots
 of embedded free space.  I think it would work pretty well on a table
 with lots of update turnover, but not on an INSERT-only workload.

OK, thats what I thought.

So with a table with an INSERT-only workload, the FSM is always empty,
so there only ever is one block that gets locked. That means we can't
ever go faster than 1 CPU can go - any other CPUs will just wait for the
block lock. [In Josh's case, 32 INSERT streams won't go significantly
faster than about 4 streams, allowing for some overlap of other
operations]

Would it be possible to: when a new block is allocated from the relation
file (rather than reused), we check the FSM - if it is empty, then we
allocate 8 new blocks and add them all to the FSM. The next few
INSERTers will then use the FSM blocks normally.

Doing that will definitely speed up DBT-2 and many other workloads. Many
tables have SERIAL defined, or use a monotonically increasing unique
key.

-- 
Best Regards, Simon Riggs


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-18 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 Would it be possible to: when a new block is allocated from the relation
 file (rather than reused), we check the FSM - if it is empty, then we
 allocate 8 new blocks and add them all to the FSM. The next few
 INSERTers will then use the FSM blocks normally.

Most likely that would just shift the contention to the WALInsertLock.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-18 Thread Simon Riggs
On Thu, 2004-11-18 at 23:19, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  Would it be possible to: when a new block is allocated from the relation
  file (rather than reused), we check the FSM - if it is empty, then we
  allocate 8 new blocks and add them all to the FSM. The next few
  INSERTers will then use the FSM blocks normally.
 
 Most likely that would just shift the contention to the WALInsertLock.

Well, removing any performance bottleneck shifts the bottleneck to
another place, though that is not an argument against removing it.

Can we subdivide the WALInsertLock so there are multiple entry points to
wal_buffers, based upon hashing the xid? That would allow wal to be
written sequentially by each transaction though slightly out of order
for different transactions. Commit/Abort would all go through the same
lock to guarantee serializability. 

-- 
Best Regards, Simon Riggs


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [Testperf-general] Re: [HACKERS] ExclusiveLock

2004-11-18 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 Can we subdivide the WALInsertLock so there are multiple entry points to
 wal_buffers, based upon hashing the xid?

I don't think so; WAL is inherently a linear log.  (Awhile ago there was
some talk of nonlinear log writing to get around the one-commit-per-
disk-revolution syndrome, but the idea basically got rejected as
unworkably complicated.)  What's more, there are a lot of entries that
must remain time-ordered independently of transaction ownership.
Consider btree index page splits and sequence nextvals for two examples.

Certainly I'd not buy into any such project without incontrovertible
proof that it would solve a major bottleneck --- and right now we are
only speculating with no evidence.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] ExclusiveLock

2004-11-09 Thread Simon Riggs
On Mon, 2004-11-08 at 21:37, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  Recent runs of DBT-2 show very occasional ExclusiveLock (s) being held
  by transactions, sometimes waiting to be granted.
 
 I think you are right that these reflect heap or btree-index extension
 operations.  Those do not actually take locks on the *table* however,
 but locks on a single page within it (which are completely orthogonal to
 table locks and don't conflict).  The pg_locks output leaves something
 to be desired, because you can't tell the difference between table and
 page locks.

Good. Thought it was worth discussion...

 It's odd that your example does not appear to show someone else holding
 a conflicting lock.

There isI didn't copy the whole lock table output...here it is...

   relname|  pid  |   mode   | granted 
---+---+--+-
 new_order | 21735 | AccessShareLock  | t
 new_order | 21735 | RowExclusiveLock | t
 orders| 21715 | AccessShareLock  | t
 orders| 21715 | RowExclusiveLock | t
 pg_class  | 23254 | AccessShareLock  | t
 order_line| 21715 | AccessShareLock  | t
 order_line| 21715 | RowExclusiveLock | t
 order_line| 21735 | ExclusiveLock| f
 new_order | 21715 | AccessShareLock  | t
 new_order | 21715 | RowExclusiveLock | t
 customer  | 21715 | AccessShareLock  | t
 pk_order_line | 21735 | AccessShareLock  | t
 pk_order_line | 21735 | RowExclusiveLock | t
 item  | 21715 | AccessShareLock  | t
 orders| 21735 | AccessShareLock  | t
 orders| 21735 | RowExclusiveLock | t
 order_line| 21735 | AccessShareLock  | t
 order_line| 21735 | RowExclusiveLock | t
 stock | 21715 | AccessShareLock  | t
 stock | 21715 | RowExclusiveLock | t
 order_line| 21715 | ExclusiveLock| t
 pk_order_line | 21715 | RowExclusiveLock | t
 pg_locks  | 23254 | AccessShareLock  | t
 district  | 21715 | AccessShareLock  | t
 district  | 21715 | RowShareLock | t
 district  | 21715 | RowExclusiveLock | t
 warehouse | 21715 | AccessShareLock  | t
 customer  | 21735 | AccessShareLock  | t
 customer  | 21735 | RowExclusiveLock | t
(29 rows)


Pids 21715 and 21735 are conflicting.

There's also another example where the lock table output is  1400 rows,
with two lock requests pending.

The oprofile for this run looks like this: (but is not of course a
snapshot at a point in time, like the lock list)

CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %app name symbol name
170746   42.7220  vmlinux-2.6.8.1-osdl2ia64_pal_call_static
18934 4.7374  libc-2.3.2.so(no symbols)
10691 2.6750  postgres FunctionCall2
9814  2.4555  postgres hash_seq_search
8654  2.1653  postgres SearchCatCache
7389  1.8488  postgres AllocSetAlloc
6122  1.5318  postgres hash_search
5707  1.4279  postgres OpernameGetCandidates
4901  1.2263  postgres StrategyDirtyBufferList
4627  1.1577  postgres XLogInsert
4424  1.1069  postgres pglz_decompress
4371  1.0937  vmlinux-2.6.8.1-osdl2__copy_user
3796  0.9498  vmlinux-2.6.8.1-osdl2finish_task_switch
3483  0.8715  postgres LWLockAcquire
3458  0.8652  postgres eqjoinsel
3001  0.7509  vmlinux-2.6.8.1-osdl2get_exec_dcookie
2824  0.7066  postgres AtEOXact_CatCache
2745  0.6868  postgres _bt_compare
2730  0.6831  postgres nocachegetattr
2715  0.6793  postgres SearchCatCacheList
2659  0.6653  postgres MemoryContextAllocZeroAligned
2604  0.6515  postgres yyparse
2553  0.6388  postgres eqsel
2127  0.5322  postgres deconstruct_array
1921  0.4806  postgres hash_any
1919  0.4801  postgres int4eq
1855  0.4641  postgres LWLockRelease
1839  0.4601  postgres StrategyBufferLookup
1777  0.4446  postgres GetSnapshotData
1729  0.4326  postgres heap_getsysattr
1595  0.3991  postgres DLMoveToFront
1586  0.3968  postgres MemoryContextAlloc
1485  0.3716  vmlinux-2.6.8.1-osdl2try_atomic_semop
1455  0.3641  postgres anonymous symbol from section .plt
1409  0.3525  postgres lappend
1352  0.3383  postgres heap_release_fetch
1270  0.3178  postgres PinBuffer
1141  0.2855  postgres DirectFunctionCall1
1132  0.2832  postgres base_yylex
982   0.2457  postgres pgstat_initstats

Re: [HACKERS] ExclusiveLock

2004-11-08 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 Recent runs of DBT-2 show very occasional ExclusiveLock (s) being held
 by transactions, sometimes waiting to be granted.

I think you are right that these reflect heap or btree-index extension
operations.  Those do not actually take locks on the *table* however,
but locks on a single page within it (which are completely orthogonal to
table locks and don't conflict).  The pg_locks output leaves something
to be desired, because you can't tell the difference between table and
page locks.

It's odd that your example does not appear to show someone else holding
a conflicting lock.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]