Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-06 Thread Sergey Koposov

Hi,

On Mon, 4 Jun 2012, Ants Aasma wrote:

On Mon, Jun 4, 2012 at 7:44 PM, Merlin Moncure mmonc...@gmail.com wrote:
I tried to keep it simple at first to find an answer to the question
if it's even worth trying before expending large effort on it. If
anyone with a multisocket machine would chip in, I'd love to know how
this patch handles on larger machines. I think the most interesting
workloads are read only loads with heavy buffer trashing but inside OS
memory. Select only pgbench with 32MB shared_buffers was withín error
margin, although slightly faster on my machine (Intel i2500K). The
workload that I used to demonstrate gain was an aggregated index scan
to minimise other overheads.


I've quickly tested your lockfree-getbuffer.patch patch with 
the test case you provided and I barely see any improvement (2% at max)

https://docs.google.com/open?id=0B7koR68V2nM1QVBxWGpZdW4wd0U
tested with 24 core (48 ht cores, Xeon E7- 4807).
Although the tps vs number of threads looks weird

Cheers,
S

*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-06 Thread Ants Aasma
On Wed, Jun 6, 2012 at 2:27 PM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 I've quickly tested your lockfree-getbuffer.patch patch with the test case
 you provided and I barely see any improvement (2% at max)
 https://docs.google.com/open?id=0B7koR68V2nM1QVBxWGpZdW4wd0U
 tested with 24 core (48 ht cores, Xeon E7- 4807).
 Although the tps vs number of threads looks weird

Was this the range scan on the test table? (sorry about the error in
the query, the x should really be id) In that case the results look
really suspicious. My machine (4 cores, no ht, @ 4GHz, newer arch)
peaked at 90tps with the stated configuration. Even when upping the
shared_buffers and enabling indexonlyscan I didn't see more than about
540tps per thread. The test is designed to exercise buffer eviction,
doing about 9800 buffer reads per transaction with 32MB of buffers.

Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-06 Thread Sergey Koposov

On Wed, 6 Jun 2012, Ants Aasma wrote:


On Wed, Jun 6, 2012 at 2:27 PM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:

I've quickly tested your lockfree-getbuffer.patch patch with the test case
you provided and I barely see any improvement (2% at max)
https://docs.google.com/open?id=0B7koR68V2nM1QVBxWGpZdW4wd0U
tested with 24 core (48 ht cores, Xeon E7- 4807).
Although the tps vs number of threads looks weird


Was this the range scan on the test table? (sorry about the error in
the query, the x should really be id) In that case the results look
really suspicious.


Yes, my fault partially, because without much thought I've put value 
instead of x in the script. Now after replacing it by id the tps are 
much smaller.


Here is the tps vs nthreads I did test up to 10 threads on my 24 cpu 
system (I disabled HT though):

https://docs.google.com/open?id=0B7koR68V2nM1Nk9OcWNJOTRrYVE

Your patch clearly improve the situation (the peak tps is ~ 10% higher), 
but the general picture is the same: flattening of tps vs nthreads.


Cheers,
S

*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-06 Thread Merlin Moncure
On Wed, Jun 6, 2012 at 2:53 PM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 On Wed, 6 Jun 2012, Ants Aasma wrote:

 On Wed, Jun 6, 2012 at 2:27 PM, Sergey Koposov kopo...@ast.cam.ac.uk
 wrote:

 I've quickly tested your lockfree-getbuffer.patch patch with the test
 case
 you provided and I barely see any improvement (2% at max)
 https://docs.google.com/open?id=0B7koR68V2nM1QVBxWGpZdW4wd0U
 tested with 24 core (48 ht cores, Xeon E7- 4807).
 Although the tps vs number of threads looks weird


 Was this the range scan on the test table? (sorry about the error in
 the query, the x should really be id) In that case the results look
 really suspicious.


 Yes, my fault partially, because without much thought I've put value
 instead of x in the script. Now after replacing it by id the tps are
 much smaller.

 Here is the tps vs nthreads I did test up to 10 threads on my 24 cpu system
 (I disabled HT though):
 https://docs.google.com/open?id=0B7koR68V2nM1Nk9OcWNJOTRrYVE

 Your patch clearly improve the situation (the peak tps is ~ 10% higher), but
 the general picture is the same: flattening of tps vs nthreads.

I think this is the expected result.  In the single user case the
spinklock never spins and only has to make the cpu-locking cache
instructions once.  can we see results @24 threads?

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-06 Thread Sergey Koposov

On Wed, 6 Jun 2012, Merlin Moncure wrote:

I think this is the expected result.  In the single user case the
spinklock never spins and only has to make the cpu-locking cache
instructions once.  can we see results @24 threads?


Here 
https://docs.google.com/open?id=0B7koR68V2nM1NDJHLUhNSS0zbUk


*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-04 Thread Robert Haas
On Fri, Jun 1, 2012 at 9:55 PM, Ants Aasma a...@cybertec.at wrote:
 On Sat, Jun 2, 2012 at 1:48 AM, Merlin Moncure mmonc...@gmail.com wrote:
 Buffer pins aren't a cache: with a cache you are trying to mask a slow
 operation (like a disk i/o) with a faster such that the amount of slow
 operations are minimized.  Buffer pins however are very different in
 that we only care about contention on the reference count (the buffer
 itself is not locked!) which makes me suspicious that caching type
 algorithms are the wrong place to be looking.  I think it comes to do
 picking between your relatively complex but general, lock displacement
 approach or a specific strategy based on known bottlenecks.

 I agree that pins aren't like a cache. I mentioned the caching
 algorithms because they work based on access frequency and highly
 contended locks are likely to be accessed frequently even from a
 single backend. However this only makes sense for the delayed
 unpinning method, and I also have come to the conclusion that it's not
 likely to work well. Besides delaying cleanup, the overhead for the
 common case of uncontended access is just too much.

 It seems to me that even the nailing approach will need a replacement
 algorithm. The local pins still need to be published globally and
 because shared memory size is fixed, the maximum amount of locally
 pinned nailed buffers needs to be limited as well.

 But anyway, I managed to completely misread the profile that Sergey
 gave. Somehow I missed that the time went into the retry TAS in slock
 instead of the inlined TAS. This shows that the issue isn't just
 cacheline ping-pong but cacheline stealing. This could be somewhat
 mitigated by making pinning lock-free. The Nb-GCLOCK paper that Robert
 posted earlier in another thread describes an approach for this. I
 have a WIP patch (attached) that makes the clock sweep lock-free in
 the common case. This patch gave a 40% performance increase for an
 extremely allocation heavy load running with 64 clients on a 4 core 1
 socket system, lesser gains across the board. Pinning has a shorter
 lock duration (and a different lock type) so the gain might be less,
 or it might be a larger problem and post a higher gain. Either way, I
 think the nailing approach should be explored further, cacheline
 ping-pong could still be a problem with higher number of processors
 and losing the spinlock also loses the ability to detect contention.

Note sure about the rest of this patch, but this part is definitely bogus:

+#if !defined(pg_atomic_fetch_and_set)
+#define pg_atomic_fetch_and_set(dst, src, value) \
+   do { S_LOCK(dummy_spinlock); \
+   dst = src; \
+   src = value; \
+   S_UNLOCK(dummy_spinlock); } while (0)
+#endif

Locking a dummy backend-local spinlock doesn't provide atomicity
across multiple processes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-04 Thread Ants Aasma
On Mon, Jun 4, 2012 at 5:12 PM, Robert Haas robertmh...@gmail.com wrote:
 Note sure about the rest of this patch, but this part is definitely bogus:

 +#if !defined(pg_atomic_fetch_and_set)
 +#define pg_atomic_fetch_and_set(dst, src, value) \
 +       do { S_LOCK(dummy_spinlock); \
 +       dst = src; \
 +       src = value; \
 +       S_UNLOCK(dummy_spinlock); } while (0)
 +#endif

 Locking a dummy backend-local spinlock doesn't provide atomicity
 across multiple processes.

Right, of course. I don't know why I assumed that dummy_spinlock would
be global. In any case, this is very WIP and doesn't even aspire to be
portable yet. The main point was to see if there's any significant
performance to be gained by this method.

Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-04 Thread Merlin Moncure
On Mon, Jun 4, 2012 at 9:20 AM, Ants Aasma a...@cybertec.at wrote:
 On Mon, Jun 4, 2012 at 5:12 PM, Robert Haas robertmh...@gmail.com wrote:
 Note sure about the rest of this patch, but this part is definitely bogus:

 +#if !defined(pg_atomic_fetch_and_set)
 +#define pg_atomic_fetch_and_set(dst, src, value) \
 +       do { S_LOCK(dummy_spinlock); \
 +       dst = src; \
 +       src = value; \
 +       S_UNLOCK(dummy_spinlock); } while (0)
 +#endif

 Locking a dummy backend-local spinlock doesn't provide atomicity
 across multiple processes.

 Right, of course. I don't know why I assumed that dummy_spinlock would
 be global. In any case, this is very WIP and doesn't even aspire to be
 portable yet. The main point was to see if there's any significant
 performance to be gained by this method.

yeah -- those are fallback routines in case the compiler primitives
don't exist.  I think I understand what Ants is doing here: he's
reducing the coverage of the free list lock to only cover actually
popping a buffer off the free list; it no longer covers the clock
sweep.  That's a massive win if it works.   In order to get away with
that he had to decompose all manipulations from the clock sweep to the
Strategy to thread safe atomic operations.

What happens (in the very unlikely, but possible case?) if another
backend races to the buffer you've pointed at with 'victim'?  It looks
like multiple backends share the clock sweep now, but don't you need
to need an extra test to ensure it's still a candidate victim buffer?

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-04 Thread Merlin Moncure
On Mon, Jun 4, 2012 at 10:17 AM, Merlin Moncure mmonc...@gmail.com wrote:
 What happens (in the very unlikely, but possible case?) if another
 backend races to the buffer you've pointed at with 'victim'?  It looks
 like multiple backends share the clock sweep now, but don't you need
 to need an extra test to ensure it's still a candidate victim buffer?

Actually, I don't think you do: the existing check on refcount is
probably good enough.  Hm, why did you get rid of
BufferStrategyControl.lastFreeBuffer?

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-04 Thread Ants Aasma
On Mon, Jun 4, 2012 at 6:38 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Mon, Jun 4, 2012 at 10:17 AM, Merlin Moncure mmonc...@gmail.com wrote:
 What happens (in the very unlikely, but possible case?) if another
 backend races to the buffer you've pointed at with 'victim'?  It looks
 like multiple backends share the clock sweep now, but don't you need
 to need an extra test to ensure it's still a candidate victim buffer?

 Actually, I don't think you do: the existing check on refcount is
 probably good enough.  Hm, why did you get rid of
 BufferStrategyControl.lastFreeBuffer?

It was dead code as far as I could tell. That change isn't actually
relevant for this patch because free-list management is still
protected by a lock (except the initial unlocked test that is
doublechecked under lock) and so doesn't need any adjustment.

Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-04 Thread Merlin Moncure
On Mon, Jun 4, 2012 at 10:42 AM, Ants Aasma a...@cybertec.at wrote:
 On Mon, Jun 4, 2012 at 6:38 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Mon, Jun 4, 2012 at 10:17 AM, Merlin Moncure mmonc...@gmail.com wrote:
 What happens (in the very unlikely, but possible case?) if another
 backend races to the buffer you've pointed at with 'victim'?  It looks
 like multiple backends share the clock sweep now, but don't you need
 to need an extra test to ensure it's still a candidate victim buffer?

 Actually, I don't think you do: the existing check on refcount is
 probably good enough.  Hm, why did you get rid of
 BufferStrategyControl.lastFreeBuffer?

 It was dead code as far as I could tell. That change isn't actually
 relevant for this patch because free-list management is still
 protected by a lock (except the initial unlocked test that is
 doublechecked under lock) and so doesn't need any adjustment.

I have to admit -- this is pretty cool.  The changes look pretty clean
but I think it's going to need some benchmarking to prove it's
actually faster under all workloads and some more review though. If
there's heavy contention on getting victim buffers you're going to be
making a lot of atomic operations -- each one of which will lock down
the cache line.  A yielding lwlock is obviously going to have very
different behaviors under contention, so there needs to be a
performance test that really burns up the freelist.

If it's Well And Truly faster, you have to immediately start wondering
if the freelist lock can't be completely scrapped: buffer invalidation
only occurs in fairly rare events -- but when it does ISTM to be
fairly counter productive.  In trying to avoid the sweep you end up
pushing everyone towards a locky free list.  I don't know about you,
but I'd prefer to have more consistent behavior (as in, not drop big
table, concurrency drops 50%).

Noted, this doesn't really address the OP's problem :-).  I bet it'd
supplement the nailing strategy nicely however.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-01 Thread Sergey Koposov

On Thu, 31 May 2012, Jeff Janes wrote:



No, idt_match is getting filled by multi-threaded copy() and then joined
with 4 other big tables like idt_phot. The result is then split into
partitions.


That does make things more complicated.  But you could you partition
it at that level and then do the joins partition-wise?


Unfortunately the information to understand in which partition the data 
needs to be put in is only available from the idt_match table. So I have 
to do at least one join with idt_match. But I will experiment further 
with ways to perform queries, I just stopped doing that because I saw 
these problems with scalability.



I don't have much experience at data partitioning (well, I do, but the
experience is with partitioning in Perl with terabytes of flat files,
not in PG :) ) but I think that once you have your partitioning keys
you want to apply them the same way up and down the data set.


I'm not sure what you mean by the same way up and down the data set.

Cheers,
S

*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-01 Thread Florian Pflug
On May31, 2012, at 20:50 , Robert Haas wrote:
 Suppose we introduce two new buffer flags,
 BUF_NAILED and BUF_NAIL_REMOVAL.  When we detect excessive contention
 on the buffer header spinlock, we set BUF_NAILED.  Once we do that,
 the buffer can't be evicted until that flag is removed, and backends
 are permitted to record pins in a per-backend area protected by a
 per-backend spinlock or lwlock, rather than in the buffer header.
 When we want to un-nail the buffer, we set BUF_NAIL_REMOVAL.  At that
 point, it's no longer permissible to record new pins in the
 per-backend areas, but old ones may still exist.  So then we scan all
 the per-backend areas and transfer the pins to the buffer header, or
 else just wait until no more exist; then, we clear both BUF_NAILED and
 BUF_NAIL_REMOVAL.

A simpler idea would be to collapse UnpinBuffer() / PinBuffer() pairs
by queing UnpinBuffer() requests for a while before actually updating
shared state.

I'm imagining having a small unpin queue with, say, 32 entries in
backend-local memory. When we unpin a buffer, we add the buffer at the
front of the queue. If the queue overflows, we dequeue a buffer from the
back of the queue and actually call UnpinBuffer(). If PinBuffer() is called
for a queued buffer, we simply remove the buffer from the queue.

We'd drain the unpin queue whenever we don't expect a PinBuffer() request
to happen for a while. Returning to the main loop is an obvious such place,
but there might be others. We could, for example, drain the queue every time
we block on a lock or signal, and maybe also before we go do I/O. Or, we
could have one such queue per resource owner, and drain it when we release
the resource owner.

We already avoid calling PinBuffer() multiple times for multiple overlapping
pins of a single buffer by a single backend. The strategy above would extend
that to not-quite-overlapping pins.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-01 Thread Merlin Moncure
On Fri, Jun 1, 2012 at 7:47 AM, Florian Pflug f...@phlo.org wrote:
 On May31, 2012, at 20:50 , Robert Haas wrote:
 Suppose we introduce two new buffer flags,
 BUF_NAILED and BUF_NAIL_REMOVAL.  When we detect excessive contention
 on the buffer header spinlock, we set BUF_NAILED.  Once we do that,
 the buffer can't be evicted until that flag is removed, and backends
 are permitted to record pins in a per-backend area protected by a
 per-backend spinlock or lwlock, rather than in the buffer header.
 When we want to un-nail the buffer, we set BUF_NAIL_REMOVAL.  At that
 point, it's no longer permissible to record new pins in the
 per-backend areas, but old ones may still exist.  So then we scan all
 the per-backend areas and transfer the pins to the buffer header, or
 else just wait until no more exist; then, we clear both BUF_NAILED and
 BUF_NAIL_REMOVAL.

 A simpler idea would be to collapse UnpinBuffer() / PinBuffer() pairs
 by queing UnpinBuffer() requests for a while before actually updating
 shared state.

 I'm imagining having a small unpin queue with, say, 32 entries in
 backend-local memory. When we unpin a buffer, we add the buffer at the
 front of the queue. If the queue overflows, we dequeue a buffer from the
 back of the queue and actually call UnpinBuffer(). If PinBuffer() is called
 for a queued buffer, we simply remove the buffer from the queue.

 We'd drain the unpin queue whenever we don't expect a PinBuffer() request
 to happen for a while. Returning to the main loop is an obvious such place,
 but there might be others. We could, for example, drain the queue every time
 we block on a lock or signal, and maybe also before we go do I/O. Or, we
 could have one such queue per resource owner, and drain it when we release
 the resource owner.

 We already avoid calling PinBuffer() multiple times for multiple overlapping
 pins of a single buffer by a single backend. The strategy above would extend
 that to not-quite-overlapping pins.

A potential issue with this line of thinking is that your pin delay
queue could get highly pressured by outer portions of the query (as in
the OP's case)  that will get little or no benefit from the delayed
pin.  But choosing a sufficiently sized drain queue would work for
most reasonable cases assuming 32 isn't enough?  Why not something
much larger, for example the lesser of 1024, (NBuffers * .25) /
max_connections?  In other words, for you to get much benefit, you
have to pin the buffer sufficiently more than 1/N times among all
buffers.

Or, maybe you only put contended buffers in the delay queue by
watching for the delay count that is going to be returned from s_lock.
 That forces a lookup on each pin though.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-01 Thread Tom Lane
Merlin Moncure mmonc...@gmail.com writes:
 A potential issue with this line of thinking is that your pin delay
 queue could get highly pressured by outer portions of the query (as in
 the OP's case)  that will get little or no benefit from the delayed
 pin.  But choosing a sufficiently sized drain queue would work for
 most reasonable cases assuming 32 isn't enough?  Why not something
 much larger, for example the lesser of 1024, (NBuffers * .25) /
 max_connections?  In other words, for you to get much benefit, you
 have to pin the buffer sufficiently more than 1/N times among all
 buffers.

Allowing each backend to pin a large fraction of shared buffers sounds
like a seriously bad idea to me.  That's just going to increase
thrashing of what remains.

More generally, I don't believe that we have any way to know which
buffers would be good candidates to keep pinned for a long time.
Typically, we don't drop the pin in the first place if we know we're
likely to touch that buffer again soon.  btree root pages might be an
exception, but I'm not even convinced of that one.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-01 Thread Florian Pflug
On Jun1, 2012, at 15:45 , Tom Lane wrote:
 Merlin Moncure mmonc...@gmail.com writes:
 A potential issue with this line of thinking is that your pin delay
 queue could get highly pressured by outer portions of the query (as in
 the OP's case)  that will get little or no benefit from the delayed
 pin.  But choosing a sufficiently sized drain queue would work for
 most reasonable cases assuming 32 isn't enough?  Why not something
 much larger, for example the lesser of 1024, (NBuffers * .25) /
 max_connections?  In other words, for you to get much benefit, you
 have to pin the buffer sufficiently more than 1/N times among all
 buffers.
 
 Allowing each backend to pin a large fraction of shared buffers sounds
 like a seriously bad idea to me.  That's just going to increase
 thrashing of what remains.

Right, that was one of the motivations for suggesting the small queue.
At least that way, the number of buffers optimistically pinned by each
backend is limited.

The other was that once the outer portions plough through more than
a few pages per iteration of the sub-plan, the cost of doing that should
dominate the cost of pinning and unpinning.

 More generally, I don't believe that we have any way to know which
 buffers would be good candidates to keep pinned for a long time.

I'd think that pinning a buffer which we've only recently unpinned
is a pretty good indication that the same thing will happen again.

My proposed algorithm could be made to use exactly that criterion
by tracking a little bit more state. We'd have to tag queue entries
with a flag indicating whether they are 

  Unpinned (COLD)
 
  Pinned, and unpinning should be delayed (HOT)

  Waiting to be unpinned (LUKEWARM)

UnpinBuffer() would check if the buffer is HOT, and if so add it to
the queue with flag LUKEWARM. Otherwise, it'd get immediately
unpinned and flagged as COLD (adding it to the queue if necessary).
PinBuffer() would pin the buffer and mark it as HOT if it was COLD,
and just mark it as HOT if it was LUKEWARM. If the buffer isn't on
the queue already, PinBuffer() would simply pin it and be done.

This would give the following behaviour for a buffer that is pinned
repeatedly

  PinBuffer(): not on queue - not on queue (refcount incremented)
  UnpinBuffer(): not on queue - COLD (refcount decremented)
  ...
  PinBuffer(): COLD - HOT (refcount incremented)
  UnpinBuffer(): HOT - LUKEWARM (refcount *not* decremented)
  ...
  PinBuffer(): LUKEWARM - HOT (refcount *not* incremented)
  UnpinBuffer(): HOT - LUKEWARM (refcount *not* decremented)
  …

 Typically, we don't drop the pin in the first place if we know we're
 likely to touch that buffer again soon. btree root pages might be an
 exception, but I'm not even convinced of that one.

But Sergey's use-case pretty convincingly shows that, more generally,
inner sides of a nested loop join are also an exception, no? At least
if the inner side is either an index scan, or a seqscan of a really
small table.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-01 Thread Merlin Moncure
On Fri, Jun 1, 2012 at 8:45 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Merlin Moncure mmonc...@gmail.com writes:
 A potential issue with this line of thinking is that your pin delay
 queue could get highly pressured by outer portions of the query (as in
 the OP's case)  that will get little or no benefit from the delayed
 pin.  But choosing a sufficiently sized drain queue would work for
 most reasonable cases assuming 32 isn't enough?  Why not something
 much larger, for example the lesser of 1024, (NBuffers * .25) /
 max_connections?  In other words, for you to get much benefit, you
 have to pin the buffer sufficiently more than 1/N times among all
 buffers.

 Allowing each backend to pin a large fraction of shared buffers sounds
 like a seriously bad idea to me.  That's just going to increase
 thrashing of what remains.

By 'large fraction', you mean 25%?  You could always set it lower, say
5%.   But if you can be smarter about which buffers to put in I agree:
a smaller queue is better.

 More generally, I don't believe that we have any way to know which
 buffers would be good candidates to keep pinned for a long time.
 Typically, we don't drop the pin in the first place if we know we're
 likely to touch that buffer again soon.  btree root pages might be an
 exception, but I'm not even convinced of that one.

Why not (unless Florian's warming concept is a better bet) hook it to
spinlock contention? That's what we're trying to avoid after all.
s_lock can be modified to return if it had to delay.  PinBuffer could
watch for that and stick it in the queue.   Both Florin's idea (AIUI)
or a s_lock based implementation require you to search your local
queue on every pin/unpin, which i think is the real cost.  Robert's
doesn't, although it is a more complicated approach.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-01 Thread Ants Aasma
On Fri, Jun 1, 2012 at 5:57 PM, Florian Pflug f...@phlo.org wrote:
 My proposed algorithm could be made to use exactly that criterion
 by tracking a little bit more state. We'd have to tag queue entries
 with a flag indicating whether they are

  Unpinned (COLD)

  Pinned, and unpinning should be delayed (HOT)

  Waiting to be unpinned (LUKEWARM)

This sounds awfully similar to the LIRS/CLOCK-Pro cache replacement
algorithms. They manage buffers by reuse distance based on last two
accesses. Because both algorithms demonstrate very good performance
over a wide range of cache sizes and workloads, it might be worth
exploring how they could be applicable here.

On the other hand, I agree with Merlin that searching the local queue
for every pin could be too heavy. Roberts approach amounts to
categorizing some buffers to be so hot that we basically lift them out
of the regular replacement algorithm management and don't even bother
tracking their usage on the account that it will be cheaper to figure
out their usage state after the fact. This might have some interesting
interactions with replacement algorithms that naturally keep separate
tiers of buffers. I think I'm going to try to see where that train of
thought takes me.

Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-01 Thread Robert Haas
On Fri, Jun 1, 2012 at 8:47 AM, Florian Pflug f...@phlo.org wrote:
 A simpler idea would be to collapse UnpinBuffer() / PinBuffer() pairs
 by queing UnpinBuffer() requests for a while before actually updating
 shared state.

So, what happens when somebody wants a cleanup lock on the buffer
you've decided to keep pinned?  We have this problem already; I'm wary
of making it worse.

 We'd drain the unpin queue whenever we don't expect a PinBuffer() request
 to happen for a while. Returning to the main loop is an obvious such place,
 but there might be others.

However, on a workload like pgbench -S, dropping the pin when you
return to the main loop would render the optimization useless.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-01 Thread Jeff Janes
On Fri, Jun 1, 2012 at 10:51 AM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Jun 1, 2012 at 8:47 AM, Florian Pflug f...@phlo.org wrote:

 We'd drain the unpin queue whenever we don't expect a PinBuffer() request
 to happen for a while. Returning to the main loop is an obvious such place,
 but there might be others.

 However, on a workload like pgbench -S, dropping the pin when you
 return to the main loop would render the optimization useless.

But do we need the optimization on a workload like pgbench -S?  You
did some pretty massive scalability tests on that front, and I don't
think this problem was identified in them.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-01 Thread Florian Pflug
On Jun1, 2012, at 19:51 , Robert Haas wrote:
 On Fri, Jun 1, 2012 at 8:47 AM, Florian Pflug f...@phlo.org wrote:
 A simpler idea would be to collapse UnpinBuffer() / PinBuffer() pairs
 by queing UnpinBuffer() requests for a while before actually updating
 shared state.
 
 So, what happens when somebody wants a cleanup lock on the buffer
 you've decided to keep pinned?  We have this problem already; I'm wary
 of making it worse.

She has to wait, or skip the cleanup. We already keep buffers pinned for
potentially a long time - if cursors enter the game even over multiple
commands, I think. And even without cursors, a nestloop plan can keep
buffers used by the outer sub-plan pinned while the inner plan is
executing. So giving the inner plan the chance to do the same seems not
unreasonable.

Plus, the more refined version of that idea from my reply to Tom only
delays the unpinning after the buffer was pinned twice within a short
period of time. After that, it seems to me that the probability of it
getting pinned again is high enough to warrant slightly penalizing a
potential cleanup.

 We'd drain the unpin queue whenever we don't expect a PinBuffer() request
 to happen for a while. Returning to the main loop is an obvious such place,
 but there might be others.
 
 However, on a workload like pgbench -S, dropping the pin when you
 return to the main loop would render the optimization useless.

Well, we could drain the queue only if the read() from the socket blocks.
But does pgbench -S really manage to cause significant spinlock contention
due to buffer pinning/unpinning? I'd expect other things to dominate there..

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-01 Thread Robert Haas
On Fri, Jun 1, 2012 at 2:06 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Fri, Jun 1, 2012 at 10:51 AM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Jun 1, 2012 at 8:47 AM, Florian Pflug f...@phlo.org wrote:

 We'd drain the unpin queue whenever we don't expect a PinBuffer() request
 to happen for a while. Returning to the main loop is an obvious such place,
 but there might be others.

 However, on a workload like pgbench -S, dropping the pin when you
 return to the main loop would render the optimization useless.

 But do we need the optimization on a workload like pgbench -S?  You
 did some pretty massive scalability tests on that front, and I don't
 think this problem was identified in them.

I don't know if I mentioned it on list, but I believe there is an
effect, and I believe it will get stronger as we continue to make
improvements in other areas, and as we continue to ramp up the number
of cores.   Even now, you can get s_lock taking 1% of the CPU time,
which is not trivial, and not all of that is from buffer-pinning, but
it's in the mix, so I'm kind of reluctant to accept a solution that
involves assuming that the problem always has to occur within a
transaction rather than across transactions.  I think it's a dead-end
anyway on other grounds: we ought to be trying to find ways to hold
buffer pins for less time, not more time.

Another thought is that if the problem is limited to the root index
block, then we could consider less general solutions, like allowing
backends to cache the root index block and sending some kind of
invalidation when it gets split.  But I don't know that the root index
block is the only place where we can have this problem, so I'm a
little nervous about that angle, too.  Still, it might be worth some
research.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-01 Thread Robert Haas
On Fri, Jun 1, 2012 at 2:54 PM, Florian Pflug f...@phlo.org wrote:
 On Jun1, 2012, at 19:51 , Robert Haas wrote:
 On Fri, Jun 1, 2012 at 8:47 AM, Florian Pflug f...@phlo.org wrote:
 A simpler idea would be to collapse UnpinBuffer() / PinBuffer() pairs
 by queing UnpinBuffer() requests for a while before actually updating
 shared state.

 So, what happens when somebody wants a cleanup lock on the buffer
 you've decided to keep pinned?  We have this problem already; I'm wary
 of making it worse.

 She has to wait, or skip the cleanup. We already keep buffers pinned for
 potentially a long time - if cursors enter the game even over multiple
 commands, I think. And even without cursors, a nestloop plan can keep
 buffers used by the outer sub-plan pinned while the inner plan is
 executing. So giving the inner plan the chance to do the same seems not
 unreasonable.

Maybe.  TBH, I'm unhappy about how easy it is to leave pins floating
around and would like to reduce that further, not increase it.

 Plus, the more refined version of that idea from my reply to Tom only
 delays the unpinning after the buffer was pinned twice within a short
 period of time. After that, it seems to me that the probability of it
 getting pinned again is high enough to warrant slightly penalizing a
 potential cleanup.

 We'd drain the unpin queue whenever we don't expect a PinBuffer() request
 to happen for a while. Returning to the main loop is an obvious such place,
 but there might be others.

 However, on a workload like pgbench -S, dropping the pin when you
 return to the main loop would render the optimization useless.

 Well, we could drain the queue only if the read() from the socket blocks.

I believe that it's going to block frequently on this type of
workload.  pgbench isn't fast enough to get the next query over there
by the time postgres is ready for the next query.

 But does pgbench -S really manage to cause significant spinlock contention
 due to buffer pinning/unpinning? I'd expect other things to dominate there..

See previous note.  It may not be the biggest effect, but I think it's
in the mix.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-01 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Another thought is that if the problem is limited to the root index
 block, then we could consider less general solutions, like allowing
 backends to cache the root index block and sending some kind of
 invalidation when it gets split.

Possibly worth noting here that we already do exactly that for btree
metapages.  However, I believe that case is a bit different, in that
using slightly stale metapage data is nonfatal.  Also, anything like
this can be a win only if the cached data can be expected to have pretty
long lifetime; so you would not want to cache the root page until the
index had gotten large enough that root page changes were infrequent.
Not sure how that could be mechanized ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-01 Thread Florian Pflug
On Jun1, 2012, at 21:07 , Robert Haas wrote:
 On Fri, Jun 1, 2012 at 2:54 PM, Florian Pflug f...@phlo.org wrote:
 On Jun1, 2012, at 19:51 , Robert Haas wrote:
 On Fri, Jun 1, 2012 at 8:47 AM, Florian Pflug f...@phlo.org wrote:
 We'd drain the unpin queue whenever we don't expect a PinBuffer() request
 to happen for a while. Returning to the main loop is an obvious such place,
 but there might be others.
 
 However, on a workload like pgbench -S, dropping the pin when you
 return to the main loop would render the optimization useless.
 
 Well, we could drain the queue only if the read() from the socket blocks.
 
 I believe that it's going to block frequently on this type of
 workload.  pgbench isn't fast enough to get the next query over there
 by the time postgres is ready for the next query.
 
 But does pgbench -S really manage to cause significant spinlock contention
 due to buffer pinning/unpinning? I'd expect other things to dominate there..
 
 See previous note.  It may not be the biggest effect, but I think it's
 in the mix.

Ok, now you've lost me. If the read() blocks, how on earth can a few
additional pins/unpins ever account for any meaningful execution time?

It seems to me that once read() blocks we're talking about a delay in the
order of the scheduling granularity (i.e., milliseconds, in the best case),
while even in the word case pinning a buffer shouldn't take more than
1000 cycles (for comparison, I think a cache miss across all layers costs
a few hundred cycles). So there's at the very least 3 order of magnitude
between those two...

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-01 Thread Robert Haas
On Fri, Jun 1, 2012 at 3:16 PM, Florian Pflug f...@phlo.org wrote:
 Ok, now you've lost me. If the read() blocks, how on earth can a few
 additional pins/unpins ever account for any meaningful execution time?

 It seems to me that once read() blocks we're talking about a delay in the
 order of the scheduling granularity (i.e., milliseconds, in the best case),
 while even in the word case pinning a buffer shouldn't take more than
 1000 cycles (for comparison, I think a cache miss across all layers costs
 a few hundred cycles). So there's at the very least 3 order of magnitude
 between those two...

I'm not sure what you want me to say here.  s_lock shows up in the
profile, and some of that is from PinBuffer.  I think any detectable
number of calls to s_lock is a sign of Bad Things (TM).  I can't
reproduce anything as severe as what the OP is seeing, but what does
that prove?  In a couple years we'll have systems with 128 cores
floating around, and things that are minor problems at 32 or even 64
cores will be crippling at 128 cores.  IME, spinlock contention has a
very sharp tipping point.  It's only a minor annoyance and then you
hit some threshold number of cores and, bam, you're spending 70-90% of
your time across all cores fighting over that one spinlock.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-01 Thread Merlin Moncure
On Fri, Jun 1, 2012 at 3:40 PM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Jun 1, 2012 at 3:16 PM, Florian Pflug f...@phlo.org wrote:
 Ok, now you've lost me. If the read() blocks, how on earth can a few
 additional pins/unpins ever account for any meaningful execution time?

 It seems to me that once read() blocks we're talking about a delay in the
 order of the scheduling granularity (i.e., milliseconds, in the best case),
 while even in the word case pinning a buffer shouldn't take more than
 1000 cycles (for comparison, I think a cache miss across all layers costs
 a few hundred cycles). So there's at the very least 3 order of magnitude
 between those two...

 I'm not sure what you want me to say here.  s_lock shows up in the
 profile, and some of that is from PinBuffer.  I think any detectable
 number of calls to s_lock is a sign of Bad Things (TM).  I can't
 reproduce anything as severe as what the OP is seeing, but what does
 that prove?  In a couple years we'll have systems with 128 cores
 floating around, and things that are minor problems at 32 or even 64
 cores will be crippling at 128 cores.  IME, spinlock contention has a
 very sharp tipping point.  It's only a minor annoyance and then you
 hit some threshold number of cores and, bam, you're spending 70-90% of
 your time across all cores fighting over that one spinlock.

I think your approach, nailing buffers, is really the way to go.  It
nails buffers based on detected contention which is very desirable --
uncontended spinlocks aren't broken and don't need to be fixed.  It
also doesn't add overhead in the general case whereas a side by side
backend queue does.

Another nice aspect is that you're not changing the lifetime of the
pin as the backend sees it but storing the important stuff (the
interplay with usage_count is a nice touch) on the buffer itself --
you want to keep as little as possible in the backend private side and
your patch does that; it's more amenable to 3rd party intervention
(flush your buffers right now!) then extended pins.  It exploits the
fact that pins can overlap and that the reference count is useless if
the buffer is always in memory anyways.  It immediately self corrects
when the first backend gripes whereas a per backend solution will
grind down as each backend independently determines it's got a problem
-- not pleasant if your case is 'walking' a set of buffers.

Buffer pins aren't a cache: with a cache you are trying to mask a slow
operation (like a disk i/o) with a faster such that the amount of slow
operations are minimized.  Buffer pins however are very different in
that we only care about contention on the reference count (the buffer
itself is not locked!) which makes me suspicious that caching type
algorithms are the wrong place to be looking.  I think it comes to do
picking between your relatively complex but general, lock displacement
approach or a specific strategy based on known bottlenecks.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-06-01 Thread Ants Aasma
On Sat, Jun 2, 2012 at 1:48 AM, Merlin Moncure mmonc...@gmail.com wrote:
 Buffer pins aren't a cache: with a cache you are trying to mask a slow
 operation (like a disk i/o) with a faster such that the amount of slow
 operations are minimized.  Buffer pins however are very different in
 that we only care about contention on the reference count (the buffer
 itself is not locked!) which makes me suspicious that caching type
 algorithms are the wrong place to be looking.  I think it comes to do
 picking between your relatively complex but general, lock displacement
 approach or a specific strategy based on known bottlenecks.

I agree that pins aren't like a cache. I mentioned the caching
algorithms because they work based on access frequency and highly
contended locks are likely to be accessed frequently even from a
single backend. However this only makes sense for the delayed
unpinning method, and I also have come to the conclusion that it's not
likely to work well. Besides delaying cleanup, the overhead for the
common case of uncontended access is just too much.

It seems to me that even the nailing approach will need a replacement
algorithm. The local pins still need to be published globally and
because shared memory size is fixed, the maximum amount of locally
pinned nailed buffers needs to be limited as well.

But anyway, I managed to completely misread the profile that Sergey
gave. Somehow I missed that the time went into the retry TAS in slock
instead of the inlined TAS. This shows that the issue isn't just
cacheline ping-pong but cacheline stealing. This could be somewhat
mitigated by making pinning lock-free. The Nb-GCLOCK paper that Robert
posted earlier in another thread describes an approach for this. I
have a WIP patch (attached) that makes the clock sweep lock-free in
the common case. This patch gave a 40% performance increase for an
extremely allocation heavy load running with 64 clients on a 4 core 1
socket system, lesser gains across the board. Pinning has a shorter
lock duration (and a different lock type) so the gain might be less,
or it might be a larger problem and post a higher gain. Either way, I
think the nailing approach should be explored further, cacheline
ping-pong could still be a problem with higher number of processors
and losing the spinlock also loses the ability to detect contention.


Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


lockfree-getbuffer.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-31 Thread Robert Haas
On Thu, May 31, 2012 at 7:31 AM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 On Wed, 30 May 2012, Robert Haas wrote:

 I'd really like to find out exactly where all those s_lock calls are
 coming from.  Is there any way you can get oprofile to output a
 partial stack backtrace?  If you have perf it's very easy, just 'perf
 record -g -a command to launch your test case' and then 'perf report
 -g'.


 I repeated my test with 8 threads (without tasksetting) and with
 sharedbuffers=48g (because that seemed to trigger in particular long times ~
 80 seconds). And I attach the perf report.

Thanks.  How did you generate this perf report?  It's cool, because I
haven't figured out how to make perf generate a report that is easily
email-able, and it seems you have.

The only trouble is that there's no call stack information here for
s_lock or PinBuffer, which is what I really want.  It seems to have
spit out call stack information only for the kernel functions, and not
for user functions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-31 Thread Robert Haas
On Thu, May 31, 2012 at 11:23 AM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 On Thu, 31 May 2012, Robert Haas wrote:

 Thanks.  How did you generate this perf report?  It's cool, because I
 haven't figured out how to make perf generate a report that is easily
 email-able, and it seems you have.

 I did pretty much what you have said, e.g.
 attached it to running process by
 perf record -g -p PID
 and then
 perf report -g   output

Ah, interesting.  I never tried sending the output to a file.

 And postgresql was compiled with cflags=-g

 The only trouble is that there's no call stack information here for
 s_lock or PinBuffer, which is what I really want.  It seems to have
 spit out call stack information only for the kernel functions, and not
 for user functions.

 Yes, I forgot to clean the old binaries when recompiled with cflags=-g.
 So not it is fixed. I attach the updated perf report (i.e. the first 1
 lines of it to reduce the  file size).

Oh, ho.  So from this we can see that the problem is that we're
getting huge amounts of spinlock contention when pinning and unpinning
index pages.

It would be nice to have a self-contained reproducible test case for
this, so that we could experiment with it on other systems.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-31 Thread Merlin Moncure
On Thu, May 31, 2012 at 10:23 AM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 On Thu, 31 May 2012, Robert Haas wrote:


 Thanks.  How did you generate this perf report?  It's cool, because I
 haven't figured out how to make perf generate a report that is easily
 email-able, and it seems you have.


 I did pretty much what you have said, e.g.
 attached it to running process by
 perf record -g -p PID
 and then
 perf report -g   output

 And postgresql was compiled with cflags=-g


 The only trouble is that there's no call stack information here for
 s_lock or PinBuffer, which is what I really want.  It seems to have
 spit out call stack information only for the kernel functions, and not
 for user functions.


 Yes, I forgot to clean the old binaries when recompiled with cflags=-g.
 So not it is fixed. I attach the updated perf report (i.e. the first 1
 lines of it to reduce the  file size).

That's basically what we needed.  The sequential scans are driving
index scans are all simultaneously pointing at the same couple of
pages in the indexThe are constantly pinning and unpinning -- the
database is schizophrenically going back and forth between the key
pages in the index being and not being allowed to be candidates for
buffer eviction.  Raising shared buffers doesn't help because it's
just marking the buffers to be available for eviction, not the
eviction process itself, that is the problem.  IOS doens't help
because it's index relation buffers, not the heap buffers we are
binding up on (although if the btree was fixed it's entirely possible
the problem could head right back to the heap for non IOS scans.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-31 Thread Sergey Koposov

On Thu, 31 May 2012, Robert Haas wrote:


Oh, ho.  So from this we can see that the problem is that we're
getting huge amounts of spinlock contention when pinning and unpinning
index pages.

It would be nice to have a self-contained reproducible test case for
this, so that we could experiment with it on other systems.


I have created it a few days ago:
http://archives.postgresql.org/pgsql-hackers/2012-05/msg01143.php

It is still valid. And I'm using exactly it to test. The only thing to 
change is to create a two-col index and drop another index.

The scripts are precisely the ones I'm using now.

The problem is that in order to see a really big slowdown (10 times slower 
than a single thread) I've had to raise the buffers to 48g but it was 
slow for smaller shared buffer settings as well.


But I'm not sure how sensitive the test is to the hardware.

Cheers,
S

*
Sergey E. Koposov,  PhD, Research Associate
Institute of Astronomy, University of Cambridge 
Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551

Web: http://www.ast.cam.ac.uk/~koposov/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-31 Thread Merlin Moncure
On Thu, May 31, 2012 at 11:54 AM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 On Thu, 31 May 2012, Robert Haas wrote:

 Oh, ho.  So from this we can see that the problem is that we're
 getting huge amounts of spinlock contention when pinning and unpinning
 index pages.

 It would be nice to have a self-contained reproducible test case for
 this, so that we could experiment with it on other systems.


 I have created it a few days ago:
 http://archives.postgresql.org/pgsql-hackers/2012-05/msg01143.php

 It is still valid. And I'm using exactly it to test. The only thing to
 change is to create a two-col index and drop another index.
 The scripts are precisely the ones I'm using now.

 The problem is that in order to see a really big slowdown (10 times slower
 than a single thread) I've had to raise the buffers to 48g but it was slow
 for smaller shared buffer settings as well.

 But I'm not sure how sensitive the test is to the hardware.

It's not: high contention on spinlocks is going to suck no matter what
hardware you have.   I think the problem is pretty obvious now: any
case where multiple backends are scanning the same sequence of buffers
in a very tight loop is going to display this behavior.  It doesn't
come up that often: it takes a pretty unusual sequence of events to
get a bunch of backends hitting the same buffer like that.

Hm, I wonder if you could alleviate the symptoms by making making the
Pin/UnpinBuffer smarter so that frequently pinned buffers could stay
pinned longer -- kinda as if your private ref count was hacked to be
higher in that case.   It would be a complex fix for a narrow issue
though.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-31 Thread Robert Haas
On Thu, May 31, 2012 at 2:03 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Thu, May 31, 2012 at 11:54 AM, Sergey Koposov kopo...@ast.cam.ac.uk 
 wrote:
 On Thu, 31 May 2012, Robert Haas wrote:

 Oh, ho.  So from this we can see that the problem is that we're
 getting huge amounts of spinlock contention when pinning and unpinning
 index pages.

 It would be nice to have a self-contained reproducible test case for
 this, so that we could experiment with it on other systems.


 I have created it a few days ago:
 http://archives.postgresql.org/pgsql-hackers/2012-05/msg01143.php

 It is still valid. And I'm using exactly it to test. The only thing to
 change is to create a two-col index and drop another index.
 The scripts are precisely the ones I'm using now.

 The problem is that in order to see a really big slowdown (10 times slower
 than a single thread) I've had to raise the buffers to 48g but it was slow
 for smaller shared buffer settings as well.

 But I'm not sure how sensitive the test is to the hardware.

 It's not: high contention on spinlocks is going to suck no matter what
 hardware you have.   I think the problem is pretty obvious now: any
 case where multiple backends are scanning the same sequence of buffers
 in a very tight loop is going to display this behavior.  It doesn't
 come up that often: it takes a pretty unusual sequence of events to
 get a bunch of backends hitting the same buffer like that.

 Hm, I wonder if you could alleviate the symptoms by making making the
 Pin/UnpinBuffer smarter so that frequently pinned buffers could stay
 pinned longer -- kinda as if your private ref count was hacked to be
 higher in that case.   It would be a complex fix for a narrow issue
 though.

This test case is unusual because it hits a whole series of buffers
very hard.  However, there are other cases where this happens on a
single buffer that is just very, very hot, like the root block of a
btree index, where the pin/unpin overhead hurts us.  I've been
thinking about this problem for a while, but it hasn't made it up to
the top of my priority list, because workloads where pin/unpin is the
dominant cost are still relatively uncommon.  I expect them to get
more common as we fix other problems.

Anyhow, I do have some vague thoughts on how to fix this.  Buffer pins
are a lot like weak relation locks, in that they are a type of lock
that is taken frequently, but rarely conflicts.  And the fast-path
locking in 9.2 provides a demonstration of how to handle this kind of
problem efficiently: making the weak, rarely-conflicting locks
cheaper, at the cost of some additional expense when a conflicting
lock (in this case, a buffer cleanup lock) is taken.  In particular,
each backend has its own area to record weak relation locks, and a
strong relation lock must scan all of those areas and migrate any
locks found there to the main lock table.  I don't think it would be
feasible to adopt exactly this solution for buffer pins, because page
eviction and buffer cleanup locks, while not exactly common, are
common enough that we can't require a scan of N per-backend areas
every time one of those operations occurs.

But, maybe we could have a system of this type that only applies to
the very hottest buffers.  Suppose we introduce two new buffer flags,
BUF_NAILED and BUF_NAIL_REMOVAL.  When we detect excessive contention
on the buffer header spinlock, we set BUF_NAILED.  Once we do that,
the buffer can't be evicted until that flag is removed, and backends
are permitted to record pins in a per-backend area protected by a
per-backend spinlock or lwlock, rather than in the buffer header.
When we want to un-nail the buffer, we set BUF_NAIL_REMOVAL.  At that
point, it's no longer permissible to record new pins in the
per-backend areas, but old ones may still exist.  So then we scan all
the per-backend areas and transfer the pins to the buffer header, or
else just wait until no more exist; then, we clear both BUF_NAILED and
BUF_NAIL_REMOVAL.

So the pin algorithm looks like this:

read buffer header flags (unlocked)
if (flags  (BUF_NAILED|BUF_NAIL_REMOVAL) != BUF_NAILED)
{
take buffer header spinlock
record pin in buffer header
release buffer header spinlock;
}
else
{
take per-backend lwlock
record pin in per-backend area
release per-backend lwlock
read buffer header flags (unlocked)
if (flags  (BUF_NAILED|BUF_NAIL_REMOVAL) != BUF_NAILED)
{
take per-backend lwlock
forget pin in per-backend area
release per-backend lwlock
take buffer header spinlock
record pin in buffer header
release buffer header spinlock
}
}

Due to memory ordering effects, we might see the buffer as nailed when
in fact nail removal has already begun (or even, completed).  We can
prevent that if (1) the nail removal code sets the nail removal flag
before checking the per-backend areas and (2) the pin code checks the
nail removal flag AFTER checking the per-backend 

Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-31 Thread Merlin Moncure
On Thu, May 31, 2012 at 1:50 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, May 31, 2012 at 2:03 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Thu, May 31, 2012 at 11:54 AM, Sergey Koposov kopo...@ast.cam.ac.uk 
 wrote:
 On Thu, 31 May 2012, Robert Haas wrote:

 Oh, ho.  So from this we can see that the problem is that we're
 getting huge amounts of spinlock contention when pinning and unpinning
 index pages.

 It would be nice to have a self-contained reproducible test case for
 this, so that we could experiment with it on other systems.


 I have created it a few days ago:
 http://archives.postgresql.org/pgsql-hackers/2012-05/msg01143.php

 It is still valid. And I'm using exactly it to test. The only thing to
 change is to create a two-col index and drop another index.
 The scripts are precisely the ones I'm using now.

 The problem is that in order to see a really big slowdown (10 times slower
 than a single thread) I've had to raise the buffers to 48g but it was slow
 for smaller shared buffer settings as well.

 But I'm not sure how sensitive the test is to the hardware.

 It's not: high contention on spinlocks is going to suck no matter what
 hardware you have.   I think the problem is pretty obvious now: any
 case where multiple backends are scanning the same sequence of buffers
 in a very tight loop is going to display this behavior.  It doesn't
 come up that often: it takes a pretty unusual sequence of events to
 get a bunch of backends hitting the same buffer like that.

 Hm, I wonder if you could alleviate the symptoms by making making the
 Pin/UnpinBuffer smarter so that frequently pinned buffers could stay
 pinned longer -- kinda as if your private ref count was hacked to be
 higher in that case.   It would be a complex fix for a narrow issue
 though.

 This test case is unusual because it hits a whole series of buffers
 very hard.  However, there are other cases where this happens on a
 single buffer that is just very, very hot, like the root block of a
 btree index, where the pin/unpin overhead hurts us.  I've been
 thinking about this problem for a while, but it hasn't made it up to
 the top of my priority list, because workloads where pin/unpin is the
 dominant cost are still relatively uncommon.  I expect them to get
 more common as we fix other problems.

 Anyhow, I do have some vague thoughts on how to fix this.  Buffer pins
 are a lot like weak relation locks, in that they are a type of lock
 that is taken frequently, but rarely conflicts.  And the fast-path
 locking in 9.2 provides a demonstration of how to handle this kind of
 problem efficiently: making the weak, rarely-conflicting locks
 cheaper, at the cost of some additional expense when a conflicting
 lock (in this case, a buffer cleanup lock) is taken.  In particular,
 each backend has its own area to record weak relation locks, and a
 strong relation lock must scan all of those areas and migrate any
 locks found there to the main lock table.  I don't think it would be
 feasible to adopt exactly this solution for buffer pins, because page
 eviction and buffer cleanup locks, while not exactly common, are
 common enough that we can't require a scan of N per-backend areas
 every time one of those operations occurs.

 But, maybe we could have a system of this type that only applies to
 the very hottest buffers.  Suppose we introduce two new buffer flags,
 BUF_NAILED and BUF_NAIL_REMOVAL.  When we detect excessive contention
 on the buffer header spinlock, we set BUF_NAILED.  Once we do that,
 the buffer can't be evicted until that flag is removed, and backends
 are permitted to record pins in a per-backend area protected by a
 per-backend spinlock or lwlock, rather than in the buffer header.
 When we want to un-nail the buffer, we set BUF_NAIL_REMOVAL.

Hm, couple questions: how do you determine if/when to un-nail a
buffer, and who makes that decision (bgwriter?) Is there a limit to
how many buffers you are allowed to nail?  It seems like a much
stronger idea, but one downside I see vs the 'pin for longer idea' i
was kicking around was how to deal stale nailed buffers and keeping
them from uncontrollably growing so that you have to either stop
nailing or forcibly evicting them.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-31 Thread Robert Haas
On Thu, May 31, 2012 at 3:25 PM, Merlin Moncure mmonc...@gmail.com wrote:
 Hm, couple questions: how do you determine if/when to un-nail a
 buffer, and who makes that decision (bgwriter?)

Well, I think some experimentation might be required, but my first
thought is to tie it into buffer eviction.  If we record a pin in some
side array in lieu of acquiring the buffer header spinlock, then we're
not bumping the usage count.  So perhaps what we might do is - when we
first nail the buffer, we set it's usage count to something much
higher than what's normally allowed, like say 50.  When it gets down
to 0 and somebody tries to evict it, we scan the per-backend arrays
and count the number of pins we find there, and set the usage count to
5 * that number.

The details could be completely wrong here.  The point is that I think
every so often we can just reconsider whether the buffer is still
worthy of being nailed.  Exactly how often we should reconsider it
seems like an empirical question.

 Is there a limit to
 how many buffers you are allowed to nail?

Probably not.  The system had better be designed so that the threshold
for nailing a buffer is pretty high.  It should be exceedingly
unlikely for more than a small percentage of buffers to meet the
criteria; if it happens, then there's probably a good reason.

 It seems like a much
 stronger idea, but one downside I see vs the 'pin for longer idea' i
 was kicking around was how to deal stale nailed buffers and keeping
 them from uncontrollably growing so that you have to either stop
 nailing or forcibly evicting them.

I think we're really talking about variants on the same theme.  My
idea is basically pin forever, which is taking longer to its
logical extreme, but even if it were something short of forever I
think you'd still need some way to flush the stale pins.  Otherwise,
vacuum can get stuck, etc.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-31 Thread Jeff Janes
On Sun, May 27, 2012 at 11:45 AM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 Hi,

 I did another test using the same data and the same code, which I've
 provided before and the performance of the single thread seems to be
 degrading quadratically with the number of threads.

 Here are the results:
 Nthreads Time_to_execute_one_thread
 1 8.1
 2 7.8
 3 8.1
 4 9.0
 5 10.2
 6 11.4
 7 13.3
 8 16.1
 9 19.0
 10 21.4
 11 23.8
 12 27.3
 13 30.2
 14 32.0
 15 34.1
 16 37.5

This looks more skew hyperbolic than quadratic.  In the higher region,
it doesn't seem to be increasing at an increasing rate, but rather
increasing at a constant rate of about 2.5 sec per additional backend,
i.e. converging to an asymptote with a slope of 2.5.  A pedantic
point, perhaps, but if you are trying to infer the nature of the
bottleneck from the shape of the curve, they lead to rather different
sets of hypotheses.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-31 Thread Jeff Janes
On Thu, May 31, 2012 at 9:17 AM, Robert Haas robertmh...@gmail.com wrote:

 Oh, ho.  So from this we can see that the problem is that we're
 getting huge amounts of spinlock contention when pinning and unpinning
 index pages.

 It would be nice to have a self-contained reproducible test case for
 this, so that we could experiment with it on other systems.

I just posted a patch under subject pgbench--new transaction type
that introduces a pgbench -P option.

I think that that would do a good job of simulating unique-key
look-ups on the inner side of a nested loop (which is basically what
we have here) and so creating contention on index pages.  Right now I
don't have anything with more than 2 CPUs and 2 is not high enough to
get much contention so I can't post any meaningful numbers.  (pgbench
-P might also be of interest in hash index investigation)

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-31 Thread Jeff Janes
On Thu, May 31, 2012 at 11:50 AM, Robert Haas robertmh...@gmail.com wrote:

 This test case is unusual because it hits a whole series of buffers
 very hard.  However, there are other cases where this happens on a
 single buffer that is just very, very hot, like the root block of a
 btree index, where the pin/unpin overhead hurts us.

I think that very very hot page is also the problem here, not a whole
sequence of hot pages.  Most of his buffer content sh lwlocks are on
just two buffers, and most of his blocked buffer mapping lwlocks on
are on just two partitions.  So I am guessing that almost all of his
spin-lock contention from Pin and Unpin are also coming from those
same two buffers.  Why there are two buffers when there is only one
index root block involved, I don't know.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-31 Thread Jeff Janes
On Wed, May 30, 2012 at 6:10 PM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 On Wed, 30 May 2012, Jeff Janes wrote:

 But anyway, is idt_match a fairly static table?  If so, I'd partition
 that into 16 tables, and then have each one of your tasks join against
 a different one of those tables.  That should relieve the contention
 on the index root block, and might have some other benefits as well.


 No, idt_match is getting filled by multi-threaded copy() and then joined
 with 4 other big tables like idt_phot. The result is then split into
 partitions.

That does make things more complicated.  But you could you partition
it at that level and then do the joins partition-wise?

I don't have much experience at data partitioning (well, I do, but the
experience is with partitioning in Perl with terabytes of flat files,
not in PG :) ) but I think that once you have your partitioning keys
you want to apply them the same way up and down the data set.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Merlin Moncure
On Sun, May 27, 2012 at 1:45 PM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 Hi,

 I did another test using the same data and the same code, which I've
 provided before and the performance of the single thread seems to be
 degrading quadratically with the number of threads.

 Here are the results:
 Nthreads Time_to_execute_one_thread
 1 8.1
 2 7.8
 3 8.1
 4 9.0
 5 10.2
 6 11.4
 7 13.3
 8 16.1
 9 19.0
 10 21.4
 11 23.8
 12 27.3
 13 30.2
 14 32.0
 15 34.1
 16 37.5

Ok, I double checked offline with Sergey to make sure the strategy
wasn't helping...it isn't, and we confirmed it was being forced on
after playing with the tunables a bit.  So what is happening here?  I
have a hunch that this particular query is defeating the ring buffer
strategy code.  Here's the query:

create table _tmp0 as select * from (
   select *, (select healpixid from idt_match as m where
   m.transitid=o.transitid)
   as x from idt_photoobservation as o offset 0
 ) as y where x%16=ZZZ order by x;

(where ZZZ is some number integer number 0=ZZZ16)

With the following plan:
--
 Sort  (cost=3228814504.96..3228815137.21 rows=252902 width=498)
  Sort Key: y.x
  -  Subquery Scan on y  (cost=0.00..3228791809.25 rows=252902 width=498)
Filter: ((y.x % 16::bigint) = 0)
-  Limit  (cost=0.00..3228033102.41 rows=50580456 width=490)
  -  Seq Scan on idt_photoobservation o
(cost=0.00..3228033102.41 rows=50580456 width=490)
SubPlan 1
  -  Index Scan using idt_match_transitid_idx on
idt_match m  (cost=0.00..63.74 rows=1 width=8)
Index Cond: (transitid = o.transitid)

If I understand this properly,.Sergey is scanning a large table and
looking up an integer value from a smaller table row by row and
explicitly forcing it as such (via 'offset 0'). That integer value is
then used to filter the resultant table down to size 1/N so that
multiple workers can process the data which is then sorted.  This
isn't a particularly great way to attack the problem, but it should
scale better than it does.  I think the problem is coming not from the
main seq scan, but from the dependent index lookup on idt_match.  Even
though the scan can be made to exhibit lock free behavior, the index
lookups will eventually start lining up and hit the same sequences of
pages in the same order -- bam.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Sergey Koposov

Here is the actual explain analyze of the query on the smaller dataset
which I have been using for the recent testing.

test=# explain analyze create table _tmp0 as select * from
  ( select *,
(select healpixid from idt_match as m where m.transitid=o.transitid)
as x from idt_photoobservation_small as o offset 0
  ) as y where x%16=0 order by x;
  
QUERY PLAN

p--
 Sort  (cost=63835201.73..63835214.23 rows=5000 width=498) (actual 
time=8203.041..8252.216 rows=173696 loops=1)
   Sort Key: y.x
   Sort Method: quicksort  Memory: 182555kB
   -  Subquery Scan on y  (cost=0.00..63834894.54 rows=5000 width=498) (actual 
time=0.102..7602.947 rows=173696 loops=1)
 Filter: ((y.x % 16::bigint) = 0)
 Rows Removed by Filter: 826304
 -  Limit  (cost=0.00..63819894.51 rows=102 width=490) (actual 
time=0.041..7296.401 rows=100 loops=1)
   -  Seq Scan on idt_photoobservation_small o  
(cost=0.00..63819894.51 rows=102 width=490) (actual time=0.038..7094.555 
rows=100 loops=1)
 SubPlan 1
   -  Index Scan using idt_match_transitid_idx on 
idt_match m  (cost=0.00..63.74 rows=1 width=8) (actual time=0.003..0.004 rows=1 
loops=100)
 Index Cond: (transitid = o.transitid)
 Total runtime: 8908.631 ms

*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Merlin Moncure
On Wed, May 30, 2012 at 10:42 AM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 Here is the actual explain analyze of the query on the smaller dataset
 which I have been using for the recent testing.

 test=# explain analyze create table _tmp0 as select * from

  ( select *,
        (select healpixid from idt_match as m where m.transitid=o.transitid)
                as x from idt_photoobservation_small as o offset 0
                  ) as y where x%16=0 order by x;

  QUERY PLAN

 p--
  Sort  (cost=63835201.73..63835214.23 rows=5000 width=498) (actual
 time=8203.041..8252.216 rows=173696 loops=1)
   Sort Key: y.x
   Sort Method: quicksort  Memory: 182555kB
   -  Subquery Scan on y  (cost=0.00..63834894.54 rows=5000 width=498)
 (actual time=0.102..7602.947 rows=173696 loops=1)

         Filter: ((y.x % 16::bigint) = 0)
         Rows Removed by Filter: 826304
         -  Limit  (cost=0.00..63819894.51 rows=102 width=490) (actual
 time=0.041..7296.401 rows=100 loops=1)
               -  Seq Scan on idt_photoobservation_small o
  (cost=0.00..63819894.51 rows=102 width=490) (actual
 time=0.038..7094.555 rows=100 loops=1)
                     SubPlan 1
                       -  Index Scan using idt_match_transitid_idx on
 idt_match m  (cost=0.00..63.74 rows=1 width=8) (actual time=0.003..0.004
 rows=1 loops=100)

                             Index Cond: (transitid = o.transitid)
  Total runtime: 8908.631 ms

Two things:

1. Can we see an explain analyze during a 'bogged' case?

2. Can we try to get 'index only scan' working over idt_match?  That's
should be a matter of adjusting the index so that it's:
create index on idt_match (transitid, healpixid);

merlin




merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Sergey Koposov

On Wed, 30 May 2012, Merlin Moncure wrote:



1. Can we see an explain analyze during a 'bogged' case?


Here is the one to one comparison of the 'bogged' 
**

  
QUERY PLAN
--
 Sort  (cost=63835201.73..63835214.23 rows=5000 width=498) (actual 
time=18007.500..18007.500 rows=0 loops=1)
   Sort Key: y.x
   Sort Method: quicksort  Memory: 25kB
   -  Subquery Scan on y  (cost=0.00..63834894.54 rows=5000 width=498) (actual 
time=18007.454..18007.454 rows=0 loops=1)
 Filter: ((y.x % 16::bigint) = 7)
 Rows Removed by Filter: 100
 -  Limit  (cost=0.00..63819894.51 rows=102 width=490) (actual 
time=0.047..17734.570 rows=100 loops=1)
   -  Seq Scan on idt_photoobservation_small o  
(cost=0.00..63819894.51 rows=102 width=490) (actual time=0.045..17543.902 
rows=100 loops=1)
 SubPlan 1
   -  Index Scan using idt_match_transitid_idx on 
idt_match m  (cost=0.00..63.74 rows=1 width=8) (actual time=0.015..0.015 rows=1 
loops=100)
 Index Cond: (transitid = o.transitid)
 Total runtime: 18056.866 ms
(12 rows)

Time: 18067.929 ms
*

vs  non-bogged:

**
  
QUERY PLAN
--
 Sort  (cost=63835201.73..63835214.23 rows=5000 width=498) (actual 
time=6635.133..6635.133 rows=0 loops=1)
   Sort Key: y.x
   Sort Method: quicksort  Memory: 25kB
   -  Subquery Scan on y  (cost=0.00..63834894.54 rows=5000 width=498) (actual 
time=6635.091..6635.091 rows=0 loops=1)
 Filter: ((y.x % 16::bigint) = 7)
 Rows Removed by Filter: 100
 -  Limit  (cost=0.00..63819894.51 rows=102 width=490) (actual 
time=0.059..6344.683 rows=100 loops=1)
   -  Seq Scan on idt_photoobservation_small o  
(cost=0.00..63819894.51 rows=102 width=490) (actual time=0.056..6149.429 
rows=100 loops=1)
 SubPlan 1
   -  Index Scan using idt_match_transitid_idx on 
idt_match m  (cost=0.00..63.74 rows=1 width=8) (actual time=0.003..0.004 rows=1 
loops=100)
 Index Cond: (transitid = o.transitid)
 Total runtime: 6669.215 ms
(12 rows)

Time: 6673.991 ms
**



2. Can we try to get 'index only scan' working over idt_match?  That's
should be a matter of adjusting the index so that it's:
create index on idt_match (transitid, healpixid);


I don't think I could do that. I created the index, you asked (although 
it is unclear why it would help...), tried explain, and it still used a 
simple index scan. After disabling the index scan it decided to use the bitmap scan 
(which isn't actually faster in the multithreaded setup. And after disabling the 
bitmapscan it switched to seqscan).


Cheers,
Sergey

PS Just for the reference the current indices on idt_match are
idt_match_idx btree (healpixid)
idt_match_transitid_healpixid_idx btree (transitid, healpixid)
idt_match_transitid_idx btree (transitid)



*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Robert Haas
On Wed, May 30, 2012 at 12:58 PM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 Here is the one to one comparison of the 'bogged' **

  QUERY PLAN
 --
  Sort  (cost=63835201.73..63835214.23 rows=5000 width=498) (actual
 time=18007.500..18007.500 rows=0 loops=1)
   Sort Key: y.x
   Sort Method: quicksort  Memory: 25kB
   -  Subquery Scan on y  (cost=0.00..63834894.54 rows=5000 width=498)
 (actual time=18007.454..18007.454 rows=0 loops=1)
         Filter: ((y.x % 16::bigint) = 7)
         Rows Removed by Filter: 100
         -  Limit  (cost=0.00..63819894.51 rows=102 width=490) (actual
 time=0.047..17734.570 rows=100 loops=1)
               -  Seq Scan on idt_photoobservation_small o
  (cost=0.00..63819894.51 rows=102 width=490) (actual
 time=0.045..17543.902 rows=100 loops=1)
                     SubPlan 1
                       -  Index Scan using idt_match_transitid_idx on
 idt_match m  (cost=0.00..63.74 rows=1 width=8) (actual time=0.015..0.015
 rows=1 loops=100)

                             Index Cond: (transitid = o.transitid)
  Total runtime: 18056.866 ms
 (12 rows)

 Time: 18067.929 ms
 *

 vs  non-bogged:

 **

  QUERY PLAN
 --
  Sort  (cost=63835201.73..63835214.23 rows=5000 width=498) (actual
 time=6635.133..6635.133 rows=0 loops=1)
   Sort Key: y.x
   Sort Method: quicksort  Memory: 25kB
   -  Subquery Scan on y  (cost=0.00..63834894.54 rows=5000 width=498)
 (actual time=6635.091..6635.091 rows=0 loops=1)
         Filter: ((y.x % 16::bigint) = 7)
         Rows Removed by Filter: 100
         -  Limit  (cost=0.00..63819894.51 rows=102 width=490) (actual
 time=0.059..6344.683 rows=100 loops=1)
               -  Seq Scan on idt_photoobservation_small o
  (cost=0.00..63819894.51 rows=102 width=490) (actual
 time=0.056..6149.429 rows=100 loops=1)

                     SubPlan 1
                       -  Index Scan using idt_match_transitid_idx on
 idt_match m  (cost=0.00..63.74 rows=1 width=8) (actual time=0.003..0.004
 rows=1 loops=100)
                             Index Cond: (transitid = o.transitid)
  Total runtime: 6669.215 ms
 (12 rows)

 Time: 6673.991 ms

Hmm.  So the total time across all loops is 3000-4000 ms in the
unbogged case, and 15000 ms in the bogged case.  The time for the
seqscan parent is 6344.683 in the unbogged case, so ~3s once you
subtract out the index scans, and 17543.902 in the bogged case, so ~3s
once you subtract out the index scans.  So this seems to support
Merlin's theory that the index scan is what's getting slow under heavy
concurrency.  Apparently, the index scan slows down 4-5x but the
seqscan is just as fast as ever.  Very interesting...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Merlin Moncure
On Wed, May 30, 2012 at 12:11 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, May 30, 2012 at 12:58 PM, Sergey Koposov kopo...@ast.cam.ac.uk 
 wrote:
 Here is the one to one comparison of the 'bogged' **

  QUERY PLAN
 --
  Sort  (cost=63835201.73..63835214.23 rows=5000 width=498) (actual
 time=18007.500..18007.500 rows=0 loops=1)
   Sort Key: y.x
   Sort Method: quicksort  Memory: 25kB
   -  Subquery Scan on y  (cost=0.00..63834894.54 rows=5000 width=498)
 (actual time=18007.454..18007.454 rows=0 loops=1)
         Filter: ((y.x % 16::bigint) = 7)
         Rows Removed by Filter: 100
         -  Limit  (cost=0.00..63819894.51 rows=102 width=490) (actual
 time=0.047..17734.570 rows=100 loops=1)
               -  Seq Scan on idt_photoobservation_small o
  (cost=0.00..63819894.51 rows=102 width=490) (actual
 time=0.045..17543.902 rows=100 loops=1)
                     SubPlan 1
                       -  Index Scan using idt_match_transitid_idx on
 idt_match m  (cost=0.00..63.74 rows=1 width=8) (actual time=0.015..0.015
 rows=1 loops=100)

                             Index Cond: (transitid = o.transitid)
  Total runtime: 18056.866 ms
 (12 rows)

 Time: 18067.929 ms
 *

 vs  non-bogged:

 **

  QUERY PLAN
 --
  Sort  (cost=63835201.73..63835214.23 rows=5000 width=498) (actual
 time=6635.133..6635.133 rows=0 loops=1)
   Sort Key: y.x
   Sort Method: quicksort  Memory: 25kB
   -  Subquery Scan on y  (cost=0.00..63834894.54 rows=5000 width=498)
 (actual time=6635.091..6635.091 rows=0 loops=1)
         Filter: ((y.x % 16::bigint) = 7)
         Rows Removed by Filter: 100
         -  Limit  (cost=0.00..63819894.51 rows=102 width=490) (actual
 time=0.059..6344.683 rows=100 loops=1)
               -  Seq Scan on idt_photoobservation_small o
  (cost=0.00..63819894.51 rows=102 width=490) (actual
 time=0.056..6149.429 rows=100 loops=1)

                     SubPlan 1
                       -  Index Scan using idt_match_transitid_idx on
 idt_match m  (cost=0.00..63.74 rows=1 width=8) (actual time=0.003..0.004
 rows=1 loops=100)
                             Index Cond: (transitid = o.transitid)
  Total runtime: 6669.215 ms
 (12 rows)

 Time: 6673.991 ms

 Hmm.  So the total time across all loops is 3000-4000 ms in the
 unbogged case, and 15000 ms in the bogged case.  The time for the
 seqscan parent is 6344.683 in the unbogged case, so ~3s once you
 subtract out the index scans, and 17543.902 in the bogged case, so ~3s
 once you subtract out the index scans.  So this seems to support
 Merlin's theory that the index scan is what's getting slow under heavy
 concurrency.  Apparently, the index scan slows down 4-5x but the
 seqscan is just as fast as ever.  Very interesting...

Yeah, an 'index only scan' *would* fix the performance problem if my
theory is correct.  By skipping the heap lookup it would also skip the
buffer retagging which is binding up the multiple concurrent scans.
Sergey, just in case you don't know, an IOS works like this.  If all
the fields that are fetched from a table are present in the index and
the index points towards a tuple on a page that contains only visible
tuples according to the visibility map, you get to skip jumping back
into the heap for visibility checks.  In your case, I think that would
result in almost linear scaling.

Hm, why aren't we getting a IOS?  Just for kicks (assuming this is
test data), can we drop the index on just transitid, leaving the index
on transitid, healpixid?Is enable_indexonlyscan on?  Has idt_match
been vacuumed?  What kind of plan do you get when do:

select healpixid from idt_match where transitid = some valid transitid;

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Sergey Koposov

On Wed, 30 May 2012, Merlin Moncure wrote:


Hm, why aren't we getting a IOS?  Just for kicks (assuming this is
test data), can we drop the index on just transitid, leaving the index
on transitid, healpixid?Is enable_indexonlyscan on?  Has idt_match
been vacuumed?  What kind of plan do you get when do:


Okay dropping the index on transitid solved the issue with indexonlyscan 
but didn't solve the original problem. Actually the indexonlyscan made the 
sequential queries faster but not the parallel ones.


Here is a side by side explain analyze again for indexonly scans

Bogged:


 QUERY PLAN
-
 Sort  (cost=82262274.19..82262286.69 rows=5000 width=498) (actual 
time=19635.318..19635.318 rows=0 loops=1)
   Sort Key: y.x
   Sort Method: quicksort  Memory: 25kB
   -  Subquery Scan on y  (cost=0.00..82261967.00 rows=5000 width=498) (actual 
time=19635.282..19635.282 rows=0 loops=1)
 Filter: ((y.x % 16::bigint) = 7)
 Rows Removed by Filter: 100
 -  Limit  (cost=0.00..82246967.00 rows=100 width=490) (actual 
time=0.076..19373.922 rows=100 loops=1)
   -  Seq Scan on idt_photoobservation_small o  
(cost=0.00..82246967.00 rows=100 width=490) (actual time=0.074..19179.336 
rows=100 loops=1)
 SubPlan 1
   -  Index Only Scan using 
idt_match_transitid_healpixid_idx on idt_match m  (cost=0.00..82.17 rows=1 
width=8) (actual time=0.016..0.017 rows=1 loops=100)
 Index Cond: (transitid = o.transitid)
 Heap Fetches: 0
 Total runtime: 19702.727 ms
(13 rows)
***
vs non-bogged
***

 QUERY PLAN
-
 Sort  (cost=82262274.19..82262286.69 rows=5000 width=498) (actual 
time=6198.334..6198.334 rows=0 loops=1)
   Sort Key: y.x
   Sort Method: quicksort  Memory: 25kB
   -  Subquery Scan on y  (cost=0.00..82261967.00 rows=5000 width=498) (actual 
time=6198.285..6198.285 rows=0 loops=1)
 Filter: ((y.x % 16::bigint) = 7)
 Rows Removed by Filter: 100
 -  Limit  (cost=0.00..82246967.00 rows=100 width=490) (actual 
time=0.116..5932.766 rows=100 loops=1)
   -  Seq Scan on idt_photoobservation_small o  
(cost=0.00..82246967.00 rows=100 width=490) (actual time=0.113..5730.012 
rows=100 loops=1)
 SubPlan 1
   -  Index Only Scan using 
idt_match_transitid_healpixid_idx on idt_match m  (cost=0.00..82.17 rows=1 
width=8) (actual time=0.003..0.003 rows=1 loops=100)
 Index Cond: (transitid = o.transitid)
 Heap Fetches: 0
 Total runtime: 6239.512 ms
(13 rows)
**

Cheers,
S

PS I didn't know that index on two columns together with the indexonlyscan can help 
the queries based on one of those columns. Very helpful!


*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Merlin Moncure
On Wed, May 30, 2012 at 1:45 PM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 On Wed, 30 May 2012, Merlin Moncure wrote:


 Hm, why aren't we getting a IOS?  Just for kicks (assuming this is
 test data), can we drop the index on just transitid, leaving the index
 on transitid, healpixid?    Is enable_indexonlyscan on?  Has idt_match
 been vacuumed?  What kind of plan do you get when do:


 Okay dropping the index on transitid solved the issue with indexonlyscan but
 didn't solve the original problem. Actually the indexonlyscan made the
 sequential queries faster but not the parallel ones.

hurk --  ISTM that since IOS is masikng the heap lookups, there must
be contention on the index itself?  Does this working set fit in
shared memory?  If so, what happens when you do a database restart and
repeat the IOS test?

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Sergey Koposov

On Wed, 30 May 2012, Merlin Moncure wrote:


hurk --  ISTM that since IOS is masikng the heap lookups, there must
be contention on the index itself?  Does this working set fit in
shared memory?  If so, what happens when you do a database restart and
repeat the IOS test?


The dataset fits well in shared buffers

Here are the data sizes
 public.idt_photoobservation_small| 521 MB |
  528756
 public.idt_match_idx | 1084 MB|
  528762
 public.idt_match_transitid_healpixid_idx | 1522 MB|
 1955140
 public.idt_match | 2906 MB|
  528753

And shared buffers are 10G

If I restart the db the timings do not change significantly. There is always some 
variation which I don't really understand, e.g. the parallel runs sometimes

take 18s, or 25 seconds, or 30 seconds per thread. So there is something else 
affecting
the runs -- I don't know, maybe that's related to which thread starts first,
or what is the starting point of the seq scan... (there is no other 
activity on the machine btw).


Cheers,
S

*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Jeff Janes
On Wed, May 30, 2012 at 11:45 AM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 On Wed, 30 May 2012, Merlin Moncure wrote:


 Hm, why aren't we getting a IOS?  Just for kicks (assuming this is
 test data), can we drop the index on just transitid, leaving the index
 on transitid, healpixid?    Is enable_indexonlyscan on?  Has idt_match
 been vacuumed?  What kind of plan do you get when do:


 Okay dropping the index on transitid solved the issue with indexonlyscan but
 didn't solve the original problem. Actually the indexonlyscan made the
 sequential queries faster but not the parallel ones.

How big is idt_match?  What if you drop all indexes on idt_match,
encouraging all the backends to do hash joins against it, which occur
in local memory and so don't have contention?

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Merlin Moncure
On Wed, May 30, 2012 at 3:15 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Wed, May 30, 2012 at 11:45 AM, Sergey Koposov kopo...@ast.cam.ac.uk 
 wrote:
 On Wed, 30 May 2012, Merlin Moncure wrote:


 Hm, why aren't we getting a IOS?  Just for kicks (assuming this is
 test data), can we drop the index on just transitid, leaving the index
 on transitid, healpixid?    Is enable_indexonlyscan on?  Has idt_match
 been vacuumed?  What kind of plan do you get when do:


 Okay dropping the index on transitid solved the issue with indexonlyscan but
 didn't solve the original problem. Actually the indexonlyscan made the
 sequential queries faster but not the parallel ones.

 How big is idt_match?  What if you drop all indexes on idt_match,
 encouraging all the backends to do hash joins against it, which occur
 in local memory and so don't have contention?

You just missed his post -- it's only 3G.   can you run your 'small'
working set against 48gb shared buffers?

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Sergey Koposov

On Wed, 30 May 2012, Merlin Moncure wrote:


How big is idt_match?  What if you drop all indexes on idt_match,
encouraging all the backends to do hash joins against it, which occur
in local memory and so don't have contention?


You just missed his post -- it's only 3G.   can you run your 'small'
working set against 48gb shared buffers?


Just tried 3times and it actually got much worse ~ 70-80 seconds per 
query in the  parallel setup ( i.e. 10 times slower than the single run)


The oprofile then looks like this:

CPU: Intel Architectural Perfmon, speed 1862 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask 
of 0x00 (No unit mask) count 10
samples  %linenr info symbol name
---
883523   46.3676  (no location information)   s_lock
  883523   100.000  (no location information)   s_lock [self]
---
303984   15.9532  (no location information)   PinBuffer
  303984   100.000  (no location information)   PinBuffer [self]

The problem is that since there is that variability in times, I don't 
really 100% know whether this trend of slow-down with increasing 
shared memory is genuine or not.


I've also tried just in case shared_buffers=1G, and it is still very slow 
(50 sec).


After that I changed the shared buffers back to 10G and the timings got 
back to 25 sec.


Weird...

I still wonder whether there is problem with the way the locking is done 
(as referenced in the recent droping tables slowiness thread).


Cheers,
S

*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Florian Pflug
On May30, 2012, at 22:07 , Sergey Koposov wrote:
 If I restart the db the timings do not change significantly. There is always 
 some variation which I don't really understand, e.g. the parallel runs 
 sometimes
 take 18s, or 25 seconds, or 30 seconds per thread. So there is something else 
 affecting
 the runs -- I don't know, maybe that's related to which thread starts first,
 or what is the starting point of the seq scan... (there is no other activity 
 on the machine btw).

I wonder if the huge variance could be caused by non-uniform synchronization 
costs across different cores. That's not all that unlikely, because at least 
some cache levels (L2 and/or L3, I think) are usually shared between all cores 
on a single die. Thus, a cache bouncing line between cores on the same die 
might very well be faster then it bouncing between cores on different dies.

On linux, you can use the taskset command to explicitly assign processes to 
cores. The easiest way to check if that makes a difference is to assign one 
core for each connection to the postmaster before launching your test. Assuming 
that cpu assignment are inherited to child processes, that should then spread 
your backends out over exactly the cores you specify.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Sergey Koposov

On Wed, 30 May 2012, Florian Pflug wrote:


I wonder if the huge variance could be caused by non-uniform 
synchronization costs across different cores. That's not all that 
unlikely, because at least some cache levels (L2 and/or L3, I think) are 
usually shared between all cores on a single die. Thus, a cache bouncing 
line between cores on the same die might very well be faster then it 
bouncing between cores on different dies.


On linux, you can use the taskset command to explicitly assign processes 
to cores. The easiest way to check if that makes a difference is to 
assign one core for each connection to the postmaster before launching 
your test. Assuming that cpu assignment are inherited to child 
processes, that should then spread your backends out over exactly the 
cores you specify.


Wow, thanks! This seems to be working to some extend. I've found that 
distributing each thread x ( 0x7) to the cpu 1+3*x
(reminder, that i have HT disabled and in total I have 4 cpus with 6 
proper cores each) gives quite good results. And after a few runs, I seem 
to be getting a more or less stable results for the multiple threads, 
with the performance of multithreaded runs going from 6 to 11 seconds for 
various threads. (another reminder is that 5-6 seconds  is roughly the 
timing of a my queries running in a single  thread).


So to some extend one can say that the problem is partially solved (i.e. 
it is probably understood)
But the question now is whether there is a *PG* problem here or 
not, or is it Intel's or Linux's problem ? 
Because still the slowdown was caused by locking. If there wouldn't be 
locking there wouldn't be any problems (as demonstrated a while ago by 
just cat'ting the files in multiple threads).


Cheers,
S

*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Florian Pflug
On May31, 2012, at 01:16 , Sergey Koposov wrote:
 On Wed, 30 May 2012, Florian Pflug wrote:
 
 I wonder if the huge variance could be caused by non-uniform synchronization 
 costs across different cores. That's not all that unlikely, because at least 
 some cache levels (L2 and/or L3, I think) are usually shared between all 
 cores on a single die. Thus, a cache bouncing line between cores on the same 
 die might very well be faster then it bouncing between cores on different 
 dies.
 
 On linux, you can use the taskset command to explicitly assign processes to 
 cores. The easiest way to check if that makes a difference is to assign one 
 core for each connection to the postmaster before launching your test. 
 Assuming that cpu assignment are inherited to child processes, that should 
 then spread your backends out over exactly the cores you specify.
 
 Wow, thanks! This seems to be working to some extend. I've found that 
 distributing each thread x ( 0x7) to the cpu 1+3*x
 (reminder, that i have HT disabled and in total I have 4 cpus with 6 proper 
 cores each) gives quite good results. And after a few runs, I seem to be 
 getting a more or less stable results for the multiple threads, with the 
 performance of multithreaded runs going from 6 to 11 seconds for various 
 threads. (another reminder is that 5-6 seconds  is roughly the timing of a my 
 queries running in a single  thread).

Wait, so performance *increased* by spreading the backends out over as many 
dies as possible, not by using as few as possible? That'd
be exactly the opposite of what I'd have expected. (I'm assuming that cores on 
one die have ascending ids on linux. If you could post the contents of 
/proc/cpuinfo, we could verify that)

 So to some extend one can say that the problem is partially solved (i.e. it 
 is probably understood)

Not quite, I think. We still don't really know why there's that much spinlock 
contention AFAICS. But what we've learned is that the actual
spinning on a contested lock is only part of the problem. The cache-line 
bouncing caused by all those lock acquisition is the other part, and it's 
pretty expensive too - otherwise, moving the backends around wouldn't have 
helped.

 But the question now is whether there is a *PG* problem here or not, or is it 
 Intel's or Linux's problem ?

Neither Intel nor Linux can do much about this, I fear. Synchronization will 
always be expensive, and the more so the larger the number of cores. Linux 
could maybe pick a better process to core assignment, but it probably won't be 
able to pick the optimal one for every workload. So unfortunately, this is a 
postgres problem I'd say.

 Because still the slowdown was caused by locking. If there wouldn't be 
 locking there wouldn't be any problems (as demonstrated a while ago by just 
 cat'ting the files in multiple threads).

Yup, we'll have to figure out a way to reduce the locking overhead. 9.2 already 
scales much better to a large number of cores than previous versions did, but 
your test case shows that there's still room for improvement.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Sergey Koposov

On Thu, 31 May 2012, Florian Pflug wrote:

Wait, so performance *increased* by spreading the backends out over as 
many dies as possible, not by using as few as possible? That'd be 
exactly the opposite of what I'd have expected. (I'm assuming that cores 
on one die have ascending ids on linux. If you could post the contents 
of /proc/cpuinfo, we could verify that)


Yes, you are correct. And I also can confirm that the cpus in the cpuinfo 
are ordered by physical id e.g. they go like

0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3

I did a specific test with just 6 threads (== number of cores per cpu)
and ran it on a single phys cpu, it took ~ 12 seconds for each thread,
and when I tried to spread it across 4 cpus it took 7-9 seconds per 
thread. But all these numbers are anyway significantly better then when I 
didn't use taskset. Which probably means without it the processes were 
jumping from core to core ? ...


Because still the slowdown was caused by locking. If there wouldn't be 
locking there wouldn't be any problems (as demonstrated a while ago by 
just cat'ting the files in multiple threads).


Yup, we'll have to figure out a way to reduce the locking overhead. 9.2 
already scales much better to a large number of cores than previous 
versions did, but your test case shows that there's still room for 
improvement.


Yes, and unfortunately these scaling problems in read-only cpu 
bound queries, where naively I wound't expect any.


Thanks,

Sergey

*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Jeff Janes
On Wed, May 30, 2012 at 4:16 PM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:

 But the question now is whether there is a *PG* problem here or not, or is
 it Intel's or Linux's problem ? Because still the slowdown was caused by
 locking. If there wouldn't be locking there wouldn't be any problems (as
 demonstrated a while ago by just cat'ting the files in multiple threads).

You cannot have a traditional RDBMS without locking.  From your
description of the problem, I probably wouldn't be using a traditional
database system at all for this, but rather flat files and Perl.  Or
at least, I would partition the data before loading it to the DB,
rather than trying to do it after.

But anyway, is idt_match a fairly static table?  If so, I'd partition
that into 16 tables, and then have each one of your tasks join against
a different one of those tables.  That should relieve the contention
on the index root block, and might have some other benefits as well.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Stephen Frost
Sergey, all,

* Sergey Koposov (kopo...@ast.cam.ac.uk) wrote:
 I did a specific test with just 6 threads (== number of cores per cpu)
 and ran it on a single phys cpu, it took ~ 12 seconds for each thread,
 and when I tried to spread it across 4 cpus it took 7-9 seconds per
 thread. But all these numbers are anyway significantly better then
 when I didn't use taskset. Which probably means without it the
 processes were jumping from core to core ? ...

It sounds like the issue here really is the PG processes bouncing
between the cores, which I do think is a Linux issue, though perhaps PG
could somehow encourage Linux to be better about this somehow..

This isn't specific to Linux, however, under Windows there's something
similar and they encourage the same kind of manual setting of things
like IIS processes to CPUs..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Stephen Frost
* Sergey Koposov (kopo...@ast.cam.ac.uk) wrote:
 I did a specific test with just 6 threads (== number of cores per cpu)
 and ran it on a single phys cpu, it took ~ 12 seconds for each thread,
 and when I tried to spread it across 4 cpus it took 7-9 seconds per
 thread. But all these numbers are anyway significantly better then
 when I didn't use taskset. Which probably means without it the
 processes were jumping from core to core ? ...

Oh, and wrt why 'cat' isn't really affected by this issue- that's
because 'cat' isn't *doing* anything, CPU wise, really.  The PG backends
are actually doing real work and therefore they have both CPU state and
memory accesses which are impacted when the process is moved from one
core to another.  If this system has NUMA (where memory is associated
with a set of cores), then that can make it more painful when threads
are being moved between cores also, because suddenly the memory you were
accessing (which was 'fast') is now taking longer because you're having
to go through another CPU to get to it.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Sergey Koposov

On Wed, 30 May 2012, Jeff Janes wrote:


But the question now is whether there is a *PG* problem here or not, or is
it Intel's or Linux's problem ? Because still the slowdown was caused by
locking. If there wouldn't be locking there wouldn't be any problems (as
demonstrated a while ago by just cat'ting the files in multiple threads).


You cannot have a traditional RDBMS without locking.  From your


I understand the need of significant locking when there concurrent writes, 
but not when there only reads. But I'm not a RDBMS expert, so that's 
maybe that's misunderstanding on my side.



description of the problem,  I probably wouldn't be using a traditional
database system at all for this, but rather flat files and Perl.


Flat files and perl for 25-50 TB of data over few years is a bit extreme 
;)



Or
at least, I would partition the data before loading it to the DB,
rather than trying to do it after.


I intensionally did otherwise, because I thought that PG will 
to be much smarter than me in juggling the data I'm ingesting (~ tens of 
gig each day), join the appropriate bits of data and then split by 
partitions. Unfortunately I see that there are some scalability 
issues on the way, which I didn't expect. Those aren't fatal, but slightly 
disappointing.



But anyway, is idt_match a fairly static table?  If so, I'd partition
that into 16 tables, and then have each one of your tasks join against
a different one of those tables.  That should relieve the contention
on the index root block, and might have some other benefits as well.


No, idt_match is getting filled by multi-threaded copy() and then joined 
with 4 other big tables like idt_phot. The result is then split into 
partitions. And I was trying different approaches to fully utilize the 
CPUs and/or I/O and somehow parallize the queries. That's the 
reasoning for somewhat contrived  queries in my test.


Cheers,
S

*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Robert Haas
On Wed, May 30, 2012 at 9:10 PM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 I understand the need of significant locking when there concurrent writes,
 but not when there only reads. But I'm not a RDBMS expert, so that's maybe
 that's misunderstanding on my side.

If we knew in advance that no writes would come along during the
execution of a particular test case, then we could skip a lot of
locking on the reads.  But we don't, so we have to be prepared for the
possibility of writes at any time, which means doing things taking
share-locks on data while it's actively being read.

Also, even in an entirely read-only workload, buffers can get evicted
from the database buffer cache, so we have to lock to make sure a
buffer doesn't get evicted while we're reading it.

I'd really like to find out exactly where all those s_lock calls are
coming from.  Is there any way you can get oprofile to output a
partial stack backtrace?  If you have perf it's very easy, just 'perf
record -g -a command to launch your test case' and then 'perf report
-g'.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Florian Pflug
On May31, 2012, at 02:26 , Sergey Koposov wrote:
 On Thu, 31 May 2012, Florian Pflug wrote:
 Wait, so performance *increased* by spreading the backends out over as many 
 dies as possible, not by using as few as possible? That'd be exactly the 
 opposite of what I'd have expected. (I'm assuming that cores on one die have 
 ascending ids on linux. If you could post the contents of /proc/cpuinfo, we 
 could verify that)
 
 Yes, you are correct. And I also can confirm that the cpus in the cpuinfo are 
 ordered by physical id e.g. they go like
 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3
 
 I did a specific test with just 6 threads (== number of cores per cpu)
 and ran it on a single phys cpu, it took ~ 12 seconds for each thread,
 and when I tried to spread it across 4 cpus it took 7-9 seconds per thread. 
 But all these numbers are anyway significantly better then when I didn't use 
 taskset.

Hm. The only resource that is shared between different cores on a die is 
usually the last cache level (L2 or L3). So by spreading backends out over more 
cores, you're increasing the total cache size. Maybe that could explain the 
behavior you're seeing.

 Which probably means without it the processes were jumping from core to core 
 ? …

Seems so. I completely fail to understand why they would, though. Since you've 
got far fewer runnable processes than cores, why would the kernel ever see a 
reason to migrate a process from one core to another? I think that you can 
query the core that a process is running one from one of the files in 
/proc/pid. You could try gathering samples ever 50 ms or so during the test 
run - maybe that could shed some additional light on this.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Stephen Frost
Robert,

* Robert Haas (robertmh...@gmail.com) wrote:
 On Wed, May 30, 2012 at 9:10 PM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
  I understand the need of significant locking when there concurrent writes,
  but not when there only reads. But I'm not a RDBMS expert, so that's maybe
  that's misunderstanding on my side.
 
 If we knew in advance that no writes would come along during the
 execution of a particular test case, then we could skip a lot of
 locking on the reads.  But we don't, so we have to be prepared for the
 possibility of writes at any time, which means doing things taking
 share-locks on data while it's actively being read.

Uh, we have a read-only transaction mode, don't we?  Or does that not
help, because someone else, in another transaction, could take a
read-write lock?

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-30 Thread Jeff Janes
On Wed, May 30, 2012 at 7:00 PM, Stephen Frost sfr...@snowman.net wrote:
 Robert,

 * Robert Haas (robertmh...@gmail.com) wrote:
 On Wed, May 30, 2012 at 9:10 PM, Sergey Koposov kopo...@ast.cam.ac.uk 
 wrote:
  I understand the need of significant locking when there concurrent writes,
  but not when there only reads. But I'm not a RDBMS expert, so that's maybe
  that's misunderstanding on my side.

 If we knew in advance that no writes would come along during the
 execution of a particular test case, then we could skip a lot of
 locking on the reads.  But we don't, so we have to be prepared for the
 possibility of writes at any time, which means doing things taking
 share-locks on data while it's actively being read.

 Uh, we have a read-only transaction mode, don't we?  Or does that not
 help, because someone else, in another transaction, could take a
 read-write lock?

Yes, exactly.  You would have to put the entire cluster into a read
only mode, and the entire code base almost from top to bottom would
have to be rewritten to detect and support such a mode.  Even then,
either you have a unified buffer cache (which needs locking even if
the queries are read only), or each process uses its own local memory.
 In the latter case, how do you avoid oversubscribing physical memory?
 Also, there are certain house-keeping tasks that can take place
anytime, even in read-only transactions.  You would either need to
turn those off, or you would need to make a pass through the entire
system doing all the house-keeping up front during the transition from
read-write to read-only.

It would basically be like writing an entirely new database management
system, which just happens to share PG's on-disk format and PG's
flavor of SLQ syntax and semantics.  And is read only.  (And since the
OP is trying to create tables based on his read-only query, such a
system wouldn't be very well suited.)

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-27 Thread Sergey Koposov

Hi,

I did another test using the same data and the same code, which I've 
provided before and the performance of the single thread seems to be 
degrading quadratically with the number of threads.


Here are the results:
Nthreads Time_to_execute_one_thread
1 8.1
2 7.8
3 8.1
4 9.0
5 10.2
6 11.4
7 13.3
8 16.1
9 19.0
10 21.4
11 23.8
12 27.3
13 30.2
14 32.0
15 34.1
16 37.5

Regards,
Sergey



On Sat, 26 May 2012, Stephen Frost wrote:


* Sergey Koposov (kopo...@ast.cam.ac.uk) wrote:

Turning off synch seq scans doesn't help either. 18 sec
multithreaded run vs 7 sec single threaded.


Alright, can you just time 'cat' when they're started a few seconds or
whatever apart from each other?  I can't imagine it being affected in
the same way as these, but seems like it wouldn't hurt to check.

Thanks,

Stephen



*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-26 Thread Robert Haas
On Fri, May 25, 2012 at 10:30 AM, Merlin Moncure mmonc...@gmail.com wrote:
 I think what's happening here is that the buffer partitions don't help
 (in fact, they hurt) in the presence of multiple concurrent scans that
 are operating on approximately the same data.  Sooner or later the
 scans line up on each other and start binding when reassigning lock
 tags (which take out up to two ordered exclusive lwlocks). This is on
 the read side, so the buffer access strategy is zero help (I confirmed
 this off list).

This theory is seeming fairly plausible - how can we test it?

How about trying it with synchronize_seqscans = off?  If the
synchronized-seqscan logic is causing contention on the buf mapping
locks and individual buffer locks, that should fix it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-26 Thread Sergey Koposov

On Sat, 26 May 2012, Robert Haas wrote:

This theory is seeming fairly plausible - how can we test it?

How about trying it with synchronize_seqscans = off?  If the
synchronized-seqscan logic is causing contention on the buf mapping
locks and individual buffer locks, that should fix it.


Turning off synch seq scans doesn't help either. 18 sec multithreaded run 
vs 7 sec single threaded.


S

*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-26 Thread Stephen Frost
* Sergey Koposov (kopo...@ast.cam.ac.uk) wrote:
 Turning off synch seq scans doesn't help either. 18 sec
 multithreaded run vs 7 sec single threaded.

Alright, can you just time 'cat' when they're started a few seconds or
whatever apart from each other?  I can't imagine it being affected in
the same way as these, but seems like it wouldn't hurt to check.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-26 Thread Sergey Koposov

On Sat, 26 May 2012, Stephen Frost wrote:


Alright, can you just time 'cat' when they're started a few seconds or
whatever apart from each other?  I can't imagine it being affected in
the same way as these, but seems like it wouldn't hurt to check.


I've tryed cat'ting a created in advance 8gig file on the same filesystem.
And  during the multi-threaded run it takes ~4sec, roughly the same as 
when run separately.



*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-25 Thread Merlin Moncure
On Thu, May 24, 2012 at 6:26 PM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 On Thu, 24 May 2012, Jeff Janes wrote:

 Add
 #define LWLOCK_STATS
 near the top of:
 src/backend/storage/lmgr/lwlock.c

 and recompile and run a reduced-size workload.  When the processes
 exits, they will dump a lot of data about LWLock usage to the logfile.
 Generally the LWLock with the most blocks on it will be the main
 culprit.


 Here is the output from a multi-threaded run (8thtreads, 22 seconds) sorted
 by blk. Not sure whether that's of much use or not:

 PID 7112 lwlock 48: shacq 1124394 exacq 1350 blk 1373
 PID 7110 lwlock 48: shacq 1124460 exacq 1128 blk 1110
 PID 7114 lwlock 48: shacq 1124502 exacq 1041 blk 976
 PID 7111 lwlock 48: shacq 1124523 exacq 1009 blk 955
 PID 7113 lwlock 48: shacq 1124383 exacq 868 blk 871
 PID 7112 lwlock 44: shacq 1127148 exacq 1323 blk 838
 PID 7110 lwlock 44: shacq 1127256 exacq 1132 blk 774
 PID 7114 lwlock 44: shacq 1127418 exacq 1024 blk 702
 PID 7113 lwlock 44: shacq 1127179 exacq 920 blk 665
 PID 7111 lwlock 44: shacq 1127324 exacq 957 blk 651
 PID 7109 lwlock 48: shacq 1124402 exacq 384 blk 602
 PID 7108 lwlock 48: shacq 1125039 exacq 1592 blk 546
 PID 7108 lwlock 44: shacq 1127902 exacq 1548 blk 511
 PID 7109 lwlock 44: shacq 1127261 exacq 388 blk 466
 PID 7114 lwlock 47: shacq 227986 exacq 929 blk 449
 PID 7115 lwlock 44: shacq 1127495 exacq 633 blk 401
 PID 7115 lwlock 48: shacq 1124666 exacq 559 blk 397
 PID 7112 lwlock 47: shacq 227993 exacq 1248 blk 387
 PID 7110 lwlock 47: shacq 228218 exacq 1082 blk 378
 PID 7111 lwlock 47: shacq 228093 exacq 907 blk 321
 PID 7114 lwlock 42: shacq 232591 exacq 935 blk 318
 PID 7113 lwlock 47: shacq 228085 exacq 909 blk 306
 PID 7112 lwlock 34: shacq 343247 exacq 1255 blk 271
 PID 7110 lwlock 42: shacq 232599 exacq 1049 blk 262
 PID 7111 lwlock 34: shacq 343398 exacq 926 blk 255
 PID 7112 lwlock 42: shacq 232505 exacq 1262 blk 240
 PID 7108 lwlock 42: shacq 233215 exacq 1539 blk 237
 PID 7110 lwlock 43: shacq 438768 exacq 1026 blk 231
 PID 7114 lwlock 43: shacq 438795 exacq 898 blk 230
 PID 7113 lwlock 34: shacq 343283 exacq 832 blk 226
 PID 7110 lwlock 34: shacq 343338 exacq 1074 blk 226
 PID 7114 lwlock 0: shacq 0 exacq 14864 blk 219
 PID 7112 lwlock 43: shacq 438691 exacq 1123 blk 215
 PID 7111 lwlock 42: shacq 232645 exacq 885 blk 215
 PID 7114 lwlock 34: shacq 343362 exacq 939 blk 214
 PID 7111 lwlock 0: shacq 0 exacq 13638 blk 205
 PID 7111 lwlock 43: shacq 438817 exacq 899 blk 200
 PID 7114 lwlock 7: shacq 0 exacq 547 blk 193
 PID 7113 lwlock 42: shacq 232600 exacq 860 blk 192
 PID 7110 lwlock 0: shacq 0 exacq 16862 blk 191
 PID 7114 lwlock 38: shacq 333266 exacq 977 blk 183
 PID 7112 lwlock 38: shacq 333175 exacq 1246 blk 183
 PID 7113 lwlock 7: shacq 0 exacq 548 blk 178
 PID 7113 lwlock 43: shacq 438720 exacq 833 blk 178
 PID 7115 lwlock 7: shacq 0 exacq 549 blk 177
 PID 7108 lwlock 47: shacq 228682 exacq 1495 blk 177
 PID 7108 lwlock 34: shacq 343982 exacq 1595 blk 177
 PID 7112 lwlock 0: shacq 0 exacq 19538 blk 172
 PID 7111 lwlock 7: shacq 0 exacq 549 blk 172
 PID 7115 lwlock 17: shacq 0 exacq 9927 blk 167
 PID 7109 lwlock 34: shacq 343410 exacq 348 blk 167
 PID 7112 lwlock 7: shacq 0 exacq 548 blk 166
 PID 7110 lwlock 38: shacq 333249 exacq 1013 blk 165
 PID 7113 lwlock 38: shacq 333226 exacq 828 blk 163
 PID 7110 lwlock 7: shacq 0 exacq 548 blk 162
 PID 7109 lwlock 7: shacq 0 exacq 548 blk 161
 PID 7109 lwlock 47: shacq 228097 exacq 386 blk 158
 PID 7112 lwlock 45: shacq 130843 exacq 1201 blk 154
 PID 7109 lwlock 43: shacq 438787 exacq 314 blk 153
 PID 7111 lwlock 38: shacq 06 exacq 882 blk 152
 PID 7108 lwlock 43: shacq 439454 exacq 1614 blk 151
 PID 7108 lwlock 17: shacq 0 exacq 23165 blk 147
 PID 7108 lwlock 7: shacq 0 exacq 549 blk 145
 PID 7113 lwlock 0: shacq 0 exacq 13394 blk 142
 PID 7112 lwlock 46: shacq 132972 exacq 1196 blk 142
 PID 7112 lwlock 36: shacq 204508 exacq 1180 blk 140
 PID 7109 lwlock 42: shacq 232611 exacq 363 blk 140
 PID 7115 lwlock 34: shacq 343582 exacq 600 blk 139
 PID 7114 lwlock 45: shacq 130818 exacq 903 blk 139
 PID 7115 lwlock 42: shacq 232846 exacq 517 blk 138
 PID 7108 lwlock 38: shacq 333989 exacq 1704 blk 137
 PID 7115 lwlock 43: shacq 438993 exacq 558 blk 126
 PID 7115 lwlock 0: shacq 0 exacq 3161 blk 125
 PID 7110 lwlock 40: shacq 129577 exacq 1021 blk 122
 PID 7113 lwlock 45: shacq 130833 exacq 814 blk 120
 PID 7111 lwlock 17: shacq 0 exacq 4607 blk 119
 PID 7109 lwlock 0: shacq 0 exacq 5711 blk 119
 PID 7110 lwlock 45: shacq 130865 exacq 1001 blk 117
 PID 7109 lwlock 38: shacq 40 exacq 322 blk 115
 PID 7114 lwlock 46: shacq 132997 exacq 860 blk 112
 PID 7113 lwlock 36: shacq 204566 exacq 833 blk 112
 PID 7112 lwlock 40: shacq 129528 exacq 1234 blk 111
 PID 7115 lwlock 47: shacq 228397 exacq 516 blk 109
 PID 7114 lwlock 36: shacq 204634 exacq 930 blk 109
 PID 7112 lwlock 41: shacq 133600 exacq 1136 blk 109
 PID 7111 lwlock 36: shacq 204675 exacq 830 blk 108
 PID 7108 lwlock 

Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-25 Thread Sergey Koposov

On Fri, 25 May 2012, Merlin Moncure wrote:


These are all on the buffer partition locks.  That makes sense...I was
wrong earlier: this case was in fact 'create table as', not 'insert
select' which rules out both the freelist lock and the wal insert lock
because create table as gets to use both a bulk insert strategy and
wal avoiding logic (assuming wal log level is not 'archive or higher'.
So, why aren't the lock partitions helping here?


FYI the WAL level is default, and here are the relevand changed params 
from the .conf (autovacuum was disabled but the data was vacuumed 
manually).


max_connections = 100   # (change requires restart)
shared_buffers = 10GB   # min 128kB
work_mem = 500MB# min 64kB
maintenance_work_mem = 500MB# min 1MB
synchronous_commit = off# synchronization level;
checkpoint_segments = 20# in logfile segments, min 1, 16MB each
enable_hashjoin = off
enable_mergejoin = off
effective_cache_size = 10GB
autovacuum = off# Enable autovacuum subprocess?  'on'


Regards,
S

*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-25 Thread Merlin Moncure
On Fri, May 25, 2012 at 8:06 AM, Merlin Moncure mmonc...@gmail.com wrote:
 On Thu, May 24, 2012 at 6:26 PM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 On Thu, 24 May 2012, Jeff Janes wrote:

 Add
 #define LWLOCK_STATS
 near the top of:
 src/backend/storage/lmgr/lwlock.c

 and recompile and run a reduced-size workload.  When the processes
 exits, they will dump a lot of data about LWLock usage to the logfile.
 Generally the LWLock with the most blocks on it will be the main
 culprit.


 Here is the output from a multi-threaded run (8thtreads, 22 seconds) sorted
 by blk. Not sure whether that's of much use or not:

 PID 7112 lwlock 48: shacq 1124394 exacq 1350 blk 1373
 PID 7110 lwlock 48: shacq 1124460 exacq 1128 blk 1110
 PID 7114 lwlock 48: shacq 1124502 exacq 1041 blk 976
 PID 7111 lwlock 48: shacq 1124523 exacq 1009 blk 955
 PID 7113 lwlock 48: shacq 1124383 exacq 868 blk 871
 PID 7112 lwlock 44: shacq 1127148 exacq 1323 blk 838
 PID 7110 lwlock 44: shacq 1127256 exacq 1132 blk 774
 PID 7114 lwlock 44: shacq 1127418 exacq 1024 blk 702
 PID 7113 lwlock 44: shacq 1127179 exacq 920 blk 665
 PID 7111 lwlock 44: shacq 1127324 exacq 957 blk 651
 PID 7109 lwlock 48: shacq 1124402 exacq 384 blk 602
 PID 7108 lwlock 48: shacq 1125039 exacq 1592 blk 546
 PID 7108 lwlock 44: shacq 1127902 exacq 1548 blk 511
 PID 7109 lwlock 44: shacq 1127261 exacq 388 blk 466
 PID 7114 lwlock 47: shacq 227986 exacq 929 blk 449
 PID 7115 lwlock 44: shacq 1127495 exacq 633 blk 401
 PID 7115 lwlock 48: shacq 1124666 exacq 559 blk 397
 PID 7112 lwlock 47: shacq 227993 exacq 1248 blk 387
 PID 7110 lwlock 47: shacq 228218 exacq 1082 blk 378
 PID 7111 lwlock 47: shacq 228093 exacq 907 blk 321
 PID 7114 lwlock 42: shacq 232591 exacq 935 blk 318
 PID 7113 lwlock 47: shacq 228085 exacq 909 blk 306
 PID 7112 lwlock 34: shacq 343247 exacq 1255 blk 271
 PID 7110 lwlock 42: shacq 232599 exacq 1049 blk 262
 PID 7111 lwlock 34: shacq 343398 exacq 926 blk 255
 PID 7112 lwlock 42: shacq 232505 exacq 1262 blk 240
 PID 7108 lwlock 42: shacq 233215 exacq 1539 blk 237
 PID 7110 lwlock 43: shacq 438768 exacq 1026 blk 231
 PID 7114 lwlock 43: shacq 438795 exacq 898 blk 230
 PID 7113 lwlock 34: shacq 343283 exacq 832 blk 226
 PID 7110 lwlock 34: shacq 343338 exacq 1074 blk 226
 PID 7114 lwlock 0: shacq 0 exacq 14864 blk 219
 PID 7112 lwlock 43: shacq 438691 exacq 1123 blk 215
 PID 7111 lwlock 42: shacq 232645 exacq 885 blk 215
 PID 7114 lwlock 34: shacq 343362 exacq 939 blk 214
 PID 7111 lwlock 0: shacq 0 exacq 13638 blk 205
 PID 7111 lwlock 43: shacq 438817 exacq 899 blk 200
 PID 7114 lwlock 7: shacq 0 exacq 547 blk 193
 PID 7113 lwlock 42: shacq 232600 exacq 860 blk 192
 PID 7110 lwlock 0: shacq 0 exacq 16862 blk 191
 PID 7114 lwlock 38: shacq 333266 exacq 977 blk 183
 PID 7112 lwlock 38: shacq 333175 exacq 1246 blk 183
 PID 7113 lwlock 7: shacq 0 exacq 548 blk 178
 PID 7113 lwlock 43: shacq 438720 exacq 833 blk 178
 PID 7115 lwlock 7: shacq 0 exacq 549 blk 177
 PID 7108 lwlock 47: shacq 228682 exacq 1495 blk 177
 PID 7108 lwlock 34: shacq 343982 exacq 1595 blk 177
 PID 7112 lwlock 0: shacq 0 exacq 19538 blk 172
 PID 7111 lwlock 7: shacq 0 exacq 549 blk 172
 PID 7115 lwlock 17: shacq 0 exacq 9927 blk 167
 PID 7109 lwlock 34: shacq 343410 exacq 348 blk 167
 PID 7112 lwlock 7: shacq 0 exacq 548 blk 166
 PID 7110 lwlock 38: shacq 333249 exacq 1013 blk 165
 PID 7113 lwlock 38: shacq 333226 exacq 828 blk 163
 PID 7110 lwlock 7: shacq 0 exacq 548 blk 162
 PID 7109 lwlock 7: shacq 0 exacq 548 blk 161
 PID 7109 lwlock 47: shacq 228097 exacq 386 blk 158
 PID 7112 lwlock 45: shacq 130843 exacq 1201 blk 154
 PID 7109 lwlock 43: shacq 438787 exacq 314 blk 153
 PID 7111 lwlock 38: shacq 06 exacq 882 blk 152
 PID 7108 lwlock 43: shacq 439454 exacq 1614 blk 151
 PID 7108 lwlock 17: shacq 0 exacq 23165 blk 147
 PID 7108 lwlock 7: shacq 0 exacq 549 blk 145
 PID 7113 lwlock 0: shacq 0 exacq 13394 blk 142
 PID 7112 lwlock 46: shacq 132972 exacq 1196 blk 142
 PID 7112 lwlock 36: shacq 204508 exacq 1180 blk 140
 PID 7109 lwlock 42: shacq 232611 exacq 363 blk 140
 PID 7115 lwlock 34: shacq 343582 exacq 600 blk 139
 PID 7114 lwlock 45: shacq 130818 exacq 903 blk 139
 PID 7115 lwlock 42: shacq 232846 exacq 517 blk 138
 PID 7108 lwlock 38: shacq 333989 exacq 1704 blk 137
 PID 7115 lwlock 43: shacq 438993 exacq 558 blk 126
 PID 7115 lwlock 0: shacq 0 exacq 3161 blk 125
 PID 7110 lwlock 40: shacq 129577 exacq 1021 blk 122
 PID 7113 lwlock 45: shacq 130833 exacq 814 blk 120
 PID 7111 lwlock 17: shacq 0 exacq 4607 blk 119
 PID 7109 lwlock 0: shacq 0 exacq 5711 blk 119
 PID 7110 lwlock 45: shacq 130865 exacq 1001 blk 117
 PID 7109 lwlock 38: shacq 40 exacq 322 blk 115
 PID 7114 lwlock 46: shacq 132997 exacq 860 blk 112
 PID 7113 lwlock 36: shacq 204566 exacq 833 blk 112
 PID 7112 lwlock 40: shacq 129528 exacq 1234 blk 111
 PID 7115 lwlock 47: shacq 228397 exacq 516 blk 109
 PID 7114 lwlock 36: shacq 204634 exacq 930 blk 109
 PID 7112 lwlock 41: shacq 133600 exacq 1136 blk 

Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-25 Thread Tom Lane
Merlin Moncure mmonc...@gmail.com writes:
 Hm, what if BufTableHashPartition() was pseudo randomized so that
 different backends would not get the same buffer partition for a
 particular tag?

Huh?  You have to make sure that different backends will find the same
buffer for the same page, so I don't see how that can possibly work.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-25 Thread Merlin Moncure
On Fri, May 25, 2012 at 10:22 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Merlin Moncure mmonc...@gmail.com writes:
 Hm, what if BufTableHashPartition() was pseudo randomized so that
 different backends would not get the same buffer partition for a
 particular tag?

 Huh?  You have to make sure that different backends will find the same
 buffer for the same page, so I don't see how that can possibly work.

Right -- duh.  Well, hm.  Is this worth fixing?  ISTM there's a bit of
'optimizing for pgbench-itis' in the buffer partitions -- they seem
optimized to lever the mostly random access behavior of pgbench.  But
how likely is it to see multiple simultaneous scans in the real world?
 Interleaving scans like that is not a very effective optimization --
if it was me, it'd be trying to organize something around a
partitioned tid scan for parallel sequential access.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-25 Thread Stephen Frost
* Merlin Moncure (mmonc...@gmail.com) wrote:
 Right -- duh.  Well, hm.  Is this worth fixing?  ISTM there's a bit of
 'optimizing for pgbench-itis' in the buffer partitions -- they seem
 optimized to lever the mostly random access behavior of pgbench.  But
 how likely is it to see multiple simultaneous scans in the real world?
  Interleaving scans like that is not a very effective optimization --
 if it was me, it'd be trying to organize something around a
 partitioned tid scan for parallel sequential access.

Didn't we implement a system whereby this is exactly what we intend to
happen on the read side- that is, everyone doing a SeqScan gangs up on
one ring buffer and follows it, which we felt was going to dramatically
improve performance in some cases?

Or is this completely different from that..?

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-25 Thread Jeff Janes
On Thu, May 24, 2012 at 4:26 PM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 On Thu, 24 May 2012, Jeff Janes wrote:

 Add
 #define LWLOCK_STATS
 near the top of:
 src/backend/storage/lmgr/lwlock.c

 and recompile and run a reduced-size workload.  When the processes
 exits, they will dump a lot of data about LWLock usage to the logfile.
 Generally the LWLock with the most blocks on it will be the main
 culprit.


 Here is the output from a multi-threaded run (8thtreads, 22 seconds) sorted
 by blk. Not sure whether that's of much use or not:

 PID 7112 lwlock 48: shacq 1124394 exacq 1350 blk 1373
 PID 7110 lwlock 48: shacq 1124460 exacq 1128 blk 1110
 PID 7114 lwlock 48: shacq 1124502 exacq 1041 blk 976
 PID 7111 lwlock 48: shacq 1124523 exacq 1009 blk 955
 PID 7113 lwlock 48: shacq 1124383 exacq 868 blk 871
 PID 7112 lwlock 44: shacq 1127148 exacq 1323 blk 838
 PID 7110 lwlock 44: shacq 1127256 exacq 1132 blk 774
 PID 7114 lwlock 44: shacq 1127418 exacq 1024 blk 702
 PID 7113 lwlock 44: shacq 1127179 exacq 920 blk 665
 PID 7111 lwlock 44: shacq 1127324 exacq 957 blk 651
 PID 7109 lwlock 48: shacq 1124402 exacq 384 blk 602
 PID 7108 lwlock 48: shacq 1125039 exacq 1592 blk 546
 PID 7108 lwlock 44: shacq 1127902 exacq 1548 blk 511
 PID 7109 lwlock 44: shacq 1127261 exacq 388 blk 466

That is not an informative as I thought it would be (except to show
WAL was not the issue).

I'm guessing that 44 and 48 are the buffer mapping partitions which
cover the root block of some highly used index.

But just because those things are at the top of the list doesn't mean
they are a problem.  Something has to be at the top, and they don't
dominate the total number of blocking they way I would expect them to
if they were truly a substantial bottleneck.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-25 Thread Merlin Moncure
On Fri, May 25, 2012 at 11:17 AM, Stephen Frost sfr...@snowman.net wrote:
 * Merlin Moncure (mmonc...@gmail.com) wrote:
 Right -- duh.  Well, hm.  Is this worth fixing?  ISTM there's a bit of
 'optimizing for pgbench-itis' in the buffer partitions -- they seem
 optimized to lever the mostly random access behavior of pgbench.  But
 how likely is it to see multiple simultaneous scans in the real world?
  Interleaving scans like that is not a very effective optimization --
 if it was me, it'd be trying to organize something around a
 partitioned tid scan for parallel sequential access.

 Didn't we implement a system whereby this is exactly what we intend to
 happen on the read side- that is, everyone doing a SeqScan gangs up on
 one ring buffer and follows it, which we felt was going to dramatically
 improve performance in some cases?

yeah:

/*
 * If the table is large relative to NBuffers, use a bulk-read access
 * strategy and enable synchronized scanning (see syncscan.c).  Although
 * the thresholds for these features could be different, we make them 
the
 * same so that there are only two behaviors to tune rather than four.
 * (However, some callers need to be able to disable one or both of 
these
 * behaviors, independently of the size of the table; also there is a 
GUC
 * variable that can disable synchronized scanning.)
 *
 * During a rescan, don't make a new strategy object if we don't have 
to.
 */
if (!RelationUsesLocalBuffers(scan-rs_rd) 
scan-rs_nblocks  NBuffers / 4)
{
allow_strat = scan-rs_allow_strat;
allow_sync = scan-rs_allow_sync;
}
else
allow_strat = allow_sync = false;

if (allow_strat)
{
if (scan-rs_strategy == NULL)
scan-rs_strategy = GetAccessStrategy(BAS_BULKREAD);
}


I wonder if the logic here is just being too strict...

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-25 Thread Tom Lane
Merlin Moncure mmonc...@gmail.com writes:
 On Fri, May 25, 2012 at 11:17 AM, Stephen Frost sfr...@snowman.net wrote:
 Didn't we implement a system whereby this is exactly what we intend to
 happen on the read side- that is, everyone doing a SeqScan gangs up on
 one ring buffer and follows it, which we felt was going to dramatically
 improve performance in some cases?

 yeah:
 ...
 I wonder if the logic here is just being too strict...

I don't recall how much evidence there is behind the NBuffers/4 threshold.
Maybe that needs some tuning?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-25 Thread Sergey Koposov

On Fri, 25 May 2012, Merlin Moncure wrote:

how likely is it to see multiple simultaneous scans in the real world?
Interleaving scans like that is not a very effective optimization --
if it was me, it'd be trying to organize something around a
partitioned tid scan for parallel sequential access.


Regarding the real world use.  I can say for myself is that the reason I'm
running the queries like the ones I have in my test is that I have a large
inflow of data every day, tens of gb, which has to be processed and since PG
doesn't have any parallelization of queries built in, and running the
processing in a single thread will take too long, I'm trying to work around
that by manually running multiple threads in PG and trying to split the work
among them. So not having scalability here is going to hurt us (but I 
admit that it's probably not very popular use-case for PG).


Regards,
S

*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-25 Thread Robert Haas
On Thu, May 24, 2012 at 7:26 PM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 Here is the output from a multi-threaded run (8thtreads, 22 seconds) sorted
 by blk. Not sure whether that's of much use or not:

What are the top dozen or so entries if you sort by shacq?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-25 Thread Sergey Koposov

On Fri, 25 May 2012, Robert Haas wrote:


On Thu, May 24, 2012 at 7:26 PM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:

Here is the output from a multi-threaded run (8thtreads, 22 seconds) sorted
by blk. Not sure whether that's of much use or not:


What are the top dozen or so entries if you sort by shacq?


Here it is:

PID 7108 lwlock 44: shacq 1127902 exacq 1548 blk 511
PID 7115 lwlock 44: shacq 1127495 exacq 633 blk 401
PID 7114 lwlock 44: shacq 1127418 exacq 1024 blk 702
PID 7111 lwlock 44: shacq 1127324 exacq 957 blk 651
PID 7109 lwlock 44: shacq 1127261 exacq 388 blk 466
PID 7110 lwlock 44: shacq 1127256 exacq 1132 blk 774
PID 7113 lwlock 44: shacq 1127179 exacq 920 blk 665
PID 7112 lwlock 44: shacq 1127148 exacq 1323 blk 838
PID 7108 lwlock 48: shacq 1125039 exacq 1592 blk 546
PID 7115 lwlock 48: shacq 1124666 exacq 559 blk 397
PID 7111 lwlock 48: shacq 1124523 exacq 1009 blk 955
PID 7114 lwlock 48: shacq 1124502 exacq 1041 blk 976
PID 7110 lwlock 48: shacq 1124460 exacq 1128 blk 1110
PID 7109 lwlock 48: shacq 1124402 exacq 384 blk 602
PID 7112 lwlock 48: shacq 1124394 exacq 1350 blk 1373
PID 7113 lwlock 48: shacq 1124383 exacq 868 blk 871
PID 7115 lwlock 1020: shacq 100 exacq 0 blk 0
PID 7115 lwlock 1018: shacq 100 exacq 0 blk 0
PID 7114 lwlock 1020: shacq 100 exacq 0 blk 0
PID 7114 lwlock 1018: shacq 100 exacq 0 blk 0
PID 7113 lwlock 1020: shacq 100 exacq 0 blk 0
PID 7113 lwlock 1018: shacq 100 exacq 0 blk 0
PID 7112 lwlock 1020: shacq 100 exacq 0 blk 0
PID 7112 lwlock 1018: shacq 100 exacq 0 blk 0
PID 7111 lwlock 1020: shacq 100 exacq 0 blk 0
PID 7111 lwlock 1018: shacq 100 exacq 0 blk 0
PID 7110 lwlock 1020: shacq 100 exacq 0 blk 0
PID 7110 lwlock 1018: shacq 100 exacq 0 blk 0
PID 7109 lwlock 1020: shacq 100 exacq 0 blk 0
PID 7109 lwlock 1018: shacq 100 exacq 0 blk 0
PID 7108 lwlock 1020: shacq 100 exacq 0 blk 0
PID 7108 lwlock 1018: shacq 100 exacq 0 blk 0
PID 7108 lwlock 43: shacq 439454 exacq 1614 blk 151
PID 7115 lwlock 43: shacq 438993 exacq 558 blk 126
PID 7111 lwlock 43: shacq 438817 exacq 899 blk 200
PID 7114 lwlock 43: shacq 438795 exacq 898 blk 230
PID 7109 lwlock 43: shacq 438787 exacq 314 blk 153
PID 7110 lwlock 43: shacq 438768 exacq 1026 blk 231
PID 7113 lwlock 43: shacq 438720 exacq 833 blk 178
PID 7112 lwlock 43: shacq 438691 exacq 1123 blk 215
PID 7108 lwlock 34: shacq 343982 exacq 1595 blk 177
PID 7115 lwlock 34: shacq 343582 exacq 600 blk 139
PID 7109 lwlock 34: shacq 343410 exacq 348 blk 167
PID 7111 lwlock 34: shacq 343398 exacq 926 blk 255
PID 7114 lwlock 34: shacq 343362 exacq 939 blk 214
PID 7110 lwlock 34: shacq 343338 exacq 1074 blk 226
PID 7113 lwlock 34: shacq 343283 exacq 832 blk 226
PID 7112 lwlock 34: shacq 343247 exacq 1255 blk 271
PID 7108 lwlock 38: shacq 333989 exacq 1704 blk 137
PID 7115 lwlock 38: shacq 333512 exacq 581 blk 106
PID 7109 lwlock 38: shacq 40 exacq 322 blk 115
PID 7111 lwlock 38: shacq 06 exacq 882 blk 152
PID 7114 lwlock 38: shacq 333266 exacq 977 blk 183
PID 7110 lwlock 38: shacq 333249 exacq 1013 blk 165
PID 7113 lwlock 38: shacq 333226 exacq 828 blk 163
PID 7112 lwlock 38: shacq 333175 exacq 1246 blk 183
PID 7108 lwlock 42: shacq 233215 exacq 1539 blk 237
PID 7115 lwlock 42: shacq 232846 exacq 517 blk 138
PID 7111 lwlock 42: shacq 232645 exacq 885 blk 215
PID 7109 lwlock 42: shacq 232611 exacq 363 blk 140
PID 7113 lwlock 42: shacq 232600 exacq 860 blk 192
PID 7110 lwlock 42: shacq 232599 exacq 1049 blk 262
PID 7114 lwlock 42: shacq 232591 exacq 935 blk 318
PID 7112 lwlock 42: shacq 232505 exacq 1262 blk 240
PID 7108 lwlock 47: shacq 228682 exacq 1495 blk 177
PID 7115 lwlock 47: shacq 228397 exacq 516 blk 109
PID 7110 lwlock 47: shacq 228218 exacq 1082 blk 378
PID 7109 lwlock 47: shacq 228097 exacq 386 blk 158
PID 7111 lwlock 47: shacq 228093 exacq 907 blk 321
PID 7113 lwlock 47: shacq 228085 exacq 909 blk 306
PID 7112 lwlock 47: shacq 227993 exacq 1248 blk 387
PID 7114 lwlock 47: shacq 227986 exacq 929 blk 449
PID 7108 lwlock 36: shacq 205287 exacq 1592 blk 78
PID 7115 lwlock 36: shacq 204845 exacq 535 blk 86
PID 7111 lwlock 36: shacq 204675 exacq 830 blk 108
PID 7114 lwlock 36: shacq 204634 exacq 930 blk 109
PID 7109 lwlock 36: shacq 204581 exacq 363 blk 83
PID 7113 lwlock 36: shacq 204566 exacq 833 blk 112
PID 7110 lwlock 36: shacq 204546 exacq 1014 blk 106
PID 7112 lwlock 36: shacq 204508 exacq 1180 blk 140
PID 7108 lwlock 35: shacq 136067 exacq 1537 blk 56
PID 7115 lwlock 35: shacq 135532 exacq 494 blk 47
PID 7111 lwlock 35: shacq 135404 exacq 820 blk 76
PID 7109 lwlock 35: shacq 135362 exacq 326 blk 56
PID 7114 lwlock 35: shacq 135342 exacq 879 blk 102
PID 7110 lwlock 35: shacq 135333 exacq 1016 blk 88
PID 7113 lwlock 35: shacq 135318 exacq 829 blk 75
PID 7112 lwlock 35: shacq 135257 exacq 1229 blk 98
PID 7108 lwlock 41: shacq 134338 exacq 1540 blk 55
PID 7115 lwlock 41: shacq 133929 exacq 542 blk 50
PID 7109 lwlock 41: shacq 133781 exacq 360 blk 49
PID 7111 

Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-25 Thread Merlin Moncure
On Fri, May 25, 2012 at 11:38 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Merlin Moncure mmonc...@gmail.com writes:
 On Fri, May 25, 2012 at 11:17 AM, Stephen Frost sfr...@snowman.net wrote:
 Didn't we implement a system whereby this is exactly what we intend to
 happen on the read side- that is, everyone doing a SeqScan gangs up on
 one ring buffer and follows it, which we felt was going to dramatically
 improve performance in some cases?

 yeah:
 ...
 I wonder if the logic here is just being too strict...

 I don't recall how much evidence there is behind the NBuffers/4 threshold.
 Maybe that needs some tuning?


probably.  Sergey is going to get some numbers back from a hacked
version that forces the scan to use a strategy.

If that turns out to be the problem, I bet it'd make sense to hook the
decision against ring size which is currently internal to freelist.
In other words, since freelist is making the call to determine the
size of the ring, it should also probably be making the call in terms
of a particular input size being useful to apply a strategy to.

Maybe something like
IsAccessStrategyUseful(BufferAccessStrategyType btype, BlockNumber nblocks) ...

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-25 Thread Sergey Koposov

On Fri, 25 May 2012, Merlin Moncure wrote:

can you hack this in heapam.c and see if it helps?
line 131-ish:

if (!RelationUsesLocalBuffers(scan-rs_rd) 
scan-rs_nblocks  NBuffers / 4)
becomes
if (!RelationUsesLocalBuffers(scan-rs_rd))

(also you can set the partition count back).


The first few runs of my test with the suggested change made the 
multithreaded queries even slower by a factor of two.

E.g. 44 seconds for multithreaded vs ~ 7sec single threaded.
(with the default 9.2beta I get ~ 14-22sec for multithreaded run)
But now after another few reruns of my test, i see again this variability 
in times for multithreaded runs. It went down to 18-22seconds. Then again 
to 44 sec. So overall the change you suggested either doesn't improve 
the situation or makes it worse .


Regards,
S

*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-24 Thread Sergey Koposov

Hi,

I've been running some tests on pg 9.2beta1 and in particular a set
of queries like

create table _tmp0 as select * from (
select *, (select healpixid from idt_match as m where
m.transitid=o.transitid)
as x from idt_photoobservation as o offset 0
  ) as y where x%16=ZZZ order by x;

(where ZZZ is some number integer number 0=ZZZ16)

With the following plan:
--
 Sort  (cost=3228814504.96..3228815137.21 rows=252902 width=498)
   Sort Key: y.x
   -  Subquery Scan on y  (cost=0.00..3228791809.25 rows=252902 width=498)
 Filter: ((y.x % 16::bigint) = 0)
 -  Limit  (cost=0.00..3228033102.41 rows=50580456 width=490)
   -  Seq Scan on idt_photoobservation o  
(cost=0.00..3228033102.41 rows=50580456 width=490)
 SubPlan 1
   -  Index Scan using idt_match_transitid_idx on 
idt_match m  (cost=0.00..63.74 rows=1 width=8)
 Index Cond: (transitid = o.transitid)

The schema of the tables are:

e2es2= \d idt_match
 Table public.idt_match
  Column   |   Type   | Modifiers
---+--+---
 sourceid  | bigint   |
 transitid | bigint   |
 healpixid | bigint   |
 flagsxm   | smallint |
Indexes:
idt_match_idx btree (healpixid)
idt_match_transitid_idx btree (transitid)

Table public.idt_photoobservation
  Column   |   Type   | Modifiers
---+--+---
 transitid | bigint   |
 fluxbp0   | real |

more columns

Indexes:
idt_photoobservation_idx btree (transitid)

And I noticed than when I run the query like the one shown above in parallel
(in multiple connections for ZZZ=0...8) the performance of each query 
drops down significantly (factor of 2)(despite the fact that during the 
execution of the query postgres is mostly CPU bound).


So I tried to oprofile it and strace it, And that's what I saw:

The strace -c of each process  shows something like that
#
Process 18660 detached
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 76.250.001342   0268987   semop
 23.750.000418   0 61694   read
  0.000.00   0   138   lseek
  0.000.00   0   355   select
  0.000.00   0 3   kill
-- --- --- - - 
100.000.001760331177   total
###

And the oprofile shows this on top:
---
2863981  25.7117  ReleasePredicateLocks
  2863981  100.000  ReleasePredicateLocks [self]
---
1246629  11.1917  LocalBufferAlloc
  1246629  100.000  LocalBufferAlloc [self]
---
1135393  10.1931  CheckForSerializableConflictIn
  1135393  100.000  CheckForSerializableConflictIn [self]


So there is a lot of locking for some reason, And I was surprised to see 
anything related to SSI in the profile at all, because I'm not running 
serializable transactions (I was just running my queries from multiple 
psql sessions).


Is there a reasonable explanation for what I'm seeing  ?

Thanks in advance,
Sergey

More info: there is no other concurrent activity on the machine.
shared_buffers is 10G, The machine has ~ 60G of RAM, 24 cores and proper 
RAID


Another note is that the oprofile and strace were obtained during the 
stage when the idt_photo.. table was scanned and data was being stored in 
the pgsql_tmp (not during the last bit when the data in pgsql_tmp were

actually being sorted).


*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-24 Thread Merlin Moncure
On Thu, May 24, 2012 at 8:24 AM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 Hi,

 I've been running some tests on pg 9.2beta1 and in particular a set
 of queries like

 create table _tmp0 as select * from (
        select *, (select healpixid from idt_match as m where
                                        m.transitid=o.transitid)
                as x from idt_photoobservation as o offset 0
          ) as y where x%16=ZZZ order by x;

 (where ZZZ is some number integer number 0=ZZZ16)

 With the following plan:
 --
  Sort  (cost=3228814504.96..3228815137.21 rows=252902 width=498)
   Sort Key: y.x
   -  Subquery Scan on y  (cost=0.00..3228791809.25 rows=252902 width=498)
         Filter: ((y.x % 16::bigint) = 0)
         -  Limit  (cost=0.00..3228033102.41 rows=50580456 width=490)
               -  Seq Scan on idt_photoobservation o
  (cost=0.00..3228033102.41 rows=50580456 width=490)
                     SubPlan 1
                       -  Index Scan using idt_match_transitid_idx on
 idt_match m  (cost=0.00..63.74 rows=1 width=8)
                             Index Cond: (transitid = o.transitid)

 The schema of the tables are:

 e2es2= \d idt_match
     Table public.idt_match
  Column   |   Type   | Modifiers
 ---+--+---
  sourceid  | bigint   |
  transitid | bigint   |
  healpixid | bigint   |
  flagsxm   | smallint |
 Indexes:
    idt_match_idx btree (healpixid)
    idt_match_transitid_idx btree (transitid)

 Table public.idt_photoobservation
  Column   |   Type   | Modifiers
 ---+--+---
  transitid | bigint   |
  fluxbp0   | real     |
 
 more columns
 
 Indexes:
    idt_photoobservation_idx btree (transitid)

 And I noticed than when I run the query like the one shown above in parallel
 (in multiple connections for ZZZ=0...8) the performance of each query drops
 down significantly (factor of 2)(despite the fact that during the execution
 of the query postgres is mostly CPU bound).

 So I tried to oprofile it and strace it, And that's what I saw:

 The strace -c of each process  shows something like that
 #
 Process 18660 detached
 % time     seconds  usecs/call     calls    errors syscall
 -- --- --- - - 
  76.25    0.001342           0    268987           semop
  23.75    0.000418           0     61694           read
  0.00    0.00           0       138           lseek
  0.00    0.00           0       355           select
  0.00    0.00           0         3           kill
 -- --- --- - - 
 100.00    0.001760                331177           total
 ###

 And the oprofile shows this on top:
 ---
 2863981  25.7117  ReleasePredicateLocks
  2863981  100.000  ReleasePredicateLocks [self]
 ---
 1246629  11.1917  LocalBufferAlloc
  1246629  100.000  LocalBufferAlloc [self]
 ---
 1135393  10.1931  CheckForSerializableConflictIn
  1135393  100.000  CheckForSerializableConflictIn [self]
 

 So there is a lot of locking for some reason, And I was surprised to see
 anything related to SSI in the profile at all, because I'm not running
 serializable transactions (I was just running my queries from multiple psql
 sessions).

Are you sure?  I looked at all the ReleasePredicateLocks calls and
they appear to be guarded by:

/* Nothing to do if this is not a serializable transaction */
if (MySerializableXact == InvalidSerializableXact)
return false;

What's the default isolation mode set to?

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-24 Thread Sergey Koposov

On Thu, 24 May 2012, Merlin Moncure wrote:


Are you sure?  I looked at all the ReleasePredicateLocks calls and
they appear to be guarded by:

   /* Nothing to do if this is not a serializable transaction */
   if (MySerializableXact == InvalidSerializableXact)
   return false;
What's the default isolation mode set to?


Sorry, it was indeed my mistake. I was pointing opreport to the 
binary which was recompiled, while the old version was still running , so 
I guess oprofile was picking up wrong function names


Here is the correct oprofile: when multiple queries are running :
samples  %symbol name
---
952457   13.8715  LWLockAcquire
  952457   100.000  LWLockAcquire [self]
---
779077   11.3464  PinBuffer
  779077   100.000  PinBuffer [self]
---
759898   11.0671  s_lock
  759898   100.000  s_lock [self]
---
6897536.7178  slot_deform_tuple
  689753   100.000  slot_deform_tuple [self]
---
5260025.1230  UnpinBuffer
  526002   100.000  UnpinBuffer [self]

When only one is running:
samples  %symbol name
---
163268   14.0343  slot_deform_tuple
  163268   100.000  slot_deform_tuple [self]
---
126018   10.8324  _bt_compare
  126018   100.000  _bt_compare [self]
---
1139759.7972  ExecProject
  113975   100.000  ExecProject [self]
---
49760 4.2773  FunctionCall2Coll
  49760100.000  FunctionCall2Coll [self]
---
49164 4.2261  LWLockAcquire
  49164100.000  LWLockAcquire [self]
---
43526 3.7414  hash_search_with_hash_value
  43526100.000  hash_search_with_hash_value [self]



I guess there is nothing catastrophically wrong with that, but still I'm 
very suprised that you get severe locking problems (factor of two 
slow-down) when running parallel read-only transactions.


Sergey

*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-24 Thread Jeff Janes
On Thu, May 24, 2012 at 6:24 AM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 Hi,

 I've been running some tests on pg 9.2beta1 and in particular a set
 of queries like
...

 And I noticed than when I run the query like the one shown above in parallel
 (in multiple connections for ZZZ=0...8) the performance of each query drops
 down significantly (factor of 2)(despite the fact that during the execution
 of the query postgres is mostly CPU bound).

Is this a regression from previous versions (9.0, 9.1), or has the
slow down always been there?

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-24 Thread Robert Haas
On Thu, May 24, 2012 at 1:42 PM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 I guess there is nothing catastrophically wrong with that, but still I'm
 very suprised that you get severe locking problems (factor of two slow-down)
 when running parallel read-only transactions.

Me, too.  How many concurrent connections are you running, and does
your working set exceed shared_buffers?  Can you provide a
self-contained reproducible test case?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-24 Thread Sergey Koposov

On Thu, 24 May 2012, Robert Haas wrote:


On Thu, May 24, 2012 at 1:42 PM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:

I guess there is nothing catastrophically wrong with that, but still I'm
very suprised that you get severe locking problems (factor of two slow-down)
when running parallel read-only transactions.


Me, too.  How many concurrent connections are you running, and does
your working set exceed shared_buffers?  Can you provide a
self-contained reproducible test case?


The last tests I've been doing were with 8 connections.
And the working set is roughly 30Gig, which is ~ 3x the shared buffers. 
(but ~ 50% of RAM).


Regarding the test-case, I'll try to see whether I can still observe the 
same slowing down if I chop the main table by a factor of few, so I can 
put the data somewhere for download.


S

*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-24 Thread Robert Haas
On Thu, May 24, 2012 at 2:19 PM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 On Thu, 24 May 2012, Robert Haas wrote:
 On Thu, May 24, 2012 at 1:42 PM, Sergey Koposov kopo...@ast.cam.ac.uk
 wrote:

 I guess there is nothing catastrophically wrong with that, but still I'm
 very suprised that you get severe locking problems (factor of two
 slow-down)
 when running parallel read-only transactions.

 Me, too.  How many concurrent connections are you running, and does
 your working set exceed shared_buffers?  Can you provide a
 self-contained reproducible test case?

 The last tests I've been doing were with 8 connections.
 And the working set is roughly 30Gig, which is ~ 3x the shared buffers. (but
 ~ 50% of RAM).

Given that additional information, I would say these results are
expected.  Unfortunately, our BufFreelistLock is a serious contention
point, and I think that's what you're hitting.  See the graph here:

http://rhaas.blogspot.com/2012/03/performance-and-scalability-on-ibm.html

As you can see, raw performance isn't much worse with the larger data
sets, but scalability at high connection counts is severely degraded
once the working set no longer fits in shared_buffers.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-24 Thread Merlin Moncure
On Thu, May 24, 2012 at 1:43 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, May 24, 2012 at 2:19 PM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 On Thu, 24 May 2012, Robert Haas wrote:
 On Thu, May 24, 2012 at 1:42 PM, Sergey Koposov kopo...@ast.cam.ac.uk
 wrote:

 I guess there is nothing catastrophically wrong with that, but still I'm
 very suprised that you get severe locking problems (factor of two
 slow-down)
 when running parallel read-only transactions.

 Me, too.  How many concurrent connections are you running, and does
 your working set exceed shared_buffers?  Can you provide a
 self-contained reproducible test case?

 The last tests I've been doing were with 8 connections.
 And the working set is roughly 30Gig, which is ~ 3x the shared buffers. (but
 ~ 50% of RAM).

 Given that additional information, I would say these results are
 expected.  Unfortunately, our BufFreelistLock is a serious contention
 point, and I think that's what you're hitting.  See the graph here:

 http://rhaas.blogspot.com/2012/03/performance-and-scalability-on-ibm.html

 As you can see, raw performance isn't much worse with the larger data
 sets, but scalability at high connection counts is severely degraded
 once the working set no longer fits in shared_buffers.

Hm, wouldn't the BufFreelistLock issue be ameliorated if
StrategyGetBuffer could reserve multiple buffers so that you'd draw
down your local list and only then go back to the global pool? (easier
said than done obviously).

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-24 Thread Jeff Janes
On Thu, May 24, 2012 at 12:46 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Thu, May 24, 2012 at 2:24 PM, Merlin Moncure mmonc...@gmail.com wrote:
 As you can see, raw performance isn't much worse with the larger data
 sets, but scalability at high connection counts is severely degraded
 once the working set no longer fits in shared_buffers.

 Hm, wouldn't the BufFreelistLock issue be ameliorated if
 StrategyGetBuffer could reserve multiple buffers so that you'd draw
 down your local list and only then go back to the global pool? (easier
 said than done obviously).

 hm, looking at the code some more, it looks like the whole point of
 the strategy system is to do that.

I thought you were suggesting that the StrategyGetBuffer would
pre-allocate multiple buffers to a backend under the cover of a single
BufFreelistLock.  If that is what you were suggesting, that is not
what the strategy system is currently for.  It is for locally reusing
buffers, not for gang-allocating them.

If a backend could somehow predict that the buffer it is about to read
in is likely going to be a cold buffer, perhaps it would make sense
for each backend to maintain an small ring of its own which it can
reuse for such cold buffers.



 ISTM bulk insert type queries
 would be good candidates for a buffer strategy somehow?

Probably.  There is a code or README comment to that effect that I
stumbled upon just ra couple hours ago, but can't immediately re-find
it.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-24 Thread Robert Haas
On Thu, May 24, 2012 at 3:46 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Thu, May 24, 2012 at 2:24 PM, Merlin Moncure mmonc...@gmail.com wrote:
 As you can see, raw performance isn't much worse with the larger data
 sets, but scalability at high connection counts is severely degraded
 once the working set no longer fits in shared_buffers.

 Hm, wouldn't the BufFreelistLock issue be ameliorated if
 StrategyGetBuffer could reserve multiple buffers so that you'd draw
 down your local list and only then go back to the global pool? (easier
 said than done obviously).

 hm, looking at the code some more, it looks like the whole point of
 the strategy system is to do that.  ISTM bulk insert type queries
 would be good candidates for a buffer strategy somehow?

Yep.

commit 85e2cedf985bfecaf43a18ca17433070f439fb0e
Author: Tom Lane t...@sss.pgh.pa.us
Date:   Thu Nov 6 20:51:15 2008 +

Improve bulk-insert performance by keeping the current target buffer pinned
(but not locked, as that would risk deadlocks).  Also, make it work in a sma
ring of buffers to avoid having bulk inserts trash the whole buffer arena.

Robert Haas, after an idea of Simon Riggs'.

But that doesn't help here, since pgbench is only looking up one row per query.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-24 Thread Merlin Moncure
On Thu, May 24, 2012 at 3:35 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, May 24, 2012 at 3:46 PM, Merlin Moncure mmonc...@gmail.com wrote:
 hm, looking at the code some more, it looks like the whole point of
 the strategy system is to do that.  ISTM bulk insert type queries
 would be good candidates for a buffer strategy somehow?

 Yep.

 commit 85e2cedf985bfecaf43a18ca17433070f439fb0e
 Author: Tom Lane t...@sss.pgh.pa.us
 Date:   Thu Nov 6 20:51:15 2008 +

    Improve bulk-insert performance by keeping the current target buffer pinned
    (but not locked, as that would risk deadlocks).  Also, make it work in a 
 sma
    ring of buffers to avoid having bulk inserts trash the whole buffer arena.

    Robert Haas, after an idea of Simon Riggs'.

 But that doesn't help here, since pgbench is only looking up one row per 
 query.

Wait -- OP's gripe this isn't regarding standard pgbench, but multiple
large concurrent 'insert into foo select...'.  I looked in the code
and it appears that the only bulk insert strategy using operations are
copy, create table as select, and table rewrite operations. Concurrent
INSERT SELECT apparently doesn't get the benefit of a strategy and
should be fighting over the freelist once the pool exhausts.

We don't get to skip wal of course, but we should be able to use a
bulk insert strategy, especially if there was some way of predicting
that a large number of tuples were going to be inserted.  I'm
wondering though of contention on the free list is in fact the OP's
problem.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-24 Thread Robert Haas
On Thu, May 24, 2012 at 4:46 PM, Merlin Moncure mmonc...@gmail.com wrote:
 Wait -- OP's gripe this isn't regarding standard pgbench, but multiple
 large concurrent 'insert into foo select...'.  I looked in the code
 and it appears that the only bulk insert strategy using operations are
 copy, create table as select, and table rewrite operations. Concurrent
 INSERT SELECT apparently doesn't get the benefit of a strategy and
 should be fighting over the freelist once the pool exhausts.

I think you are right.

 We don't get to skip wal of course, but we should be able to use a
 bulk insert strategy, especially if there was some way of predicting
 that a large number of tuples were going to be inserted.  I'm
 wondering though of contention on the free list is in fact the OP's
 problem.

Not sure.  It might be some other LWLock, but it's hard to tell which
one from the information provided.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-24 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, May 24, 2012 at 4:46 PM, Merlin Moncure mmonc...@gmail.com wrote:
 We don't get to skip wal of course, but we should be able to use a
 bulk insert strategy, especially if there was some way of predicting
 that a large number of tuples were going to be inserted.  I'm
 wondering though of contention on the free list is in fact the OP's
 problem.

 Not sure.  It might be some other LWLock, but it's hard to tell which
 one from the information provided.

Yeah.  It seems quite plausible that Robert's select-only benchmark might
be mainly tripping over the freelist lock, but I'm less convinced about
something that's doing INSERT/SELECT, and therefore is also doing a lot
of WAL activity, index insertions, etc.  I'd want to see some
instrumentation results before assuming we know where the bottleneck is
there.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-24 Thread Sergey Koposov

Hi,

On Thu, 24 May 2012, Robert Haas wrote:

Not sure.  It might be some other LWLock, but it's hard to tell which
one from the information provided.


If you could tell what's the best way to find out the info that you need, 
then I could run it reasonably quickly.


S

*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-24 Thread Jeff Janes
On Thu, May 24, 2012 at 3:36 PM, Sergey Koposov kopo...@ast.cam.ac.uk wrote:
 Hi,


 On Thu, 24 May 2012, Robert Haas wrote:

 Not sure.  It might be some other LWLock, but it's hard to tell which
 one from the information provided.


 If you could tell what's the best way to find out the info that you need,
 then I could run it reasonably quickly.

Add
#define LWLOCK_STATS
near the top of:
src/backend/storage/lmgr/lwlock.c

and recompile and run a reduced-size workload.  When the processes
exits, they will dump a lot of data about LWLock usage to the logfile.
 Generally the LWLock with the most blocks on it will be the main
culprit.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

2012-05-24 Thread Sergey Koposov

On Thu, 24 May 2012, Jeff Janes wrote:

Add
#define LWLOCK_STATS
near the top of:
src/backend/storage/lmgr/lwlock.c

and recompile and run a reduced-size workload.  When the processes
exits, they will dump a lot of data about LWLock usage to the logfile.
Generally the LWLock with the most blocks on it will be the main
culprit.


Here is the output from a multi-threaded run (8thtreads, 22 seconds) 
sorted by blk. Not sure whether that's of much use or not:


PID 7112 lwlock 48: shacq 1124394 exacq 1350 blk 1373
PID 7110 lwlock 48: shacq 1124460 exacq 1128 blk 1110
PID 7114 lwlock 48: shacq 1124502 exacq 1041 blk 976
PID 7111 lwlock 48: shacq 1124523 exacq 1009 blk 955
PID 7113 lwlock 48: shacq 1124383 exacq 868 blk 871
PID 7112 lwlock 44: shacq 1127148 exacq 1323 blk 838
PID 7110 lwlock 44: shacq 1127256 exacq 1132 blk 774
PID 7114 lwlock 44: shacq 1127418 exacq 1024 blk 702
PID 7113 lwlock 44: shacq 1127179 exacq 920 blk 665
PID 7111 lwlock 44: shacq 1127324 exacq 957 blk 651
PID 7109 lwlock 48: shacq 1124402 exacq 384 blk 602
PID 7108 lwlock 48: shacq 1125039 exacq 1592 blk 546
PID 7108 lwlock 44: shacq 1127902 exacq 1548 blk 511
PID 7109 lwlock 44: shacq 1127261 exacq 388 blk 466
PID 7114 lwlock 47: shacq 227986 exacq 929 blk 449
PID 7115 lwlock 44: shacq 1127495 exacq 633 blk 401
PID 7115 lwlock 48: shacq 1124666 exacq 559 blk 397
PID 7112 lwlock 47: shacq 227993 exacq 1248 blk 387
PID 7110 lwlock 47: shacq 228218 exacq 1082 blk 378
PID 7111 lwlock 47: shacq 228093 exacq 907 blk 321
PID 7114 lwlock 42: shacq 232591 exacq 935 blk 318
PID 7113 lwlock 47: shacq 228085 exacq 909 blk 306
PID 7112 lwlock 34: shacq 343247 exacq 1255 blk 271
PID 7110 lwlock 42: shacq 232599 exacq 1049 blk 262
PID 7111 lwlock 34: shacq 343398 exacq 926 blk 255
PID 7112 lwlock 42: shacq 232505 exacq 1262 blk 240
PID 7108 lwlock 42: shacq 233215 exacq 1539 blk 237
PID 7110 lwlock 43: shacq 438768 exacq 1026 blk 231
PID 7114 lwlock 43: shacq 438795 exacq 898 blk 230
PID 7113 lwlock 34: shacq 343283 exacq 832 blk 226
PID 7110 lwlock 34: shacq 343338 exacq 1074 blk 226
PID 7114 lwlock 0: shacq 0 exacq 14864 blk 219
PID 7112 lwlock 43: shacq 438691 exacq 1123 blk 215
PID 7111 lwlock 42: shacq 232645 exacq 885 blk 215
PID 7114 lwlock 34: shacq 343362 exacq 939 blk 214
PID 7111 lwlock 0: shacq 0 exacq 13638 blk 205
PID 7111 lwlock 43: shacq 438817 exacq 899 blk 200
PID 7114 lwlock 7: shacq 0 exacq 547 blk 193
PID 7113 lwlock 42: shacq 232600 exacq 860 blk 192
PID 7110 lwlock 0: shacq 0 exacq 16862 blk 191
PID 7114 lwlock 38: shacq 333266 exacq 977 blk 183
PID 7112 lwlock 38: shacq 333175 exacq 1246 blk 183
PID 7113 lwlock 7: shacq 0 exacq 548 blk 178
PID 7113 lwlock 43: shacq 438720 exacq 833 blk 178
PID 7115 lwlock 7: shacq 0 exacq 549 blk 177
PID 7108 lwlock 47: shacq 228682 exacq 1495 blk 177
PID 7108 lwlock 34: shacq 343982 exacq 1595 blk 177
PID 7112 lwlock 0: shacq 0 exacq 19538 blk 172
PID 7111 lwlock 7: shacq 0 exacq 549 blk 172
PID 7115 lwlock 17: shacq 0 exacq 9927 blk 167
PID 7109 lwlock 34: shacq 343410 exacq 348 blk 167
PID 7112 lwlock 7: shacq 0 exacq 548 blk 166
PID 7110 lwlock 38: shacq 333249 exacq 1013 blk 165
PID 7113 lwlock 38: shacq 333226 exacq 828 blk 163
PID 7110 lwlock 7: shacq 0 exacq 548 blk 162
PID 7109 lwlock 7: shacq 0 exacq 548 blk 161
PID 7109 lwlock 47: shacq 228097 exacq 386 blk 158
PID 7112 lwlock 45: shacq 130843 exacq 1201 blk 154
PID 7109 lwlock 43: shacq 438787 exacq 314 blk 153
PID 7111 lwlock 38: shacq 06 exacq 882 blk 152
PID 7108 lwlock 43: shacq 439454 exacq 1614 blk 151
PID 7108 lwlock 17: shacq 0 exacq 23165 blk 147
PID 7108 lwlock 7: shacq 0 exacq 549 blk 145
PID 7113 lwlock 0: shacq 0 exacq 13394 blk 142
PID 7112 lwlock 46: shacq 132972 exacq 1196 blk 142
PID 7112 lwlock 36: shacq 204508 exacq 1180 blk 140
PID 7109 lwlock 42: shacq 232611 exacq 363 blk 140
PID 7115 lwlock 34: shacq 343582 exacq 600 blk 139
PID 7114 lwlock 45: shacq 130818 exacq 903 blk 139
PID 7115 lwlock 42: shacq 232846 exacq 517 blk 138
PID 7108 lwlock 38: shacq 333989 exacq 1704 blk 137
PID 7115 lwlock 43: shacq 438993 exacq 558 blk 126
PID 7115 lwlock 0: shacq 0 exacq 3161 blk 125
PID 7110 lwlock 40: shacq 129577 exacq 1021 blk 122
PID 7113 lwlock 45: shacq 130833 exacq 814 blk 120
PID 7111 lwlock 17: shacq 0 exacq 4607 blk 119
PID 7109 lwlock 0: shacq 0 exacq 5711 blk 119
PID 7110 lwlock 45: shacq 130865 exacq 1001 blk 117
PID 7109 lwlock 38: shacq 40 exacq 322 blk 115
PID 7114 lwlock 46: shacq 132997 exacq 860 blk 112
PID 7113 lwlock 36: shacq 204566 exacq 833 blk 112
PID 7112 lwlock 40: shacq 129528 exacq 1234 blk 111
PID 7115 lwlock 47: shacq 228397 exacq 516 blk 109
PID 7114 lwlock 36: shacq 204634 exacq 930 blk 109
PID 7112 lwlock 41: shacq 133600 exacq 1136 blk 109
PID 7111 lwlock 36: shacq 204675 exacq 830 blk 108
PID 7108 lwlock 0: shacq 0 exacq 6504 blk 108
PID 7115 lwlock 38: shacq 333512 exacq 581 blk 106
PID 7110 lwlock 36: shacq 204546 exacq 1014 blk 106
PID 7113 lwlock 37: shacq 127040 exacq