Re: [HACKERS] Enabling Checksums

2013-03-23 Thread Jim Nasby

On 3/18/13 2:25 PM, Simon Riggs wrote:

On 18 March 2013 19:02, Jeff Davis pg...@j-davis.com wrote:

On Sun, 2013-03-17 at 22:26 -0700, Daniel Farina wrote:

as long as I am able to turn them off easily


To be clear: you don't get the performance back by doing
ignore_checksum_failure = on. You only get around the error itself,
which allows you to dump/reload the good data.


Given that the worst pain point comes from setting hint bits during a
large SELECT, it makes sense to offer an option to simply skip hint
bit setting when we are reading data (SELECT, not
INSERT/UPDATE/DELETE). That seems like a useful option even without
checksums. I know I have seen cases across many releases where setting
that would have been good, since it puts the cleanup back onto
VACUUM/writers, rather than occasional SELECTs.


+1


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-23 Thread Jim Nasby

On 3/22/13 7:27 PM, Ants Aasma wrote:

On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure mmonc...@gmail.com wrote:

well if you do a non-locking test first you could at least avoid some
cases (and, if you get the answer wrong, so what?) by jumping to the
next buffer immediately.  if the non locking test comes good, only
then do you do a hardware TAS.

you could in fact go further and dispense with all locking in front of
usage_count, on the premise that it's only advisory and not a real
refcount.  so you only then lock if/when it's time to select a
candidate buffer, and only then when you did a non locking test first.
  this would of course require some amusing adjustments to various
logical checks (usage_count = 0, heh).


Moreover, if the buffer happens to miss a decrement due to a data
race, there's a good chance that the buffer is heavily used and
wouldn't need to be evicted soon anyway. (if you arrange it to be a
read-test-inc/dec-store operation then you will never go out of
bounds) However, clocksweep and usage_count maintenance is not what is
causing contention because that workload is distributed. The issue is
pinning and unpinning. There we need an accurate count and there are
some pages like index roots that get hit very heavily. Things to do
there would be in my opinion convert to a futex based spinlock so when
there is contention it doesn't completely kill performance and then
try to get rid of the contention. Converting to lock-free pinning
won't help much here as what is killing us here is the cacheline
bouncing.

One way to get rid of contention is the buffer nailing idea that
Robert came up with. If some buffer gets so hot that maintaining
refcount on the buffer header leads to contention, promote that buffer
to a nailed status, let everyone keep their pin counts locally and
sometime later revisit the nailing decision and if necessary convert
pins back to the buffer header.

One other interesting idea I have seen is closeable scalable nonzero
indication (C-SNZI) from scalable rw-locks [1]. The idea there is to
use a tree structure to dynamically stripe access to the shared lock
counter when contention is detected. Downside is that considerable
amount of shared memory is needed so there needs to be some way to
limit the resource usage. This is actually somewhat isomorphic to the
nailing idea.

The issue with the current buffer management algorithm is that it
seems to scale badly with increasing shared_buffers. I think the
improvements should concentrate on finding out what is the problem
there and figuring out how to fix it. A simple idea to test would be
to just partition shared buffers along with the whole clock sweep
machinery into smaller ones, like the buffer mapping hash tables
already are. This should at the very least reduce contention for the
clock sweep even if it doesn't reduce work done per page miss.

[1] http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf


Partitioned clock sweep strikes me as a bad idea... you could certainly get 
unlucky and end up with a lot of hot stuff in one partition.

Another idea that'sbeen broughht up inthe past is to have something in the 
background keep a minimum number of buffers on the free list. That's how OS VM 
systems I'm familiar with work, so there's precedent for it.

I recall there were at least some theoretical concerns about this, but I don't 
remember if anyone actually tested the idea.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let's invent a function to report lock-wait-blocking PIDs

2013-03-23 Thread Jim Nasby

On 3/20/13 10:36 PM, Tom Lane wrote:

Simon Riggs si...@2ndquadrant.com writes:

On 20 March 2013 18:02, Tom Lane t...@sss.pgh.pa.us wrote:

The API that comes to mind is (name subject to
bikeshedding)

pg_blocking_pids(pid int) returns int[]



Useful. Can we also have an SRF rather than an array?


I thought about that, but at least for the isolationtester use-case,
the array result is clearly easier to use.  You can get from one to the
other with unnest() or array_agg(), so I don't really feel a need to
provide both.  Can you generate use-cases where the set-result approach
is superior?


Unless pg_blocking_pids(..) RETURNS SETOF would be significantly faster than 
unnest(), not directly, BUT...

Anytime I'm looking at locks I almost always want to know not only who's 
blocking who, but what they're actually blocking on. Related to that, I also 
wish we had a way to provide more info about why we're blocked on an XID, since 
just pointing your finger at a backend often doesn't do much to tell you what 
caused the block in the first place.

So from that standpoint, I'd prefer that pg_blocking_pids returned enough info 
to tell me exactly which locks were blocking.

*thinking*

Actually, is it possible for a backend to have more than one ungranted lock? If 
not then I suppose that would be good enough to tell you which lock had the 
problem.

On the performance side, I've also often wished for a way to pull data from 
pg_* tables/functions atomically; would it be reasonable to have a separate 
function that would copy everything from the proc array into local memory so 
you could query it from there to your hearts content? Bonus if it could also 
copy all/parts of the statistics file.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-23 Thread Jim Nasby

I realize Simone relented on this, but FWIW...

On 3/16/13 4:02 PM, Simon Riggs wrote:

Most other data we store doesn't consist of
large runs of 0x00 or 0xFF as data. Most data is more complex than
that, so any runs of 0s or 1s written to the block will be detected.

...

It's not that uncommon for folks to have tables that have a bunch of 
int[2,4,8]s all in a row, and I'd bet it's not uncommon for a lot of those 
fields to be zero.


Checksums are for detecting problems. What kind of problems? Sporadic
changes of bits? Or repeated errors. If we were trying to trap
isolated bit changes then CRC-32 would be appropriate. But I'm
assuming that whatever causes the problem is going to recur,


That's opposite to my experience. When we've had corruption events we will normally have 
one to several blocks with problems how up essentially all at once. Of course we can't 
prove that all the corruption happened at exactly the same time, but I believe it's a 
strong possibility. If it wasn't exactly the same time it was certainly over a span of 
minutes to hours... *but* we've never seen new corruption occur after we start an 
investigation (we frequently wait several hours for the next time we can take an outage 
without incurring a huge loss in revenue). That we would run for a number of hours with 
no additional corruption leads me to believe that whatever caused the corruption was 
essentially a one-time [1] event.

[1] One-time except for the fact that there were several periods where we would 
have corruption occur in 12 or 6 month intervals.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-23 Thread Jim Nasby

On 3/20/13 8:41 AM, Bruce Momjian wrote:

On Mon, Mar 18, 2013 at 01:52:58PM -0400, Bruce Momjian wrote:

I assume a user would wait until they suspected corruption to turn it
on, and because it is only initdb-enabled, they would have to
dump/reload their cluster.  The open question is whether this is a
usable feature as written, or whether we should wait until 9.4.

pg_upgrade can't handle this because the old/new clusters would have the
same catalog version number and the tablespace directory names would
conflict.  Even if they are not using tablespaces, the old heap/index
files would not have checksums and therefore would throw an error as
soon as you accessed them.  In fact, this feature is going to need
pg_upgrade changes to detect from pg_controldata that the old/new
clusters have the same checksum setting.


A few more issues with pg_upgrade: if we ever decide to change the
checksum calculation in a later major release, pg_upgrade might not work
because of the checksum change but could still work for users who don't
use checksums.

Also, while I understand why we have to set the checksum option at
initdb time, it seems we could enable users to turn it off after initdb
--- is there any mechanism for this?

Also, if a users uses checksums in 9.3, could they initdb without
checksums in 9.4 and use pg_upgrade?  As coded, the pg_controldata
checksum settings would not match and pg_upgrade would throw an error,
but it might be possible to allow this, i.e. you could go from checksum
to no checksum initdb clusters, but not from no checksum to checksum.  I
am wondering if the patch should reflect this.


If the docs don't warn about this, they should, but I don't think it's the 
responsibility of this patch to deal with that problem. The reason I don't 
believe this patch should deal with it is because that is a known, rather 
serious, limitation of pg_upgrade. It's something about pg_upgrade that just 
needs to be fixed, regardless of what patches might make the situation worse.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-23 Thread Amit Kapila
On Saturday, March 23, 2013 9:34 AM Jim Nasby wrote:
 On 3/22/13 7:27 PM, Ants Aasma wrote:
  On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure mmonc...@gmail.com
 wrote:
 
  One other interesting idea I have seen is closeable scalable nonzero
  indication (C-SNZI) from scalable rw-locks [1]. The idea there is to
  use a tree structure to dynamically stripe access to the shared lock
  counter when contention is detected. Downside is that considerable
  amount of shared memory is needed so there needs to be some way to
  limit the resource usage. This is actually somewhat isomorphic to the
  nailing idea.
 
  The issue with the current buffer management algorithm is that it
  seems to scale badly with increasing shared_buffers. I think the
  improvements should concentrate on finding out what is the problem
  there and figuring out how to fix it. A simple idea to test would be
  to just partition shared buffers along with the whole clock sweep
  machinery into smaller ones, like the buffer mapping hash tables
  already are. This should at the very least reduce contention for the
  clock sweep even if it doesn't reduce work done per page miss.
 
  [1] http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf
 
 Partitioned clock sweep strikes me as a bad idea... you could certainly
 get unlucky and end up with a lot of hot stuff in one partition.
 
 Another idea that'sbeen broughht up inthe past is to have something in
 the background keep a minimum number of buffers on the free list.
 That's how OS VM systems I'm familiar with work, so there's precedent
 for it.
 
 I recall there were at least some theoretical concerns about this, but
 I don't remember if anyone actually tested the idea.

I have tried one of the idea's : Adding the buffers background writer finds
reusable to freelist. 
http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382852FF9
7@szxeml509-mbs
This can reduce the clock swipe as it can find buffers from freelist. 

It shows performance improvement for read loads when data can be contained
in shared buffers, 
but when the data becomes large and (I/O) is involved, it shows some dip as
well.


With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Single-argument variant for array_length and friends?

2013-03-23 Thread Brendan Jurd
On 22 March 2013 09:12, Merlin Moncure mmonc...@gmail.com wrote:
 On Thu, Mar 21, 2013 at 2:00 AM, Pavel Stehule pavel.steh...@gmail.com 
 wrote:
 lot of postgresql functions calculate with all items in array without
 respect to dimensions - like unnest.

 so concept use outermost dim is not in pg now, and should not be
 introduced if it is possible. More it goes against a verbosity concept
 introduced by ADA and reused in PL/SQL and PL/pgSQL.

 and pl/psm*

Yeah, okay.  That argument works for me.  Let's go for option (a),
only allow the user to omit the dimension argument if the array is
1-D.

We still have the issue that Tom isn't convinced that the feature is
worth pursuing -- Tom, would you please elaborate a little on what you
dislike about it?  I don't see much of a downside (just 3 extra
pg_procs).

Cheers,
BJ


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SDP query optimizer

2013-03-23 Thread Adriano Lange

Hi,

On 22-03-2013 21:22, Josh Berkus wrote:

Woah!  Way cool.

As a warning, we're in the closing throes of version 9.3 right now, so
if you code/ideas doesn't get the attention it deserves, that's why.


Ok. No problem. :-)


There is an incomplete project from a few years back to make the
non-exhaustive query planner pluggable so that we could use different
algorithms.  Unfortunately, it was never finished and merged with the
core code.  Your planner is yet another reason it would be great to
complete this.


Yes. I looked at the Julius and Tomas' project in pgFoundry [1] some 
years ago, but it was inactive. Therefore, I decided to start a new one.


[1] - http://pgfoundry.org/projects/optimizer/

Anyway, good work for all of you.

--
Adriano Lange



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-23 Thread Ants Aasma
On Sat, Mar 23, 2013 at 6:29 AM, Atri Sharma atri.j...@gmail.com wrote:
 One way to distribute memory contention in case of spinlocks could be
 to utilize the fundamentals of NUMA architecture. Specifically, we can
 let the contending backends spin on local flags instead on the buffer
 header flags directly. As access to local cache lines is much cheaper
 and faster than memory locations which are far away in NUMA, we could
 potentially reduce the memory overhead for a specific line and reduce
 the overall overheads as well.

This is not even something for NUMA architectures (which is by now all
multiprocessor machines), even multicore machines have overheads for
bouncing cache lines. The locks don't even have to be local, it's good
enough to just have better probability of each backend contending
hitting a different lock, if we take care of not having the locks
share cache lines. IMO that's the whole point of striping locks, the
critical section is usually cheaper than the cost of getting the cache
line in an exclusive state.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-23 Thread Ants Aasma
On Sat, Mar 23, 2013 at 6:04 AM, Jim Nasby j...@nasby.net wrote:
 Partitioned clock sweep strikes me as a bad idea... you could certainly get
 unlucky and end up with a lot of hot stuff in one partition.

Surely that is not worse than having everything in a single partition.
Given a decent partitioning function it's very highly unlikely to have
more than a few of the hottest buffers end up in a single partition.

 Another idea that'sbeen broughht up inthe past is to have something in the
 background keep a minimum number of buffers on the free list. That's how OS
 VM systems I'm familiar with work, so there's precedent for it.

 I recall there were at least some theoretical concerns about this, but I
 don't remember if anyone actually tested the idea.

Yes, having bgwriter do the actual cleaning up seems like a good idea.
The whole bgwriter infrastructure will need some serious tuning. There
are many things that could be shifted to background if we knew it
could keep up, like hint bit setting on dirty buffers being flushed
out. But again, we have the issue of having good tests to see where
the changes hurt.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-23 Thread Ants Aasma
On Sat, Mar 23, 2013 at 5:14 AM, Craig Ringer cr...@2ndquadrant.com wrote:
 Making zero a not checksummed magic value would significantly detract
 from the utility of checksums IMO.

FWIW using 65521 modulus to compress larger checksums into 16 bits
will leave 14 non-zero values unused.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-23 Thread Andres Freund
  Results for pgbench scale 100:
  No checksums:   tps = 56623.819783
  Fletcher checksums: tps = 55282.222687 (1.024x slowdown)
  CRC Checksums:  tps = 50571.324795 (1.120x slowdown)
  SIMD Checksums: tps = 56608.888985 (1.000x slowdown)
 
  So to conclude, the 3 approaches:
 
  Great analysis. Still a tough choice.

+1

  One thing that might be interesting is to look at doing SIMD for both
  data and WAL. I wonder if that would be a noticeable speedup for WAL
  full-page writes? That would give greater justification for the extra
  work it will take (intrinsics/ASM), and it would be a nice win for
  non-checksum users.
 
 Andres showed that switching out the existing CRC for zlib's would
 result in 8-30% increase in INSERT-SELECT speed
 (http://www.postgresql.org/message-id/201005202227.49990.and...@anarazel.de)
 with the speeded up CRC still showing up as 10% of the profile. So I
 guess another 5% speedup by doing the CRC 8 bytes at a time instead of
 the used 4. And another couple % by using Fletcher or SIMD.

I am not sure the considerations for WAL are the same as for page checksums -
the current WAL code only computes the CRCs in rather small chunks, so very
pipelineable algorithms/implementations don't necessarly show the same benefit
for WAL as they do for page checksums...

And even if the checksumming were to be changed to compute the CRC in larger
chunks - a very sensible thing imo - it would still be relatively small sizes
in many workloads.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SDP query optimizer

2013-03-23 Thread Andres Freund
On 2013-03-22 20:35:43 -0300, Adriano Lange wrote:
 Hi all,
 
 I have developed a new query optimizer for PostgreSQL and I would like to
 share it with the community. The optimizer's name is Sampling and Dynamic
 Programming (SDP). I put it into a plugin developed some years ago, named
 LJQO:
 
 https://github.com/alange0001/ljqo.git
 
 This plugin was configured to compile only against PostgreSQL 9.2. However,
 I guess it may be easily adjusted for other versions of PostgreSQL.
 
 I would be glad for any feedback about SDP or even about LJQO.
 
 I have some numbers about the SDP in comparison with GEQO. If interested,
 see a diff between the two .out2 files attached. The schema and query are
 from a previous email posted by Andres Freund in this list.

I just want to mention that unless you skew the statistics for the individual
tables from their empty/default state this mostly measures a pretty degenerate
case where optima are very rare and not very differentiated. Thats a useful
thing to test, but not to have as the target to optimize for.
So it might be interesting to run that thing with some table
stats/contents stats set up.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-03-23 Thread Andres Freund
On 2013-03-22 07:38:36 +0900, Michael Paquier wrote:
 Is someone planning to provide additional feedback about this patch at some
 point?

Yes, now that I have returned from my holidays - or well, am returning
from them, I do plan to. But it should probably get some implementation
level review from somebody but Fujii and me...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-23 Thread Ants Aasma
On Sat, Mar 23, 2013 at 3:10 PM, Andres Freund and...@2ndquadrant.com wrote:
 Andres showed that switching out the existing CRC for zlib's would
 result in 8-30% increase in INSERT-SELECT speed
 (http://www.postgresql.org/message-id/201005202227.49990.and...@anarazel.de)
 with the speeded up CRC still showing up as 10% of the profile. So I
 guess another 5% speedup by doing the CRC 8 bytes at a time instead of
 the used 4. And another couple % by using Fletcher or SIMD.

 I am not sure the considerations for WAL are the same as for page checksums -
 the current WAL code only computes the CRCs in rather small chunks, so very
 pipelineable algorithms/implementations don't necessarly show the same benefit
 for WAL as they do for page checksums...

Sure, but I think that WAL checksums are not a big overhead in that case anyway.

I should point out that getting the SIMD algorithm to not be a loss
for small variable sized workloads will take considerable amount of
effort and code. Whereas it's quite easy for pipelined CRC32 and
Fletcher (or should I say Adler as we want to use mod 65521).

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump/restore syntax checking bug?

2013-03-23 Thread Joshua D. Drake


On 03/22/2013 10:13 PM, Josh Kupershmidt wrote:


On Fri, Mar 22, 2013 at 9:35 PM, Joshua D. Drake j...@commandprompt.com wrote:


postgres@jd-laptop:~$ pg_restore -d test -P 'by(),hello()' foo.sqlc


Note, the pg_restore doc makes no mention of trying to squeeze
multiple function prototypes in a single argument you've done here, or
of using multiple -P flags.


It appears we need better syntax checking.


Can't really argue with this. But if you think these pg_restore
examples are bad, try this gem:
   reindexdb --table='foo; ALTER ROLE limited WITH superuser'


That is HORRIBLE! Looks like our base utilities need some attention.

jD



Josh






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-23 Thread Andres Freund
On 2013-03-23 15:36:03 +0200, Ants Aasma wrote:
 On Sat, Mar 23, 2013 at 3:10 PM, Andres Freund and...@2ndquadrant.com wrote:
  Andres showed that switching out the existing CRC for zlib's would
  result in 8-30% increase in INSERT-SELECT speed
  (http://www.postgresql.org/message-id/201005202227.49990.and...@anarazel.de)
  with the speeded up CRC still showing up as 10% of the profile. So I
  guess another 5% speedup by doing the CRC 8 bytes at a time instead of
  the used 4. And another couple % by using Fletcher or SIMD.
 
  I am not sure the considerations for WAL are the same as for page checksums 
  -
  the current WAL code only computes the CRCs in rather small chunks, so very
  pipelineable algorithms/implementations don't necessarly show the same 
  benefit
  for WAL as they do for page checksums...
 
 Sure, but I think that WAL checksums are not a big overhead in that case 
 anyway.

I have seen profiles that indicate rather the contrary... Even in the optimal
case of no FPWs a single heap_insert() results in the CRC computed in 5 steps
or so. 4 of them over potentially noncontiguous pointer addressed memory.
If you add an index or two where the situation is the same the slowdown is not
all that surprising.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] timeofday() and clock_timestamp() produce different results when casting to timestamptz

2013-03-23 Thread Tom Lane
Adrian Klaver adrian.kla...@gmail.com writes:
 Seems the time zone info still thinks Moscow time is being setting 
 forward an hour for DST when in fact the time remains constant through 
 the year.

I think the discrepancy is between this in timezone/data/europe:

Zone Europe/Moscow   2:30:20 -  LMT 1880
 2:30   -   MMT 1916 Jul  3 # Moscow Mean Time
 2:30:48 Russia %s  1919 Jul  1 2:00
 3:00   Russia  MSK/MSD 1922 Oct
 2:00   -   EET 1930 Jun 21
 3:00   Russia  MSK/MSD 1991 Mar 31 2:00s
 2:00   Russia  EE%sT   1992 Jan 19 2:00s
 3:00   Russia  MSK/MSD 2011 Mar 27 2:00s
 4:00   -   MSK

and this in timezone/tznames/Default:

MSD 14400 D  # Moscow Daylight Time
 # (Europe/Moscow)
MSK 10800# Moscow Time
 # (Europe/Moscow)

We really need to figure out a way to update the tznames data
automatically, or at least notice when it's become inconsistent with
the underlying Olson database.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Request for Patch Feedback: Lag Lead Window Functions Can Ignore Nulls

2013-03-23 Thread Nicholas White
 The SQL standard defines a RESPECT NULLS or IGNORE NULLS option for lead,
lag, [...]. This is not implemented in PostgreSQL
(http://www.postgresql.org/docs/devel/static/functions-window.html)
I've had a go at implementing this, and I've attached the resulting patch.
It's not finished yet, but I was hoping to find out if my solution is along
the right lines.

In particular, I'm storing the ignore-nulls flag in the frameOptions of a
window function definition, and am adding a function to the windowapi.h to
get at these options. I'm keeping the last non-null value in
WinGetPartitionLocalMemory (which I hope is the right place), but I'm not
using any of the *GetDatum macros to access it.

An example of my change's behaviour:

nwhite=# select *, lag(num,0) ignore nulls over (order by generate_series)
from
nwhite-# (select generate_series from generate_series(0,10)) s
nwhite-# left outer join
nwhite-# numbers n
nwhite-# on (s.generate_series = n.num);
 generate_series | num | lag
-+-+-
   0 | |
   1 |   1 |   1
   2 | |   1
   3 | |   1
   4 |   4 |   4
   5 |   5 |   5
   6 | |   5
   7 | |   5
   8 | |   5
   9 |   9 |   9
  10 | |   9
(11 rows)

I'd find this feature really useful, so I hope you can help me get my patch
to a contributable state.

Thanks -

Nick


lead-lag-ignore-nulls.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SDP query optimizer

2013-03-23 Thread Adriano Lange

On 23-03-2013 10:15, Andres Freund wrote:

I just want to mention that unless you skew the statistics for the individual
tables from their empty/default state this mostly measures a pretty degenerate
case where optima are very rare and not very differentiated. Thats a useful
thing to test, but not to have as the target to optimize for.
So it might be interesting to run that thing with some table
stats/contents stats set up.



Yes, the search space obtained from this experiment may be very simpler 
than a real case. Beyond this experiment, I can construct classical 
queries used to evaluate this kind of algorithm, as stars, cliques, 
chains and cycles. Beyond these queries I have no idea how can I further 
test it.


Regards,

Adriano Lange


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Request for Patch Feedback: Lag Lead Window Functions Can Ignore Nulls

2013-03-23 Thread Tom Lane
Nicholas White n.j.wh...@gmail.com writes:
 The SQL standard defines a RESPECT NULLS or IGNORE NULLS option for lead,
 lag, [...]. This is not implemented in PostgreSQL
 (http://www.postgresql.org/docs/devel/static/functions-window.html)
 I've had a go at implementing this, and I've attached the resulting patch.
 It's not finished yet, but I was hoping to find out if my solution is along
 the right lines.

Since we're trying to get 9.3 to closure, this patch probably isn't
going to get much attention until the 9.4 development cycle starts
(in a couple of months, likely).  In the meantime, please add it to
the next commitfest list so we remember to come back to it:
https://commitfest.postgresql.org/action/commitfest_view?id=18

One comment just from a quick eyeball look is that we really hate
adding new keywords that aren't UNRESERVED, because that risks
breaking existing applications.  Please see if you can refactor the
grammar to make those new entries unreserved.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-23 Thread Atri Sharma


 Partitioned clock sweep strikes me as a bad idea... you could certainly get
 unlucky and end up with a lot of hot stuff in one partition.

 Another idea that'sbeen broughht up inthe past is to have something in the
 background keep a minimum number of buffers on the free list. That's how OS
 VM systems I'm familiar with work, so there's precedent for it.

 I recall there were at least some theoretical concerns about this, but I
 don't remember if anyone actually tested the idea.

 One way to handle this could be to have dynamic membership of pages
in the partitions. Based on activity for a page, it could be moved to
another partition. In this manner, we *could* distribute the hot and
not so hot buffer pages and hence it could help.

Regards,

Atri

--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-23 Thread Jeff Janes
On Thu, Mar 21, 2013 at 9:51 PM, Atri Sharma atri.j...@gmail.com wrote:

 Hello all,

 Sorry if this is a naive question.

 I was going through Greg Smith's slides on buffer
 cache(
 http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCache.pdf).
 When going through the page replacement algorithm that we use i.e.
 clocksweep algorithm, I felt a potential problem in our current
 system.

 Specifically, when a new entry is allocated in the buffer, it's
 USAGE_COUNT is set to 1. On each sweep of the algorithm, the
 USAGE_COUNT is decremented and an entry whose  USAGE_COUNT becomes
 zero is replaced.


It is replaced when the usage_count is already found to be zero, not when
it is made zero.


 I feel that this could lead to a bias towards replacement of
 relatively younger pages in the  cache over older pages. An entry
 which has just entered the cache with USAGE_COUNT=1 could be replaced
 soon, but it may be needed frequently in the near future,


Well, it may be needed.  But then again, it may not be needed.  And that
old page, that also may be needed frequently in the future (which is far
more likely than a new page--after all an old page likely got old for a
reason).  The best evidence that it will be needed again is that it
actually has been needed again.

I'm more convinced in the other direction, new pages should enter with 0
rather than with 1.  I think that the argument that a new buffer needs to
be given more of an opportunity to get used again is mostly bogus.  You
cannot bully the shared_buffers into being larger than it is.  If all the
incoming buffers get more opportunity, that just means the buffer-clock
ticks twice as fast, and really none of them has more opportunity when you
measure that opportunity against an outside standard (wall time, or
work-load accomplished).  All of our children cannot be above average.

Cheers,

Jeff


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-23 Thread Merlin Moncure
On Fri, Mar 22, 2013 at 7:27 PM, Ants Aasma a...@cybertec.at wrote:
 On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure mmonc...@gmail.com wrote:
 well if you do a non-locking test first you could at least avoid some
 cases (and, if you get the answer wrong, so what?) by jumping to the
 next buffer immediately.  if the non locking test comes good, only
 then do you do a hardware TAS.

 you could in fact go further and dispense with all locking in front of
 usage_count, on the premise that it's only advisory and not a real
 refcount.  so you only then lock if/when it's time to select a
 candidate buffer, and only then when you did a non locking test first.
  this would of course require some amusing adjustments to various
 logical checks (usage_count = 0, heh).

 Moreover, if the buffer happens to miss a decrement due to a data
 race, there's a good chance that the buffer is heavily used and
 wouldn't need to be evicted soon anyway. (if you arrange it to be a
 read-test-inc/dec-store operation then you will never go out of
 bounds)

yeah. There's something to be said to have an upper bound in the
length of time to get a page out (except in the special case when most
of them are pinned).  Right now, any page contention on a buffer
header for any reason can shut down buffer allocation, and that's just
not good.  It's obviously not very likely to happen but I think it can
does does happen.  The more I think about it the more I think's a bad
idea to spin during buffer allocation for any reason, ever.

 However, clocksweep and usage_count maintenance is not what is
 causing contention because that workload is distributed. The issue is
 pinning and unpinning. There we need an accurate count and there are
 some pages like index roots that get hit very heavily. Things to do
 there would be in my opinion convert to a futex based spinlock so when
 there is contention it doesn't completely kill performance and then
 try to get rid of the contention. Converting to lock-free pinning
 won't help much here as what is killing us here is the cacheline
 bouncing.

Yup -- futexes are another way to go.  They are linux specific though.

 One way to get rid of contention is the buffer nailing idea that
 Robert came up with. If some buffer gets so hot that maintaining
 refcount on the buffer header leads to contention, promote that buffer
 to a nailed status, let everyone keep their pin counts locally and
 sometime later revisit the nailing decision and if necessary convert
 pins back to the buffer header.

Yeah this is a more general (albeit more complicated) solution and
would likely be fantastic.  Is it safe to assume that refcounting is
the only likely cause of contention?

 One other interesting idea I have seen is closeable scalable nonzero
 indication (C-SNZI) from scalable rw-locks [1]. The idea there is to
 use a tree structure to dynamically stripe access to the shared lock
 counter when contention is detected. Downside is that considerable
 amount of shared memory is needed so there needs to be some way to
 limit the resource usage. This is actually somewhat isomorphic to the
 nailing idea.

 The issue with the current buffer management algorithm is that it
 seems to scale badly with increasing shared_buffers. I think the
 improvements should concentrate on finding out what is the problem
 there and figuring out how to fix it. A simple idea to test would be
 to just partition shared buffers along with the whole clock sweep
 machinery into smaller ones, like the buffer mapping hash tables
 already are. This should at the very least reduce contention for the
 clock sweep even if it doesn't reduce work done per page miss.

 [1] http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf

I'll have to take a look.  Removing *all spinning* from from page
allocation though feels like it might be worthwhile to test (got to
give some bonus points for being a very local change and simple to
implement).  I wonder if with more shared buffers you tend to sweep
more buffers per allocation.  (IIRC Jeff J was skeptical of that).

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-23 Thread Tom Lane
Jeff Janes jeff.ja...@gmail.com writes:
 I'm more convinced in the other direction, new pages should enter with 0
 rather than with 1.  I think that the argument that a new buffer needs to
 be given more of an opportunity to get used again is mostly bogus.

IIRC, the argument for starting at 1 not 0 is that otherwise a new page
might have an infinitesmally small lifespan, if the clock sweep should
reach it just after it gets entered into the buffers.  By starting at
1, the uncertainty in a new page's lifespan runs from 1 to 2 sweep times
not 0 to 1 sweep time.

I think though that this argument only holds water if the buffer didn't
get found via the clock sweep to start with --- otherwise, it ought to
have just about one clock sweep of time before the sweep comes back to
it.  It does apply to buffers coming off the freelist, though.

Thus, if we were to get rid of the freelist then maybe we could change
the starting usage_count ... but whether that's a good idea in itself
is pretty uncertain.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-23 Thread Jeff Janes
On Fri, Mar 22, 2013 at 4:06 AM, Atri Sharma atri.j...@gmail.com wrote:




 Not yet, I figured this might be a problem and am designing test cases
 for the same. I would be glad for some help there please.


Perhaps this isn't the help you were looking for, but I spent a long time
looking into this a few years ago.  Then I stopped and decided to work on
other things.  I would recommend you do so too.

If I have to struggle to come up with an artificial test case that shows
that there is a problem, then why should I believe that there actually is a
problem?  If you take a well-known problem (like, say, bad performance at
shared_buffers  8GB (or even lower, on Windows)) and create an artificial
test case to exercise and investigate that, that is one thing.  But why
invent pathological test cases with no known correspondence to reality?
 There are plenty of real problems to work on, and some of them are just as
intellectually interesting as the artificial problems are.

My conclusions were:

1) If everything fits in shared_buffers, then the replacement policy
doesn't matter.

2) If shared_buffers is much smaller than RAM (the most common case, I
believe), then what mostly matters is your OS's replacement policy, not
pgsql's.  Not much a pgsql hacker can do about this, other than turn into a
kernel hacker.

3) If little of the highly-used data fits in RAM. then any non-absurd
replacement policy is about as good as any other non-absurd one.

4) If most, but not quite all, of the highly-used data fits shared_buffers
and shared_buffers takes most of RAM (or at least, most of RAM not already
needed for other things like work_mem and executables), then the
replacement policy matters a lot.  But different policies suit different
work-loads, and there is little reason to think we can come up with a way
to choose between them.  (Also, in these conditions, performance is very
chaotic.  You can run the same algorithm for a long time, and it can
suddenly switch from good to bad or the other way around, and then stay in
that new mode for a long time).  Also, even if you come up with a good
algorithm, if you make the data set 20% smaller or 20% larger, it is no
longer a good algorithm.

5) Having buffers enter with usage_count=0 rather than 1 would probably be
slightly better most of the time under conditions described in 4, but there
is no way get enough evidence of this over enough conditions to justify
making a change.  And besides, how often do people run with shared_buffers
being most of RAM, and the hot data just barely not fitting in it?


If you want some known problems that are in this general area, we have:

1) If all data fits in RAM but not shared_buffers, and you have a very
large number of CPUs and a read-only or read-mostly workload,
then BufFreelistLock can be a major bottle neck.  (But, on a Amazon
high-CPU instance, I did not see this very much.  I suspect the degree of
problem depends a lot on whether you have a lot of sockets with a few CPUs
each, versus one chip with many CPUs).  This is very easy to come up with
model cases for, pgbench -S -c8 -j8, for example, can often show it.

2) A major reason that people run with shared_buffers much lower than RAM
is that performance seems to suffer with shared_buffers  8GB under
write-heavy workloads, even with spread-out checkpoints.  This is
frequently reported as a real world problem, but as far as I know has never
been reduced to a simple reproducible test case. (Although there was a
recent thread, maybe High CPU usage / load average after upgrading to
Ubuntu 12.04, that I thought might be relevant to this.  I haven't had the
time to seriously study the thread, or the hardware to investigate it
myself)

Cheers,

Jeff


Re: [HACKERS] Request for Patch Feedback: Lag Lead Window Functions Can Ignore Nulls

2013-03-23 Thread Nicholas White
Thanks - I've added it here:
https://commitfest.postgresql.org/action/patch_view?id=1096 .

I've also attached a revised version that makes IGNORE and RESPECT
UNRESERVED keywords (following the pattern of NULLS_FIRST and NULLS_LAST).

Nick


On 23 March 2013 14:34, Tom Lane t...@sss.pgh.pa.us wrote:

 Nicholas White n.j.wh...@gmail.com writes:
  The SQL standard defines a RESPECT NULLS or IGNORE NULLS option for lead,
  lag, [...]. This is not implemented in PostgreSQL
  (http://www.postgresql.org/docs/devel/static/functions-window.html)
  I've had a go at implementing this, and I've attached the resulting
 patch.
  It's not finished yet, but I was hoping to find out if my solution is
 along
  the right lines.

 Since we're trying to get 9.3 to closure, this patch probably isn't
 going to get much attention until the 9.4 development cycle starts
 (in a couple of months, likely).  In the meantime, please add it to
 the next commitfest list so we remember to come back to it:
 https://commitfest.postgresql.org/action/commitfest_view?id=18

 One comment just from a quick eyeball look is that we really hate
 adding new keywords that aren't UNRESERVED, because that risks
 breaking existing applications.  Please see if you can refactor the
 grammar to make those new entries unreserved.

 regards, tom lane



lead-lag-ignore-nulls.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] avoid buffer underflow in errfinish()

2013-03-23 Thread Xi Wang
CHECK_STACK_DEPTH checks if errordata_stack_depth is negative.
Move the dereference of errordata[errordata_stack_depth] after
the check to avoid out-of-bounds read.
---
 src/backend/utils/error/elog.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 3a211bf..47a0a8b 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -393,13 +393,15 @@ void
 errfinish(int dummy,...)
 {
ErrorData  *edata = errordata[errordata_stack_depth];
-   int elevel = edata-elevel;
+   int elevel;
MemoryContext oldcontext;
ErrorContextCallback *econtext;
 
recursion_depth++;
CHECK_STACK_DEPTH();
 
+   elevel = edata-elevel;
+
/*
 * Do processing in ErrorContext, which we hope has enough reserved 
space
 * to report an error.
-- 
1.7.10.4



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] avoid buffer underflow in errfinish()

2013-03-23 Thread Xi Wang
A side question: at src/backend/storage/lmgr/proc.c:1150, is there a
null pointer deference for `autovac'?

There is a null pointer check `autovac != NULL', but the pointer is
already dereferenced earlier when initializing `autovac_pgxact'.  Is
this null pointer check redundant, or should we move the dereference
`autovac-pgprocno' after the check?  Thanks.

On Sat, Mar 23, 2013 at 6:38 PM, Xi Wang xi.w...@gmail.com wrote:
 CHECK_STACK_DEPTH checks if errordata_stack_depth is negative.
 Move the dereference of errordata[errordata_stack_depth] after
 the check to avoid out-of-bounds read.
 ---
  src/backend/utils/error/elog.c |4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

 diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
 index 3a211bf..47a0a8b 100644
 --- a/src/backend/utils/error/elog.c
 +++ b/src/backend/utils/error/elog.c
 @@ -393,13 +393,15 @@ void
  errfinish(int dummy,...)
  {
 ErrorData  *edata = errordata[errordata_stack_depth];
 -   int elevel = edata-elevel;
 +   int elevel;
 MemoryContext oldcontext;
 ErrorContextCallback *econtext;

 recursion_depth++;
 CHECK_STACK_DEPTH();

 +   elevel = edata-elevel;
 +
 /*
  * Do processing in ErrorContext, which we hope has enough reserved 
 space
  * to report an error.
 --
 1.7.10.4



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [GENERAL] timeofday() and clock_timestamp() produce different results when casting to timestamptz

2013-03-23 Thread Adrian Klaver

On 03/23/2013 08:16 AM, Tom Lane wrote:

Adrian Klaver adrian.kla...@gmail.com writes:

Seems the time zone info still thinks Moscow time is being setting
forward an hour for DST when in fact the time remains constant through
the year.


I think the discrepancy is between this in timezone/data/europe:

Zone Europe/Moscow   2:30:20 -  LMT 1880
 2:30   -   MMT 1916 Jul  3 # Moscow Mean Time
 2:30:48 Russia %s  1919 Jul  1 2:00
 3:00   Russia  MSK/MSD 1922 Oct
 2:00   -   EET 1930 Jun 21
 3:00   Russia  MSK/MSD 1991 Mar 31 2:00s
 2:00   Russia  EE%sT   1992 Jan 19 2:00s
 3:00   Russia  MSK/MSD 2011 Mar 27 2:00s
 4:00   -   MSK

and this in timezone/tznames/Default:

MSD 14400 D  # Moscow Daylight Time
  # (Europe/Moscow)
MSK 10800# Moscow Time
  # (Europe/Moscow)

We really need to figure out a way to update the tznames data
automatically, or at least notice when it's become inconsistent with
the underlying Olson database.


So I temporary fix would be to go into /share/timezonesets/Default

and change :

MSK 10800# Moscow Time

to

MSK 14400# Moscow Time

and then you get:

test= set time zone 'Europe/Moscow';
SET
test= select timeofday(), clock_timestamp();
  timeofday  |clock_timestamp
-+---
 Sun Mar 24 03:50:45.066537 2013 MSK | 2013-03-24 03:50:45.066582+04
(1 row)

test= select timeofday()::timestamptz, clock_timestamp()::timestamptz;
   timeofday   |clock_timestamp
---+---
 2013-03-24 03:50:52.485092+04 | 2013-03-24 03:50:52.485188+04
(1 row)





regards, tom lane





--
Adrian Klaver
adrian.kla...@gmail.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-03-23 Thread Michael Paquier
On Sat, Mar 23, 2013 at 10:20 PM, Andres Freund and...@2ndquadrant.comwrote:

 On 2013-03-22 07:38:36 +0900, Michael Paquier wrote:
  Is someone planning to provide additional feedback about this patch at
 some
  point?

 Yes, now that I have returned from my holidays - or well, am returning
 from them, I do plan to. But it should probably get some implementation
 level review from somebody but Fujii and me...

Yeah, it would be good to have an extra pair of fresh eyes looking at those
patches.
Thanks,
-- 
Michael