Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Atri Sharma


Sent from my iPad

On 22-Mar-2013, at 11:28, Amit Kapila amit.kap...@huawei.com wrote:

 On Friday, March 22, 2013 10:22 AM Atri Sharma wrote:
 Hello all,
 
 Sorry if this is a naive question.
 
 I was going through Greg Smith's slides on buffer
 cache(http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCac
 he.pdf).
 When going through the page replacement algorithm that we use i.e.
 clocksweep algorithm, I felt a potential problem in our current
 system.
 
 Specifically, when a new entry is allocated in the buffer, it's
 USAGE_COUNT is set to 1. On each sweep of the algorithm, the
 USAGE_COUNT is decremented and an entry whose  USAGE_COUNT becomes
 zero is replaced.
 
 Yes, it is replaced but in the next clock sweep pass, not immediately after
 making 0.
 So till the time of next pass if nobody accesses the buffer and all other
 buffers have higher count, it can be replaced.
 Also the buffer, it has returned for which the usage count becomes 1, it
 will come to reduce the usage count only in next pass.
 So in whole, I think it needs 2 passes for a freshly returned buffer to be
 re-used incase no one uses it again.
 
 With Regards,
 Amit Kapila.
 

Hmm,so in the second pass,it gets replaced,right?

I think that if the initialization of USAGE_COUNT starts at the maximum allowed 
value instead of one, we can have a better solution to this problem.

Another,more complex solution could be to introduce an ageing factor as well.

Regards,

Atri

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let's invent a function to report lock-wait-blocking PIDs

2013-03-22 Thread Heikki Linnakangas

On 21.03.2013 05:36, Tom Lane wrote:

Simon Riggssi...@2ndquadrant.com  writes:

On 20 March 2013 18:02, Tom Lanet...@sss.pgh.pa.us  wrote:

The API that comes to mind is (name subject to
bikeshedding)

pg_blocking_pids(pid int) returns int[]



Useful. Can we also have an SRF rather than an array?


I thought about that, but at least for the isolationtester use-case,
the array result is clearly easier to use.  You can get from one to the
other with unnest() or array_agg(), so I don't really feel a need to
provide both.  Can you generate use-cases where the set-result approach
is superior?


How about inverting the function into:

pg_pid_blocked_by(pid int) returns int

It would take as argument a pid, and return the pid of the process that 
is blocking the given process. That would feel more natural to me.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Amit Kapila
On Friday, March 22, 2013 12:00 PM Atri Sharma wrote:
 
 
 Sent from my iPad
 
 On 22-Mar-2013, at 11:28, Amit Kapila amit.kap...@huawei.com wrote:
 
  On Friday, March 22, 2013 10:22 AM Atri Sharma wrote:
  Hello all,
 
  Sorry if this is a naive question.
 
  I was going through Greg Smith's slides on buffer
 
 cache(http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCac
  he.pdf).
  When going through the page replacement algorithm that we use i.e.
  clocksweep algorithm, I felt a potential problem in our current
  system.
 
  Specifically, when a new entry is allocated in the buffer, it's
  USAGE_COUNT is set to 1. On each sweep of the algorithm, the
  USAGE_COUNT is decremented and an entry whose  USAGE_COUNT becomes
  zero is replaced.
 
  Yes, it is replaced but in the next clock sweep pass, not immediately
 after
  making 0.
  So till the time of next pass if nobody accesses the buffer and all
 other
  buffers have higher count, it can be replaced.
  Also the buffer, it has returned for which the usage count becomes 1,
 it
  will come to reduce the usage count only in next pass.
  So in whole, I think it needs 2 passes for a freshly returned buffer
 to be
  re-used incase no one uses it again.
 
  With Regards,
  Amit Kapila.
 
 
 Hmm,so in the second pass,it gets replaced,right?
  Yes.


 I think that if the initialization of USAGE_COUNT starts at the maximum
 allowed value instead of one, we can have a better solution to this
 problem.

So what is your idea, if you start at maximum, what we will do for further
accesses to it?
Why do you want to give more priority to just loaded page?


 Another,more complex solution could be to introduce an ageing factor as
 well.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Atri Sharma

 I think that if the initialization of USAGE_COUNT starts at the maximum
 allowed value instead of one, we can have a better solution to this
 problem.

 So what is your idea, if you start at maximum, what we will do for further
 accesses to it?

I havent chalked out a detailed plan yet, but I think the idea of
initializing USAGE_COUNT to maximum value is not at all good. I was
just thinking off the top of my head.

 Why do you want to give more priority to just loaded page?

I just want it to have more chances to stay, rather than being
replaced pretty early. This is because,  as I said earlier, a new page
could be in high demand in near future, which would lead to repeated
replacement-bringing in of page and hence cause overheads.




 Another,more complex solution could be to introduce an aging factor

This is the one I think would work out best, add an age factor as to
the time duration which an entry has spent in the cache along with its
usage count.

So, what I am proposing here is to add another factor in the
clocksweep algorithm when it selects victim pages for replacement.
Specifically, the selection of victim pages should be done with the
usage_count AND the time spent by the entry in the cache. This would
give priority to pages with high accesses and not ignore relatively
young pages as well. If a page is not accessed for a long time after
it was allocated, it would be the ideal victim for replacement both in
terms of USAGE_COUNT as well as age.

Regards,

Atri




--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Amit Kapila
On Friday, March 22, 2013 4:16 PM Atri Sharma wrote:
 
  I think that if the initialization of USAGE_COUNT starts at the
 maximum
  allowed value instead of one, we can have a better solution to this
  problem.
 
  So what is your idea, if you start at maximum, what we will do for
 further
  accesses to it?
 
 I havent chalked out a detailed plan yet, but I think the idea of
 initializing USAGE_COUNT to maximum value is not at all good. I was
 just thinking off the top of my head.
 
  Why do you want to give more priority to just loaded page?
 
 I just want it to have more chances to stay, rather than being
 replaced pretty early. This is because,  as I said earlier, a new page
 could be in high demand in near future, which would lead to repeated
 replacement-bringing in of page and hence cause overheads.
 
 
 
 
  Another,more complex solution could be to introduce an aging factor
 
 This is the one I think would work out best, add an age factor as to
 the time duration which an entry has spent in the cache along with its
 usage count.
 
 So, what I am proposing here is to add another factor in the
 clocksweep algorithm when it selects victim pages for replacement.
 Specifically, the selection of victim pages should be done with the
 usage_count AND the time spent by the entry in the cache. This would
 give priority to pages with high accesses and not ignore relatively
 young pages as well. If a page is not accessed for a long time after
 it was allocated, it would be the ideal victim for replacement both in
 terms of USAGE_COUNT as well as age.

What would you do if the only young page has usage count zero during second
sweep.
I don't think introducing another factor along with usage count would do any
much help. 
Have you encountered any such workload very relatively young pages are
getting victimized
and the same is causing performance issues?

With Regards,
Amit Kapila.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Atri Sharma

 What would you do if the only young page has usage count zero during second
 sweep.

UmmThe same approach we take when there is no page with usage
count zero in a sweep in the current algorithm?


 I don't think introducing another factor along with usage count would do any
 much help.

Which else approach can we take here?


 Have you encountered any such workload very relatively young pages are
 getting victimized
 and the same is causing performance issues?

Not yet, I figured this might be a problem and am designing test cases
for the same. I would be glad for some help there please.

Regards,

Atri




--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Amit Kapila
On Friday, March 22, 2013 4:36 PM Atri Sharma wrote:
 
  What would you do if the only young page has usage count zero during
 second
  sweep.
 
 UmmThe same approach we take when there is no page with usage
 count zero in a sweep in the current algorithm?

It would give more priority to young page as compare to more used page.
I don't know if that would be correct thing to do.

With Regards,
Amit Kapila.
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Atri Sharma
On Fri, Mar 22, 2013 at 4:53 PM, Amit Kapila amit.kap...@huawei.com wrote:
 On Friday, March 22, 2013 4:36 PM Atri Sharma wrote:
 
  What would you do if the only young page has usage count zero during
 second
  sweep.

 UmmThe same approach we take when there is no page with usage
 count zero in a sweep in the current algorithm?

 It would give more priority to young page as compare to more used page.
 I don't know if that would be correct thing to do.

 This is my idea, give equal priority to new pages when they enter the
cache, so that they all have an equal chance to be replaced initially.
With time, usage_count shall become the deciding factor in victim
selection.

Regards,

Atri



--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should commit_delay be PGC_SIGHUP?

2013-03-22 Thread Simon Riggs
On 22 March 2013 02:14, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 Only one setting will be best for the whole cluster, so neither the
 user nor the DBA gains if a user sets this to a different value than
 the one that has been determined to be optimal.

 Since we wait while holding the lock it is actually harmful to
 everyone if anybody sets a stupid value and might even be considered a
 denial of service attack.

 So there is a very good reason to make this SIGHUP, not just a whim.

 Hmm.  If a malicious user could hurt performance for other sessions with
 a bad setting of commit_delay, then USERSET is clearly a bad idea.
 But it still seems like it could be SUSET rather than SIGHUP.

Agreed; everybody gets what they want. Committed.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should commit_delay be PGC_SIGHUP?

2013-03-22 Thread Robert Haas
On Fri, Mar 22, 2013 at 8:06 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Hmm.  If a malicious user could hurt performance for other sessions with
 a bad setting of commit_delay, then USERSET is clearly a bad idea.
 But it still seems like it could be SUSET rather than SIGHUP.

 Agreed; everybody gets what they want. Committed.

This is fine with me, too, and I agree that it's warranted... but your
commit message supposes that this behavior is new in 9.3, and I think
it dates to 9.2.  I'm not inclined to think the issue is serious
enough to back-patch (and risk breaking current installations) but I
thought that it worth mentioning

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Ants Aasma
On Mar 22, 2013 12:46 PM, Atri Sharma atri.j...@gmail.com wrote:

 This is the one I think would work out best, add an age factor as to
 the time duration which an entry has spent in the cache along with its
 usage count.

You might want to check out the LIRS cache replacement algorithm [1].
That algorithm tries to estimate least frequently used instead of
least recently used. Mysql uses it for their buffer replacement
policy. There is also a clock sweep based approximation called
CLOCK-Pro. Papers describing and evaluating both are available on the
net. The evaluations in the papers showed significantly better
performance for both of those compared to regular clock sweep or even
ARC.

However, I think the main issue isn't finding new algorithms that are
better in some specific circumstances. The hard part is figuring out
whether their performance is better in general. My idea was to create
a patch to capture page pinning traffic from PostgreSQL (maybe stream
out into a per backend file), run it with some production workloads
and use that to generate testing workloads for the cache replacement
policies. Haven't gotten round to actually doing that though.

[1] http://en.wikipedia.org/wiki/LIRS_caching_algorithm

Regards,
Ants Aasma
--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should commit_delay be PGC_SIGHUP?

2013-03-22 Thread Peter Geoghegan
On Fri, Mar 22, 2013 at 12:42 PM, Robert Haas robertmh...@gmail.com wrote:
 This is fine with me, too, and I agree that it's warranted... but your
 commit message supposes that this behavior is new in 9.3, and I think
 it dates to 9.2.

No, it doesn't. It just missed the deadline for 9.2.

I'm happy enough to have the setting be PGC_SUSET, since that more or
less conveys that commit_delay isn't something that is sensible to set
dynamically.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON Function Bike Shedding

2013-03-22 Thread Merlin Moncure
On Mon, Mar 18, 2013 at 3:12 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Andrew Dunstan and...@dunslane.net writes:
 I've been sitting here for a while mulling none too happily over the
 debate on the names for the proposed JSON extraction functions. I
 haven't really been happy with any of the suggestions, much, not least
 my own original function names which were really only intended as
 placeholders. Last night in the still watches I decided I just couldn't
 go with a function name as almost totally content-free as get(), or even
 get_text(). And I don't think prepending json_' to the name helps much
 either.

 Agreed.

 Just concentrating to start with on those get() functions, in the simple
 case we really don't need them at all. hstore has the - operator
 without documenting the underlying function (fetchval). So maybe we
 should just do that.

 Well, not documenting the underlying function does not relieve you from
 having to name it in a reasonably sane fashion.  It still wouldn't do
 to call it get().

How about 'fetch'.  Or is that just skirting the content free aspect?
Agree that 'path' is out (as unnest is out due to unfortunate semantic
confusion).  At the end of the day, 'get()' is simply referencing an
array (either associative or not).  Most languages do this with an
operator, but I think fetch is pretty solid term.


   * I'd be inclined to stick with json_array_length() and
 json_object_keys() - I think they describe pretty well what they do.
 hstore's skeys() does more or less the same as json_object_keys(),
 so we could use that if we want to be consistent. I don't think it's
 a terribly good name though.
   * json_unnest() should certainly be renamed. Alternatives that come to
 mind are json_unfold() or json_elements() or json_array_elements().
   * json_each(), json_each_as_text(), json_populate_record() and
 json_populate_recordset() - to be consistent with hstore we could
 remove the json_. We probably should remove the _as_ from
 json_each_as_text().

 I don't particularly have a dog in this fight, but do we really want
 some of these to have a json_ prefix and others not?

That's already baked in, because 9.2 json functions have prefix.  I'm
still partial to json_unwrap for unnest, but out of Andrew's
suggestions I like json_elements the best.  Like removing _as_.


merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON Function Bike Shedding

2013-03-22 Thread Andrew Dunstan


On 03/22/2013 09:29 AM, Merlin Moncure wrote:

On Mon, Mar 18, 2013 at 3:12 PM, Tom Lane t...@sss.pgh.pa.us wrote:

Andrew Dunstan and...@dunslane.net writes:

I've been sitting here for a while mulling none too happily over the
debate on the names for the proposed JSON extraction functions. I
haven't really been happy with any of the suggestions, much, not least
my own original function names which were really only intended as
placeholders. Last night in the still watches I decided I just couldn't
go with a function name as almost totally content-free as get(), or even
get_text(). And I don't think prepending json_' to the name helps much
either.

Agreed.


Just concentrating to start with on those get() functions, in the simple
case we really don't need them at all. hstore has the - operator
without documenting the underlying function (fetchval). So maybe we
should just do that.

Well, not documenting the underlying function does not relieve you from
having to name it in a reasonably sane fashion.  It still wouldn't do
to call it get().

How about 'fetch'.  Or is that just skirting the content free aspect?
Agree that 'path' is out (as unnest is out due to unfortunate semantic
confusion).  At the end of the day, 'get()' is simply referencing an
array (either associative or not).  Most languages do this with an
operator, but I think fetch is pretty solid term.



   * I'd be inclined to stick with json_array_length() and
 json_object_keys() - I think they describe pretty well what they do.
 hstore's skeys() does more or less the same as json_object_keys(),
 so we could use that if we want to be consistent. I don't think it's
 a terribly good name though.
   * json_unnest() should certainly be renamed. Alternatives that come to
 mind are json_unfold() or json_elements() or json_array_elements().
   * json_each(), json_each_as_text(), json_populate_record() and
 json_populate_recordset() - to be consistent with hstore we could
 remove the json_. We probably should remove the _as_ from
 json_each_as_text().

I don't particularly have a dog in this fight, but do we really want
some of these to have a json_ prefix and others not?

That's already baked in, because 9.2 json functions have prefix.


I have finally decided my position on this. I think we have lots of good 
precedents for using type names in function names: array functions, xml 
functions and enum functions, for example.  I think these are the 
precedents to follow, rather than hstore. Some people will be unhappy 
that this means more typing, but SQL is somewhat verbose anyway, and 
whatever we do will make someone unhappy :-)



  I'm
still partial to json_unwrap for unnest, but out of Andrew's
suggestions I like json_elements the best.  Like removing _as_.



OK. I can live with that.

New version forthcoming soon.

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Atri Sharma

 However, I think the main issue isn't finding new algorithms that are
 better in some specific circumstances. The hard part is figuring out
 whether their performance is better in general. My idea was to create
 a patch to capture page pinning traffic from PostgreSQL (maybe stream
 out into a per backend file), run it with some production workloads
 and use that to generate testing workloads for the cache replacement
 policies. Haven't gotten round to actually doing that though.

 [1] http://en.wikipedia.org/wiki/LIRS_caching_algorithm


Thanks for the link. I think LIRS can indeed be helpful in our case.

We should indeed build some test cases for testing this theory. I am
all for capturing page replacement and usage data and analyzing it.

Atri

--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Tom Lane
Ants Aasma a...@cybertec.at writes:
 You might want to check out the LIRS cache replacement algorithm [1].
 That algorithm tries to estimate least frequently used instead of
 least recently used. Mysql uses it for their buffer replacement
 policy. There is also a clock sweep based approximation called
 CLOCK-Pro. Papers describing and evaluating both are available on the
 net. The evaluations in the papers showed significantly better
 performance for both of those compared to regular clock sweep or even
 ARC.

I seem to recall that CLOCK-Pro, or something named similarly to that,
was one of the alternatives discussed when we went over to the current
clock-sweep approach.  And we definitely looked at ARC.  It might be
worth checking the archives from back then to see what's already been
considered.

 However, I think the main issue isn't finding new algorithms that are
 better in some specific circumstances. The hard part is figuring out
 whether their performance is better in general.

Yeah. You can prove almost anything with the right set of test cases :-(

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review]

2013-03-22 Thread Alvaro Herrera
Amit Kapila escribió:
 On Friday, March 22, 2013 8:57 AM Alvaro Herrera wrote:
  Amit Kapila escribió:

   I think adding new syntax change is little scary for me, not for
   the matter of implementation but for building consensus on syntax.
  
  I cannot but agree on that point.
 
 Sorry, I don't get your point. 
 Do you mean to say that you don't agree with me and want new syntax as
 proposed by you to be implemented?

On the contrary, I was saying I agree with you on the difficulty on
getting consensus on this.

I don't know where I learned this phrase or even if it's in common
usage.  After a not-so-quick search I see it explained here:
http://www.perfectyourenglish.com/usage/but.htm

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let's invent a function to report lock-wait-blocking PIDs

2013-03-22 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 21.03.2013 05:36, Tom Lane wrote:
 The API that comes to mind is (name subject to bikeshedding)
 pg_blocking_pids(pid int) returns int[]

 How about inverting the function into:
 pg_pid_blocked_by(pid int) returns int
 It would take as argument a pid, and return the pid of the process that 
 is blocking the given process. That would feel more natural to me.

Hm, I'm not sure that's uniquely defined.  In the case I mentioned
before (A has AccessShare, B is blocked waiting for AccessExclusive,
C wants AccessShare and is queued behind B), which of A and B do
you think is blocking C?

Whichever answer you choose could be the wrong one for isolationtester:
I think it needs to consider that C is blocked if *either* A or B is
part of its set of test processes.  So that's why I thought an array
(or set) result including both A and B would be appropriate.  AFAICT,
what you're proposing isn't the inverse of what I said, it's the
same direction but you're assuming there's only one blocking process.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Strange Windows problem, lock_timeout test request

2013-03-22 Thread Robert Haas
On Thu, Mar 21, 2013 at 8:16 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Mon, Mar 18, 2013 at 10:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Because it's wrong.  Removing volatile means that the compiler is
 permitted to optimize away stores (and fetches!) on the basis of their
 being unnecessary according to straight-line analysis of the code.
 Write barriers don't fix that, they only say that stores that the
 compiler chooses to issue at all have to be ordered a certain way.

 I don't think this is correct.  The read and write barriers as
 implemented are designed to function as compiler barriers also, just
 as they do in the Linux kernel and every other piece of software I've
 found that implements anything remotely like this, with the lone
 exception of PostgreSQL.  In PostgreSQL, spinlock acquisition and
 release are defined as CPU barriers but not a compiler barrier, and
 this necessitates extensive use of volatile all over the code base
 which would be unnecessary if we did this the way it's done in Linux
 and elsewhere.

 I think you're just as mistaken as Zoltan.   Barriers enforce ordering
 of operations, not whether an operation occurs at all.

Surely not.  Suppose the user does this:

some_global_var = 1;
some_function();
some_global_var = 2;

I hope we can both agree that any compiler which thinks it doesn't
need to store 1 in some_global_var is completely nuts, because
some_function() might perform any arbitrary computation, including one
that depends on some_global_var being 1 rather than whatever value it
had before that.  Of course, if a global optimizer can prove that
some_function() can't do anything that can possibly care about
some_global_var, then the store could be omitted, but not otherwise.
Should the compiler omit the store and should there then be a bug in
the program, we wouldn't say oh, some_global_var ought to be declared
volatile.  We would say that compiler is buggy.

Now, conversely, in this situation, it seems to me (and I think you'll
agree) that the compiler could be forgiven for omitting one of the
stores:

some_global_var = 1;
local_var = 42;
some_global_var = 2;

If we declare some_global_var as volatile, it will force the compiler
to perform both stores regardless of what the optimizer thinks, and
moreover, the second store is required to happen after the first one,
because the definition of volatile is that volatile references are
globally sequenced with respect TO EACH OTHER.  However, the store to
local_var could be moved around with respect to the other two or
omitted altogether unless local_var is also declared volatile.

Now, consider this:

some_global_var = 1;
pg_compiler_barrier(); /* or in Linux, barrier() */
some_global_var = 2;

The compiler barrier is exactly equivalent to the unknown function in
the first example, except that no function is actually called.  The
semantics are precisely that the compiler must assume that, at the
point the barrier intervenes, an unknown operation will occur which
may depend on the contents of any word in memory and which may modify
any word in memory.  Thus, the compiler may not postpone stores
requested before the barrier until after the barrier on the grounds
that the values will be overwritten after the barrier; and if any
global variables have been loaded into registers before the barrier it
must be assumed that, after the barrier, the registers may no longer
match those global variables.

It is true that any barrier, including a compiler barrier, serves only
to separate instructions.  But I don't believe it follows in any way
that the barrier therefore prohibits reordering operations but not
omitting them altogether.  Such a definition would have no practical
utility.  The compiler is not free to willy-nilly leave out things
that the user asks it to do any more than it is free to willy-nilly
reorder them.  If it were, programming would be chaos, and everything
would have to be volatile.  What the compiler is free to do is to
reorder and omit instructions where the programmer won't, in a
single-thread execution context, be able to notice the difference.
And a compiler barrier is an explicit notification that, in essence,
the programmer will notice if things are not just the way he wrote
them when that point in the code is reached.

To see the difference between this and volatile, consider the following:

a = 1;
b = 1;
c = 1;
a = 2;
b = 2;
c = 2;

Absent any special precautions, the compiler may well optimize the
first set of stores away completely and perform the second set in any
order it likes.  If we make all the variables volatile, it will do all
6 stores in precisely the order specified, omitting nothing and
reordering nothing.  If we instead stick a compiler barrier just after
the assignment c = 1, then the compiler must store 1 in all three
variables (in any order that it likes), and then store 2 in all three
variables (in any order that it likes).  The barrier 

Re: [HACKERS] Enabling Checksums

2013-03-22 Thread Ants Aasma
On Fri, Mar 22, 2013 at 3:04 AM, Jeff Davis pg...@j-davis.com wrote:
 I've been following your analysis and testing, and it looks like there
 are still at least three viable approaches:

 1. Some variant of Fletcher
 2. Some variant of CRC32
 3. Some SIMD-based checksum

 Each of those has some open implementation questions, as well. If we
 settle on one of those approaches, we don't necessarily need the fastest
 implementation right away. I might even argue that the first patch to be
 committed should be a simple implementation of whatever algorithm we
 choose, and then optimization should be done in a separate patch (if it
 is tricky to get right).

+1 on correct first, fast second.

 Of course, it's hard to settle on the general algorithm to use without
 knowing the final performance numbers. So right now I'm in somewhat of a
 holding pattern until we settle on something.

For performance the K8 results gave me confidence that we have a
reasonably good overview what the performance is like for the class of
CPU's that PostgreSQL is likely to run on. I don't think there is
anything left to optimize there, all algorithms are pretty close to
maximum theoretical performance. Still, benchmarks on AMD's Bulldozer
arch and maybe on some non-x86 machines (Power, Itanium, Sparc) would
be very welcome to ensure that I haven't missed anything.

To see real world performance numbers I dumped the algorithms on top
of the checksums patch. I set up postgres with 32MB shared buffers,
and ran with concurrency 4 select only pgbench and a worst case
workload, results are median of 5 1-minute runs. I used fletcher as it
was in the checksums patch without unrolling. Unrolling would cut the
performance hit by a third or so.

The worst case workload is set up using
CREATE TABLE sparse (id serial primary key, v text) WITH (fillfactor=10);
INSERT INTO sparse (v) SELECT REPEAT('x', 1000) FROM generate_series(1,10);
VACUUM ANALYZE sparse;

The test query itself is a simple SELECT count(v) FROM sparse;

Results for the worst case workload:
No checksums:   tps = 14.710519
Fletcher checksums: tps = 10.825564 (1.359x slowdown)
CRC checksums:  tps =  5.844995 (2.517x slowdown)
SIMD checksums: tps = 14.062388 (1.046x slowdown)

Results for pgbench scale 100:
No checksums:   tps = 56623.819783
Fletcher checksums: tps = 55282.222687 (1.024x slowdown)
CRC Checksums:  tps = 50571.324795 (1.120x slowdown)
SIMD Checksums: tps = 56608.888985 (1.000x slowdown)

So to conclude, the 3 approaches:

CRC:
Time to checksum 8192 bytes:
12'000 - 16'000 cycles best case without special hardware
 1'200 cycles with hardware (new Intel only)
Code size: 131 bytes
* Can calculate arbitrary number of bytes per invocation, state is 4
bytes. Implementation can be shared with WAL.
* Quite slow without hardware acceleration.
* Software implementation requires a 8kB table for calculation or it
will be even slower. Quite likely to fall out of cache.
* If we wish to use hardware acceleration then the polynomial should
be switched to Castagnoli. I think the old polynomial needs to stay as
the values seem to be stored in indexes by tsvector compression and
multibyte trigrams. (not 100% sure, just skimmed the code)
* Error detection of 32bit Castagnoli CRC is known to be good, the
effect of truncating to 16 bits is not analyzed yet.

Fletcher:
Time to checksum 8192 bytes:
 2'600 cycles +- 100
Code size: 170 bytes unrolled
* Very simple implementation for optimal speed.
* Needs to calculate 4 bytes at a time, requires 8 bytes of state.
Implementation that can work for WAL would be tricky but not
impossible. Probably wouldn't share code.
* Should give good enough error detection with suitable choice for
final recombination.

SIMD Checksums:
Time to checksum 8192 bytes:
   730 cycles for processors with 128bit SIMD units
  1830 cycles for processors with 64bit SIMD units
Code size: 436 bytes
* Requires vectorization, intrinsics or ASM for decent performance.
* Needs to calculate 128bytes at a time, requires 128 bytes of state.
Using for anything other than summing fixed size blocks looks tricky.
* Loosely based on Fowler-Noll-Vo and should have reasonably good
error detection capabilities.

Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON Function Bike Shedding

2013-03-22 Thread Merlin Moncure
On Fri, Mar 22, 2013 at 8:58 AM, Andrew Dunstan and...@dunslane.net wrote:

 On 03/22/2013 09:29 AM, Merlin Moncure wrote:

 On Mon, Mar 18, 2013 at 3:12 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Andrew Dunstan and...@dunslane.net writes:

 I've been sitting here for a while mulling none too happily over the
 debate on the names for the proposed JSON extraction functions. I
 haven't really been happy with any of the suggestions, much, not least
 my own original function names which were really only intended as
 placeholders. Last night in the still watches I decided I just couldn't
 go with a function name as almost totally content-free as get(), or even
 get_text(). And I don't think prepending json_' to the name helps much
 either.

 Agreed.

 Just concentrating to start with on those get() functions, in the simple
 case we really don't need them at all. hstore has the - operator
 without documenting the underlying function (fetchval). So maybe we
 should just do that.

 Well, not documenting the underlying function does not relieve you from
 having to name it in a reasonably sane fashion.  It still wouldn't do
 to call it get().

 How about 'fetch'.  Or is that just skirting the content free aspect?
 Agree that 'path' is out (as unnest is out due to unfortunate semantic
 confusion).  At the end of the day, 'get()' is simply referencing an
 array (either associative or not).  Most languages do this with an
 operator, but I think fetch is pretty solid term.


* I'd be inclined to stick with json_array_length() and
  json_object_keys() - I think they describe pretty well what they
 do.
  hstore's skeys() does more or less the same as json_object_keys(),
  so we could use that if we want to be consistent. I don't think
 it's
  a terribly good name though.
* json_unnest() should certainly be renamed. Alternatives that come
 to
  mind are json_unfold() or json_elements() or json_array_elements().
* json_each(), json_each_as_text(), json_populate_record() and
  json_populate_recordset() - to be consistent with hstore we could
  remove the json_. We probably should remove the _as_ from
  json_each_as_text().

 I don't particularly have a dog in this fight, but do we really want
 some of these to have a json_ prefix and others not?

 That's already baked in, because 9.2 json functions have prefix.


 I have finally decided my position on this. I think we have lots of good
 precedents for using type names in function names: array functions, xml
 functions and enum functions, for example.  I think these are the precedents
 to follow, rather than hstore. Some people will be unhappy that this means
 more typing, but SQL is somewhat verbose anyway, and whatever we do will
 make someone unhappy :-)


   I'm
 still partial to json_unwrap for unnest, but out of Andrew's
 suggestions I like json_elements the best.  Like removing _as_.


 OK. I can live with that.

 New version forthcoming soon.

Thanks for that!  I'm super duper busy lately, but i'd still like to
bone up the docs a little bit, so if I can find the time I'd like to
squeeze some in before we lock in the beta if that's all right.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Default connection parameters for postgres_fdw and dblink

2013-03-22 Thread Tom Lane
It struck me while looking at the regression test arrangements for
postgres_fdw that as things are set up, the default username for
outgoing connections is going to be that of the operating system
user running the postmaster.  dblink is the same way.

Now, this might not be the world's worst default, but it's got to be a
pretty bad one.  The OS username of the server isn't really supposed to
be exposed at all on the SQL level.  And to the extent that it matches
the bootstrap superuser's SQL name, it's still a non-POLA-satisfying
default for any user other than the bootstrap superuser.

IMO it would make a lot more sense for the default to be the name of the
current database user.  Either that, or insist that the outgoing
username not be defaultable at all; though I'm not sure we can do the
latter without breaking the regression tests, since those are supposed
to be agnostic as to the name of the superuser running them.

A related issue is that libpq will happily acquire defaults from the
server's environment, such as PGPORT.  This seems like it's also
exposing things that shouldn't be exposed.  Unfortunately, I think we're
depending on that for the dblink and postgres_fdw regression tests to
work at all when the postmaster is listening to a nondefault port (ie,
make check).

Is there a better way to handle all this?  It may be too late to rethink
dblink's behavior anyhow, but perhaps it's not too late to change
postgres_fdw.  I think though that once we let 9.3 out the door, it
*will* be too late to make any major changes, because postgres_fdw's
usage is going to go through the roof now that it can do remote updates.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Strange Windows problem, lock_timeout test request

2013-03-22 Thread Greg Stark
On Fri, Mar 22, 2013 at 2:29 PM, Robert Haas robertmh...@gmail.com wrote:
 The barrier essentially
 divides up the code into chunks and requires that those chunks be
 optimized independently by the compiler without knowledge of what
 earlier or later chunks are doing

While all this sounds sensible I would love to see a gcc programmer or
llvm programmer actually comment on what they think volatile does and
what they want to implement in the compiler.

I'm a bit worried that we're making assumptions like things happen in
a specific order that aren't really justified. In these days of
superscalar execution and multi-level caches things may be weirder
than we're imagining.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-22 Thread Jeff Davis
On Fri, 2013-03-22 at 17:09 +0200, Ants Aasma wrote:
 For performance the K8 results gave me confidence that we have a
 reasonably good overview what the performance is like for the class of
 CPU's that PostgreSQL is likely to run on. I don't think there is
 anything left to optimize there, all algorithms are pretty close to
 maximum theoretical performance.

Great work!

 The worst case workload is set up using
 CREATE TABLE sparse (id serial primary key, v text) WITH (fillfactor=10);
 INSERT INTO sparse (v) SELECT REPEAT('x', 1000) FROM 
 generate_series(1,10);
 VACUUM ANALYZE sparse;
 
 The test query itself is a simple SELECT count(v) FROM sparse;
 
 Results for the worst case workload:
 No checksums:   tps = 14.710519
 Fletcher checksums: tps = 10.825564 (1.359x slowdown)
 CRC checksums:  tps =  5.844995 (2.517x slowdown)
 SIMD checksums: tps = 14.062388 (1.046x slowdown)

I assume this is in the bad region identified by Greg, where there is
no disk activity, but shared_buffers is small, leading to a lot of
movement between the OS cache and shared buffers?

What do you mean by TPS exactly? If the select query is writing hint
bits, then you wouldn't be able to repeat it because they are already
set. So are you repeating the creation/loading of the table, as well?

 Results for pgbench scale 100:
 No checksums:   tps = 56623.819783
 Fletcher checksums: tps = 55282.222687 (1.024x slowdown)
 CRC Checksums:  tps = 50571.324795 (1.120x slowdown)
 SIMD Checksums: tps = 56608.888985 (1.000x slowdown)
 
 So to conclude, the 3 approaches:

Great analysis. Still a tough choice.

One thing that might be interesting is to look at doing SIMD for both
data and WAL. I wonder if that would be a noticeable speedup for WAL
full-page writes? That would give greater justification for the extra
work it will take (intrinsics/ASM), and it would be a nice win for
non-checksum users.

I also notice that http://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%
80%93Vo_hash_function explicitly mentions adapting FNV to a smaller
size. That gives me a little more confidence. Do you have other links we
should read about this approach, or possible weaknesses?

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-22 Thread Jeff Davis
On Fri, 2013-03-22 at 17:09 +0200, Ants Aasma wrote:
 So to conclude, the 3 approaches:

One other question: assuming that the algorithms use the full 16-bit
space, is there a good way to avoid zero without skewing the result? Can
we do something like un-finalize (after we figure out that it's zero),
compute in an extra salt value, and then re-finalize? That might work
for Fletcher; but I don't think that works for CRC or Fowler-Noll-Vo
because the final value is the same as the state.

I'm still slightly concerned about differentiating checksummed pages in
the future if we want to offer a transition path, since we no longer use
header bits. Avoiding zero might help us there. Hopefully not necessary,
but something we might find useful. Also, it would help us identify
situations where the checksum is never set.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Greg Stark
On Fri, Mar 22, 2013 at 2:02 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 And we definitely looked at ARC

We didn't just look at it. At least one release used it. Then patent
issues were raised (and I think the implementation had some contention
problems).


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Atri Sharma
On Fri, Mar 22, 2013 at 11:36 PM, Greg Stark st...@mit.edu wrote:
 On Fri, Mar 22, 2013 at 2:02 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 And we definitely looked at ARC

 We didn't just look at it. At least one release used it. Then patent
 issues were raised (and I think the implementation had some contention
 problems).


 --
 greg

What is the general thinking? Is it time to start testing again and
thinking about improvements to the current algorithm?

Regards,

Atri

--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-22 Thread Ants Aasma
On Fri, Mar 22, 2013 at 7:35 PM, Jeff Davis pg...@j-davis.com wrote:
 On Fri, 2013-03-22 at 17:09 +0200, Ants Aasma wrote:
 For performance the K8 results gave me confidence that we have a
 reasonably good overview what the performance is like for the class of
 CPU's that PostgreSQL is likely to run on. I don't think there is
 anything left to optimize there, all algorithms are pretty close to
 maximum theoretical performance.

 Great work!

Thanks.

 The worst case workload is set up using
 CREATE TABLE sparse (id serial primary key, v text) WITH (fillfactor=10);
 INSERT INTO sparse (v) SELECT REPEAT('x', 1000) FROM 
 generate_series(1,10);
 VACUUM ANALYZE sparse;

 The test query itself is a simple SELECT count(v) FROM sparse;

 Results for the worst case workload:
 No checksums:   tps = 14.710519
 Fletcher checksums: tps = 10.825564 (1.359x slowdown)
 CRC checksums:  tps =  5.844995 (2.517x slowdown)
 SIMD checksums: tps = 14.062388 (1.046x slowdown)

 I assume this is in the bad region identified by Greg, where there is
 no disk activity, but shared_buffers is small, leading to a lot of
 movement between the OS cache and shared buffers?

 What do you mean by TPS exactly? If the select query is writing hint
 bits, then you wouldn't be able to repeat it because they are already
 set. So are you repeating the creation/loading of the table, as well?

The table is created once, size is 800MB with one hinted tuple per
page. Shared buffers is set to 32MB, machine is Intel Core i5-2500K
with 16GB of memory (2 memory channels, 1333MHz, overheads are likely
to be larger with faster memory). This is the worst case workload for
in-memory workload that doesn't fit into shared_buffers as almost no
work other than swapping buffer pages in is done. I think things like
bitmap heap scans might show similar characteristics.

 Results for pgbench scale 100:
 No checksums:   tps = 56623.819783
 Fletcher checksums: tps = 55282.222687 (1.024x slowdown)
 CRC Checksums:  tps = 50571.324795 (1.120x slowdown)
 SIMD Checksums: tps = 56608.888985 (1.000x slowdown)

 So to conclude, the 3 approaches:

 Great analysis. Still a tough choice.

 One thing that might be interesting is to look at doing SIMD for both
 data and WAL. I wonder if that would be a noticeable speedup for WAL
 full-page writes? That would give greater justification for the extra
 work it will take (intrinsics/ASM), and it would be a nice win for
 non-checksum users.

Andres showed that switching out the existing CRC for zlib's would
result in 8-30% increase in INSERT-SELECT speed
(http://www.postgresql.org/message-id/201005202227.49990.and...@anarazel.de)
with the speeded up CRC still showing up as 10% of the profile. So I
guess another 5% speedup by doing the CRC 8 bytes at a time instead of
the used 4. And another couple % by using Fletcher or SIMD.

 I also notice that http://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%
 80%93Vo_hash_function explicitly mentions adapting FNV to a smaller
 size. That gives me a little more confidence. Do you have other links we
 should read about this approach, or possible weaknesses?

It mentions that one should use 32bit FNV and fold it down to 16bit
via xor. This doesn't work here because SSE2 doesn't have pmulld
(SSE4.1). I have taken some liberties here by actually doing 64 16bit
FNV like operations in parallel and then doing an FNV like combination
of them at the end. However the choices there are concerned with good
hashing performance, while for checksums it should matter much even if
the average error detection rate goes from 99.998% to 99.99% as long
as common error scenarios don't match up with the collisions. If
decide to go this route we should definitely research what the
effectiveness consequences here are and what are good choices for the
prime values used. On the face of it multiply by prime and add/xor
looks like it provides pretty good mixing, resists transposed
sequences, zeroing out blocks. The worst case seems to be bit errors.
As far as I can see, this implementation should detect all single bit
errors, but if one of the bit errors is on MSB, a second single error
in MSB will cancel it out. I haven't done the math but it should still
work out as better than 99% chance to detect random 2 bit errors.

On Fri, Mar 22, 2013 at 8:00 PM, Jeff Davis pg...@j-davis.com wrote:
 On Fri, 2013-03-22 at 17:09 +0200, Ants Aasma wrote:
 So to conclude, the 3 approaches:

 One other question: assuming that the algorithms use the full 16-bit
 space, is there a good way to avoid zero without skewing the result? Can
 we do something like un-finalize (after we figure out that it's zero),
 compute in an extra salt value, and then re-finalize? That might work
 for Fletcher; but I don't think that works for CRC or Fowler-Noll-Vo
 because the final value is the same as the state.

Taking the Fletcher or CRC32 result modulo 65521 (largest prime 
16bits) only gives a very slight skew that shouldn't 

Re: [HACKERS] Materialized view assertion failure in HEAD

2013-03-22 Thread Kevin Grittner
Kevin Grittner kgri...@ymail.com wrote:
 Kevin Grittner kgri...@ymail.com wrote:
 Robert Haas robertmh...@gmail.com wrote:

 It seems to me that the right place to fix this is in
 interpretOidsOption(), by returning false rather than
 default_with_oids whenever the relation is a materialized view.

 I like it.

 In working up a patch for this approach, I see that if CREATE
 FOREIGN TABLE is executed with default_with_oids set to true, it
 adds an oid column which appears to be always zero in my tests so
 far (although maybe other FDWs support it?).  Do we want to leave
 that alone?  If we're going to add code to ignore that setting for
 matviews do we also want to ignore it for FDWs?

 [ thinks... ]

 I suppose I should post a patch which preserves the status quo for
 FDWs and treat that as a separate issue.  So, rough cut attached.
 Obviously some docs should be added around this, and I still need
 to do another pass to make sure I didn't miss anything; but it
 passes make world-check, make installworld-check, and the
 regression database can be dumped and loaded without problem.

 Comments?

Tidied up, further tested, and pushed.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] dump, restore, dump yields differences

2013-03-22 Thread Kevin Grittner
As part of testing the matview issues around dumping, I ran `make
installcheck-world`, ran pg_dump to dump the regression database,
loaded it into a new database, dumped the newly restored database,
and compared the output files from the two pg_dump runs.  There
were a few somewhat surprising differences, attached.  Do we want
to worry about these, or leave them alone?

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company*** regression.dump	2013-03-22 10:58:28.341475752 -0500
--- rest.dump	2013-03-22 11:00:35.881475455 -0500
***
*** 7878,7887 
  -- Data for Name: b_star; Type: TABLE DATA; Schema: public; Owner: kgrittn
  --
  
! COPY b_star (class, aa, bb, a) FROM stdin;
! b	3	mumble	\N
  b	4	\N	\N
! b	\N	bumble	\N
  b	\N	\N	\N
  \.
  
--- 7878,7887 
  -- Data for Name: b_star; Type: TABLE DATA; Schema: public; Owner: kgrittn
  --
  
! COPY b_star (class, aa, a, bb) FROM stdin;
! b	3	\N	mumble
  b	4	\N	\N
! b	\N	\N	bumble
  b	\N	\N	\N
  \.
  
***
*** 48950,48959 
  -- Data for Name: c_star; Type: TABLE DATA; Schema: public; Owner: kgrittn
  --
  
! COPY c_star (class, aa, cc, a) FROM stdin;
! c	5	hi mom	\N
  c	6	\N	\N
! c	\N	hi paul	\N
  c	\N	\N	\N
  \.
  
--- 48950,48959 
  -- Data for Name: c_star; Type: TABLE DATA; Schema: public; Owner: kgrittn
  --
  
! COPY c_star (class, aa, a, cc) FROM stdin;
! c	5	\N	hi mom
  c	6	\N	\N
! c	\N	\N	hi paul
  c	\N	\N	\N
  \.
  
***
*** 49185,49206 
  -- Data for Name: d_star; Type: TABLE DATA; Schema: public; Owner: kgrittn
  --
  
! COPY d_star (class, aa, bb, cc, dd, a) FROM stdin;
! d	7	grumble	hi sunita	0	\N
! d	8	stumble	hi koko	\N	\N
! d	9	rumble	\N	1.10009	\N
! d	10	\N	hi kristin	10.0098	\N
! d	\N	crumble	hi boris	100.0010005	\N
! d	11	fumble	\N	\N	\N
! d	12	\N	hi avi	\N	\N
! d	13	\N	\N	1000.97	\N
! d	\N	tumble	hi andrew	\N	\N
! d	\N	humble	\N	1.09997	\N
! d	\N	\N	hi ginger	10.0093	\N
  d	14	\N	\N	\N	\N
! d	\N	jumble	\N	\N	\N
! d	\N	\N	hi jolly	\N	\N
! d	\N	\N	\N	100.001	\N
  d	\N	\N	\N	\N	\N
  \.
  
--- 49185,49206 
  -- Data for Name: d_star; Type: TABLE DATA; Schema: public; Owner: kgrittn
  --
  
! COPY d_star (class, aa, a, bb, cc, dd) FROM stdin;
! d	7	\N	grumble	hi sunita	0
! d	8	\N	stumble	hi koko	\N
! d	9	\N	rumble	\N	1.10009
! d	10	\N	\N	hi kristin	10.0098
! d	\N	\N	crumble	hi boris	100.0010005
! d	11	\N	fumble	\N	\N
! d	12	\N	\N	hi avi	\N
! d	13	\N	\N	\N	1000.97
! d	\N	\N	tumble	hi andrew	\N
! d	\N	\N	humble	\N	1.09997
! d	\N	\N	\N	hi ginger	10.0093
  d	14	\N	\N	\N	\N
! d	\N	\N	jumble	\N	\N
! d	\N	\N	\N	hi jolly	\N
! d	\N	\N	\N	\N	100.001
  d	\N	\N	\N	\N	\N
  \.
  
***
*** 59377,59390 
  -- Data for Name: e_star; Type: TABLE DATA; Schema: public; Owner: kgrittn
  --
  
! COPY e_star (class, aa, cc, ee, e, a) FROM stdin;
! e	15	hi carol	-1	\N	\N
! e	16	hi bob	\N	\N	\N
! e	17	\N	-2	\N	\N
! e	\N	hi michelle	-3	\N	\N
  e	18	\N	\N	\N	\N
! e	\N	hi elisa	\N	\N	\N
! e	\N	\N	-4	\N	\N
  \.
  
  
--- 59377,59390 
  -- Data for Name: e_star; Type: TABLE DATA; Schema: public; Owner: kgrittn
  --
  
! COPY e_star (class, aa, a, cc, ee, e) FROM stdin;
! e	15	\N	hi carol	-1	\N
! e	16	\N	hi bob	\N	\N
! e	17	\N	\N	-2	\N
! e	\N	\N	hi michelle	-3	\N
  e	18	\N	\N	\N	\N
! e	\N	\N	hi elisa	\N	\N
! e	\N	\N	\N	-4	\N
  \.
  
  
***
*** 59415,59437 
  -- Data for Name: f_star; Type: TABLE DATA; Schema: public; Owner: kgrittn
  --
  
! COPY f_star (class, aa, cc, ee, ff, f, e, a) FROM stdin;
! f	19	hi claire	-5	((1,3),(2,4))	10	\N	\N
! f	20	hi mike	-6	\N	10	\N	\N
! f	21	hi marcel	\N	((11,44),(22,55),(33,66))	10	\N	\N
! f	22	\N	-7	((111,555),(222,666),(333,777),(444,888))	10	\N	\N
! f	\N	hi keith	-8	((,),(,))	10	\N	\N
! f	24	hi marc	\N	\N	10	\N	\N
! f	25	\N	-9	\N	10	\N	\N
! f	26	\N	\N	((1,3),(2,4))	10	\N	\N
! f	\N	hi allison	-10	\N	10	\N	\N
! f	\N	hi jeff	\N	((11,33),(22,44))	10	\N	\N
! f	\N	\N	-11	((111,333),(222,444))	10	\N	\N
! f	27	\N	\N	\N	10	\N	\N
! f	\N	hi carl	\N	\N	10	\N	\N
! f	\N	\N	-12	\N	10	\N	\N
! f	\N	\N	\N	((,),(,))	10	\N	\N
! f	\N	\N	\N	\N	10	\N	\N
  \.
  
  
--- 59415,59437 
  -- Data for Name: f_star; Type: TABLE DATA; Schema: public; Owner: kgrittn
  --
  
! COPY f_star (class, aa, a, cc, ee, e, ff, f) FROM stdin;
! f	19	\N	hi claire	-5	\N	((1,3),(2,4))	10
! f	20	\N	hi mike	-6	\N	\N	10
! f	21	\N	hi marcel	\N	\N	((11,44),(22,55),(33,66))	10
! f	22	\N	\N	-7	\N	((111,555),(222,666),(333,777),(444,888))	10
! f	\N	\N	hi keith	-8	\N	((,),(,))	10
! f	24	\N	hi marc	\N	\N	\N	10
! f	25	\N	\N	-9	\N	\N	10
! f	26	\N	\N	\N	\N	((1,3),(2,4))	10
! f	\N	\N	hi allison	-10	\N	\N	10
! f	\N	\N	hi jeff	\N	\N	((11,33),(22,44))	10
! f	\N	\N	\N	-11	\N	

Re: [HACKERS] dump, restore, dump yields differences

2013-03-22 Thread Tom Lane
Kevin Grittner kgri...@ymail.com writes:
 As part of testing the matview issues around dumping, I ran `make
 installcheck-world`, ran pg_dump to dump the regression database,
 loaded it into a new database, dumped the newly restored database,
 and compared the output files from the two pg_dump runs.  There
 were a few somewhat surprising differences, attached.  Do we want
 to worry about these, or leave them alone?

Those are expected.  You can trace the ALTER TABLE history of those
tables if you want to see why they're so odd, but basically there
are inheritance situations where it's hard to avoid this.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Merlin Moncure
On Fri, Mar 22, 2013 at 1:13 PM, Atri Sharma atri.j...@gmail.com wrote:
 On Fri, Mar 22, 2013 at 11:36 PM, Greg Stark st...@mit.edu wrote:
 On Fri, Mar 22, 2013 at 2:02 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 And we definitely looked at ARC

 We didn't just look at it. At least one release used it. Then patent
 issues were raised (and I think the implementation had some contention
 problems).


 --
 greg

 What is the general thinking? Is it time to start testing again and
 thinking about improvements to the current algorithm?

well, what problem are you trying to solve exactly?  the main problems
I see today are not so much in terms of page replacement but spinlock
and lwlock contention.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Default connection parameters for postgres_fdw and dblink

2013-03-22 Thread Jeff Davis
On Fri, 2013-03-22 at 12:19 -0400, Tom Lane wrote:
 Is there a better way to handle all this?  It may be too late to rethink
 dblink's behavior anyhow, but perhaps it's not too late to change
 postgres_fdw.  I think though that once we let 9.3 out the door, it
 *will* be too late to make any major changes, because postgres_fdw's
 usage is going to go through the roof now that it can do remote updates.

The first thing that occurs to me is to have postgres_fdw install some
GUCs with reasonable defaults.

Perhaps the default could be a magic value that is replaced by the
current user or something (similar to search_path).

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: postgres_fdw vs data formatting GUCs (was Re: [HACKERS] [v9.3] writable foreign tables)

2013-03-22 Thread Tom Lane
Daniel Farina dan...@heroku.com writes:
 This contains some edits to comments that referred to the obsolete and
 bogus TupleDesc scanning.  No mechanical alterations.

Applied with some substantial revisions.  I didn't like where you'd put
the apply/restore calls, for one thing --- we need to wait to do the
applies until we have the PGresult in hand, else we might be applying
stale values of the remote's GUCs.  Also, adding a call that could throw
errors right before materializeResult() won't do, because that would
result in leaking the PGresult on error.  The struct for state seemed a
bit of a mess too, given that you couldn't always initialize it in one
place.  (In hindsight I could have left that alone given where I ended
up putting the calls, but it didn't seem to be providing any useful
isolation.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized view assertion failure in HEAD

2013-03-22 Thread Tom Lane
Kevin Grittner kgri...@ymail.com writes:
 In working up a patch for this approach, I see that if CREATE
 FOREIGN TABLE is executed with default_with_oids set to true, it
 adds an oid column which appears to be always zero in my tests so
 far (although maybe other FDWs support it?).  Do we want to leave
 that alone?  If we're going to add code to ignore that setting for
 matviews do we also want to ignore it for FDWs?

I don't see any very good reason for the core code to be assuming
anything about whether an FDW will support OIDs or not.  It would have
been noticeably more painful to get postgres_fdw's writable-table
support working if the core code had tried to enforce an opinion that
foreign tables couldn't have ctid, to take something somewhat
comparable.

Where I'd like to see things going is to add a hook for FDWs to control
what system columns get created for their tables.  When that happens,
an FDW could throw error about OIDs if it wants to.  In the meantime
I don't feel a need to change the behavior.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Tom Lane
Merlin Moncure mmonc...@gmail.com writes:
 On Fri, Mar 22, 2013 at 1:13 PM, Atri Sharma atri.j...@gmail.com wrote:
 What is the general thinking? Is it time to start testing again and
 thinking about improvements to the current algorithm?

 well, what problem are you trying to solve exactly?  the main problems
 I see today are not so much in terms of page replacement but spinlock
 and lwlock contention.

Even back when we last hacked on that algorithm, the concerns were not
so much about which pages it replaced as how much overhead and
contention was created by the management algorithm.  I haven't seen any
reason to think we have a problem with the quality of the replacement
choices.  The proposal to increase the initial usage count would
definitely lead to more overhead/contention, though, because it would
result in having to circle around all the buffers more times (on
average) to get a free buffer.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dump, restore, dump yields differences

2013-03-22 Thread Greg Stark
On Fri, Mar 22, 2013 at 6:49 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Those are expected.  You can trace the ALTER TABLE history of those
 tables if you want to see why they're so odd, but basically there
 are inheritance situations where it's hard to avoid this.

Incidentally it would still be cool to have make check do this dance
and intelligently compare the before and after. There have been more
than one patch where you've caught omissions in pg_dump before
applying.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Default connection parameters for postgres_fdw and dblink

2013-03-22 Thread Tom Lane
Jeff Davis pg...@j-davis.com writes:
 On Fri, 2013-03-22 at 12:19 -0400, Tom Lane wrote:
 Is there a better way to handle all this?  It may be too late to rethink
 dblink's behavior anyhow, but perhaps it's not too late to change
 postgres_fdw.  I think though that once we let 9.3 out the door, it
 *will* be too late to make any major changes, because postgres_fdw's
 usage is going to go through the roof now that it can do remote updates.

 The first thing that occurs to me is to have postgres_fdw install some
 GUCs with reasonable defaults.

If there's anything I've learned in the last dozen years, it's that GUCs
with application-visible semantic effects are dangerous.  If the
semantic effects are relevant to security, that's probably even worse.

 Perhaps the default could be a magic value that is replaced by the
 current user or something (similar to search_path).

That seems like just an overcomplicated form of my suggestion that the
default should be the current user's name.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Merlin Moncure
On Fri, Mar 22, 2013 at 2:52 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Merlin Moncure mmonc...@gmail.com writes:
 On Fri, Mar 22, 2013 at 1:13 PM, Atri Sharma atri.j...@gmail.com wrote:
 What is the general thinking? Is it time to start testing again and
 thinking about improvements to the current algorithm?

 well, what problem are you trying to solve exactly?  the main problems
 I see today are not so much in terms of page replacement but spinlock
 and lwlock contention.

 Even back when we last hacked on that algorithm, the concerns were not
 so much about which pages it replaced as how much overhead and
 contention was created by the management algorithm.  I haven't seen any
 reason to think we have a problem with the quality of the replacement
 choices.  The proposal to increase the initial usage count would
 definitely lead to more overhead/contention, though, because it would
 result in having to circle around all the buffers more times (on
 average) to get a free buffer.


yup...absolutely.  I have a hunch that the occasional gripes we see
about server stalls under high load with read only (or mostly read
only) loads are coming from spinlock contention under the lwlock
hitting a critical point and shutting the server down effectively
until by chance the backend with the lwlock gets lucky and lands the
spinlock.

I think there is some very low hanging optimization fruit in the clock
sweep loop.   first and foremost, I see no good reason why when
scanning pages we have to spin and wait on a buffer in order to
pedantically adjust usage_count.  some simple refactoring there could
set it up so that a simple TAS (or even a TTAS with the first test in
front of the cache line lock as we done automatically in x86 IIRC)
could guard the buffer and, in the event of any lock detected, simply
move on to the next candidate without messing around with that buffer
at all.   This could construed as a 'trylock' variant of a spinlock
and might help out with cases where an especially hot buffer is
locking up the sweep.  This is exploiting the fact that from
StrategyGetBuffer we don't need a *particular* buffer, just *a*
buffer.

I also wonder if we shouldn't (perhaps in addition to the above)
resuscitate Jeff Jane's idea to get rid of the lwlock completely and
manage everything with spinlocks..

Naturally, all of this would have to be confirmed with some very robust testing.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Tom Lane
Merlin Moncure mmonc...@gmail.com writes:
 I think there is some very low hanging optimization fruit in the clock
 sweep loop.   first and foremost, I see no good reason why when
 scanning pages we have to spin and wait on a buffer in order to
 pedantically adjust usage_count.  some simple refactoring there could
 set it up so that a simple TAS (or even a TTAS with the first test in
 front of the cache line lock as we done automatically in x86 IIRC)
 could guard the buffer and, in the event of any lock detected, simply
 move on to the next candidate without messing around with that buffer
 at all.   This could construed as a 'trylock' variant of a spinlock
 and might help out with cases where an especially hot buffer is
 locking up the sweep.  This is exploiting the fact that from
 StrategyGetBuffer we don't need a *particular* buffer, just *a*
 buffer.

Hm.  You could argue in fact that if there's contention for the buffer
header, that's proof that it's busy and shouldn't have its usage count
decremented.  So this seems okay from a logical standpoint.

However, I'm not real sure that it's possible to do a conditional
spinlock acquire that doesn't create just as much hardware-level
contention as a full acquire (ie, TAS is about as bad whether it
gets the lock or not).  So the actual benefit is a bit less clear.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Merlin Moncure
On Fri, Mar 22, 2013 at 3:16 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Merlin Moncure mmonc...@gmail.com writes:
 I think there is some very low hanging optimization fruit in the clock
 sweep loop.   first and foremost, I see no good reason why when
 scanning pages we have to spin and wait on a buffer in order to
 pedantically adjust usage_count.  some simple refactoring there could
 set it up so that a simple TAS (or even a TTAS with the first test in
 front of the cache line lock as we done automatically in x86 IIRC)
 could guard the buffer and, in the event of any lock detected, simply
 move on to the next candidate without messing around with that buffer
 at all.   This could construed as a 'trylock' variant of a spinlock
 and might help out with cases where an especially hot buffer is
 locking up the sweep.  This is exploiting the fact that from
 StrategyGetBuffer we don't need a *particular* buffer, just *a*
 buffer.

 Hm.  You could argue in fact that if there's contention for the buffer
 header, that's proof that it's busy and shouldn't have its usage count
 decremented.  So this seems okay from a logical standpoint.

 However, I'm not real sure that it's possible to do a conditional
 spinlock acquire that doesn't create just as much hardware-level
 contention as a full acquire (ie, TAS is about as bad whether it
 gets the lock or not).  So the actual benefit is a bit less clear.

well if you do a non-locking test first you could at least avoid some
cases (and, if you get the answer wrong, so what?) by jumping to the
next buffer immediately.  if the non locking test comes good, only
then do you do a hardware TAS.

you could in fact go further and dispense with all locking in front of
usage_count, on the premise that it's only advisory and not a real
refcount.  so you only then lock if/when it's time to select a
candidate buffer, and only then when you did a non locking test first.
 this would of course require some amusing adjustments to various
logical checks (usage_count = 0, heh).

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: postgres_fdw vs data formatting GUCs (was Re: [HACKERS] [v9.3] writable foreign tables)

2013-03-22 Thread Daniel Farina
On Fri, Mar 22, 2013 at 12:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Daniel Farina dan...@heroku.com writes:
 This contains some edits to comments that referred to the obsolete and
 bogus TupleDesc scanning.  No mechanical alterations.

 Applied with some substantial revisions.  I didn't like where you'd put
 the apply/restore calls, for one thing --- we need to wait to do the
 applies until we have the PGresult in hand, else we might be applying
 stale values of the remote's GUCs.  Also, adding a call that could throw
 errors right before materializeResult() won't do, because that would
 result in leaking the PGresult on error.

Good catches.

 The struct for state seemed a
 bit of a mess too, given that you couldn't always initialize it in one
 place.

Yeah, I had to give that up when pushing things around, unless I
wanted to push more state down.  It used to be neater.

 (In hindsight I could have left that alone given where I ended
 up putting the calls, but it didn't seem to be providing any useful
 isolation.)

I studied your commit.

Yeah, the idea I had was to try to avoid pushing down a loaded a value
as a PGconn into the lower level helper functions, but perhaps that
economy was false one after the modifications.  Earlier versions used
to push down the RemoteGucs struct instead of a full-blown conn to
hint to the restricted purpose of that reference.  By conceding to this
pushdown I think the struct could have remained, as you said, but the
difference to clarity is likely marginal.  I thought I found a way to
not have to widen the parameter list at all, so I preferred that one,
but clearly it is wrong, w.r.t. leaks and the not up-to-date protocol
state.

Sorry you had to root around so much in there to get something you
liked, but thanks for going through it.

--
fdr


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Cube extension improvement, GSoC

2013-03-22 Thread Stas Kelvich
Hello,

some time ago I started working on the data search system (about 100-200M of 
records) with queries consisted of several diapason and equality conditions, 
e.g.:

  WHERE dim1 BETWEEN 128 AND 137 AND
  WHERE dim2 BETWEEN 4815 AND 162342 AND
  WHERE dim3 = 42
  ORDER BY dim1 ASC

There are 6 or 7 search criteria in my task. In order to avoid full table scan 
I started using R-Tree over cube data type:

  CREATE INDEX ON my_table USING GiST(cube(array[dim1, dim2, dim3]))

For fast sorting I used Alexander Korotkov's patch for knngist 
(http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg153676.html), 
which changes metric for nearest neighbors search and allows to obtain cubes 
ordered by a specific coordinate. Having changed some data types and 
function/operator definitions I ported this to 9.2 
(https://github.com/kelvich/postgres/commit/bb372). Then, after this I could 
select ordered records right from the index:

  SELECT * FROM my_table
  WHERE cube(array[dim1, dim2, dim3]) @ cube '(128,4815,42),(137,162342,42)'
  ORDER BY cube(array[dim1, dim2, dim3]) * 5;

The main issue of such approach is the index size. In my case it was about 
100GB for 100M of records. Therefore index building becomes very expensive 
disk-related operation. For the same reason reading a large number of records 
from the index is slow too.

I came to Oleg Bartunov, Theodor Sigaev and after a while to Alexander Korotkov 
for any help. (I'm very thankful to them and glad that they agreed to meet, 
listen to me and give useful advices). Having discussed it we decided that 
there was several possible improvements for the cube extension:

  * Adding point data type support to the cube extension in order to avoid 
storing of coincident upper left and lower right vertices, which may reduce the 
volume that leaf nodes occupy almost twice.
  * Checking the split algorithm with big datasets and, if possible, improving 
it.
  * Learning cube extension to store dimensions with different data types. Such 
index would be good alternative to compound key B-Tree multi-index (suitable 
for diapason queries and data ordering).
  * Providing support for KNN with metrics induced by the different norms: 
euclidean, taxicab norm, p-norm. This can be useful in fields where we can 
extract signature: similar images search, similar audio search.

I'd like to participate in GSoC with this improvements, and I'm very interested 
in any comments or suggestions about this feature list.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: DROP OWNED BY fails to drop privileges granted by non-owners (was Re: [GENERAL] Bug, Feature, or what else?)

2013-03-22 Thread Alvaro Herrera
Tom Lane escribió:

 I believe the problem is that DROP OWNED for privileges is implemented
 by calling REVOKE.  As noted upthread, when a superuser does REVOKE,
 it's executed as though the object owner did the REVOKE, so only
 privileges granted directly by the object owner go away.  In this
 particular example, DROP OWNED BY u1 makes the grant to u1 go away,
 and then the grant to u2 goes away via cascade ... but DROP OWNED BY
 u2 fails to accomplish anything at all, because postgres never granted
 anything directly to u2.
 
 We haven't seen this reported before, probably because the use of
 GRANT OPTIONS isn't very common, but AFAICS it's been wrong since
 the invention of DROP OWNED.

So it seems.

I have been mulling this over today, and it seems to me that one
way to fix it would be to have ExecGrant_Objecttype() loop over all
possible grantors when determining what to revoke when it's called by
DROP OWNED.  A proof-of-concept is below (RemoveRoleFromObjectACL would
set istmt.all_grantors to true, whereas ExecuteGrantStmt leaves it as
false and behaves as today).

I am not sure about the restrict_and_check_grant() call I left out in
the middle of the operation; it seems to me that if we limit DROP OWNED
to be called only by a superuser, we can get away without that check.
Or maybe we need a new version of that routine that will only apply the
bitmask and not raise any error messages.

The patch below is just for ExecGrant_Relation(), but as far as I can
tell all the ExecGrant_Objecttype() routines do pretty much the same
here, so I think it'd be better to refactor the select_best_grantor/
restrict_and_check_grant/merge_acl_with_grant sequence into a common
function, and then have that function apply the loop for all grantors.

Thoughts?


***
*** 1891,1932  ExecGrant_Relation(InternalGrant *istmt)
  AclObjectKind aclkind;
  
  /* Determine ID to do the grant as, and available grant options */
! select_best_grantor(GetUserId(), this_privileges,
! old_acl, ownerId,
! grantorId, avail_goptions);
! 
! switch (pg_class_tuple-relkind)
  {
! case RELKIND_SEQUENCE:
! aclkind = ACL_KIND_SEQUENCE;
! break;
! default:
! aclkind = ACL_KIND_CLASS;
! break;
  }
  
! /*
!  * Restrict the privileges to what we can actually grant, and emit
!  * the standards-mandated warning and error messages.
!  */
! this_privileges =
! restrict_and_check_grant(istmt-is_grant, avail_goptions,
!  istmt-all_privs, this_privileges,
!  relOid, grantorId, aclkind,
!  NameStr(pg_class_tuple-relname),
!  0, NULL);
  
! /*
!  * Generate new ACL.
!  */
! new_acl = merge_acl_with_grant(old_acl,
!istmt-is_grant,
!istmt-grant_option,
!istmt-behavior,
!istmt-grantees,
!this_privileges,
!grantorId,
!ownerId);
  
  /*
   * We need the members of both old and new ACLs so we can correct
--- 1895,1962 
  AclObjectKind aclkind;
  
  /* Determine ID to do the grant as, and available grant options */
! if (!istmt-all_grantors)
  {
! select_best_grantor(GetUserId(), this_privileges,
! old_acl, ownerId,
! grantorId, avail_goptions);
! 
! switch (pg_class_tuple-relkind)
! {
! case RELKIND_SEQUENCE:
! aclkind = ACL_KIND_SEQUENCE;
! break;
! default:
! aclkind = ACL_KIND_CLASS;
! break;
! }
! 
! /*
!  * Restrict the privileges to what we can actually grant, and 
emit
!  * the standards-mandated warning and error messages.
!  */
! this_privileges =
! restrict_and_check_grant(istmt-is_grant, avail_goptions,
!  istmt-all_privs, 
this_privileges,
!  relOid, grantorId, aclkind,
!  NameStr(pg_class_tuple-relname),
!   

Re: [HACKERS] SDP query optimizer

2013-03-22 Thread Josh Berkus
Adriano,

 I have developed a new query optimizer for PostgreSQL and I would like
 to share it with the community. The optimizer's name is Sampling and
 Dynamic Programming (SDP). I put it into a plugin developed some years
 ago, named LJQO:

Woah!  Way cool.

As a warning, we're in the closing throes of version 9.3 right now, so
if you code/ideas doesn't get the attention it deserves, that's why.

There is an incomplete project from a few years back to make the
non-exhaustive query planner pluggable so that we could use different
algorithms.  Unfortunately, it was never finished and merged with the
core code.  Your planner is yet another reason it would be great to
complete this.

Keep up the good work!

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Ants Aasma
On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure mmonc...@gmail.com wrote:
 well if you do a non-locking test first you could at least avoid some
 cases (and, if you get the answer wrong, so what?) by jumping to the
 next buffer immediately.  if the non locking test comes good, only
 then do you do a hardware TAS.

 you could in fact go further and dispense with all locking in front of
 usage_count, on the premise that it's only advisory and not a real
 refcount.  so you only then lock if/when it's time to select a
 candidate buffer, and only then when you did a non locking test first.
  this would of course require some amusing adjustments to various
 logical checks (usage_count = 0, heh).

Moreover, if the buffer happens to miss a decrement due to a data
race, there's a good chance that the buffer is heavily used and
wouldn't need to be evicted soon anyway. (if you arrange it to be a
read-test-inc/dec-store operation then you will never go out of
bounds) However, clocksweep and usage_count maintenance is not what is
causing contention because that workload is distributed. The issue is
pinning and unpinning. There we need an accurate count and there are
some pages like index roots that get hit very heavily. Things to do
there would be in my opinion convert to a futex based spinlock so when
there is contention it doesn't completely kill performance and then
try to get rid of the contention. Converting to lock-free pinning
won't help much here as what is killing us here is the cacheline
bouncing.

One way to get rid of contention is the buffer nailing idea that
Robert came up with. If some buffer gets so hot that maintaining
refcount on the buffer header leads to contention, promote that buffer
to a nailed status, let everyone keep their pin counts locally and
sometime later revisit the nailing decision and if necessary convert
pins back to the buffer header.

One other interesting idea I have seen is closeable scalable nonzero
indication (C-SNZI) from scalable rw-locks [1]. The idea there is to
use a tree structure to dynamically stripe access to the shared lock
counter when contention is detected. Downside is that considerable
amount of shared memory is needed so there needs to be some way to
limit the resource usage. This is actually somewhat isomorphic to the
nailing idea.

The issue with the current buffer management algorithm is that it
seems to scale badly with increasing shared_buffers. I think the
improvements should concentrate on finding out what is the problem
there and figuring out how to fix it. A simple idea to test would be
to just partition shared buffers along with the whole clock sweep
machinery into smaller ones, like the buffer mapping hash tables
already are. This should at the very least reduce contention for the
clock sweep even if it doesn't reduce work done per page miss.

[1] http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SDP query optimizer

2013-03-22 Thread Ants Aasma
On Sat, Mar 23, 2013 at 1:35 AM, Adriano Lange alange0...@gmail.com wrote:
 I have developed a new query optimizer for PostgreSQL and I would like to
 share it with the community. The optimizer's name is Sampling and Dynamic
 Programming (SDP). I put it into a plugin developed some years ago, named
 LJQO:

 https://github.com/alange0001/ljqo.git

Looks great. Do you happen to have any papers or insight into why SDP
works as well as it does? It isn't immediately clear to me why the
cheapest left-deep tree is a good heuristic start point for the
dynamic programming phase.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review]

2013-03-22 Thread Amit Kapila
On Friday, March 22, 2013 7:33 PM Alvaro Herrera wrote:
 Amit Kapila escribió:
  On Friday, March 22, 2013 8:57 AM Alvaro Herrera wrote:
   Amit Kapila escribió:
 
I think adding new syntax change is little scary for me, not for
the matter of implementation but for building consensus on
 syntax.
  
   I cannot but agree on that point.
 
  Sorry, I don't get your point.
  Do you mean to say that you don't agree with me and want new syntax
 as
  proposed by you to be implemented?
 
 On the contrary, I was saying I agree with you on the difficulty on
 getting consensus on this.

Now based on final discussion, I will go-ahead and document the behavior
rather than using SIGHUP at end or giving NOTICE at end of command.
Let me know if anybody still feels otherwise.

Apart from this during Greg's testing, he found a performance problem with
running pg_reload_conf() along with my patch the reason for which actually
turns out to be a memory
growth. I have attempted to fix it in the patch ctx_growth_fix_v1.patch
which is currently attached in CF queue along with my Patch.
http://www.postgresql.org/message-id/004801ce216b$e37c3b30$aa74b190$@kapila@
huawei.com
I think that should be done irrespective of SET Persistent Patch.


 I don't know where I learned this phrase or even if it's in common
 usage.  After a not-so-quick search I see it explained here:
 http://www.perfectyourenglish.com/usage/but.htm

You were right, I was not able to understand.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SIGHUP not received by custom bgworkers if postmaster is notified

2013-03-22 Thread Michael Paquier
Hi all,

Please find attached a simple example of bgworker that logs a message each
time a SIGTERM or SIGHUP signal is received by it:
- hello signal: processed SIGHUP when SIGHUP is handled by my example
- hello signal: processed SIGTERM when SIGTERM is handled by my example

With the current master code, here is what I get:
$ for i in {1..5}; do pg_ctl reload -D ~/bin/pgsql/master/; sleep 1; done
server signaled
server signaled
server signaled
server signaled
server signaled
$ cat ~/bin/pgsql/master/pg_log/postgresql-2013-03-23_112246.log
LOG:  starting background worker process hello signal worker
LOG:  database system was shut down at 2013-03-23 11:22:46 JST
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

LOG:  received SIGHUP, reloading configuration files
LOG:  received SIGHUP, reloading configuration files
LOG:  received SIGHUP, reloading configuration files
LOG:  received SIGHUP, reloading configuration files
LOG:  received SIGHUP, reloading configuration files

SIGHUP is not received by my bgworker. But SIGTERM is:

$ tail -n 5 ~/bin/pgsql/master/pg_log/postgresql-2013-03-23_112246.log
LOG:  autovacuum launcher shutting down
*LOG:  hello signal: processed SIGTERM*
LOG:  worker process: hello signal worker (PID 2873) exited with exit code 0
LOG:  shutting down
LOG:  database system is shut down

Now, if I apply my fix and redo the same tests, here is what I get:

$ for i in {1..5}; do pg_ctl reload -D ~/bin/pgsql/master/; sleep 1; done
server signaled
server signaled
server signaled
server signaled
server signaled
$ cat ~/bin/pgsql/master/pg_log/postgresql-2013-03-23_113315.log
LOG:  starting background worker process hello signal worker
LOG:  database system was shut down at 2013-03-23 11:33:14 JST
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

LOG:  received SIGHUP, reloading configuration files
*LOG:  hello signal: processed SIGHUP*
LOG:  received SIGHUP, reloading configuration files
*LOG:  hello signal: processed SIGHUP*
LOG:  received SIGHUP, reloading configuration files
*LOG:  hello signal: processed SIGHUP*
LOG:  received SIGHUP, reloading configuration files
*LOG:  hello signal: processed SIGHUP*
LOG:  received SIGHUP, reloading configuration files
*LOG:  hello signal: processed SIGHUP*

So SIGHUP is now correctly managed by the bgworker. As well as SIGTERM:

$ pg_ctl st: pg_ctl stop -D ~/bin/pgsql/master/
waiting for server to shut down done
server stopped
ioltas@nukkle:~/bin/extra(linux OK)$ tail -n 5
~/bin/pgsql/master/pg_log/postgresql-2013-03-23_113315.log
*LOG:  hello signal: processed SIGTERM*
LOG:  autovacuum launcher shutting down
LOG:  worker process: hello signal worker (PID 13781) exited with exit code
0
LOG:  shutting down
LOG:  database system is shut down

It would be great to get that fixed.
Thanks.
-- 
Michael


hello_signal.tar.gz
Description: GNU Zip compressed data


20130321_bgworker_sighup.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-22 Thread Craig Ringer
On 03/23/2013 02:00 AM, Jeff Davis wrote:
 On Fri, 2013-03-22 at 17:09 +0200, Ants Aasma wrote:
 So to conclude, the 3 approaches:
 One other question: assuming that the algorithms use the full 16-bit
 space, is there a good way to avoid zero without skewing the result? Can
 we do something like un-finalize (after we figure out that it's zero),
 compute in an extra salt value, and then re-finalize? That might work
 for Fletcher; but I don't think that works for CRC or Fowler-Noll-Vo
 because the final value is the same as the state.

 I'm still slightly concerned about differentiating checksummed pages in
 the future if we want to offer a transition path, since we no longer use
 header bits. Avoiding zero might help us there. Hopefully not necessary,
 but something we might find useful.
Avoiding a magic value for not checksummed might help, but IMO zero is
a terrible choice for that since it's one of the most likely things to
be written in chunks over good data during some kinds of corruption
event (memory overwriting, etc).

Making zero a not checksummed magic value would significantly detract
from the utility of checksums IMO.
  Also, it would help us identify
 situations where the checksum is never set.
Now that seems more useful - ERROR: BUG or disk corruption found.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Atri Sharma

 Moreover, if the buffer happens to miss a decrement due to a data
 race, there's a good chance that the buffer is heavily used and
 wouldn't need to be evicted soon anyway. (if you arrange it to be a
 read-test-inc/dec-store operation then you will never go out of
 bounds) However, clocksweep and usage_count maintenance is not what is
 causing contention because that workload is distributed. The issue is
 pinning and unpinning. There we need an accurate count and there are
 some pages like index roots that get hit very heavily. Things to do
 there would be in my opinion convert to a futex based spinlock so when
 there is contention it doesn't completely kill performance and then
 try to get rid of the contention. Converting to lock-free pinning
 won't help much here as what is killing us here is the cacheline
 bouncing.

 One way to get rid of contention is the buffer nailing idea that
 Robert came up with. If some buffer gets so hot that maintaining
 refcount on the buffer header leads to contention, promote that buffer
 to a nailed status, let everyone keep their pin counts locally and
 sometime later revisit the nailing decision and if necessary convert
 pins back to the buffer header.

 One other interesting idea I have seen is closeable scalable nonzero
 indication (C-SNZI) from scalable rw-locks [1]. The idea there is to
 use a tree structure to dynamically stripe access to the shared lock
 counter when contention is detected. Downside is that considerable
 amount of shared memory is needed so there needs to be some way to
 limit the resource usage. This is actually somewhat isomorphic to the
 nailing idea.

 The issue with the current buffer management algorithm is that it
 seems to scale badly with increasing shared_buffers. I think the
 improvements should concentrate on finding out what is the problem
 there and figuring out how to fix it. A simple idea to test would be
 to just partition shared buffers along with the whole clock sweep
 machinery into smaller ones, like the buffer mapping hash tables
 already are. This should at the very least reduce contention for the
 clock sweep even if it doesn't reduce work done per page miss.


One way to distribute memory contention in case of spinlocks could be
to utilize the fundamentals of NUMA architecture. Specifically, we can
let the contending backends spin on local flags instead on the buffer
header flags directly. As access to local cache lines is much cheaper
and faster than memory locations which are far away in NUMA, we could
potentially reduce the memory overhead for a specific line and reduce
the overall overheads as well.

Regards.

Atri


--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_dump/restore syntax checking bug?

2013-03-22 Thread Joshua D. Drake


Hello,

In testing some pg_restore functionality I found the following:

postgres@jd-laptop:~$ pg_dump -U postgres -Fc -s --file=foo.sqlc
postgres@jd-laptop:~$ dropdb test;
postgres@jd-laptop:~$ createdb test;
postgres@jd-laptop:~$ pg_restore -d test -P 'by()' foo.sqlc
postgres@jd-laptop:~$ psql -U postgres test
psql (9.1.8)
Type help for help.

test=# select by();
 by

 by
(1 row)

test=# select hello();
ERROR:  function hello() does not exist
LINE 1: select hello();
   ^
HINT:  No function matches the given name and argument types. You might 
need to add explicit type casts.


 The above is as expected.


test=# \q
postgres@jd-laptop:~$ pg_restore -d test -P 'by(),hello()' foo.sqlc
postgres@jd-laptop:~$ psql -U postgres test;
psql (9.1.8)
Type help for help.

test=# select hello();
ERROR:  function hello() does not exist
LINE 1: select hello();

 This is where I am confused. It didn't restore hello() and it also 
didn't error that the syntax of the restore command was invalid.

   ^
test=# drop function by();
DROP FUNCTION
test=#
test=# q
test-# \q
postgres@jd-laptop:~$ pg_restore -d test -P 'by(),hello()' foo.sqlc
postgres@jd-laptop:~$ psql -U postgres test;
psql (9.1.8)
Type help for help.

test=# drop function by();
ERROR:  function by() does not exist

 by() not restored by above command


test=# \q
postgres@jd-laptop:~$ pg_restore -d test -P 'by()','hello()' foo.sqlc
postgres@jd-laptop:~$ psql -U postgres test;
psql (9.1.8)
Type help for help.

test=# drop function by();
ERROR:  function by() does not exist
test=# select hello();
ERROR:  function hello() does not exist
LINE 1: select hello();
   ^
HINT:  No function matches the given name and argument types. You might 
need to add explicit type casts.


 by() and hello() also not restored


test=# \q
postgres@jd-laptop:~$ pg_restore -d test -P 'by()' -P'hello()' foo.sqlc
postgres@jd-laptop:~$ psql -U postgres test;
psql (9.1.8)
Type help for help.

test=# select hello();
 hello
---
 hello
(1 row)

test=# select by();
ERROR:  function by() does not exist
LINE 1: select by();

 hello() restored but by() was not.


It appears we need better syntax checking.

Sincerely,

JD



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump/restore syntax checking bug?

2013-03-22 Thread Josh Kupershmidt
On Fri, Mar 22, 2013 at 9:35 PM, Joshua D. Drake j...@commandprompt.com wrote:

 postgres@jd-laptop:~$ pg_restore -d test -P 'by(),hello()' foo.sqlc

Note, the pg_restore doc makes no mention of trying to squeeze
multiple function prototypes in a single argument you've done here, or
of using multiple -P flags.

 It appears we need better syntax checking.

Can't really argue with this. But if you think these pg_restore
examples are bad, try this gem:
  reindexdb --table='foo; ALTER ROLE limited WITH superuser'

Josh


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers