Re: [HACKERS] Page replacement algorithm in buffer cache
Sent from my iPad On 22-Mar-2013, at 11:28, Amit Kapila amit.kap...@huawei.com wrote: On Friday, March 22, 2013 10:22 AM Atri Sharma wrote: Hello all, Sorry if this is a naive question. I was going through Greg Smith's slides on buffer cache(http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCac he.pdf). When going through the page replacement algorithm that we use i.e. clocksweep algorithm, I felt a potential problem in our current system. Specifically, when a new entry is allocated in the buffer, it's USAGE_COUNT is set to 1. On each sweep of the algorithm, the USAGE_COUNT is decremented and an entry whose USAGE_COUNT becomes zero is replaced. Yes, it is replaced but in the next clock sweep pass, not immediately after making 0. So till the time of next pass if nobody accesses the buffer and all other buffers have higher count, it can be replaced. Also the buffer, it has returned for which the usage count becomes 1, it will come to reduce the usage count only in next pass. So in whole, I think it needs 2 passes for a freshly returned buffer to be re-used incase no one uses it again. With Regards, Amit Kapila. Hmm,so in the second pass,it gets replaced,right? I think that if the initialization of USAGE_COUNT starts at the maximum allowed value instead of one, we can have a better solution to this problem. Another,more complex solution could be to introduce an ageing factor as well. Regards, Atri -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Let's invent a function to report lock-wait-blocking PIDs
On 21.03.2013 05:36, Tom Lane wrote: Simon Riggssi...@2ndquadrant.com writes: On 20 March 2013 18:02, Tom Lanet...@sss.pgh.pa.us wrote: The API that comes to mind is (name subject to bikeshedding) pg_blocking_pids(pid int) returns int[] Useful. Can we also have an SRF rather than an array? I thought about that, but at least for the isolationtester use-case, the array result is clearly easier to use. You can get from one to the other with unnest() or array_agg(), so I don't really feel a need to provide both. Can you generate use-cases where the set-result approach is superior? How about inverting the function into: pg_pid_blocked_by(pid int) returns int It would take as argument a pid, and return the pid of the process that is blocking the given process. That would feel more natural to me. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
On Friday, March 22, 2013 12:00 PM Atri Sharma wrote: Sent from my iPad On 22-Mar-2013, at 11:28, Amit Kapila amit.kap...@huawei.com wrote: On Friday, March 22, 2013 10:22 AM Atri Sharma wrote: Hello all, Sorry if this is a naive question. I was going through Greg Smith's slides on buffer cache(http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCac he.pdf). When going through the page replacement algorithm that we use i.e. clocksweep algorithm, I felt a potential problem in our current system. Specifically, when a new entry is allocated in the buffer, it's USAGE_COUNT is set to 1. On each sweep of the algorithm, the USAGE_COUNT is decremented and an entry whose USAGE_COUNT becomes zero is replaced. Yes, it is replaced but in the next clock sweep pass, not immediately after making 0. So till the time of next pass if nobody accesses the buffer and all other buffers have higher count, it can be replaced. Also the buffer, it has returned for which the usage count becomes 1, it will come to reduce the usage count only in next pass. So in whole, I think it needs 2 passes for a freshly returned buffer to be re-used incase no one uses it again. With Regards, Amit Kapila. Hmm,so in the second pass,it gets replaced,right? Yes. I think that if the initialization of USAGE_COUNT starts at the maximum allowed value instead of one, we can have a better solution to this problem. So what is your idea, if you start at maximum, what we will do for further accesses to it? Why do you want to give more priority to just loaded page? Another,more complex solution could be to introduce an ageing factor as well. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
I think that if the initialization of USAGE_COUNT starts at the maximum allowed value instead of one, we can have a better solution to this problem. So what is your idea, if you start at maximum, what we will do for further accesses to it? I havent chalked out a detailed plan yet, but I think the idea of initializing USAGE_COUNT to maximum value is not at all good. I was just thinking off the top of my head. Why do you want to give more priority to just loaded page? I just want it to have more chances to stay, rather than being replaced pretty early. This is because, as I said earlier, a new page could be in high demand in near future, which would lead to repeated replacement-bringing in of page and hence cause overheads. Another,more complex solution could be to introduce an aging factor This is the one I think would work out best, add an age factor as to the time duration which an entry has spent in the cache along with its usage count. So, what I am proposing here is to add another factor in the clocksweep algorithm when it selects victim pages for replacement. Specifically, the selection of victim pages should be done with the usage_count AND the time spent by the entry in the cache. This would give priority to pages with high accesses and not ignore relatively young pages as well. If a page is not accessed for a long time after it was allocated, it would be the ideal victim for replacement both in terms of USAGE_COUNT as well as age. Regards, Atri -- Regards, Atri l'apprenant -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
On Friday, March 22, 2013 4:16 PM Atri Sharma wrote: I think that if the initialization of USAGE_COUNT starts at the maximum allowed value instead of one, we can have a better solution to this problem. So what is your idea, if you start at maximum, what we will do for further accesses to it? I havent chalked out a detailed plan yet, but I think the idea of initializing USAGE_COUNT to maximum value is not at all good. I was just thinking off the top of my head. Why do you want to give more priority to just loaded page? I just want it to have more chances to stay, rather than being replaced pretty early. This is because, as I said earlier, a new page could be in high demand in near future, which would lead to repeated replacement-bringing in of page and hence cause overheads. Another,more complex solution could be to introduce an aging factor This is the one I think would work out best, add an age factor as to the time duration which an entry has spent in the cache along with its usage count. So, what I am proposing here is to add another factor in the clocksweep algorithm when it selects victim pages for replacement. Specifically, the selection of victim pages should be done with the usage_count AND the time spent by the entry in the cache. This would give priority to pages with high accesses and not ignore relatively young pages as well. If a page is not accessed for a long time after it was allocated, it would be the ideal victim for replacement both in terms of USAGE_COUNT as well as age. What would you do if the only young page has usage count zero during second sweep. I don't think introducing another factor along with usage count would do any much help. Have you encountered any such workload very relatively young pages are getting victimized and the same is causing performance issues? With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
What would you do if the only young page has usage count zero during second sweep. UmmThe same approach we take when there is no page with usage count zero in a sweep in the current algorithm? I don't think introducing another factor along with usage count would do any much help. Which else approach can we take here? Have you encountered any such workload very relatively young pages are getting victimized and the same is causing performance issues? Not yet, I figured this might be a problem and am designing test cases for the same. I would be glad for some help there please. Regards, Atri -- Regards, Atri l'apprenant -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
On Friday, March 22, 2013 4:36 PM Atri Sharma wrote: What would you do if the only young page has usage count zero during second sweep. UmmThe same approach we take when there is no page with usage count zero in a sweep in the current algorithm? It would give more priority to young page as compare to more used page. I don't know if that would be correct thing to do. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
On Fri, Mar 22, 2013 at 4:53 PM, Amit Kapila amit.kap...@huawei.com wrote: On Friday, March 22, 2013 4:36 PM Atri Sharma wrote: What would you do if the only young page has usage count zero during second sweep. UmmThe same approach we take when there is no page with usage count zero in a sweep in the current algorithm? It would give more priority to young page as compare to more used page. I don't know if that would be correct thing to do. This is my idea, give equal priority to new pages when they enter the cache, so that they all have an equal chance to be replaced initially. With time, usage_count shall become the deciding factor in victim selection. Regards, Atri -- Regards, Atri l'apprenant -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should commit_delay be PGC_SIGHUP?
On 22 March 2013 02:14, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: Only one setting will be best for the whole cluster, so neither the user nor the DBA gains if a user sets this to a different value than the one that has been determined to be optimal. Since we wait while holding the lock it is actually harmful to everyone if anybody sets a stupid value and might even be considered a denial of service attack. So there is a very good reason to make this SIGHUP, not just a whim. Hmm. If a malicious user could hurt performance for other sessions with a bad setting of commit_delay, then USERSET is clearly a bad idea. But it still seems like it could be SUSET rather than SIGHUP. Agreed; everybody gets what they want. Committed. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should commit_delay be PGC_SIGHUP?
On Fri, Mar 22, 2013 at 8:06 AM, Simon Riggs si...@2ndquadrant.com wrote: Hmm. If a malicious user could hurt performance for other sessions with a bad setting of commit_delay, then USERSET is clearly a bad idea. But it still seems like it could be SUSET rather than SIGHUP. Agreed; everybody gets what they want. Committed. This is fine with me, too, and I agree that it's warranted... but your commit message supposes that this behavior is new in 9.3, and I think it dates to 9.2. I'm not inclined to think the issue is serious enough to back-patch (and risk breaking current installations) but I thought that it worth mentioning -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
On Mar 22, 2013 12:46 PM, Atri Sharma atri.j...@gmail.com wrote: This is the one I think would work out best, add an age factor as to the time duration which an entry has spent in the cache along with its usage count. You might want to check out the LIRS cache replacement algorithm [1]. That algorithm tries to estimate least frequently used instead of least recently used. Mysql uses it for their buffer replacement policy. There is also a clock sweep based approximation called CLOCK-Pro. Papers describing and evaluating both are available on the net. The evaluations in the papers showed significantly better performance for both of those compared to regular clock sweep or even ARC. However, I think the main issue isn't finding new algorithms that are better in some specific circumstances. The hard part is figuring out whether their performance is better in general. My idea was to create a patch to capture page pinning traffic from PostgreSQL (maybe stream out into a per backend file), run it with some production workloads and use that to generate testing workloads for the cache replacement policies. Haven't gotten round to actually doing that though. [1] http://en.wikipedia.org/wiki/LIRS_caching_algorithm Regards, Ants Aasma -- Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should commit_delay be PGC_SIGHUP?
On Fri, Mar 22, 2013 at 12:42 PM, Robert Haas robertmh...@gmail.com wrote: This is fine with me, too, and I agree that it's warranted... but your commit message supposes that this behavior is new in 9.3, and I think it dates to 9.2. No, it doesn't. It just missed the deadline for 9.2. I'm happy enough to have the setting be PGC_SUSET, since that more or less conveys that commit_delay isn't something that is sensible to set dynamically. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON Function Bike Shedding
On Mon, Mar 18, 2013 at 3:12 PM, Tom Lane t...@sss.pgh.pa.us wrote: Andrew Dunstan and...@dunslane.net writes: I've been sitting here for a while mulling none too happily over the debate on the names for the proposed JSON extraction functions. I haven't really been happy with any of the suggestions, much, not least my own original function names which were really only intended as placeholders. Last night in the still watches I decided I just couldn't go with a function name as almost totally content-free as get(), or even get_text(). And I don't think prepending json_' to the name helps much either. Agreed. Just concentrating to start with on those get() functions, in the simple case we really don't need them at all. hstore has the - operator without documenting the underlying function (fetchval). So maybe we should just do that. Well, not documenting the underlying function does not relieve you from having to name it in a reasonably sane fashion. It still wouldn't do to call it get(). How about 'fetch'. Or is that just skirting the content free aspect? Agree that 'path' is out (as unnest is out due to unfortunate semantic confusion). At the end of the day, 'get()' is simply referencing an array (either associative or not). Most languages do this with an operator, but I think fetch is pretty solid term. * I'd be inclined to stick with json_array_length() and json_object_keys() - I think they describe pretty well what they do. hstore's skeys() does more or less the same as json_object_keys(), so we could use that if we want to be consistent. I don't think it's a terribly good name though. * json_unnest() should certainly be renamed. Alternatives that come to mind are json_unfold() or json_elements() or json_array_elements(). * json_each(), json_each_as_text(), json_populate_record() and json_populate_recordset() - to be consistent with hstore we could remove the json_. We probably should remove the _as_ from json_each_as_text(). I don't particularly have a dog in this fight, but do we really want some of these to have a json_ prefix and others not? That's already baked in, because 9.2 json functions have prefix. I'm still partial to json_unwrap for unnest, but out of Andrew's suggestions I like json_elements the best. Like removing _as_. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON Function Bike Shedding
On 03/22/2013 09:29 AM, Merlin Moncure wrote: On Mon, Mar 18, 2013 at 3:12 PM, Tom Lane t...@sss.pgh.pa.us wrote: Andrew Dunstan and...@dunslane.net writes: I've been sitting here for a while mulling none too happily over the debate on the names for the proposed JSON extraction functions. I haven't really been happy with any of the suggestions, much, not least my own original function names which were really only intended as placeholders. Last night in the still watches I decided I just couldn't go with a function name as almost totally content-free as get(), or even get_text(). And I don't think prepending json_' to the name helps much either. Agreed. Just concentrating to start with on those get() functions, in the simple case we really don't need them at all. hstore has the - operator without documenting the underlying function (fetchval). So maybe we should just do that. Well, not documenting the underlying function does not relieve you from having to name it in a reasonably sane fashion. It still wouldn't do to call it get(). How about 'fetch'. Or is that just skirting the content free aspect? Agree that 'path' is out (as unnest is out due to unfortunate semantic confusion). At the end of the day, 'get()' is simply referencing an array (either associative or not). Most languages do this with an operator, but I think fetch is pretty solid term. * I'd be inclined to stick with json_array_length() and json_object_keys() - I think they describe pretty well what they do. hstore's skeys() does more or less the same as json_object_keys(), so we could use that if we want to be consistent. I don't think it's a terribly good name though. * json_unnest() should certainly be renamed. Alternatives that come to mind are json_unfold() or json_elements() or json_array_elements(). * json_each(), json_each_as_text(), json_populate_record() and json_populate_recordset() - to be consistent with hstore we could remove the json_. We probably should remove the _as_ from json_each_as_text(). I don't particularly have a dog in this fight, but do we really want some of these to have a json_ prefix and others not? That's already baked in, because 9.2 json functions have prefix. I have finally decided my position on this. I think we have lots of good precedents for using type names in function names: array functions, xml functions and enum functions, for example. I think these are the precedents to follow, rather than hstore. Some people will be unhappy that this means more typing, but SQL is somewhat verbose anyway, and whatever we do will make someone unhappy :-) I'm still partial to json_unwrap for unnest, but out of Andrew's suggestions I like json_elements the best. Like removing _as_. OK. I can live with that. New version forthcoming soon. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
However, I think the main issue isn't finding new algorithms that are better in some specific circumstances. The hard part is figuring out whether their performance is better in general. My idea was to create a patch to capture page pinning traffic from PostgreSQL (maybe stream out into a per backend file), run it with some production workloads and use that to generate testing workloads for the cache replacement policies. Haven't gotten round to actually doing that though. [1] http://en.wikipedia.org/wiki/LIRS_caching_algorithm Thanks for the link. I think LIRS can indeed be helpful in our case. We should indeed build some test cases for testing this theory. I am all for capturing page replacement and usage data and analyzing it. Atri -- Regards, Atri l'apprenant -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
Ants Aasma a...@cybertec.at writes: You might want to check out the LIRS cache replacement algorithm [1]. That algorithm tries to estimate least frequently used instead of least recently used. Mysql uses it for their buffer replacement policy. There is also a clock sweep based approximation called CLOCK-Pro. Papers describing and evaluating both are available on the net. The evaluations in the papers showed significantly better performance for both of those compared to regular clock sweep or even ARC. I seem to recall that CLOCK-Pro, or something named similarly to that, was one of the alternatives discussed when we went over to the current clock-sweep approach. And we definitely looked at ARC. It might be worth checking the archives from back then to see what's already been considered. However, I think the main issue isn't finding new algorithms that are better in some specific circumstances. The hard part is figuring out whether their performance is better in general. Yeah. You can prove almost anything with the right set of test cases :-( regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review]
Amit Kapila escribió: On Friday, March 22, 2013 8:57 AM Alvaro Herrera wrote: Amit Kapila escribió: I think adding new syntax change is little scary for me, not for the matter of implementation but for building consensus on syntax. I cannot but agree on that point. Sorry, I don't get your point. Do you mean to say that you don't agree with me and want new syntax as proposed by you to be implemented? On the contrary, I was saying I agree with you on the difficulty on getting consensus on this. I don't know where I learned this phrase or even if it's in common usage. After a not-so-quick search I see it explained here: http://www.perfectyourenglish.com/usage/but.htm -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Let's invent a function to report lock-wait-blocking PIDs
Heikki Linnakangas hlinnakan...@vmware.com writes: On 21.03.2013 05:36, Tom Lane wrote: The API that comes to mind is (name subject to bikeshedding) pg_blocking_pids(pid int) returns int[] How about inverting the function into: pg_pid_blocked_by(pid int) returns int It would take as argument a pid, and return the pid of the process that is blocking the given process. That would feel more natural to me. Hm, I'm not sure that's uniquely defined. In the case I mentioned before (A has AccessShare, B is blocked waiting for AccessExclusive, C wants AccessShare and is queued behind B), which of A and B do you think is blocking C? Whichever answer you choose could be the wrong one for isolationtester: I think it needs to consider that C is blocked if *either* A or B is part of its set of test processes. So that's why I thought an array (or set) result including both A and B would be appropriate. AFAICT, what you're proposing isn't the inverse of what I said, it's the same direction but you're assuming there's only one blocking process. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Strange Windows problem, lock_timeout test request
On Thu, Mar 21, 2013 at 8:16 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Mon, Mar 18, 2013 at 10:09 AM, Tom Lane t...@sss.pgh.pa.us wrote: Because it's wrong. Removing volatile means that the compiler is permitted to optimize away stores (and fetches!) on the basis of their being unnecessary according to straight-line analysis of the code. Write barriers don't fix that, they only say that stores that the compiler chooses to issue at all have to be ordered a certain way. I don't think this is correct. The read and write barriers as implemented are designed to function as compiler barriers also, just as they do in the Linux kernel and every other piece of software I've found that implements anything remotely like this, with the lone exception of PostgreSQL. In PostgreSQL, spinlock acquisition and release are defined as CPU barriers but not a compiler barrier, and this necessitates extensive use of volatile all over the code base which would be unnecessary if we did this the way it's done in Linux and elsewhere. I think you're just as mistaken as Zoltan. Barriers enforce ordering of operations, not whether an operation occurs at all. Surely not. Suppose the user does this: some_global_var = 1; some_function(); some_global_var = 2; I hope we can both agree that any compiler which thinks it doesn't need to store 1 in some_global_var is completely nuts, because some_function() might perform any arbitrary computation, including one that depends on some_global_var being 1 rather than whatever value it had before that. Of course, if a global optimizer can prove that some_function() can't do anything that can possibly care about some_global_var, then the store could be omitted, but not otherwise. Should the compiler omit the store and should there then be a bug in the program, we wouldn't say oh, some_global_var ought to be declared volatile. We would say that compiler is buggy. Now, conversely, in this situation, it seems to me (and I think you'll agree) that the compiler could be forgiven for omitting one of the stores: some_global_var = 1; local_var = 42; some_global_var = 2; If we declare some_global_var as volatile, it will force the compiler to perform both stores regardless of what the optimizer thinks, and moreover, the second store is required to happen after the first one, because the definition of volatile is that volatile references are globally sequenced with respect TO EACH OTHER. However, the store to local_var could be moved around with respect to the other two or omitted altogether unless local_var is also declared volatile. Now, consider this: some_global_var = 1; pg_compiler_barrier(); /* or in Linux, barrier() */ some_global_var = 2; The compiler barrier is exactly equivalent to the unknown function in the first example, except that no function is actually called. The semantics are precisely that the compiler must assume that, at the point the barrier intervenes, an unknown operation will occur which may depend on the contents of any word in memory and which may modify any word in memory. Thus, the compiler may not postpone stores requested before the barrier until after the barrier on the grounds that the values will be overwritten after the barrier; and if any global variables have been loaded into registers before the barrier it must be assumed that, after the barrier, the registers may no longer match those global variables. It is true that any barrier, including a compiler barrier, serves only to separate instructions. But I don't believe it follows in any way that the barrier therefore prohibits reordering operations but not omitting them altogether. Such a definition would have no practical utility. The compiler is not free to willy-nilly leave out things that the user asks it to do any more than it is free to willy-nilly reorder them. If it were, programming would be chaos, and everything would have to be volatile. What the compiler is free to do is to reorder and omit instructions where the programmer won't, in a single-thread execution context, be able to notice the difference. And a compiler barrier is an explicit notification that, in essence, the programmer will notice if things are not just the way he wrote them when that point in the code is reached. To see the difference between this and volatile, consider the following: a = 1; b = 1; c = 1; a = 2; b = 2; c = 2; Absent any special precautions, the compiler may well optimize the first set of stores away completely and perform the second set in any order it likes. If we make all the variables volatile, it will do all 6 stores in precisely the order specified, omitting nothing and reordering nothing. If we instead stick a compiler barrier just after the assignment c = 1, then the compiler must store 1 in all three variables (in any order that it likes), and then store 2 in all three variables (in any order that it likes). The barrier
Re: [HACKERS] Enabling Checksums
On Fri, Mar 22, 2013 at 3:04 AM, Jeff Davis pg...@j-davis.com wrote: I've been following your analysis and testing, and it looks like there are still at least three viable approaches: 1. Some variant of Fletcher 2. Some variant of CRC32 3. Some SIMD-based checksum Each of those has some open implementation questions, as well. If we settle on one of those approaches, we don't necessarily need the fastest implementation right away. I might even argue that the first patch to be committed should be a simple implementation of whatever algorithm we choose, and then optimization should be done in a separate patch (if it is tricky to get right). +1 on correct first, fast second. Of course, it's hard to settle on the general algorithm to use without knowing the final performance numbers. So right now I'm in somewhat of a holding pattern until we settle on something. For performance the K8 results gave me confidence that we have a reasonably good overview what the performance is like for the class of CPU's that PostgreSQL is likely to run on. I don't think there is anything left to optimize there, all algorithms are pretty close to maximum theoretical performance. Still, benchmarks on AMD's Bulldozer arch and maybe on some non-x86 machines (Power, Itanium, Sparc) would be very welcome to ensure that I haven't missed anything. To see real world performance numbers I dumped the algorithms on top of the checksums patch. I set up postgres with 32MB shared buffers, and ran with concurrency 4 select only pgbench and a worst case workload, results are median of 5 1-minute runs. I used fletcher as it was in the checksums patch without unrolling. Unrolling would cut the performance hit by a third or so. The worst case workload is set up using CREATE TABLE sparse (id serial primary key, v text) WITH (fillfactor=10); INSERT INTO sparse (v) SELECT REPEAT('x', 1000) FROM generate_series(1,10); VACUUM ANALYZE sparse; The test query itself is a simple SELECT count(v) FROM sparse; Results for the worst case workload: No checksums: tps = 14.710519 Fletcher checksums: tps = 10.825564 (1.359x slowdown) CRC checksums: tps = 5.844995 (2.517x slowdown) SIMD checksums: tps = 14.062388 (1.046x slowdown) Results for pgbench scale 100: No checksums: tps = 56623.819783 Fletcher checksums: tps = 55282.222687 (1.024x slowdown) CRC Checksums: tps = 50571.324795 (1.120x slowdown) SIMD Checksums: tps = 56608.888985 (1.000x slowdown) So to conclude, the 3 approaches: CRC: Time to checksum 8192 bytes: 12'000 - 16'000 cycles best case without special hardware 1'200 cycles with hardware (new Intel only) Code size: 131 bytes * Can calculate arbitrary number of bytes per invocation, state is 4 bytes. Implementation can be shared with WAL. * Quite slow without hardware acceleration. * Software implementation requires a 8kB table for calculation or it will be even slower. Quite likely to fall out of cache. * If we wish to use hardware acceleration then the polynomial should be switched to Castagnoli. I think the old polynomial needs to stay as the values seem to be stored in indexes by tsvector compression and multibyte trigrams. (not 100% sure, just skimmed the code) * Error detection of 32bit Castagnoli CRC is known to be good, the effect of truncating to 16 bits is not analyzed yet. Fletcher: Time to checksum 8192 bytes: 2'600 cycles +- 100 Code size: 170 bytes unrolled * Very simple implementation for optimal speed. * Needs to calculate 4 bytes at a time, requires 8 bytes of state. Implementation that can work for WAL would be tricky but not impossible. Probably wouldn't share code. * Should give good enough error detection with suitable choice for final recombination. SIMD Checksums: Time to checksum 8192 bytes: 730 cycles for processors with 128bit SIMD units 1830 cycles for processors with 64bit SIMD units Code size: 436 bytes * Requires vectorization, intrinsics or ASM for decent performance. * Needs to calculate 128bytes at a time, requires 128 bytes of state. Using for anything other than summing fixed size blocks looks tricky. * Loosely based on Fowler-Noll-Vo and should have reasonably good error detection capabilities. Ants Aasma -- Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON Function Bike Shedding
On Fri, Mar 22, 2013 at 8:58 AM, Andrew Dunstan and...@dunslane.net wrote: On 03/22/2013 09:29 AM, Merlin Moncure wrote: On Mon, Mar 18, 2013 at 3:12 PM, Tom Lane t...@sss.pgh.pa.us wrote: Andrew Dunstan and...@dunslane.net writes: I've been sitting here for a while mulling none too happily over the debate on the names for the proposed JSON extraction functions. I haven't really been happy with any of the suggestions, much, not least my own original function names which were really only intended as placeholders. Last night in the still watches I decided I just couldn't go with a function name as almost totally content-free as get(), or even get_text(). And I don't think prepending json_' to the name helps much either. Agreed. Just concentrating to start with on those get() functions, in the simple case we really don't need them at all. hstore has the - operator without documenting the underlying function (fetchval). So maybe we should just do that. Well, not documenting the underlying function does not relieve you from having to name it in a reasonably sane fashion. It still wouldn't do to call it get(). How about 'fetch'. Or is that just skirting the content free aspect? Agree that 'path' is out (as unnest is out due to unfortunate semantic confusion). At the end of the day, 'get()' is simply referencing an array (either associative or not). Most languages do this with an operator, but I think fetch is pretty solid term. * I'd be inclined to stick with json_array_length() and json_object_keys() - I think they describe pretty well what they do. hstore's skeys() does more or less the same as json_object_keys(), so we could use that if we want to be consistent. I don't think it's a terribly good name though. * json_unnest() should certainly be renamed. Alternatives that come to mind are json_unfold() or json_elements() or json_array_elements(). * json_each(), json_each_as_text(), json_populate_record() and json_populate_recordset() - to be consistent with hstore we could remove the json_. We probably should remove the _as_ from json_each_as_text(). I don't particularly have a dog in this fight, but do we really want some of these to have a json_ prefix and others not? That's already baked in, because 9.2 json functions have prefix. I have finally decided my position on this. I think we have lots of good precedents for using type names in function names: array functions, xml functions and enum functions, for example. I think these are the precedents to follow, rather than hstore. Some people will be unhappy that this means more typing, but SQL is somewhat verbose anyway, and whatever we do will make someone unhappy :-) I'm still partial to json_unwrap for unnest, but out of Andrew's suggestions I like json_elements the best. Like removing _as_. OK. I can live with that. New version forthcoming soon. Thanks for that! I'm super duper busy lately, but i'd still like to bone up the docs a little bit, so if I can find the time I'd like to squeeze some in before we lock in the beta if that's all right. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Default connection parameters for postgres_fdw and dblink
It struck me while looking at the regression test arrangements for postgres_fdw that as things are set up, the default username for outgoing connections is going to be that of the operating system user running the postmaster. dblink is the same way. Now, this might not be the world's worst default, but it's got to be a pretty bad one. The OS username of the server isn't really supposed to be exposed at all on the SQL level. And to the extent that it matches the bootstrap superuser's SQL name, it's still a non-POLA-satisfying default for any user other than the bootstrap superuser. IMO it would make a lot more sense for the default to be the name of the current database user. Either that, or insist that the outgoing username not be defaultable at all; though I'm not sure we can do the latter without breaking the regression tests, since those are supposed to be agnostic as to the name of the superuser running them. A related issue is that libpq will happily acquire defaults from the server's environment, such as PGPORT. This seems like it's also exposing things that shouldn't be exposed. Unfortunately, I think we're depending on that for the dblink and postgres_fdw regression tests to work at all when the postmaster is listening to a nondefault port (ie, make check). Is there a better way to handle all this? It may be too late to rethink dblink's behavior anyhow, but perhaps it's not too late to change postgres_fdw. I think though that once we let 9.3 out the door, it *will* be too late to make any major changes, because postgres_fdw's usage is going to go through the roof now that it can do remote updates. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Strange Windows problem, lock_timeout test request
On Fri, Mar 22, 2013 at 2:29 PM, Robert Haas robertmh...@gmail.com wrote: The barrier essentially divides up the code into chunks and requires that those chunks be optimized independently by the compiler without knowledge of what earlier or later chunks are doing While all this sounds sensible I would love to see a gcc programmer or llvm programmer actually comment on what they think volatile does and what they want to implement in the compiler. I'm a bit worried that we're making assumptions like things happen in a specific order that aren't really justified. In these days of superscalar execution and multi-level caches things may be weirder than we're imagining. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Fri, 2013-03-22 at 17:09 +0200, Ants Aasma wrote: For performance the K8 results gave me confidence that we have a reasonably good overview what the performance is like for the class of CPU's that PostgreSQL is likely to run on. I don't think there is anything left to optimize there, all algorithms are pretty close to maximum theoretical performance. Great work! The worst case workload is set up using CREATE TABLE sparse (id serial primary key, v text) WITH (fillfactor=10); INSERT INTO sparse (v) SELECT REPEAT('x', 1000) FROM generate_series(1,10); VACUUM ANALYZE sparse; The test query itself is a simple SELECT count(v) FROM sparse; Results for the worst case workload: No checksums: tps = 14.710519 Fletcher checksums: tps = 10.825564 (1.359x slowdown) CRC checksums: tps = 5.844995 (2.517x slowdown) SIMD checksums: tps = 14.062388 (1.046x slowdown) I assume this is in the bad region identified by Greg, where there is no disk activity, but shared_buffers is small, leading to a lot of movement between the OS cache and shared buffers? What do you mean by TPS exactly? If the select query is writing hint bits, then you wouldn't be able to repeat it because they are already set. So are you repeating the creation/loading of the table, as well? Results for pgbench scale 100: No checksums: tps = 56623.819783 Fletcher checksums: tps = 55282.222687 (1.024x slowdown) CRC Checksums: tps = 50571.324795 (1.120x slowdown) SIMD Checksums: tps = 56608.888985 (1.000x slowdown) So to conclude, the 3 approaches: Great analysis. Still a tough choice. One thing that might be interesting is to look at doing SIMD for both data and WAL. I wonder if that would be a noticeable speedup for WAL full-page writes? That would give greater justification for the extra work it will take (intrinsics/ASM), and it would be a nice win for non-checksum users. I also notice that http://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2% 80%93Vo_hash_function explicitly mentions adapting FNV to a smaller size. That gives me a little more confidence. Do you have other links we should read about this approach, or possible weaknesses? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Fri, 2013-03-22 at 17:09 +0200, Ants Aasma wrote: So to conclude, the 3 approaches: One other question: assuming that the algorithms use the full 16-bit space, is there a good way to avoid zero without skewing the result? Can we do something like un-finalize (after we figure out that it's zero), compute in an extra salt value, and then re-finalize? That might work for Fletcher; but I don't think that works for CRC or Fowler-Noll-Vo because the final value is the same as the state. I'm still slightly concerned about differentiating checksummed pages in the future if we want to offer a transition path, since we no longer use header bits. Avoiding zero might help us there. Hopefully not necessary, but something we might find useful. Also, it would help us identify situations where the checksum is never set. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
On Fri, Mar 22, 2013 at 2:02 PM, Tom Lane t...@sss.pgh.pa.us wrote: And we definitely looked at ARC We didn't just look at it. At least one release used it. Then patent issues were raised (and I think the implementation had some contention problems). -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
On Fri, Mar 22, 2013 at 11:36 PM, Greg Stark st...@mit.edu wrote: On Fri, Mar 22, 2013 at 2:02 PM, Tom Lane t...@sss.pgh.pa.us wrote: And we definitely looked at ARC We didn't just look at it. At least one release used it. Then patent issues were raised (and I think the implementation had some contention problems). -- greg What is the general thinking? Is it time to start testing again and thinking about improvements to the current algorithm? Regards, Atri -- Regards, Atri l'apprenant -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Fri, Mar 22, 2013 at 7:35 PM, Jeff Davis pg...@j-davis.com wrote: On Fri, 2013-03-22 at 17:09 +0200, Ants Aasma wrote: For performance the K8 results gave me confidence that we have a reasonably good overview what the performance is like for the class of CPU's that PostgreSQL is likely to run on. I don't think there is anything left to optimize there, all algorithms are pretty close to maximum theoretical performance. Great work! Thanks. The worst case workload is set up using CREATE TABLE sparse (id serial primary key, v text) WITH (fillfactor=10); INSERT INTO sparse (v) SELECT REPEAT('x', 1000) FROM generate_series(1,10); VACUUM ANALYZE sparse; The test query itself is a simple SELECT count(v) FROM sparse; Results for the worst case workload: No checksums: tps = 14.710519 Fletcher checksums: tps = 10.825564 (1.359x slowdown) CRC checksums: tps = 5.844995 (2.517x slowdown) SIMD checksums: tps = 14.062388 (1.046x slowdown) I assume this is in the bad region identified by Greg, where there is no disk activity, but shared_buffers is small, leading to a lot of movement between the OS cache and shared buffers? What do you mean by TPS exactly? If the select query is writing hint bits, then you wouldn't be able to repeat it because they are already set. So are you repeating the creation/loading of the table, as well? The table is created once, size is 800MB with one hinted tuple per page. Shared buffers is set to 32MB, machine is Intel Core i5-2500K with 16GB of memory (2 memory channels, 1333MHz, overheads are likely to be larger with faster memory). This is the worst case workload for in-memory workload that doesn't fit into shared_buffers as almost no work other than swapping buffer pages in is done. I think things like bitmap heap scans might show similar characteristics. Results for pgbench scale 100: No checksums: tps = 56623.819783 Fletcher checksums: tps = 55282.222687 (1.024x slowdown) CRC Checksums: tps = 50571.324795 (1.120x slowdown) SIMD Checksums: tps = 56608.888985 (1.000x slowdown) So to conclude, the 3 approaches: Great analysis. Still a tough choice. One thing that might be interesting is to look at doing SIMD for both data and WAL. I wonder if that would be a noticeable speedup for WAL full-page writes? That would give greater justification for the extra work it will take (intrinsics/ASM), and it would be a nice win for non-checksum users. Andres showed that switching out the existing CRC for zlib's would result in 8-30% increase in INSERT-SELECT speed (http://www.postgresql.org/message-id/201005202227.49990.and...@anarazel.de) with the speeded up CRC still showing up as 10% of the profile. So I guess another 5% speedup by doing the CRC 8 bytes at a time instead of the used 4. And another couple % by using Fletcher or SIMD. I also notice that http://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2% 80%93Vo_hash_function explicitly mentions adapting FNV to a smaller size. That gives me a little more confidence. Do you have other links we should read about this approach, or possible weaknesses? It mentions that one should use 32bit FNV and fold it down to 16bit via xor. This doesn't work here because SSE2 doesn't have pmulld (SSE4.1). I have taken some liberties here by actually doing 64 16bit FNV like operations in parallel and then doing an FNV like combination of them at the end. However the choices there are concerned with good hashing performance, while for checksums it should matter much even if the average error detection rate goes from 99.998% to 99.99% as long as common error scenarios don't match up with the collisions. If decide to go this route we should definitely research what the effectiveness consequences here are and what are good choices for the prime values used. On the face of it multiply by prime and add/xor looks like it provides pretty good mixing, resists transposed sequences, zeroing out blocks. The worst case seems to be bit errors. As far as I can see, this implementation should detect all single bit errors, but if one of the bit errors is on MSB, a second single error in MSB will cancel it out. I haven't done the math but it should still work out as better than 99% chance to detect random 2 bit errors. On Fri, Mar 22, 2013 at 8:00 PM, Jeff Davis pg...@j-davis.com wrote: On Fri, 2013-03-22 at 17:09 +0200, Ants Aasma wrote: So to conclude, the 3 approaches: One other question: assuming that the algorithms use the full 16-bit space, is there a good way to avoid zero without skewing the result? Can we do something like un-finalize (after we figure out that it's zero), compute in an extra salt value, and then re-finalize? That might work for Fletcher; but I don't think that works for CRC or Fowler-Noll-Vo because the final value is the same as the state. Taking the Fletcher or CRC32 result modulo 65521 (largest prime 16bits) only gives a very slight skew that shouldn't
Re: [HACKERS] Materialized view assertion failure in HEAD
Kevin Grittner kgri...@ymail.com wrote: Kevin Grittner kgri...@ymail.com wrote: Robert Haas robertmh...@gmail.com wrote: It seems to me that the right place to fix this is in interpretOidsOption(), by returning false rather than default_with_oids whenever the relation is a materialized view. I like it. In working up a patch for this approach, I see that if CREATE FOREIGN TABLE is executed with default_with_oids set to true, it adds an oid column which appears to be always zero in my tests so far (although maybe other FDWs support it?). Do we want to leave that alone? If we're going to add code to ignore that setting for matviews do we also want to ignore it for FDWs? [ thinks... ] I suppose I should post a patch which preserves the status quo for FDWs and treat that as a separate issue. So, rough cut attached. Obviously some docs should be added around this, and I still need to do another pass to make sure I didn't miss anything; but it passes make world-check, make installworld-check, and the regression database can be dumped and loaded without problem. Comments? Tidied up, further tested, and pushed. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] dump, restore, dump yields differences
As part of testing the matview issues around dumping, I ran `make installcheck-world`, ran pg_dump to dump the regression database, loaded it into a new database, dumped the newly restored database, and compared the output files from the two pg_dump runs. There were a few somewhat surprising differences, attached. Do we want to worry about these, or leave them alone? -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company*** regression.dump 2013-03-22 10:58:28.341475752 -0500 --- rest.dump 2013-03-22 11:00:35.881475455 -0500 *** *** 7878,7887 -- Data for Name: b_star; Type: TABLE DATA; Schema: public; Owner: kgrittn -- ! COPY b_star (class, aa, bb, a) FROM stdin; ! b 3 mumble \N b 4 \N \N ! b \N bumble \N b \N \N \N \. --- 7878,7887 -- Data for Name: b_star; Type: TABLE DATA; Schema: public; Owner: kgrittn -- ! COPY b_star (class, aa, a, bb) FROM stdin; ! b 3 \N mumble b 4 \N \N ! b \N \N bumble b \N \N \N \. *** *** 48950,48959 -- Data for Name: c_star; Type: TABLE DATA; Schema: public; Owner: kgrittn -- ! COPY c_star (class, aa, cc, a) FROM stdin; ! c 5 hi mom \N c 6 \N \N ! c \N hi paul \N c \N \N \N \. --- 48950,48959 -- Data for Name: c_star; Type: TABLE DATA; Schema: public; Owner: kgrittn -- ! COPY c_star (class, aa, a, cc) FROM stdin; ! c 5 \N hi mom c 6 \N \N ! c \N \N hi paul c \N \N \N \. *** *** 49185,49206 -- Data for Name: d_star; Type: TABLE DATA; Schema: public; Owner: kgrittn -- ! COPY d_star (class, aa, bb, cc, dd, a) FROM stdin; ! d 7 grumble hi sunita 0 \N ! d 8 stumble hi koko \N \N ! d 9 rumble \N 1.10009 \N ! d 10 \N hi kristin 10.0098 \N ! d \N crumble hi boris 100.0010005 \N ! d 11 fumble \N \N \N ! d 12 \N hi avi \N \N ! d 13 \N \N 1000.97 \N ! d \N tumble hi andrew \N \N ! d \N humble \N 1.09997 \N ! d \N \N hi ginger 10.0093 \N d 14 \N \N \N \N ! d \N jumble \N \N \N ! d \N \N hi jolly \N \N ! d \N \N \N 100.001 \N d \N \N \N \N \N \. --- 49185,49206 -- Data for Name: d_star; Type: TABLE DATA; Schema: public; Owner: kgrittn -- ! COPY d_star (class, aa, a, bb, cc, dd) FROM stdin; ! d 7 \N grumble hi sunita 0 ! d 8 \N stumble hi koko \N ! d 9 \N rumble \N 1.10009 ! d 10 \N \N hi kristin 10.0098 ! d \N \N crumble hi boris 100.0010005 ! d 11 \N fumble \N \N ! d 12 \N \N hi avi \N ! d 13 \N \N \N 1000.97 ! d \N \N tumble hi andrew \N ! d \N \N humble \N 1.09997 ! d \N \N \N hi ginger 10.0093 d 14 \N \N \N \N ! d \N \N jumble \N \N ! d \N \N \N hi jolly \N ! d \N \N \N \N 100.001 d \N \N \N \N \N \. *** *** 59377,59390 -- Data for Name: e_star; Type: TABLE DATA; Schema: public; Owner: kgrittn -- ! COPY e_star (class, aa, cc, ee, e, a) FROM stdin; ! e 15 hi carol -1 \N \N ! e 16 hi bob \N \N \N ! e 17 \N -2 \N \N ! e \N hi michelle -3 \N \N e 18 \N \N \N \N ! e \N hi elisa \N \N \N ! e \N \N -4 \N \N \. --- 59377,59390 -- Data for Name: e_star; Type: TABLE DATA; Schema: public; Owner: kgrittn -- ! COPY e_star (class, aa, a, cc, ee, e) FROM stdin; ! e 15 \N hi carol -1 \N ! e 16 \N hi bob \N \N ! e 17 \N \N -2 \N ! e \N \N hi michelle -3 \N e 18 \N \N \N \N ! e \N \N hi elisa \N \N ! e \N \N \N -4 \N \. *** *** 59415,59437 -- Data for Name: f_star; Type: TABLE DATA; Schema: public; Owner: kgrittn -- ! COPY f_star (class, aa, cc, ee, ff, f, e, a) FROM stdin; ! f 19 hi claire -5 ((1,3),(2,4)) 10 \N \N ! f 20 hi mike -6 \N 10 \N \N ! f 21 hi marcel \N ((11,44),(22,55),(33,66)) 10 \N \N ! f 22 \N -7 ((111,555),(222,666),(333,777),(444,888)) 10 \N \N ! f \N hi keith -8 ((,),(,)) 10 \N \N ! f 24 hi marc \N \N 10 \N \N ! f 25 \N -9 \N 10 \N \N ! f 26 \N \N ((1,3),(2,4)) 10 \N \N ! f \N hi allison -10 \N 10 \N \N ! f \N hi jeff \N ((11,33),(22,44)) 10 \N \N ! f \N \N -11 ((111,333),(222,444)) 10 \N \N ! f 27 \N \N \N 10 \N \N ! f \N hi carl \N \N 10 \N \N ! f \N \N -12 \N 10 \N \N ! f \N \N \N ((,),(,)) 10 \N \N ! f \N \N \N \N 10 \N \N \. --- 59415,59437 -- Data for Name: f_star; Type: TABLE DATA; Schema: public; Owner: kgrittn -- ! COPY f_star (class, aa, a, cc, ee, e, ff, f) FROM stdin; ! f 19 \N hi claire -5 \N ((1,3),(2,4)) 10 ! f 20 \N hi mike -6 \N \N 10 ! f 21 \N hi marcel \N \N ((11,44),(22,55),(33,66)) 10 ! f 22 \N \N -7 \N ((111,555),(222,666),(333,777),(444,888)) 10 ! f \N \N hi keith -8 \N ((,),(,)) 10 ! f 24 \N hi marc \N \N \N 10 ! f 25 \N \N -9 \N \N 10 ! f 26 \N \N \N \N ((1,3),(2,4)) 10 ! f \N \N hi allison -10 \N \N 10 ! f \N \N hi jeff \N \N ((11,33),(22,44)) 10 ! f \N \N \N -11 \N
Re: [HACKERS] dump, restore, dump yields differences
Kevin Grittner kgri...@ymail.com writes: As part of testing the matview issues around dumping, I ran `make installcheck-world`, ran pg_dump to dump the regression database, loaded it into a new database, dumped the newly restored database, and compared the output files from the two pg_dump runs. There were a few somewhat surprising differences, attached. Do we want to worry about these, or leave them alone? Those are expected. You can trace the ALTER TABLE history of those tables if you want to see why they're so odd, but basically there are inheritance situations where it's hard to avoid this. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
On Fri, Mar 22, 2013 at 1:13 PM, Atri Sharma atri.j...@gmail.com wrote: On Fri, Mar 22, 2013 at 11:36 PM, Greg Stark st...@mit.edu wrote: On Fri, Mar 22, 2013 at 2:02 PM, Tom Lane t...@sss.pgh.pa.us wrote: And we definitely looked at ARC We didn't just look at it. At least one release used it. Then patent issues were raised (and I think the implementation had some contention problems). -- greg What is the general thinking? Is it time to start testing again and thinking about improvements to the current algorithm? well, what problem are you trying to solve exactly? the main problems I see today are not so much in terms of page replacement but spinlock and lwlock contention. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Default connection parameters for postgres_fdw and dblink
On Fri, 2013-03-22 at 12:19 -0400, Tom Lane wrote: Is there a better way to handle all this? It may be too late to rethink dblink's behavior anyhow, but perhaps it's not too late to change postgres_fdw. I think though that once we let 9.3 out the door, it *will* be too late to make any major changes, because postgres_fdw's usage is going to go through the roof now that it can do remote updates. The first thing that occurs to me is to have postgres_fdw install some GUCs with reasonable defaults. Perhaps the default could be a magic value that is replaced by the current user or something (similar to search_path). Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: postgres_fdw vs data formatting GUCs (was Re: [HACKERS] [v9.3] writable foreign tables)
Daniel Farina dan...@heroku.com writes: This contains some edits to comments that referred to the obsolete and bogus TupleDesc scanning. No mechanical alterations. Applied with some substantial revisions. I didn't like where you'd put the apply/restore calls, for one thing --- we need to wait to do the applies until we have the PGresult in hand, else we might be applying stale values of the remote's GUCs. Also, adding a call that could throw errors right before materializeResult() won't do, because that would result in leaking the PGresult on error. The struct for state seemed a bit of a mess too, given that you couldn't always initialize it in one place. (In hindsight I could have left that alone given where I ended up putting the calls, but it didn't seem to be providing any useful isolation.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Materialized view assertion failure in HEAD
Kevin Grittner kgri...@ymail.com writes: In working up a patch for this approach, I see that if CREATE FOREIGN TABLE is executed with default_with_oids set to true, it adds an oid column which appears to be always zero in my tests so far (although maybe other FDWs support it?). Do we want to leave that alone? If we're going to add code to ignore that setting for matviews do we also want to ignore it for FDWs? I don't see any very good reason for the core code to be assuming anything about whether an FDW will support OIDs or not. It would have been noticeably more painful to get postgres_fdw's writable-table support working if the core code had tried to enforce an opinion that foreign tables couldn't have ctid, to take something somewhat comparable. Where I'd like to see things going is to add a hook for FDWs to control what system columns get created for their tables. When that happens, an FDW could throw error about OIDs if it wants to. In the meantime I don't feel a need to change the behavior. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
Merlin Moncure mmonc...@gmail.com writes: On Fri, Mar 22, 2013 at 1:13 PM, Atri Sharma atri.j...@gmail.com wrote: What is the general thinking? Is it time to start testing again and thinking about improvements to the current algorithm? well, what problem are you trying to solve exactly? the main problems I see today are not so much in terms of page replacement but spinlock and lwlock contention. Even back when we last hacked on that algorithm, the concerns were not so much about which pages it replaced as how much overhead and contention was created by the management algorithm. I haven't seen any reason to think we have a problem with the quality of the replacement choices. The proposal to increase the initial usage count would definitely lead to more overhead/contention, though, because it would result in having to circle around all the buffers more times (on average) to get a free buffer. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dump, restore, dump yields differences
On Fri, Mar 22, 2013 at 6:49 PM, Tom Lane t...@sss.pgh.pa.us wrote: Those are expected. You can trace the ALTER TABLE history of those tables if you want to see why they're so odd, but basically there are inheritance situations where it's hard to avoid this. Incidentally it would still be cool to have make check do this dance and intelligently compare the before and after. There have been more than one patch where you've caught omissions in pg_dump before applying. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Default connection parameters for postgres_fdw and dblink
Jeff Davis pg...@j-davis.com writes: On Fri, 2013-03-22 at 12:19 -0400, Tom Lane wrote: Is there a better way to handle all this? It may be too late to rethink dblink's behavior anyhow, but perhaps it's not too late to change postgres_fdw. I think though that once we let 9.3 out the door, it *will* be too late to make any major changes, because postgres_fdw's usage is going to go through the roof now that it can do remote updates. The first thing that occurs to me is to have postgres_fdw install some GUCs with reasonable defaults. If there's anything I've learned in the last dozen years, it's that GUCs with application-visible semantic effects are dangerous. If the semantic effects are relevant to security, that's probably even worse. Perhaps the default could be a magic value that is replaced by the current user or something (similar to search_path). That seems like just an overcomplicated form of my suggestion that the default should be the current user's name. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
On Fri, Mar 22, 2013 at 2:52 PM, Tom Lane t...@sss.pgh.pa.us wrote: Merlin Moncure mmonc...@gmail.com writes: On Fri, Mar 22, 2013 at 1:13 PM, Atri Sharma atri.j...@gmail.com wrote: What is the general thinking? Is it time to start testing again and thinking about improvements to the current algorithm? well, what problem are you trying to solve exactly? the main problems I see today are not so much in terms of page replacement but spinlock and lwlock contention. Even back when we last hacked on that algorithm, the concerns were not so much about which pages it replaced as how much overhead and contention was created by the management algorithm. I haven't seen any reason to think we have a problem with the quality of the replacement choices. The proposal to increase the initial usage count would definitely lead to more overhead/contention, though, because it would result in having to circle around all the buffers more times (on average) to get a free buffer. yup...absolutely. I have a hunch that the occasional gripes we see about server stalls under high load with read only (or mostly read only) loads are coming from spinlock contention under the lwlock hitting a critical point and shutting the server down effectively until by chance the backend with the lwlock gets lucky and lands the spinlock. I think there is some very low hanging optimization fruit in the clock sweep loop. first and foremost, I see no good reason why when scanning pages we have to spin and wait on a buffer in order to pedantically adjust usage_count. some simple refactoring there could set it up so that a simple TAS (or even a TTAS with the first test in front of the cache line lock as we done automatically in x86 IIRC) could guard the buffer and, in the event of any lock detected, simply move on to the next candidate without messing around with that buffer at all. This could construed as a 'trylock' variant of a spinlock and might help out with cases where an especially hot buffer is locking up the sweep. This is exploiting the fact that from StrategyGetBuffer we don't need a *particular* buffer, just *a* buffer. I also wonder if we shouldn't (perhaps in addition to the above) resuscitate Jeff Jane's idea to get rid of the lwlock completely and manage everything with spinlocks.. Naturally, all of this would have to be confirmed with some very robust testing. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
Merlin Moncure mmonc...@gmail.com writes: I think there is some very low hanging optimization fruit in the clock sweep loop. first and foremost, I see no good reason why when scanning pages we have to spin and wait on a buffer in order to pedantically adjust usage_count. some simple refactoring there could set it up so that a simple TAS (or even a TTAS with the first test in front of the cache line lock as we done automatically in x86 IIRC) could guard the buffer and, in the event of any lock detected, simply move on to the next candidate without messing around with that buffer at all. This could construed as a 'trylock' variant of a spinlock and might help out with cases where an especially hot buffer is locking up the sweep. This is exploiting the fact that from StrategyGetBuffer we don't need a *particular* buffer, just *a* buffer. Hm. You could argue in fact that if there's contention for the buffer header, that's proof that it's busy and shouldn't have its usage count decremented. So this seems okay from a logical standpoint. However, I'm not real sure that it's possible to do a conditional spinlock acquire that doesn't create just as much hardware-level contention as a full acquire (ie, TAS is about as bad whether it gets the lock or not). So the actual benefit is a bit less clear. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
On Fri, Mar 22, 2013 at 3:16 PM, Tom Lane t...@sss.pgh.pa.us wrote: Merlin Moncure mmonc...@gmail.com writes: I think there is some very low hanging optimization fruit in the clock sweep loop. first and foremost, I see no good reason why when scanning pages we have to spin and wait on a buffer in order to pedantically adjust usage_count. some simple refactoring there could set it up so that a simple TAS (or even a TTAS with the first test in front of the cache line lock as we done automatically in x86 IIRC) could guard the buffer and, in the event of any lock detected, simply move on to the next candidate without messing around with that buffer at all. This could construed as a 'trylock' variant of a spinlock and might help out with cases where an especially hot buffer is locking up the sweep. This is exploiting the fact that from StrategyGetBuffer we don't need a *particular* buffer, just *a* buffer. Hm. You could argue in fact that if there's contention for the buffer header, that's proof that it's busy and shouldn't have its usage count decremented. So this seems okay from a logical standpoint. However, I'm not real sure that it's possible to do a conditional spinlock acquire that doesn't create just as much hardware-level contention as a full acquire (ie, TAS is about as bad whether it gets the lock or not). So the actual benefit is a bit less clear. well if you do a non-locking test first you could at least avoid some cases (and, if you get the answer wrong, so what?) by jumping to the next buffer immediately. if the non locking test comes good, only then do you do a hardware TAS. you could in fact go further and dispense with all locking in front of usage_count, on the premise that it's only advisory and not a real refcount. so you only then lock if/when it's time to select a candidate buffer, and only then when you did a non locking test first. this would of course require some amusing adjustments to various logical checks (usage_count = 0, heh). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: postgres_fdw vs data formatting GUCs (was Re: [HACKERS] [v9.3] writable foreign tables)
On Fri, Mar 22, 2013 at 12:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: Daniel Farina dan...@heroku.com writes: This contains some edits to comments that referred to the obsolete and bogus TupleDesc scanning. No mechanical alterations. Applied with some substantial revisions. I didn't like where you'd put the apply/restore calls, for one thing --- we need to wait to do the applies until we have the PGresult in hand, else we might be applying stale values of the remote's GUCs. Also, adding a call that could throw errors right before materializeResult() won't do, because that would result in leaking the PGresult on error. Good catches. The struct for state seemed a bit of a mess too, given that you couldn't always initialize it in one place. Yeah, I had to give that up when pushing things around, unless I wanted to push more state down. It used to be neater. (In hindsight I could have left that alone given where I ended up putting the calls, but it didn't seem to be providing any useful isolation.) I studied your commit. Yeah, the idea I had was to try to avoid pushing down a loaded a value as a PGconn into the lower level helper functions, but perhaps that economy was false one after the modifications. Earlier versions used to push down the RemoteGucs struct instead of a full-blown conn to hint to the restricted purpose of that reference. By conceding to this pushdown I think the struct could have remained, as you said, but the difference to clarity is likely marginal. I thought I found a way to not have to widen the parameter list at all, so I preferred that one, but clearly it is wrong, w.r.t. leaks and the not up-to-date protocol state. Sorry you had to root around so much in there to get something you liked, but thanks for going through it. -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Cube extension improvement, GSoC
Hello, some time ago I started working on the data search system (about 100-200M of records) with queries consisted of several diapason and equality conditions, e.g.: WHERE dim1 BETWEEN 128 AND 137 AND WHERE dim2 BETWEEN 4815 AND 162342 AND WHERE dim3 = 42 ORDER BY dim1 ASC There are 6 or 7 search criteria in my task. In order to avoid full table scan I started using R-Tree over cube data type: CREATE INDEX ON my_table USING GiST(cube(array[dim1, dim2, dim3])) For fast sorting I used Alexander Korotkov's patch for knngist (http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg153676.html), which changes metric for nearest neighbors search and allows to obtain cubes ordered by a specific coordinate. Having changed some data types and function/operator definitions I ported this to 9.2 (https://github.com/kelvich/postgres/commit/bb372). Then, after this I could select ordered records right from the index: SELECT * FROM my_table WHERE cube(array[dim1, dim2, dim3]) @ cube '(128,4815,42),(137,162342,42)' ORDER BY cube(array[dim1, dim2, dim3]) * 5; The main issue of such approach is the index size. In my case it was about 100GB for 100M of records. Therefore index building becomes very expensive disk-related operation. For the same reason reading a large number of records from the index is slow too. I came to Oleg Bartunov, Theodor Sigaev and after a while to Alexander Korotkov for any help. (I'm very thankful to them and glad that they agreed to meet, listen to me and give useful advices). Having discussed it we decided that there was several possible improvements for the cube extension: * Adding point data type support to the cube extension in order to avoid storing of coincident upper left and lower right vertices, which may reduce the volume that leaf nodes occupy almost twice. * Checking the split algorithm with big datasets and, if possible, improving it. * Learning cube extension to store dimensions with different data types. Such index would be good alternative to compound key B-Tree multi-index (suitable for diapason queries and data ordering). * Providing support for KNN with metrics induced by the different norms: euclidean, taxicab norm, p-norm. This can be useful in fields where we can extract signature: similar images search, similar audio search. I'd like to participate in GSoC with this improvements, and I'm very interested in any comments or suggestions about this feature list. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: DROP OWNED BY fails to drop privileges granted by non-owners (was Re: [GENERAL] Bug, Feature, or what else?)
Tom Lane escribió: I believe the problem is that DROP OWNED for privileges is implemented by calling REVOKE. As noted upthread, when a superuser does REVOKE, it's executed as though the object owner did the REVOKE, so only privileges granted directly by the object owner go away. In this particular example, DROP OWNED BY u1 makes the grant to u1 go away, and then the grant to u2 goes away via cascade ... but DROP OWNED BY u2 fails to accomplish anything at all, because postgres never granted anything directly to u2. We haven't seen this reported before, probably because the use of GRANT OPTIONS isn't very common, but AFAICS it's been wrong since the invention of DROP OWNED. So it seems. I have been mulling this over today, and it seems to me that one way to fix it would be to have ExecGrant_Objecttype() loop over all possible grantors when determining what to revoke when it's called by DROP OWNED. A proof-of-concept is below (RemoveRoleFromObjectACL would set istmt.all_grantors to true, whereas ExecuteGrantStmt leaves it as false and behaves as today). I am not sure about the restrict_and_check_grant() call I left out in the middle of the operation; it seems to me that if we limit DROP OWNED to be called only by a superuser, we can get away without that check. Or maybe we need a new version of that routine that will only apply the bitmask and not raise any error messages. The patch below is just for ExecGrant_Relation(), but as far as I can tell all the ExecGrant_Objecttype() routines do pretty much the same here, so I think it'd be better to refactor the select_best_grantor/ restrict_and_check_grant/merge_acl_with_grant sequence into a common function, and then have that function apply the loop for all grantors. Thoughts? *** *** 1891,1932 ExecGrant_Relation(InternalGrant *istmt) AclObjectKind aclkind; /* Determine ID to do the grant as, and available grant options */ ! select_best_grantor(GetUserId(), this_privileges, ! old_acl, ownerId, ! grantorId, avail_goptions); ! ! switch (pg_class_tuple-relkind) { ! case RELKIND_SEQUENCE: ! aclkind = ACL_KIND_SEQUENCE; ! break; ! default: ! aclkind = ACL_KIND_CLASS; ! break; } ! /* ! * Restrict the privileges to what we can actually grant, and emit ! * the standards-mandated warning and error messages. ! */ ! this_privileges = ! restrict_and_check_grant(istmt-is_grant, avail_goptions, ! istmt-all_privs, this_privileges, ! relOid, grantorId, aclkind, ! NameStr(pg_class_tuple-relname), ! 0, NULL); ! /* ! * Generate new ACL. ! */ ! new_acl = merge_acl_with_grant(old_acl, !istmt-is_grant, !istmt-grant_option, !istmt-behavior, !istmt-grantees, !this_privileges, !grantorId, !ownerId); /* * We need the members of both old and new ACLs so we can correct --- 1895,1962 AclObjectKind aclkind; /* Determine ID to do the grant as, and available grant options */ ! if (!istmt-all_grantors) { ! select_best_grantor(GetUserId(), this_privileges, ! old_acl, ownerId, ! grantorId, avail_goptions); ! ! switch (pg_class_tuple-relkind) ! { ! case RELKIND_SEQUENCE: ! aclkind = ACL_KIND_SEQUENCE; ! break; ! default: ! aclkind = ACL_KIND_CLASS; ! break; ! } ! ! /* ! * Restrict the privileges to what we can actually grant, and emit ! * the standards-mandated warning and error messages. ! */ ! this_privileges = ! restrict_and_check_grant(istmt-is_grant, avail_goptions, ! istmt-all_privs, this_privileges, ! relOid, grantorId, aclkind, ! NameStr(pg_class_tuple-relname), !
Re: [HACKERS] SDP query optimizer
Adriano, I have developed a new query optimizer for PostgreSQL and I would like to share it with the community. The optimizer's name is Sampling and Dynamic Programming (SDP). I put it into a plugin developed some years ago, named LJQO: Woah! Way cool. As a warning, we're in the closing throes of version 9.3 right now, so if you code/ideas doesn't get the attention it deserves, that's why. There is an incomplete project from a few years back to make the non-exhaustive query planner pluggable so that we could use different algorithms. Unfortunately, it was never finished and merged with the core code. Your planner is yet another reason it would be great to complete this. Keep up the good work! -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure mmonc...@gmail.com wrote: well if you do a non-locking test first you could at least avoid some cases (and, if you get the answer wrong, so what?) by jumping to the next buffer immediately. if the non locking test comes good, only then do you do a hardware TAS. you could in fact go further and dispense with all locking in front of usage_count, on the premise that it's only advisory and not a real refcount. so you only then lock if/when it's time to select a candidate buffer, and only then when you did a non locking test first. this would of course require some amusing adjustments to various logical checks (usage_count = 0, heh). Moreover, if the buffer happens to miss a decrement due to a data race, there's a good chance that the buffer is heavily used and wouldn't need to be evicted soon anyway. (if you arrange it to be a read-test-inc/dec-store operation then you will never go out of bounds) However, clocksweep and usage_count maintenance is not what is causing contention because that workload is distributed. The issue is pinning and unpinning. There we need an accurate count and there are some pages like index roots that get hit very heavily. Things to do there would be in my opinion convert to a futex based spinlock so when there is contention it doesn't completely kill performance and then try to get rid of the contention. Converting to lock-free pinning won't help much here as what is killing us here is the cacheline bouncing. One way to get rid of contention is the buffer nailing idea that Robert came up with. If some buffer gets so hot that maintaining refcount on the buffer header leads to contention, promote that buffer to a nailed status, let everyone keep their pin counts locally and sometime later revisit the nailing decision and if necessary convert pins back to the buffer header. One other interesting idea I have seen is closeable scalable nonzero indication (C-SNZI) from scalable rw-locks [1]. The idea there is to use a tree structure to dynamically stripe access to the shared lock counter when contention is detected. Downside is that considerable amount of shared memory is needed so there needs to be some way to limit the resource usage. This is actually somewhat isomorphic to the nailing idea. The issue with the current buffer management algorithm is that it seems to scale badly with increasing shared_buffers. I think the improvements should concentrate on finding out what is the problem there and figuring out how to fix it. A simple idea to test would be to just partition shared buffers along with the whole clock sweep machinery into smaller ones, like the buffer mapping hash tables already are. This should at the very least reduce contention for the clock sweep even if it doesn't reduce work done per page miss. [1] http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf Regards, Ants Aasma -- Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SDP query optimizer
On Sat, Mar 23, 2013 at 1:35 AM, Adriano Lange alange0...@gmail.com wrote: I have developed a new query optimizer for PostgreSQL and I would like to share it with the community. The optimizer's name is Sampling and Dynamic Programming (SDP). I put it into a plugin developed some years ago, named LJQO: https://github.com/alange0001/ljqo.git Looks great. Do you happen to have any papers or insight into why SDP works as well as it does? It isn't immediately clear to me why the cheapest left-deep tree is a good heuristic start point for the dynamic programming phase. Regards, Ants Aasma -- Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review]
On Friday, March 22, 2013 7:33 PM Alvaro Herrera wrote: Amit Kapila escribió: On Friday, March 22, 2013 8:57 AM Alvaro Herrera wrote: Amit Kapila escribió: I think adding new syntax change is little scary for me, not for the matter of implementation but for building consensus on syntax. I cannot but agree on that point. Sorry, I don't get your point. Do you mean to say that you don't agree with me and want new syntax as proposed by you to be implemented? On the contrary, I was saying I agree with you on the difficulty on getting consensus on this. Now based on final discussion, I will go-ahead and document the behavior rather than using SIGHUP at end or giving NOTICE at end of command. Let me know if anybody still feels otherwise. Apart from this during Greg's testing, he found a performance problem with running pg_reload_conf() along with my patch the reason for which actually turns out to be a memory growth. I have attempted to fix it in the patch ctx_growth_fix_v1.patch which is currently attached in CF queue along with my Patch. http://www.postgresql.org/message-id/004801ce216b$e37c3b30$aa74b190$@kapila@ huawei.com I think that should be done irrespective of SET Persistent Patch. I don't know where I learned this phrase or even if it's in common usage. After a not-so-quick search I see it explained here: http://www.perfectyourenglish.com/usage/but.htm You were right, I was not able to understand. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SIGHUP not received by custom bgworkers if postmaster is notified
Hi all, Please find attached a simple example of bgworker that logs a message each time a SIGTERM or SIGHUP signal is received by it: - hello signal: processed SIGHUP when SIGHUP is handled by my example - hello signal: processed SIGTERM when SIGTERM is handled by my example With the current master code, here is what I get: $ for i in {1..5}; do pg_ctl reload -D ~/bin/pgsql/master/; sleep 1; done server signaled server signaled server signaled server signaled server signaled $ cat ~/bin/pgsql/master/pg_log/postgresql-2013-03-23_112246.log LOG: starting background worker process hello signal worker LOG: database system was shut down at 2013-03-23 11:22:46 JST LOG: database system is ready to accept connections LOG: autovacuum launcher started LOG: received SIGHUP, reloading configuration files LOG: received SIGHUP, reloading configuration files LOG: received SIGHUP, reloading configuration files LOG: received SIGHUP, reloading configuration files LOG: received SIGHUP, reloading configuration files SIGHUP is not received by my bgworker. But SIGTERM is: $ tail -n 5 ~/bin/pgsql/master/pg_log/postgresql-2013-03-23_112246.log LOG: autovacuum launcher shutting down *LOG: hello signal: processed SIGTERM* LOG: worker process: hello signal worker (PID 2873) exited with exit code 0 LOG: shutting down LOG: database system is shut down Now, if I apply my fix and redo the same tests, here is what I get: $ for i in {1..5}; do pg_ctl reload -D ~/bin/pgsql/master/; sleep 1; done server signaled server signaled server signaled server signaled server signaled $ cat ~/bin/pgsql/master/pg_log/postgresql-2013-03-23_113315.log LOG: starting background worker process hello signal worker LOG: database system was shut down at 2013-03-23 11:33:14 JST LOG: database system is ready to accept connections LOG: autovacuum launcher started LOG: received SIGHUP, reloading configuration files *LOG: hello signal: processed SIGHUP* LOG: received SIGHUP, reloading configuration files *LOG: hello signal: processed SIGHUP* LOG: received SIGHUP, reloading configuration files *LOG: hello signal: processed SIGHUP* LOG: received SIGHUP, reloading configuration files *LOG: hello signal: processed SIGHUP* LOG: received SIGHUP, reloading configuration files *LOG: hello signal: processed SIGHUP* So SIGHUP is now correctly managed by the bgworker. As well as SIGTERM: $ pg_ctl st: pg_ctl stop -D ~/bin/pgsql/master/ waiting for server to shut down done server stopped ioltas@nukkle:~/bin/extra(linux OK)$ tail -n 5 ~/bin/pgsql/master/pg_log/postgresql-2013-03-23_113315.log *LOG: hello signal: processed SIGTERM* LOG: autovacuum launcher shutting down LOG: worker process: hello signal worker (PID 13781) exited with exit code 0 LOG: shutting down LOG: database system is shut down It would be great to get that fixed. Thanks. -- Michael hello_signal.tar.gz Description: GNU Zip compressed data 20130321_bgworker_sighup.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 03/23/2013 02:00 AM, Jeff Davis wrote: On Fri, 2013-03-22 at 17:09 +0200, Ants Aasma wrote: So to conclude, the 3 approaches: One other question: assuming that the algorithms use the full 16-bit space, is there a good way to avoid zero without skewing the result? Can we do something like un-finalize (after we figure out that it's zero), compute in an extra salt value, and then re-finalize? That might work for Fletcher; but I don't think that works for CRC or Fowler-Noll-Vo because the final value is the same as the state. I'm still slightly concerned about differentiating checksummed pages in the future if we want to offer a transition path, since we no longer use header bits. Avoiding zero might help us there. Hopefully not necessary, but something we might find useful. Avoiding a magic value for not checksummed might help, but IMO zero is a terrible choice for that since it's one of the most likely things to be written in chunks over good data during some kinds of corruption event (memory overwriting, etc). Making zero a not checksummed magic value would significantly detract from the utility of checksums IMO. Also, it would help us identify situations where the checksum is never set. Now that seems more useful - ERROR: BUG or disk corruption found. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page replacement algorithm in buffer cache
Moreover, if the buffer happens to miss a decrement due to a data race, there's a good chance that the buffer is heavily used and wouldn't need to be evicted soon anyway. (if you arrange it to be a read-test-inc/dec-store operation then you will never go out of bounds) However, clocksweep and usage_count maintenance is not what is causing contention because that workload is distributed. The issue is pinning and unpinning. There we need an accurate count and there are some pages like index roots that get hit very heavily. Things to do there would be in my opinion convert to a futex based spinlock so when there is contention it doesn't completely kill performance and then try to get rid of the contention. Converting to lock-free pinning won't help much here as what is killing us here is the cacheline bouncing. One way to get rid of contention is the buffer nailing idea that Robert came up with. If some buffer gets so hot that maintaining refcount on the buffer header leads to contention, promote that buffer to a nailed status, let everyone keep their pin counts locally and sometime later revisit the nailing decision and if necessary convert pins back to the buffer header. One other interesting idea I have seen is closeable scalable nonzero indication (C-SNZI) from scalable rw-locks [1]. The idea there is to use a tree structure to dynamically stripe access to the shared lock counter when contention is detected. Downside is that considerable amount of shared memory is needed so there needs to be some way to limit the resource usage. This is actually somewhat isomorphic to the nailing idea. The issue with the current buffer management algorithm is that it seems to scale badly with increasing shared_buffers. I think the improvements should concentrate on finding out what is the problem there and figuring out how to fix it. A simple idea to test would be to just partition shared buffers along with the whole clock sweep machinery into smaller ones, like the buffer mapping hash tables already are. This should at the very least reduce contention for the clock sweep even if it doesn't reduce work done per page miss. One way to distribute memory contention in case of spinlocks could be to utilize the fundamentals of NUMA architecture. Specifically, we can let the contending backends spin on local flags instead on the buffer header flags directly. As access to local cache lines is much cheaper and faster than memory locations which are far away in NUMA, we could potentially reduce the memory overhead for a specific line and reduce the overall overheads as well. Regards. Atri -- Regards, Atri l'apprenant -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_dump/restore syntax checking bug?
Hello, In testing some pg_restore functionality I found the following: postgres@jd-laptop:~$ pg_dump -U postgres -Fc -s --file=foo.sqlc postgres@jd-laptop:~$ dropdb test; postgres@jd-laptop:~$ createdb test; postgres@jd-laptop:~$ pg_restore -d test -P 'by()' foo.sqlc postgres@jd-laptop:~$ psql -U postgres test psql (9.1.8) Type help for help. test=# select by(); by by (1 row) test=# select hello(); ERROR: function hello() does not exist LINE 1: select hello(); ^ HINT: No function matches the given name and argument types. You might need to add explicit type casts. The above is as expected. test=# \q postgres@jd-laptop:~$ pg_restore -d test -P 'by(),hello()' foo.sqlc postgres@jd-laptop:~$ psql -U postgres test; psql (9.1.8) Type help for help. test=# select hello(); ERROR: function hello() does not exist LINE 1: select hello(); This is where I am confused. It didn't restore hello() and it also didn't error that the syntax of the restore command was invalid. ^ test=# drop function by(); DROP FUNCTION test=# test=# q test-# \q postgres@jd-laptop:~$ pg_restore -d test -P 'by(),hello()' foo.sqlc postgres@jd-laptop:~$ psql -U postgres test; psql (9.1.8) Type help for help. test=# drop function by(); ERROR: function by() does not exist by() not restored by above command test=# \q postgres@jd-laptop:~$ pg_restore -d test -P 'by()','hello()' foo.sqlc postgres@jd-laptop:~$ psql -U postgres test; psql (9.1.8) Type help for help. test=# drop function by(); ERROR: function by() does not exist test=# select hello(); ERROR: function hello() does not exist LINE 1: select hello(); ^ HINT: No function matches the given name and argument types. You might need to add explicit type casts. by() and hello() also not restored test=# \q postgres@jd-laptop:~$ pg_restore -d test -P 'by()' -P'hello()' foo.sqlc postgres@jd-laptop:~$ psql -U postgres test; psql (9.1.8) Type help for help. test=# select hello(); hello --- hello (1 row) test=# select by(); ERROR: function by() does not exist LINE 1: select by(); hello() restored but by() was not. It appears we need better syntax checking. Sincerely, JD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump/restore syntax checking bug?
On Fri, Mar 22, 2013 at 9:35 PM, Joshua D. Drake j...@commandprompt.com wrote: postgres@jd-laptop:~$ pg_restore -d test -P 'by(),hello()' foo.sqlc Note, the pg_restore doc makes no mention of trying to squeeze multiple function prototypes in a single argument you've done here, or of using multiple -P flags. It appears we need better syntax checking. Can't really argue with this. But if you think these pg_restore examples are bad, try this gem: reindexdb --table='foo; ALTER ROLE limited WITH superuser' Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers