Re: [HACKERS] Scaling shared buffer eviction
On Fri, Sep 12, 2014 at 11:55 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Thu, Sep 11, 2014 at 4:31 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-09-10 12:17:34 +0530, Amit Kapila wrote: +++ b/src/backend/postmaster/bgreclaimer.c A fair number of comments in that file refer to bgwriter... Will fix. Fixed. @@ -0,0 +1,302 @@ +/*- + * + * bgreclaimer.c + * + * The background reclaimer (bgreclaimer) is new as of Postgres 9.5. It + * attempts to keep regular backends from having to run clock sweep (which + * they would only do when they don't find a usable shared buffer from + * freelist to read in another page). That's not really accurate. Freelist pages are often also needed to write new pages, without reading anything in. Agreed, but the same is used in bgwriter file as well; so if we change here, we might want to change bgwriter file header as well. I have just changed bgreclaimer for this comment, if the same is required for bgwriter, I can create a separate patch as that change is not related to this patch, so I thought it is better to keep it separate. I'd phrase it as which they only need to do if they don't find a victim buffer from the freelist victim buffer sounds more like a buffer which it will get from clock sweep, how about next candidate (same is used in function header of StrategyGetBuffer()). Fixed. In the best scenario all requests + * for shared buffers will be fulfilled from freelist as the background + * reclaimer process always tries to maintain buffers on freelist. However, + * regular backends are still empowered to run clock sweep to find a usable + * buffer if the bgreclaimer fails to maintain enough buffers on freelist. empowered sounds strange to me. 'still can run the clock sweep'? No harm in changing like what you are suggesting, however the same is used in file header of bgwriter.c as well, so I think lets keep this usage as it is because there is no correctness issue here. Not changed anything for this comment. +static void bgreclaim_quickdie(SIGNAL_ARGS); +static void BgreclaimSigHupHandler(SIGNAL_ARGS); +static void ReqShutdownHandler(SIGNAL_ARGS); +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS); This looks inconsistent. I have kept based on bgwriter, so not sure if it's good to change. However I we want consistent in naming, I would like to keep something like: ReclaimShutdownHandler ReclaimQuickDieHandler .. .. Changed function names to make them consistent. + /* + * If an exception is encountered, processing resumes here. + * + * See notes in postgres.c about the design of this coding. + */ + if (sigsetjmp(local_sigjmp_buf, 1) != 0) + { + /* Since not using PG_TRY, must reset error stack by hand */ + error_context_stack = NULL; .. No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a good idea, regardless of it possibly being true today (which I'm not sure about yet). I will add LWLockReleaseAll() in exception handling as discussed elsewhere in thread. Done. + + /* Now we can allow interrupts again */ + RESUME_INTERRUPTS(); Other processes sleep for a second here, I think that's a good idea. E.g. that bit: Agreed, will make change as per suggestion. Done. + /* + * Loop forever + */ + for (;;) + { + int rc; + + + /* + * Backend will signal bgreclaimer when the number of buffers in + * freelist falls below than low water mark of freelist. + */ + rc = WaitLatch(MyProc-procLatch, +WL_LATCH_SET | WL_POSTMASTER_DEATH, +-1); That's probably not going to work well directly after a (re)start of bgreclaim (depending on how you handle the water mark, I'll see in a bit). Could you please be more specific here? I wasn't sure if any change is required here, so kept the code as it is. +Background Reclaimer's Processing +- .. +Two water mark indicators are used to maintain sufficient number of buffers +on freelist. Low water mark indicator is used by backends to wake bgreclaimer +when the number of buffers in freelist falls below it. High water mark +indicator is used by bgreclaimer to move buffers to freelist. For me the description of the high water as stated here doesn't seem to explain anything. This section should have a description of how the reclaimer interacts with the bgwriter logic. Do we put dirty buffers on the freelist that are then cleaned by the bgwriter? Which buffers does the bgwriter
Re: [HACKERS] Postgres code for a query intermediate dataset
On 14/09/14 05:36, Rohit Goyal wrote: Hi All, I want to work on the code of intermediate dataset of select and update query. For example. Rohit's salary has been updated 4 times, so it has 4 different version of salary. I want to select salary of person named Rohit. Now suppose , in intermediate result, I found 4 different versions of the data. I want to know the code portion which i need to look for working on all 4 versions in dataset. :) Hi Rohit, Currently in Postgres, these intermediate versions all exist - however a given session can only see one of them. Also VACUUM is allowed to destroy versions that no other transactions can see. So if I'm understanding you correctly, you would like to have some way for a session to see *all* these versions (and I guess preventing VACUUM from destroying them). It is certainly possible (or used to be via snapshot manipulation, I haven't looked at that code in a while sorry) to enable a session to see all the old versions, and is quite a cool idea (Postgres used to have this ability in older versions - called Time Travel). For pure practicality, this can be achieved without any code modifications using TRIGGERs and an extra table (as Gavin alludes to). Do tell us a bit more about what you are wanting to do! Cheers Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Fri, Sep 12, 2014 at 11:09 PM, Gregory Smith gregsmithpg...@gmail.com wrote: This looks like it's squashed one of the very fundamental buffer scaling issues though; well done Amit. Thanks. I'll go back to my notes and try to recreate the pathological cases that plagued both the 8.3 BGW rewrite and the aborted 9.2 fsync spreading effort I did; get those running again and see how they do on this new approach. I have a decent sized 24 core server that should be good enough for this job. I'll see what I can do. It will be really helpful if you can try out those cases. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On 14/09/14 19:00, Amit Kapila wrote: On Fri, Sep 12, 2014 at 11:09 PM, Gregory Smith gregsmithpg...@gmail.com mailto:gregsmithpg...@gmail.com wrote: This looks like it's squashed one of the very fundamental buffer scaling issues though; well done Amit. Thanks. I'll go back to my notes and try to recreate the pathological cases that plagued both the 8.3 BGW rewrite and the aborted 9.2 fsync spreading effort I did; get those running again and see how they do on this new approach. I have a decent sized 24 core server that should be good enough for this job. I'll see what I can do. It will be really helpful if you can try out those cases. And if you want 'em run on the 60 core beast, just let me know the details and I'll do some runs for you. Cheers Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Postgres code for a query intermediate dataset
On Sunday, September 14, 2014, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: On 14/09/14 05:36, Rohit Goyal wrote: Hi All, I want to work on the code of intermediate dataset of select and update query. For example. Rohit's salary has been updated 4 times, so it has 4 different version of salary. I want to select salary of person named Rohit. Now suppose , in intermediate result, I found 4 different versions of the data. I want to know the code portion which i need to look for working on all 4 versions in dataset. :) Hi Rohit, Currently in Postgres, these intermediate versions all exist - however a given session can only see one of them. Also VACUUM is allowed to destroy versions that no other transactions can see. So if I'm understanding you correctly, you would like to have some way for a session to see *all* these versions (and I guess preventing VACUUM from destroying them). Any modifications of that sort are bound to introduce lots of pain, not to mention performance degradation and the added responsibility of ensuring that dead tuples don't bloat up the system (prevent vacuum from running at regular intervals and you can have a xid wraparound). I just mentioned that in case you are planning to go in that direction. If you only want the data, use the triggers as Gavin mentioned. Regards, Atri -- Regards, Atri *l'apprenant*
Re: [HACKERS] run xmllint during build (was Re: need xmllint on borka)
Hello Peter, I've committed the $(missing) use separately, That was simple and is a definite improvement. Tiny detail: the new DBTOEPUB macro definition in src/Makefile.global.in lacks another tab to be nicely aligned with the other definitions. and rebased this patch on top of that. Applied and tested, everything looks fine. The only remaining question is whether the xmllint check should always be called. You stated that it was stricter than sgml processing, so I would think it worth to always call it, but this is really a marginal preference. I think it is okay if some slaves in the build farm do build the various targets. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Postgres code for a query intermediate dataset
On 14/09/14 19:25, Atri Sharma wrote: On Sunday, September 14, 2014, Mark Kirkwood mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz wrote: On 14/09/14 05:36, Rohit Goyal wrote: Hi All, I want to work on the code of intermediate dataset of select and update query. For example. Rohit's salary has been updated 4 times, so it has 4 different version of salary. I want to select salary of person named Rohit. Now suppose , in intermediate result, I found 4 different versions of the data. I want to know the code portion which i need to look for working on all 4 versions in dataset. :) Hi Rohit, Currently in Postgres, these intermediate versions all exist - however a given session can only see one of them. Also VACUUM is allowed to destroy versions that no other transactions can see. So if I'm understanding you correctly, you would like to have some way for a session to see *all* these versions (and I guess preventing VACUUM from destroying them). Any modifications of that sort are bound to introduce lots of pain, not to mention performance degradation and the added responsibility of ensuring that dead tuples don't bloat up the system (prevent vacuum from running at regular intervals and you can have a xid wraparound). I just mentioned that in case you are planning to go in that direction. If you only want the data, use the triggers as Gavin mentioned. Obviously in the general case sure - but (as yet) we don't have much idea about Rohit's use case and workload. If retrieving past versions is the *primary* workload bias and high update concurrency is not required then this could well work better than a trigger based solution. And it does not seem too onerous to have the ability to switch this on as required, viz: ALTER TABLE table1 VERSIONING; (or similar syntax) which makes VACUUM leave this table alone. It might make more sense to make such a concept apply to a TABLESPACE instead mind you (i.e things in here are for archive/versioning purposes)... Clearly we'd need to see the code for any of this and evaluate if it is good or terrible, but I'm not seeing the idea as bad as stated. Cheers Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Postgres code for a query intermediate dataset
Hi Mark, On Sun, Sep 14, 2014 at 8:57 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: On 14/09/14 05:36, Rohit Goyal wrote: Hi All, I want to work on the code of intermediate dataset of select and update query. For example. Rohit's salary has been updated 4 times, so it has 4 different version of salary. I want to select salary of person named Rohit. Now suppose , in intermediate result, I found 4 different versions of the data. I want to know the code portion which i need to look for working on all 4 versions in dataset. :) Hi Rohit, Currently in Postgres, these intermediate versions all exist - however a given session can only see one of them. Also VACUUM is allowed to destroy versions that no other transactions can see. Exactly, one visible version is there per session. But, I want to test my algorithm in which i myself want to work on all the intermediate version and find the correct one for the session. So if I'm understanding you correctly, you would like to have some way for a session to see *all* these versions (and I guess preventing VACUUM from destroying them). yes and I also want to know the portion of code where i can find all the historical or intermediate versions It is certainly possible (or used to be via snapshot manipulation, I haven't looked at that code in a while sorry) to enable a session to see all the old versions, and is quite a cool idea (Postgres used to have this ability in older versions - called Time Travel). For pure practicality, this can be achieved without any code modifications using TRIGGERs and an extra table (as Gavin alludes to). Can you explain me more about how to starting working using trigger? Do tell us a bit more about what you are wanting to do! Cheers Mark -- Regards, Rohit Goyal
Re: [HACKERS] Postgres code for a query intermediate dataset
On Sun, Sep 14, 2014 at 1:30 PM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: On 14/09/14 19:25, Atri Sharma wrote: On Sunday, September 14, 2014, Mark Kirkwood mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz wrote: On 14/09/14 05:36, Rohit Goyal wrote: Hi All, I want to work on the code of intermediate dataset of select and update query. For example. Rohit's salary has been updated 4 times, so it has 4 different version of salary. I want to select salary of person named Rohit. Now suppose , in intermediate result, I found 4 different versions of the data. I want to know the code portion which i need to look for working on all 4 versions in dataset. :) Hi Rohit, Currently in Postgres, these intermediate versions all exist - however a given session can only see one of them. Also VACUUM is allowed to destroy versions that no other transactions can see. So if I'm understanding you correctly, you would like to have some way for a session to see *all* these versions (and I guess preventing VACUUM from destroying them). Any modifications of that sort are bound to introduce lots of pain, not to mention performance degradation and the added responsibility of ensuring that dead tuples don't bloat up the system (prevent vacuum from running at regular intervals and you can have a xid wraparound). I just mentioned that in case you are planning to go in that direction. If you only want the data, use the triggers as Gavin mentioned. Obviously in the general case sure - but (as yet) we don't have much idea about Rohit's use case and workload. If retrieving past versions is the *primary* workload bias and high update concurrency is not required then this could well work better than a trigger based solution. And it does not seem too onerous to have the ability to switch this on as required, viz: ALTER TABLE table1 VERSIONING; (or similar syntax) which makes VACUUM leave this table alone. How do you plan to do all that VACUUM does for this table then? It seems to me that you are saying to VACUUM that it need not be concerned with table 'A' and you are assuming ownership of all the tasks performed by VACUUM for this table. Seems pretty broken to me, not to mention the performance degradations. Regards, Atri Regards, Atri
Re: [HACKERS] Postgres code for a query intermediate dataset
On Sun, Sep 14, 2014 at 1:30 PM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: On 14/09/14 19:25, Atri Sharma wrote: On Sunday, September 14, 2014, Mark Kirkwood mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz wrote: On 14/09/14 05:36, Rohit Goyal wrote: Hi All, I want to work on the code of intermediate dataset of select and update query. For example. Rohit's salary has been updated 4 times, so it has 4 different version of salary. I want to select salary of person named Rohit. Now suppose , in intermediate result, I found 4 different versions of the data. I want to know the code portion which i need to look for working on all 4 versions in dataset. :) Hi Rohit, Currently in Postgres, these intermediate versions all exist - however a given session can only see one of them. Also VACUUM is allowed to destroy versions that no other transactions can see. So if I'm understanding you correctly, you would like to have some way for a session to see *all* these versions (and I guess preventing VACUUM from destroying them). Any modifications of that sort are bound to introduce lots of pain, not to mention performance degradation and the added responsibility of ensuring that dead tuples don't bloat up the system (prevent vacuum from running at regular intervals and you can have a xid wraparound). I just mentioned that in case you are planning to go in that direction. If you only want the data, use the triggers as Gavin mentioned. Obviously in the general case sure - but (as yet) we don't have much idea about Rohit's use case and workload. If retrieving past versions is the *primary* workload bias and high update concurrency is not required then this could well work better than a trigger based solution. And it does not seem too onerous to have the ability to switch this on as required, viz: ALTER TABLE table1 VERSIONING; (or similar syntax) which makes VACUUM leave this table alone. It might make more sense to make such a concept apply to a TABLESPACE instead mind you (i.e things in here are for archive/versioning purposes)... What I think can be done is have a tuplestore which has the delta of updated rows i.e. only have the changes made in an update statement stored in a tuplestore (it could be a part of RelationData). It should be simple enough to have tuplestore store the oid of the inserted tuple and the difference between new tuple and the old tuple. No changes need to be done for old tuple since it can be marked as deleted and VACUUM can remove it as normal logic. Not a clean way, but should work for what you proposed.
Re: [HACKERS] Postgres code for a query intermediate dataset
On 14/09/14 20:24, Atri Sharma wrote: On Sun, Sep 14, 2014 at 1:30 PM, Mark Kirkwood mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz wrote: On 14/09/14 19:25, Atri Sharma wrote: On Sunday, September 14, 2014, Mark Kirkwood mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz mailto:mark.kirkwood@__catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz wrote: On 14/09/14 05:36, Rohit Goyal wrote: Hi All, I want to work on the code of intermediate dataset of select and update query. For example. Rohit's salary has been updated 4 times, so it has 4 different version of salary. I want to select salary of person named Rohit. Now suppose , in intermediate result, I found 4 different versions of the data. I want to know the code portion which i need to look for working on all 4 versions in dataset. :) Hi Rohit, Currently in Postgres, these intermediate versions all exist - however a given session can only see one of them. Also VACUUM is allowed to destroy versions that no other transactions can see. So if I'm understanding you correctly, you would like to have some way for a session to see *all* these versions (and I guess preventing VACUUM from destroying them). Any modifications of that sort are bound to introduce lots of pain, not to mention performance degradation and the added responsibility of ensuring that dead tuples don't bloat up the system (prevent vacuum from running at regular intervals and you can have a xid wraparound). I just mentioned that in case you are planning to go in that direction. If you only want the data, use the triggers as Gavin mentioned. Obviously in the general case sure - but (as yet) we don't have much idea about Rohit's use case and workload. If retrieving past versions is the *primary* workload bias and high update concurrency is not required then this could well work better than a trigger based solution. And it does not seem too onerous to have the ability to switch this on as required, viz: ALTER TABLE table1 VERSIONING; (or similar syntax) which makes VACUUM leave this table alone. How do you plan to do all that VACUUM does for this table then? It seems to me that you are saying to VACUUM that it need not be concerned with table 'A' and you are assuming ownership of all the tasks performed by VACUUM for this table. Seems pretty broken to me, not to mention the performance degradations. I think the whole point of such a modification is that nothing is done to such tables, as you want to see all the previous versions. Clearly this is less performant for standard workloads...but we are talking about non standard workloads surely... Regards Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Postgres code for a query intermediate dataset
On 14/09/14 20:11, Rohit Goyal wrote: Hi Mark, On Sun, Sep 14, 2014 at 8:57 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz wrote: Currently in Postgres, these intermediate versions all exist - however a given session can only see one of them. Also VACUUM is allowed to destroy versions that no other transactions can see. Exactly, one visible version is there per session. But, I want to test my algorithm in which i myself want to work on all the intermediate version and find the correct one for the session. So if I'm understanding you correctly, you would like to have some way for a session to see *all* these versions (and I guess preventing VACUUM from destroying them). yes and I also want to know the portion of code where i can find all the historical or intermediate versions Well that's going to be a bit of a learning curve for you :-), the concept to get familiar with is snapshots (see src/backand/access/heap/heapam.c to get started). It is certainly possible (or used to be via snapshot manipulation, I haven't looked at that code in a while sorry) to enable a session to see all the old versions, and is quite a cool idea (Postgres used to have this ability in older versions - called Time Travel). For pure practicality, this can be achieved without any code modifications using TRIGGERs and an extra table (as Gavin alludes to). Can you explain me more about how to starting working using trigger? I'm not sure we have specif examples in the docs for what you want to do, but generally see http://www.postgresql.org/docs/9.4/static/plpgsql-trigger.html Cheers Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tips/advice for implementing integrated RESTful HTTP API
FYI, got an initial implementation of http://wiki.postgresql.org/wiki/HTTP_API done in Java (intended to run as a servlet) at https://github.com/bjornharrtell/jdbc-http-server. Feedback is welcome :) Regards, Björn 2014-09-03 1:19 GMT+02:00 Álvaro Hernández Tortosa a...@nosys.es: On 02/09/14 04:47, Dobes Vandermeer wrote: Same idea as PgBouncer or PgPool. The advantage over hacking PgBouncer/PgPool for the job is that Tomcat can already do a lot of what you want using built-in, pre-existing functionality. Connection pool management, low level REST-style HTTP processing, JSON handling etc are all done for you. Yeah, those are nice conveniences but I still think installing Java and getting something to run on startup is a bit more of a hurdle. Better maek life easier up front by having a simple standalone proxy you can compile and run with just whatever is already available on a typical AWS ubuntu environment. If instead of Tomcat you use Jetty, you can embed the whole app+Jetty+dependencies in a single executable JAR, which easies deployment a lot. Installing a JVM in a Ubuntu environment is just one apt-get and even easier if you use CloudFormation for automation. I don't think is a bad choice at all... you get most of the functionality you want already there, as Craig said, and it's lightweight. Hope it helps, Álvaro
Re: [HACKERS] Postgres code for a query intermediate dataset
Hi Mark Atri, :) Thanks for reply. But, I think i confused you. I am talking about access using indexes. So, I assume that B+ tree store key-value pair where rohit is the key and all the versions are its value. Another way to think is I have a secondary index on emp. name and there are 4 rohit exist in DB. So, now B+ tree gives me 4 different tuple pointer for each Rohit. I want to know the code portion for this where i can see all 4 tuple pointer before each one have I/O access to fetch its tuple. Are the suggestions still valid? On Sun, Sep 14, 2014 at 10:53 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: On 14/09/14 20:11, Rohit Goyal wrote: Hi Mark, On Sun, Sep 14, 2014 at 8:57 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz wrote: Currently in Postgres, these intermediate versions all exist - however a given session can only see one of them. Also VACUUM is allowed to destroy versions that no other transactions can see. Exactly, one visible version is there per session. But, I want to test my algorithm in which i myself want to work on all the intermediate version and find the correct one for the session. So if I'm understanding you correctly, you would like to have some way for a session to see *all* these versions (and I guess preventing VACUUM from destroying them). yes and I also want to know the portion of code where i can find all the historical or intermediate versions Well that's going to be a bit of a learning curve for you :-), the concept to get familiar with is snapshots (see src/backand/access/heap/heapam.c to get started). Thanks I will read it. :) Can you please tel me some specifics from this c file, if you already know :) It is certainly possible (or used to be via snapshot manipulation, I haven't looked at that code in a while sorry) to enable a session to see all the old versions, and is quite a cool idea (Postgres used to have this ability in older versions - called Time Travel). For pure practicality, this can be achieved without any code modifications using TRIGGERs and an extra table (as Gavin alludes to). Can you explain me more about how to starting working using trigger? I'm not sure we have specif examples in the docs for what you want to do, but generally see http://www.postgresql.org/docs/9.4/static/plpgsql- trigger.html Cheers Mark -- Regards, Rohit Goyal
Re: [HACKERS] Postgres code for a query intermediate dataset
On 14/09/14 21:18, Rohit Goyal wrote: Hi Mark Atri, :) Thanks for reply. But, I think i confused you. I am talking about access using indexes. So, I assume that B+ tree store key-value pair where rohit is the key and all the versions are its value. Another way to think is I have a secondary index on emp. name and there are 4 rohit exist in DB. So, now B+ tree gives me 4 different tuple pointer for each Rohit. I want to know the code portion for this where i can see all 4 tuple pointer before each one have I/O access to fetch its tuple. Are the suggestions still valid? Visibility rules mentioned earlier apply equally to tables and indexes (strictly speaking what happens is index tuples are checked against the relevant tables to see if your session can see them), so discussion of whether tuples are retrieved via index or table scans is not really relevant (i.e query planning/optimization is separate from tuple visibility). Cheers Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Aussie timezone database changes incoming
On Thu, Sep 11, 2014 at 03:42:14PM +1000, Andrew McNamara wrote: Anyone from down under care to remark about the actual usage of old and new abbreviations? About bloody time! AEST/AEDT/etc are the official abbreviations and are commonly used. They have been increasingly used over the last 20 years or so, and the EST/EDT stuff on the Olsen tz database has been a source of annoyance for a very long time, eg: http://thread.gmane.org/gmane.comp.time.tz/2262 Quite likely this change will break stuff, but my feeling is more people will be cheering than screaming. Indeed, this has been a pain in the ass for a long long time. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On 09/13/2014 11:28 PM, Peter Geoghegan wrote: Anyway, attached rough test program implements what you outline. This is for 30,000 32 byte strings (where just the final two bytes differ). On my laptop, output looks like this (edited to only show median duration in each case): Got to be careful to not let the compiler optimize away microbenchmarks like this. At least with my version of gcc, the strcoll calls get optimized away, as do the memcmp calls, if you don't use the result for anything. Clang was even more aggressive; it ran both comparisons in 0.0 seconds. Apparently it optimizes away the loops altogether. Also, there should be a setlocale(LC_ALL, ) call somewhere. Otherwise it runs in C locale, and we don't use strcoll() at all for C locale. After fixing those, it runs much slower, so I had to reduce the number of strings. Here's a fixed program. I'm now getting numbers like this: (baseline) duration of comparisons without useless memcmp()s: 6.007368 seconds duration of comparisons with useless memcmp()s: 6.079826 seconds Both values vary in range 5.9 - 6.1 s, so it's fair to say that the useless memcmp() is free with these parameters. Is this the worst case scenario? - Heikki #include ctype.h #include stdio.h #include stdlib.h #include string.h #include sys/time.h #include locale.h /* STRING_SIZE does not include NUL byte */ #define STRING_SIZE 32 #define N_STRINGS 2000 #define LAST_N_DIFFER 2 #define INSTR_TIME_SUBTRACT(x,y) \ do { \ (x).tv_sec -= (y).tv_sec; \ (x).tv_usec -= (y).tv_usec; \ /* Normalize */ \ while ((x).tv_usec 0) \ { \ (x).tv_usec += 100; \ (x).tv_sec--; \ } \ } while (0) #define INSTR_TIME_GET_DOUBLE(t) \ (((double) (t).tv_sec) + ((double) (t).tv_usec) / 100.0) #define Min(x, y) ((x) (y) ? (x) : (y)) /* * Generate single random alphabetical ASCII char. Uses an unseeded rand() * call, to ensure inter-run determinism. */ static inline char generate_prandom_printable_char(void) { for (;;) { char crand = rand() 0x7F; if ((crand = 'A' crand = 'Z') || (crand = 'a' crand = 'z')) return crand; } } /* * Generate a random payload string. The returned string is pre-calcuated * once, to form the bulk of each distinct string. */ static inline char * generate_prandom_payload() { char *payload = malloc(STRING_SIZE + 1); int i; /* * The final LAST_N_DIFFER bytes will ultimately be clobbered with distinct * bytes, but that occurs separately, per string. */ for (i = 0; i STRING_SIZE; i++) payload[i] = generate_prandom_printable_char(); /* * Add NUL here, even though complete strings are not NULL terminated (or, * rather, are copied into a buffer on the stack in order to add a NUL once * per comparison) */ payload[STRING_SIZE] = '\0'; return payload; } int main(int argc, const char *argv[]) { char **strings = malloc(N_STRINGS * sizeof(char *)); char *payload = generate_prandom_payload(); int i, j; int nmatches = 0; setlocale(LC_ALL, ); /* Initialize strings */ for (i = 0; i N_STRINGS; i++) { int j; strings[i] = malloc(STRING_SIZE); memcpy(strings[i], payload, STRING_SIZE); /* Last LAST_N_DIFFER characters randomly vary */ for (j = LAST_N_DIFFER; j 0; j--) { char n_last = generate_prandom_printable_char(); strings[i][STRING_SIZE - j] = n_last; } /* Don't terminate -- no room in buffer */ } printf(Strings generated - beginning tests\n); for (;;) { struct timeval before, after; /* ** * Baseline -- no wasted memcmp(). * * Compare each string to each other string using only strcoll() ** */ gettimeofday(before, NULL); for (i = 0; i N_STRINGS; i++) { char *str = strings[i]; /* Quadratic string comparison */ for (j = 0; j N_STRINGS; j++) { int cmp; char buf1[STRING_SIZE + 1]; char buf2[STRING_SIZE + 1]; char *other; other = strings[j]; memcpy(buf1, str, STRING_SIZE); memcpy(buf2, other, STRING_SIZE); buf1[STRING_SIZE] = '\0'; buf2[STRING_SIZE] = '\0'; cmp = strcoll(buf1, buf2); if (cmp == 0) nmatches++; /* * memcmp() tie-breaker (equivalent to the one that currently * appears within varstr_cmp()) isn't considered. It's only * relevant to a small number of locales, like hu_HU, where * strcoll() might (rarely) indicate equality in respect of a * pair of non-identical strings. If strcoll() did return 0, * then for most locales it's certain that the first memcmp() * would have worked out. */ } } /* Report no-wasted-memcmp case duration */ gettimeofday(after, NULL); INSTR_TIME_SUBTRACT(after, before); printf((baseline) duration of comparisons without useless memcmp()s: %f seconds\n\n, INSTR_TIME_GET_DOUBLE(after)); /*
Re: [HACKERS] alter user set local_preload_libraries.
On 9/1/14 7:51 AM, Kyotaro HORIGUCHI wrote: The attached patch simply changes the context for local_... to PGC_USERSET and edits the doc. I had this ready to commit, but then Invent PGC_SU_BACKEND and mark log_connections/log_disconnections that way. was committed in the meantime. Does this affect what we should do with this change? I guess one thing to look into would be whether we could leave local_preload_libraries as PGC_BACKEND and change session_preload_libraries to PGC_SU_BACKEND, and then investigate whether we could allow settings made with ALTER ROLE / SET to change PGC_BACKEND settings. In the meantime, I have committed documentation fixes for the back branches 9.0 .. 9.3. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On 2014-09-13 20:27:51 -0500, k...@rice.edu wrote: What we are looking for here is uniqueness thus better error detection. Not avalanche effect, nor cryptographically secure, nor bit distribution. As far as I'm aware CRC32C is unbeaten collision wise and time proven. I couldn't find tests with xxhash and crc32 on the same hardware so I spent some time putting together a benchmark (see attachment, to run it just start run.sh) I included a crc32 implementation using ssr4.2 instructions (which works on pretty much any Intel processor built after 2008 and AMD built after 2012), a portable Slice-By-8 software implementation and xxhash since it's the fastest software 32bit hash I know of. Here're the results running the test program on my i5-4200M crc sb8: 90444623 elapsed: 0.513688s speed: 1.485220 GB/s crc hw: 90444623 elapsed: 0.048327s speed: 15.786877 GB/s xxhash: 7f4a8d5 elapsed: 0.182100s speed: 4.189663 GB/s The hardware version is insanely and works on the majority of Postgres setups and the fallback software implementations is 2.8x slower than the fastest 32bit hash around. Hopefully it'll be useful in the discussion. Note that all these numbers aren't fully relevant to the use case here. For the WAL - which is what we're talking about and the only place where CRC32 is used with high throughput - the individual parts of a record are pretty darn small on average. So performance of checksumming small amounts of data is more relevant. Mind, that's not likely to go for CRC32, especially not slice-by-8. The cache fooprint of the large tables is likely going to be noticeable in non micro benchmarks. Also, while I understand that CRC has a very venerable history and is well studied for transmission type errors, I have been unable to find any research on its applicability to validating file/block writes to a disk drive. Which incidentally doesn't really match what the CRC is used for here. It's used for individual WAL records. Usually these are pretty small, far smaller than disk/postgres' blocks on average. There's a couple scenarios where they can get large, true, but most of them are small. The primary reason they're important is to correctly detect the end of the WAL. To ensure we're interpreting half written records, or records from before the WAL file was overwritten. While it is to quote you unbeaten collision wise, xxhash, both the 32-bit and 64-bit version are its equal. Aha? You take that from the smhasher results? Since there seems to be a lack of research on disk based error detection versus CRC polynomials, it seems likely that any of the proposed hash functions are on an equal footing in this regard. As Andres commented up-thread, xxhash comes along for free with lz4. This is pure handwaving. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] documentation update for doc/src/sgml/func.sgml
On 9/12/14 3:13 PM, Andreas 'ads' Scherbaum wrote: Of course a general rule how to link to WP would be nice ... I think Wikipedia links should be avoided altogether. We can assume that readers are technically proficient to look up general technical concepts on their own using a reference system of their choice. In cases where a link is warranted, it is better to construct a proper bibliographic citation to the primary source material, such as an IEEE standard or an academic paper, in a way that will stand the test of time. Another problem, which the bibliography system partially addresses, is that if we patch things like proposed here, we'll end up with inconsistent linking all over the documentation wherever the concept is mentioned. For example, we already make reference to the relevant IEEE standard where floating-point types are first introduced in section 8.1.3, and we probably don't need to repeat floating-point numbers are weird every time they are mentioned later on. Independent of that, it is actually not correct that we use the IEEE's rules, because we don't use any rules, that is up to the operating system/platform. While most platforms indeed do use the IEEE floating-point standard more less, some don't. Section 8.1.3 tries to point that out. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] alter user set local_preload_libraries.
Peter Eisentraut pete...@gmx.net writes: On 9/1/14 7:51 AM, Kyotaro HORIGUCHI wrote: The attached patch simply changes the context for local_... to PGC_USERSET and edits the doc. I had this ready to commit, but then Invent PGC_SU_BACKEND and mark log_connections/log_disconnections that way. was committed in the meantime. Does this affect what we should do with this change? I guess one thing to look into would be whether we could leave local_preload_libraries as PGC_BACKEND and change session_preload_libraries to PGC_SU_BACKEND, and then investigate whether we could allow settings made with ALTER ROLE / SET to change PGC_BACKEND settings. Yeah, I was wondering about that while I was making the other commit. I did not touch those variables at the time, but it would make sense to restrict them as you suggest. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Aussie timezone database changes incoming
Martijn van Oosterhout klep...@svana.org writes: On Thu, Sep 11, 2014 at 03:42:14PM +1000, Andrew McNamara wrote: Quite likely this change will break stuff, but my feeling is more people will be cheering than screaming. Indeed, this has been a pain in the ass for a long long time. It's good news that people think this will be an improvement. I've not dug into the change details to be sure, but I think probably I was overthinking it upthread. We seem to already have some of the new abbreviations installed, and the other ones do not conflict with anything. So we'll just add them and be happy. What we should do with the Australia abbreviations file is re-document it as being appropriate for historical usage only. Anyone who's got that selected will continue to see the behavior they did before. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Sun, Sep 14, 2014 at 05:21:10PM +0200, Andres Freund wrote: On 2014-09-13 20:27:51 -0500, k...@rice.edu wrote: Also, while I understand that CRC has a very venerable history and is well studied for transmission type errors, I have been unable to find any research on its applicability to validating file/block writes to a disk drive. Which incidentally doesn't really match what the CRC is used for here. It's used for individual WAL records. Usually these are pretty small, far smaller than disk/postgres' blocks on average. There's a couple scenarios where they can get large, true, but most of them are small. The primary reason they're important is to correctly detect the end of the WAL. To ensure we're interpreting half written records, or records from before the WAL file was overwritten. While it is to quote you unbeaten collision wise, xxhash, both the 32-bit and 64-bit version are its equal. Aha? You take that from the smhasher results? Yes. Since there seems to be a lack of research on disk based error detection versus CRC polynomials, it seems likely that any of the proposed hash functions are on an equal footing in this regard. As Andres commented up-thread, xxhash comes along for free with lz4. This is pure handwaving. Yes. But without research to support the use of CRC32 in this same environment, it is handwaving in the other direction. :) Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] KNN-GiST with recheck
I added the point to polygon distance operator patch to the open CommitFest as ready for committer and added myself as reviewer to both of the patches. I think that for most use cases just some operators require further sorting and some of them not. But it could appear one day that some index gives part of its knn answers exact and part of them inexact. Same happen to recheck of regular operators. Initially recheck flag was defined in opclass. But later recheck became runtime flag. I cannot think of an use case, but it makes sense to add the flag to the distance function just like the consistent function if we will go with this implementation. Cost estimation of GiST is a big problem anyway. It doesn't care (and can't) about amount of recheck for regular operators. In this patch, same would be for knn recheck. The problem is that touching heap from access method breaks incapsulation. One idea about this is to do sorting in another nodes. However, I wonder if it would be an overengineering and overhead. In attached patch I propose a different approach: put code touching heap into separate index_get_heap_values function. Also new version of patch includes regression tests and some cleanup. While looking it at I found a bug. It returns the second column in wrong order when both of the distance functions return recheck = true. Test script attached to run on the regression database. I tried to fix but could not. searchTreeItemDistanceRecheck function is not very easy to follow. I think it deserves more comments. knn-gist-recheck-test-multicolumn.sql Description: application/sql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: plpgsql - Assert statement
2014-09-09 7:54 GMT+02:00 Craig Ringer cr...@2ndquadrant.com: On 09/05/2014 05:21 PM, Pavel Stehule wrote: *shrug* Doing it in SQL would probably break more stuff. I'm trying to contain the damage. And arguably, this is mostly only useful in PL/PgSQL. I've wanted assertions in SQL enough that I often write trivial wrappers around `raise` in PL/PgSQL for use in `CASE` statements etc. In this moment we have no agreement on syntax, but there was not defined a requirements for aggregations. I looked on assertions in some languages and implementation and design of assertions is really varied. I though about it, and Assertions is not plpgsql only issue. It must be supported by core, and by other PL. There are two usual requests for Assertions: a) Isn't possible handle a assertion exception anywhere .. it enforce ROLLBACK in 100% b) Assertions should be disabled globally .. I am not sure, it it is a good idea, but I can understand so some tests based on queries to data can be performance issue. Important question is a relation assertations and exceptions. Is it only shortcut for exception or some different? Comments? Regards Pavel -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] proposal: plpgsql - Assert statement
On 09/14/2014 02:25 PM, Pavel Stehule wrote: a) Isn't possible handle a assertion exception anywhere .. it enforce ROLLBACK in 100% b) Assertions should be disabled globally .. I am not sure, it it is a good idea, but I can understand so some tests based on queries to data can be performance issue. Important question is a relation assertations and exceptions. Is it only shortcut for exception or some different? I think that most data integrity issues can be handled by a well designed database schema that uses UNIQUE, NOT NULL, REFERENCES and CHECK constraints. Assertions are usually found inside of complex code constructs to check values of local variables. I don't think it is even a good idea to implement assertions that can query arbitrary data. Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] KNN-GiST with recheck
On Sun, Sep 14, 2014 at 10:09 PM, Emre Hasegeli e...@hasegeli.com wrote: I added the point to polygon distance operator patch to the open CommitFest as ready for committer and added myself as reviewer to both of the patches. Thanks. Cost estimation of GiST is a big problem anyway. It doesn't care (and can't) about amount of recheck for regular operators. In this patch, same would be for knn recheck. The problem is that touching heap from access method breaks incapsulation. One idea about this is to do sorting in another nodes. However, I wonder if it would be an overengineering and overhead. In attached patch I propose a different approach: put code touching heap into separate index_get_heap_values function. Also new version of patch includes regression tests and some cleanup. While looking it at I found a bug. It returns the second column in wrong order when both of the distance functions return recheck = true. Test script attached to run on the regression database. I tried to fix but could not. searchTreeItemDistanceRecheck function is not very easy to follow. I think it deserves more comments. Fixed, thanks. It was logical error in comparison function implementation. -- With best regards, Alexander Korotkov. knn-gist-recheck-4.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Sun, Sep 14, 2014 at 7:37 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Got to be careful to not let the compiler optimize away microbenchmarks like this. At least with my version of gcc, the strcoll calls get optimized away, as do the memcmp calls, if you don't use the result for anything. Clang was even more aggressive; it ran both comparisons in 0.0 seconds. Apparently it optimizes away the loops altogether. I suppose the fact that I saw results that fit my pre-conceived notion of what was happening made me lose my initial concern about that. Also, there should be a setlocale(LC_ALL, ) call somewhere. Otherwise it runs in C locale, and we don't use strcoll() at all for C locale. Oops. This might be a useful mistake, though -- if the strcoll() using the C locale is enough to make the memcmp() not free, then that suggests that strcoll() is the hiding place for the useless memcmp(), where instructions relating to the memcmp() can execute in parallel to instructions relating to strcoll() that add latency from memory accesses (for non-C locales). With the C locale, strcoll() is equivalent to strcmp()/memcmp(). Commenting out the setlocale(LC_ALL, ) in your revised versions shows something like my original numbers (so I guess my compiler wasn't smart enough to optimize away the strcoll() + memcmp() cases). Whereas, there is no noticeable regression/difference between each case when I run the revised program unmodified. That seems to prove that strcoll() is a good enough hiding place. Both values vary in range 5.9 - 6.1 s, so it's fair to say that the useless memcmp() is free with these parameters. Is this the worst case scenario? Other than pushing the differences much much later in the strings (which you surely thought of already), yes. I think it's worse than the worst, because we've boiled this down to just the comparison part, leaving only the strcoll() as a hiding place, which is evidently good enough. I thought that it was important that there be an unpredictable access pattern (characteristic of quicksort), so that memory latency is added here and there. I'm happy to learn that I was wrong about that, and that a strcoll() alone hides the would-be memcmp() latency. Large strings matter much less anyway, I think. If you have a pair of strings both longer than CACHE_LINE_SIZE bytes, and the first CACHE_LINE_SIZE bytes are identical, and the lengths are known to match, it seems like a very sensible bet to anticipate that they're fully equal. So in a world where that affects the outcome of this test program, I think it still changes nothing (if, indeed, it matters at all, which it appears not to anyway, at least with 256 byte strings). We should probably do the a fully opportunistic memcmp() == 0 within varstr_cmp() itself, so that Windows has the benefit of this too, as well as callers like compareJsonbScalarValue(). Actually, looking at it closely, I think that there might still be a microscopic regression, as there might have also been with my variant of your SQL test case [1] - certainly in the noise, but perhaps measurable with enough runs. If there is, that seems like an acceptable price to pay. When I test this stuff, I'm now very careful about power management settings on my laptop...there are many ways to be left with egg on your face with this kind of benchmark. [1] http://www.postgresql.org/message-id/cam3swzqy95sow00b+zjycrgmr-uf1mz8ryv4_ou2encvstn...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On 12 September 2014 18:19, Simon Riggs si...@2ndquadrant.com wrote: On 12 September 2014 15:30, Tom Lane t...@sss.pgh.pa.us wrote: After a little bit I remembered there was already a function for this. So specifically, I'd suggest using ExecRelationIsTargetRelation() to decide whether to mark the scan as requiring pruning. Sounds cool. Thanks both, this is sounding like a viable route now. Yes, this is viable. Patch attached, using Alvaro's idea of use-case specific pruning and Tom's idea of aiming at target relations. Patch uses or extends existing infrastructure, so its shorter than it might have been, yet with all that bufmgr yuck removed. This is very, very good because while going through this I notice the dozen or more places where we were pruning blocks in annoying places I didn't even know about such as about 4-5 constraint checks. In more than a few DDL commands like ALTER TABLE and CLUSTER we were even pruning the old relation prior to rewrite. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services hot_disable.v5.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SKIP LOCKED DATA (work in progress)
On 12 September 2014 03:56, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Thomas Munro wrote: But to reach the case you mentioned, it would need to get past that (xmax is not a valid transaction) but then the tuple would need to be locked by another session before heap_lock_tuple is called a few lines below. That's a race scenario that I don't believe we can create using advisory lock tricks in an isolation test. Hm, are you able to reproduce it using GDB? Craig Ringer was saying elsewhere that there are other cases that are impossible to test reliably and was proposing addings hooks or something to block backends at convenient times. Not an easy problem ... +1, I think that is a great idea. FWIW here's some throwaway code that I used to do that: diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index 79667f1..fbb3b55 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -54,6 +54,7 @@ #include storage/lmgr.h #include tcop/utility.h #include utils/acl.h +#include utils/builtins.h #include utils/lsyscache.h #include utils/memutils.h #include utils/snapmgr.h @@ -2029,6 +2030,20 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode, } /* +* Begin wait point debugging hack... +* TODO: Only in a special build mode... +* We tell anyone waiting that we have reached wait point #42. +* We wait for permission to proceed from wait point #43. +*/ + elog(WARNING, XXX reached point 42, waiting at point 43); + DirectFunctionCall1(pg_advisory_unlock_int8, Int64GetDatum(42)); + DirectFunctionCall1(pg_advisory_lock_int8, Int64GetDatum(43)); + elog(WARNING, XXX continuing after point 43); + /* +* End wait point debugging hack. +*/ + + /* * This is a live tuple, so now try to lock it. */ test = heap_lock_tuple(relation, tuple, Using the attached isolation spec, that race case is reached. Yeah, it's crude and confusing having those three advisory locks (one to allow an update chain to be created after s1 takes a snapshot, and the other two so that s2 can block s1 at the right point to produce that race case), but I found this less messy than trying to reproduce complicated concurrency scenarios with GDB. IMHO it would be great if there were a tidy and supported way to do this kind of thing, perhaps with a formal notion of named wait points which are only compiled in in special test builds, and an optional set of extra isolation specs that use them. I attach some additional minor suggestions to your patch. Please feel free to reword comments differently if you think my wording isn't an improvements (or I've maked an english mistakes). Thanks, these are incorporated in the new version (also rebased). Great, thanks; I'll look at it again soon to commit, as I think we're done now. Thanks! Thomas Munro # Test SKIP LOCKED with an updated tuple chain, race case with wait at # control point #42 setup { CREATE TABLE foo ( id int PRIMARY KEY, data text NOT NULL ); INSERT INTO foo VALUES (1, 'x'), (2, 'x'); } teardown { DROP TABLE foo; } session s1 setup { -- we hold a lock that s2c can wait for SELECT pg_advisory_lock(42); BEGIN; } step s1a { SELECT * FROM foo WHERE pg_advisory_lock(0) IS NOT NULL ORDER BY id LIMIT 1 FOR UPDATE SKIP LOCKED; } step s1b { COMMIT; } session s2 step s2a { -- first, block s1a from running after it has taken its snapshot SELECT pg_advisory_lock(0); } step s2b { -- generate up update chain and commit UPDATE foo SET data = data WHERE id = 1; } step s2c { -- unblock s1a so that it starts running but make it wait at control point 43 SELECT pg_advisory_lock(43); SELECT pg_advisory_unlock(0); } step s2d { -- wait for s1a to reach wait point 42 SELECT pg_advisory_lock(42); -- now lock the tuple and hold the lock BEGIN; UPDATE foo SET data = data WHERE id = 1; -- and finally allow s1a to continue SELECT pg_advisory_unlock(43); } step s2e { COMMIT; } permutation s2a s1a s2b s2c s2d s1b s2e -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sequence Access Method WIP
On 18/11/13 11:50, Heikki Linnakangas wrote: I don't think the sequence AM should be in control of 'cached'. The caching is done outside the AM. And log_cnt probably should be passed to the _alloc function directly as an argument, ie. the server code asks the AM to allocate N new values in one call. I'm thinking that the alloc function should look something like this: seqam_alloc(Relation seqrel, int nrequested, Datum am_private) I was looking at this a bit today and what I see is that it's not that simple. Minimum input the seqam_alloc needs is: - Relation seqrel - int64 minv, maxv, incby, bool is_cycled - these are basically options giving info about how the new numbers are allocated (I guess some implementations are not going to support all of those) - bool is_called - the current built-in sequence generator behaves differently based on it and I am not sure we can get over it (it could perhaps be done in back-end independently of AM?) - int64 nrequested - number of requested values - Datum am_private - current private data In this light I agree with what Andres wrote - let's just send the whole Form_pg_sequence object. Also makes me think that the seqam options interface should also be passed the minv/maxv/incby/is_cycled etc options for validation, not just the amoptions. And it should return: int64 value - the first value allocated. int nvalues - the number of values allocated. am_private - updated private data. There is also more needed than this, you need: - int64 value - first value allocated (value to be returned) - int64 nvalues - number of values allocated - int64 last - last cached value (used for cached/last_value) - int64 next - last logged value (used for wal logging) - am_private - updated private data, must be possible to return as null I personally don't like that we need all the nvalues, next and last as it makes the seqam a little bit too aware of the sequence logging internals in my opinion but I haven't found a way around it - it's impossible for backend to know how the AM will act around incby/maxv/minv/cycling so it can't really calculate these values by itself, unless ofcourse we fix the behavior and require seqams to behave predictably, but that somewhat breaks the whole idea of leaving the allocation to the seqam. Obviously it would also work to return list of allocated values and then backend could calculate the value, nvalues, last, next from that list by itself, but I am worried about performance of that approach. The backend code handles the caching and logging of values. When it has exhausted all the cached values (or doesn't have any yet), it calls the AM's alloc function to get a new batch. The AM returns the new batch, and updates its private state as necessary. Then the backend code updates the relation file with the new values and the AM's private data. WAL-logging and checkpointing is the backend's responsibility. Agreed here. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
Em 14/09/2014 12:21, Andres Freund and...@2ndquadrant.com escreveu: On 2014-09-13 20:27:51 -0500, k...@rice.edu wrote: What we are looking for here is uniqueness thus better error detection. Not avalanche effect, nor cryptographically secure, nor bit distribution. As far as I'm aware CRC32C is unbeaten collision wise and time proven. I couldn't find tests with xxhash and crc32 on the same hardware so I spent some time putting together a benchmark (see attachment, to run it just start run.sh) I included a crc32 implementation using ssr4.2 instructions (which works on pretty much any Intel processor built after 2008 and AMD built after 2012), a portable Slice-By-8 software implementation and xxhash since it's the fastest software 32bit hash I know of. Here're the results running the test program on my i5-4200M crc sb8: 90444623 elapsed: 0.513688s speed: 1.485220 GB/s crc hw: 90444623 elapsed: 0.048327s speed: 15.786877 GB/s xxhash: 7f4a8d5 elapsed: 0.182100s speed: 4.189663 GB/s The hardware version is insanely and works on the majority of Postgres setups and the fallback software implementations is 2.8x slower than the fastest 32bit hash around. Hopefully it'll be useful in the discussion. Note that all these numbers aren't fully relevant to the use case here. For the WAL - which is what we're talking about and the only place where CRC32 is used with high throughput - the individual parts of a record are pretty darn small on average. So performance of checksumming small amounts of data is more relevant. Mind, that's not likely to go for CRC32, especially not slice-by-8. The cache fooprint of the large tables is likely going to be noticeable in non micro benchmarks. Indeed, the small input sizes is something I was missing. Something more cache friendly would be better, it's just a matter of finding a better candidate. Although I find it highly unlikely that the 4kb extra table of sb8 brings its performance down to sb4 level, even considering the small inputs and cache misses. For what's worth mysql, cassandra, kafka, ext4, xfx all use crc32c checksums in their WAL/Journals. Also, while I understand that CRC has a very venerable history and is well studied for transmission type errors, I have been unable to find any research on its applicability to validating file/block writes to a disk drive. Which incidentally doesn't really match what the CRC is used for here. It's used for individual WAL records. Usually these are pretty small, far smaller than disk/postgres' blocks on average. There's a couple scenarios where they can get large, true, but most of them are small. The primary reason they're important is to correctly detect the end of the WAL. To ensure we're interpreting half written records, or records from before the WAL file was overwritten. While it is to quote you unbeaten collision wise, xxhash, both the 32-bit and 64-bit version are its equal. Aha? You take that from the smhasher results? Since there seems to be a lack of research on disk based error detection versus CRC polynomials, it seems likely that any of the proposed hash functions are on an equal footing in this regard. As Andres commented up-thread, xxhash comes along for free with lz4. This is pure handwaving. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: BRIN indexes (was Re: [HACKERS] Minmax indexes)
El 08/09/14 13:02, Alvaro Herrera escribió: Here's version 18. I have renamed it: These are now BRIN indexes. I have fixed numerous race conditions and deadlocks. In particular I fixed this problem you noted: Heikki Linnakangas wrote: Another race condition: If a new tuple is inserted to the range while summarization runs, it's possible that the new tuple isn't included in the tuple that the summarization calculated, nor does the insertion itself udpate it. I did it mostly in the way you outlined, i.e. by way of a placeholder tuple that gets updated by concurrent inserters and then the tuple resulting from the scan is unioned with the values in the updated placeholder tuple. This required the introduction of one extra support proc for opclasses (pretty simple stuff anyhow). There should be only minor items left now, such as silencing the WARNING: concurrent insert in progress within table sales which is emitted by IndexBuildHeapScan (possibly thousands of times) when doing a summarization of a range being inserted into or otherwise modified. Basically the issue here is that IBHS assumes it's being run with ShareLock in the heap (which blocks inserts), but here we're using it with ShareUpdateExclusive only, which lets inserts in. There is no harm AFAICS because of the placeholder tuple stuff I describe above. Debuging VACUUM VERBOSE ANALYZE over a concurrent table being updated/insert. (gbd) Breakpoint 1, errfinish (dummy=0) at elog.c:411 411ErrorData *edata = errordata[errordata_stack_depth]; The complete backtrace is at http://pastebin.com/gkigSNm7 Also, I found pages with an unkown type (using deafult parameters for the index creation): brin_page_type | array_agg +--- unknown (00) | {3,4} revmap | {1} regular| {2} meta | {0} (4 rows) -- -- Emanuel Calvo @3manuek
Re: [HACKERS] pgbench throttling latency limit
On Sat, Sep 13, 2014 at 4:25 AM, Fabien COELHO coe...@cri.ensmp.fr wrote: [about logging...] Here is an attempt at updating the log features, including the aggregate and sampling stuff, with skipped transactions under throttling. I moved the logging stuff into a function which is called when a transaction is skipped or finished. From a log file format perspective, I think that - would be better than skipped. I like skipped. That seems a lot more clear, and less likely to get parsed as a numeric value by a careless regex like [+-]\d* -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Obsolete comment within execTuples.c
Header comments within execTuples.c state: * - ExecutePlan() calls ExecSelect(), which passes the result slot * to printtup(), which uses slot_getallattrs() to extract the * individual Datums for printing. This isn't true, though - the function ExecSelect() no longer exists. This is because commit 8a5849b7ff24c637a1140c26fc171e45c9142005 removed it, as infrastructure for data-modifying CTEs. In fact, ExecutePlan now directly and indifferently calls the receiveSlot. tcop/dest stuff takes care of calling printtup(), which I guess is what was originally intended by this comment (by which I mean, that part probably isn't obsolete). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Obsolete comment within execTuples.c
Peter Geoghegan p...@heroku.com writes: Header comments within execTuples.c state: * - ExecutePlan() calls ExecSelect(), which passes the result slot * to printtup(), which uses slot_getallattrs() to extract the * individual Datums for printing. This isn't true, though - the function ExecSelect() no longer exists. Hmm ... the adjacent comment about EndPlan is obsolete too, since that function is now called ExecEndPlan. I think the basic problem in both cases is that it didn't occur to us to go looking for references to a static function in files other than the one containing the static function. More generally, though, it seems like the header comments in execTuples.c are not the best place to document global behavior ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Obsolete comment within execTuples.c
On Sun, Sep 14, 2014 at 9:05 PM, Tom Lane t...@sss.pgh.pa.us wrote: More generally, though, it seems like the header comments in execTuples.c are not the best place to document global behavior ... Yeah, my thoughts exactly. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On 09/14/2014 09:27 AM, k...@rice.edu wrote: Thank you for running this sample benchmark. It definitely shows that the hardware version of the CRC is very fast, unfortunately it is really only available on x64 Intel/AMD processors which leaves all the rest lacking. We're talking about something that'd land in 9.5 at best, and going by the adoption rates I see, get picked up slowly over the next couple of years by users. Given that hardware support is already widespread now, I'm not at all convinced that this is a problem. In mid-2015 we'd be talking about 4+ year old AMD CPUs and Intel CPUs that're 6+ years old. In a quick search around I did find one class of machine I have access to that doesn't have SSE 4.2 support. Well, two if you count the POWER7 boxes. It is a type of pre-OpenStack slated-for-retirement RackSpace server with an Opteron 2374. People on older, slower hardware won't get a big performance boost when adopting a new PostgreSQL major release on their old gear. This doesn't greatly upset me. It'd be another thing if we were talking about something where people without the required support would be unable to run the Pg release or take a massive performance hit, but that doesn't appear to be the case here. So I'm all for taking advantage of the hardware support. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] orangutan seizes up during isolation-check
On Tue, Sep 02, 2014 at 12:25:39AM -0400, Tom Lane wrote: Noah Misch n...@leadboat.com writes: Buildfarm member orangutan has failed chronically on both of the branches for which it still reports, HEAD and REL9_1_STABLE, for over two years. The postmaster appears to jam during isolation-check. Dave, orangutan currently has one such jammed postmaster for each branch. Could you gather some information about the running processes? What's particularly odd is that orangutan seems to be running an only slightly out-of-date OS X release, which is hardly an unusual configuration. My own laptop gets through isolation-check just fine. Seems like there must be something nonstandard about orangutan's software ... but what? Agreed. The difference is durable across OS X releases, because orangutan showed like symptoms under 10.7.3. Dave assisted me off-list with data collection and experimentation. Ultimately, --enable-nls was the key distinction, the absence of which spares the other OS X buildfarm animals. The explanation for ECONNREFUSED was more pedestrian than the reasons I had guessed. There were no jammed postmasters running as of the above writing. Rather, the postmasters were gone, but the socket directory entries remained. That happens when the postmaster suffers a kill -9, a SIGSEGV, an assertion failure, or a similar abrupt exit. When I reproduced the problem, CountChildren() was attempting to walk a corrupt BackendList. Sometimes, the list had an entry such that e-next == e; these send CountChildren() into an infinite loop. Other times, testing if (bp-dead_end) prompted a segfault. That explains orangutan sometimes failing quickly and other times hanging for hours. Every crash showed at least two threads running in the postmaster. Multiple threads bring trouble in the form of undefined behavior for fork() w/o exec() and for sigprocmask(). The postmaster uses sigprocmask() to block most signals when doing something nontrivial; this allows it to do such nontrivial work in signal handlers. A sequence of 74 buildfarm runs caught 27 cases of a secondary thread running a signal handler, 14 cases of two signal handlers running at once, and one user-visible postmaster failure. libintl replaces setlocale(). Its setlocale(LC_x, ) uses OS-specific APIs to determine the default locale when $LANG and similar environment variables are empty, as they are during make check NO_LOCALE=1. On OS X, it calls[1] CFLocaleCopyCurrent(), which in turn spins up a thread. See the end of this message for the postmaster thread stacks active upon hitting a breakpoint set at _dispatch_mgr_thread. I see two options for fixing this in pg_perm_setlocale(LC_x, ): 1. Fork, call setlocale(LC_x, ) in the child, pass back the effective locale name through a pipe, and pass that name to setlocale() in the original process. The short-lived child will get the extra threads, and the postmaster will remain clean. 2. On OS X, check for relevant environment variables. Finding none, set LC_x=C before calling setlocale(LC_x, ). A variation is to raise ereport(FATAL) if sufficient environment variables aren't in place. Either way ensures the libintl setlocale() will never call CFLocaleCopyCurrent(). This is simpler than (1), but it entails a behavior change: LANG= initdb will use LANG=C or fail rather than use the OS X user account locale. I'm skeptical of the value of looking up locale information using other OS X facilities when the usual environment variables are inconclusive, but I see no clear cause to reverse that decision now. I lean toward (1). Thanks, nm [1] http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/localename.c;h=78dc344bba191417855670fb751210d3608db6e6;hb=HEAD#l2883 thread #1: tid = 0xeccea9, 0x7fff9066b372 libsystem_notify.dylib`notify_register_check + 30, queue = 'com.apple.main-thread' frame #0: 0x7fff9066b372 libsystem_notify.dylib`notify_register_check + 30 frame #1: 0x7fff987cf261 libsystem_info.dylib`__si_module_static_ds_block_invoke + 109 frame #2: 0x7fff944d628d libdispatch.dylib`_dispatch_client_callout + 8 frame #3: 0x7fff944d61fc libdispatch.dylib`dispatch_once_f + 79 frame #4: 0x7fff987cf1f2 libsystem_info.dylib`si_module_static_ds + 42 frame #5: 0x7fff987cec65 libsystem_info.dylib`si_module_with_name + 60 frame #6: 0x7fff987cf0e7 libsystem_info.dylib`si_module_config_modules_for_category + 168 frame #7: 0x7fff987cedbd libsystem_info.dylib`__si_module_static_search_block_invoke + 87 frame #8: 0x7fff944d628d libdispatch.dylib`_dispatch_client_callout + 8 frame #9: 0x7fff944d61fc libdispatch.dylib`dispatch_once_f + 79 frame #10: 0x7fff987ced64 libsystem_info.dylib`si_module_static_search + 42 frame #11: 0x7fff987cec65 libsystem_info.dylib`si_module_with_name + 60 frame #12: 0x7fff987d0cf2 libsystem_info.dylib`getpwuid +