Re: [HACKERS] Replication vs. float timestamps is a disaster
>>>>> "TL" == Tom Lane <t...@sss.pgh.pa.us> writes: TL> The question to be asked is whether there is still anybody out there TL> using float timestamps. Gentoo's ebuild includes: $(use_enable !pg_legacytimestamp integer-datetimes) \ meaning that by default --enable-integer-datetimes is passed to configure, but if the pg_legacytimestamp use flag is set, then --disable-integer-datetimes is passed instead. They document it as: Use double precision floating-point numbers instead of 64-bit integers for timestamp storage. Ie, w/o any kind of deprecation notice. I don't know how many (how few?) add pg_legacytimestamp to USE when merging postgresql. But it is still available as of 9.6 and also with their live build of git://git.postgresql.org/git/postgresql.git. -JimC -- James Cloos <cl...@jhcloos.com> OpenPGP: 0x997A9F17ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Multiple synchronous_standby_names rules
Hello, When working with a production (P) and a DR (D) environment it is often a requirement to be able to protect against data loss when promoting within a site, and also when losing A and promoting a master at D. The current synchronous_standby_names do not allow this. In a simple example we could take the following nodes: P1 (current master), P2, P3 D1, D2, D3 Where P1 is replicating to (P2, P3, D1, D2, D3). The closest synchronous_standby_names setting you could get to my use case would be: 1 (D1, D2, D3) This would allow the loss of either site without losing data - however it would not allow promotion within site P from P1 -> (P2 | P3) without the potential for data loss. What is needed to support this is the ability to configure Px with something like: 1 (P1, P2, P3), 1 (D1, D2, D3) Would there be any appetite for this - or would it be seen as over complication of the current rules? Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009 *P *(+61) 2 8099 9000 <(+61)%202%208099%209000> *W* www.jirotech.com *F * (+61) 2 8099 9099 <(+61)%202%208099%209000> -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] One process per session lack of sharing
On 15/07/2016 09:28, Craig Ringer wrote: I don't think anyone's considering moving from multi-processing to multi-threading in PostgreSQL. I really, really like the protection that the shared-nothing-by-default process model gives us, among other things. As I understand it, the main issue is that it is hard to integrate extensions that use heavyweight runtimes and are focussed on isolation within a virtual machine. Its not just Perhaps it would be possible for the postmaster (or a delegate process) to host such a runtime, and find a way for a user process that wants to use such a runtime to communicate with it, whether by copying function parameters over RPC or by sharing some of its address space explicitly to the runtime to operate on directly. Such a host delegate process could be explicitly built with multithread support and not 'infect' the rest of the code with its requirements. Using granular RPC is nice for isolation but I am concerned that the latencies might be high. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Aggregate
Good news! On Tuesday, 22 March 2016, David Rowley <david.row...@2ndquadrant.com> wrote: > On 22 March 2016 at 02:35, Robert Haas <robertmh...@gmail.com > <javascript:;>> wrote: > > I have committed this after changing some of the comments. > > > > There might still be bugs ... but I don't see them. And the speedups > > look very impressive. > > > > Really nice work, David. > > Thanks for that, and thank you for taking the time to carefully review > it and commit it. > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org > <javascript:;>) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Choosing parallel_degree
OK cool, thanks. Can we remove the minimum size limit when the per table degree setting is applied? This would help for tables with 2 - 1000 pages combined with a high CPU cost aggregate. Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Sun, Mar 20, 2016 at 11:23 PM, David Rowley <david.row...@2ndquadrant.com > wrote: > On 18 March 2016 at 10:13, James Sewell <james.sew...@lisasoft.com> wrote: > > This does bring up an interesting point I don't quite understand though. > If I run parallel agg on a table with 4 rows with 2 workers will it run on > two workers (2 rows each) or will the first one grab all 4 rows? > It works on a per page basis, workers just each grab the next page to > be scanned from a page counter that sits in shared memory, the worker > just increments the page number, releases the lock on the counter and > scans that page. > > See heap_parallelscan_nextpage() > > So the answer to your question is probably no. At least not unless the > the page only contained 2 rows. > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Choosing parallel_degree
On Thu, Mar 17, 2016 at 5:05 AM, Julien Rouhaud <julien.rouh...@dalibo.com> wrote: > > attached v3 drops the GUC part. > This looks good good. I do think that some threshold control would be good in the long term - but you are right Robert it just feels strange. Maybe once the final formula is implemented in 9.7+ and this gets some real world use cases it can be revisited? One thing I really, really like about the way the new patch works is that you can set parallel_degree on an inheritance parent, then that will set the minimum workers for all associated children (when accessing from the parent). Currently this patch will not fire on small tables even when parallel_degree is set, can we fix this by adding a check for ref->parallel_degree to the table size condition? Cheers, James -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Parallel Aggregate
Hi again, This is probably me missing something, but is there a reason parallel aggregate doesn't seem to ever create append nodes containing Index scans? SET random_page_cost TO 0.2; SET max_parallel_degree TO 8; postgres=# explain SELECT sum(count_i) FROM base GROUP BY view_time_day; QUERY PLAN - Finalize GroupAggregate (cost=310596.32..310598.03 rows=31 width=16) Group Key: view_time_day -> Sort (cost=310596.32..310596.79 rows=186 width=16) Sort Key: view_time_day -> Gather (cost=310589.00..310589.31 rows=186 width=16) Number of Workers: 5 -> Partial HashAggregate (cost=310589.00..310589.31 rows=31 width=16) Group Key: view_time_day -> Parallel Seq Scan on base (cost=0.00..280589.00 rows=600 width=12) SET max_parallel_degree TO 0; postgres=# explain SELECT sum(count_i) FROM base GROUP BY view_time_day; QUERY PLAN --- GroupAggregate (cost=0.56..600085.92 rows=31 width=16) Group Key: view_time_day -> Index Only Scan using base_view_time_day_count_i_idx on base (cost=0.56..450085.61 rows=3000 width=12) (3 rows) Cheers, James Sewell, Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Thu, Mar 17, 2016 at 8:08 AM, David Rowley <david.row...@2ndquadrant.com> wrote: > On 17 March 2016 at 01:29, Robert Haas <robertmh...@gmail.com> wrote: > > On Wed, Mar 16, 2016 at 8:19 AM, Amit Kapila <amit.kapil...@gmail.com> > wrote: > >> Isn't it better to call it as Parallel Aggregate instead of Partial > >> Aggregate. Initialy, we have kept Partial for seqscan, but later on we > >> changed to Parallel Seq Scan, so I am not able to think why it is > better to > >> call Partial incase of Aggregates. > > > > I think partial is the right terminology. Unlike a parallel > > sequential scan, a partial aggregate isn't parallel-aware and could be > > used in contexts having nothing to do with parallelism. It's just > > that it outputs transition values instead of a finalized value. > > +1 the reason the partial aggregate patches have been kept separate > from the parallel aggregate patches is that partial aggregate will > serve for many other purposes. Parallel Aggregate is just one of many > possible use cases for this, so it makes little sense to give it a > name according to a single use case. > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Choosing parallel_degree
Hey, I think are definitely use cases for using parallel agg on a small table when the time for each agg operation is very high. PostGIS can be used to create many examples of low row count and table size but high CPU operations. This does bring up an interesting point I don't quite understand though. If I run parallel agg on a table with 4 rows with 2 workers will it run on two workers (2 rows each) or will the first one grab all 4 rows? Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Thu, Mar 17, 2016 at 10:30 PM, Julien Rouhaud <julien.rouh...@dalibo.com> wrote: > On 17/03/2016 12:21, David Rowley wrote: > > On 18 March 2016 at 00:13, Julien Rouhaud <julien.rouh...@dalibo.com> > wrote: > >> With the current threshold, you need a table bigger than 8 MB to be able > >> to force parallel workers. I'm not sure there'll be benefits for > >> multiple workers on a table smaller than 8 MB, since setting up all the > >> parallel stuff takes time. > > > > It would be really nice if it were possible to drop the setting really > > low, so that combined with a low parallel_setup_cost we could enable > > parallel query on small tables in the regression test suite. > > > > > > Indeed. That could also be a use case for moving parallel_threshold to a > GUC, but not sure what'd be best. > > -- > Julien Rouhaud > http://dalibo.com - http://dalibo.org > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Choosing parallel_degree
On Wed, Mar 16, 2016 at 11:26 AM, Julien Rouhaud <julien.rouh...@dalibo.com> wrote: > > I'm not too familiar with parallel planning, but I tried to implement > both in attached patch. I didn't put much effort into the > parallel_threshold GUC documentation, because I didn't really see a good > way to explain it. I'd e happy to improve it if needed. Also, to make > this parameter easier to tune for users, perhaps we could divide the > default value by 3 and use it as is in the first iteration in > create_parallel_path() ? > > Also, global max_parallel_degree still needs to be at least 1 for the > per table value to be considered. > > All applies and works from my end. Is the max_parallel_degree per table of much use here? It allows the max number of workers per table to be set - but it's still bound by the same formula (now from the GUC). So in reality it's only really useful for limiting the number of workers, not raising it. Would there be a common use case for limiting parallelism on a subset of tables in a database you've explicitly set to have a higher amount of parallel operations via the GUC? I struggle to think of one? I think in practicality the reverse would be more common, you'd want to set certain tables to a starting point of a certain number of workers (and ramp up to more if the formula allowed it). You could set this to 0 for never use parallel agg on this table. Another option is to allow access to the the threshold multiplier (currently hard coded to 3) per table - but this might become pretty hard to explain succinctly in the documentation. Cheers, James -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Choosing parallel_degree
Thanks David, Eventually it would be great to take into account the cost of the function doing the agg (pg_proc.procost, which is a multiple of CPU units). This would allow people to mark specific aggregations as needing more CPU power, therefore needing more workers per page (or should it be tuple in this case?). In the meantime some way to manually influence this would be good. I just did some testing (on an 8VCPU machine) with a 139MB table, which gets 3 workers currently. For a count(*) I get a time of 131.754 ms. If I increase this to 8 workers I get around 86.193 ms. Obviously this doesn't mean much as YMMV - but it does show that the ability to manually adjust the scaling would be great, especially in data warehouse or reporting environments. I did want to test with some really slow aggs, but even when I take out the small table test in create_parallel_paths I can't seem to get a parallel plan for a tiny table. Any idea on why this would be David? Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Tue, Mar 15, 2016 at 12:25 PM, David Rowley <david.row...@2ndquadrant.com > wrote: > Over in [1] James mentioned about wanting more to be able to have more > influence over the partial path's parallel_degree decision. At risk > of a discussion on that hijacking the parallel aggregate thread, I > thought I'd start this for anyone who would want to discuss making > changes to that. > > I've attached a simple C program which shows the parallel_degree which > will be chosen at the moment. For now it's based on the size of the > base relation. Perhaps that will need to be rethought later, perhaps > based on costs. But I just don't think it's something for 9.6. > > Here's the output of the C program. > > For 1 pages there will be 1 workers (rel size 0 MB, 0 GB) > For 3001 pages there will be 2 workers (rel size 23 MB, 0 GB) > For 9001 pages there will be 3 workers (rel size 70 MB, 0 GB) > For 27001 pages there will be 4 workers (rel size 210 MB, 0 GB) > For 81001 pages there will be 5 workers (rel size 632 MB, 0 GB) > For 243001 pages there will be 6 workers (rel size 1898 MB, 1 GB) > For 729001 pages there will be 7 workers (rel size 5695 MB, 5 GB) > For 2187001 pages there will be 8 workers (rel size 17085 MB, 16 GB) > For 6561001 pages there will be 9 workers (rel size 51257 MB, 50 GB) > For 19683001 pages there will be 10 workers (rel size 153773 MB, 150 GB) > For 59049001 pages there will be 11 workers (rel size 461320 MB, 450 GB) > For 177147001 pages there will be 12 workers (rel size 1383960 MB, 1351 GB) > For 531441001 pages there will be 13 workers (rel size 4151882 MB, 4054 GB) > For 1594323001 pages there will be 14 workers (rel size 12455648 MB, 12163 > GB) > > [1] > http://www.postgresql.org/message-id/CANkGpBtUvzpdvF2=_iq64ujmvrpycs6d4i9-wepbusq1sq+...@mail.gmail.com > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Parallel Aggregate
On Tue, Mar 15, 2016 at 9:32 AM, Robert Haaswrote: > > I kind of doubt this would work well, but somebody could write a patch > for it and try it out. OK I'll give this a go today and report back. Would the eventual plan be to use pg_proc.procost for the functions from each aggregate concerned? If so I might have a peek at that too, although I imagine I won't get far. Cheers, -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Parallel Aggregate
On Tuesday, 15 March 2016, Robert Haaswrote: > > > Does the cost of the aggregate function come into this calculation at > > all? In PostGIS land, much smaller numbers of rows can generate loads > > that would be effective to parallelize (worker time much >> than > > startup cost). > > Unfortunately, no - only the table size. This is a problem, and needs > to be fixed. However, it's probably not going to get fixed for 9.6. > :-( > Any chance of getting a GUC (say min_parallel_degree) added to allow setting the initial value of parallel_degree, then changing the small relation check to also pass if parallel_degree > 1? That way you could set min_parallel_degree on a query by query basis if you are running aggregates which you know will take a lot of CPU. I suppose it wouldn't make much sense at all to set globally though, so it could just confuse matters. Cheers, -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Parallel Aggregate
On Mon, Mar 14, 2016 at 3:05 PM, David Rowley <david.row...@2ndquadrant.com> wrote: > > Things to try: > 1. alter table a add column ts_date date; update a set ts_date = > date_trunc('DAY',ts); vacuum full analyze ts; > 2. or, create index on a (date_trunc('DAY',ts)); analyze a; > 3. or for testing, set the work_mem higher. > > Ah, that makes sense. Tried with a BTREE index, and it works as perfectly but the index is 428MB - which is a bit rough. Removed that and put on a BRIN index, same result for 48kB - perfect! Thanks for the help, James -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Parallel Aggregate
Hi again, I've been playing around with inheritance combined with this patch. Currently it looks like you are taking max(parallel_degree) from all the child tables and using that for the number of workers. For large machines it makes much more sense to use sum(parallel_degree) - but I've just seen this comment in the code: /* * Decide what parallel degree to request for this append path. For * now, we just use the maximum parallel degree of any member. It * might be useful to use a higher number if the Append node were * smart enough to spread out the workers, but it currently isn't. */ Does this mean that even though we are aggregating in parallel, we are only operating on one child table at a time currently? Cheers, James Sewell, Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Mon, Mar 14, 2016 at 2:39 PM, James Sewell <james.sew...@lisasoft.com> wrote: > Cool, > > I've been testing how this works with partitioning (which seems to be > strange, but I'll post separately about that) and something odd seems to be > going on now with the parallel triggering: > > postgres=# create table a as select * from base_p2015_11; > SELECT 2000 > > postgres=# select * from a limit 1; > ts | count | a | b | c | d | e > +---+-+--+--+--+--- > 2015-11-26 21:10:04.856828 | 860 | 946 | 1032 | 1118 | 1204 | > (1 row) > > postgres-# \d a > Table "datamart_owner.a" > Column |Type | Modifiers > +-+--- > ts | timestamp without time zone | > count | integer | > a | integer | > b | integer | > c | integer | > d | integer | > e | integer | > > postgres=# select pg_size_pretty(pg_relation_size('a')); > pg_size_pretty > > 1149 MB > > postgres=# explain select sum(count) from a group by date_trunc('DAY',ts); > QUERY PLAN > > -- > Finalize GroupAggregate (cost=218242.96..218254.46 rows=200 width=16) >Group Key: (date_trunc('DAY'::text, ts)) >-> Sort (cost=218242.96..218245.96 rows=1200 width=16) > Sort Key: (date_trunc('DAY'::text, ts)) > -> Gather (cost=218059.08..218181.58 rows=1200 width=16) >Number of Workers: 5 >-> Partial HashAggregate (cost=217059.08..217061.58 > rows=200 width=16) > Group Key: date_trunc('DAY'::text, ts) > -> Parallel Seq Scan on a (cost=0.00..197059.06 > rows=405 width=12) > (9 rows) > > postgres=# analyze a; > > postgres=# explain select sum(count) from a group by date_trunc('DAY',ts); > QUERY PLAN > -- > GroupAggregate (cost=3164211.55..3564212.03 rows=2024 width=16) >Group Key: (date_trunc('DAY'::text, ts)) >-> Sort (cost=3164211.55..3214211.61 rows=2024 width=12) > Sort Key: (date_trunc('DAY'::text, ts)) > -> Seq Scan on a (cost=0.00..397059.30 rows=2024 width=12) > (5 rows) > > Unsure what's happening here. > > > > James Sewell, > PostgreSQL Team Lead / Solutions Architect > __ > > > Level 2, 50 Queen St, Melbourne VIC 3000 > > *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 > > > On Mon, Mar 14, 2016 at 1:31 PM, David Rowley < > david.row...@2ndquadrant.com> wrote: > >> On 14 March 2016 at 14:52, James Sewell <james.sew...@lisasoft.com> >> wrote: >> > One question - how is the upper limit of workers chosen? >> >> See create_parallel_paths() in allpaths.c. Basically the bigger the >> relation (in pages) the more workers will be allocated, up until >> max_parallel_degree. >> >> There is also a comment in that function which states: >> /* >> * Limit the degree of parallelism logarithmically based on the size of the >> * relation. This probably needs to be a good deal more sophisticated, >> but we >> * need something here for now. >> */ >> >> So this will likely see some revision at some point, after 9.6. >> >> -- >> David Rowley http://www.
Re: [HACKERS] Parallel Aggregate
Cool, I've been testing how this works with partitioning (which seems to be strange, but I'll post separately about that) and something odd seems to be going on now with the parallel triggering: postgres=# create table a as select * from base_p2015_11; SELECT 2000 postgres=# select * from a limit 1; ts | count | a | b | c | d | e +---+-+--+--+--+--- 2015-11-26 21:10:04.856828 | 860 | 946 | 1032 | 1118 | 1204 | (1 row) postgres-# \d a Table "datamart_owner.a" Column |Type | Modifiers +-+--- ts | timestamp without time zone | count | integer | a | integer | b | integer | c | integer | d | integer | e | integer | postgres=# select pg_size_pretty(pg_relation_size('a')); pg_size_pretty 1149 MB postgres=# explain select sum(count) from a group by date_trunc('DAY',ts); QUERY PLAN -- Finalize GroupAggregate (cost=218242.96..218254.46 rows=200 width=16) Group Key: (date_trunc('DAY'::text, ts)) -> Sort (cost=218242.96..218245.96 rows=1200 width=16) Sort Key: (date_trunc('DAY'::text, ts)) -> Gather (cost=218059.08..218181.58 rows=1200 width=16) Number of Workers: 5 -> Partial HashAggregate (cost=217059.08..217061.58 rows=200 width=16) Group Key: date_trunc('DAY'::text, ts) -> Parallel Seq Scan on a (cost=0.00..197059.06 rows=405 width=12) (9 rows) postgres=# analyze a; postgres=# explain select sum(count) from a group by date_trunc('DAY',ts); QUERY PLAN -- GroupAggregate (cost=3164211.55..3564212.03 rows=2024 width=16) Group Key: (date_trunc('DAY'::text, ts)) -> Sort (cost=3164211.55..3214211.61 rows=2024 width=12) Sort Key: (date_trunc('DAY'::text, ts)) -> Seq Scan on a (cost=0.00..397059.30 rows=2024 width=12) (5 rows) Unsure what's happening here. James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Mon, Mar 14, 2016 at 1:31 PM, David Rowley <david.row...@2ndquadrant.com> wrote: > On 14 March 2016 at 14:52, James Sewell <james.sew...@lisasoft.com> wrote: > > One question - how is the upper limit of workers chosen? > > See create_parallel_paths() in allpaths.c. Basically the bigger the > relation (in pages) the more workers will be allocated, up until > max_parallel_degree. > > There is also a comment in that function which states: > /* > * Limit the degree of parallelism logarithmically based on the size of the > * relation. This probably needs to be a good deal more sophisticated, but > we > * need something here for now. > */ > > So this will likely see some revision at some point, after 9.6. > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Parallel Aggregate
Hi, Happy to test, really looking forward to seeing this stuff in core. The explain analyze is below: Finalize HashAggregate (cost=810142.42..810882.62 rows=59216 width=16) (actual time=2282.092..2282.202 rows=15 loops=1) Group Key: (date_trunc('DAY'::text, pageview_start_tstamp)) -> Gather (cost=765878.46..808069.86 rows=414512 width=16) (actual time=2281.749..2282.060 rows=105 loops=1) Number of Workers: 6 -> Partial HashAggregate (cost=764878.46..765618.66 rows=59216 width=16) (actual time=2276.879..2277.030 rows=15 loops=7) Group Key: date_trunc('DAY'::text, pageview_start_tstamp) -> Parallel Seq Scan on celebrus_fact_agg_1_p2015_12 (cost=0.00..743769.76 rows=4221741 width=12) (actual time=0.066..1631 .650 rows=3618887 loops=7) One question - how is the upper limit of workers chosen? James Sewell, Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Mon, Mar 14, 2016 at 12:30 PM, David Rowley <david.row...@2ndquadrant.com > wrote: > On 14 March 2016 at 14:16, James Sewell <james.sew...@lisasoft.com> wrote: > >> I've done some testing with one of my data sets in an 8VPU virtual >> environment and this is looking really, really good. >> >> My test query is: >> >> SELECT pageview, sum(pageview_count) >> FROM fact_agg_2015_12 >> GROUP BY date_trunc('DAY'::text, pageview); >> >> The query returns 15 rows. The fact_agg table is 5398MB and holds around >> 25 million records. >> >> Explain with a max_parallel_degree of 8 tells me that the query will >> only use 6 background workers. I have no indexes on the table currently. >> >> Finalize HashAggregate (cost=810142.42..810882.62 rows=59216 width=16) >>Group Key: (date_trunc('DAY'::text, pageview)) >>-> Gather (cost=765878.46..808069.86 rows=414512 width=16) >> Number of Workers: 6 >> -> Partial HashAggregate (cost=764878.46..765618.66 rows=59216 >> width=16) >>Group Key: date_trunc('DAY'::text, pageview) >>-> Parallel Seq Scan on fact_agg_2015_12 >> (cost=0.00..743769.76 rows=4221741 width=12) >> > > Great! Thanks for testing this. > > If you run EXPLAIN ANALYZE on this with the 6 workers, does the actual > number of Gather rows come out at 105? I'd just like to get an idea of my > cost estimate for the Gather are going to be accurate for real world data > sets. > > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Parallel Aggregate
Hi, I've done some testing with one of my data sets in an 8VPU virtual environment and this is looking really, really good. My test query is: SELECT pageview, sum(pageview_count) FROM fact_agg_2015_12 GROUP BY date_trunc('DAY'::text, pageview); The query returns 15 rows. The fact_agg table is 5398MB and holds around 25 million records. Explain with a max_parallel_degree of 8 tells me that the query will only use 6 background workers. I have no indexes on the table currently. Finalize HashAggregate (cost=810142.42..810882.62 rows=59216 width=16) Group Key: (date_trunc('DAY'::text, pageview)) -> Gather (cost=765878.46..808069.86 rows=414512 width=16) Number of Workers: 6 -> Partial HashAggregate (cost=764878.46..765618.66 rows=59216 width=16) Group Key: date_trunc('DAY'::text, pageview) -> Parallel Seq Scan on fact_agg_2015_12 (cost=0.00..743769.76 rows=4221741 width=12) I am getting the following timings (everything was cached before I started tested). I didn't average the runtime, but I ran each one three times and took the middle value. *max_parallel_degree runtime* 0 11693.537 ms 1 6387.937 ms 2 4328.629 ms 3 3292.376 ms 4 2743.148 ms 5 2278.449 ms 6 2000.599 ms I'm pretty happy! Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Mon, Mar 14, 2016 at 8:44 AM, David Rowley <david.row...@2ndquadrant.com> wrote: > On 12 March 2016 at 16:31, David Rowley <david.row...@2ndquadrant.com> > wrote: > > I've attached an updated patch which is based on commit 7087166, > > things are really changing fast in the grouping path area at the > > moment, but hopefully the dust is starting to settle now. > > The attached patch fixes a harmless compiler warning about a possible > uninitialised variable. > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Random note of encouragement
Argh seems like a false alarm for now. I installed 9.5 from RPM source (the other was one I had installed previously) and the performance matched 9.6 Sorry about that, I must have *something* screwed up on the other one. Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Random note of encouragement
I've actually just tested this on 9.3 - and I get roughly the same as 9.6devel. Now going back to make sure my 9.5 environment is sane. Hopefully this isn't me jumping the gun. Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Thu, Feb 25, 2016 at 10:50 AM, Thomas Munro < thomas.mu...@enterprisedb.com> wrote: > On Thu, Feb 25, 2016 at 12:26 PM, Bruce Momjian <br...@momjian.us> wrote: > > On Thu, Feb 25, 2016 at 10:06:34AM +1100, James Sewell wrote: > >> Now when I run the following SQL (multiple times to allow for getting > >> everything into shared buffers, which is 4GB on my machine): > >> > >> > >> select sum(count_n) from base group by view_time_day; > >> > >> > >> I get the following results: > >> > >> > >> PSQL 9.5 - ~21 seconds > >> PSQL 9.6 devel - ~8.5 seconds > >> > >> > >> I think that's pretty good! > >> > >> I know this is a devel release, things may change, blah blah. But still, > >> something has changed for the better here! > > > > Wow, that is cool. Can anyone suggest which commit improved this? > > Since it sums numerics, maybe integer transition functions from commit > 959277a4f579da5243968c750069570a58e92b38 helped? > > -- > Thomas Munro > http://www.enterprisedb.com > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
[HACKERS] Random note of encouragement
Hey All, I've been doing some (futile) work trying to speed up aggregates with a group by in PostgreSQL 9.5. I installed PostgreSQL 9.6 on the same machine to see if I could get anything running in parallel when using partitioning - which didn't work. But - I did find this: With the following setup: CREATE TABLE base( view_time TIMESTAMP WITHOUT time ZONE, view_time_day TIMESTAMP WITHOUT time ZONE, count_n numeric); INSERT INTO base SELECT view_time, date_trunc('day', view_time), COUNT::numeric, FROM (SELECT TIMESTAMP '2015-12-01' + random() * interval '30 days' AS view_time, trunc(random() * 99 + 1) AS COUNT FROM generate_series(1,3000)) a; analyze base; Now when I run the following SQL (multiple times to allow for getting everything into shared buffers, which is 4GB on my machine): select sum(count_n) from base group by view_time_day; I get the following results: PSQL 9.5 - *~21 seconds* PSQL 9.6 devel - *~8.5 seconds* I think that's pretty good! I know this is a devel release, things may change, blah blah. But still, something has changed for the better here! I get the same plan on both nodes: HashAggregate (cost=670590.56..670590.95 rows=31 width=13) Group Key: view_time_day -> Seq Scan on base (cost=0.00..520590.04 rows=3104 width=13) Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] WAL Re-Writes
On 27/01/2016 13:30, Amit Kapila wrote: Thoughts? Are the decreases observed with SSD as well as spinning rust? I might imagine that decreasing the wear would be advantageous, especially if the performance decrease is less with low read latency. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Streaming Replication clusters and load balancing
Hello all, I have recently been working with PostgreSQL and HAProxy to provide seamless load balancing to a group of database servers. This on it's own isn't a hard thing: I have an implementation finished and am now thinking about the best way to bring it to a production ready state which could be used by others, and used in load-balancers other than HAProxy with minimal config changes. My initial requirements were: Given a group of PostgreSQL servers check each x seconds and: - Allow read/write access only to the master server (via IPA / portA) - Disallow access if there are multiple master servers - Allow read access to all servers (via IPB / portB) as long as the following holds: - They are attached to the current master server via streaming replication (or they are the current master server) - They can currently contact the master server (safest option, disallow all access when master-less) - They are in the same timeline as the master server (do I need this check?) - The master server reports that they have less than x bytes lag HAProxy can talk to PostgreSQL for a health check via TCP or PSQL (connection check only). Neither of these allow the logic above - therefore this logic has to be hosted outside of HAProxy. This might change in the future if HAProxy gets the ability to send SQL statements (like an F5 can). Today the best way to provide this information to HAProxy (and many other load balancers, application frameworks, proxies, monitoring solutions) is via HTTP, where HTTP 200 is pass the check and HTTP 503 is fail the check (and don't load balance to this node). In my case I have a script which accepts HTTP requests to /read to check if this node is available for read only and /write which checks if this node is available for read/writes. The options as I see them are: - Implement a script / small app which connects to PostgreSQL and executes these checks - Implemented and proven today at many independent sites - Should it run on HAProxy server or PSQL server? - Integrated HTTP server or x.inetd script? - Platform independence? - What if it dies? - Implement a custom PostgreSQL BGworker which provides this information over HTTP - No outside of PostgreSQL config needed - No reliance on another daemon / interface being up - libmicrohttpd or similar should help with platform independence - Security / acceptance by community? - Only newer versions of PostgreSQL - Spend the time working on getting SQL checks into HAProxy - What about other platforms which only support HTTP? I think all of the options would benefit from a PSQL extension which does the following: - Encapsulates the check logic (easier to upgrade, manipulate) - Stores historic check data for a number of hours / days / months - Stores defaults (override via HTTP could be possible for things like lag) Does anyone else have any thoughts on this topic? Eventually many cool features could flow out of this kind of work: - Integration with High Availability products - I have this working with EnterpriseDB EFM now. - Locate the current master using the HA product - more than one master doesn't cause loss of service as long as HA state is sane - Locate all clustered standby servers using the HA product - if a standby is removed from the HA cluster, it is removed from load balancing - if a standby is not part of the cluster, it is removed from load balancing (even if it is part of streaming replication) - HTTP replication status requests which facilitate dynamically managing HAProxy (or other) PostgreSQL server pools - Add a node to streaming replication, it automatically shows up in the pool and starts being checked to see if it can service reads - Great for cloud scale out - Allocation of additional load balancer groups based on some criteria (?), for example - read/write (as above) - read only (as above) - data warehouse (reporting reads only) - DR (replica with no reads or writes - until it becomes a master) Keen to hear comments. Cheers, James Sewell, Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Rounding to even for numeric data type
MP == Michael Paquier michael.paqu...@gmail.com writes: MP So, attached is a patch that does 1) and 2) to make clear to the MP user how numeric and double precision behave regarding rounding. MP I am adding it to CF 2015-06 to keep track of it... Given that the examples show -2.5 rounds to -3, the IEEE term is roundTiesToAway, and the typical conversational english is round ties away from zero. RoundUp means mean towards +Infinity. 754 specifies that for decimal, either roundTiesToEven or roundTiesToAway are acceptable defaults, and which of the two applies is language dependent. Does ANSI SQL say anything about how numeric should round? In general, for decimals (or anything other than binary), there are twelve possible roundings: ToEven ToOdd AwayFromZero ToZero Up Down TiesToEven TiesToOdd TiesAwayFromZero TiesToZero TiesUp TiesDown (Up is the same as ceil(3), Down as floor(3).) -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 0x997A9F17ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Composite index and min()
Hello, I have the following table: \d a Table phxconfig.a Column | Type | Modifiers ---+-+--- phx_run_id| integer | cell_id | integer | Indexes: a_phx_run_id_cell_id_idx btree (phx_run_id, cell_id) When I use a min() query I get the following plans: test=# explain select min(phx_run_id) from a; QUERY PLAN --- Result (cost=0.22..0.23 rows=1 width=0) InitPlan 1 (returns $0) - Limit (cost=0.14..0.22 rows=1 width=4) - Index Only Scan using a_phx_run_id_cell_id_idx on a (cost=0.14..7.89 rows=100 width=4) Index Cond: (phx_run_id IS NOT NULL) test=# explain select min(cell_id) from a; QUERY PLAN - Aggregate (cost=2.25..2.26 rows=1 width=4) - Seq Scan on a (cost=0.00..2.00 rows=100 width=4) Can anyone comment on why this happens? The index kicks in when I do an explicit cell_id comparison. These are large tables, and they are in a partition layout so it really hurts when I do the min call on the parent table. Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
[HACKERS] ADD FOREIGN KEY locking
Hello all, When I add a FK with a statement like this: ALTER TABLE a ADD FOREIGN KEY (id) REFERENCES b(id); I see a lock on table b: select locktype,mode,granted from pg_locks, pg_stat_activity where relation::regclass::text = 'b' AND pg_locks.pid = pg_stat_activity.pid; locktype | relation mode | AccessShareLock granted | t query | SOME LONG RUNNING QUERY WHICH SELECTS FROM b locktype | relation mode | AccessExclusiveLock granted | f query | ALTER TABLE a ADD FOREIGN KEY (id) REFERENCES b(id); This means that my add key won't complete until my long running query does. That seems a bit odd to me? In this database there are lots of datawarehouse type queries running, which makes it a bit hard for me to schedule this operation. Is this just a manifestation of adding the key being in an ALTER TABLE, which always needs an AccessExclusiveLock? Or am I missing some edge case when this lock would be required in this circumstance? No real urgency on this question, I just found it a bit strange and thought someone might be able to shed some light. James Sewell, Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] ADD FOREIGN KEY locking
Oh, I've just noticed something in the Commit fest post - Reducing lock strength of trigger and foreign key DDL Perhaps I just need to be more patient. Cheers, James Sewell, Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Wed, Feb 18, 2015 at 10:57 AM, James Sewell james.sew...@lisasoft.com wrote: Hello all, When I add a FK with a statement like this: ALTER TABLE a ADD FOREIGN KEY (id) REFERENCES b(id); I see a lock on table b: select locktype,mode,granted from pg_locks, pg_stat_activity where relation::regclass::text = 'b' AND pg_locks.pid = pg_stat_activity.pid; locktype | relation mode | AccessShareLock granted | t query | SOME LONG RUNNING QUERY WHICH SELECTS FROM b locktype | relation mode | AccessExclusiveLock granted | f query | ALTER TABLE a ADD FOREIGN KEY (id) REFERENCES b(id); This means that my add key won't complete until my long running query does. That seems a bit odd to me? In this database there are lots of datawarehouse type queries running, which makes it a bit hard for me to schedule this operation. Is this just a manifestation of adding the key being in an ALTER TABLE, which always needs an AccessExclusiveLock? Or am I missing some edge case when this lock would be required in this circumstance? No real urgency on this question, I just found it a bit strange and thought someone might be able to shed some light. James Sewell, Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale
On 03/10/2014 05:53, Kouhei Kaigai wrote: Yep, that's my pain. Even though usual query does not take many buffers pinned, my use case needs to fetch megabytes scale data at once because of performance reason; page-by-page synchronous scan makes GPU being idle. Doesn't your GPU have an async queue and exec mechanism? Then you could do an asyn DMA to the GPU with an event, use that event in he GPU to start the kernel and in the DB to release the pin? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PostrgeSQL vs oracle doing 1 million sqrts am I doing it wrong?
ST == Shaun Thomas stho...@optionshouse.com writes: ST That said, the documentation here says FLOAT4 is an alias for REAL, ST so it's somewhat nonintuitive for FLOAT4 to be so much slower than ST FLOAT8, which is an alias for DOUBLE PRECISION. There are some versions of glibc where doing certain math on double is faster than doing it on float, depending on how things are compiled. Maybe this is one of them? -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 0x997A9F17ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Securing make check (CVE-2014-0067)
On 02/03/2014 15:30, Magnus Hagander wrote: Terminal Services have definitely become more common over time, but with faster and cheaper virtualization, a lot of people have switched to that instead, which would remove the problem of course. I wonder how common it actually is, though, to *build postgres* on a terminal services machine with other users on it... Well, the banks I've contracted at recently are all rather keen on virtual desktops for developers, and some of those are terminal services. We're a headache, and packaging up all the things we need is a pain, so there is some mileage in buying grunty servers and doing specific installs that are then shared, rather than making an MSI generally available. Also I have experience of being given accounts for jenkins etc that are essentially terminal services logins, and having these things unable to maintain a software stack can effectively disqualify tech we would otherwise use. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PostgreSQL Failback without rebuild
I've just noticed that on PostgreSQL 9.3 I can do the following with a master node A and a slave node B (as long as I have set recovery_target_timeline = 'latest'): 1. Stop Node A 2. Promote Node B 3. Attach Node A as slave This is sufficient for my needs (I know it doesn't cover a crash), can anyone see any potential problems with this approach? Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Wed, Feb 5, 2014 at 6:03 PM, Michael Paquier michael.paqu...@gmail.comwrote: On Wed, Feb 5, 2014 at 3:14 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Wed, Feb 5, 2014 at 10:30 AM, James Sewell james.sew...@lisasoft.com I've seen some proposals and a tool (pg_rewind), but all seem to have draw backs. As far as I remember, one of the main drawbacks for pg_rewind was related to hint bits which can be avoided by wal_log_hints. pg_rewind is not part of core PostgreSQL code, however if you wish, you can try that tool to see if can it solve your purpose. For 9.3, pg_rewind is only safe with page checksums enabled. For 9.4, yes wal_log_hints or checksums is mandatory. The code contains as well some safety checks as well to ensure that a node not using those parameters cannot be rewinded. Regards, -- Michael -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] PostgreSQL Failback without rebuild
Node A could get ahead even if it has been shut down cleanly BEFORE the promotion? I'd always assumed if I shut down the master the slave would be at the same point after shutdown - is this incorrect? Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Fri, Feb 7, 2014 at 4:58 PM, Michael Paquier michael.paqu...@gmail.comwrote: On Fri, Feb 7, 2014 at 1:57 PM, James Sewell james.sew...@lisasoft.comwrote: I've just noticed that on PostgreSQL 9.3 I can do the following with a master node A and a slave node B (as long as I have set recovery_target_timeline = 'latest'): 1. Stop Node A 2. Promote Node B 3. Attach Node A as slave This is sufficient for my needs (I know it doesn't cover a crash), can anyone see any potential problems with this approach? Yes, node A could get ahead of the point where WAL forked when promoting B. In this case you cannot reconnect A to B, and need to actually recreate a node from a fresh base backup, or rewind it. pg_rewind targets the latter, postgres core is able to to the former, and depending on things like your environment and/or the size of your server, you might prefer one or the other. Regards, -- Michael -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
[HACKERS] PostgreSQL Failback without rebuild
Hello All, I have been reading through some of the recent discussions about failback when in a streaming replication setup. I define failback as: 1. Node A is master, Node B is slave 2. Node A crashes || Node A is stopped || nothing happens 3. Promote Node B to Master 4. Attach Node A as slave My understanding is currently to achieve step three you need to take a base backup of Node B and deploy it to Node A before starting streaming replication (or use rsync etc...). This is very undesirable for many users, especially if they have a very large database. From the discussions I can see that the problem is to do with Node A writing changes to disk that Node B are not streamed before Node A crashes. Has there been any consensus on this issue? Are there any solutions which might make it into 9.4 or 9.5? I've seen some proposals and a tool (pg_rewind), but all seem to have draw backs. I've been looking mainly at these threads: http://www.postgresql.org/message-id/CAF8Q-Gy7xa60HwXc0MKajjkWFEbFDWTG=ggyu1kmt+s2xcq...@mail.gmail.com http://www.postgresql.org/message-id/caf8q-gxg3pqtf71nvece-6ozraew5pwhk7yqtbjgwrfu513...@mail.gmail.com http://www.postgresql.org/message-id/519df910.4020...@vmware.com Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Mon, 2014-01-13 at 19:48 -0500, Trond Myklebust wrote: On Jan 13, 2014, at 19:03, Hannu Krosing ha...@2ndquadrant.com wrote: On 01/13/2014 09:53 PM, Trond Myklebust wrote: On Jan 13, 2014, at 15:40, Andres Freund and...@2ndquadrant.com wrote: On 2014-01-13 15:15:16 -0500, Robert Haas wrote: On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner kgri...@ymail.com wrote: I notice, Josh, that you didn't mention the problems many people have run into with Transparent Huge Page defrag and with NUMA access. Amen to that. Actually, I think NUMA can be (mostly?) fixed by setting zone_reclaim_mode; is there some other problem besides that? I think that fixes some of the worst instances, but I've seen machines spending horrible amounts of CPU ( BUS) time in page reclaim nonetheless. If I analyzed it correctly it's in RAM working set workloads where RAM is pretty large and most of it is used as page cache. The kernel ends up spending a huge percentage of time finding and potentially defragmenting pages when looking for victim buffers. On a related note, there's also the problem of double-buffering. When we read a page into shared_buffers, we leave a copy behind in the OS buffers, and similarly on write-out. It's very unclear what to do about this, since the kernel and PostgreSQL don't have intimate knowledge of what each other are doing, but it would be nice to solve somehow. I've wondered before if there wouldn't be a chance for postgres to say my dear OS, that the file range 0-8192 of file x contains y, no need to reread and do that when we evict a page from s_b but I never dared to actually propose that to kernel people... O_DIRECT was specifically designed to solve the problem of double buffering between applications and the kernel. Why are you not able to use that in these situations? What is asked is the opposite of O_DIRECT - the write from a buffer inside postgresql to linux *buffercache* and telling linux that it is the same as what is currently on disk, so don't bother to write it back ever. I don’t understand. Are we talking about mmap()ed files here? Why would the kernel be trying to write back pages that aren’t dirty? No ... if I have it right, it's pretty awful: they want to do a read of a file into a user provided buffer, thus obtaining a page cache entry and a copy in their userspace buffer, then insert the page of the user buffer back into the page cache as the page cache page ... that's right, isn't it postgress people? Effectively you end up with buffered read/write that's also mapped into the page cache. It's a pretty awful way to hack around mmap. James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, 2014-01-14 at 15:39 +0100, Hannu Krosing wrote: On 01/14/2014 09:39 AM, Claudio Freire wrote: On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing ha...@2ndquadrant.com wrote: Again, as said above the linux file system is doing fine. What we want is a few ways to interact with it to let it do even better when working with postgresql by telling it some stuff it otherwise would have to second guess and by sometimes giving it back some cache pages which were copied away for potential modifying but ended up clean in the end. You don't need new interfaces. Only a slight modification of what fadvise DONTNEED does. This insistence in injecting pages from postgres to kernel is just a bad idea. Do you think it would be possible to map copy-on-write pages from linux cache to postgresql cache ? this would be a step in direction of solving the double-ram-usage of pages which have not been read from syscache to postgresql cache without sacrificing linux read-ahead (which I assume does not happen when reads bypass system cache). The current mechanism for coherency between a userspace cache and the in-kernel page cache is mmap ... that's the only way you get the same page in both currently. glibc used to have an implementation of read/write in terms of mmap, so it should be possible to insert it into your current implementation without a major rewrite. The problem I think this brings you is uncontrolled writeback: you don't want dirty pages to go to disk until you issue a write() I think we could fix this with another madvise(): something like MADV_WILLUPDATE telling the page cache we expect to alter the pages again, so don't be aggressive about cleaning them. Plus all the other issues with mmap() ... but if you can detail those, we might be able to fix them. and we can write back the copy at the point when it is safe (from postgresql perspective) to let the system write them back ? Using MADV_WILLUPDATE, possibly ... you're still not going to have absolute control. The kernel will write back the pages if the dirty limits are exceeded, for instance, but we could tune it to be useful. Do you think it is possible to make it work with good performance for a few million 8kb pages ? At the very least, it still needs postgres to know too much of the filesystem (block layout) to properly work. Ie: pg must be required to put entire filesystem-level blocks into the page cache, since that's how the page cache works. I was more thinking of an simple write() interface with extra flags/sysctls to tell kernel that we already have this on disk At the very worst, it may introduce serious security and reliability implications, when applications can destroy the consistency of the page cache (even if full access rights are checked, there's still the possibility this inconsistency might be exploitable). If you allow write() which just writes clean pages, I can not see where the extra security concerns are beyond what normal write can do. The problem is we can't give you absolute control of when pages are written back because that interface can be used to DoS the system: once we get too many dirty uncleanable pages, we'll thrash looking for memory and the system will livelock. James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, 2014-01-14 at 11:48 -0500, Robert Haas wrote: On Tue, Jan 14, 2014 at 11:44 AM, James Bottomley james.bottom...@hansenpartnership.com wrote: No, I'm sorry, that's never going to be possible. No user space application has all the facts. If we give you an interface to force unconditional holding of dirty pages in core you'll livelock the system eventually because you made a wrong decision to hold too many dirty pages. I don't understand why this has to be absolute: if you advise us to hold the pages dirty and we do up until it becomes a choice to hold on to the pages or to thrash the system into a livelock, why would you ever choose the latter? And if, as I'm assuming, you never would, why don't you want the kernel to make that choice for you? If you don't understand how write-ahead logging works, this conversation is going nowhere. Suffice it to say that the word ahead is not optional. No, I do ... you mean the order of write out, if we have to do it, is important. In the rest of the kernel, we do this with barriers which causes ordered grouping of I/O chunks. If we could force a similar ordering in the writeout code, is that enough? James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, 2014-01-14 at 15:15 -0200, Claudio Freire wrote: On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas robertmh...@gmail.com wrote: In terms of avoiding double-buffering, here's my thought after reading what's been written so far. Suppose we read a page into our buffer pool. Until the page is clean, it would be ideal for the mapping to be shared between the buffer cache and our pool, sort of like copy-on-write. That way, if we decide to evict the page, it will still be in the OS cache if we end up needing it again (remember, the OS cache is typically much larger than our buffer pool). But if the page is dirtied, then instead of copying it, just have the buffer pool forget about it, because at that point we know we're going to write the page back out anyway before evicting it. This would be pretty similar to copy-on-write, except without the copying. It would just be forget-from-the-buffer-pool-on-write. But... either copy-on-write or forget-on-write needs a page fault, and thus a page mapping. Is a page fault more expensive than copying 8k? (I really don't know). A page fault can be expensive, yes ... but perhaps you don't need one. What you want is a range of memory that's read from a file but treated as anonymous for writeout (i.e. written to swap if we need to reclaim it). Then at some time later, you want to designate it as written back to the file instead so you control the writeout order. I'm not sure we can do this: the separation between file backed and anonymous pages is pretty deeply ingrained into the OS, but if it were possible, is that what you want? James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, 2014-01-14 at 10:39 -0500, Tom Lane wrote: James Bottomley james.bottom...@hansenpartnership.com writes: The current mechanism for coherency between a userspace cache and the in-kernel page cache is mmap ... that's the only way you get the same page in both currently. Right. glibc used to have an implementation of read/write in terms of mmap, so it should be possible to insert it into your current implementation without a major rewrite. The problem I think this brings you is uncontrolled writeback: you don't want dirty pages to go to disk until you issue a write() Exactly. I think we could fix this with another madvise(): something like MADV_WILLUPDATE telling the page cache we expect to alter the pages again, so don't be aggressive about cleaning them. Don't be aggressive isn't good enough. The prohibition on early write has to be absolute, because writing a dirty page before we've done whatever else we need to do results in a corrupt database. It has to be treated like a write barrier. The problem is we can't give you absolute control of when pages are written back because that interface can be used to DoS the system: once we get too many dirty uncleanable pages, we'll thrash looking for memory and the system will livelock. Understood, but that makes this direction a dead end. We can't use it if the kernel might decide to write anyway. No, I'm sorry, that's never going to be possible. No user space application has all the facts. If we give you an interface to force unconditional holding of dirty pages in core you'll livelock the system eventually because you made a wrong decision to hold too many dirty pages. I don't understand why this has to be absolute: if you advise us to hold the pages dirty and we do up until it becomes a choice to hold on to the pages or to thrash the system into a livelock, why would you ever choose the latter? And if, as I'm assuming, you never would, why don't you want the kernel to make that choice for you? James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, 2014-01-14 at 12:39 -0500, Robert Haas wrote: On Tue, Jan 14, 2014 at 12:20 PM, James Bottomley james.bottom...@hansenpartnership.com wrote: On Tue, 2014-01-14 at 15:15 -0200, Claudio Freire wrote: On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas robertmh...@gmail.com wrote: In terms of avoiding double-buffering, here's my thought after reading what's been written so far. Suppose we read a page into our buffer pool. Until the page is clean, it would be ideal for the mapping to be shared between the buffer cache and our pool, sort of like copy-on-write. That way, if we decide to evict the page, it will still be in the OS cache if we end up needing it again (remember, the OS cache is typically much larger than our buffer pool). But if the page is dirtied, then instead of copying it, just have the buffer pool forget about it, because at that point we know we're going to write the page back out anyway before evicting it. This would be pretty similar to copy-on-write, except without the copying. It would just be forget-from-the-buffer-pool-on-write. But... either copy-on-write or forget-on-write needs a page fault, and thus a page mapping. Is a page fault more expensive than copying 8k? (I really don't know). A page fault can be expensive, yes ... but perhaps you don't need one. What you want is a range of memory that's read from a file but treated as anonymous for writeout (i.e. written to swap if we need to reclaim it). Then at some time later, you want to designate it as written back to the file instead so you control the writeout order. I'm not sure we can do this: the separation between file backed and anonymous pages is pretty deeply ingrained into the OS, but if it were possible, is that what you want? Doesn't sound exactly like what I had in mind. What I was suggesting is an analogue of read() that, if it reads full pages of data to a page-aligned address, shares the data with the buffer cache until it's first written instead of actually copying the data. The only way to make this happen is mmap the file to the buffer and use MADV_WILLNEED. The pages are write-protected so that an attempt to write the address range causes a page fault. In response to such a fault, the pages become anonymous memory and the buffer cache no longer holds a reference to the page. OK, so here I thought of another madvise() call to switch the region to anonymous memory. A page fault works too, of course, it's just that one per page in the mapping will be expensive. Do you care about handling aliases ... what happens if someone else reads from the file, or will that never occur? The reason for asking is that it's much easier if someone else mmapping the file gets your anonymous memory than we create an alias in the page cache. James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, 2014-01-14 at 15:09 -0500, Robert Haas wrote: On Tue, Jan 14, 2014 at 3:00 PM, James Bottomley james.bottom...@hansenpartnership.com wrote: Doesn't sound exactly like what I had in mind. What I was suggesting is an analogue of read() that, if it reads full pages of data to a page-aligned address, shares the data with the buffer cache until it's first written instead of actually copying the data. The only way to make this happen is mmap the file to the buffer and use MADV_WILLNEED. The pages are write-protected so that an attempt to write the address range causes a page fault. In response to such a fault, the pages become anonymous memory and the buffer cache no longer holds a reference to the page. OK, so here I thought of another madvise() call to switch the region to anonymous memory. A page fault works too, of course, it's just that one per page in the mapping will be expensive. I don't think either of these ideas works for us. We start by creating a chunk of shared memory that all processes (we do not use threads) will have mapped at a common address, and we read() and write() into that chunk. Yes, that's what I was thinking: it's a cache. About how many files comprise this cache? Are you thinking it's too difficult for every process to map the files? Do you care about handling aliases ... what happens if someone else reads from the file, or will that never occur? The reason for asking is that it's much easier if someone else mmapping the file gets your anonymous memory than we create an alias in the page cache. All reads and writes go through the buffer pool stored in shared memory, but any of the processes that have that shared memory region mapped could be responsible for any individual I/O request. That seems to be possible with the abstraction. The initial mapping gets the file backed pages: you can do madvise to read them (using readahead), flush them (using wontneed) and flip them to anonymous (using something TBD). Since it's a shared mapping API based on the file, any of the mapping processes can do any operation. Future mappers of the file get the mix of real and anon memory, so it's truly shared. Given that you want to use this as a shared cache, it seems that the API to flip back from anon to file mapped is wontneed. That would also trigger writeback of any dirty pages in the previously anon region ... which you could force with msync. As far as I can see, this is identical to read/write on a shared region with the exception that you don't need to copy in and out of the page cache. From our point of view, the implementation is nice because the pages effectively never leave the page cache. We just use an extra per page flag (which I'll get shot for suggesting) to alter the writeout path (which is where the complexity which may kill the implementation is). James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Mon, 2014-01-13 at 14:32 -0600, Jim Nasby wrote: On 1/13/14, 2:27 PM, Claudio Freire wrote: On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby j...@nasby.net wrote: On 1/13/14, 2:19 PM, Claudio Freire wrote: On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas robertmh...@gmail.com wrote: On a related note, there's also the problem of double-buffering. When we read a page into shared_buffers, we leave a copy behind in the OS buffers, and similarly on write-out. It's very unclear what to do about this, since the kernel and PostgreSQL don't have intimate knowledge of what each other are doing, but it would be nice to solve somehow. There you have a much harder algorithmic problem. You can basically control duplication with fadvise and WONTNEED. The problem here is not the kernel and whether or not it allows postgres to be smart about it. The problem is... what kind of smarts (algorithm) to use. Isn't this a fairly simple matter of when we read a page into shared buffers tell the kernel do forget that page? And a corollary to that for when we dump a page out of shared_buffers (here kernel, please put this back into your cache). That's my point. In terms of kernel-postgres interaction, it's fairly simple. What's not so simple, is figuring out what policy to use. Remember, you cannot tell the kernel to put some page in its page cache without reading it or writing it. So, once you make the kernel forget a page, evicting it from shared buffers becomes quite expensive. Well, if we were to collaborate with the kernel community on this then presumably we can do better than that for eviction... even to the extent of here's some data from this range in this file. It's (clean| dirty). Put it in your cache. Just trust me on this. This should be the madvise() interface (with MADV_WILLNEED and MADV_DONTNEED) is there something in that interface that is insufficient? James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Mon, 2014-01-13 at 22:12 +0100, Andres Freund wrote: On 2014-01-13 12:34:35 -0800, James Bottomley wrote: On Mon, 2014-01-13 at 14:32 -0600, Jim Nasby wrote: Well, if we were to collaborate with the kernel community on this then presumably we can do better than that for eviction... even to the extent of here's some data from this range in this file. It's (clean| dirty). Put it in your cache. Just trust me on this. This should be the madvise() interface (with MADV_WILLNEED and MADV_DONTNEED) is there something in that interface that is insufficient? For one, postgres doesn't use mmap for files (and can't without major new interfaces). I understand, that's why you get double buffering: because we can't replace a page in the range you give us on read/write. However, you don't have to switch entirely to mmap: you can use mmap/madvise exclusively for cache control and still use read/write (and still pay the double buffer penalty, of course). It's only read/write with directio that would cause problems here (unless you're planning to switch to DIO?). Frequently mmap()/madvise()/munmap()ing 8kb chunks has horrible consequences for performance/scalability - very quickly you contend on locks in the kernel. Is this because of problems in the mmap_sem? Also, that will mark that page dirty, which isn't what we want in this case. You mean madvise (page_addr)? It shouldn't ... the state of the dirty bit should only be updated by actual writes. Which MADV_ primitive is causing the dirty marking, because we might be able to fix it (unless there's some weird corner case I don't know about). One major usecase is transplanting a page comming from postgres' buffers into the kernel's buffercache because the latter has a much better chance of properly allocating system resources across independent applications running. If you want to share pages between the application and the page cache, the only known interface is mmap ... perhaps we can discuss how better to improve mmap for you? We also do have a way of transplanting pages: it's called splice. How do the semantics of splice differ from what you need? Oh, and the kernel's page-cache management while far from perfect, actually scales much better than postgres'. Well, then, it sounds like the best way forward would be to get postgress to use the kernel page cache more efficiently. James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On Mon, 2014-01-13 at 21:29 +, Greg Stark wrote: On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund and...@2ndquadrant.com wrote: For one, postgres doesn't use mmap for files (and can't without major new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has horrible consequences for performance/scalability - very quickly you contend on locks in the kernel. I may as well dump this in this thread. We've discussed this in person a few times, including at least once with Ted T'so when he visited Dublin last year. The fundamental conflict is that the kernel understands better the hardware and other software using the same resources, Postgres understands better its own access patterns. We need to either add interfaces so Postgres can teach the kernel what it needs about its access patterns or add interfaces so Postgres can find out what it needs to know about the hardware context. The more ambitious and interesting direction is to let Postgres tell the kernel what it needs to know to manage everything. To do that we would need the ability to control when pages are flushed out. This is absolutely necessary to maintain consistency. Postgres would need to be able to mark pages as unflushable until some point in time in the future when the journal is flushed. We discussed various ways that interface could work but it would be tricky to keep it low enough overhead to be workable. So in this case, the question would be what additional information do we need to exchange that's not covered by the existing interfaces. Between madvise and splice, we seem to have most of what you want; what's missing? The less exciting, more conservative option would be to add kernel interfaces to teach Postgres about things like raid geometries. Then Postgres could use directio and decide to do prefetching based on the raid geometry, how much available i/o bandwidth and iops is available, etc. Reimplementing i/o schedulers and all the rest of the work that the kernel provides inside Postgres just seems like something outside our competency and that none of us is really excited about doing. This would also be a well trodden path ... I believe that some large database company introduced Direct IO for roughly this purpose. James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On 06/01/2014 03:14, Robert Haas wrote: That's up to the application. After calling dsm_create(), you call dsm_segment_handle() to get the 32-bit integer handle for that segment. Then you have to get that to the other process(es) somehow. If you're trying to share a handle with a background worker, you can stuff it in bgw_main_arg. Otherwise, you'll probably need to store it in the main shared memory segment, or a file, or whatever. Well, that works for sysv shm, sure. But I was interested (possibly from Konstantin) how the handle transfer takes place at the moment, particularly if it is possible to create additional segments dynamically. I haven't looked at the extension at all. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On 06/01/2014 04:20, Amit Kapila wrote: Duplicate handle should work, but we need to communicate the handle to other process using IPC. Only if the other process needs to use it. The IPC is not to transfer the handle to the other process, just to tell it which slot in its handle table contains the handle. If you just want to ensure that its use-count never goes to zero, the receiver does not need to know what the handle is. However ... The point remains that you need to duplicate it into every process that might want to use it subsequently, so it makes sense to DuplicateHandle into the parent, and then to advertise that handle value publicly so that other child processes can DuplicateHandle it back into their own process. The handle value can change so you also need to refer to the handle in the parent and map it in each child to the local equivalent. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On 05/01/2014 16:50, Robert Haas wrote: But on Windows, segments are*automatically* destroyed*by the operating system* when the last process unmaps them, so it's not quite so clear to me how we can allow it there. The main shared memory segment is no problem because the postmaster always has it mapped, even if no one else does, but that doesn't help for dynamic shared memory segments. Surely you just need to DuplicateHandle into the parent process? If you want to (tidily) dispose of it at some time, then you'll need to tell the postmaster that you have done so and what the handle is in its process, but if you just want it to stick around, then you can just pass it up.
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On 05/01/2014 18:02, Robert Haas wrote: On Sun, Jan 5, 2014 at 12:34 PM, jamesja...@mansionfamily.plus.com wrote: On 05/01/2014 16:50, Robert Haas wrote: But on Windows, segments are*automatically* destroyed*by the operating system* when the last process unmaps them, so it's not quite so clear to me how we can allow it there. The main shared memory segment is no problem because the postmaster always has it mapped, even if no one else does, but that doesn't help for dynamic shared memory segments. Surely you just need to DuplicateHandle into the parent process? If you want to (tidily) dispose of it at some time, then you'll need to tell the postmaster that you have done so and what the handle is in its process, but if you just want it to stick around, then you can just pass it up. Uh, I don't know, maybe? Does the postmaster have to do something to receive the duplicated handle In principle, no, so long as the child has a handle to the parent process that has the appropriate permissions. Given that these processes have a parent/child relationship that shouldn't be too hard to arrange. , or can the child just throw it over the wall to the parent and let it rot until the postmaster finally exits? Yes. Though it might be a good idea to record the handle somewhere (perhaps in a table) so that any potential issues from an insane system spamming the postmaster with handles are apparent. I'm intrigued - how are the handles shared between children that are peers in the current scheme? Some handle transfer must already be in place. Could you share the handles to an immortal worker if you want to reduce any potential impact on the postmaster? The latter would be nicer for our purposes, perhaps, as running more code from within the postmaster is risky for us. If a regular backend process dies, the postmaster will restart everything and the database will come back on line, but if the postmaster itself dies, we're hard down. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] SSL: better default ciphersuite
MK == Marko Kreen mark...@gmail.com writes: PE == Peter Eisentraut pete...@gmx.net writes: MK Well, we should - the DEFAULT is clearly a client-side default MK for compatibility only. No server should ever run with it. PE Any other opinions on this out there? For reference, see: https://wiki.mozilla.org/Security/Server_Side_TLS for the currently suggested suite for TLS servers. That is: ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256: ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384: DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM: ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA: ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384: ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256: DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256: DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:AES128-GCM-SHA256: AES256-GCM-SHA384:ECDHE-RSA-RC4-SHA:ECDHE-ECDSA-RC4-SHA: AES128:AES256:RC4-SHA:HIGH: !aNULL:!eNULL:!EXPORT:!DES:!3DES:!MD5:!PSK The page explains why. But for pgsql, I'd leave off the !PSK; pre-shared keys may prove useful for some. And RC4, perhaps, also should be !ed. And if anyone wants Kerberos tls-authentication, one could add KRB5-DES-CBC3-SHA, but that is ssl3-only. Once salsa20-poly1305 lands in openssl, that should be added to the start of the list. -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Add an ldapoption to disable chasing LDAP referrals
Hey All, I had missed these emails, sorry. The search+bind mode issue is one of documentation location, I have fixed it by moving the section to the applied to both list. As the patch is to do with post-auth response this is correct. As far as the issue when something other than 0 or 1 is set I am happy throw an error (although this doesn't seem to be how option such as LDAPTLS work: 1 if 1 else 0). I assume I would use the ereport() function to do this (using the second example from this page http://www.postgresql.org/docs/9.2/static/error-message-reporting.html)? Cheers, James James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 * **W* www.lisasoft.com *F *(+61) 3 8370 8099 On Thu, Sep 19, 2013 at 12:56 AM, Peter Eisentraut pete...@gmx.net wrote: On 7/8/13 9:33 PM, James Sewell wrote: New patch attached. I've moved from using a boolean to an enum trivalue. When ldapreferrals is set to something other than 0 or 1 exactly, it just ignores the option. That's not good, I think. It should probably be an error. -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] PSQL return coder
I was avoiding ON_ERROR_STOP because I was using ON_ERROR_ROLLBACK, but have just realised that if I encase my SQL in a transaction then rollback will still happen. Perfect! James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 * **W* www.lisasoft.com *F *(+61) 3 8370 8099 On Fri, Oct 11, 2013 at 12:25 AM, Merlin Moncure mmonc...@gmail.com wrote: On Thu, Oct 10, 2013 at 1:52 AM, Tom Lane t...@sss.pgh.pa.us wrote: James Sewell james.sew...@lisasoft.com writes: My question is in a rollback scenario is it possible to get PSQL to return a non 0 exit status? Maybe you could use -c instead of -f? $ psql -c 'select 1; select 1/0' regression ERROR: division by zero $ echo $? 1 You won't need explicit BEGIN/END because this is already a single transaction. According to the man page, EXIT STATUS psql returns 0 to the shell if it finished normally, 1 if a fatal error of its own (out of memory, file not found) occurs, 2 if the connection to the server went bad and the session was not interactive, and 3 if an error occurred in a script and the variable ON_ERROR_STOP was set. So for a longer script ON_ERROR_STOP might be the ticket (which is usually a good idea anyways). merlin -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
[HACKERS] PSQL return coder
Hello, I am using PSQL to run SQL from a file with the -f flag as follows: BEGIN SQL SQL ... END This gives me rollback on error and a nicer output than -1. This works fine. My question is in a rollback scenario is it possible to get PSQL to return a non 0 exit status? Cheers,a James -- James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 * **W* www.lisasoft.com *F *(+61) 3 8370 8099 -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] pg_dump and schema names
I was under the impression that every command that references a relation makes use of the search path, regardless of what it is *doing*. Maybe this is different in older versions though? I actually ran into this recently and had to remove all the xx. schema components using vi before I could run the dump to move to a different schema. Just to be sure: postgres=# create schema test; CREATE SCHEMA postgres=# set search_path to test; SET postgres=# create table test1(id serial); NOTICE: CREATE TABLE will create implicit sequence test1_id_seq for serial column test1.id CREATE TABLE postgres=# alter table test1 owner to postgres; ALTER TABLE Cheers, James Sewell Solutions Architect _ [image: http://www.lisasoft.com/sites/lisasoft/files/u1/2013hieghtslogan_0.png] Level 2, 50 Queen St, Melbourne, VIC, 3000 P: 03 8370 8000 F: 03 8370 8099 W: www.lisasoft.com On Fri, Aug 9, 2013 at 2:04 PM, Bruce Momjian br...@momjian.us wrote: pg_dump goes to great lengths not to hard-code the schema name into commands like CREATE TABLE, instead setting the search_path before creating the table; these commands: CREATE SCHEMA xx; CREATE TABLE xx.test(x int); generates this output: SET search_path = xx, pg_catalog; CREATE TABLE test ( x integer ); If you dump a schema and want to reload it into another schema, you should only need to update that one search_path line. However, later in the dump file, we hardcode the schema name for setting the object owner: ALTER TABLE xx.test OWNER TO postgres; Could we use search_path here to avoid the schema designation? I am not sure this possible because while CREATE defaults to the first existing schema in the search_path, I am not sure ALTER has the same behavior because you are not _creating_ something with ALTER. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence. image001.png
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On 14/07/2013 20:13, Greg Smith wrote: The most efficient way to write things out is to delay those writes as long as possible. That doesn't smell right to me. It might be that delaying allows more combining and allows the kernel to see more at once and optimise it, but I think the counter-argument is that it is an efficiency loss to have either CPU or disk idle waiting on the other. It cannot make sense from a throughput point of view to have disks doing nothing and then become overloaded so they are a bottleneck (primarily seeking) and the CPU does nothing. Now I have NOT measured behaviour but I'd observe that we see disks that can stream 100MB/s but do only 5% of that if they are doing random IO. Some random seeks during sync can't be helped, but if they are done when we aren't waiting for sync completion then they are in effect free. The flip side is that we can't really know whether they will get merged with adjacent writes later so its hard to schedule them early. But we can observe that if we have a bunch of writes to adjacent data then a seek to do the write is effectively amortised across them. So it occurs to me that perhaps we can watch for patterns where we have groups of adjacent writes that might stream, and when they form we might schedule them to be pushed out early (if not immediately), ideally out as far as the drive (but not flushed from its cache) and without forcing all other data to be flushed too. And perhaps we should always look to be getting drives dedicated to dbms to do something, even if it turns out to have been redundant in the end. That's not necessarily easy on Linux without using a direct unbuffered IO but to me that is Linux' problem. For a start its not the only target system, and having feedback 'we need' from db and mail system groups to the NT kernels devs hasn't hurt, and it never hurt Solaris to hear what Oracle and Sybase devs felt they needed either. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Add an ldapoption to disable chasing LDAP referrals
Hey, New patch attached. I've moved from using a boolean to an enum trivalue. Let me know what you think. Cheers, James James Sewell PostgreSQL Team Lead / Solutions Architect _ [image: http://www.lisasoft.com/sites/lisasoft/files/u1/2013hieghtslogan_0.png] Level 2, 50 Queen St, Melbourne, VIC, 3000 P: 03 8370 8000 F: 03 8370 8099 W: www.lisasoft.com On Thu, Jul 4, 2013 at 8:23 PM, Magnus Hagander mag...@hagander.net wrote: On Thu, Jul 4, 2013 at 2:30 AM, James Sewell james.sew...@lisasoft.comwrote: Heya, I see what you are saying, the problem as I see it is that the action we are taking here is disable chasing ldap referrals. If the name is ldapreferrals and we use a boolean then setting it to 1 reads in a counter intuitive manner: That assumes that the default in the ldap library is always going to be to chase them. Does the standard somehow mandate that it should be? set ldapreferals=true to disable chasing LDAP referrals. You'd obviously reverse the meaning as well. set ldapreferals=false to disable chasing LDAP referrals. Perhaps you are fine with this though if it's documented? It does work in the inverse way to pam_ldap, where setting to true enables referral chasing. pam_ldap works like so: not set : library default set to 0 : disable referral chasing set to 1 : enable referral chasing That is exactly what I'm suggesting it should do, and I'm pretty sure that's what Peter suggested as well. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence. image001.png pgsql_ldapnochaseref_v1.2.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Add an ldapoption to disable chasing LDAP referrals
Heya, I see what you are saying, the problem as I see it is that the action we are taking here is disable chasing ldap referrals. If the name is ldapreferrals and we use a boolean then setting it to 1 reads in a counter intuitive manner: set ldapreferals=true to disable chasing LDAP referrals. Perhaps you are fine with this though if it's documented? It does work in the inverse way to pam_ldap, where setting to true enables referral chasing. pam_ldap works like so: not set : library default set to 0 : disable referral chasing set to 1 : enable referral chasing The other option would be to have the default value (of the parameter) be true and set the boolean to false to disable it. I can't find any other examples of this though - I assume having a one off like this in the code is a bad thing also? I'm happy to let you guys decide. Cheers, James James Sewell PostgreSQL Team Lead / Solutions Architect _ [image: http://www.lisasoft.com/sites/lisasoft/files/u1/2013hieghtslogan_0.png] Level 2, 50 Queen St, Melbourne, VIC, 3000 P: 03 8370 8000 F: 03 8370 8099 W: www.lisasoft.com On Wed, Jul 3, 2013 at 6:12 PM, Magnus Hagander mag...@hagander.net wrote: On Wed, Jul 3, 2013 at 3:04 AM, James Sewell james.sew...@lisasoft.comwrote: Hey Peter, You are correct, it is the same as the referrals option in pam_ldap. It's also the -C (sometimes -R - it seems ldapsearch options are pretty non-standard) option in ldapsearch. As far as I'm aware you can't pass this in an LDAP URL, primarily because this never gets sent to the LDAP server. The server always returns an LDIF with inline references, this just determines if you chase them client side or just list them as is. I could be missing something here, but using: ldapreferrals={0|1} Would require a three state type, as we need a way of not interfering with the library defaults? To 'enable' the new behavior here using a boolean you would need to set ldapreferrals=false - which with the normal way of dealing with config booleans would alter the default behavior if the option was not specified. How do you feel about: ldapdisablereferrals=(0|1) I agree with Peter that the negative thing is bad. l don't see the problem, really. If you don't specify it, you rely on library defaults. If you do specify it, we lock it to that setting. I don't see the need to specifically have a setting to rely on library defaults - just remove it from the line and you get that. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence. image001.png
Re: [HACKERS] [PATCH] Add an ldapoption to disable chasing LDAP referrals
Hey Peter, You are correct, it is the same as the referrals option in pam_ldap. It's also the -C (sometimes -R - it seems ldapsearch options are pretty non-standard) option in ldapsearch. As far as I'm aware you can't pass this in an LDAP URL, primarily because this never gets sent to the LDAP server. The server always returns an LDIF with inline references, this just determines if you chase them client side or just list them as is. I could be missing something here, but using: ldapreferrals={0|1} Would require a three state type, as we need a way of not interfering with the library defaults? To 'enable' the new behavior here using a boolean you would need to set ldapreferrals=false - which with the normal way of dealing with config booleans would alter the default behavior if the option was not specified. How do you feel about: ldapdisablereferrals=(0|1) Cheers, James Sewell James Sewell PostgreSQL Team Lead / Solutions Architect _ [image: http://www.lisasoft.com/sites/lisasoft/files/u1/2013hieghtslogan_0.png] Level 2, 50 Queen St, Melbourne, VIC, 3000 P: 03 8370 8000 F: 03 8370 8099 W: www.lisasoft.com On Tue, Jul 2, 2013 at 10:46 PM, Peter Eisentraut pete...@gmx.net wrote: On 7/2/13 12:20 AM, James Sewell wrote: Hey All, This patch request grew from this post (of mine) to pgsql-general: http://www.postgresql.org/message-id/cabuevezouae-g1_oejagujjmem675dnystwybp4d_wz6om+...@mail.gmail.com The patch adds another available LDAP option (ldapnochaseref) for search+bind mode in the pg_hba.conf fil. If set to 1 (0 is default) then it performs a ldap_set_option which disables chasing of any LDAP references which are returned as part of the search LDIF. This appears to be the same as the referrals option in pam_ldap (http://linux.die.net/man/5/pam_ldap). So it seems legitimate. For consistency, I would name the option ldapreferrals={0|1}. I prefer avoiding double negatives. Do you know of a standard way to represent this option in an LDAP URL, perhaps as an extension? -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence. image001.png
[HACKERS] [PATCH] Add an ldapoption to disable chasing LDAP referrals
Hey All, This patch request grew from this post (of mine) to pgsql-general: http://www.postgresql.org/message-id/cabuevezouae-g1_oejagujjmem675dnystwybp4d_wz6om+...@mail.gmail.com The patch adds another available LDAP option (ldapnochaseref) for search+bind mode in the pg_hba.conf fil. If set to 1 (0 is default) then it performs a ldap_set_option which disables chasing of any LDAP references which are returned as part of the search LDIF. I can think of two use cases for this: 1. (the case which spawned my email) A valid search is performed, but for some reason a ref: with a non responsive LDAP server is returned as well, which causes the authentication to time out (could be intermittent if DNS round robin or similar is used and some of the LDAP servers are not functioning / a packet dropping firewall is in the way). 2. (a case I found when testing with AD) A valid search is performed and 6 ref: entries are returned, which all must be chased before authentication can succeed. Setting ldapnochaseref speeds up authentication with no negative cost (assuming you understand your LDAP schema). I think it's work noting that this setting seems to be the default for ldapsearch on Linux these days. Hopefully I found all the documentation that I was meant to update, let me know if not though. Cheers, James Sewell PostgreSQL Team Lead / Solutions Architect _ [image: http://www.lisasoft.com/sites/lisasoft/files/u1/2013hieghtslogan_0.png] Level 2, 50 Queen St, Melbourne, VIC, 3000 P: 03 8370 8000 F: 03 8370 8099 W: www.lisasoft.com -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence. image001.png pgsql_ldapnochaseref_v1.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpython implementation
On 01/07/2013 02:43, Claudio Freire wrote: In essence, you'd have to use another implementation. CPython guys have left it very clear they don't intend to fix that, as they don't consider it a bug. It's just how it is. Given how useful it is to have a scripting language that can be used outside of the database as well as inside it, would it be reasonable to consider 'promoting' pllua? My understanding is that it (lua) is much cleaner under the hood (than CPython). Although I do recognise that Python as a whole has always had more traction. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] C++ compiler
On 25/06/2013 05:16, Tom Lane wrote: It might be time to reconsider whether we should move the baseline portability requirement up to C99. My understanding was that you picked up a lot of users when the Win32 port became useful. While you can build with msys, I would think that leaving Microsoft's tooling behind would be a mistake, and as far as I am aware they have said that they are supporting C++11 but not bothering with C99. I'm really not in favor of moving to C++ though, as the portability-vs-usefulness tradeoffs seem pretty unattractive there. As a long-time C++ programmer I don't see what the problem would be beyond (some) existing contributors being wary of the unknown. Its not as if any platform developed enough to be a sane db server has not got a decent C++ compiler or two. Portability is only really a problem with a subset of new C++11 features.
Re: [HACKERS] Parallel Sort
Have you considered GPU-based sorting? I know there's been discussion in the past. If you use OpenCL, then you can use a CPU driver if there is no GPU, and that can allow you to leverage all the CPU cores without having to do the multi-thread stuff in the backend. While the compilation of a specific kernel can be quite expensive, it also has the effect of a JIT compiler in terms of system independence. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Floating point error
HL == Heikki Linnakangas hlinnakan...@vmware.com writes: HL It would be nice to have a base-2 text format to represent floats. HL It wouldn't be as human-friendly as base-10, but it could be used HL when you don't want to lose precision. pg_dump in particular. hexidecimal notation for floats exists. The printf format flag is %a for miniscule and %A for majuscule. The result of 1./3. is 0xa.aabp-5. This site has some info and a conversion demo: http://gregstoll.dyndns.org/~gregstoll/floattohex/ -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visual Studio 2012 RC
Anyway, this is getting way off track. The point is that the MS SDKs and compilers are a bit of a mess and that MinGW support is useful because we can't rely on them continuing to offer free SDKs and compilers in future. Well, more compilers are always useful, but complaining that Microsoft might withdraw their working compilers smacks of 'what if?' paranoia. What if mingw support for Win64 was (sometimes/often/always/still) a bit rubbish? Oh wait ... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visual Studio 2012 RC
On the contrary, only a few months ago there was a far from groundless fear that Microsoft would do just that. Following considerable outcry they changed their mind. But this is definitely not just paranoia. As for w64 support, the mingw-64 project exists more or less explicitly to produce 64 bit compilers, including those hosted on mingw/msys. Huh. The only reason we have to use mingw64 or one of the assorted personal builds is because 'mingw support' doesn't deliver on its own, and last I looked there was a confusing variety of personal builds with various strengths and weaknesses. I managed to make some progress but we seem to be a ways off having a reference download (and ideally one with clang too I guess). I'd very much like there to be a good reference implementation, but the whole mingw/mingw64 thing is indicative of some problems, and reminds me of egcs. You have references to back up your statements, and demonstrate that it wasn't primarily FUD? FWIW I think the higher entry prices of pay-for VStudio almost guarantees continued availability of a free compiler, though it might end up slightly crippled, but I'm not a product planner for MS any more than you are. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Strange Windows problem, lock_timeout test request
So, while no native 64-bit compilers are available for free as part of Visual Studio Express 2012 (11), it doesn't matter much. The issue is more that it's very much Microsoft's whim whether they release compilers at all and if so, which ones, when and how. I think I have a pretty vanilla Visual Studio Express 2012 for Desktop and: C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\bin\x86_amd64.\cl Microsoft (R) C/C++ Optimizing Compiler Version 17.00.51106.1 for x64 Copyright (C) Microsoft Corporation. All rights reserved. usage: cl [ option... ] filename... [ /link linkoption... ] C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\bin\x86_amd64 Am I misunderstanding the discussion here? Isn't that the 64-bit tool suite? Does anyone care if the compiler is a 64 bit binary itself? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] json api WIP patch
I had been wondering how to do such an insertion efficiently in the context of SPI, but it seems that there is no SPI_copy equiv that would allow a query parse and plan to be avoided. Your query above would need to be planned too, although the plan will be trivial. Ah yes, I meant that I had not found a way to avoid it (for multi-row inserts etc) from a stored proc context where I have SPI functions available. You should not try to use it as a general bulk load facility. And it will not be as fast as COPY for several reasons, including that the Json parsing routines are necessarily much heavier than the COPY parse routines, which have in any case been optimized over quite a long period. Also, a single json datum is limited to no more than 1Gb. If you have such a datum, parsing it involves having it in memory and then taking a copy (I wonder if we could avoid that step - will take a look). Then each object is decomposed into a hash table of key value pairs, which it then used to construct the record datum. Each field name in the result record is used to look up the value in the hash table - this happens once in the case of populate_record() and once per object in the array in the case of populate_recordset(). In the latter case the resulting records are put into a tuplestore structure (which spills to disk if necessary) which is then returned to the caller when all the objects in the js on array are processed. COPY doesn't have these sorts of issues. It knows without having to look things up where each datum is in each record, and it stashes the result straight into the target table. It can read and insert huge numbers of rows without significant memory implications. Yes - but I don't think I can use COPY from a stored proc context can I? If I could use binary COPY from a stored proc that has received a binary param and unpacked to the data, it would be handy. If SPI provided a way to perform a copy to a temp table and then some callback on an iterator that yields rows to it, that would do the trick I guess. Perhaps if you give us a higher level view of what you're trying to achieve we can help you better. I had been trying to identify a way to work with record sets where the records might be used for insert, or for updates or deletion statements, preferably without forming a large custom SQL statement that must then be parsed and planned (and which would be a PITA if I wanted to use the SQL-C preprocessor or some language bindings that like to prepare a statement and execute with params). The data I work with has a master-detail structure and insertion performance matters, so I'm trying to limit manipulations to one statement per table per logical operation even where there are multiple detail rows. Sometimes the network latency can be a pain too and that also suggests an RPC with unpack and insert locally. Cheers James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] json api WIP patch
You can use COPY from a stored procedure, but only to and from files. I think that's in the chocolate fireguard realm though as far as efficiency for this sort of scenario goes, even if its handled by retaining an mmap'd file as workspace. If SPI provided a way to perform a copy to a temp table and then some callback on an iterator that yields rows to it, that would do the trick I guess. SPI is useful, but it's certainly possible to avoid its use. After all, that what almost the whole backend does, including the COPY code. Of course, it's a lot harder to write that way, which is part of why SPI exists. Efficiency has its price. So it is possible to use a lower level interface from a C stored proc? SPI is the (only) documented direct function extension API isn't it? Is the issue with using the JSON data-to-record set that the parsing can be costly? Perhaps it can be achieved with B64 of compressed protobuf, or such. I don't mind if it seems a bit messy - the code can be generated from the table easily enough, especially if I can use C++. I guess an allocator that uses SPI_palloc would solve issues with memory management on error? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] json api WIP patch
The processing functions have been extended to provide populate_record() and populate_recordset() functions.The latter in particular could be useful in decomposing a piece of json representing an array of flat objects (a fairly common pattern) into a set of Postgres records in a single pass. So this would allow an 'insert into ... select ... from unpack-the-JSON(...)'? I had been wondering how to do such an insertion efficiently in the context of SPI, but it seems that there is no SPI_copy equiv that would allow a query parse and plan to be avoided. Is this mechanism likely to be as fast as we can get at the moment in contexts where copy is not feasible? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ILIKE vs indices
TL == Tom Lane t...@sss.pgh.pa.us writes: JC Is there any contraindication to recasting: JC foo ILIKE 'bar' JC into: JC LOWER(foo) LIKE LOWER('bar') TL In some locales those are not equivalent, I believe, or at least TL shouldn't be. (What the current code actually does is a separate TL question.) I see. After determining indexing based on th existance of an initial fixed string, exluding anything matching isalpha(), it uses tolower(3) and friends to do the actual match. So my proposal wouldn't change what matches, but might make fixing any bugs in what *should* match more difficult? TL In any case it's not obvious why LOWER rather than UPPER. I suggested lower() because that matched all of the suggestions I found. And as it turns out matches the current behaviour, too. The footnote about adding explicit mention to the docs was expressly because it is not otherwise obvious whether indices should use lower() or upper(). I'll ask on one of the unicode lists whether there are any locales where a case-insensitive match should be different than a case-preserving match of tolower() vs tolower(). -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] ILIKE vs indices
While tuning an application, I found the posts from 2003 recomending the use of LOWER() and LIKE in place of ILIKE to take advantage of indices. For this app, given the limitations of the upper-layer protocol it must support, that change replaced about 30 minutes of repeated seq scans with about 1 minute of repeated index scans! On a query-set often repeated several times per day. (Probably more times per day now.) Is there any contraindication to recasting: foo ILIKE 'bar' into: LOWER(foo) LIKE LOWER('bar') and documenting that an index has to be on LOWER(column) to benefit ILIKE? Perhaps the parser could read the former as the latter? -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Testing 9.2 in ~production environment
Updating pg_database to set datctype='C' did solve the speed issues with the two largs dbs. Presumably, since LC_CTYPE=en_US.UTF-8 was in the env when I ran pg_restore, it overrode the ctype setting in the dump files. Some of the slow selects do use ilike; even w/ datctype='C' the indices are skipped for at least this query: # explain analyze SELECT mb_id FROM mb WHERE name ILIKE 'foo@bar' AND ownerid=7; QUERY PLAN - Seq Scan on mb (cost=0.00..570.96 rows=3 width=4) (actual time=9.443..25.039 rows=1 loops=1) Filter: ((name ~~* 'foo@bar'::text) AND (ownerid = 7)) Rows Removed by Filter: 34827 Total runtime: 25.071 ms (4 rows) The mb table has several indices, including separate ones on name and ownerid. (not my design, btw. And I really do need to re-write the middleware) Whether it is strcoll(3) (even though LC_COLLATE is explicitly C) or LIKE, it does make a significant difference for those two apps. -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Testing 9.2 in ~production environment
As a followup, I find that I can avoid the seq scan by adding an index to that table as: create index mb_name_own_idx on mb ( lower(name), ownerid ); and changing the query from using the idiom: WHERE name ILIKE 'foo@bar' AND ownerid=7; to using: WHERE lower(name) = lower('foo@bar') AND ownerid=7; which saves 20+ ms on each of the 30+ k such selects in a full run. I haven't tested how fast it would be with that change and a utf8 ctype. Because of how the middleware achives its portability between pg, my et al, changing it to use lower and = will require significant surgery. Is there any way to specify the index such that the ILIKE query will use said index? -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Testing 9.2 in ~production environment
JB == Josh Berkus j...@agliodbs.com writes: JB Can you check the collations of the two databases? I'm wondering if 9.1 JB is in C collation and 9.2 is something else. Thanks! pg_dump -C tells me these two differences: -SET client_encoding = 'SQL_ASCII'; +SET client_encoding = 'UTF8'; -CREATE DATABASE dbm WITH TEMPLATE = template0 ENCODING = 'SQL_ASCII' LC_COLLATE = 'C' LC_CTYPE = 'C'; +CREATE DATABASE dbm WITH TEMPLATE = template0 ENCODING = 'UTF8' LC_COLLATE = 'C' LC_CTYPE = 'en_US.UTF-8'; for every db in the clusters. I presume that lc_ctype is the significant difference? LC_CTYPE *is* specified as 'C' in the dump from which I created the 9.2 cluster, so it must have been overridden by pg_restore. I see that my dist's /etc rc script now sets LC_CTYPE. Would that explain why lc_ctype changed between the two clusters? Is there any way to alter a db's lc_ctype w/o dumping and restoring? I want to preserve some of the changes made since I copied the 9.1 cluster. Alter database reports that lc_ctype cannot be changed. -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Testing 9.2 in ~production environment
I'm giving 9.2-beta2 a test simulating a production workflow. Everything looks OK except the speed. Most (all?) queries take about five to six times as long as they do with 9.1. The configurations are essentially the same, the query plans are the same. A (hot) example, pulled semi-randomly from a run, with the names mangled to protect the innocent: = 9.1 = Nested Loop (cost=0.00..26.92 rows=1 width=28) (actual time=0.114..0.514 rows=19 loops=1) - Index Scan using ms_pkey on ms msg (cost=0.00..26.03 rows=1 width=20) (actual time=0.026..0.207 rows=19 loops=1) Index Cond: ((ms_id = 407) AND (ms_id = 435) AND (mb_id = 50222)) Filter: (status = ANY ('{0,1,2}'::integer[])) - Index Scan using ph_pkey on ph pm (cost=0.00..0.87 rows=1 width=16) (actual time=0.010..0.010 rows=1 loops=19) Index Cond: (id = msg.ph_id) Total runtime: 0.605 ms = 9.2 = Nested Loop (cost=0.00..30.12 rows=1 width=28) (actual time=0.439..2.540 rows=19 loops=1) - Index Scan using ms_pkey on ms msg (cost=0.00..29.18 rows=1 width=20) (actual time=0.155..1.157 rows=19 loops=1) Index Cond: ((ms_id = 407) AND (ms_id = 435) AND (mb_id = 50222)) Filter: (status = ANY ('{0,1,2}'::integer[])) - Index Scan using ph_pkey on ph pm (cost=0.00..0.93 rows=1 width=16) (actual time=0.053..0.054 rows=1 loops=19) Index Cond: (id = msg.ph_id) Total runtime: 2.752 ms All of the tables and indices for the run in question fit into ram. The effective cache, work mem, costs, etc were optimized in 9.0, and kept for 9.1 and the beta. That the plans are the same suggests that isn't the problem, yes? I think I recall mention from a previous beta (but goog isn't helping me confirm) that there is some extra debugging or such enabled in the betas. If so, and if turning that off would provide a better comparison, where in the src should I look? -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Testing 9.2 in ~production environment
AF == Andres Freund and...@2ndquadrant.com writes: AF Is it possible that you compiled with assertions enabled? That would roughly AF fit that magnitude. SHOW debug_assertions; Should show you whether it was AF enabled. Thanks, but SHOW debug_assertions reports off. -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Testing 9.2 in ~production environment
PE == Peter Eisentraut pete...@gmx.net writes: PE That depends on how you built it. Just being a beta by itself doesn't PE turn on any extra debugging. OK. So either I misremembered or it was something no longer done. PE That depends on how you built it. Its a Gentoo box; both were build from their ebuilds, with the same gcc, flags, etc. PE Compare the output of pg_config --configure from both installations. The only differences are 9.1 vs 9.2 in the paths. Thanks, -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fake async rep target
Well, I was assuming that there was some intelligence in the receiver that could effectively parse this for the application; are you suggesting that is effectively binary deltas to apply to raw pages? Certainly, Sybase rep server works by creating function calls or SQL updates (depending on how you set it all up), and a fairly simple parser can be used to process the received stream. I was hoping that the receiver code could be used to at least identify which tuples are affected and their before/after primary keys. Ideally the result would be a packet by transaction, listing the impacted tables and for each table a list of affected primary keys and whether they were inserted, deleted or updated. James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fake async rep target
Well, I was assuming that there was some intelligence in the receiver that could effectively parse this for the application; are you suggesting that is effectively binary deltas to apply to raw pages? Certainly, Sybase rep server works by creating function calls or SQL updates (depending on how you set it all up), and a fairly simple parser can be used to process the received stream. I was hoping that the receiver code could be used to at least identify which tuples are affected and their before/after primary keys. Ideally the result would be a packet by transaction, listing the impacted tables and for each table a list of affected primary keys and whether they were inserted, deleted or updated. James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Fake async rep target
How easy would it be to implement a fake async rep target? Perhaps even as something that a server could allow a connection to request? (ie a suitably permissioned connection could convert itself to receive n async replication stream, rather than being statically configured?) I know that it sounds a bit bonkers, but a while back I worked on a system where we configured a rep target (using OpenServer) we could observe changes to tables and enqueue secondary processing. Rather painful in that case because of the way that repserver is configured, and I'm not sure it was worth the pain when configuring test and dev environments. However, in principle, it seems that this is quite an elegant standing for a whole raft of trigger functions - and probably a lot cheaper to execute. The key, I think, is to be able to allow dynamic attachment of such a 'change feed' by an account that has god-like read access. Is the existing async rep code amenable to this sort of abuse? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bugs in our Windows socket code
That is, if you request FD_WRITE events for a pre-existing socket with WSAEventSelect, you will not get one until the outbound network buffer has been filled and then has partially emptied. (This is incredibly broken, but Microsoft evidently has no intention of fixing it.) I think you should distinguish between 'broken' and 'not what I am used to'. Microsoft have always had an edge triggered delivery here, rather than the level triggered delivery on BSD. But WinSock is not BSD sockets. If you want to remember what the state is, associate a flag with the socket. Microsoft are very good at retaining compatibility, 'fixing it' in your terms would be crass breakage. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Have we out-grown Flex?
I haven't tried quex, but I have tried lemon (which can be broken out of SQLite) and re2c and ragel. I like ragel and lemon, but the combination supports a push-parser style from memory, and many tools are inconvenient unless you are prepared to suck in a whole message before parsing, or let the parser drive a pull loop, or use a coroutine structure. Could go all trendy and use a PEG tool like, er,, peg (http://piumarta.com/software/peg/). (I haven't tried them tho') James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Have we out-grown Flex?
Doesn't that imply that a plan cache might be worthwhile? But no matter: didn't the OP really have issue with packaging and Windows support - and there are a lot of Windows users, and in general there are many Windows devs: making it easier for them to contribute has to be good doesn't it? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Have we out-grown Flex?
I believe there are tools that are significantly faster than flex. I believe re2c generates code that is faster. But the key thing is to test, probably, or perhaps ask around. I'm out of touch, but from memory flex wasn't the be-all and end-all. Lemon is definitely easy to maintain/port and the result is pretty nice, too (I know Bison/Yacc wasn't the focus here). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] swapcache-style cache?
Has anyone considered managing a system like the DragonFLY swapcache for a DBMS like PostgreSQL? ie where the admin can assign drives with good random read behaviour (but perhaps also-ran random write) such as SSDs to provide a cache for blocks that were dirtied, with async write that hopefully writes them out before they are forcibly discarded. And where a cache fail (whether by timeout, hard fail, or CRC fail) just means having to go back to the real transactional storage. I'd been thinking that swapcache would help where the working set won't fit in RAM, also L2ARC on Solaris - but it seems to me that there is no reason not to allow the DBMS to manage the set-aside area itself where it is given either access to the raw device or to a pre-sized file on the device it can map in segments. While L2ARC is obviously very heavyweight and entwined in ZFS, Dragonfly's swapcache seems to me remarkably elegant and, it would seem, very effective. James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Postgres / plpgsql equivalent to python's getattr() ?
On Aug 4, 2011, at 4:55 AM, Florian Pflug wrote: @OP: Here's my implementation of the feature you desire as a set of C-language functions: https://github.com/fgp/pg_record_inspect. Other people did code up similar things in the past, but I currently cannot find any links to their work. But it little bit digging in the mailing list archives should turn them up. Many thanks, Florian, we'll be checking that out. James James Robinson Socialserve.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Postgres / plpgsql equivalent to python's getattr() ?
Hackers, Python's getattr() allows for dynamic lookup of attributes on an object, as in: inst = MyClass(x=12, y=24) v = getattr(inst, 'x') assert v == 12 Oftentimes in writing data validating trigger functions, it'd be real handy to be able to do a similar thing in plpgsql against column values in a row or record type, such as making use of a trigger argument for hint as what column to consider in this table's case. Oh, to be able to do something like (toy example known to be equivalent to a check): CREATE OR REPLACE FUNCTION must_be_positive() RETURNS TRIGGER AS $$ begin if getattr(NEW, TG_ARGV[0]) = 0 then raise exception(TG_ARGV[0] || ' must be positive'); end if; -- after trigger return null; end; $$ LANGUAGE PLPGSQL; A function which takes a row + a text column name, and / or a peer function taking row + index within row would really open up plpgsql's expressivity in cases where you're writing mainly SQL stuff, not really wanting to go over to plpythonu or whatnot (whose description of rows are as dicts). Is there something in the internals which inherently prevent this? Or am I fool and it already exists? Not having to defer to EXECUTE would be attractive. James Robinson Socialserve.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pl/python tracebacks
On Mar 1, 2011, at 12:10 PM, Jan Urbański wrote: So you end up with a context message saying PL/Python function %s and a detail message with the saved detail (if it's present) *and* the traceback. The problem is that the name of the function is already in the traceback, so there's no need for the context *if* there's a traceback present. The problem I'm having is technical: since the callback is already set when the code reaches the traceback-printing stage, you can't really unset it. AFAICS the elog code calls *all* callbacks from error_context_stack. So I can't prevent the context message from appearing. If I make the traceback part of the context as well, it's just going to appear together with the message from the callback. I remember going through a lot of pain getting this done right in pg-python[pl/py]. SELECT it_blows_up(); ERROR: function's main raised a Python exception CONTEXT: [exception from Python] Traceback (most recent call last): File public.it_blows_up(), line 13, in main three() File public.it_blows_up(), line 10, in three return two() File public.it_blows_up(), line 7, in two return one() File public.it_blows_up(), line 4, in one raise OverflowError(there's water everywhere) OverflowError: there's water everywhere [public.it_blows_up()] IIRC, I unconditionally print the [public.it_blows_up()] part iff it's not an ERROR. If it is an ERROR, I let the traceback rendering part of the code handle it on the PL's entry point exit. It was really tricky to do this because I was rendering the traceback *after* the error_context_stack had been called. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pl/python improvements
On Dec 23, 2010, at 3:38 AM, Jan Urbański wrote: Oh, didn't know that. I see that it does some more fancy things, like defining a inheritance hierarchy for these exceptions and adding some more into the mix. Right, there were some cases that appeared to benefit from larger buckets than what the existing code classes provided. Also, some of the exceptions in there are strictly for py-postgresql/client-side things. The names I used are not really invented, they're just plpgsql condition names from http://www.postgresql.org/docs/current/static/errcodes-appendix.html with underscores changed to camel case. Also, since they're autogenerated from utils/errcodes.h they don't have any hierarchy, they just all inherit from SPIError. For the backend setting, I think this is quite appropriate. However, for pg-python, I had mixed feelings about this as I wanted to be able to leverage py-postgresql's hierarchy, but still have the projects independent. I ended up punting on this one by using a single error class, and forcing the user to compare the codes. =( Sticking Error to every one of them will result in things like SubstringErrorError, so I'm not really sold on that. There was some creativity applied to the names in postgresql.exceptions to accommodate for things like that. (Like no redundant Error) Basically I think more PL/Python users will be familiar with condition names as you use them in pl/pgsql than with the names from py-postgresql. I think that's fair assumption. In fact, I think that might make a good TODO for py-postgresql/pg-python. Provide a plpgsql-code-name to exception class mapping. cheers, jwp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hstores in pl/python
On Dec 13, 2010, at 6:16 PM, Tom Lane wrote: how do you identify which type OID is really hstore? How about an identification field on pg_type? CREATE TYPE hstore ..., IDENTIFIER 'org.postgresql.hstore'; -- Where the identifier is an arbitrary string. Type information can be looked up by the PL, and the I/O functions can be dynamically resolved using the identifier. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: XLog vs SSD [Was: Re: [HACKERS] random write in xlog?]
JJ == Jeff Janes jeff.ja...@gmail.com writes: JJ So PG always writing 8K at a time is unlikely to make a difference JJ than if it wrote a smaller amount. Ah. Somehow I was thinking of the xlog files' 16M filesize rather than the internal 8k block size If it is only writing 8k blocks then there is probably little hope of making efficient use of flash. -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
XLog vs SSD [Was: Re: [HACKERS] random write in xlog?]
JJ == Jeff Janes jeff.ja...@gmail.com writes: JJ Anyway, the writes are logically sequentially, but not physically. JJ If I remember correctly, it always writes out full blocks, even if JJ the last part of the block has not yet been filled with new data. JJ When the remainder gets filled, it then writes out the full block JJ again, both the already written and the new part. What does that mean for use of a flash SSD for the xlog dir? Does the block writing mesh up well with the usage pattern a flash SSD needs to maximize lifespan? I'd love a dram ssd for pg_xlog and the journals for the other filesystems, but they cost too much. -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Floating-point timestamps versus Range Types
JD == Jeff Davis pg...@j-davis.com writes: JD 2. Fix the input/output functions in a special mode for dump/reload, JDto make them true inverses. That can be done by supporting the %A printf(3)/scanf(3) format. -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Floating-point timestamps versus Range Types
JD == Jeff Davis pg...@j-davis.com writes: JD 2. Fix the input/output functions in a special mode for dump/reload, JD to make them true inverses. JC That can be done by supporting the %A printf(3)/scanf(3) format. JD I don't happen to see a %A format in the man page, but I doubt the JD output would look like a timestamp. Modern printf(1) also supports it, so an easy example: :; printf '%A\n' 3.1415926535897932384626433832795029 0XC.90FDAA22168C235P-2 %a is the same, but with miniscule letters. It is the hex format of the float types, and defaults to exactly enough precision. The length modifier L makes %a expect a long double. JD And if it doesn't look like a timestamp, it violates the spirit of a JD logical dump of the data. Point taken. Had I read the whole thread before replying I would have been reminded that the float timestamps were archaic; that avoids any need of %A for timestamps. That said, the possiblity of hex i/o format for the float datatypes would be welcome. -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Floating-point timestamps versus Range Types
TL == Tom Lane t...@sss.pgh.pa.us writes: JC That said, the possiblity of hex i/o format for the float datatypes JC would be welcome. TL It's unportable, for two different reasons: TL 2. The printf specifiers you want us to rely on are not standard. They are in C99. TL 1. pg_dump output would become platform-specific. This is highly TL undesirable. It is true that pg would have to test for them in configure and supply alternative code wherever libc fails to support them. I can readily accept that there are many more pressing needs. But would such a patch for master be rejected? -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] .gitignore files, take two
I'm reading this a bit late, but... We (Xorg) found that ignoring: *~ *.bak *.patch in addition to the files generated by building is very helpful. We do use git tag and git describe in the make dist process, as well as git log ChangeLog. That may be relevant; avoiding git describe's dirty designation is important when using it in that fashion. But it helps to be able to run git add --all cleanly. I understand that other git users have similar experience. -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Postgres vs. intel ccNUMA on Linux
Hackers, Any tips / conventional wisdom regarding running postgres on large- ish memory ccNUMA intel machines, such as a 32G dual-quad-core, showing two NUMA nodes of 16G each? I expect each postgres backend's non-shared memory usage to remain nice and reasonably sized, hopefully staying within the confines of its processor's local memory region, but how will accesses to shared memory and / or buffer cache play out? Do people tune their backends via 'numactl' ? Furthermore, if one had more than one database being served by the machine, would it be advisable to do this via multiple clusters instead of a single cluster, tweaking the processor affinity of each postmaster accordingly, trying to ensure each cluster's shared memory segments and buffer cache pools remain local for the resulting backends? Thanks! James Robinson Socialserve.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [JDBC] Trouble with COPY IN
On Aug 9, 2010, at 11:49 AM, Kris Jurka wrote: Oh, duh. It's a server side copy not going through the client at all. Here's a hopefully final patch. Trying it out... Works for me. I understand the resistance to the patch, but it would be quite nice to see this wart in the rear view. =\ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Python 2.7 deprecated the PyCObject API?
On Aug 14, 2010, at 9:08 AM, Tom Lane wrote: Just to clarify, you're recommending something like proc-me = PyCObject_FromVoidPtr(proc, NULL); + if (proc-me == NULL) + elog(ERROR, could not create PyCObject for function); PyDict_SetItemString(PLy_procedure_cache, key, proc-me); correct? (Hm, and it looks like we'd better move the pfree just above that...) Almost, there's still a Python exception to report and/or clear. I only glanced at this and didn't recall what the plpython mechanisms were for that, thus the ambiguous complain(). Yeah, and since we'll have to back-patch it, a fairly noninvasive patch would be nice. Will you work on that? I was hoping that Peter would pop in with a patch, but I think a few lines of CPP may suffice.. (warning: untested =) #ifdef Py_CAPSULE_H /* * Python.h (2.7 and up) includes pycapsule.h, so rely on the header * define to detect the API's existence. */ #define PyCObject_FromVoidPtr(POINTER, IGNORED) PyCapsule_New(POINTER, NULL, NULL) #undef PyCObject_Check #define PyCObject_Check(OBJ) PyCapsule_CheckExact(OBJ) #define PyCObject_AsVoidPtr(OBJ) PyCapsule_GetPointer(OBJ, NULL) #endif /* Py_CAPSULE_H */ http://svn.python.org/view/python/branches/release27-maint/Include/pycapsule.h?view=markup yay? nay? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Python 2.7 deprecated the PyCObject API?
On Aug 13, 2010, at 5:20 PM, Tom Lane wrote: According to a discussion over in Fedora-land, $subject is true: http://lists.fedoraproject.org/pipermail/devel/2010-August/140995.html I see several calls in plpython.c that seem to refer to PyCObject stuff. Anybody have any idea if we need to do something about this? Well, we should at least be checking for an exception here anyways: proc-me = PyCObject_FromVoidPtr(proc, NULL); PyDict_SetItemString(PLy_procedure_cache, key, proc-me); if (proc-me == NULL) complain(); That is, with those warnings adjustments, proc-me will be NULL and then explode in PyDict_SetItemString: [David Malcolm] However, if someone overrides the process-wide warnings settings, then the API can fail altogether, raising a PendingDeprecationWarning exception (which in CPython terms means setting a thread-specific error state and returning NULL). ./ AFA a better fix is concerned, the shortest route would seem to be to use the new capsule stuff iff Python = 2.7. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [JDBC] Trouble with COPY IN
On Aug 6, 2010, at 4:31 PM, Kris Jurka wrote: binary-copy-end-v2.patch I think there's a snag in the patch: postgres=# COPY data FROM '/Users/jwp/DATA.bcopy' WITH BINARY; ERROR: row field count is -1, expected 1 CONTEXT: COPY data, line 4 Probably a quick/small fix away, I imagine. But, I was able to trigger the new ERROR with py-postgresql: import postgresql as pg db=pg.open('localhost/postgres') q=db.prepare('copy data FROM STDIN WITH BINARY') from itertools import chain import sys db.pq.tracer = sys.stderr.write q.load_rows(chain(open('/Users/jwp/DATA.bcopy', 'rb'), (b'EXTRA',))) ↑ B(25): b'B\x00\x00\x00\x18\x00py:0x1268b30\x00\x00\x00\x00\x00\x00\x00' ↑ E(10): b'E\x00\x00\x00\t\x00\x00\x00\x00\x01' ↑ S(5): b'S\x00\x00\x00\x04' ↓ b'2'(0): b'' ↓ b'G'(5): b'\x01\x00\x01\x00\x01' ↑__(7): b'PGCOPY\n' ↑__(3): b'\xff\r\n' ↑__(41): b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x04\x00\x00\x00\x01\x00\x01\x00\x00\x00\x04\x00\x00\x00\x02\x00\x01\x00\x00\x00\x04\x00\x00\x00\x03\xff\xff' ↑__(5): b'EXTRA' ↑ c(5): b'c\x00\x00\x00\x04' ↑ S(5): b'S\x00\x00\x00\x04' ↓ b'E'(95): b'SERROR\x00C22P04\x00Mreceived copy data after EOF marker\x00WCOPY data, line 4\x00Fcopy.c\x00L2081\x00RCopyFrom\x00\x00' ↓ b'Z'(1): b'I' Traceback (most recent call last): File stdin, line 1, in module snip File /Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/postgresql/driver/pq3.py, line 462, in raise_server_error raise server_error postgresql.exceptions.BadCopyError: received copy data after EOF marker CODE: 22P04 LOCATION: File 'copy.c', line 2081, in CopyFrom from SERVER CONTEXT: COPY data, line 4 STATEMENT: [prepared] sql_parameter_types: [] statement_id: py:0x1268b30 string: copy data FROM STDIN WITH BINARY CONNECTION: [idle] client_address: ::1/128 client_port: 63922 version: PostgreSQL 9.1devel on x86_64-apple-darwin10.4.0, compiled by GCC i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5664), 64-bit CONNECTOR: [Host] pq://jwp:*...@localhost:5432/postgres category: None DRIVER: postgresql.driver.pq3.Driver -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [JDBC] Trouble with COPY IN
On Jul 25, 2010, at 8:01 AM, Kris Jurka wrote: The JDBC driver reads server messages for multiple reasons. One of them is indeed to do early failure detection. That's high quality. =) Another is to pickup NoticeResponse messages to avoid a network buffer deadlock. That's a good catch. I don't think psql/restore would often run into this as when COPY IN is in play, it's normally restoring a database. However, with JDBC, I imagine COPY would more often be used to do bulk loading into live tables that may very well cause a NOTICE. [Well, I reference psql/libpq because I don't recall it recognizing failure during COPY IN in the past, so I assume it's not receiving any data in that state.] hrm, I suppose a lazy way around that problem would be to suspend all client messages(client_min_messages) during COPY IN. Tho, I guess one would still have to contend with NotificationResponse, and ParameterStatus.. So this is possible to work around driver side by peeking into the network stream and delaying processing of the end of copy until the driver agrees that the copy is done, but I don't think you would have to peek in. If the interface were to always hold onto the last message or last n-bytes submitted to be sent, it would be able to send the possible CopyData(EOF) and CopyDone once the COPY operation (at the interface level) is closed/shutdown/terminated. Granted, this is dependent on CopyData(EOF) not being in the middle of regular CopyData, but I gather that that would end in an ErrorResponse anyways. I still maintain that this is a server bug. It is not OK for the server to assume that the client is done and move on, the client must tell the server what it wants done. I'm a bit torn here. While it would seem to be either a bug in the spec or a bug in the server, I'm inclined to call it a wart in the server's implementation of the spec. I don't see the fix as being dangerous, but I imagine an implementor would want to have the workaround in place regardless. I certainly would. I'd be in favor of seeing this fixed in 9.x, and the documentation updated to warn implementors about the wart in the older versions.. That is, I don't see any reason why we can't get rid of this unsightly thing considering the workarounds would still work with a wart-free server. cheers, jwp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [JDBC] Trouble with COPY IN
On Jul 28, 2010, at 9:53 AM, Kris Jurka wrote: Technically you won't get NotificationResponse until transaction end, so you don't need to worry about that mid copy. Ah, thanks for noting that. It would appear my original reading of the async section didn't get far enough beyond Frontends must be prepared to deal with these messages at any time, even when not engaged in a query.. I see the note below clarifying NotificationResponse. One of the key points of confusion is that CopyData(EOF) does not result in an error. It results in ignoring any futher data. The problem I have is that for text mode it waits for CopyDone, but in binary mode it ends the copy sequence immediately. That is bothersome. :\ Additionally the interface exposed by the JDBC driver lets the user write arbitrary CopyData bytes to the server, so without parsing all of that we don't know whether they've issued CopyData(EOF) or not. Okay, so you can't know with absolute certainty without parsing the data, but the usual case would be handled by holding onto the last-N bytes or so. Enough to fit the EOF and perhaps a little more for paranoia's sake. That's not to say that I'm missing the problem. When (not if, when) the user feeds data past a CopyData(EOF), it's going to get interesting. [Thinking about the logic necessary to handle such a case and avoid network buffer deadlock...] I would think the least invasive way to handle it would be to set the CommandComplete and ReadyForQuery messages aside when they are received if CopyDone hasn't been sent, continue the COPY operation as usual until it is shutdown, send CopyDone and, finally, reinstate CommandComplete and RFQ as if they were just received.. I don't think that really accommodates for CopyFail as the status in RFQ will need to be adjusted to match what was actually done..? Well, I'm not sure you would need to worry about NoticeResponse after a premature CommandComplete as INSERTs are no longer happening. ugh. +1 for a fix. Not directly regarding your patch, but while the discussion is in the general area. I think it would be wise to throw an error when non-empty CopyData messages are received after CopyData(EOF). Chances are that the user is making a mistake and should be notified of it. cheers, jwp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers