Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-09-11 Thread Etsuro Fujita

On 2015/09/11 6:02, Robert Haas wrote:

On Thu, Sep 3, 2015 at 6:25 AM, Etsuro Fujita
 wrote:

I gave it another thought; the following changes to ExecInitNode would make
the patch much simpler, ie, we would no longer need to call the new code in
ExecInitForeignScan, ExecForeignScan, ExecEndForeignScan, and
ExecReScanForeignScan.  I think that would resolve the name problem also.

*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***
*** 247,254  ExecInitNode(Plan *node, EState *estate, int eflags)
 break;

 case T_ForeignScan:
!   result = (PlanState *) ExecInitForeignScan((ForeignScan *) node,
!  estate, eflags);
 break;

 case T_CustomScan:
--- 247,269 
 break;

 case T_ForeignScan:
!   {
!   Index   scanrelid = ((ForeignScan *)
node)->scan.scanrelid;
!
!   if (estate->es_epqTuple != NULL && scanrelid == 0)
!   {
!   /*
!* We are in foreign join inside an EvalPlanQual
recheck.
!* Initialize local join execution plan, instead.
!*/
!   Plan   *subplan = ((ForeignScan *)
node)->fs_subplan;
!
!   result = ExecInitNode(subplan, estate, eflags);
!   }
!   else
!   result = (PlanState *) ExecInitForeignScan((ForeignScan
*) node,
!  estate,
eflags);
!   }
 break;


I don't think that's a good idea.  The Plan tree and the PlanState
tree should be mirror images of each other; breaking that equivalence
will cause confusion, at least.


IIRC, Horiguchi-san also pointed that out. Honestly, I also think that 
that is weird, but IIUC, I think it can't hurt. What I was concerned 
about was EXPLAIN, but EXPLAIN doesn't handle an EvalPlanQual PlanState 
tree at least currently.


Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pageinspect patch, for showing tuple data

2015-09-11 Thread Michael Paquier
On Fri, Sep 11, 2015 at 12:08 AM, Nikolay Shaplov
 wrote:
> В письме от 10 сентября 2015 15:46:25 пользователь Michael Paquier написал:
>
>> > So if move tuple data parsing into separate function, then we have to pass
>> > these values alongside the tuple data. This is not convenient at all.
>> > So I did it in a way you see in the patch.
>>
>> Why is it not convenient at all? Yes, you have a point, we need those
>> fields to be able to parse the t_data properly. Still the possibility
>> to show individual fields of a tuple as a bytea array either with
>> toasted or detoasted values is a concept completely different from
>> simply showing the page items, which is what, it seems to me,
>> heap_page_items is aimed to only do. Hence, As t_infomask2, t_infomask
>> and t_bits are already available as return fields of heap_page_items,
>> we should simply add a function like that:
>> heap_page_item_parse(Oid relid, bytea data, t_infomask2 int,
>> t_infomask int, t_bits int, bool force_detoast, warning_mode bool)
>> returns bytea[]
>
> Just compare two expressions:
>
> select * from heap_page_item_attrs(get_raw_page('test', 0),'test'::regclass);
>
> and
>
> select  lp, lp_off, lp_flags, lp_len, t_xmin, t_xmax, t_field3, t_ctid,
> t_infomask2, t_infomask, t_hoff, t_bits, t_oid,  tuple_data_parse (
> t_data,  t_infomask, t_infomask2, t_bits, 'test'::regclass, true, false)
> from heap_page_item_attrs(get_raw_page('test', 0));
>
> The second variant is a total mess and almost unsable...

It is hard to believe as well that any sane application would use * as
well in a SELECT query. Users would though and we are talking about
user's education here :)

> Though I've discussed this issue with friedns, and we came to conclusion that
> it might be good to implement tuple_data_parse and then implement
> easy to use heap_page_item_attrs in pure SQL, using heap_page_items and
> tuple_data_parse.
> That would keep usage simplicity, and make code more simple as you offer.

Yep. That's doable with a simple SQL function. I am not sure that it
is worth including in pageinspect though.

> The only one argument against it is that in internal representation t_bits is
> binary, and in sql output it is string with '0' and '1' characters. We will
> have to convert it back to binary mode. This is not difficult, but just 
> useless
> convertations there and back again.

The reason why this is visibly converted from bit to text is that the
in-core bit data type has a fixed length, and in the case of
HeapTupleHeaderData there is a variable length.

> What do you think about this solution?

For code simplicity's sake this seems worth the cost.

>> Note that the data corruption checks apply only to this function as
>> far as I understand, so I think that things could be actually split
>> into two independent patches:
>> 1) Upgrade heap_page_items to add the tuple data as bytea.
>> 2) Add the new function able to parse those fields appropriately.
>>
>> As this code, as you justly mentioned, is aimed mainly for educational
>> purposes to understand a page structure, we should definitely make it
>> as simple as possible at code level, and it seems to me that this
>> comes with a separate SQL interface to control tuple data parsing as a
>> bytea[]. We are doing no favor to our users to complicate the code of
>> pageinspect.c as this patch is doing in its current state, especially
>> if beginners want to have a look at it.
>>
>> >> Actually not two functions, but just one, with an extra flag to be
>> >> able to enforce detoasting on the field values where this can be done.
>> >
>> > Yeah, I also thought about it. But did not come to any final conclusion.
>> > Should we add forth argument, that enables detoasting, instead of adding
>> > separate function?
>>
>> This is definitely something you want to control with a switch.
> Ok.Let's come to the final decision with tuple_data_parse, and i will add this
> switch there and to pure sql heap_page_item_attrs

Fine for me.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pageinspect patch, for showing tuple data

2015-09-11 Thread Michael Paquier
On Fri, Sep 11, 2015 at 12:12 AM, Bruce Momjian  wrote:
> On Thu, Sep 10, 2015 at 03:46:25PM +0900, Michael Paquier wrote:
>> Why is it not convenient at all? Yes, you have a point, we need those
>> fields to be able to parse the t_data properly. Still the possibility
>> to show individual fields of a tuple as a bytea array either with
>> toasted or detoasted values is a concept completely different from
>> simply showing the page items, which is what, it seems to me,
>> heap_page_items is aimed to only do. Hence, As t_infomask2, t_infomask
>> and t_bits are already available as return fields of heap_page_items,
>> we should simply add a function like that:
>> heap_page_item_parse(Oid relid, bytea data, t_infomask2 int,
>> t_infomask int, t_bits int, bool force_detoast, warning_mode bool)
>> returns bytea[]
>
> Should pageinspect create a table that contains some of the constants
> used to interpret infomask?

Interesting idea. It may be indeed useful to show to a user mappings
between t_infomask flags <=> textual meaning. I guess that we could
have an SRF function with a view on top of it that returns such a
list. The same can apply to t_infomask2.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-09-11 Thread Kouhei Kaigai
> -Original Message-
> From: Etsuro Fujita [mailto:fujita.ets...@lab.ntt.co.jp]
> Sent: Friday, September 11, 2015 12:36 PM
> To: Robert Haas
> Cc: Kaigai Kouhei(海外 浩平); PostgreSQL-development; 花田茂
> Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual
> 
> On 2015/09/11 6:24, Robert Haas wrote:
> > On Thu, Sep 3, 2015 at 1:22 AM, Etsuro Fujita
> >  wrote:
> >>> I'm wondering if there's another approach.  If I understand correctly,
> >>> there are two reasons why the current situation is untenable.  The
> >>> first is that ForeignRecheck always returns true, but we could instead
> >>> call an FDW-supplied callback routine there.  The callback could be
> >>> optional, so that we just return true if there is none, which is nice
> >>> for already-existing FDWs that then don't need to do anything.
> >>
> >> My question about this is, is the callback really needed?  If there are any
> >> FDWs that want to do the work *in their own way*, instead of just doing
> >> ExecProcNode for executing a local join execution plan in case of foreign
> >> join (or just doing ExecQual for checking remote quals in case of foreign
> >> table), I'd agree with introducing the callback, but if not, I don't think
> >> that that makes much sense.
> >
> > It doesn't seem to me that it hurts much of anything to add the
> > callback there, and it does provide some flexibility.  Actually, I'm
> > not really sure why we're thinking we need a subplan here at all,
> > rather than just having a ForeignRecheck callback that can do whatever
> > it needs to do with no particular help from the core infrastructure.
> > I think you wrote some code to show how postgres_fdw would use the API
> > you are proposing, but I can't find it.  Can you point me in the right
> > direction?
> 
> I've proposed the following API changes:
> 
> * I modified create_foreignscan_path, which is called from
> postgresGetForeignJoinPaths/postgresGetForeignPaths, so that a path,
> subpath, is passed as the eighth argument of the function. subpath
> represents a local join execution path if scanrelid==0, but NULL if
> scanrelid>0.
>
I like to suggest to have multiple path nodes, like custom-scan, because
the infrastructure will be also helpful to implement FDW driver that can
have multiple sub-plans. One expected usage is here:
http://www.postgresql.org/message-id/9a28c8860f777e439aa12e8aea7694f8010f2...@bpxm15gp.gisp.nec.co.jp

> * I modified make_foreignscan, which is called from
> postgresGetForeignPlan, so that a list of quals, fdw_quals, is passed as
> the last argument of the function.  fdw_quals represents remote quals if
> scanrelid>0, but NIL if scanrelid==0.
>
If a callback on ForeignRecheck processes EPQ rechecks, the core PostgreSQL
don't need to know what expression was pushed down and how does it kept in
the private field (fdw_exprs). Only FDW driver knows which private field is
the expression node that was pushed down to the remote side. It shall not be
an interface contract.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Hooking at standard_join_search (Was: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual)

2015-09-11 Thread Etsuro Fujita

On 2015/09/11 6:30, Robert Haas wrote:

On Wed, Sep 9, 2015 at 2:30 AM, Etsuro Fujita
 wrote:

But that path might have already been discarded on the basis of cost.
I think Tom's idea is better: let the FDW consult some state cached
for this purpose in the RelOptInfo.


Do you have an idea of what information would be collected into the state
and how the FDW would derive parameterizations to consider producing
pushed-down joins with from that information?  What I'm concerned about that
is to reduce the number of parameterizations to consider, to reduce overhead
in costing the corresponding queries.  I'm missing something, though.


I think the thing we'd want to store in the state would be enough
information to reconstruct a valid join nest.  For example, the
reloptinfo for (A B) might note that A needs to be left-joined to B.
When we go to construct paths for (A B C), and there is no
SpecialJoinInfo that mentions C, we know that we can construct (A LJ
B) IJ C rather than (A IJ B) IJ C.  If any paths survived, we could
find a way to pull that information out of the path, but pulling it
out of the RelOptInfo should always work.


So, information to address the how-to-build-the-query-text
problem would be stored in the state, in other words. Right?


I am not sure what to do about parameterizations.  That's one of my
remaining concerns about moving the hook.


I think we should also make it clear what to do about sort orderings.

Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Make timestamptz_out less slow.

2015-09-11 Thread David Rowley
On 3 September 2015 at 05:10, Andres Freund  wrote:

> > +/*
> > + * pg_int2str
> > + *   Converts 'value' into a decimal string representation of
> the number.
> > + *
> > + * Caller must ensure that 'str' points to enough memory to hold the
> result
> > + * (at least 12 bytes, counting a leading sign and trailing NUL).
> > + * Return value is a pointer to the new NUL terminated end of string.
> > + */
> > +char *
> > +pg_int2str(char *str, int32 value)
> > +{
> > + char *start;
> > + char *end;
> > +
> > + /*
> > +  * Handle negative numbers in a special way. We can't just append
> a '-'
> > +  * prefix and reverse the sign as on two's complement machines
> negative
> > +  * numbers can be 1 further from 0 than positive numbers, we do it
> this way
> > +  * so we properly handle the smallest possible value.
> > +  */
> > + if (value < 0)
> > + {
>
> We could alternatively handle this by special-casing INT_MIN, probably
> resulting in a bit less duplicated code.
>

Hi Andres,

I've made a few updates to the patch to cleanup a few badly written
comments, and also to fix a missing Abs() call.
I've also renamed the pg_int2str to become pg_ltostr() to bring it more
inline with pg_ltoa.

Attached is also timestamp_delta.patch which changes pg_ltostr() to use the
INT_MIN special case that pg_ltoa also uses. I didn't make a similar change
to pg_ltostr_zeropad() as I'd need to add even more special code to prepend
the correct number of zero before the 2147483648. This zero padding reason
was why I originally came up with the alternative way to handle the
negative numbers. I had just thought that it was best to keep both
functions as similar as possible.

I've not done anything yet to remove the #ifdef HAVE_INT64_TIMESTAMP from
AppendSeconds(). The only way I can think to handle this is to just
make fsec_t unconditionally an int64 (since with float datetimes the
precision is 10 decimal digits after the dec point, this does not fit in
int32). I'd go off and do this, but this means I need to write an int64
version of pg_ltostr(). Should I do it this way?

I also did some tests on server grade hardware, and the performance
increase is looking a bit better.

create table ts (ts timestamp not null);
insert into ts select generate_series('2010-01-01 00:00:00', '2011-01-01
00:00:00', '1 sec'::interval);
vacuum freeze ts;

Master

tpch=# copy ts to '/dev/null';
COPY 31536001
Time: 17071.255 ms

Patched
tpch=# copy ts to '/dev/null';
COPY 31536001
Time: 6402.835 ms

The times don't seem to vary any depending on the attached
timestamp_delta.patch

Regards

David Rowley


timestamp_delta.patch
Description: Binary data


timestamp_out_speedup_2015-09-12.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multi-column distinctness.

2015-09-11 Thread Tomas Vondra


On 09/07/2015 05:25 AM, Kyotaro HORIGUCHI wrote:

Hello,


but making the COLUMN
required is certainly much worse as it breaks many existing scripts. The
keyword inky breaks cases that manipulate "statistics" column.


Ouch! It is simply by accident, or my lack of carefulness. I will
come up with fixed syntax later..


If any of this is unacceptable, then we probably need to come up with a
different syntax.


I've been thinking about the syntax, and I think both options (making 
COLUMN required or making STATISTICS a reserved keyword) will break 
something no matter what we do, forcing the users to either always use 
ADD COLUMN or quote all the existing uses of "statistics" (as column 
names, for example).


Maybe the best solution is to abandon the ALTER TABLE approach entirely, 
and instead invent a new set of commands


  CREATE STATISTICS
  DROP STATISTICS

(ALTER STATISTICS seems a bit excessive at this point).

Another thing is that perhaps we should add names for statistics, just 
like we do for constraints, for example. Otherwise the DROP STATISTICS 
handling is rather awkward - for example if the user creates stats twice 
by mistake, he's unable to drop just one of them.


regards

--
Tomas Vondra   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Small patch to fix an O(N^2) problem in foreign keys

2015-09-11 Thread Kevin Grittner
Jan Wieck  wrote:

>>> The patch uses a heuristic method of detecting when the hash table
>>> should be destroyed and recreated. InvalidateConstraintCacheCallBack()
>>> adds the current size of the hash table to a counter. When that sum
>>> reaches 1,000,000, the hash table is flushed. This improves the schema
>>> restore of a database with 100,000 foreign keys by factor 3.
>>>
>>> According to my tests the patch does not interfere with the bulk
>>> updates, the original feature was supposed to improve.
>>
>> In the real-world customer case that caused you to look into this,
>> I thought 45ba424f drove schema-only restore time from 2 hours to
>> about 25 hours, and that this patch takes it back down to 2 hours.
>> Am I remembering right?  And this came about because it added
>> 20-some hours to a pg_upgrade run?
>
> From 2 hours to >50, but yes, this is that case.
>
>> If there are no objections, I will push this as a bug fix back to
>> 9.3, where the performance regression was introduced.

Hearing no objections, pushed with one minor tweak Jan and I
discussed off-list -- the original patch duplicated a bit of code
that we agreed should be split into a static function and called
from both places, to protect against someone later changing one and
missing the other.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Make timestamptz_out less slow.

2015-09-11 Thread Andres Freund
On 2015-09-12 04:00:26 +1200, David Rowley wrote:
> I've not done anything yet to remove the #ifdef HAVE_INT64_TIMESTAMP from
> AppendSeconds(). The only way I can think to handle this is to just
> make fsec_t unconditionally an int64 (since with float datetimes the
> precision is 10 decimal digits after the dec point, this does not fit in
> int32). I'd go off and do this, but this means I need to write an int64
> version of pg_ltostr(). Should I do it this way?

I don't have time to look into this in detail right now. But isn't that
the wrong view?

The precision with float timestamps is still microseconds which you can
easily express in an integer. That the value can be higher is just
because fsec_t for floating points also includes seconds, right? It
shouldn't be too hard to separate those?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2015-09-11 Thread David Rowley
On 11 September 2015 at 22:23, YUriy Zhuravlev 
wrote:

>
> Without patch we have 417523 TPS and with patch 965821 TPS for big x86
> server.
> All details here: https://gist.github.com/stalkerg/773a81b79a27b4d5d63f
>
>
Impressive!

I've run this on a single CPU server and don't see any speedup, so I assume
I'm not getting enough contention.
As soon as our 4 socket machine is free I'll try a pgbench run with that.

Just for fun, what's the results if you use -M prepared ?

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/

 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2015-09-11 Thread YUriy Zhuravlev
On Friday 11 September 2015 18:14:21 Andres Freund wrote:
> This way we can leave the for (;;) loop
> in BufferAlloc() thinking that the buffer is unused (and can't be further
> pinned because of the held spinlock!)

We lost lock after PinBuffer_Locked in BufferAlloc. Therefore, in essence, 
nothing has changed. 
-- 
YUriy Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Hooking at standard_join_search (Was: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual)

2015-09-11 Thread Robert Haas
On Fri, Sep 11, 2015 at 3:12 AM, Etsuro Fujita
 wrote:
> So, information to address the how-to-build-the-query-text
> problem would be stored in the state, in other words. Right?

Right.

>> I am not sure what to do about parameterizations.  That's one of my
>> remaining concerns about moving the hook.
>
> I think we should also make it clear what to do about sort orderings.

How does that come into it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] New gist vacuum.

2015-09-11 Thread Костя Кузнецов
Hello. I am student from gsoc programm.My project is sequantial access in vacuum of gist. New vacuum has 2 big step:physical order scan pages and cleaning after 1 step.  1 scan - scan all pages and create information map(hashmap) and add information to rescan stack( stack of pages that needed to rescanning second step is work only with page(from rescan stack) where there is a changes. In new version of vacuum besides increased speed also there is a deleting of pages. Only leaf pages can be deleted. The process of deleteing pages is (1. delete link to page. 2. change rightlinks (if needed) 3. set deleted). I added 2 action in wal (when i set delete flag and when i change rightlinks). When i delete links to leaf pages from inner page i always save 1 link to leaf(avoiding situations with empty inner pages).I attach some speed benchmarks.i compare old and new version on my laptop(without ssd). the test: table "point_tbl" from regression database. i insert about 200 millions rows. after that i delete 33 million and run vacuum.size of index is about 18 gb.old version:INFO: vacuuming "public.point_tbl"INFO: scanned index "gpointind" to remove 11184520 row versionsDETAIL: CPU 84.70s/72.26u sec elapsed 27007.14 sec.INFO: "point_tbl": removed 11184520 row versions in 400715 pagesDETAIL: CPU 3.96s/3.10u sec elapsed 233.12 sec.INFO: scanned index "gpointind" to remove 11184523 row versionsDETAIL: CPU 87.10s/69.05u sec elapsed 26410.44 sec.INFO: "point_tbl": removed 11184523 row versions in 400715 pagesDETAIL: CPU 4.23s/3.36u sec elapsed 331.43 sec.INFO: scanned index "gpointind" to remove 11184523 row versionsDETAIL: CPU 87.65s/65.73u sec elapsed 26230.35 sec.INFO: "point_tbl": removed 11184523 row versions in 400715 pagesDETAIL: CPU 4.47s/3.41u sec elapsed 342.93 sec.INFO: scanned index "gpointind" to remove 866 row versionsDETAIL: CPU 79.97s/39.64u sec elapsed 23341.88 sec.INFO: "point_tbl": removed 866 row versions in 31 pagesDETAIL: CPU 0.00s/0.00u sec elapsed 0.00 sec.INFO: index "gpointind" now contains 201326592 row versions in 2336441 pagesDETAIL: 33554432 index row versions were removed.0 index pages have been deleted, 0 are currently reusable.  new vacuum is about INFO: vacuuming "public.point_tbl"INFO: scanned index "gpointind" to remove 11184520 row versionsDETAIL: CPU 13.00s/27.57u sec elapsed 1864.22 sec.INFO: "point_tbl": removed 11184520 row versions in 400715 pagesDETAIL: CPU 3.46s/2.86u sec elapsed 214.04 sec.INFO: scanned index "gpointind" to remove 11184523 row versionsDETAIL: CPU 14.17s/27.02u sec elapsed 2163.67 sec.INFO: "point_tbl": removed 11184523 row versions in 400715 pagesDETAIL: CPU 3.33s/2.99u sec elapsed 222.60 sec.INFO: scanned index "gpointind" to remove 11184523 row versionsDETAIL: CPU 11.84s/25.23u sec elapsed 1828.71 sec.INFO: "point_tbl": removed 11184523 row versions in 400715 pagesDETAIL: CPU 3.44s/2.81u sec elapsed 215.06 sec.INFO: scanned index "gpointind" to remove 866 row versionsDETAIL: CPU 5.62s/6.68u sec elapsed 176.67 sec.INFO: "point_tbl": removed 866 row versions in 31 pagesDETAIL: CPU 0.00s/0.00u sec elapsed 0.01 sec.INFO: index "gpointind" now contains 201326592 row versions in 2336360 pagesDETAIL: 33554432 index row versions were removed.150833 index pages have been deleted, 150833 are currently reusable.CPU 5.54s/2.08u sec elapsed 165.61 sec.INFO: "point_tbl": found 33554432 removable, 201326592 nonremovable row versions in 1202176 out of 1202176 pagesDETAIL: 0 dead row versions cannot be removed yet.There were 0 unused item pointers.Skipped 0 pages due to buffer pins.0 pages are entirely empty.CPU 73.50s/116.82u sec elapsed 8300.73 sec.INFO: analyzing "public.point_tbl"INFO: "point_tbl": scanned 100 of 1202176 pages, containing 16756 live rows and 0 dead rows; 100 rows in sample, 201326601 estimated total rowsVACUUM There is a big speed up + we can reuse some pages.Thanks.diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0e49959..229d3f4 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -619,6 +619,12 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
+			if(GistPageIsDeleted(stack->page)) {
+UnlockReleaseBuffer(stack->buffer);
+xlocked = false;
+state.stack = stack = stack->parent;
+continue;
+			}
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index ff888e2..c99ff7e 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -1128,11 +1128,6 @@ gistGetMaxLevel(Relation index)
  * but will be added there the first time we visit them.
  */
 
-typedef struct
-{
-	BlockNumber childblkno;		/* hash key */
-	BlockNumber parentblkno;
-} 

Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-11 Thread Robert Haas
On Fri, Sep 11, 2015 at 1:12 PM, Tomas Vondra
 wrote:
> I'm arguing for fixing the existing bug, and then addressing the case of
> over-estimation separately, with proper analysis.

Well, this is part of how we're looking it differently.  I think the
bug is "we're passing a value to palloc that is too large, so
sometimes it fails" and the way to fix that is to properly limit the
value.  You are clearly defining the bug a bit differently.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bad row estimation with indexed func returning bool

2015-09-11 Thread Tom Lane
Jehan-Guillaume de Rorthais  writes:
> I faced a correlation problem on a query today and tried the usual trick
> consisting of using an functional index and rewriting the query to use it.

The core reason this isn't doing anything useful is that
clause_selectivity() is hard-wired to estimate the selectivity of a
top-level WHERE clause that is a function call as 0.333, no matter
what:

else if (is_funcclause(clause))
{
/*
 * This is not an operator, so we guess at the selectivity. THIS IS A
 * HACK TO GET V4 OUT THE DOOR.  FUNCS SHOULD BE ABLE TO HAVE
 * SELECTIVITIES THEMSELVES.   -- JMH 7/9/92
 */
s1 = (Selectivity) 0.333;
}

Adding per-function selectivity estimators, as Joe was presumably
envisioning, would be a sufficiently large amount of work that it's not
too surprising nobody's gotten around to it in twenty-three years.  (The
infrastructure maybe wouldn't be so bad, but where would the estimators
themselves come from, especially for user-defined functions?)

However, in the case at hand, the complaint basically is why aren't we
treating the boolean function expression like a boolean variable, and
looking to see if there are stats available for it, like this other
bit in clause_selectivity:

/*
 * A Var at the top of a clause must be a bool Var. This is
 * equivalent to the clause reln.attribute = 't', so we compute
 * the selectivity as if that is what we have.
 */
s1 = restriction_selectivity(root,
 BooleanEqualOperator,
 list_make2(var,
makeBoolConst(true,
  false)),
 InvalidOid,
 varRelid);

Indeed you could argue that this ought to be the fallback behavior for
*any* unhandled case, not just function expressions.  Not sure if we'd
need to restrict it to single-relation expressions or not.

The implication of doing it like this would be that the default estimate
in the absence of any matching stats would be 0.5 (since eqsel defaults
to 1/ndistinct, and get_variable_numdistinct will report 2.0 for any
boolean-type expression it has no stats for).  That's not a huge change
from the existing 0.333 estimate, which seems pretty unprincipled
anyway ... but it would probably be enough to annoy people if we did it in
stable branches.  So I'd be inclined to propose changing this in HEAD and
maybe 9.5, but not further back.  (For non-function expressions, 0.5 is
the default already, so those would not change behavior.)

Comments?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partitioned checkpointing

2015-09-11 Thread Andres Freund
Hi,

Partitioned checkpoint have the significant disadvantage that it
increases random write io by the number of passes. Which is a bad idea,
*especially* on SSDs.

> >So we'd need logic like this
> >1. Run through shared buffers and analyze the files contained in there
> >2. Assign files to one of N batches so we can make N roughly equal sized
> >mini-checkpoints
> >3. Make N passes through shared buffers, writing out files assigned to
> >each batch as we go

That's essentially what Fabien's sorting patch does by sorting all
writes.

> What I think might work better is actually keeping the write/fsync phases we
> have now, but instead of postponing the fsyncs until the next checkpoint we
> might spread them after the writes. So with target=0.5 we'd do the writes in
> the first half, then the fsyncs in the other half. Of course, we should sort
> the data like you propose, and issue the fsyncs in the same order (so that
> the OS has time to write them to the devices).

I think the approach in Fabien's patch of enforcing that there's not
very much dirty data to flush by forcing early cache flushes is
better. Having gigabytes worth of dirty data in the OS page cache can
have massive negative impact completely independent of fsyncs.

> I wonder how much the original paper (written in 1996) is effectively
> obsoleted by spread checkpoints, but the benchmark results posted by
> Horikawa-san suggest there's a possible gain. But perhaps partitioning the
> checkpoints is not the best approach?

I think it's likely that the patch will have only a very small effect if
applied ontop of Fabien's patch (which'll require some massaging I'm
sure). 

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2015-09-11 Thread Andres Freund
On 2015-09-11 19:33:26 +0300, YUriy Zhuravlev wrote:
> On Friday 11 September 2015 18:14:21 Andres Freund wrote:
> > This way we can leave the for (;;) loop
> > in BufferAlloc() thinking that the buffer is unused (and can't be further
> > pinned because of the held spinlock!)
> 
> We lost lock after PinBuffer_Locked in BufferAlloc. Therefore, in essence, 
> nothing has changed.

The relevant piece of code is:
/*
 * Need to lock the buffer header too in order to change its 
tag.
 */
LockBufHdr(buf);

/*
 * Somebody could have pinned or re-dirtied the buffer while we 
were
 * doing the I/O and making the new hashtable entry.  If so, we 
can't
 * recycle this buffer; we must undo everything we've done and 
start
 * over with a new victim buffer.
 */
oldFlags = buf->flags;
if (buf->refcount == 1 && !(oldFlags & BM_DIRTY))
break;

UnlockBufHdr(buf);
BufTableDelete(, newHash);
if ((oldFlags & BM_TAG_VALID) &&
oldPartitionLock != newPartitionLock)
LWLockRelease(oldPartitionLock);
LWLockRelease(newPartitionLock);
UnpinBuffer(buf, true);
}

/*
 * Okay, it's finally safe to rename the buffer.
 *
 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
 * paranoia.  We also reset the usage_count since any recency of use of
 * the old content is no longer relevant.  (The usage_count starts out 
at
 * 1 so that the buffer can survive one clock-sweep pass.)
 */
buf->tag = newTag;
buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | 
BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT);
if (relpersistence == RELPERSISTENCE_PERMANENT)
buf->flags |= BM_TAG_VALID | BM_PERMANENT;
else
buf->flags |= BM_TAG_VALID;
buf->usage_count = 1;

UnlockBufHdr(buf);

so unless I'm missing something, no, we haven't lost the lock.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RLS open items are vague and unactionable

2015-09-11 Thread Tom Lane
Dean Rasheed  writes:
> On 11 September 2015 at 15:49, Tom Lane  wrote:
>> However, I don't see why UPDATE/DELETE with RETURNING couldn't be
>> restricted according to *both* the UPDATE and SELECT policies,
>> ie if there's RETURNING then you can't update a row you could not
>> have selected.  Note this would be a nothing-happens result not a
>> throw-error result, else you still leak info about the existence of
>> the row.

> That's what I was suggesting, except I was advocating a throw-error
> result rather than a nothing-happens result.

> Regarding the possibility of leaking info about the existence of rows,
> that's something that already happens with INSERT if there are unique
> indexes, and we've effectively decided there is nothing we can do
> about it. So I don't buy that as an argument for doing nothing over
> throwing an error.

Well, I don't buy your argument either.  The unique-index problem means
that if you have INSERT or UPDATE privilege, you can check for existence
of a row conflicting with a row that RLS would let you insert or update.
It does not mean that DELETE privilege would let you make that check.
Also, the unique-index problem only applies if there's a relevant unique
index, which doesn't necessarily have anything at all to do with the RLS
filter condition.

I think this is coming back to Robert's concern, that it is far from clear
that we understand the interactions of per-command RLS policies well
enough to ship them.  Maybe we should rip that out for 9.5 and only allow
a single RLS policy that that's the same for all commands.  We can always
add the more general facility later.

> Ultimately I think this will be an extremely rare case, probably more
> likely to happen as a result of accidentally misconfigured policies.
> But if that does happen, I'd rather have an error to alert me to the
> fact, than to silently do nothing.

I understand your concern about that, and it's reasonable.  But that
just leads me to the conclusion that maybe our ideas in this area are
not fully baked.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RLS open items are vague and unactionable

2015-09-11 Thread Dean Rasheed
On 11 September 2015 at 17:56, Tom Lane  wrote:
> Dean Rasheed  writes:
>> On 11 September 2015 at 15:49, Tom Lane  wrote:
>>> However, I don't see why UPDATE/DELETE with RETURNING couldn't be
>>> restricted according to *both* the UPDATE and SELECT policies,
>>> ie if there's RETURNING then you can't update a row you could not
>>> have selected.  Note this would be a nothing-happens result not a
>>> throw-error result, else you still leak info about the existence of
>>> the row.
>
>> That's what I was suggesting, except I was advocating a throw-error
>> result rather than a nothing-happens result.
>
>> Regarding the possibility of leaking info about the existence of rows,
>> that's something that already happens with INSERT if there are unique
>> indexes, and we've effectively decided there is nothing we can do
>> about it. So I don't buy that as an argument for doing nothing over
>> throwing an error.
>
> Well, I don't buy your argument either.  The unique-index problem means
> that if you have INSERT or UPDATE privilege, you can check for existence
> of a row conflicting with a row that RLS would let you insert or update.
> It does not mean that DELETE privilege would let you make that check.

OK, the unique-index argument wasn't the best argument. How about
this:- if you have the DELETE privilege you can already check for the
existence of any row that RLS would let you delete just by looking at
the command status. The same is true for UPDATE. So throwing an error
when attempting to use RETURNING to see a row that's not visible to
you doesn't leak any more existence info than before.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Waits monitoring

2015-09-11 Thread Robert Haas
On Thu, Sep 10, 2015 at 3:43 AM, Kyotaro HORIGUCHI
 wrote:
> Generated lwlocknames.[ch] don't have header comment because
> generate-lwlocknames.pl writes them into wrong place.
>
> lmgr/Makefile looks to have some mistakes.

Fixed.

>  - lwlocknames.c is not generated from (or using) lwlocknames.c
>so the entry "lwlocknames.c: lwlocknames.h" doesn't looks to
>be appropriate.

I think that's a pretty standard way of handling a case where a single
command generates multiple files.

>  - maintainer-clean in lmgr/Makefile forgets to remove lwlocknames.c.

Fixed.

> Perhaps uncommenting in pg_config_manual.h is left alone.
> (This is not included in the diff below)

Fixed.

And committed.  Thanks for the review, let's see what the buildfarm thinks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2015-09-11 Thread Andres Freund
Hi,

On 2015-09-11 13:23:24 +0300, YUriy Zhuravlev wrote:
> Continuing the theme: 
> http://www.postgresql.org/message-id/3368228.mTSz6V0Jsq@dinodell

Please don't just start new threads for a new version of the patch.

> This time, we fairly rewrote  'refcount' and 'usage_count' to atomic in
> PinBuffer and UnpinBuffer (but save lock for buffer flags in Unpin).

Hm.

> In the same time it doesn't affect to correctness of buffer manager
> because that variables already have LWLock on top of them (for partition of
> hashtable).

Note that there's a pending patch that removes the buffer mapping locks
entirely.

> If someone pinned buffer after the call StrategyGetBuffer we just try
> again (in BufferAlloc).  Also in the code there is one more check
> before deleting the old buffer, where changes can be rolled back. The
> other functions where it is checked 'refcount' and 'usage_count' put
> exclusive locks.

I don't think this is correct. This way we can leave the for (;;) loop
in BufferAlloc() thinking that the buffer is unused (and can't be further
pinned because of the held spinlock!) while it actually has been pinned
since by PinBuffer(). Additionally oldFlags can get out of sync there.


I don't think the approach of making some of the fields atomics but not
really caring about the rest is going to work. My suggestion is to add a
single 'state' 32bit atomic. This 32bit state is subdivided into:

10bit for flags,
3bit for usage_count,
16bit for refcount

then turn each operation that currently uses one of these fields into
corresponding accesses (just different values for flags, bit-shiftery &
mask for reading usage count, bit mask for reading refcount). The trick
then is to add a *new* flag value BM_LOCKED. This can then act as a sort
of a 'one bit' spinlock.

That should roughly look like (more or less pseudocode):

void
LockBufHdr(BufferDesc *desc)
{
int state = pg_atomic_read_u32(>state);

for (;;)
{
/* wait till lock is free */
while (unlikely(state & BM_LOCKED))
{
pg_spin_delay();
state = pg_atomic_read_u32(>state);

/* add exponential backoff? Should seldomly be contended tho. */
}

/* and try to get lock */
if (pg_atomic_compare_exchange_u32(>state, , state | 
BM_LOCKED))
break;
}
}

static bool
PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy)
{
...
if (ref == NULL)
{
ReservePrivateRefCountEntry();
ref = NewPrivateRefCountEntry(b);
...

int state = pg_atomic_read_u32(>state);
int oldstate = state;

while (true)
{

/* spin-wait till lock is free */
while (unlikely(state & BM_LOCKED))
{
pg_spin_delay();
state = pg_atomic_read_u32(>state);
}

/* increase refcount */
state += 1;

/* increase usagecount unless already max */
if ((state & USAGE_COUNT_MASK) != BM_MAX_USAGE_COUNT)
state += BM_USAGE_COUNT_ONE;

result = (state & BM_VALID) != 0;

if (pg_atomic_compare_exchange_u32(>state, , state))
break;

/* get ready for next loop, oldstate has been updated by cas */
state = oldstate;
}
...
}

other callsites can either just plainly continue to use
LockBufHdr/UnlockBufHdr or converted similarly to PinBuffer().

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-09-11 Thread Thom Brown
On 11 September 2015 at 15:43, Syed, Rahila  wrote:

>
> Hello,
>
> Please find attached updated VACUUM progress checker patch.
> Following have been accomplished in the patch
>
> 1. Accounts for index pages count while calculating  total progress of
> VACUUM.
> 2. Common location for storing progress parameters for any command. Idea
> is every command which needs to report progress can populate and interpret
> the shared variables in its own way.
>  Each can display progress by implementing separate views.
> 3. Separate VACUUM progress view to display various progress parameters
> has been implemented . Progress of various phases like heap scan, index
> scan, total pages scanned along with
> completion percentage is reported.
> 4.This view can display progress for all active backends running VACUUM.
>
> Basic testing has been performed. Thorough testing is yet to be done.
> Marking it as Needs Review in  Sept-Commitfest.
>
> ToDo:
> Display count of heap pages actually vacuumed(marking line pointers unused)
> Display percentage of work_mem being used to store dead tuples.
>
> Thank you,
> Rahila Syed
>

This doesn't seem to compile:

make[4]: Leaving directory `/home/thom/Development/postgresql/src/backend'
make[3]: Leaving directory `/home/thom/Development/postgresql/src/common'
make -C catalog schemapg.h
make[3]: Entering directory
`/home/thom/Development/postgresql/src/backend/catalog'
cd ../../../src/include/catalog && '/usr/bin/perl' ./duplicate_oids
3308
make[3]: *** [postgres.bki] Error 1
make[3]: Leaving directory
`/home/thom/Development/postgresql/src/backend/catalog'
make[2]: *** [submake-schemapg] Error 2
make[2]: Leaving directory `/home/thom/Development/postgresql/src/backend'
make[1]: *** [install-backend-recurse] Error 2
make[1]: Leaving directory `/home/thom/Development/postgresql/src'
make: *** [install-src-recurse] Error 2
make: Leaving directory `/home/thom/Development/postgresql'


Thom


Re: [HACKERS] Partitioned checkpointing

2015-09-11 Thread Fabien COELHO


Hello Simon,


The idea to do a partial pass through shared buffers and only write a
fraction of dirty buffers, then fsync them is a good one.


Sure.


The key point is that we spread out the fsyncs across the whole checkpoint
period.


Yes, this is really Andres suggestion, as I understood it.


I think we should be writing out all buffers for a particular file in one
pass, then issue one fsync per file.  >1 fsyncs per file seems a bad idea.


This is one of the things done in the "checkpoint continuous flushing" 
patch, as buffers are sorted, they are written per file, and in order 
within a file, which help getting sequencial writes instead of random 
writes.


See https://commitfest.postgresql.org/6/260/

However for now the final fsync is not called, but Linux is told that the 
written buffers must be flushed, which is akin to an "asynchronous fsync", 
i.e. it asks to move data but does not wait for the data to be actually 
written, as a blocking fsync would.


Andres suggestion, which has some common points to Takashi-san patch, is 
to also integrate the fsync in the buffer writing process. There are some 
details to think about, because probably it is not a a good to issue an 
fsync right after the corresponding writes, it is better to wait for some 
delay before doing so, so the implementation is not straightforward.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2015-09-11 Thread YUriy Zhuravlev
On Friday 11 September 2015 18:37:00 you wrote:
> so unless I'm missing something, no, we haven't lost the lock.
This section is protected by like LWLockAcquire(newPartitionLock, 
LW_EXCLUSIVE); before it (and we can't get this buffer from hash table). 

-- 
YUriy Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Review: check existency of table for -t option (pg_dump) when pattern...

2015-09-11 Thread Teodor Sigaev

https://commitfest.postgresql.org/6/201/

Patch looks good and is helpful in some usecases. I found and fix some typo (new 
version in attach), but patch shows some inconsistent output:

% pg_dump -t 'aaa*'  postgres
pg_dump: No matching tables were found
% pg_dump -t 'aaa*' --strict-names postgres
pg_dump: Table "aaa*" not found.

In second case error message is obviously worse.
--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 7467e86..eaa006d 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -545,6 +545,23 @@ PostgreSQL documentation
  
 
  
+  --strict-names
+  
+   
+Require that each schema (-n / --schema) and table (-t / --table)
+qualifier match at least one schema/table in the database to be dumped.
+Note that if none of the schema/table qualifiers find matches pg_dump
+will generate an error even without --strict-names.
+   
+   
+This option has no effect on -N/--exclude-schema, -T/--exclude_table
+or --exclude-table-date. An exclude pattern failing to match
+any objects is not considered an error.
+   
+  
+ 
+
+ 
   -T table
   --exclude-table=table
   
diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml
index 97e3420..a5a9394 100644
--- a/doc/src/sgml/ref/pg_restore.sgml
+++ b/doc/src/sgml/ref/pg_restore.sgml
@@ -432,6 +432,16 @@
  
 
  
+  --strict-names
+  
+   
+Require that each schema (-n / --schema) and table (-t / --table)
+qualifier match at least one schema/table in the backup file.
+   
+  
+ 
+
+ 
   -T trigger
   --trigger=trigger
   
diff --git a/src/bin/pg_dump/dumputils.c b/src/bin/pg_dump/dumputils.c
index d7506e1..52b2b98 100644
--- a/src/bin/pg_dump/dumputils.c
+++ b/src/bin/pg_dump/dumputils.c
@@ -1220,6 +1220,7 @@ simple_string_list_append(SimpleStringList *list, const 
char *val)
pg_malloc(offsetof(SimpleStringListCell, val) +strlen(val) + 1);
 
cell->next = NULL;
+   cell->touched = false;
strcpy(cell->val, val);
 
if (list->tail)
@@ -1237,7 +1238,23 @@ simple_string_list_member(SimpleStringList *list, const 
char *val)
for (cell = list->head; cell; cell = cell->next)
{
if (strcmp(cell->val, val) == 0)
+   {
+   cell->touched = true;
return true;
+   }
}
return false;
 }
+
+const char *
+simple_string_list_not_touched(SimpleStringList *list)
+{
+   SimpleStringListCell *cell;
+
+   for (cell = list->head; cell; cell = cell->next)
+   {
+   if (!cell->touched)
+   return cell->val;
+   }
+   return NULL;
+}
diff --git a/src/bin/pg_dump/dumputils.h b/src/bin/pg_dump/dumputils.h
index b176746..9f31bbc 100644
--- a/src/bin/pg_dump/dumputils.h
+++ b/src/bin/pg_dump/dumputils.h
@@ -38,6 +38,8 @@ typedef struct SimpleOidList
 typedef struct SimpleStringListCell
 {
struct SimpleStringListCell *next;
+   booltouched;/* true, when 
this string was searched
+ and 
touched */
charval[FLEXIBLE_ARRAY_MEMBER]; /* 
null-terminated string here */
 } SimpleStringListCell;
 
@@ -103,5 +105,7 @@ extern void set_dump_section(const char *arg, int 
*dumpSections);
 
 extern void simple_string_list_append(SimpleStringList *list, const char *val);
 extern bool simple_string_list_member(SimpleStringList *list, const char *val);
+extern const char *simple_string_list_not_touched(SimpleStringList *list);
+
 
 #endif   /* DUMPUTILS_H */
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 80df8fc..7126749 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -104,6 +104,7 @@ typedef struct _restoreOptions
int column_inserts;
int if_exists;
int no_security_labels; /* Skip 
security label entries */
+   int strict_names;
 
const char *filename;
int dataOnly;
diff --git a/src/bin/pg_dump/pg_backup_archiver.c 
b/src/bin/pg_dump/pg_backup_archiver.c
index 8f1f6c1..2344937 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -108,6 +108,9 @@ static void reduce_dependencies(ArchiveHandle *AH, TocEntry 
*te,
 static void mark_create_done(ArchiveHandle *AH, TocEntry *te);
 static void inhibit_data_for_failed_table(ArchiveHandle *AH, TocEntry *te);
 
+static void 

Re: [HACKERS] RLS open items are vague and unactionable

2015-09-11 Thread Dean Rasheed
On 11 September 2015 at 15:49, Tom Lane  wrote:
> Dean Rasheed  writes:
>> Yeah, we had a similar discussion regarding UPDATE USING policies and
>> ON CONFLICT UPDATE clauses. I think the argument against filtering is
>> that the rows returned would then be misleading about what was
>> actually updated.
>
> It seems to me that it would be a horribly bad idea to allow RLS to act
> in such a way that rows could be updated and then not shown in RETURNING.
>
> However, I don't see why UPDATE/DELETE with RETURNING couldn't be
> restricted according to *both* the UPDATE and SELECT policies,
> ie if there's RETURNING then you can't update a row you could not
> have selected.  Note this would be a nothing-happens result not a
> throw-error result, else you still leak info about the existence of
> the row.
>

That's what I was suggesting, except I was advocating a throw-error
result rather than a nothing-happens result.

Regarding the possibility of leaking info about the existence of rows,
that's something that already happens with INSERT if there are unique
indexes, and we've effectively decided there is nothing we can do
about it. So I don't buy that as an argument for doing nothing over
throwing an error.

My concern about doing nothing is how confusing it might be that an
UPDATE without RETURNING might update more rows than an UPDATE with
RETURNING and an identical WHERE clause. Throwing an error is much
more explicit about why you can't return those rows.

Ultimately I think this will be an extremely rare case, probably more
likely to happen as a result of accidentally misconfigured policies.
But if that does happen, I'd rather have an error to alert me to the
fact, than to silently do nothing.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-09-11 Thread Robert Haas
On Fri, Sep 11, 2015 at 3:08 AM, Etsuro Fujita
 wrote:
> IIRC, Horiguchi-san also pointed that out. Honestly, I also think that that
> is weird, but IIUC, I think it can't hurt. What I was concerned about was
> EXPLAIN, but EXPLAIN doesn't handle an EvalPlanQual PlanState tree at least
> currently.

This has come up a few times before and some people have argued for
changing the coding rule.  Nevertheless, for now, it is the rule.
IMHO, it's a pretty good rule that makes things easier to understand
and reason about.  If there's an argument for changing it, it's
performance, not developer convenience.  Anyway, we should try to fix
this problem without getting tangled in that argument.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2015-09-11 Thread Andres Freund
On 2015-09-11 19:46:02 +0300, YUriy Zhuravlev wrote:
> On Friday 11 September 2015 18:37:00 you wrote:
> > so unless I'm missing something, no, we haven't lost the lock.
> This section is protected by like LWLockAcquire(newPartitionLock, 
> LW_EXCLUSIVE); before it (and we can't get this buffer from hash table). 

a) As I said upthread there's a patch to remove these locks entirely
b) It doesn't matter anyway. Not every pin goes through the buffer
   mapping table. StrategyGetBuffer(), SyncOneBuffer(), ...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Did we forget to unpin buf in function "revmap_physical_extend" ?

2015-09-11 Thread Alvaro Herrera
Jinyu Zhang wrote:
> In function "revmap_physical_extend",  should we add "ReleaseBuffer(buf);" 
> between line 438 and 439 ?
>  422 else
>  423 {
>  424 if (needLock)
>  425 LockRelationForExtension(irel, ExclusiveLock);
>  426 
>  427 buf = ReadBuffer(irel, P_NEW);
>  428 if (BufferGetBlockNumber(buf) != mapBlk)
>  429 {
>  430 /*
>  431  * Very rare corner case: somebody extended the relation
>  432  * concurrently after we read its length.  If this happens, 
> give
>  433  * up and have caller start over.  We will have to evacuate 
> that
>  434  * page from under whoever is using it.
>  435  */
>  436 if (needLock)
>  437 UnlockRelationForExtension(irel, ExclusiveLock);
>  438 LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK);
>  439 return;
>  440 }

Yeah, clearly there's a missing ReleaseBuffer call there.

There are two callers of this code (brinRevmapExtend calls
revmap_extend_and_get_blkno calls revmap_physical_extend) -- one is
brin_doupdate.  This function is concerned with updating an existing
index tuple, that is, one that already has a revmap entry.  It follows
that the revmap page must already exist.  In other words, this callsite
is dead code.

The other caller is in brin_doinsert, which itself is called from
brinbuild (both directly and via brinbuildCallback) and summarize_range.
In the first case, the index is just being created, and therefore no one
can be trying to extend the relation concurrently.  In the second case,
we always hold ShareUpdateExclusiveLock, and therefore no one can be
also running the same thing.

Another thing to keep in mind is that since the revmap allocates blocks
from the start of the main fork, the pages it wants to use are most of
the time already used by "regular index pages".  The only case where it
actually needs to physically extend the relation is when the index is
new.  I've been trying to think of a case in which the index is created
to an empty relation, and then we try to insert a tuple during a
vacuum-induced summarization, which decides to create a new revmap page,
and then somehow another process needs to insert another index tuple and
allocate a regular index page for it, which happens to get to the
relation extension before the first process.  But I don't think this can
happen either, because every path that inserts regular index tuples
extends the revmap first.

Therefore I think this is dead code and there is no live bug.

All that said, adding the call is simple enough and I don't want to
spend a lot of time constructing a test case to prove that this cannot
happen.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RLS open items are vague and unactionable

2015-09-11 Thread Dean Rasheed
On 10 September 2015 at 21:48, Robert Haas  wrote:
> The open items list here:
>
> https://wiki.postgresql.org/wiki/PostgreSQL_9.5_Open_Items
>
> Dean's latest round of RLS refactoring.

This refactoring patch doesn't fix any behavioural issues. It is all
about trying to make the code simpler and more readable and
maintainable, and also addressing Alvaro's comments here:
http://www.postgresql.org/message-id/20150522180807.gb5...@postgresql.org

However, it has bit-rotted over the last 3 months. In particular it
doesn't take account of this change:
http://git.postgresql.org/pg/commitdiff/dee0200f0276c0f9da930a2c926f90f5615f2d64

I will happily work up a rebased version of the patch, but only if I
get a serious commitment from a committer to review it. Otherwise,
I'll let it drop.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3.9 and pg_multixact corruption

2015-09-11 Thread Bernd Helmle
...and here it is ;)

--On 10. September 2015 19:45:46 -0300 Alvaro Herrera
 wrote:

> Anyway, can you please request pg_controldata to be run on the failed
> cluster and paste it here?

pg_control version number:937
Catalog version number:   201306121
Database system identifier:   5995776571405068134
Database cluster state:   in archive recovery
pg_control last modified: Di 08 Sep 2015 14:58:36 CEST
Latest checkpoint location:   1A52/3CFAF758
Prior checkpoint location:1A52/3CFAF758
Latest checkpoint's REDO location:1A52/2313FEF8
Latest checkpoint's REDO WAL file:00011A520023
Latest checkpoint's TimeLineID:   1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:  0/2896610102
Latest checkpoint's NextOID:  261892
Latest checkpoint's NextMultiXactId:  1068223816
Latest checkpoint's NextMultiOffset:  2147460090
Latest checkpoint's oldestXID:2693040605
Latest checkpoint's oldestXID's DB:   16400
Latest checkpoint's oldestActiveXID:  0
Latest checkpoint's oldestMultiXid:   1012219584
Latest checkpoint's oldestMulti's DB: 16400
Time of latest checkpoint:Di 08 Sep 2015 00:47:01 CEST
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location: 1A52/2313FEF8
Min recovery ending loc's timeline:   1
Backup start location:1A52/2313FEF8
Backup end location:  0/0
End-of-backup record required:no
Current wal_level setting:archive
Current max_connections setting:  500
Current max_prepared_xacts setting:   0
Current max_locks_per_xact setting:   64
Maximum data alignment:   8
Database block size:  8192
Blocks per segment of large relation: 131072
WAL block size:   8192
Bytes per WAL segment:16777216
Maximum length of identifiers:64
Maximum columns in an index:  32
Maximum size of a TOAST chunk:1996
Date/time type storage:   64-bit integers
Float4 argument passing:  by value
Float8 argument passing:  by value
Data page checksum version:   0

-- 

Bernd



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partitioned checkpointing

2015-09-11 Thread Fabien COELHO


Hello Takashi-san,

I wanted to do some tests with this POC patch. For this purpose, it would 
be nice to have a guc which would allow to activate or not this feature.


Could you provide a patch with such a guc? I would suggest to have the 
number of partitions as a guc, so that choosing 1 would basically reflect 
the current behavior.


Some general comments :

I understand that what this patch does is cutting the checkpoint of 
buffers in 16 partitions, each addressing 1/16 of buffers, and each with 
its own wal-log entry, pacing, fsync and so on.


I'm not sure why it would be much better, although I agree that it may 
have some small positive influence on performance, but I'm afraid it may 
also degrade performance in some conditions. So I think that maybe a 
better understanding of why there is a better performance and focus on 
that could help obtain a more systematic gain.


This method interacts with the current proposal to improve the 
checkpointer behavior by avoiding random I/Os, but it could be combined.


I'm wondering whether the benefit you see are linked to the file flushing 
behavior induced by fsyncing more often, in which case it is quite close 
the "flushing" part of the current "checkpoint continuous flushing" patch, 
and could be redundant/less efficient that what is done there, especially 
as test have shown that the effect of flushing is *much* better on sorted 
buffers.


Another proposal around, suggested by Andres Freund I think, is that 
checkpoint could fsync files while checkpointing and not wait for the end 
of the checkpoint. I think that it may also be one of the reason why your 
patch does bring benefit, but Andres approach would be more systematic, 
because there would be no need to fsync files several time (basically your 
patch issues 16 fsync per file). This suggest that the "partitionning" 
should be done at a lower level, from within the CheckPointBuffers, which 
would take care of fsyncing files some time after writting buffers to them 
is finished.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH v2] GSSAPI encryption support

2015-09-11 Thread Michael Paquier
On Thu, Sep 10, 2015 at 4:27 PM, Michael Paquier
 wrote:
> On Thu, Sep 10, 2015 at 1:44 AM, Robbie Harwood  wrote:
>> Michael Paquier  writes:
>>
>>> On Wed, Sep 9, 2015 at 4:12 AM, Robbie Harwood wrote:
 Michael Paquier writes:
 As promised, here's a V2 to address your issues with comments.  I
 haven't heard back on the issues you found in testing, so no other
 changes are present.
>>>
>>> Well, the issue is still here: login through gssapi fails with your
>>> patch, not with HEAD. This patch is next on my review list by the way
>>> so I'll see what I can do about it soon even if I am in the US for
>>> Postgres Open next week. Still, how did you test it? I am just
>>> creating by myself a KDC, setting up a valid credential with kinit,
>>> and after setting up Postgres for this purpose the protocol
>>> communication just fails.
>>
>> My KDC is setup through freeIPA; I create a service for postgres,
>> acquire a keytab, set it in the config file, and fire up the server.  It
>> should go without saying that this is working for me, which is why I
>> asked you for more information so I could try to debug.  I wrote a post
>> on this back in June when this was still in development:
>> http://mivehind.net/page/view-page-slug/16/postgres-kerberos
> Hm. OK. I'll give it a try with freeipa and your patch with Fedora for
> example. Could you as well try the configuration I have used? In any
> case, it seems to me that we have a real problem with your patch: the
> gss authentication protocol is broken with your patch and *not* HEAD
> when using a custom kdc like the one I have set up manually on one of
> my VMs.

Looking more at this stuff. Your post assumes that you have an IPA
server available (I am not really familiar with this software stack)
already configured at hand so as you do not need to worry about any
low-level configuration and a KDC is provided as well, among other
things like ntpd or an apache instance. Well, the thing is that we
just need is a KDC for this patch to have an environment suitable for
testing, and some magic commands with kadmin.local, kinit, etc, and
not the whole set of features that an IPA server provides (when
kicking ipa-server-install one needs to provide a realm name, a KDC
admin password, so that's basically just a wrapper setting up
krb5.conf, which is handy when you want to have the full set in your
hands actually, though just to test this patch it does not seem worth
it). And I imagine that you do have an IPA server already set
facilitating your work.

Still, I gave it a try on a Fedora host, giving up after facing
several failures when trying to install the server. because of several
features.

Note that by duckduckging around I have bumped into some more
documentation to set up KDC with Postgres:
http://gpdb.docs.pivotal.io/4320/admin_guide/kerberos.html
This may be useful when testing this patch as well.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] snapshot too old, configured by time

2015-09-11 Thread Alvaro Herrera
I'm starting to read through this and have a few questions.

Kevin Grittner wrote:

> 4. Tests have been added.  They are currently somewhat minimal,
> since this is using a whole new technique for testing from any
> existing committed tests -- I wanted to make sure that this
> approach to testing was OK with everyone before expanding it.  If
> it is, I assume we will want to move some of the more generic
> portions to a .pm file to make it available for other tests.

I think your test module is a bit unsure about whether it's called tso
or sto.  I would place it inside src/test/modules, so that buildfarm
runs it automatically; I'm not sure it will pick up things in src/test.
(This is the whole point of src/test/modules, after all).  I haven't
written or reviewed our TestLib stuff so I can't comment on whether the
test code is good or not.  How long is the test supposed to last?

It bothers me a bit to add #include rel.h to snapmgr.h because of the
new macro definition.  It seems out of place.  I would instead move the
macro closer to bufmgr headers or rel.h itself perhaps.  Also, the
definition isn't accompanied by an explanatory comment (other than why
is it a macro instead of a function).

So if I understand correctly, every call to BufferGetPage needs to have
a TestForOldSnapshot immediately afterwards?  It seems easy to later
introduce places that fail to test for old snapshots.  What happens if
they do?  Does vacuum remove tuples anyway and then the query returns
wrong results?  That seems pretty dangerous.  Maybe the snapshot could
be an argument of BufferGetPage?

Please have the comments be clearer on what "align to minute boundaries"
means.  Is it floor or ceil?  Also, in the OldSnapshotControlData big
comment you talk about "length" and then the variable is called "used".
Maybe that should be more consistent, for clarity.

How large is the struct kept in shmem?  I don't think the size is
configurable, is it?  I would have guessed that the size would depend on
the GUC param ...

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RLS open items are vague and unactionable

2015-09-11 Thread Kevin Grittner
Dean Rasheed  wrote:

> Ultimately I think this will be an extremely rare case, probably more
> likely to happen as a result of accidentally misconfigured policies.
> But if that does happen, I'd rather have an error to alert me to the
> fact, than to silently do nothing.

I agree with Tom and Robert on this -- if we are going to allow
RETURNING on a DELETE or UPDATE of a table with RLS, the SELECT
policy must filter rows going to the DELETE or UPDATE phase and
silently ignore those which the user is not allowed to read.
Anything else seems crazy to me.

If we can't do that, I think we should prohibit RETURNING on DELETE
or UPDATE if there is RLS affecting the user's SELECTs.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Comment update to pathnode.c

2015-09-11 Thread Robert Haas
On Fri, Sep 11, 2015 at 6:22 AM, Etsuro Fujita
 wrote:
> The comments for create_foreignscan_path says as follows, but that it's
> now possible that the function is called by GetForeignJoinPaths, which
> was added in 9.5.
>
> 1450 /*
> 1451  * create_foreignscan_path
> 1452  *Creates a path corresponding to a scan of a foreign table,
> 1453  *returning the pathnode.
> 1454  *
> 1455  * This function is never called from core Postgres; rather, it's
> expected
> 1456  * to be called by the GetForeignPaths function of a foreign data
> wrapper.
> 1457  * We make the FDW supply all fields of the path, since we do not
> have any
> 1458  * way to calculate them in core.
> 1459  */
>
> So, I've updated the comments.  Please find attached a patch.

I would write "to be called by the GetForeignPaths or
GetForeignJoinPaths function" to keep it a bit more concise.  And I
would reflow the paragraph.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Autonomous Transaction is back

2015-09-11 Thread Merlin Moncure
On Thu, Sep 10, 2015 at 8:39 PM, Noah Misch  wrote:
> On Wed, Sep 09, 2015 at 10:04:01AM -0400, Robert Haas wrote:
>> On Sun, Sep 6, 2015 at 1:56 AM, Noah Misch  wrote:
>> > What design principle(s) have you been using to decide how autonomous
>> > transactions should behave?
>>
>> I have to admit to a complete lack of principle.  I'm quite frightened
>> of what this is going to need from the lock manager, and I'm trying to
>> wriggle out of having to do things there that are going to be nastily
>> hard.  My wriggling isn't going very well, though.
>
> It's an exceptionally-challenging development project, agreed.  So much code
> assumes the 1:1 relationship between backends and top-level transactions.

I guess I'm being obtuse, but can you explain why that assumption must
be revisited?  I don't see why it has to be...I must be missing
something.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-09-11 Thread Rahila Syed
>This doesn't seem to compile
Oh. It compiled successfully when applied on HEAD on my machine. Anyways,
the OID is changed to 3309 in the attached patch. 3308 / 3309 both are part
of OIDs in unused OID list.

Thank you,
Rahila Syed


Vacuum_progress_checker_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-09-11 Thread Alvaro Herrera
Rahila Syed wrote:
> >This doesn't seem to compile
> Oh. It compiled successfully when applied on HEAD on my machine. Anyways,
> the OID is changed to 3309 in the attached patch. 3308 / 3309 both are part
> of OIDs in unused OID list.

I think Thom may have patched on top of some other patch.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb_concat: make sure we always return a non-scalar value

2015-09-11 Thread Andrew Dunstan



On 09/09/2015 01:03 PM, Oskari Saarenmaa wrote:

09.09.2015, 18:53, Andrew Dunstan kirjoitti:

On 09/08/2015 09:54 AM, Andrew Dunstan wrote:

On 09/05/2015 02:47 AM, Oskari Saarenmaa wrote:

There was a long thread about concatenating jsonb objects to each
other, but that discussion didn't touch concatenating other types.
Currently jsonb_concat always just returns the other argument if one
of arguments is considered empty.  This causes surprising behavior
when concatenating scalar values to empty arrays:

os=# select '[]'::jsonb || '1'::jsonb;
1

os=# select '[]'::jsonb || '[1]'::jsonb;
 [1]

os=# select '[]'::jsonb || '1'::jsonb || '2'::jsonb;
 [1, 2]

os=# select '0'::jsonb || '1'::jsonb;
 [0, 1]

os=# select '{"x": "y"}'::jsonb || '[1]'::jsonb;
 [{"x": "y"}, 1]

os=# select '{"x": "y"}'::jsonb || '1'::jsonb;
ERROR:  invalid concatenation of jsonb objects

Attached a patch to fix and test this.  Also added a test case for
concatenating two scalar values which currently produces an array..
I'm not sure that behavior makes sense, but didn't want to change
that in this patch as I guess someone could consider that feature
useful.




This looks correct. Barring objection I'll apply this shortly.



Actually, I'm not sure the test is sufficient here. It looks to me like
we should only be taking this fast path if one operand is an empty array
and the other is a non-scalar array.

Otherwise we get things like this (second case is wrong, I think):

andrew=# select '[]'::jsonb || '"a"';
  ?column?
--
  ["a"]

andrew=# select '[]'::jsonb || '{"a":"b"}';
   ?column?

  {"a": "b"}

andrew=# select '[1]'::jsonb || '{"a":"b"}';
 ?column?
-
  [1, {"a": "b"}]


It looks wrong, but I'm not sure what's right in that case.  I think 
it'd make sense to just restrict concatenation to object || object, 
array || array and array || scalar and document that.  I doubt many 
people expect their objects to turn into arrays if they accidentally 
concatenate an array into it.  Alternatively the behavior could depend 
on the order of arguments for concatenation, array || anything -> 
array, object || array -> error.







I don't think I want to change the semantics in general at this stage. A 
simple minimal fix to handle the case of the fastpath where one of other 
operand is empty, making it entirely consistent with the case where it's 
not empty, is attached.


cheers

andrew

diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c
index 3b8d42e..154a883 100644
--- a/src/backend/utils/adt/jsonfuncs.c
+++ b/src/backend/utils/adt/jsonfuncs.c
@@ -3359,12 +3359,18 @@ jsonb_concat(PG_FUNCTION_ARGS)
 			   *it2;
 
 	/*
-	 * If one of the jsonb is empty, just return other.
+	 * If one of the jsonb is empty, just return the other if it's not
+	 * scalar and both are of the same kind.  If it's a scalar or they are
+	 * of different kinds we need to perform the concatenation even if one is
+	 * empty.
 	 */
-	if (JB_ROOT_COUNT(jb1) == 0)
-		PG_RETURN_JSONB(jb2);
-	else if (JB_ROOT_COUNT(jb2) == 0)
-		PG_RETURN_JSONB(jb1);
+	if (JB_ROOT_IS_OBJECT(jb1) == JB_ROOT_IS_OBJECT(jb2))
+	{
+		if (JB_ROOT_COUNT(jb1) == 0 && !JB_ROOT_IS_SCALAR(jb2))
+			PG_RETURN_JSONB(jb2);
+		else if (JB_ROOT_COUNT(jb2) == 0 && !JB_ROOT_IS_SCALAR(jb1))
+			PG_RETURN_JSONB(jb1);
+	}
 
 	it1 = JsonbIteratorInit(>root);
 	it2 = JsonbIteratorInit(>root);
diff --git a/src/test/regress/expected/jsonb.out b/src/test/regress/expected/jsonb.out
index 17656d4..82d1b69 100644
--- a/src/test/regress/expected/jsonb.out
+++ b/src/test/regress/expected/jsonb.out
@@ -2912,6 +2912,42 @@ select '"c"' || '["a", "b"]'::jsonb;
  ["c", "a", "b"]
 (1 row)
 
+select '[]'::jsonb || '["a"]'::jsonb;
+ ?column? 
+--
+ ["a"]
+(1 row)
+
+select '[]'::jsonb || '"a"'::jsonb;
+ ?column? 
+--
+ ["a"]
+(1 row)
+
+select '"b"'::jsonb || '"a"'::jsonb;
+  ?column?  
+
+ ["b", "a"]
+(1 row)
+
+select '{}'::jsonb || '{"a":"b"}'::jsonb;
+  ?column?  
+
+ {"a": "b"}
+(1 row)
+
+select '[]'::jsonb || '{"a":"b"}'::jsonb;
+   ?column?   
+--
+ [{"a": "b"}]
+(1 row)
+
+select '{"a":"b"}'::jsonb || '[]'::jsonb;
+   ?column?   
+--
+ [{"a": "b"}]
+(1 row)
+
 select '"a"'::jsonb || '{"a":1}';
 ERROR:  invalid concatenation of jsonb objects
 select '{"a":1}' || '"a"'::jsonb;
diff --git a/src/test/regress/expected/jsonb_1.out b/src/test/regress/expected/jsonb_1.out
index 86b1162..2cbc0c7 100644
--- a/src/test/regress/expected/jsonb_1.out
+++ b/src/test/regress/expected/jsonb_1.out
@@ -2912,6 +2912,42 @@ select '"c"' || '["a", "b"]'::jsonb;
  ["c", "a", "b"]
 (1 row)
 
+select '[]'::jsonb || '["a"]'::jsonb;
+ ?column? 
+--
+ ["a"]
+(1 row)
+
+select '[]'::jsonb || '"a"'::jsonb;
+ ?column? 
+--
+ ["a"]
+(1 row)
+
+select '"b"'::jsonb || '"a"'::jsonb;
+  ?column?  
+
+ ["b", "a"]
+(1 row)
+
+select '{}'::jsonb 

Re: [HACKERS] Re: [PATCH] pg_upgrade fails when postgres/template1 isn't in default tablespace

2015-09-11 Thread Bruce Momjian
On Thu, Sep 10, 2015 at 09:59:41PM -0400, Bruce Momjian wrote:
> On Fri, Sep 11, 2015 at 12:41:09AM +, Hyongtae Lim wrote:
> > The following review has been posted through the commitfest application:
> > make installcheck-world:  tested, passed
> > Implements feature:   tested, passed
> > Spec compliant:   tested, passed
> > Documentation:not tested
> > 
> > This looks good to me.
> > I've done some tests(pg_dumpall & restore with sql, upgrade test(8.4, 9.x, 
> > ...)) and there was no problems.
> > so I'm marking this as ready for committer.
> 
> Thanks.  I am going to commit it in the next 24 hours.  Thanks.

Patch applied and marked as committed in the commitfest app.

One odd thing I found in exhaustive testing is that pre-9.3, we use a
line starting with \connect to identify when the pg_dumpall globals end
and the per-database information begins.  (In 9.3+, we do per-database
dumps to allow parallel dump/restore.)  I modified the patch for pre-9.3
to prefix a space for \connect lines when we are jumping in and out of
the database so we can alter its tablespace.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Double linking MemoryContext children

2015-09-11 Thread Tom Lane
Jan Wieck  writes:
> The attached script demonstrates an O(N^2) problem we recently became 
> aware of. The script simply executes a large anonymous code block that 
> doesn't do anything useful. Usage is

>  ./t1.py [NUM_STMTS] | psql [DBNAME]

> NUM_STMTS defaults to 20,000. The code block executes rather fast, but 
> the entire script runs for over a minute (at 20,000 statements) on a 
> 2.66 GHz Xeon.

> The time is spent when all the prepared SPI statements are freed at the 
> end of execution. All prepared SPI statements are children of the Cache 
> context. MemoryContext children are a single linked list where new 
> members are inserted at the head. This works best when children are 
> created and destroyed in a last-in-first-out pattern. SPI however 
> destroys the SPI statements in the order they were created, forcing 
> MemoryContextSetParent() to traverse the entire linked list for each child.

> The attached patch makes MemoryContext children a double linked list 
> that no longer needs any list traversal no find the position of the 
> child within the list.

Seems less invasive to fix SPI to delete in the opposite order?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parser emits mysterious error message for very long tokens

2015-09-11 Thread Tom Lane
Kyotaro HORIGUCHI  writes:
> Hello, this is a problem on an extreme situation.
> When parser encounters very long tokens, it returns an error
> message which should be mysterious for users.

>> $ perl -e "print \"select '\" . \"x\"x(512*1024*1024) . \"'\"" | psql 
>> postgres
>> ERROR:  invalid memory alloc request size 1073741824

I can't get terribly excited about that, because there is not that
much daylight between there and where the query fails because the
entire input string exceeds 1GB.  Moreover, anyone who tries working
with literals in this size range will soon learn not to ;-).  So
it seems quite an artificial example to me ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Double linking MemoryContext children

2015-09-11 Thread Jan Wieck
The attached script demonstrates an O(N^2) problem we recently became 
aware of. The script simply executes a large anonymous code block that 
doesn't do anything useful. Usage is


./t1.py [NUM_STMTS] | psql [DBNAME]

NUM_STMTS defaults to 20,000. The code block executes rather fast, but 
the entire script runs for over a minute (at 20,000 statements) on a 
2.66 GHz Xeon.


The time is spent when all the prepared SPI statements are freed at the 
end of execution. All prepared SPI statements are children of the Cache 
context. MemoryContext children are a single linked list where new 
members are inserted at the head. This works best when children are 
created and destroyed in a last-in-first-out pattern. SPI however 
destroys the SPI statements in the order they were created, forcing 
MemoryContextSetParent() to traverse the entire linked list for each child.


The attached patch makes MemoryContext children a double linked list 
that no longer needs any list traversal no find the position of the 
child within the list.



Regards, Jan


--
Jan Wieck
Senior Software Engineer
http://slony.info
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index 705f3ef..d1a2e02 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -331,21 +331,13 @@ MemoryContextSetParent(MemoryContext context, MemoryContext new_parent)
 	{
 		MemoryContext parent = context->parent;
 
-		if (context == parent->firstchild)
-			parent->firstchild = context->nextchild;
+		if (context->prevchild != NULL)
+			context->prevchild->nextchild = context->nextchild;
 		else
-		{
-			MemoryContext child;
-
-			for (child = parent->firstchild; child; child = child->nextchild)
-			{
-if (context == child->nextchild)
-{
-	child->nextchild = context->nextchild;
-	break;
-}
-			}
-		}
+			parent->firstchild = context->nextchild;
+
+		if (context->nextchild != NULL)
+			context->nextchild->prevchild = context->prevchild;
 	}
 
 	/* And relink */
@@ -353,12 +345,16 @@ MemoryContextSetParent(MemoryContext context, MemoryContext new_parent)
 	{
 		AssertArg(MemoryContextIsValid(new_parent));
 		context->parent = new_parent;
+		context->prevchild = NULL;
 		context->nextchild = new_parent->firstchild;
+		if (new_parent->firstchild != NULL)
+			new_parent->firstchild->prevchild = context;
 		new_parent->firstchild = context;
 	}
 	else
 	{
 		context->parent = NULL;
+		context->prevchild = NULL;
 		context->nextchild = NULL;
 	}
 }
@@ -714,6 +710,7 @@ MemoryContextCreate(NodeTag tag, Size size,
 	node->methods = methods;
 	node->parent = NULL;		/* for the moment */
 	node->firstchild = NULL;
+	node->prevchild = NULL;
 	node->nextchild = NULL;
 	node->isReset = true;
 	node->name = ((char *) node) + size;
@@ -728,6 +725,8 @@ MemoryContextCreate(NodeTag tag, Size size,
 	{
 		node->parent = parent;
 		node->nextchild = parent->firstchild;
+		if (parent->firstchild != NULL)
+			parent->firstchild->prevchild = node;
 		parent->firstchild = node;
 		/* inherit allowInCritSection flag from parent */
 		node->allowInCritSection = parent->allowInCritSection;
diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h
index 577ab9c..bbb58bd 100644
--- a/src/include/nodes/memnodes.h
+++ b/src/include/nodes/memnodes.h
@@ -79,6 +79,7 @@ typedef struct MemoryContextData
 	MemoryContextMethods *methods;		/* virtual function table */
 	MemoryContext parent;		/* NULL if no parent (toplevel context) */
 	MemoryContext firstchild;	/* head of linked list of children */
+	MemoryContext prevchild;	/* previous child of same parent */
 	MemoryContext nextchild;	/* next child of same parent */
 	char	   *name;			/* context name (just for debugging) */
 	MemoryContextCallback *reset_cbs;	/* list of reset/delete callbacks */
#!/usr/bin/env python

import sys

def main(argv):
if len(argv) == 0:
num_stmts = 2
else:
num_stmts = int(argv[0])

print "\\timing"
print "DO $$"
print "DECLARE"
print "ts1 timestamp;"
print "ts2 timestamp;"
print "dummy integer;"
print "BEGIN"
print "SELECT timeofday() INTO ts1;"
print "RAISE NOTICE 'start: %', ts1;"

for i in range(0, num_stmts):
print "SELECT %d INTO dummy;"%i

print "SELECT timeofday() INTO ts2;"
print "RAISE NOTICE 'end: %', ts2;"
print "RAISE NOTICE 'duration: %', (ts2 - ts1);"
print "END;"
print "$$;"

if __name__ == '__main__':
sys.exit(main(sys.argv[1:]))

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RLS open items are vague and unactionable

2015-09-11 Thread Stephen Frost
Robert,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Fri, Sep 11, 2015 at 7:33 AM, Stephen Frost  wrote:
> > I've updated the page to add more details about the various items,
> > though the only code changes at this point considered 'open' are this
> > refactoring and the question regarding the 'row-level-security disabled'
> > context which I posted a patch for discussion yesterday.
> >
> > Comments and help on these would certainly be welcome, of course.  We're
> > working on a set of documentation updates to hopefully finish up in the
> > next week to add more details about RLS (additional sub-sections,
> > coverage of the issue Peter raised, additional discussion of RETURNING,
> > etc).
> 
> Thanks for the updates.  My thoughts:

Certainly, happy to help.

> On RETURNING, it seems like we've got a fairly fundamental problem
> here.  If I understand correctly, the intention of allowing policies
> to be filtered by command type is to allow blind updates and deletes,

That's not correct, no, will clarify below.

> but this behavior means that they are not really blind.  You can
> always use BEGIN/UPDATE-or-DELETE-with-RETURNING/ROLLBACK as a
> substitute for SELECT.  So the only possible thing you can do with the
> ability to filter by command tag that is coherent from a security
> point of view is to make the update and delete predicates *tighter*
> than the select predicate.

The original intention and the primary expected use-case is to allow the
predicates to be tighter, yes.  What we're discussing here is if we want
to allow that flexibility at all or force users to have a single
visibility policy for all commands.

It's already possible to configure RLS with one visibility policy for
all commands- simply only create a USING clause for 'ALL' commands and
you're done.

The only reason to avoid providing that flexibility is the concern that
it might be misunderstood and users might misconfigure their system.
Removing the flexibility to have per-command visibility policies and
instead force a single visibility policy doesn't add any capabilities.

Further, from the discussion which Dean and I had over the summer and, I
believe, agreed on is that providing *both* a per-command visibility and
having an "overall" visibiility policy (as proposed nearby, where SELECT
always applies but then the other per-command policies are combined with
that policy) ends up being more difficult to reason about and explain
and simply doesn't add any new functionality.

> And if that's where we end up, then haven't we fundamentally
> mis-designed the feature?  I mean, without the blind update case, it
> just seems kooky to have different commands see different rows.  It
> would be better to have all the command see the same rows, and then
> have update/delete *error out* if you try to update a row you're not
> allowed to touch.

Having an error-out option which is based on the values of the existing
rows, instead of only allowing the error-out case based on the values of
the row being added to the relation, is certainly an interesting idea.

As being discussed nearby with Zhaomo (thread 'CREATE POLICY and
RETURNING', which ends up being a pretty bad subject, unfortunately), if
the WITH CHECK option supported referring to 'OLD' and 'NEW' then that
could be supported.  I'm not against having that capability, but it
seems like a new feature rather than an issue with the current
implementation.  That would also imply adding a 'WITH CHECK' clause to
DELETE, to support the "error-instead" option, but that also looks like
a new feature and one which could certainly be added later without any
backwards compatibility concerns, as far as I can see, rather than a
current issue.

Perhaps we would even allow such a 'WITH CHECK' to be applied to SELECT
statements to cause an error to be returned instead, though that would
imply that the visibility policy allows users to query records which
would end up just erroring out due to the 'WITH CHECK' policy and that
might end up providing the user with information about what's in the
relation that they aren't allowed to see due to the 'WITH CHECK' policy.
I can see how it would be useful when it's the intent of the
administrators to produce errors in cases where a user is trying to
access data they are not allowed to, as that could then be audited.  In
such a case, the actual visibility rule might be simply 'true', but an
error is thrown if the rows actually returned do not pass the
'WITH CHECK' policy specified.

> On Dean's refactoring patch, I would tend to favor back-patching
> whatever do there to 9.5, but I'm not volunteering to do the work.

Alright, I'll see about getting that done then.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Double linking MemoryContext children

2015-09-11 Thread Jan Wieck

On 09/11/2015 09:38 AM, Tom Lane wrote:

Jan Wieck  writes:

The attached script demonstrates an O(N^2) problem we recently became
aware of. The script simply executes a large anonymous code block that
doesn't do anything useful. Usage is



 ./t1.py [NUM_STMTS] | psql [DBNAME]



NUM_STMTS defaults to 20,000. The code block executes rather fast, but
the entire script runs for over a minute (at 20,000 statements) on a
2.66 GHz Xeon.



The time is spent when all the prepared SPI statements are freed at the
end of execution. All prepared SPI statements are children of the Cache
context. MemoryContext children are a single linked list where new
members are inserted at the head. This works best when children are
created and destroyed in a last-in-first-out pattern. SPI however
destroys the SPI statements in the order they were created, forcing
MemoryContextSetParent() to traverse the entire linked list for each child.



The attached patch makes MemoryContext children a double linked list
that no longer needs any list traversal no find the position of the
child within the list.


Seems less invasive to fix SPI to delete in the opposite order?


That would assume that there are no other code paths that destroy memory 
contexts out of LIFO order.



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partitioned checkpointing

2015-09-11 Thread Simon Riggs
On 11 September 2015 at 09:07, Fabien COELHO  wrote:


> Some general comments :
>

Thanks for the summary Fabien.


> I understand that what this patch does is cutting the checkpoint of
> buffers in 16 partitions, each addressing 1/16 of buffers, and each with
> its own wal-log entry, pacing, fsync and so on.
>
> I'm not sure why it would be much better, although I agree that it may
> have some small positive influence on performance, but I'm afraid it may
> also degrade performance in some conditions. So I think that maybe a better
> understanding of why there is a better performance and focus on that could
> help obtain a more systematic gain.
>

I think its a good idea to partition the checkpoint, but not doing it this
way.

Splitting with N=16 does nothing to guarantee the partitions are equally
sized, so there would likely be an imbalance that would reduce the
effectiveness of the patch.


> This method interacts with the current proposal to improve the
> checkpointer behavior by avoiding random I/Os, but it could be combined.
>
> I'm wondering whether the benefit you see are linked to the file flushing
> behavior induced by fsyncing more often, in which case it is quite close
> the "flushing" part of the current "checkpoint continuous flushing" patch,
> and could be redundant/less efficient that what is done there, especially
> as test have shown that the effect of flushing is *much* better on sorted
> buffers.
>
> Another proposal around, suggested by Andres Freund I think, is that
> checkpoint could fsync files while checkpointing and not wait for the end
> of the checkpoint. I think that it may also be one of the reason why your
> patch does bring benefit, but Andres approach would be more systematic,
> because there would be no need to fsync files several time (basically your
> patch issues 16 fsync per file). This suggest that the "partitionning"
> should be done at a lower level, from within the CheckPointBuffers, which
> would take care of fsyncing files some time after writting buffers to them
> is finished.


The idea to do a partial pass through shared buffers and only write a
fraction of dirty buffers, then fsync them is a good one.

The key point is that we spread out the fsyncs across the whole checkpoint
period.

I think we should be writing out all buffers for a particular file in one
pass, then issue one fsync per file.  >1 fsyncs per file seems a bad idea.

So we'd need logic like this
1. Run through shared buffers and analyze the files contained in there
2. Assign files to one of N batches so we can make N roughly equal sized
mini-checkpoints
3. Make N passes through shared buffers, writing out files assigned to each
batch as we go

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Double linking MemoryContext children

2015-09-11 Thread Tom Lane
Jan Wieck  writes:
> On 09/11/2015 09:38 AM, Tom Lane wrote:
>> Seems less invasive to fix SPI to delete in the opposite order?

> That would assume that there are no other code paths that destroy memory 
> contexts out of LIFO order.

Given that we've not seen this sort of problem reported in the dozen or
more years the code has been like that, I'm a bit dubious that we need
to worry about that.  It's possible you're right that we need to expend
more space in the MemoryContext lists --- but I'd like to see more than
one example.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RLS open items are vague and unactionable

2015-09-11 Thread Dean Rasheed
On 11 September 2015 at 13:05, Robert Haas  wrote:
> Thanks for the updates.  My thoughts:
>
> On RETURNING, it seems like we've got a fairly fundamental problem
> here.  If I understand correctly, the intention of allowing policies
> to be filtered by command type is to allow blind updates and deletes,
> but this behavior means that they are not really blind.  You can
> always use BEGIN/UPDATE-or-DELETE-with-RETURNING/ROLLBACK as a
> substitute for SELECT.  So the only possible thing you can do with the
> ability to filter by command tag that is coherent from a security
> point of view is to make the update and delete predicates *tighter*
> than the select predicate.
>
> And if that's where we end up, then haven't we fundamentally
> mis-designed the feature?  I mean, without the blind update case, it
> just seems kooky to have different commands see different rows.  It
> would be better to have all the command see the same rows, and then
> have update/delete *error out* if you try to update a row you're not
> allowed to touch.
>

I think blind updates are a pretty niche case, and I think that it
wasn't the intention to support them, but more of an unintentional
side effect of the implementation. That said, there are
just-about-plausible use-cases where they might be considered useful,
e.g., allow a password to be nulled out, forcing a reset, but don't
allow the existing value to be read. But then, as you say, RETURNING
blows a hole in the security of that model.

I still think the answer is to make RETURNING subject to SELECT
policies, with an error thrown if you attempt a blind-update-returning
for a row not visible to you, e.g.:

DELETE FROM foo WHERE id=10; -- OK even if row 10 is not visible

DELETE FROM foo WHERE id=10 RETURNING *;
ERROR:  row returned by RETURNING is not visible using row level
security policies for "foo"

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE POLICY and RETURNING

2015-09-11 Thread Stephen Frost
Zhaomo,

* Zhaomo Yang (zmp...@gmail.com) wrote:
> I admit that I gave some bad examples in the previous email, and it is fair
> to say
> this (Being able to have something like NEW.value > 10 OR OLD.id = 1) is
> not a advantage of what I proposed
> before I can come up with any real-world examples.

I do believe that real-world examples would be helpful here.

> > So there would also be a SELECT policy anyway, which is just like the
> > existing UPDATE USING policy is today and what you're really asking for
> > is the ability to have the WITH CHECK policy reference both the OLD and
> > NEW records.
> 
> Yes. Then we won't need any USING clauses for UPDATE/DELETE. For
> UPDATE/DELETE, we only need
> one predicate which can reference both OLD and NEW.

I agree that if we force a single visibility policy for all commands
then we wouldn't need the USING clauses for UPDATE and DELETE, but we
would certainly need *some* policy for DELETE to prevent users from
being able to delete records that they aren't supposed to be allowed to.
Therefore, we'd just be replacing the USING policy with a 'WITH CHECK'
policy, no?  I'm not against supporting a 'WITH CHECK' policy for
DELETE, as outlined nearby in the discussion with Robert, but that
strikes me as a new feature rather than an issue with the current
system.  Removing the existing ability to control the visibility on a
per-command basis is pretty clearly a reduction in the overall
flexibility of the system without a clear gain to me.

As a thought experiment which might prove helpful, consider if we
decided to implement a single visibility policy but then support the
"error-out instead" options being proposed here first.  Would we then,
later, be against a patch which proposed adding the flexibility to
control the visibility on a per-command basis?  I tend to doubt it.  In
hindsight, perhaps that ordering would have been better, or it would
have been better to get everything in all at once, though that has
drawbacks (I seriously doubt we would have made the progress we have
thus far towards BDR if it had to be done all at once).

> > I might be able to get behind supporting that, but I'm not
> > terribly excited about it and you've not provided any real use-cases for
> > it that I've seen
> 
> I think that there are two major advantages:
> 
> 1)
> As many folks have pointed out in this and other threads, this will makes
> information leakage less likely.
> Now a permissive USING clause for UPDATE/DELETE can give an attacker chance
> to read rows he
> is not allowed to SELECT. Even without leaky functions, an attacker can
> easily figure out the rows by doing a
> binary search with tricks like division by zero.

Using a single USING policy for all commands rather than per-command
USING policies does precisely the same thing as what is being proposed
here.  I'm certainly all for documenting the dangers of the per-command
USING policies and the risks outlined regarding RETURNING but I dislike
the general notion that we have to protect users from themselves as
there are use-cases where blind updates/deletes could be quite handy, as
I outlined over the summer in prior discussions (queueing systems come
to mind, in particular).  That's using the visibility controls of RLS
in a different way than users familiar with the limitations of RLS in
other systems might ever consider, but that doesn't make the use-case
unreasonable.

> 2)
> This proposal allows a user to reference both the OLD and NEW records in
> the same clause. For example,
> NEW.id == OLD.id , or NEW.value <= OLD.value + 10. I think this should be
> useful for users since they may often
> need to check the new value against the old one.

As I've said up-thread, this is something which I'm not against, but it
doesn't depend on #1 in any way, nor does #1 depend on #2, and so I
believe they should be independently discussed and this brought up as a
new feature capability rather than a concern about the existing
implementation.

> > it still doesn't really change anything regarding
> > RETURNING any differently than the earlier suggestions did about having
> > the SELECT policy applied to all commands.
> 
> No, it doesn't. I proposed it here because there are some related
> discussions (applying SELECT policy to other commands).

Yet, the ability to refer to OLD from 'WITH CHECK' policies isn't
particularly related to limiting the visibility policies which are
supported to a single SELECT USING policy.  Hopefully I've outlined why
I feel that to be the case above.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-09-11 Thread Robert Haas
On Fri, Sep 11, 2015 at 2:01 AM, Kouhei Kaigai  wrote:
> If a callback on ForeignRecheck processes EPQ rechecks, the core PostgreSQL
> don't need to know what expression was pushed down and how does it kept in
> the private field (fdw_exprs). Only FDW driver knows which private field is
> the expression node that was pushed down to the remote side. It shall not be
> an interface contract.

I agree.  It seems needless to involve the core code here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-11 Thread Tomas Vondra



On 09/11/2015 07:16 PM, Robert Haas wrote:

On Fri, Sep 11, 2015 at 1:12 PM, Tomas Vondra
 wrote:

I'm arguing for fixing the existing bug, and then addressing the case of
over-estimation separately, with proper analysis.


Well, this is part of how we're looking it differently.  I think the
bug is "we're passing a value to palloc that is too large, so
sometimes it fails" and the way to fix that is to properly limit the
value.  You are clearly defining the bug a bit differently.


Yes, I see it differently.

I don't quite understand why limiting the value is more "proper" than 
using a function that can handle the actual value.


The proposed bugfix addresses the issue in the most straightforward way, 
without introducing additional considerations about possible 
over-estimations (which the current code completely ignores, so this is 
a new thing). I think bugfixes should not introduce such changes to 
behavior (albeit internal), especially not without any numbers.


regards

--
Tomas Vondra   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partitioned checkpointing

2015-09-11 Thread Takashi Horikawa
Hello Andres,
Thank you for discussion. It’s nice for me to discuss here.

> Partitioned checkpoint have the significant disadvantage that it increases
> random write io by the number of passes. Which is a bad idea,
> *especially* on SSDs.
I’m curious about the conclusion that the Partitioned Checkpinting
increases random write io by the number of passes.

The order of buffer syncs in original checkpointing is as follows.
b[0], b[1], …..  , b[N-1]
where b[i] means buffer[i], N is the number of buffers.
(To make the story simple, starting point of buffer sync determined by the
statement ‘buf_id = StrategySyncStart(NULL, NULL);’ in BufferSync() is
ignored here. If it is important in this discussion, please note that.)

While partitioned checkpointing is as follows.
1st round
  b[0], b[p], b[2p], … b[(n-1)p]
2nd round
  b[1], b[p+1], b[2p+1], … b[(n-1)p+1]
…
last round
  b[p-1], b[p+(p-1)], b[2p+(p-1)], … b[(n-1)p+(p-1)]
where p is the number of partitions and n = (N / p).

I think important here is that the ‘Partitioned checkpointing’ does not
change (increase) the total number of buffer writes.
I wonder why the sequence of b[0], b[1], …..  , b[N-1] is less random than
that of b[0], b[p], b[2p], … b[(n-1)p]. I think there is no relationship
between the neighboring buffers, like b[0] and b[1]. Is this wrong?

Also, I believe that random ‘PAGE’ writes are not harmful for SSDs.
(Buffer sync is carried out in the unit of 8 Kbyte page.) Harmful for SSDs
is partial write (write size is less than PAGE size) because it increases
the write-amplitude of the SSD, resulting in shortening its lifetime. On the
other hand, IIRC, random ‘PAGE’ writes do not increase the write
amplitude. Wearleveling algorithm of the SSD should effectively handle
random ‘PAGE’ writes.

> I think it's likely that the patch will have only a very small effect if
> applied ontop of Fabien's patch (which'll require some massaging I'm
sure).
It may be true or not. Who knows?
I think only detail experimentations tell the truth.

Best regards.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories


> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Andres Freund
> Sent: Saturday, September 12, 2015 1:30 AM
> To: Tomas Vondra
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Partitioned checkpointing
>
> Hi,
>
> Partitioned checkpoint have the significant disadvantage that it increases
> random write io by the number of passes. Which is a bad idea,
> *especially* on SSDs.
>
> > >So we'd need logic like this
> > >1. Run through shared buffers and analyze the files contained in
> > >there 2. Assign files to one of N batches so we can make N roughly
> > >equal sized mini-checkpoints 3. Make N passes through shared buffers,
> > >writing out files assigned to each batch as we go
>
> That's essentially what Fabien's sorting patch does by sorting all writes.
>
> > What I think might work better is actually keeping the write/fsync
> > phases we have now, but instead of postponing the fsyncs until the
> > next checkpoint we might spread them after the writes. So with
> > target=0.5 we'd do the writes in the first half, then the fsyncs in
> > the other half. Of course, we should sort the data like you propose,
> > and issue the fsyncs in the same order (so that the OS has time to write
> them to the devices).
>
> I think the approach in Fabien's patch of enforcing that there's not very
> much dirty data to flush by forcing early cache flushes is better. Having
> gigabytes worth of dirty data in the OS page cache can have massive
negative
> impact completely independent of fsyncs.
>
> > I wonder how much the original paper (written in 1996) is effectively
> > obsoleted by spread checkpoints, but the benchmark results posted by
> > Horikawa-san suggest there's a possible gain. But perhaps partitioning
> > the checkpoints is not the best approach?
>
> I think it's likely that the patch will have only a very small effect if
> applied ontop of Fabien's patch (which'll require some massaging I'm
sure).
>
> Greetings,
>
> Andres Freund
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make
> changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


smime.p7s
Description: S/MIME cryptographic signature


Re: [HACKERS] proposal: function parse_ident

2015-09-11 Thread Pavel Stehule
Hi

new update of parse_ident function patch

Regards

Pavel
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
new file mode 100644
index 22d4f61..1581d5a
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***
*** 1707,1712 
--- 1707,1727 

 
  
+  parse_ident
+ 
+ parse_ident(str text,
+[ strictmode boolean DEFAULT true ] )
+
+text[]
+Split qualified identifier to array parts.
+
+parse_ident('"SomeSchema".someTable')
+"SomeSchema,sometable"
+   
+ 
+   
+
+ 
   pg_client_encoding
  
  pg_client_encoding()
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
new file mode 100644
index ccc030f..9a7e89d
*** a/src/backend/catalog/system_views.sql
--- b/src/backend/catalog/system_views.sql
*** RETURNS jsonb
*** 940,942 
--- 940,949 
  LANGUAGE INTERNAL
  STRICT IMMUTABLE
  AS 'jsonb_set';
+ 
+ CREATE OR REPLACE FUNCTION
+   parse_ident(str text, strictmode boolean DEFAULT true)
+ RETURNS text[]
+ LANGUAGE INTERNAL
+ STRICT IMMUTABLE
+ AS 'parse_ident';
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
new file mode 100644
index c0495d9..8a68533
*** a/src/backend/utils/adt/misc.c
--- b/src/backend/utils/adt/misc.c
***
*** 21,32 
--- 21,35 
  #include 
  
  #include "access/sysattr.h"
+ #include "access/htup_details.h"
  #include "catalog/catalog.h"
+ #include "catalog/namespace.h"
  #include "catalog/pg_tablespace.h"
  #include "catalog/pg_type.h"
  #include "commands/dbcommands.h"
  #include "funcapi.h"
  #include "miscadmin.h"
+ #include "parser/scansup.h"
  #include "parser/keywords.h"
  #include "postmaster/syslogger.h"
  #include "rewrite/rewriteHandler.h"
***
*** 38,43 
--- 41,47 
  #include "utils/ruleutils.h"
  #include "tcop/tcopprot.h"
  #include "utils/acl.h"
+ #include "utils/array.h"
  #include "utils/builtins.h"
  #include "utils/timestamp.h"
  
*** pg_column_is_updatable(PG_FUNCTION_ARGS)
*** 598,600 
--- 602,754 
  
  	PG_RETURN_BOOL((events & REQ_EVENTS) == REQ_EVENTS);
  }
+ 
+ 
+ /*
+  * This simple parser utility are compatible with lexer implementation,
+  * used only in parse_ident function
+  */
+ static bool
+ is_ident_start(unsigned char c)
+ {
+ 	if (c == '_')
+ 		return true;
+ 	if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'))
+ 		return true;
+ 
+ 	if (c >= 0200 && c <= 0377)
+ 		return true;
+ 
+ 	return false;
+ }
+ 
+ static bool
+ is_ident_cont(unsigned char c)
+ {
+ 	if (c >= '0' && c <= '9')
+ 		return true;
+ 
+ 	return is_ident_start(c);
+ }
+ 
+ /*
+  * parse_ident - decompose text identifier to parts of
+  * identifiers and the rest (doesn't follow a identifier rule).
+  */
+ Datum
+ parse_ident(PG_FUNCTION_ARGS)
+ {
+ 	text		*qualname;
+ 	char		*qualname_str;
+ 	bool		strict_mode;
+ 	ArrayBuildState *astate = NULL;
+ 	char	*nextp;
+ 
+ 	qualname = PG_GETARG_TEXT_PP(0);
+ 	qualname_str = text_to_cstring(qualname);
+ 	strict_mode = PG_GETARG_BOOL(1);
+ 
+ 	nextp = qualname_str;
+ 
+ 	/* skip leading whitespace */
+ 	while (isspace((unsigned char) *nextp))
+ 		nextp++;
+ 
+ 	do
+ 	{
+ 		char		*curname;
+ 		char		*endp;
+ 		bool		missing_ident;
+ 
+ 		missing_ident = true;
+ 
+ 		if (*nextp == '\"')
+ 		{
+ 			curname = nextp + 1;
+ 			for (;;)
+ 			{
+ endp = strchr(nextp + 1, '\"');
+ if (endp == NULL)
+ 	ereport(ERROR,
+ 		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 		 errmsg("unclused double quotes"),
+ 		 errdetail("string \"%s\" is not valid identifier",
+ 			text_to_cstring(qualname;
+ if (endp[1] != '\"')
+ 	break;
+ memmove(endp, endp + 1, strlen(endp));
+ nextp = endp;
+ 			}
+ 			nextp = endp + 1;
+ 			*endp = '\0';
+ 
+ 			if (endp - curname == 0)
+ ereport(ERROR,
+ 	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 	 errmsg("identifier should not be empty"),
+ 	 errdetail("string \"%s\" is not valid identifier",
+ 			text_to_cstring(qualname;
+ 
+ 			astate = accumArrayResult(astate,
+ CStringGetTextDatum(curname), false,
+ 		TEXTOID, CurrentMemoryContext);
+ 			missing_ident = false;
+ 		}
+ 		else
+ 		{
+ 			if (is_ident_start((unsigned char) *nextp))
+ 			{
+ char *downname;
+ int	len;
+ text	*part;
+ 
+ curname = nextp++;
+ while (is_ident_cont((unsigned char) *nextp))
+ 	nextp++;
+ 
+ len = nextp - curname;
+ 
+ downname = downcase_truncate_identifier(curname, len, false);
+ part = cstring_to_text_with_len(downname, len);
+ astate = accumArrayResult(astate,
+ 	PointerGetDatum(part), false,
+ 			TEXTOID, CurrentMemoryContext);
+ missing_ident = false;
+ 			}
+ 		}
+ 
+ 		if (missing_ident)
+ 			ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+  errmsg("missing 

Re: [HACKERS] RLS open items are vague and unactionable

2015-09-11 Thread Robert Haas
On Fri, Sep 11, 2015 at 7:33 AM, Stephen Frost  wrote:
> I've updated the page to add more details about the various items,
> though the only code changes at this point considered 'open' are this
> refactoring and the question regarding the 'row-level-security disabled'
> context which I posted a patch for discussion yesterday.
>
> Comments and help on these would certainly be welcome, of course.  We're
> working on a set of documentation updates to hopefully finish up in the
> next week to add more details about RLS (additional sub-sections,
> coverage of the issue Peter raised, additional discussion of RETURNING,
> etc).

Thanks for the updates.  My thoughts:

On RETURNING, it seems like we've got a fairly fundamental problem
here.  If I understand correctly, the intention of allowing policies
to be filtered by command type is to allow blind updates and deletes,
but this behavior means that they are not really blind.  You can
always use BEGIN/UPDATE-or-DELETE-with-RETURNING/ROLLBACK as a
substitute for SELECT.  So the only possible thing you can do with the
ability to filter by command tag that is coherent from a security
point of view is to make the update and delete predicates *tighter*
than the select predicate.

And if that's where we end up, then haven't we fundamentally
mis-designed the feature?  I mean, without the blind update case, it
just seems kooky to have different commands see different rows.  It
would be better to have all the command see the same rows, and then
have update/delete *error out* if you try to update a row you're not
allowed to touch.

On Dean's refactoring patch, I would tend to favor back-patching
whatever do there to 9.5, but I'm not volunteering to do the work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Bad row estimation with indexed func returning bool

2015-09-11 Thread Jehan-Guillaume de Rorthais
Hi,

I faced a correlation problem on a query today and tried the usual trick
consisting of using an functional index and rewriting the query to use it.

However, after writing the function, indexing it and rewriting the query, I
faced an optimizer behavior I was not expecting. I wrote a short scenario to
reproduce it on current HEAD:

  CREATE TABLE correl AS SELECT (i-1)%2 AS i, i%2 AS j 
  FROM generate_series(1,10) AS i;

  ANALYZE correl ;

  EXPLAIN ANALYZE SELECT * FROM correl WHERE i = 1 AND j = 1;

  -- Seq Scan on correl (rows=25000) (rows=0)
  --Filter: ((i = 1) AND (j = 1))
  --Rows Removed by Filter: 10
  -- Planning time: 0.356 ms
  -- Execution time: 21.937 ms

  CREATE FUNCTION fix_correl(int, int) RETURNS bool AS
'BEGIN RETURN $1 = 1 AND $2 = 1; END '
IMMUTABLE
CALLED ON NULL INPUT
LANGUAGE plpgsql;

  CREATE INDEX ON correl ( fix_correl(i, j) );

  ANALYZE correl ;

  EXPLAIN ANALYZE SELECT * FROM correl WHERE fix_correl(i, j);

  -- Index Scan using correl_fix_correl_idx on correl  (rows=3) (rows=0)
  --Index Cond: (fix_correl(i, j) = true)
  --Filter: fix_correl(i, j)
  -- Planning time: 0.421 ms
  -- Execution time: 0.102 ms


Using a function returning integer work as expected:

  CREATE FUNCTION fix_correl_add(int, int) RETURNS int AS
'BEGIN RETURN $1 + $2 ; END '
IMMUTABLE
CALLED ON NULL INPUT
LANGUAGE plpgsql;

  CREATE INDEX ON correl ( fix_correl_add( i, j ) );

  ANALYZE correl ;

  EXPLAIN ANALYZE SELECT * FROM correl WHERE fix_correl_add( i, j ) = 2;
  -- Index Scan using correl_fix_correl_add_idx on correl  (rows=1) (rows=0)
  --Index Cond: (fix_correl_add(i, j) = 2) 
  -- Planning time: 0.462 ms
  -- Execution time: 0.102 ms


It works works as expected with a simple index on (i + j) with no function, but
I wanted to have the same conditions in both tests.

Why does the optimizer behave differently in both cases? Why do it add a Filter
when index scan-ing on correl_fix_correl_idx indexing booleans?

Please, find the complete scenario in attachment.

Regards,
-- 
Jehan-Guillaume de Rorthais
Dalibo
http://www.dalibo.com
ioguix@erg:~$ dropdb correl 
ioguix@erg:~$ createdb correl
ioguix@erg:~$ psql -qtX correl 
correl=# CREATE TABLE correl AS SELECT (i-1)%2 AS i, i%2 AS j FROM 
generate_series(1,10) AS i;

correl=# ANALYZE correl ;

correl=# EXPLAIN ANALYZE SELECT * FROM correl WHERE i = 1 AND j = 1;
 Seq Scan on correl  (cost=0.00..1943.00 rows=25000 width=8) (actual 
time=21.898..21.898 rows=0 loops=1)
   Filter: ((i = 1) AND (j = 1))
   Rows Removed by Filter: 10
 Planning time: 0.356 ms
 Execution time: 21.937 ms

correl=# CREATE FUNCTION fix_correl(int, int) RETURNS bool AS '
BEGIN RETURN $1 = 1 AND $2 = 1; END '
IMMUTABLE
CALLED ON NULL INPUT
LANGUAGE plpgsql;

correl=# CREATE INDEX ON correl ( fix_correl(i, j) );

correl=# ANALYZE correl ;

correl=# EXPLAIN ANALYZE SELECT * FROM correl WHERE fix_correl(i, j);
 Index Scan using correl_fix_correl_idx on correl  (cost=0.29..4.56 rows=3 
width=8) (actual time=0.053..0.053 rows=0 loops=1)
   Index Cond: (fix_correl(i, j) = true)
   Filter: fix_correl(i, j)
 Planning time: 0.421 ms
 Execution time: 0.102 ms

correl=# SELECT * FROM pg_stats WHERE tablename ~ '^correl_';
schemaname | public
tablename  | correl_fix_correl_idx
attname| fix_correl
inherited  | f
null_frac  | 0
avg_width  | 1
n_distinct | 1
most_common_vals   | {f}
most_common_freqs  | {1}
histogram_bounds   | 
correlation| 1
most_common_elems  | 
most_common_elem_freqs | 
elem_count_histogram   | 


correl=# CREATE INDEX ON correl ( (i + j) );

correl=# ANALYZE correl ;

correl=# EXPLAIN ANALYZE SELECT * FROM correl WHERE ( i + j ) = 2;
 Index Scan using correl_expr_idx on correl  (cost=0.29..4.31 rows=1 width=8) 
(actual time=0.027..0.027 rows=0 loops=1)
   Index Cond: ((i + j) = 2)
 Planning time: 0.175 ms
 Execution time: 0.076 ms

correl=# CREATE FUNCTION fix_correl_add(int, int) RETURNS int AS '
BEGIN RETURN $1 + $2 ; END '
IMMUTABLE
CALLED ON NULL INPUT
LANGUAGE plpgsql;

correl=# CREATE INDEX ON correl ( fix_correl_add( i, j ) );

correl=# ANALYZE correl ;

correl=# EXPLAIN ANALYZE SELECT * FROM correl WHERE fix_correl_add( i, j ) = 2;
 Index Scan using correl_fix_correl_add_idx on correl  (cost=0.29..4.31 rows=1 
width=8) (actual time=0.052..0.052 rows=0 loops=1)
   Index Cond: (fix_correl_add(i, j) = 2)
 Planning time: 0.462 ms
 Execution time: 0.102 ms

correl=# SELECT * FROM pg_stats WHERE tablename ~ '^correl_';
schemaname | public
tablename  | correl_fix_correl_idx
attname| fix_correl
inherited  | f
null_frac  | 0
avg_width  | 1
n_distinct | 1
most_common_vals   | {f}
most_common_freqs  | {1}
histogram_bounds   | 
correlation| 1
most_common_elems  | 

Re: [HACKERS] RLS open items are vague and unactionable

2015-09-11 Thread Stephen Frost
Dean,

* Dean Rasheed (dean.a.rash...@gmail.com) wrote:
> On 10 September 2015 at 21:48, Robert Haas  wrote:
> > The open items list here:
> >
> > https://wiki.postgresql.org/wiki/PostgreSQL_9.5_Open_Items
> >
> > Dean's latest round of RLS refactoring.
> 
> This refactoring patch doesn't fix any behavioural issues. It is all
> about trying to make the code simpler and more readable and
> maintainable, and also addressing Alvaro's comments here:
> http://www.postgresql.org/message-id/20150522180807.gb5...@postgresql.org
> 
> However, it has bit-rotted over the last 3 months. In particular it
> doesn't take account of this change:
> http://git.postgresql.org/pg/commitdiff/dee0200f0276c0f9da930a2c926f90f5615f2d64
> 
> I will happily work up a rebased version of the patch, but only if I
> get a serious commitment from a committer to review it. Otherwise,
> I'll let it drop.

There's no question about getting it reviewed and moving it forward;
either Joe or myself will certainly be able to handle that if others
don't step up. 

I believe there's just a question about if it should be done for 9.5 or
only in master.

That said, addressing Alvaro's comments and avoiding unnecessary code
differences between the back branches and current might be sufficient
justification to move forward with it for 9.5 too.  I thought the
refactoring was a good change in general.

All,

I've updated the page to add more details about the various items,
though the only code changes at this point considered 'open' are this
refactoring and the question regarding the 'row-level-security disabled'
context which I posted a patch for discussion yesterday.

Comments and help on these would certainly be welcome, of course.  We're
working on a set of documentation updates to hopefully finish up in the
next week to add more details about RLS (additional sub-sections,
coverage of the issue Peter raised, additional discussion of RETURNING,
etc).

Thanks!

Stephen


signature.asc
Description: Digital signature


[HACKERS] Move PinBuffer and UnpinBuffer to atomics

2015-09-11 Thread YUriy Zhuravlev
Hello hackers!

Continuing the theme: 
http://www.postgresql.org/message-id/3368228.mTSz6V0Jsq@dinodell

This time, we fairly rewrote  'refcount' and 'usage_count' to atomic in 
PinBuffer and UnpinBuffer (but save lock for buffer flags in Unpin).

In the same time it doesn't affect to correctness of buffer manager
because that variables already have LWLock on top of them (for partition of 
hashtable). If someone pinned buffer after the call StrategyGetBuffer we just 
try again (in BufferAlloc).  Also in the code there is one more check before 
deleting the old buffer, where changes can be rolled back. The other functions 
where it is checked 'refcount' and 'usage_count' put exclusive locks. 

Also stress test with 256 KB shared memory ended successfully.

Without patch we have 417523 TPS and with patch 965821 TPS for big x86 server. 
All details here: https://gist.github.com/stalkerg/773a81b79a27b4d5d63f

Thank you. 
-- 
YUriy Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companydiff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 6622d22..50ca2a5 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -33,14 +33,14 @@ typedef struct
 	BlockNumber blocknum;
 	bool		isvalid;
 	bool		isdirty;
-	uint16		usagecount;
+	uint32		usagecount;
 
 	/*
 	 * An int32 is sufficiently large, as MAX_BACKENDS prevents a buffer from
 	 * being pinned by too many backends and each backend will only pin once
 	 * because of bufmgr.c's PrivateRefCount infrastructure.
 	 */
-	int32		pinning_backends;
+	uint32		pinning_backends;
 } BufferCachePagesRec;
 
 
@@ -160,8 +160,8 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
 			fctx->record[i].reldatabase = bufHdr->tag.rnode.dbNode;
 			fctx->record[i].forknum = bufHdr->tag.forkNum;
 			fctx->record[i].blocknum = bufHdr->tag.blockNum;
-			fctx->record[i].usagecount = bufHdr->usage_count;
-			fctx->record[i].pinning_backends = bufHdr->refcount;
+			fctx->record[i].usagecount = pg_atomic_read_u32(>usage_count);
+			fctx->record[i].pinning_backends = pg_atomic_read_u32(>refcount);
 
 			if (bufHdr->flags & BM_DIRTY)
 fctx->record[i].isdirty = true;
@@ -236,7 +236,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
 			values[7] = Int16GetDatum(fctx->record[i].usagecount);
 			nulls[7] = false;
 			/* unused for v1.0 callers, but the array is always long enough */
-			values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
+			values[8] = UInt32GetDatum(fctx->record[i].pinning_backends);
 			nulls[8] = false;
 		}
 
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 3ae2848..e139a7c 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -96,8 +96,8 @@ InitBufferPool(void)
 
 			CLEAR_BUFFERTAG(buf->tag);
 			buf->flags = 0;
-			buf->usage_count = 0;
-			buf->refcount = 0;
+			pg_atomic_init_u32(>usage_count, 0);
+			pg_atomic_init_u32(>refcount, 0);
 			buf->wait_backend_pid = 0;
 
 			SpinLockInit(>buf_hdr_lock);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8c0358e..afba360 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -962,7 +962,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * into the buffer.
 		 */
 		buf = GetBufferDescriptor(buf_id);
-
 		valid = PinBuffer(buf, strategy);
 
 		/* Can release the mapping lock as soon as we've pinned it */
@@ -1013,7 +1012,15 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 */
 		buf = StrategyGetBuffer(strategy);
 
-		Assert(buf->refcount == 0);
+		/*
+		 * Ok, we can skip this but then we have to remove new buffer from
+		 * hash table. Better to just try again.
+		 */
+		if (pg_atomic_read_u32(>refcount) != 0)
+		{
+			UnlockBufHdr(buf);
+			continue;
+		}
 
 		/* Must copy buffer flags while we still hold the spinlock */
 		oldFlags = buf->flags;
@@ -1211,7 +1218,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * over with a new victim buffer.
 		 */
 		oldFlags = buf->flags;
-		if (buf->refcount == 1 && !(oldFlags & BM_DIRTY))
+		if (pg_atomic_read_u32(>refcount) == 1 && !(oldFlags & BM_DIRTY))
 			break;
 
 		UnlockBufHdr(buf);
@@ -1234,10 +1241,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	buf->tag = newTag;
 	buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT);
 	if (relpersistence == RELPERSISTENCE_PERMANENT)
-		buf->flags |= BM_TAG_VALID | BM_PERMANENT;
+		buf->flags|= BM_TAG_VALID | BM_PERMANENT;
 	else
 		buf->flags |= BM_TAG_VALID;
-	buf->usage_count = 1;
+	pg_atomic_write_u32(>usage_count, 1);
 
 	UnlockBufHdr(buf);
 
@@ -1329,7 +1336,7 @@ retry:
 	 * yet done StartBufferIO, WaitIO will fall through and we'll effectively
 	 * 

[HACKERS] Comment update to pathnode.c

2015-09-11 Thread Etsuro Fujita
Hi,

The comments for create_foreignscan_path says as follows, but that it's
now possible that the function is called by GetForeignJoinPaths, which
was added in 9.5.

1450 /*
1451  * create_foreignscan_path
1452  *Creates a path corresponding to a scan of a foreign table,
1453  *returning the pathnode.
1454  *
1455  * This function is never called from core Postgres; rather, it's
expected
1456  * to be called by the GetForeignPaths function of a foreign data
wrapper.
1457  * We make the FDW supply all fields of the path, since we do not
have any
1458  * way to calculate them in core.
1459  */

So, I've updated the comments.  Please find attached a patch.

Best regards,
Etsuro Fujita
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
***
*** 1449,1459  create_worktablescan_path(PlannerInfo *root, RelOptInfo *rel,
  
  /*
   * create_foreignscan_path
!  *	  Creates a path corresponding to a scan of a foreign table,
!  *	  returning the pathnode.
   *
   * This function is never called from core Postgres; rather, it's expected
!  * to be called by the GetForeignPaths function of a foreign data wrapper.
   * We make the FDW supply all fields of the path, since we do not have any
   * way to calculate them in core.
   */
--- 1449,1460 
  
  /*
   * create_foreignscan_path
!  *	  Creates a path corresponding to a scan of a foreign table or
!  *	  a foreign join, returning the pathnode.
   *
   * This function is never called from core Postgres; rather, it's expected
!  * to be called by the GetForeignPaths function or the GetForeignJoinPaths
!  * function of a foreign data wrapper.
   * We make the FDW supply all fields of the path, since we do not have any
   * way to calculate them in core.
   */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE POLICY and RETURNING

2015-09-11 Thread Zhaomo Yang
>
> I really don't like the approach you're suggesting above where an 'OR'
> inside of
> such a clause could mean that users can arbitrarly change any existing row
> without any further check on that row and I have a hard time seeing the
> use-case which justifies the additional complexity and user confusion.


I admit that I gave some bad examples in the previous email, and it is fair
to say
this (Being able to have something like NEW.value > 10 OR OLD.id = 1) is
not a advantage of what I proposed
before I can come up with any real-world examples.

So there would also be a SELECT policy anyway, which is just like the
> existing UPDATE USING policy is today and what you're really asking for
> is the ability to have the WITH CHECK policy reference both the OLD and
> NEW records.

Yes. Then we won't need any USING clauses for UPDATE/DELETE. For
UPDATE/DELETE, we only need
one predicate which can reference both OLD and NEW.

I might be able to get behind supporting that, but I'm not
> terribly excited about it and you've not provided any real use-cases for
> it that I've seen


I think that there are two major advantages:

1)
As many folks have pointed out in this and other threads, this will makes
information leakage less likely.
Now a permissive USING clause for UPDATE/DELETE can give an attacker chance
to read rows he
is not allowed to SELECT. Even without leaky functions, an attacker can
easily figure out the rows by doing a
binary search with tricks like division by zero.

2)
This proposal allows a user to reference both the OLD and NEW records in
the same clause. For example,
NEW.id == OLD.id , or NEW.value <= OLD.value + 10. I think this should be
useful for users since they may often
need to check the new value against the old one.


it still doesn't really change anything regarding
> RETURNING any differently than the earlier suggestions did about having
> the SELECT policy applied to all commands.


No, it doesn't. I proposed it here because there are some related
discussions (applying SELECT policy to other commands).

Thanks,
Zhaomo

On Tue, Aug 25, 2015 at 8:17 AM, Stephen Frost  wrote:

> Zhaomo,
>
> * Zhaomo Yang (zmp...@gmail.com) wrote:
> > > If no NEW or OLD is used, what happens?  Or would you have
> > > to always specify OLD/NEW for UPDATE, and then what about for the other
> > > policies, and the FOR ALL policies?
> >
> > I should be clearer with references to OLD/NEW. SELECT Predicates cannot
> > reference any of them.
> > INSERT predicates cannot refer to OLD and DELETE predicates cannot refer
> to
> > NEW. Basically,
> > for INSERT/UPDATE/DELETE, we specify predicates the same way as we do for
> > triggers' WHEN condition.
> >
> > As for FOR ALL, I think we will abandon it if we apply SELECT policy to
> > other commands, since SELECT predicate
> > will be the new universally applicable read policy, which makes the FOR
> ALL
> > USING clause much less useful. Of course users may need to specify
> separate
> > predicates for different commands, but I think it is fine. How often do
> > users want the same predicate for all the commands?
>
> I can certainly see use-cases where you'd want to apply the same policy
> to all new records, regardless of how they're being added, and further,
> the use-case where you want the same policy for records which are
> visible and those which are added.  In fact, I'd expect that to be one
> of the most common use-cases as it maps directly to a set of rows which
> are owned by one user, where that user can see/modify/delete their own
> records but not impact other users.
>
> So, I don't think it would be odd at all for users to want the same
> predicate for all of the commands.
>
> > > This could be accomplished with "USING (bar > 1)" and "WITH CHECK (foo
> >
> > > 1)", no?
> > > Your sentence above that "USING and WITH CHECK are combined by AND"
> > > isn't correct either- they're independent and are therefore really
> OR'd.
> > > If they were AND'd then the new record would have to pass both USING
> and
> > > WITH CHECK policies.
> >
> > No, it is impossible with the current implementation.
> >
> > CREATE TABLE test {
> >  id int,
> >  v1 int,
> >  v2 int
> > };
> >
> > Suppose that the user wants an update policy which is OLD.v1 > 10 OR
> NEW.v2
> > < 10.
> > As you suggested, we use the following policy
> >
> > CREATE update_p ON test
> > FOR UPDATE TO test_user
> > USING v1 > 10
> > WITH CHECK v2 < 10;
> >
> > (1) Assume there is only one row in the table
> > id |  v1 | v2 |
> > 1  | 11 | 20 |
> >
> > Now we execute  UPDATE test SET v2 = 100.
> > this query is allowed by the policy and the only row should be updated
> > since v1's old value > 10, but will trigger an error because it violates
> > the WITH CHECK clause.
>
> In this scenario, you don't care what the value of the NEW record is, at
> all?  As long as the old record had 'v1 > 10', then the resulting row
> can be anything?  I have to admit, I have a hard timing seeing 

Re: [HACKERS] Partitioned checkpointing

2015-09-11 Thread Takashi Horikawa
Hi,

>   I understand that what this patch does is cutting the checkpoint
> of buffers in 16 partitions, each addressing 1/16 of buffers, and each with
> its own wal-log entry, pacing, fsync and so on.
Right. 
However, 
> The key point is that we spread out the fsyncs across the whole checkpoint
> period.
this is not the key point of the 'partitioned checkpointing,' I think.
The original purpose is to mitigate full-page-write rush that occurs at 
immediately after the beginning of each checkpoint.
The amount of FPW at each checkpoint is reduced to 1/16 by the 
'Partitioned checkpointing.' 

>   This method interacts with the current proposal to improve the
> checkpointer behavior by avoiding random I/Os, but it could be combined.
I agree.

> Splitting with N=16 does nothing to guarantee the partitions are equally
> sized, so there would likely be an imbalance that would reduce the
> effectiveness of the patch.
May be right.
However, current method was designed with considering to split 
buffers so as to balance the load as equally as possible;
current patch splits the buffer as
---
1st round: b[0], b[p], b[2p], … b[(n-1)p]
2nd round: b[1], b[p+1], b[2p+1], … b[(n-1)p+1]
…
p-1 th round:b[p-1], b[p+(p-1)], b[2p+(p-1)], … b[(n-1)p+(p-1)]
---
where N is the number of buffers,
p is the number of partitions, and n = (N / p).

It would be extremely unbalance if buffers are divided as follow.
---
1st round: b[0], b[1], b[2], … b[n-1]
2nd round: b[n], b[n+1], b[n+2], … b[2n-1]
…
p-1 th round:b[(p-1)n], b[(p-1)n+1], b[(p-1)n+2], … b[(p-1)n+(n-1)]
---


I'm afraid that I miss the point, but
> 2.
> Assign files to one of N batches so we can make N roughly equal sized
> mini-checkpoints
Splitting buffers with considering the file boundary makes FPW related 
processing 
(in xlog.c and xloginsert.c) complicated intolerably, as 'Partitioned 
checkpointing' is strongly related to the decision of whether this buffer 
is necessary to FPW or not at the time of inserting the xlog record.
# 'partition id = buffer id % number of partitions' is fairly simple.

Best regards.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories



> -Original Message-
> From: Simon Riggs [mailto:si...@2ndquadrant.com]
> Sent: Friday, September 11, 2015 10:57 PM
> To: Fabien COELHO
> Cc: Horikawa Takashi(堀川 隆); pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Partitioned checkpointing
> 
> On 11 September 2015 at 09:07, Fabien COELHO  wrote:
> 
> 
> 
>   Some general comments :
> 
> 
> 
> Thanks for the summary Fabien.
> 
> 
>   I understand that what this patch does is cutting the checkpoint
> of buffers in 16 partitions, each addressing 1/16 of buffers, and each with
> its own wal-log entry, pacing, fsync and so on.
> 
>   I'm not sure why it would be much better, although I agree that
> it may have some small positive influence on performance, but I'm afraid
> it may also degrade performance in some conditions. So I think that maybe
> a better understanding of why there is a better performance and focus on
> that could help obtain a more systematic gain.
> 
> 
> 
> I think its a good idea to partition the checkpoint, but not doing it this
> way.
> 
> Splitting with N=16 does nothing to guarantee the partitions are equally
> sized, so there would likely be an imbalance that would reduce the
> effectiveness of the patch.
> 
> 
>   This method interacts with the current proposal to improve the
> checkpointer behavior by avoiding random I/Os, but it could be combined.
> 
>   I'm wondering whether the benefit you see are linked to the file
> flushing behavior induced by fsyncing more often, in which case it is quite
> close the "flushing" part of the current "checkpoint continuous flushing"
> patch, and could be redundant/less efficient that what is done there,
> especially as test have shown that the effect of flushing is *much* better
> on sorted buffers.
> 
>   Another proposal around, suggested by Andres Freund I think, is
> that checkpoint could fsync files while checkpointing and not wait for the
> end of the checkpoint. I think that it may also be one of the reason why
> your patch does bring benefit, but Andres approach would be more systematic,
> because there would be no need to fsync files several time (basically your
> patch issues 16 fsync per file). This suggest that the "partitionning"
> should be done at a lower level, from within the CheckPointBuffers, which
> would take care of fsyncing files some time after writting buffers to them
> is finished.
> 
> 
> The idea to do a partial pass through shared buffers and only write a fraction
> of dirty buffers, then fsync them is a good one.
> 
> The key point is that we spread out the fsyncs across the whole checkpoint
> period.
> 
> I think we should be writing out all buffers for a particular file in one
> pass, then issue one fsync per file.  >1 fsyncs per file seems a bad idea.
> 
> So we'd need 

Re: [HACKERS] Partitioned checkpointing

2015-09-11 Thread Takashi Horikawa
Hello Fabien,

> I wanted to do some tests with this POC patch. For this purpose, it would
> be nice to have a guc which would allow to activate or not this feature.
Thanks.

> Could you provide a patch with such a guc? I would suggest to have the
number
> of partitions as a guc, so that choosing 1 would basically reflect the
> current behavior.
Sure.
Please find the attached patch.

A guc parameter named 'checkpoint_partitions' is added.
This can be set to 1, 2, 4, 8, or 16.
Default is 16. (It is trivial at this present, I think.)

And some definitions are moved from bufmgr.h to xlog.h.
It would be also trivial.

Best regards.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories


partitioned-checkpointing-v2.patch
Description: Binary data


smime.p7s
Description: S/MIME cryptographic signature


Re: [HACKERS] Autonomous Transaction is back

2015-09-11 Thread Noah Misch
On Fri, Sep 11, 2015 at 02:30:53PM -0500, Merlin Moncure wrote:
> On Thu, Sep 10, 2015 at 8:39 PM, Noah Misch  wrote:
> > It's an exceptionally-challenging development project, agreed.  So much code
> > assumes the 1:1 relationship between backends and top-level transactions.
> 
> I guess I'm being obtuse, but can you explain why that assumption must
> be revisited?

It's not imperative, but the proposal on the table does so.  Robert described
doing so as "a key design goal for this feature"[1].  I have a general picture
of why the proposers chose that, but I will defer to them for any elaboration.

[1] 
http://www.postgresql.org/message-id/ca+tgmoze__kohfuyvzpxvtjsb85xm34kc_t41-oxhtb_111...@mail.gmail.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-09-11 Thread Thom Brown
On 11 September 2015 at 22:34, Alvaro Herrera 
wrote:

> Rahila Syed wrote:
> > >This doesn't seem to compile
> > Oh. It compiled successfully when applied on HEAD on my machine. Anyways,
> > the OID is changed to 3309 in the attached patch. 3308 / 3309 both are
> part
> > of OIDs in unused OID list.
>
> I think Thom may have patched on top of some other patch.


I think you might be right.  I had run "git stash" and thought that would
be sufficient, but it seems "git clean -f" was necessary.

It builds fine now.

Thom


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2015-09-11 Thread Amit Kapila
On Fri, Sep 11, 2015 at 9:21 PM, Robert Haas  wrote:
>
> On Fri, Sep 11, 2015 at 10:31 AM, Amit Kapila 
wrote:
> > > Could you perhaps try to create a testcase where xids are accessed
that
> > > are so far apart on average that they're unlikely to be in memory? And
> > > then test that across a number of client counts?
> > >
> >
> > Now about the test, create a table with large number of rows (say
11617457,
> > I have tried to create larger, but it was taking too much time (more
than a day))
> > and have each row with different transaction id.  Now each transaction
should
> > update rows that are at least 1048576 (number of transactions whose
status can
> > be held in 32 CLog buffers) distance apart, that way ideally for each
update it will
> > try to access Clog page that is not in-memory, however as the value to
update
> > is getting selected randomly and that leads to every 100th access as
disk access.
>
> What about just running a regular pgbench test, but hacking the
> XID-assignment code so that we increment the XID counter by 100 each
> time instead of 1?
>

If I am not wrong we need 1048576 number of transactions difference
for each record to make each CLOG access a disk access, so if we
increment XID counter by 100, then probably every 1th (or multiplier
of 1) transaction would go for disk access.

The number 1048576 is derived by below calc:
#define CLOG_XACTS_PER_BYTE 4
#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)

Transaction difference required for each transaction to go for disk access:
CLOG_XACTS_PER_PAGE * num_clog_buffers.

I think reducing to every 100th access for transaction status as disk access
is sufficient to prove that there is no regression with the patch for the
screnario
asked by Andres or do you think it is not?

Now another possibility here could be that we try by commenting out fsync
in CLOG path to see how much it impact the performance of this test and
then for pgbench test.  I am not sure there will be any impact because even
every 100th transaction goes to disk access that is still less as compare
WAL fsync which we have to perform for each transaction.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Moving SS_finalize_plan processing to the end of planning

2015-09-11 Thread Robert Haas
On Thu, Sep 10, 2015 at 9:49 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> Do you expect to do more work on the upper planner path-ification stuff soon?
>
> Yeah, I do actually have some work in progress.  Not sure how soon I'll be
> ready to show it --- there's a lot of code to break and reassemble :-(

Yeah, I remember looking at it a bit and finding it quite an
intimidating piece of code.  I'm just asking because it would be nice
if we got the infrastructure into this release in time to do something
with it - specifically, I'd like to find a way for FDWs to push down
aggregates, a capability for which EDB has had multiple customer
requests (and I'm sure we're not the only ones).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS][PROPOSAL] Covering + unique indexes.

2015-09-11 Thread Anastasia Lubennikova

Hi, hackers!

Use case:
Index-only scans is a wonderful feature that allows to speed up select 
queries of indexed columns.
Therefore some users want to create multicolumn indexes on columns which 
are queried often.
But if there's unique constraint on some column, they have to maintain 
another unique index.

Even if the column is already in covering index.
This adds overhead to data manipulation operations and database size.

I've started work on a patch that allows to combine covering and unique 
functionality.
The main idea is to allow user create multicolumn indexes with a 
definite number of unique columns.

For example (don't mind SQL syntax here, please):
CREATE INDEX index ON table (c1, c2, c3) UNIQUE ON (c1, c2);
Created index has three columns, but only two of them have unique 
constraint.


This idea has obvious restriction. We can set unique only for first 
index columns.

There is no clear way to maintain following index.
CREATE INDEX index ON table (c1, c2, c3) UNIQUE ON (c1, c3);

So I suggest following syntax:
CREATE [UNIQUE {ON FIRST {COLUMN | n_unique_column COLUMNS}} INDEX ON 
table_name (column_name1, column_name2 ...);


Examples:
CREATE UNIQUE INDEX ON table_name (c1, c2, c3); // (c1,c2, c3) must be 
UNIQUE. That's how it works now.


CREATE UNIQUE ON FIRST COLUMN INDEX ON table_name (c1, c2, c3); // (c1) 
must be UNIQUE
CREATE UNIQUE ON FIRST 2 COLUMNS INDEX ON table_name (c1, c2, c3); // 
(c1,c2) must be UNIQUE
CREATE UNIQUE ON FIRST 3 COLUMNS INDEX ON table_name (c1, c2, c3); // 
(c1,c2, c3) must be UNIQUE


Next issue is pg_index changes.
Now there's only a boolean flag

 * bool indisunique; /* is this a unique index? */

But new algorithm requires to store a single number

 * unit16n_unique_columns; /* number of first columns of index which
   has unique constrains. */

I think, that numbers of all attributes themselves are not needed. Is it 
right?


I'd like to see your suggestions about syntax changes.
And of course any other comments are welcome.

--
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Moving SS_finalize_plan processing to the end of planning

2015-09-11 Thread Tom Lane
Robert Haas  writes:
> On Thu, Sep 10, 2015 at 9:49 PM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> Do you expect to do more work on the upper planner path-ification stuff 
>>> soon?

>> Yeah, I do actually have some work in progress.  Not sure how soon I'll be
>> ready to show it --- there's a lot of code to break and reassemble :-(

> Yeah, I remember looking at it a bit and finding it quite an
> intimidating piece of code.  I'm just asking because it would be nice
> if we got the infrastructure into this release in time to do something
> with it - specifically, I'd like to find a way for FDWs to push down
> aggregates, a capability for which EDB has had multiple customer
> requests (and I'm sure we're not the only ones).

Yeah, I'm well aware that this is blocking progress in several areas.
I do intend to make it happen, but it's not the only demand on my time.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3.9 and pg_multixact corruption

2015-09-11 Thread Christoph Berg
Re: Bernd Helmle 2015-09-10 <7E3C7F8D210AC9A423E96F3A@eje.local>
> 2015-09-08 11:40:59 CEST [27047] DETAIL:  Could not seek in file
> "pg_multixact/members/5FC4" to offset 4294950912: Invalid argument.
> 2015-09-08 11:40:59 CEST [27047] CONTEXT:  xlog redo create mxid 1068235595
> offset 2147483648 nmembers 2: 2896635220 (upd) 2896635510 (keysh) 
> 2015-09-08 11:40:59 CEST [27045] LOG:  startup process (PID 27047) exited
> with exit code 1
> 2015-09-08 11:40:59 CEST [27045] LOG:  aborting startup due to startup
> process failure
> 
> Some side notes:
> 
> An additional recovery from a base backup and archive recovery yield to the
> same error, as soon as the affected tuple was touched with a DELETE. The
> affected table was fully dumpable via pg_dump, though.

A few more words here: the archive recovery was a pitr to 00:45, so
well before the problem, and the cluster was initially working well,
but crashed shortly after with the same mxid 1068235595 message. The
crash was triggered from a delete on a different table (which was
related schema-wise, but iirc neither of these tables has any FKs).

We then rewound the system to a zfs snapshot taken when the archive
recovery had finished (db shut down cleanly), and put it up again,
when it again crashed with mxid 1068235595, this time on a third
table.

The original crash and the first post-recovery crash happened a few
minutes after pg_start_backup(), though the next crash was without
that.


(While the archive recovery was running, I had pg_resetxlog the
original cluster. It was possible to isolate the ctid of an affected
tuple, but it wasn't possible to DELETE it, yielding an error message
similar to the above, but the database would continue. I then zeroed
the bad block using dd (zero_damaged_pages didn't help), only to find
that at least one more tuple in that table was affected (with a
different mxid).)

Christoph


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Did we forget to unpin buf in function "revmap_physical_extend" ?

2015-09-11 Thread Jinyu Zhang
In function "revmap_physical_extend",  should we add "ReleaseBuffer(buf);" 
between line 438 and 439 ?
422 else
 423 {
 424 if (needLock)
 425 LockRelationForExtension(irel, ExclusiveLock);
 426 
 427 buf = ReadBuffer(irel, P_NEW);
 428 if (BufferGetBlockNumber(buf) != mapBlk)
 429 {
 430 /*
 431  * Very rare corner case: somebody extended the relation
 432  * concurrently after we read its length.  If this happens, 
give
 433  * up and have caller start over.  We will have to evacuate 
that
 434  * page from under whoever is using it.
 435  */
 436 if (needLock)
 437 UnlockRelationForExtension(irel, ExclusiveLock);
 438 LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK);
 439 return;
 440 }
 441 LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 442 page = BufferGetPage(buf);
 443 
 444 if (needLock)
 445 UnlockRelationForExtension(irel, ExclusiveLock);
 446 }


Jinyu,
regards



Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-09-11 Thread Robert Haas
On Wed, Sep 9, 2015 at 4:30 PM, Kevin Grittner  wrote:
> Robert Haas  wrote:
>
>> I think the problem we should be trying to solve is: Given a set
>> of server IPs, connect to one that is up.
>>
>> I believe this comes up in several different scenarios.
>>
>> Example #1: [single server; changing IP address gracefully]
>>
>> Example #2: [xDB/BDR client uses local master if up; else a remote]
>>
>> Example #3: [SR/HS with changing primary]
>
> For all of those it is clear that we do not need (or want!)
> heartbeat, STONITH, fencing, etc. to be handled by the connector.
> If the above are the sorts of problems we are trying to solve, a
> very simple solution is the best.  I know you outlined several; I'm
> not sure it would matter all that much which one we used -- any
> would work and someone should Just Do It.
>
>> I'm sure there are more.
>
> Ay, there's the rub.  Some people seemed to be suggesting that this
> should be far more than what you describe above.  That would, IMO,
> be a mistake.

It sounds like we are pretty much on the same page.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RLS open items are vague and unactionable

2015-09-11 Thread Stephen Frost
* Dean Rasheed (dean.a.rash...@gmail.com) wrote:
> I think blind updates are a pretty niche case, and I think that it
> wasn't the intention to support them, but more of an unintentional
> side effect of the implementation. That said, there are
> just-about-plausible use-cases where they might be considered useful,
> e.g., allow a password to be nulled out, forcing a reset, but don't
> allow the existing value to be read. But then, as you say, RETURNING
> blows a hole in the security of that model.
> 
> I still think the answer is to make RETURNING subject to SELECT
> policies, with an error thrown if you attempt a blind-update-returning
> for a row not visible to you, e.g.:
> 
> DELETE FROM foo WHERE id=10; -- OK even if row 10 is not visible
> 
> DELETE FROM foo WHERE id=10 RETURNING *;
> ERROR:  row returned by RETURNING is not visible using row level
> security policies for "foo"

For a DELETE, applying the SELECT policy to RETURNING works, but it
doesn't work for UPDATE as the row being compared to the SELECT policy
would be the user-modified row and not the original; or at least, that's
what I recall from our discussion earlier in the summer.

Or are you suggesting that both UPDATE and DELETE apply the SELECT
policy, only when RETURNING is specified, to the original rows from the
table and throw an error if the row wouldn't be allowed per that policy?

That seems like it might be workable and is in-line with the regular
permissions system where we require SELECT rights if you specify
RETURNING but not otherwise (unless a predicate is specified, of
course), though my recollection is that there was disagreement about
having the RETURNING case throw errors rather than simply filtering the
records out (which gets us back to the discussion around applying a
single visibility policy).  Still, perhaps opinions have changed
regarding that.

Of course, it seems like that further limits the use-cases where the
blind updates could be done; though our existing support for similar
such cases (DELETE without a WHERE clause) is similairly limiting, so
perhaps that's not an issue.  This case is, perhaps, a bit different
since the user has the capability to explicitly specify the visibility
for the command either way (by either including or not including the
predicates in the SELECT policy in the UPDATE/DELETE policy), but we
don't currently support a way to alter the policy used based on the
existance of a RETURNING clause.  I had suggested supporting that quite
a while ago, but as I recall it wasn't well received.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] RLS open items are vague and unactionable

2015-09-11 Thread Tom Lane
Dean Rasheed  writes:
> Yeah, we had a similar discussion regarding UPDATE USING policies and
> ON CONFLICT UPDATE clauses. I think the argument against filtering is
> that the rows returned would then be misleading about what was
> actually updated.

It seems to me that it would be a horribly bad idea to allow RLS to act
in such a way that rows could be updated and then not shown in RETURNING.

However, I don't see why UPDATE/DELETE with RETURNING couldn't be
restricted according to *both* the UPDATE and SELECT policies,
ie if there's RETURNING then you can't update a row you could not
have selected.  Note this would be a nothing-happens result not a
throw-error result, else you still leak info about the existence of
the row.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RLS open items are vague and unactionable

2015-09-11 Thread Dean Rasheed
On 11 September 2015 at 15:20, Stephen Frost  wrote:
> * Dean Rasheed (dean.a.rash...@gmail.com) wrote:
>> I think blind updates are a pretty niche case, and I think that it
>> wasn't the intention to support them, but more of an unintentional
>> side effect of the implementation. That said, there are
>> just-about-plausible use-cases where they might be considered useful,
>> e.g., allow a password to be nulled out, forcing a reset, but don't
>> allow the existing value to be read. But then, as you say, RETURNING
>> blows a hole in the security of that model.
>>
>> I still think the answer is to make RETURNING subject to SELECT
>> policies, with an error thrown if you attempt a blind-update-returning
>> for a row not visible to you, e.g.:
>>
>> DELETE FROM foo WHERE id=10; -- OK even if row 10 is not visible
>>
>> DELETE FROM foo WHERE id=10 RETURNING *;
>> ERROR:  row returned by RETURNING is not visible using row level
>> security policies for "foo"
>
> For a DELETE, applying the SELECT policy to RETURNING works, but it
> doesn't work for UPDATE as the row being compared to the SELECT policy
> would be the user-modified row and not the original; or at least, that's
> what I recall from our discussion earlier in the summer.
>
> Or are you suggesting that both UPDATE and DELETE apply the SELECT
> policy, only when RETURNING is specified, to the original rows from the
> table and throw an error if the row wouldn't be allowed per that policy?
>

Yes, that's what I was suggesting -- for UPDATE and DELETE with
RETURNING, test the OLD row against the table's SELECT policies. For
INSERT (or UPDATE and DELETE without RETURNING), do not check the
SELECT policies, allowing blind updates, but not blind updates with
RETURNING.


> That seems like it might be workable and is in-line with the regular
> permissions system where we require SELECT rights if you specify
> RETURNING but not otherwise (unless a predicate is specified, of
> course), though my recollection is that there was disagreement about
> having the RETURNING case throw errors rather than simply filtering the
> records out (which gets us back to the discussion around applying a
> single visibility policy).  Still, perhaps opinions have changed
> regarding that.
>

Yeah, we had a similar discussion regarding UPDATE USING policies and
ON CONFLICT UPDATE clauses. I think the argument against filtering is
that the rows returned would then be misleading about what was
actually updated.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RLS open items are vague and unactionable

2015-09-11 Thread Joe Conway
On 09/11/2015 04:33 AM, Stephen Frost wrote:
> * Dean Rasheed (dean.a.rash...@gmail.com) wrote:
>> I will happily work up a rebased version of the patch, but only if I
>> get a serious commitment from a committer to review it. Otherwise,
>> I'll let it drop.
> 
> There's no question about getting it reviewed and moving it forward;
> either Joe or myself will certainly be able to handle that if others
> don't step up. 

Yes, we'll get it done.

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2015-09-11 Thread Amit Kapila
On Thu, Sep 3, 2015 at 5:11 PM, Andres Freund  wrote:
>
> On 2015-09-01 10:19:19 +0530, Amit Kapila wrote:
> > pgbench setup
> > 
> > scale factor - 300
> > Data is on magnetic disk and WAL on ssd.
> > pgbench -M prepared tpc-b
> >
> > HEAD - commit 0e141c0f
> > Patch-1 - increase_clog_bufs_v1
> >
>
> The buffer replacement algorithm for clog is rather stupid - I do wonder
> where the cutoff is that it hurts.
>
> Could you perhaps try to create a testcase where xids are accessed that
> are so far apart on average that they're unlikely to be in memory? And
> then test that across a number of client counts?
>

Okay, I have tried one such test, but what I could come up with is on an
average every 100th access is a disk access and then tested it with
different number of clog buffers and client count.  Below is the result:

Non-default parameters

max_connections = 300
shared_buffers=32GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout=35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB
autovacuum=off


HEAD - commit 49124613
Patch-1 - Clog Buffers - 64
Patch-2 - Clog Buffers - 128

Client Count/Patch_ver 1 8 64 128 HEAD 1395 8336 37866 34463 Patch-1 1615
8180 37799 35315 Patch-2 1409 8219 37068 34729
So there is not much difference in test results with different values for
Clog
buffers, probably because the I/O has dominated the test and it shows that
increasing the clog buffers won't regress the current behaviour even though
there are lot more accesses for transaction status outside CLOG buffers.

Now about the test, create a table with large number of rows (say 11617457,
I have tried to create larger, but it was taking too much time (more than a
day))
and have each row with different transaction id.  Now each transaction
should
update rows that are at least 1048576 (number of transactions whose status
can
be held in 32 CLog buffers) distance apart, that way ideally for each update
it will
try to access Clog page that is not in-memory, however as the value to
update
is getting selected randomly and that leads to every 100th access as disk
access.

Test
---
1. Attached file clog_prep.sh should create and populate the required
table and create the function used to access the CLOG pages.  You
might want to update the no_of_rows based on the rows you want to
create in table
2. Attached file  access_clog_disk.sql is used to execute the function
with random values. You might want to update nrows variable based
on the rows created in previous step.
3. Use pgbench as follows with different client count
./pgbench -c 4 -j 4 -n -M prepared -f "access_clog_disk.sql" -T 300 postgres
4. To ensure that clog access function always accesses same data
during each run, the test ensures to copy the data_directory created by
step-1
before each run.

I have checked by adding some instrumentation that approximately
every 100th access is disk access, attached patch clog_info-v1.patch
adds the necessary instrumentation in code.

As an example, pgbench test yields below results:
./pgbench -c 4 -j 4 -n -M prepared -f "access_clog_disk.sql" -T 180 postgres

LOG:  trans_status(3169396)
LOG:  trans_status_disk(29546)
LOG:  trans_status(3054952)
LOG:  trans_status_disk(28291)
LOG:  trans_status(3131242)
LOG:  trans_status_disk(28989)
LOG:  trans_status(3155449)
LOG:  trans_status_disk(29347)

Here 'trans_status' is the number of times the process went for accessing
the CLOG status and 'trans_status_disk' is the number of times it went
to disk for accessing CLOG page.


>
> >  /*
> >   * Number of shared CLOG buffers.
> >   *
>
>
>
> I think the comment should be more drastically rephrased to not
> reference individual versions and numbers.
>

Updated comments and the patch (increate_clog_bufs_v2.patch)
containing the same is attached.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


clog_prep.sh
Description: Bourne shell script


access_clog_disk.sql
Description: Binary data


clog_info-v1.patch
Description: Binary data


increase_clog_bufs_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving SS_finalize_plan processing to the end of planning

2015-09-11 Thread Robert Haas
On Fri, Sep 11, 2015 at 9:14 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, Sep 10, 2015 at 9:49 PM, Tom Lane  wrote:
>>> Robert Haas  writes:
 Do you expect to do more work on the upper planner path-ification stuff 
 soon?
>
>>> Yeah, I do actually have some work in progress.  Not sure how soon I'll be
>>> ready to show it --- there's a lot of code to break and reassemble :-(
>
>> Yeah, I remember looking at it a bit and finding it quite an
>> intimidating piece of code.  I'm just asking because it would be nice
>> if we got the infrastructure into this release in time to do something
>> with it - specifically, I'd like to find a way for FDWs to push down
>> aggregates, a capability for which EDB has had multiple customer
>> requests (and I'm sure we're not the only ones).
>
> Yeah, I'm well aware that this is blocking progress in several areas.
> I do intend to make it happen, but it's not the only demand on my time.

Sure, I understand that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-09-11 Thread Syed, Rahila

Hello,

Please find attached updated VACUUM progress checker patch.
Following have been accomplished in the patch

1. Accounts for index pages count while calculating  total progress of VACUUM.
2. Common location for storing progress parameters for any command. Idea is 
every command which needs to report progress can populate and interpret the 
shared variables in its own way.
 Each can display progress by implementing separate views.
3. Separate VACUUM progress view to display various progress parameters has 
been implemented . Progress of various phases like heap scan, index scan, total 
pages scanned along with 
completion percentage is reported.
4.This view can display progress for all active backends running VACUUM.

Basic testing has been performed. Thorough testing is yet to be done. Marking 
it as Needs Review in  Sept-Commitfest.

ToDo:
Display count of heap pages actually vacuumed(marking line pointers unused)
Display percentage of work_mem being used to store dead tuples.

Thank you,
Rahila Syed

__
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.


Vacuum_progress_checker_v2.patch
Description: Vacuum_progress_checker_v2.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latent cache flush hazard in RelationInitIndexAccessInfo

2015-09-11 Thread Robert Haas
On Wed, Sep 9, 2015 at 3:00 PM, Tom Lane  wrote:
> Some stuff Salesforce has been doing turned up the fact that
> RelationInitIndexAccessInfo is not safe against a relcache flush on
> its target index.  Specifically, the failure goes like this:
>
> * RelationInitIndexAccessInfo loads up relation->rd_indextuple.
>
> * Within one of the subsequent catalog fetches, eg the pg_am read, a
> relcache flush occurs (this can be forced with CLOBBER_CACHE_ALWAYS).
>
> * RelationClearRelation keeps the relcache entry, since it has
> positive refcount, but doesn't see a reason to keep the index-related
> fields.
>
> * RelationInitIndexAccessInfo dumps core when it tries to fetch
> fields out of relation->rd_indextuple, which has been reset to NULL.
>
> Now, it turns out this scenario cannot happen in the standard Postgres
> code as it stands today (if it could, we'd have seen it in the buildfarm's
> CLOBBER_CACHE_ALWAYS testing).  The reason for that is that
> RelationInitIndexAccessInfo is actually only called on newly-minted
> Relation structs that are not yet installed into the relcache hash table;
> hence RelationCacheInvalidate can't find them to zap them.  It can happen
> in Salesforce's modified code though, and it's not hard to imagine that
> future changes in the community version might expose the same hazard.
> (For one reason, the current technique implies that an error halfway
> through relcache load results in leaking the Relation struct and
> subsidiary data; we might eventually decide that's not acceptable.)
>
> We could just ignore the problem until/unless it becomes real for us,
> but now that I've figured it out I'm inclined to do something before
> the knowledge disappears again.
>
> The specific reason there's a problem is that there's a disconnect between
> RelationClearRelation's test for whether a relcache entry has valid index
> info (it checks whether rd_indexcxt != NULL) and the order of operations
> in RelationInitIndexAccessInfo (creating the index context is not the
> first thing it does).  Given that if RelationClearRelation thinks the
> info is valid, what it does is call RelationReloadIndexInfo which needs
> rd_index to be valid, I'm thinking the best fix is to leave the order of
> operations in RelationInitIndexAccessInfo alone and instead make the
> "relcache entry has index info" check be "rd_index != NULL".  There might
> need to be some additional work to keep RelationReloadIndexInfo from
> setting rd_isvalid = true too soon.
>
> Thoughts?

Doesn't seem like a bad thing to clean up.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Double linking MemoryContext children

2015-09-11 Thread Jan Wieck

On 09/11/2015 09:58 AM, Tom Lane wrote:

Jan Wieck  writes:

On 09/11/2015 09:38 AM, Tom Lane wrote:

Seems less invasive to fix SPI to delete in the opposite order?



That would assume that there are no other code paths that destroy memory
contexts out of LIFO order.


Given that we've not seen this sort of problem reported in the dozen or
more years the code has been like that, I'm a bit dubious that we need
to worry about that.  It's possible you're right that we need to expend
more space in the MemoryContext lists --- but I'd like to see more than
one example.


I added a bit of debug code to MemoryContextSetParent(), tracking loop 
iterations per context name. This is what happens during "make check" of 
current master:


 name | count  | not_first | iterations
--++---+
 CacheMemoryContext   |   6271 |  2634 | 104846
 ExecutorState| 90 |  2951 |   7176
 TopMemoryContext |  28464 |   620 |   2716
 SPI Proc |   8424 |   225 |381
 PortalMemory |  24616 |31 |291
 TopTransactionContext|  19312 |   114 |141
 ExprContext  |   3317 | 1 |  1

There was a total of 6,271 calls to MemoryContextSetParent() where the 
old parent is the CacheMemoryContext. For 2,634 of those the context is 
not parent->firstchild and this lead to 104,846 loop iterations total.


The remaining numbers indicate that other contexts are mostly used in 
the intended fashion, but not strictly. This means there is definitely 
potential for more edge cases, similar to the SPI example above.



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Do Layered Views/Relations Preserve Sort Order ?

2015-09-11 Thread Robert Haas
On Wed, Sep 9, 2015 at 7:53 PM, Charles Sheridan  wrote:
> When there are several views defined on top of each other, are SELECTs on
> views that do not specify a SORT order guaranteed to preserve the cumulative
> sort order of the lower-level views ?
>
> Is the answer true for any arbitrarily large set of layered views?
>
> Is the answer the same if the layers of relations are a mix of views and
> tables ?

If a view definition includes an ORDER BY clause, the output of that
view will be sorted accordingly.  But if you do something with that
output, like join it to another table, then the join might disturb the
sort order.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RLS open items are vague and unactionable

2015-09-11 Thread Robert Haas
On Fri, Sep 11, 2015 at 10:49 AM, Tom Lane  wrote:
> Dean Rasheed  writes:
>> Yeah, we had a similar discussion regarding UPDATE USING policies and
>> ON CONFLICT UPDATE clauses. I think the argument against filtering is
>> that the rows returned would then be misleading about what was
>> actually updated.
>
> It seems to me that it would be a horribly bad idea to allow RLS to act
> in such a way that rows could be updated and then not shown in RETURNING.
>
> However, I don't see why UPDATE/DELETE with RETURNING couldn't be
> restricted according to *both* the UPDATE and SELECT policies,
> ie if there's RETURNING then you can't update a row you could not
> have selected.  Note this would be a nothing-happens result not a
> throw-error result, else you still leak info about the existence of
> the row.

Yes, this seems like an entirely reasonable way forward.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RLS open items are vague and unactionable

2015-09-11 Thread Robert Haas
On Fri, Sep 11, 2015 at 9:48 AM, Stephen Frost  wrote:
> The only reason to avoid providing that flexibility is the concern that
> it might be misunderstood and users might misconfigure their system.
> Removing the flexibility to have per-command visibility policies and
> instead force a single visibility policy doesn't add any capabilities.

That seems like an extremely weak argument.  If a feature can't be
used for anything useful, the fact that it doesn't actively interfere
with the use of other features that are useful is not a reason to keep
it.  Clearly, something needs to be done about this.  Saying, you can
restrict by something other than ALL but it adds no security and
serves no use cases is, frankly, a ridiculous position.  Tom's
proposal downthread is a reasonable one, and I endorse it: there may
be other approaches as well.  But regardless of the particular
approach, if we're going to have per-command policies, then you need
to do the work to make them useful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3.9 and pg_multixact corruption

2015-09-11 Thread Andres Freund
On 2015-09-11 14:25:39 +0200, Christoph Berg wrote:
> A few more words here: the archive recovery was a pitr to 00:45, so
> well before the problem, and the cluster was initially working well,
> but crashed shortly after with the same mxid 1068235595 message. The
> crash was triggered from a delete on a different table (which was
> related schema-wise, but iirc neither of these tables has any FKs).
>
> We then rewound the system to a zfs snapshot taken when the archive
> recovery had finished (db shut down cleanly), and put it up again,
> when it again crashed with mxid 1068235595, this time on a third
> table.
>
> The original crash and the first post-recovery crash happened a few
> minutes after pg_start_backup(), though the next crash was without
> that.

Do you still have access to that data? That'd make investigation of both
the issue and fixes/workaround far easier.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2015-09-11 Thread Robert Haas
On Fri, Sep 11, 2015 at 10:31 AM, Amit Kapila  wrote:
> > Could you perhaps try to create a testcase where xids are accessed that
> > are so far apart on average that they're unlikely to be in memory? And
> > then test that across a number of client counts?
> >
>
> Now about the test, create a table with large number of rows (say 11617457,
> I have tried to create larger, but it was taking too much time (more than a 
> day))
> and have each row with different transaction id.  Now each transaction should
> update rows that are at least 1048576 (number of transactions whose status can
> be held in 32 CLog buffers) distance apart, that way ideally for each update 
> it will
> try to access Clog page that is not in-memory, however as the value to update
> is getting selected randomly and that leads to every 100th access as disk 
> access.

What about just running a regular pgbench test, but hacking the
XID-assignment code so that we increment the XID counter by 100 each
time instead of 1?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partitioned checkpointing

2015-09-11 Thread Tomas Vondra



On 09/11/2015 03:56 PM, Simon Riggs wrote:


The idea to do a partial pass through shared buffers and only write a
fraction of dirty buffers, then fsync them is a good one.

The key point is that we spread out the fsyncs across the whole
checkpoint period.


I doubt that's really what we want to do, as it defeats one of the 
purposes of spread checkpoints. With spread checkpoints, we write the 
data to the page cache, and then let the OS to actually write the data 
to the disk. This is handled by the kernel, which marks the data as 
expired after some time (say, 30 seconds) and then flushes them to disk.


The goal is to have everything already written to disk when we call 
fsync at the beginning of the next checkpoint, so that the fsync are 
cheap and don't cause I/O issues.


What you propose (spreading the fsyncs) significantly changes that, 
because it minimizes the amount of time the OS has for writing the data 
to disk in the background to 1/N. That's a significant change, and I'd 
bet it's for the worse.




I think we should be writing out all buffers for a particular file
in one pass, then issue one fsync per file. >1 fsyncs per file seems
a bad idea.

So we'd need logic like this
1. Run through shared buffers and analyze the files contained in there
2. Assign files to one of N batches so we can make N roughly equal sized
mini-checkpoints
3. Make N passes through shared buffers, writing out files assigned to
each batch as we go


What I think might work better is actually keeping the write/fsync 
phases we have now, but instead of postponing the fsyncs until the next 
checkpoint we might spread them after the writes. So with target=0.5 
we'd do the writes in the first half, then the fsyncs in the other half. 
Of course, we should sort the data like you propose, and issue the 
fsyncs in the same order (so that the OS has time to write them to the 
devices).


I wonder how much the original paper (written in 1996) is effectively 
obsoleted by spread checkpoints, but the benchmark results posted by 
Horikawa-san suggest there's a possible gain. But perhaps partitioning 
the checkpoints is not the best approach?


regards

--
Tomas Vondra   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-09-11 Thread Robert Haas
On Thu, Sep 10, 2015 at 11:36 PM, Etsuro Fujita
 wrote:
> I've proposed the following API changes:
>
> * I modified create_foreignscan_path, which is called from
> postgresGetForeignJoinPaths/postgresGetForeignPaths, so that a path,
> subpath, is passed as the eighth argument of the function. subpath
> represents a local join execution path if scanrelid==0, but NULL if
> scanrelid>0.

OK, I see now.  But I don't much like the way
get_unsorted_unparameterized_path() looks.

First, it's basically praying that MergePath, NodePath, and NestPath
can be flat-copied without breaking anything.  In general, we have
copyfuncs.c support for nodes that we need to be able to copy, and we
use copyObject() to do it.  Even if what you've got here works today,
it's not very future-proof.

Second, what guarantee do we have that we'll find a path with no
pathkeys and a NULL param_info?  Why can't all of the paths for a join
relation have pathkeys?  Why can't they all be parameterized?  I can't
think of anything that would guarantee that.

Third, even if such a guarantee existed, why is this the right
behavior?  Any join type will produce the same output; it's just a
question of performance.  And if you have only one tuple on each side,
surely a nested loop would be fine.

It seems to me that what you ought to be doing is using data hung off
the fdw_private field of each RelOptInfo to cache a NestPath that can
be used for EPQ rechecks at that level.  When you go to consider
pushing down another join, you can build up a new NestPath that's
suitable for the new level.  That seems much cleaner than groveling
through the list of surviving paths and hoping you find the right kind
of thing.

And all that having been said, I still don't really understand why you
are resisting the idea of providing a callback so that the FDW can
execute arbitrary code in the recheck path.  There doesn't seem to be
any reason not to let the FDW take control of the rechecks if it
wishes, and there's no real cost in complexity that I can see.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-11 Thread Robert Haas
On Wed, Sep 9, 2015 at 11:54 AM, Tomas Vondra
 wrote:
> Secondly, we limit the number of buckets to INT_MAX, so about 16GB (because
> buckets are just pointers). No matter how awful estimate you get (or how
> insanely high you set work_mem) you can't exceed this.

OK, so this is an interesting point, and I think it clarifies things.
Essentially, we're arguing about whether a 16GB limit is as good as a
512MB limit.  Right now, if we would have allocated more than 512MB,
we instead fail.  There are two possible solutions:

1. I'm arguing for maintaining the 512MB limit, but by clamping the
allocation to 512MB (and the number of buckets accordingly) so that it
works with fewer buckets instead of failing.

2. You're arguing for removing the 512MB limit, allowing an initial
allocation of up to 16GB.

My judgement is that #2 could give some people a nasty surprise, in
that such a large initial allocation might cause problems, especially
if driven by a bad estimate.  Your judgement is that this is unlikely
to be a problem, and that the performance consequences of limiting a
hash join to an initial allocation of 64 million buckets rather than 2
billion buckets are the thing to worry about.

I guess we'll need to wait for some other opinions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-11 Thread Tomas Vondra



On 09/11/2015 06:55 PM, Robert Haas wrote:

On Wed, Sep 9, 2015 at 11:54 AM, Tomas Vondra
 wrote:

Secondly, we limit the number of buckets to INT_MAX, so about 16GB (because
buckets are just pointers). No matter how awful estimate you get (or how
insanely high you set work_mem) you can't exceed this.


OK, so this is an interesting point, and I think it clarifies things.
Essentially, we're arguing about whether a 16GB limit is as good as a
512MB limit.  Right now, if we would have allocated more than 512MB,
we instead fail.  There are two possible solutions:

1. I'm arguing for maintaining the 512MB limit, but by clamping the
allocation to 512MB (and the number of buckets accordingly) so that it
works with fewer buckets instead of failing.

2. You're arguing for removing the 512MB limit, allowing an initial
allocation of up to 16GB.


I'm arguing for fixing the existing bug, and then addressing the case of 
over-estimation separately, with proper analysis.




My judgement is that #2 could give some people a nasty surprise, in
that such a large initial allocation might cause problems, especially
if driven by a bad estimate.  Your judgement is that this is unlikely
to be a problem, and that the performance consequences of limiting a
hash join to an initial allocation of 64 million buckets rather than 2
billion buckets are the thing to worry about.


Not quite, my judgment is that

- We shouldn't address this in this particular bugfix, because it's a
  separete problem (even if we limit the initial allocation, we still
  have to fix the repalloc after we build the Hash).

- I assume the "might cause problems" refers to malloc() issues on some
  platforms. In that case we still have to apply it to both places, not
  just to the initial allocation. I don't know if this is a problem (I
  haven't heard any such reports until now), but if it is we better
  address this consistently everywhere, not just this one place.

- I'm not really sure about the impact of the additional resize. I
  surely don't want to significantly penalize the well-estimated cases,
  so I'd like to see some numbers first.



I guess we'll need to wait for some other opinions.



OK

--
Tomas Vondra   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers