Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-17 Thread David Rowley
On 17 December 2017 at 16:22, Robert Haas  wrote:
> On Fri, Dec 15, 2017 at 5:18 PM, Alvaro Herrera  
> wrote:
>> We have two options for marking valid:
>>
>> 1. after each ALTER INDEX ATTACH, verify whether the set of partitions
>> that contain the index is complete; if so, mark it valid, otherwise do
>> nothing.  This sucks because we have to check that over and over for
>> every index that we attach
>>
>> 2. We invent yet another command, say
>> ALTER INDEX  VALIDATE

It's not perfect that we need to validate each time, but it might not
be that expensive to validate since we only really need to count the
pg_index rows that have indisvalid = true rows which have a parent
index listed as the index we're ATTACHing too, then ensure that
matches the number of leaf partitions.

> If ALTER INDEX .. ATTACH is already taking AEL on the parent, then I
> think it might as well try to validate while it's at it.  But if not
> then we might want to go with #2.

I'm now not that clear on what the behaviour is if the ONLY keyword is
not specified on the CREATE INDEX for the partitioned index. Does that
go and create each leaf partition index regardless of if there is a
suitable candidate to ATTACH?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: GSoC 2018

2017-12-17 Thread Alvaro Hernandez



On 15/12/17 13:57, Aleksander Alekseev wrote:

Hi Stephen,


HA/fail-over is a very broad topic, with a lot of pieces that need to be
done such that I'm not sure it's really viable, but perhaps a precursor
project (synchronous logical replication seems like a prereq, no?) would
make more sense.  Or, perhaps, a different piece of the HA question, but
solving the whole thing in a summer strikes me as unlikely to be
reasonable.

[...]




    I agree with most of the comments down-thread that this projects 
seems to undertake much more work than what a GSoC project could be. I'd 
like to chime in with a few other comments regarding other aspects the 
proposal:


- I believe we're mixing what HA and replication is. HA is about 
promoting a secondary node when the primary fails, preventing two 
primaries at the same time, ensuring one master is always up and so 
forth. Replication is just a data replication mechanism, and that 
*helps* with HA, but is not HA nor a necessary pre-requisite (actually 
you could build an HA solution based on shared-disk). Indeed, HA can be 
built on top of SR, logical replication (with many caveats as have been 
already commented), disk replication techniques outside of PG and 
probably others. Topic is way broader.


- If the proejct's goal is to improve on some of the logical 
replication's shortcomings that would be required to make it a useful 
replication technique on which HA solutions could be built on top of, my 
+1 for that.


- If the project goal would be to augment an existing HA solution to 
abstract away data replication techniques (so that SR is not the only 
option, but also logical or disk-based replication or whatever) from the 
core algorithm, I believe this is also a very sensible GSoC project.


- To add a proxy layer to any existing HA project is IMO a project on 
its own. I don't even see this as a GSoC feasible project.



    Cheers,

    Álvaro


--

Alvaro Hernandez


---
OnGres




Small typo in comment in json_agg_transfn

2017-12-17 Thread David Rowley
The attached fixed a small typo in json_agg_transfn.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


json_agg_transfn_comment_fix.patch
Description: Binary data


Parallel Aggregates for string_agg and array_agg

2017-12-17 Thread David Rowley
Hi,

While working on partial aggregation a few years ago, I didn't really
think it was worthwhile allowing partial aggregation of string_agg and
array_agg. I soon realised that I was wrong about that and allowing
parallelisation of these aggregates still could be very useful when
many rows are filtered out during the scan.

Some benchmarks that I've done locally show that parallel string_agg
and array_agg do actually perform better, despite the fact that the
aggregate state grows linearly with each aggregated item. Obviously,
the performance will get even better when workers are filtering out
rows before aggregation takes place, so it seems worthwhile doing
this. However, the main reason that I'm motivated to do this is that
there are more uses for partial aggregation other than just parallel
aggregation, and it seems a shame to disable all these features if a
single aggregate does not support partial mode.

I've attached a patch which implements all this. I've had most of it
stashed away for a while now, but I managed to get some time this
weekend to get it into a more completed state.

Things are now looking pretty good for the number of aggregates that
support partial mode.

Just a handful of aggregates now don't support partial aggregation;

postgres=# select aggfnoid from pg_aggregate where aggcombinefn=0 and
aggkind='n';
 aggfnoid
--
 xmlagg
 json_agg
 json_object_agg
 jsonb_agg
 jsonb_object_agg
(5 rows)

... and a good number do support it;

postgres=# select count(*) from pg_aggregate where aggcombinefn<>0 and
aggkind='n';
 count
---
   122
(1 row)

There's probably no reason why the last 5 of those couldn't be done
either, it might just require shifting a bit more work into the final
functions, although, I'm not planning on that for this patch.

As for the patch; there's a bit of a quirk in the implementation of
string_agg. We previously always threw away the delimiter that belongs
to the first aggregated value, but we do now need to keep that around
so we can put it in between two states in the combine function. I
decided the path of least resistance to do this was just to borrow
StringInfo's cursor variable to use as a pointer to the state of the
first value and put the first delimiter before that. Both the
string_agg(text) and string_agg(bytea) already have a final function,
so we just need to skip over the bytes up until the cursor position to
get rid of the first delimiter. I could go and invent some new state
type to do the same, but I don't really see the trouble with what I've
done with StringInfo, but I'll certainly listen if someone else thinks
this is wrong.

Another thing that I might review later about this is seeing about
getting rid of some of the code duplication between
array_agg_array_combine and accumArrayResultArr.

I'm going to add this to PG11's final commitfest rather than the
January 'fest as it seems more like a final commitfest type of patch.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


combinefn_for_string_and_array_aggs_v1.patch
Description: Binary data


Re: pg_(total_)relation_size and partitioned tables

2017-12-17 Thread David Rowley
On 17 December 2017 at 16:24, Robert Haas  wrote:
> On Thu, Dec 14, 2017 at 12:23 AM, Amit Langote
>  wrote:
>> You may have guessed from $subject that the two don't work together.
>
> It works exactly as documented:
>
> pg_total_relation_size(regclass) - Total disk space used by the
> specified table, including all indexes and TOAST data
>
> It says nothing about including partitions.  If we change this, then
> we certainly need to update the documentation (that might be a good
> idea if we decide not to update this).
>
> Personally, I'm -1 on including partitions, because then you can no
> longer expect that the sum of pg_total_relation_size(regclass) across
> all relations in the database will equal the size of the database
> itself.  Partitions will be counted a number of times equal to their
> depth in the partitioning hierarchy.  However, I understand that I
> might get outvoted.

I'd also vote to leave the relation_size functions alone.

Perhaps it's worth thinking of changing pg_table_size() instead. We
have taken measures to try and hide the fact that a table is made up
of a bunch of partitions from the user in some cases, e.g DROP TABLE
works without CASCADE for a partitioned table. I'm sure there are
arguments in both directions here too though.

I generally think of the difference between a relation and a table
similarly to the difference between a tuple and a row. A relation is
just what we use to implement tables, and there can be multiple
relations per table (in the partitioning case), similar to how there
can be multiple tuple versions for a single row. So that might back up
that pg_table_size should include all relations that make up that
table.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Why does array_position_common bitwise NOT an Oid type?

2017-12-17 Thread David Rowley
On 17 December 2017 at 14:53, Tom Lane  wrote:
> David Rowley  writes:
>> I was puzzled to see the following code:
>
>> my_extra->element_type = ~element_type;
>
> If memory serves, the idea was to force the subsequent datatype-lookup
> path to be taken, even if for some reason element_type is InvalidOid.
> If we take the lookup path then the bogus element_type will be detected
> and reported; if we don't, it won't be.

That makes sense. I'd just have expected more documentation on that.
Although, perhaps I just didn't look hard enough. I did fail to notice
the fact that the same thing does occur over and over when I sent this
originally.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-17 Thread Alvaro Herrera
Robert Haas wrote:
> On Fri, Dec 15, 2017 at 5:18 PM, Alvaro Herrera  
> wrote:
> > We have two options for marking valid:
> >
> > 1. after each ALTER INDEX ATTACH, verify whether the set of partitions
> > that contain the index is complete; if so, mark it valid, otherwise do
> > nothing.  This sucks because we have to check that over and over for
> > every index that we attach
> >
> > 2. We invent yet another command, say
> > ALTER INDEX  VALIDATE
> 
> If ALTER INDEX .. ATTACH is already taking AEL on the parent, then I
> think it might as well try to validate while it's at it.  But if not
> then we might want to go with #2.

The problem I have with it is that restoring a dump containing indexes
on partitions becomes a O(N^2) deal as it has to do the full check once
for every index we attach.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Top-N sorts verses parallelism

2017-12-17 Thread Robert Haas
On Fri, Dec 15, 2017 at 4:13 PM, Thomas Munro
 wrote:
> Looks right to me.  Commit 3452dc52 just forgot to tell the planner.
> I'm pleased about that because it makes this a slam-dunk bug-fix and
> not some confusing hard to justify costing problem.

Jeff Janes inquired off-list about other places where cost_sort() gets called.

In costsize.c, it is called from initial_cost_mergejoin, which seems
like it does not present an opportunity for pushing down limits, since
we don't know how many rows we'll have to join to get a certain number
of outputs.

In createplan.c, it is called only from label_sort_with_costsize().
That in turn called from create_merge_append_plan(), passing the tuple
limit from the path being converted to a plan, which seems
unimpeachable.  It's also called from make_unique_plan() and
create_mergejoin_plan() with -1, which seems OK since in neither case
do we know how many input rows we need to read.

In planner.c, it's called from plan_cluster_use_sort() with -1;
CLUSTER has to read the whole input, so that's fine.

In prepunion.c, it's called from choose_hashed_setop() with -1.  I
think set operations also need to read the whole input.

In pathnode.c, it's called from create_merge_append_path,
create_unique_path, create_gather_merge_path, create_groupingset_path,
and create_sort_path.  create_merge_append_path passes any limit
applicable to the subpath.  create_unique_path passes -1.
create_gather_merge_path also passes -1, which as Jeff also pointed
out seems possibly wrong.  create_sort_path also passes -1, and so
does create_groupingsets_path.

I went through the callers to create_sort_path and the only one that
looks like it can pass a limit is the one you and Jeff already
identified.  So I think the question is just whether
create_gather_merge_path needs a similar fix.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Backfill bgworker Extension?

2017-12-17 Thread Jeremy Finzel
On Sat, Dec 16, 2017 at 8:31 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 12/15/17 23:50, Jeremy Finzel wrote:
> > The common ground is some column in some table needs to be bulk updated.
> > I may not be explaining well, but in our environment we have done
> > hundreds of these using a generic framework to build a backfill. So I’m
> > not sure what you are questioning about the need? We have had to build a
> > worker to accomplish this because it can’t be done as a sql script alone.
>
> I'm trying to identify the independently useful pieces in your use case.
>  A background worker to backfill large tables is a very specific use
> case.  If instead we had a job/scheduler mechanism and a way to have
> server-side scripts that can control transactions, then that might
> satisfy your requirements as well (I'm not sure), but it would also
> potentially address many other uses.


I follow you now.  Yes, I think it probably would.  I think it would at
least provide a framework on which to build the tool I want.  It would be
great to have a "worker-capable" tool inside postgres than always having to
write external logic to do things like this.

> I’m not sure what you mean by a stored procedure in the background.
> > Since it would not be a single transaction, it doesn’t fit as a stored
> > procedure at least in Postgres when a function is 1 transaction.
>
> In progress: https://commitfest.postgresql.org/16/1360/


Looking forward to this.  I think this will help.  A stored procedure with
subtransactions could have the logic for the backfill in it, but would
still need an external worker that could retry it in case of failure
especially when things like a server restart happens.

Thanks,
Jeremy


Re: [HACKERS] Something for the TODO list: deprecating abstime and friends

2017-12-17 Thread David Fetter
On Wed, Dec 13, 2017 at 12:05:06AM -0800, Andres Freund wrote:
> On 2017-07-15 18:00:34 -0400, Tom Lane wrote:
> > * contrib/spi/timetravel depends on abstime columns to represent what
> > would nowadays be better done as a tstzrange.  I'd have thought we
> > could maybe toss that example module overboard, but we just today got
> > a question about its usage, so I'm afraid it's not quite dead yet.
> > What shall we do with that?
> 
> Looking at the code I'd be pretty strongly inclined to scrap it.
> 
> 
> Before I'd discovered this thread, I'd started to write up a
> patch. Attached. It's clearly not fully done. Questions I'd while
> hacking things up:
> - what to do with contrib/spi/timetravel - I'd just removed it from the
>   relevant Makefile for now.

Porting it to tstzrange seems friendlier, however:

- Does that look like significant work to do this port?
- How about the work involved with upgrading existing installations?

> - nabstime.c currently implements timeofday() which imo is a pretty
>   weird function. I'd be quite inclined to remove it at the same time as
>   this.

+1 for binning it.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: [sqlsmith] Parallel worker executor crash on master

2017-12-17 Thread Thomas Munro
On Sun, Dec 17, 2017 at 12:26 PM, Andreas Seltenreich
 wrote:
> Thomas Munro writes:
>
>> On Sat, Dec 16, 2017 at 10:13 PM, Andreas Seltenreich
>>  wrote:
>>> Core was generated by `postgres: smith regression [local] SELECT
>>> '.
>>> Program terminated with signal SIGSEGV, Segmentation fault.
>>> #0  gather_getnext (gatherstate=0x555a5fff1350) at nodeGather.c:283
>>> 283 estate->es_query_dsa = 
>>> gatherstate->pei->area;
>>> #1  ExecGather (pstate=0x555a5fff1350) at nodeGather.c:216
>>
>> Hmm, thanks.  That's not good.  Do we know if gatherstate->pei is
>> NULL, or if it's somehow pointing to garbage?
>
> It was NULL on all the coredumps I looked into.  Below[1] is a full
> gatherstate.
>
>> Not sure how either of those things could happen, since we only set it
>> to NULL in ExecShutdownGather() after which point we shouldn't call
>> ExecGather() again, and any MemoryContext problems with pei should
>> have caused problems already without this patch (for example in
>> ExecParallelCleanup).  Clearly I'm missing something.
>
> FWIW, all backtraces collected so far are identical for the first nine
> frames.  After ExecProjectSet, they are pretty random executor innards.

Oh, right.  We only create pei if (gather->num_workers > 0 &&
estate->es_use_parallel_mode), so I need to teach that patch that pei
may be NULL.  I'll go and post a new version like that over on the
other thread.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Protect syscache from bloating with negative cache entries

2017-12-17 Thread Robert Haas
On Sat, Dec 16, 2017 at 11:42 PM, Andres Freund  wrote:
>> I'm not sure we should regard very quick bloating as a problem in need
>> of solving.  Doesn't that just mean we need the cache to be bigger, at
>> least temporarily?
>
> Leaving that aside, is that actually not at least to a good degree,
> solved by that problem? By bumping the generation on hash resize, we
> have recency information we can take into account.

I agree that we can do it.  I'm just not totally sure it's a good
idea.  I'm also not totally sure it's a bad idea, either.  That's why
I asked the question.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pg_(total_)relation_size and partitioned tables

2017-12-17 Thread Robert Haas
On Sun, Dec 17, 2017 at 9:54 AM, David Rowley
 wrote:
> I'd also vote to leave the relation_size functions alone.
>
> Perhaps it's worth thinking of changing pg_table_size() instead. We
> have taken measures to try and hide the fact that a table is made up
> of a bunch of partitions from the user in some cases, e.g DROP TABLE
> works without CASCADE for a partitioned table. I'm sure there are
> arguments in both directions here too though.

Yeah, I don't really understand why changing pg_table_size() is any
more desirable than changing pg_relation_size().

I mean, we could have a table-size function that takes an array of
things you want to include (indexes, toast, partitions, etc), but
changing the semantics of existing functions seems like it's going to
be more painful than helpful (aside from the arguments I brought up
before).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pg_(total_)relation_size and partitioned tables

2017-12-17 Thread Michael Paquier
On Sun, Dec 17, 2017 at 11:54 PM, David Rowley
 wrote:
> I'd also vote to leave the relation_size functions alone.

Count me in that bucket as well.

> Perhaps it's worth thinking of changing pg_table_size() instead. We
> have taken measures to try and hide the fact that a table is made up
> of a bunch of partitions from the user in some cases, e.g DROP TABLE
> works without CASCADE for a partitioned table. I'm sure there are
> arguments in both directions here too though.
>
> I generally think of the difference between a relation and a table
> similarly to the difference between a tuple and a row. A relation is
> just what we use to implement tables, and there can be multiple
> relations per table (in the partitioning case), similar to how there
> can be multiple tuple versions for a single row. So that might back up
> that pg_table_size should include all relations that make up that
> table.

The barrier here is thin. What's proposed here is already doable with
a WITH RECURSIVE query. So why not just documenting this query and be
done with it instead of complicating the code? It seems to me that the
performance in calling pg_relation_size() in a cascading times fashion
would not matter much. Or one could invent an additional cascading
option which scans inheritance and/or partition chains, or simply have
a new function.
-- 
Michael



Re: worker_spi example BGW code GUC tweak

2017-12-17 Thread Robert Haas
On Thu, Dec 14, 2017 at 5:41 PM, Chapman Flack  wrote:
> Would this sample code make an even better teaching example if it
> used the existing GUC way to declare that worker_spi.naptime is
> in units of seconds?
>
> Or does it not do that for some reason I've overlooked?

Making it use GUC_UNIT_S seems like a good idea to me, but removing
the mention of seconds from the description doesn't seem like a good
idea to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: es_query_dsa is broken

2017-12-17 Thread Thomas Munro
On Thu, Dec 7, 2017 at 12:51 PM, Thomas Munro
 wrote:
> On Wed, Dec 6, 2017 at 10:16 PM, Amit Kapila  wrote:
>> On Wed, Dec 6, 2017 at 1:14 AM, Robert Haas  wrote:
>>> On Tue, Dec 5, 2017 at 7:24 AM, Amit Kapila  wrote:
 + EState *estate = gatherstate->ps.state;
 +
 + /* Install our DSA area while executing the plan. */
 + estate->es_query_dsa = gatherstate->pei->area;
   outerTupleSlot = ExecProcNode(outerPlan);
 + estate->es_query_dsa = NULL;

 Won't the above coding pattern create a problem, if ExecProcNode
 throws an error and outer block catches it and continues execution
 (consider the case of execution inside PL blocks)?
>>>
>>> I don't see what the problem is.  The query that got aborted by the
>>> error wouldn't be sharing an EState with one that didn't.
>>
>> That's right.  Ignore my comment, I got confused.   Other than that, I
>> don't see any problem with the code as such apart from that it looks
>> slightly hacky.  I think Thomas or someone needs to develop a patch on
>> the lines you have mentioned or what Thomas was trying to describe in
>> his email and see how it comes out.

Andreas Seltenreich found a query[1] that suffers from this bug
(thanks!), and Amit confirmed that this patch fixes it, but further
sqlsmith fuzz testing with the patch applied also revealed that it
failed to handle the case where pei is NULL because parallelism was
inhibited.  Here's a new version to fix that.  Someone might prefer a
coding with "if" rather than the ternary operator, but I didn't have a
strong preference.

[1] 
https://www.postgresql.org/message-id/flat/CAEepm%3D2tfz1XNDcyU_a5ZiEaopz6%2BbBo9vQY%2BiJVLTtUVNztcQ%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com


fix-es_query_dsa-pg10-v3.patch
Description: Binary data


fix-es_query_dsa-pg11-v3.patch
Description: Binary data


Re: worker_spi example BGW code GUC tweak

2017-12-17 Thread Chapman Flack
On 12/17/2017 07:32 PM, Robert Haas wrote:

> Making it use GUC_UNIT_S seems like a good idea to me, but removing
> the mention of seconds from the description doesn't seem like a good
> idea to me.

I took for my model a quick survey of existing GUCs that use
GUC_UNIT_(MS|S|MIN) - most of which do not restate the time unit
in the text description. (It's not a shut-out; some do, but only
a handful.)

I think that makes sense, because once the GUC_UNIT_foo is
specified, you get output like:

select current_setting('worker_spi.naptime');
 current_setting
-
 10s

and, if you set it for, say, 12ms or 180min, it will be
displayed as 2min or 3h, etc., making 'seconds' in the text
description a little redundant in the best case—when the
current value is most naturally shown with s—and a little
goofy in the other cases, where the value would be displayed
with min, h, or d, and reading the value combined with the text
description makes the snarky little voice in your head go
"nap for 3 hours seconds??".

-Chap



Re: genomic locus

2017-12-17 Thread Robert Haas
On Fri, Dec 15, 2017 at 2:49 PM, Gene Selkov  wrote:
> I need a data type to represent genomic positions, which will consist of a
> string and a pair of integers with interval logic and access methods. Sort
> of like my seg type, but more straightforward.
>
> I noticed somebody took a good care of seg while I was away for the last 20
> years, and I am extremely grateful for that. I have been using it. In the
> meantime, things have changed and now I am almost clueless about how you
> deal with contributed modules and what steps I should take once I get it to
> work. Also, is it a good idea to clone and fix seg for this purpose, or is
> there a more appropriate template? Or maybe just building it from scratch
> will be a better idea?

Have you thought about just using a composite type?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: worker_spi example BGW code GUC tweak

2017-12-17 Thread Robert Haas
On Sun, Dec 17, 2017 at 8:30 PM, Chapman Flack  wrote:
> On 12/17/2017 07:32 PM, Robert Haas wrote:
>> Making it use GUC_UNIT_S seems like a good idea to me, but removing
>> the mention of seconds from the description doesn't seem like a good
>> idea to me.
>
> I took for my model a quick survey of existing GUCs that use
> GUC_UNIT_(MS|S|MIN) - most of which do not restate the time unit
> in the text description. (It's not a shut-out; some do, but only
> a handful.)
>
> I think that makes sense, because once the GUC_UNIT_foo is
> specified, you get output like:
>
> select current_setting('worker_spi.naptime');
>  current_setting
> -
>  10s
>
> and, if you set it for, say, 12ms or 180min, it will be
> displayed as 2min or 3h, etc., making 'seconds' in the text
> description a little redundant in the best case—when the
> current value is most naturally shown with s—and a little
> goofy in the other cases, where the value would be displayed
> with min, h, or d, and reading the value combined with the text
> description makes the snarky little voice in your head go
> "nap for 3 hours seconds??".

Well, you have a point, at that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-17 Thread Robert Haas
On Sun, Dec 17, 2017 at 10:50 AM, Alvaro Herrera
 wrote:
>> If ALTER INDEX .. ATTACH is already taking AEL on the parent, then I
>> think it might as well try to validate while it's at it.  But if not
>> then we might want to go with #2.
>
> The problem I have with it is that restoring a dump containing indexes
> on partitions becomes a O(N^2) deal as it has to do the full check once
> for every index we attach.

Sure.  If the constant factor is high enough to matter, then VALIDATE
makes sense.

IMHO, anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-17 Thread Robert Haas
On Sun, Dec 17, 2017 at 5:29 AM, David Rowley
 wrote:
>> If ALTER INDEX .. ATTACH is already taking AEL on the parent, then I
>> think it might as well try to validate while it's at it.  But if not
>> then we might want to go with #2.
>
> I'm now not that clear on what the behaviour is if the ONLY keyword is
> not specified on the CREATE INDEX for the partitioned index. Does that
> go and create each leaf partition index regardless of if there is a
> suitable candidate to ATTACH?

No, the other way around.  ONLY is being proposed as a way to create
an initially-not-valid parent to which we can then ATTACH
subsequently-created child indexes.  But because we will have REPLACE
rather than DETACH, once you get the index valid it never goes back to
not-valid.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Using ProcSignal to get memory context stats from a running backend

2017-12-17 Thread Robert Haas
On Thu, Dec 14, 2017 at 9:34 PM, Craig Ringer  wrote:
> On 15 December 2017 at 09:24, Greg Stark  wrote:
>> Another simpler option would be to open up a new file in the log
>> directory
>
> ... if we have one.
>
> We might be logging to syslog, or whatever else.
>
> I'd rather keep it simple(ish).

+1.  I still think just printing it to the log is fine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-17 Thread David Rowley
On 18 December 2017 at 15:04, Robert Haas  wrote:
> On Sun, Dec 17, 2017 at 5:29 AM, David Rowley
>  wrote:
>> I'm now not that clear on what the behaviour is if the ONLY keyword is
>> not specified on the CREATE INDEX for the partitioned index. Does that
>> go and create each leaf partition index regardless of if there is a
>> suitable candidate to ATTACH?
>
> No, the other way around.  ONLY is being proposed as a way to create
> an initially-not-valid parent to which we can then ATTACH
> subsequently-created child indexes.  But because we will have REPLACE
> rather than DETACH, once you get the index valid it never goes back to
> not-valid.

I understand what the ONLY is proposed to do. My question was in
regards to the behaviour without ONLY.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: pg_(total_)relation_size and partitioned tables

2017-12-17 Thread Michael Paquier
On Mon, Dec 18, 2017 at 9:29 AM, Michael Paquier
 wrote:
> The barrier here is thin. What's proposed here is already doable with
> a WITH RECURSIVE query. So why not just documenting this query and be
> done with it instead of complicating the code? It seems to me that the
> performance in calling pg_relation_size() in a cascading times fashion
> would not matter much. Or one could invent an additional cascading
> option which scans inheritance and/or partition chains, or simply have
> a new function.

I just blogged on the matter, and here is one possibility here
compatible with v10:
WITH RECURSIVE partition_info
  (relid,
   relname,
   relsize,
   relispartition,
   relkind) AS (
SELECT oid AS relid,
   relname,
   pg_relation_size(oid) AS relsize,
   relispartition,
   relkind
FROM pg_catalog.pg_class
WHERE relname = 'parent_name' AND
  relkind = 'p'
  UNION ALL
SELECT
 c.oid AS relid,
 c.relname AS relname,
 pg_relation_size(c.oid) AS relsize,
 c.relispartition AS relispartition,
 c.relkind AS relkind
FROM partition_info AS p,
 pg_catalog.pg_inherits AS i,
 pg_catalog.pg_class AS c
WHERE p.relid = i.inhparent AND
 c.oid = i.inhrelid AND
 c.relispartition
  )
SELECT * FROM partition_info;

This is not really straight-forward. You could as well have the
pg_relation_size call in the outer query.
-- 
Michael



Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-17 Thread Robert Haas
On Sun, Dec 17, 2017 at 9:38 PM, David Rowley
 wrote:
> On 18 December 2017 at 15:04, Robert Haas  wrote:
>> On Sun, Dec 17, 2017 at 5:29 AM, David Rowley
>>  wrote:
>>> I'm now not that clear on what the behaviour is if the ONLY keyword is
>>> not specified on the CREATE INDEX for the partitioned index. Does that
>>> go and create each leaf partition index regardless of if there is a
>>> suitable candidate to ATTACH?
>>
>> No, the other way around.  ONLY is being proposed as a way to create
>> an initially-not-valid parent to which we can then ATTACH
>> subsequently-created child indexes.  But because we will have REPLACE
>> rather than DETACH, once you get the index valid it never goes back to
>> not-valid.
>
> I understand what the ONLY is proposed to do. My question was in
> regards to the behaviour without ONLY.

Oh, sorry -- I was confused.  I'm not sure whether that should try to
attach to something if it exists, or just create unconditionally...
what do you think?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-17 Thread David Rowley
On 18 December 2017 at 15:59, Robert Haas  wrote:
> On Sun, Dec 17, 2017 at 9:38 PM, David Rowley
>  wrote:
>> On 18 December 2017 at 15:04, Robert Haas  wrote:
>>> On Sun, Dec 17, 2017 at 5:29 AM, David Rowley
>>>  wrote:
 I'm now not that clear on what the behaviour is if the ONLY keyword is
 not specified on the CREATE INDEX for the partitioned index. Does that
 go and create each leaf partition index regardless of if there is a
 suitable candidate to ATTACH?
>>>
>>> No, the other way around.  ONLY is being proposed as a way to create
>>> an initially-not-valid parent to which we can then ATTACH
>>> subsequently-created child indexes.  But because we will have REPLACE
>>> rather than DETACH, once you get the index valid it never goes back to
>>> not-valid.
>>
>> I understand what the ONLY is proposed to do. My question was in
>> regards to the behaviour without ONLY.
>
> Oh, sorry -- I was confused.  I'm not sure whether that should try to
> attach to something if it exists, or just create unconditionally...
> what do you think?

I think you feel quite strongly about not having the code select a
random matching index, so if we want to stick to that rule, then we'll
need to create a set of new leaf indexes rather than select a random
one.

If we don't do that, then we might as well go with my idea and ditch
the ONLY syntax and have the CREATE INDEX try to find a suitable leaf
index, or create a new leaf index if one cannot be found.

Would it be surprising to users if CREATE INDEX ON partitioned_table
created a bunch of duplicate new indexes on the leaf tables?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-17 Thread Robert Haas
On Sun, Dec 17, 2017 at 10:05 PM, David Rowley
 wrote:
> I think you feel quite strongly about not having the code select a
> random matching index, so if we want to stick to that rule, then we'll
> need to create a set of new leaf indexes rather than select a random
> one.

I feel quite strongly about it *in the context of pg_dump*.  Outside
of that scenario, I'm happy to go with whatever behavior you and
others think will satisfy the POLA.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

2017-12-17 Thread Robert Haas
On Thu, Dec 14, 2017 at 7:51 PM, Masahiko Sawada  wrote:
> +1 from me.

Works for me, too, although I still don't really follow how it's
happening in the present coding.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

2017-12-17 Thread Michael Paquier
On Mon, Dec 18, 2017 at 12:14 PM, Robert Haas  wrote:
> On Thu, Dec 14, 2017 at 7:51 PM, Masahiko Sawada  
> wrote:
>> +1 from me.
>
> Works for me, too, although I still don't really follow how it's
> happening in the present coding.

Craig has mentioned at least one way upthread:
https://www.postgresql.org/message-id/CAMsr+YHj4asAu8dJ1Q5hiW+wxL2Jd1=uxo52dvxhmxive-x...@mail.gmail.com
And that's possible when building with LOCK_DEBUG with trace_lwlocks
enabled at least. So I would rather not rely on assuming that
CHECK_FOR_INTERRUPTS() as a code path never taken for the cases
discussed here.
-- 
Michael



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2017-12-17 Thread Masahiko Sawada
On Sun, Dec 17, 2017 at 12:27 PM, Robert Haas  wrote:
> On Thu, Dec 14, 2017 at 5:45 AM, Masahiko Sawada  
> wrote:
>> Here is the result.
>> I've measured the through-put with some cases on my virtual machine.
>> Each client loads 48k file to each different relations located on
>> either xfs filesystem or ext4 filesystem, for 30 sec.
>>
>> Case 1: COPYs to relations on different filessystems(xfs and ext4) and
>> N_RELEXTLOCK_ENTS is 1024
>>
>> clients = 2, avg = 296.2068
>> clients = 5, avg = 372.0707
>> clients = 10, avg = 389.8850
>> clients = 50, avg = 428.8050
>>
>> Case 2: COPYs to relations on different filessystems(xfs and ext4) and
>> N_RELEXTLOCK_ENTS is 1
>>
>> clients = 2, avg = 294.3633
>> clients = 5, avg = 358.9364
>> clients = 10, avg = 383.6945
>> clients = 50, avg = 424.3687
>>
>> And the result of current HEAD is following.
>>
>> clients = 2, avg = 284.9976
>> clients = 5, avg = 356.1726
>> clients = 10, avg = 375.9856
>> clients = 50, avg = 429.5745
>>
>> In case2, the through-put got decreased compare to case 1 but it seems
>> to be almost same as current HEAD. Because the speed of acquiring and
>> releasing extension lock got x10 faster than current HEAD as I
>> mentioned before, the performance degradation may not have gotten
>> decreased than I expected even in case 2.
>> Since my machine doesn't have enough resources the result of clients =
>> 50 might not be a valid result.
>
> I have to admit that result is surprising to me.
>

I think the environment I used for performance measurement did not
have enough resources. I will do the same benchmark on an another
environment to see if it was a valid result, and will share it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: pg_(total_)relation_size and partitioned tables

2017-12-17 Thread Amit Langote
Thanks all for your thoughts.

I agree with the Robert's point which both David and Michael seem to agree
with that we shouldn't really be changing what pg_relation_size() is doing
under the covers.  And I guess the same for pg_table_size(), too.  Both of
those functions and their siblings work with relations that possess
on-disk structures and have associated relations (TOAST, indexes) that in
turn possess on-disk structures.  It seems quite clearly documented as
such.  Partitioned tables are different in that they neither possess
on-disk structures nor have any relations (TOAST, indexes) associated
directly with them.  Instead, they have partitions that are the relations
that aforementioned dbsize.c functions are familiar with.

So, I withdraw the patch I originally posted in favor of some other approach.

Reply continues below...

On 2017/12/18 11:51, Michael Paquier wrote:
> On Mon, Dec 18, 2017 at 9:29 AM, Michael Paquier
>  wrote:
>> The barrier here is thin. What's proposed here is already doable with
>> a WITH RECURSIVE query. So why not just documenting this query and be
>> done with it instead of complicating the code? It seems to me that the
>> performance in calling pg_relation_size() in a cascading times fashion
>> would not matter much. Or one could invent an additional cascading
>> option which scans inheritance and/or partition chains, or simply have
>> a new function.
> 
> I just blogged on the matter, and here is one possibility here
> compatible with v10:
> WITH RECURSIVE partition_info
>   (relid,
>relname,
>relsize,
>relispartition,
>relkind) AS (
> SELECT oid AS relid,
>relname,
>pg_relation_size(oid) AS relsize,
>relispartition,
>relkind
> FROM pg_catalog.pg_class
> WHERE relname = 'parent_name' AND
>   relkind = 'p'
>   UNION ALL
> SELECT
>  c.oid AS relid,
>  c.relname AS relname,
>  pg_relation_size(c.oid) AS relsize,
>  c.relispartition AS relispartition,
>  c.relkind AS relkind
> FROM partition_info AS p,
>  pg_catalog.pg_inherits AS i,
>  pg_catalog.pg_class AS c
> WHERE p.relid = i.inhparent AND
>  c.oid = i.inhrelid AND
>  c.relispartition
>   )
> SELECT * FROM partition_info;
> 
> This is not really straight-forward. You could as well have the
> pg_relation_size call in the outer query.

Thanks Michael for coming up with that.

Do you (and/or others) think that's something that we can wrap inside a
built-in function(s), that is, one defined in system_views.sql?  Or if we
decide to have new functions, say, pg_get_partitions() and/or
pg_get_partition_sizes(), we might as well implement them as C functions
inside dbsize.c.  If so, do we have want to have "partition" variants of
all *_size() functions viz. pg_relation_size(), pg_total_relation_size(),
pg_indexes_size(), and pg_table_size()?

Thanks,
Amit




Re: pg_(total_)relation_size and partitioned tables

2017-12-17 Thread Michael Paquier
On Mon, Dec 18, 2017 at 2:17 PM, Amit Langote
 wrote:
> Do you (and/or others) think that's something that we can wrap inside a
> built-in function(s), that is, one defined in system_views.sql?  Or if we
> decide to have new functions, say, pg_get_partitions() and/or
> pg_get_partition_sizes(), we might as well implement them as C functions
> inside dbsize.c.  If so, do we have want to have "partition" variants of
> all *_size() functions viz. pg_relation_size(), pg_total_relation_size(),
> pg_indexes_size(), and pg_table_size()?

I can see value in something like Robert is proposing upthread by
having a function one is able to customize to decide what to include
in the calculation. There could be options for at least:
- partitions, or relation cascading.
- index.
- toast tables.
- visibility maps.
- FSM.
The first three ones is what Robert are mentioned, still I would see
the last two ones are interesting things if optional. Or we could have
just a SRF that returns one row per object scanned.
-- 
Michael



Re: [HACKERS] replace GrantObjectType with ObjectType

2017-12-17 Thread Rushabh Lathia
On Sat, Dec 16, 2017 at 12:40 AM, Robert Haas  wrote:

> On Fri, Dec 15, 2017 at 12:40 PM, Peter Eisentraut
>  wrote:
> > On 12/14/17 22:59, Rushabh Lathia wrote:
> >> I noted that no_priv_msg and not_owner_msg array been removed
> >> and code fitted the code into aclcheck_error().  Actually that
> >> makes the code more complex then what it used to be.  I would
> >> prefer the array rather then code been fitted into the function.
> >
> > There is an argument for having a big array versus the switch/case
> > approach.  But most existing code around object addresses uses the
> > switch/case approach, so it's better to align it that way, I think.
> > It's weird to have to maintain two different styles.
>
>
Only motivation is, earlier approach looks more cleaner. Also patch is
getting bigger - so if we continue with old approach it will make review
easy. Just in case switch/case approach is a go to, then it can be
done as part of separate clean up patch.

Thanks,
Rushabh Lathia


Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-17 Thread David Rowley
On 17 December 2017 at 16:22, Robert Haas  wrote:
> On Fri, Dec 15, 2017 at 5:18 PM, Alvaro Herrera  
> wrote:
>> We have two options for marking valid:
>>
>> 1. after each ALTER INDEX ATTACH, verify whether the set of partitions
>> that contain the index is complete; if so, mark it valid, otherwise do
>> nothing.  This sucks because we have to check that over and over for
>> every index that we attach
>>
>> 2. We invent yet another command, say
>> ALTER INDEX  VALIDATE

It's not perfect that we need to validate each time, but it might not
be that expensive to validate since we only really need to count the
pg_index rows that have indisvalid = true rows which have a parent
index listed as the index we're ATTACHing too, then ensure that
matches the number of leaf partitions.

> If ALTER INDEX .. ATTACH is already taking AEL on the parent, then I
> think it might as well try to validate while it's at it.  But if not
> then we might want to go with #2.

I'm now not that clear on what the behaviour is if the ONLY keyword is
not specified on the CREATE INDEX for the partitioned index. Does that
go and create each leaf partition index regardless of if there is a
suitable candidate to ATTACH?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: GSoC 2018

2017-12-17 Thread Alvaro Hernandez



On 15/12/17 13:57, Aleksander Alekseev wrote:

Hi Stephen,


HA/fail-over is a very broad topic, with a lot of pieces that need to be
done such that I'm not sure it's really viable, but perhaps a precursor
project (synchronous logical replication seems like a prereq, no?) would
make more sense.  Or, perhaps, a different piece of the HA question, but
solving the whole thing in a summer strikes me as unlikely to be
reasonable.

[...]




    I agree with most of the comments down-thread that this projects 
seems to undertake much more work than what a GSoC project could be. I'd 
like to chime in with a few other comments regarding other aspects the 
proposal:


- I believe we're mixing what HA and replication is. HA is about 
promoting a secondary node when the primary fails, preventing two 
primaries at the same time, ensuring one master is always up and so 
forth. Replication is just a data replication mechanism, and that 
*helps* with HA, but is not HA nor a necessary pre-requisite (actually 
you could build an HA solution based on shared-disk). Indeed, HA can be 
built on top of SR, logical replication (with many caveats as have been 
already commented), disk replication techniques outside of PG and 
probably others. Topic is way broader.


- If the proejct's goal is to improve on some of the logical 
replication's shortcomings that would be required to make it a useful 
replication technique on which HA solutions could be built on top of, my 
+1 for that.


- If the project goal would be to augment an existing HA solution to 
abstract away data replication techniques (so that SR is not the only 
option, but also logical or disk-based replication or whatever) from the 
core algorithm, I believe this is also a very sensible GSoC project.


- To add a proxy layer to any existing HA project is IMO a project on 
its own. I don't even see this as a GSoC feasible project.



    Cheers,

    Álvaro


--

Alvaro Hernandez


---
OnGres




Small typo in comment in json_agg_transfn

2017-12-17 Thread David Rowley
The attached fixed a small typo in json_agg_transfn.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


json_agg_transfn_comment_fix.patch
Description: Binary data


Parallel Aggregates for string_agg and array_agg

2017-12-17 Thread David Rowley
Hi,

While working on partial aggregation a few years ago, I didn't really
think it was worthwhile allowing partial aggregation of string_agg and
array_agg. I soon realised that I was wrong about that and allowing
parallelisation of these aggregates still could be very useful when
many rows are filtered out during the scan.

Some benchmarks that I've done locally show that parallel string_agg
and array_agg do actually perform better, despite the fact that the
aggregate state grows linearly with each aggregated item. Obviously,
the performance will get even better when workers are filtering out
rows before aggregation takes place, so it seems worthwhile doing
this. However, the main reason that I'm motivated to do this is that
there are more uses for partial aggregation other than just parallel
aggregation, and it seems a shame to disable all these features if a
single aggregate does not support partial mode.

I've attached a patch which implements all this. I've had most of it
stashed away for a while now, but I managed to get some time this
weekend to get it into a more completed state.

Things are now looking pretty good for the number of aggregates that
support partial mode.

Just a handful of aggregates now don't support partial aggregation;

postgres=# select aggfnoid from pg_aggregate where aggcombinefn=0 and
aggkind='n';
 aggfnoid
--
 xmlagg
 json_agg
 json_object_agg
 jsonb_agg
 jsonb_object_agg
(5 rows)

... and a good number do support it;

postgres=# select count(*) from pg_aggregate where aggcombinefn<>0 and
aggkind='n';
 count
---
   122
(1 row)

There's probably no reason why the last 5 of those couldn't be done
either, it might just require shifting a bit more work into the final
functions, although, I'm not planning on that for this patch.

As for the patch; there's a bit of a quirk in the implementation of
string_agg. We previously always threw away the delimiter that belongs
to the first aggregated value, but we do now need to keep that around
so we can put it in between two states in the combine function. I
decided the path of least resistance to do this was just to borrow
StringInfo's cursor variable to use as a pointer to the state of the
first value and put the first delimiter before that. Both the
string_agg(text) and string_agg(bytea) already have a final function,
so we just need to skip over the bytes up until the cursor position to
get rid of the first delimiter. I could go and invent some new state
type to do the same, but I don't really see the trouble with what I've
done with StringInfo, but I'll certainly listen if someone else thinks
this is wrong.

Another thing that I might review later about this is seeing about
getting rid of some of the code duplication between
array_agg_array_combine and accumArrayResultArr.

I'm going to add this to PG11's final commitfest rather than the
January 'fest as it seems more like a final commitfest type of patch.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


combinefn_for_string_and_array_aggs_v1.patch
Description: Binary data


Re: pg_(total_)relation_size and partitioned tables

2017-12-17 Thread David Rowley
On 17 December 2017 at 16:24, Robert Haas  wrote:
> On Thu, Dec 14, 2017 at 12:23 AM, Amit Langote
>  wrote:
>> You may have guessed from $subject that the two don't work together.
>
> It works exactly as documented:
>
> pg_total_relation_size(regclass) - Total disk space used by the
> specified table, including all indexes and TOAST data
>
> It says nothing about including partitions.  If we change this, then
> we certainly need to update the documentation (that might be a good
> idea if we decide not to update this).
>
> Personally, I'm -1 on including partitions, because then you can no
> longer expect that the sum of pg_total_relation_size(regclass) across
> all relations in the database will equal the size of the database
> itself.  Partitions will be counted a number of times equal to their
> depth in the partitioning hierarchy.  However, I understand that I
> might get outvoted.

I'd also vote to leave the relation_size functions alone.

Perhaps it's worth thinking of changing pg_table_size() instead. We
have taken measures to try and hide the fact that a table is made up
of a bunch of partitions from the user in some cases, e.g DROP TABLE
works without CASCADE for a partitioned table. I'm sure there are
arguments in both directions here too though.

I generally think of the difference between a relation and a table
similarly to the difference between a tuple and a row. A relation is
just what we use to implement tables, and there can be multiple
relations per table (in the partitioning case), similar to how there
can be multiple tuple versions for a single row. So that might back up
that pg_table_size should include all relations that make up that
table.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Why does array_position_common bitwise NOT an Oid type?

2017-12-17 Thread David Rowley
On 17 December 2017 at 14:53, Tom Lane  wrote:
> David Rowley  writes:
>> I was puzzled to see the following code:
>
>> my_extra->element_type = ~element_type;
>
> If memory serves, the idea was to force the subsequent datatype-lookup
> path to be taken, even if for some reason element_type is InvalidOid.
> If we take the lookup path then the bogus element_type will be detected
> and reported; if we don't, it won't be.

That makes sense. I'd just have expected more documentation on that.
Although, perhaps I just didn't look hard enough. I did fail to notice
the fact that the same thing does occur over and over when I sent this
originally.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-17 Thread Alvaro Herrera
Robert Haas wrote:
> On Fri, Dec 15, 2017 at 5:18 PM, Alvaro Herrera  
> wrote:
> > We have two options for marking valid:
> >
> > 1. after each ALTER INDEX ATTACH, verify whether the set of partitions
> > that contain the index is complete; if so, mark it valid, otherwise do
> > nothing.  This sucks because we have to check that over and over for
> > every index that we attach
> >
> > 2. We invent yet another command, say
> > ALTER INDEX  VALIDATE
> 
> If ALTER INDEX .. ATTACH is already taking AEL on the parent, then I
> think it might as well try to validate while it's at it.  But if not
> then we might want to go with #2.

The problem I have with it is that restoring a dump containing indexes
on partitions becomes a O(N^2) deal as it has to do the full check once
for every index we attach.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Top-N sorts verses parallelism

2017-12-17 Thread Robert Haas
On Fri, Dec 15, 2017 at 4:13 PM, Thomas Munro
 wrote:
> Looks right to me.  Commit 3452dc52 just forgot to tell the planner.
> I'm pleased about that because it makes this a slam-dunk bug-fix and
> not some confusing hard to justify costing problem.

Jeff Janes inquired off-list about other places where cost_sort() gets called.

In costsize.c, it is called from initial_cost_mergejoin, which seems
like it does not present an opportunity for pushing down limits, since
we don't know how many rows we'll have to join to get a certain number
of outputs.

In createplan.c, it is called only from label_sort_with_costsize().
That in turn called from create_merge_append_plan(), passing the tuple
limit from the path being converted to a plan, which seems
unimpeachable.  It's also called from make_unique_plan() and
create_mergejoin_plan() with -1, which seems OK since in neither case
do we know how many input rows we need to read.

In planner.c, it's called from plan_cluster_use_sort() with -1;
CLUSTER has to read the whole input, so that's fine.

In prepunion.c, it's called from choose_hashed_setop() with -1.  I
think set operations also need to read the whole input.

In pathnode.c, it's called from create_merge_append_path,
create_unique_path, create_gather_merge_path, create_groupingset_path,
and create_sort_path.  create_merge_append_path passes any limit
applicable to the subpath.  create_unique_path passes -1.
create_gather_merge_path also passes -1, which as Jeff also pointed
out seems possibly wrong.  create_sort_path also passes -1, and so
does create_groupingsets_path.

I went through the callers to create_sort_path and the only one that
looks like it can pass a limit is the one you and Jeff already
identified.  So I think the question is just whether
create_gather_merge_path needs a similar fix.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Backfill bgworker Extension?

2017-12-17 Thread Jeremy Finzel
On Sat, Dec 16, 2017 at 8:31 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 12/15/17 23:50, Jeremy Finzel wrote:
> > The common ground is some column in some table needs to be bulk updated.
> > I may not be explaining well, but in our environment we have done
> > hundreds of these using a generic framework to build a backfill. So I’m
> > not sure what you are questioning about the need? We have had to build a
> > worker to accomplish this because it can’t be done as a sql script alone.
>
> I'm trying to identify the independently useful pieces in your use case.
>  A background worker to backfill large tables is a very specific use
> case.  If instead we had a job/scheduler mechanism and a way to have
> server-side scripts that can control transactions, then that might
> satisfy your requirements as well (I'm not sure), but it would also
> potentially address many other uses.


I follow you now.  Yes, I think it probably would.  I think it would at
least provide a framework on which to build the tool I want.  It would be
great to have a "worker-capable" tool inside postgres than always having to
write external logic to do things like this.

> I’m not sure what you mean by a stored procedure in the background.
> > Since it would not be a single transaction, it doesn’t fit as a stored
> > procedure at least in Postgres when a function is 1 transaction.
>
> In progress: https://commitfest.postgresql.org/16/1360/


Looking forward to this.  I think this will help.  A stored procedure with
subtransactions could have the logic for the backfill in it, but would
still need an external worker that could retry it in case of failure
especially when things like a server restart happens.

Thanks,
Jeremy


Re: [HACKERS] Something for the TODO list: deprecating abstime and friends

2017-12-17 Thread David Fetter
On Wed, Dec 13, 2017 at 12:05:06AM -0800, Andres Freund wrote:
> On 2017-07-15 18:00:34 -0400, Tom Lane wrote:
> > * contrib/spi/timetravel depends on abstime columns to represent what
> > would nowadays be better done as a tstzrange.  I'd have thought we
> > could maybe toss that example module overboard, but we just today got
> > a question about its usage, so I'm afraid it's not quite dead yet.
> > What shall we do with that?
> 
> Looking at the code I'd be pretty strongly inclined to scrap it.
> 
> 
> Before I'd discovered this thread, I'd started to write up a
> patch. Attached. It's clearly not fully done. Questions I'd while
> hacking things up:
> - what to do with contrib/spi/timetravel - I'd just removed it from the
>   relevant Makefile for now.

Porting it to tstzrange seems friendlier, however:

- Does that look like significant work to do this port?
- How about the work involved with upgrading existing installations?

> - nabstime.c currently implements timeofday() which imo is a pretty
>   weird function. I'd be quite inclined to remove it at the same time as
>   this.

+1 for binning it.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: [sqlsmith] Parallel worker executor crash on master

2017-12-17 Thread Thomas Munro
On Sun, Dec 17, 2017 at 12:26 PM, Andreas Seltenreich
 wrote:
> Thomas Munro writes:
>
>> On Sat, Dec 16, 2017 at 10:13 PM, Andreas Seltenreich
>>  wrote:
>>> Core was generated by `postgres: smith regression [local] SELECT
>>> '.
>>> Program terminated with signal SIGSEGV, Segmentation fault.
>>> #0  gather_getnext (gatherstate=0x555a5fff1350) at nodeGather.c:283
>>> 283 estate->es_query_dsa = 
>>> gatherstate->pei->area;
>>> #1  ExecGather (pstate=0x555a5fff1350) at nodeGather.c:216
>>
>> Hmm, thanks.  That's not good.  Do we know if gatherstate->pei is
>> NULL, or if it's somehow pointing to garbage?
>
> It was NULL on all the coredumps I looked into.  Below[1] is a full
> gatherstate.
>
>> Not sure how either of those things could happen, since we only set it
>> to NULL in ExecShutdownGather() after which point we shouldn't call
>> ExecGather() again, and any MemoryContext problems with pei should
>> have caused problems already without this patch (for example in
>> ExecParallelCleanup).  Clearly I'm missing something.
>
> FWIW, all backtraces collected so far are identical for the first nine
> frames.  After ExecProjectSet, they are pretty random executor innards.

Oh, right.  We only create pei if (gather->num_workers > 0 &&
estate->es_use_parallel_mode), so I need to teach that patch that pei
may be NULL.  I'll go and post a new version like that over on the
other thread.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Protect syscache from bloating with negative cache entries

2017-12-17 Thread Robert Haas
On Sat, Dec 16, 2017 at 11:42 PM, Andres Freund  wrote:
>> I'm not sure we should regard very quick bloating as a problem in need
>> of solving.  Doesn't that just mean we need the cache to be bigger, at
>> least temporarily?
>
> Leaving that aside, is that actually not at least to a good degree,
> solved by that problem? By bumping the generation on hash resize, we
> have recency information we can take into account.

I agree that we can do it.  I'm just not totally sure it's a good
idea.  I'm also not totally sure it's a bad idea, either.  That's why
I asked the question.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pg_(total_)relation_size and partitioned tables

2017-12-17 Thread Robert Haas
On Sun, Dec 17, 2017 at 9:54 AM, David Rowley
 wrote:
> I'd also vote to leave the relation_size functions alone.
>
> Perhaps it's worth thinking of changing pg_table_size() instead. We
> have taken measures to try and hide the fact that a table is made up
> of a bunch of partitions from the user in some cases, e.g DROP TABLE
> works without CASCADE for a partitioned table. I'm sure there are
> arguments in both directions here too though.

Yeah, I don't really understand why changing pg_table_size() is any
more desirable than changing pg_relation_size().

I mean, we could have a table-size function that takes an array of
things you want to include (indexes, toast, partitions, etc), but
changing the semantics of existing functions seems like it's going to
be more painful than helpful (aside from the arguments I brought up
before).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pg_(total_)relation_size and partitioned tables

2017-12-17 Thread Michael Paquier
On Sun, Dec 17, 2017 at 11:54 PM, David Rowley
 wrote:
> I'd also vote to leave the relation_size functions alone.

Count me in that bucket as well.

> Perhaps it's worth thinking of changing pg_table_size() instead. We
> have taken measures to try and hide the fact that a table is made up
> of a bunch of partitions from the user in some cases, e.g DROP TABLE
> works without CASCADE for a partitioned table. I'm sure there are
> arguments in both directions here too though.
>
> I generally think of the difference between a relation and a table
> similarly to the difference between a tuple and a row. A relation is
> just what we use to implement tables, and there can be multiple
> relations per table (in the partitioning case), similar to how there
> can be multiple tuple versions for a single row. So that might back up
> that pg_table_size should include all relations that make up that
> table.

The barrier here is thin. What's proposed here is already doable with
a WITH RECURSIVE query. So why not just documenting this query and be
done with it instead of complicating the code? It seems to me that the
performance in calling pg_relation_size() in a cascading times fashion
would not matter much. Or one could invent an additional cascading
option which scans inheritance and/or partition chains, or simply have
a new function.
-- 
Michael



Re: worker_spi example BGW code GUC tweak

2017-12-17 Thread Robert Haas
On Thu, Dec 14, 2017 at 5:41 PM, Chapman Flack  wrote:
> Would this sample code make an even better teaching example if it
> used the existing GUC way to declare that worker_spi.naptime is
> in units of seconds?
>
> Or does it not do that for some reason I've overlooked?

Making it use GUC_UNIT_S seems like a good idea to me, but removing
the mention of seconds from the description doesn't seem like a good
idea to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: es_query_dsa is broken

2017-12-17 Thread Thomas Munro
On Thu, Dec 7, 2017 at 12:51 PM, Thomas Munro
 wrote:
> On Wed, Dec 6, 2017 at 10:16 PM, Amit Kapila  wrote:
>> On Wed, Dec 6, 2017 at 1:14 AM, Robert Haas  wrote:
>>> On Tue, Dec 5, 2017 at 7:24 AM, Amit Kapila  wrote:
 + EState *estate = gatherstate->ps.state;
 +
 + /* Install our DSA area while executing the plan. */
 + estate->es_query_dsa = gatherstate->pei->area;
   outerTupleSlot = ExecProcNode(outerPlan);
 + estate->es_query_dsa = NULL;

 Won't the above coding pattern create a problem, if ExecProcNode
 throws an error and outer block catches it and continues execution
 (consider the case of execution inside PL blocks)?
>>>
>>> I don't see what the problem is.  The query that got aborted by the
>>> error wouldn't be sharing an EState with one that didn't.
>>
>> That's right.  Ignore my comment, I got confused.   Other than that, I
>> don't see any problem with the code as such apart from that it looks
>> slightly hacky.  I think Thomas or someone needs to develop a patch on
>> the lines you have mentioned or what Thomas was trying to describe in
>> his email and see how it comes out.

Andreas Seltenreich found a query[1] that suffers from this bug
(thanks!), and Amit confirmed that this patch fixes it, but further
sqlsmith fuzz testing with the patch applied also revealed that it
failed to handle the case where pei is NULL because parallelism was
inhibited.  Here's a new version to fix that.  Someone might prefer a
coding with "if" rather than the ternary operator, but I didn't have a
strong preference.

[1] 
https://www.postgresql.org/message-id/flat/CAEepm%3D2tfz1XNDcyU_a5ZiEaopz6%2BbBo9vQY%2BiJVLTtUVNztcQ%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com


fix-es_query_dsa-pg10-v3.patch
Description: Binary data


fix-es_query_dsa-pg11-v3.patch
Description: Binary data


Re: worker_spi example BGW code GUC tweak

2017-12-17 Thread Chapman Flack
On 12/17/2017 07:32 PM, Robert Haas wrote:

> Making it use GUC_UNIT_S seems like a good idea to me, but removing
> the mention of seconds from the description doesn't seem like a good
> idea to me.

I took for my model a quick survey of existing GUCs that use
GUC_UNIT_(MS|S|MIN) - most of which do not restate the time unit
in the text description. (It's not a shut-out; some do, but only
a handful.)

I think that makes sense, because once the GUC_UNIT_foo is
specified, you get output like:

select current_setting('worker_spi.naptime');
 current_setting
-
 10s

and, if you set it for, say, 12ms or 180min, it will be
displayed as 2min or 3h, etc., making 'seconds' in the text
description a little redundant in the best case—when the
current value is most naturally shown with s—and a little
goofy in the other cases, where the value would be displayed
with min, h, or d, and reading the value combined with the text
description makes the snarky little voice in your head go
"nap for 3 hours seconds??".

-Chap



Re: genomic locus

2017-12-17 Thread Robert Haas
On Fri, Dec 15, 2017 at 2:49 PM, Gene Selkov  wrote:
> I need a data type to represent genomic positions, which will consist of a
> string and a pair of integers with interval logic and access methods. Sort
> of like my seg type, but more straightforward.
>
> I noticed somebody took a good care of seg while I was away for the last 20
> years, and I am extremely grateful for that. I have been using it. In the
> meantime, things have changed and now I am almost clueless about how you
> deal with contributed modules and what steps I should take once I get it to
> work. Also, is it a good idea to clone and fix seg for this purpose, or is
> there a more appropriate template? Or maybe just building it from scratch
> will be a better idea?

Have you thought about just using a composite type?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: worker_spi example BGW code GUC tweak

2017-12-17 Thread Robert Haas
On Sun, Dec 17, 2017 at 8:30 PM, Chapman Flack  wrote:
> On 12/17/2017 07:32 PM, Robert Haas wrote:
>> Making it use GUC_UNIT_S seems like a good idea to me, but removing
>> the mention of seconds from the description doesn't seem like a good
>> idea to me.
>
> I took for my model a quick survey of existing GUCs that use
> GUC_UNIT_(MS|S|MIN) - most of which do not restate the time unit
> in the text description. (It's not a shut-out; some do, but only
> a handful.)
>
> I think that makes sense, because once the GUC_UNIT_foo is
> specified, you get output like:
>
> select current_setting('worker_spi.naptime');
>  current_setting
> -
>  10s
>
> and, if you set it for, say, 12ms or 180min, it will be
> displayed as 2min or 3h, etc., making 'seconds' in the text
> description a little redundant in the best case—when the
> current value is most naturally shown with s—and a little
> goofy in the other cases, where the value would be displayed
> with min, h, or d, and reading the value combined with the text
> description makes the snarky little voice in your head go
> "nap for 3 hours seconds??".

Well, you have a point, at that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-17 Thread Robert Haas
On Sun, Dec 17, 2017 at 10:50 AM, Alvaro Herrera
 wrote:
>> If ALTER INDEX .. ATTACH is already taking AEL on the parent, then I
>> think it might as well try to validate while it's at it.  But if not
>> then we might want to go with #2.
>
> The problem I have with it is that restoring a dump containing indexes
> on partitions becomes a O(N^2) deal as it has to do the full check once
> for every index we attach.

Sure.  If the constant factor is high enough to matter, then VALIDATE
makes sense.

IMHO, anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-17 Thread Robert Haas
On Sun, Dec 17, 2017 at 5:29 AM, David Rowley
 wrote:
>> If ALTER INDEX .. ATTACH is already taking AEL on the parent, then I
>> think it might as well try to validate while it's at it.  But if not
>> then we might want to go with #2.
>
> I'm now not that clear on what the behaviour is if the ONLY keyword is
> not specified on the CREATE INDEX for the partitioned index. Does that
> go and create each leaf partition index regardless of if there is a
> suitable candidate to ATTACH?

No, the other way around.  ONLY is being proposed as a way to create
an initially-not-valid parent to which we can then ATTACH
subsequently-created child indexes.  But because we will have REPLACE
rather than DETACH, once you get the index valid it never goes back to
not-valid.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Using ProcSignal to get memory context stats from a running backend

2017-12-17 Thread Robert Haas
On Thu, Dec 14, 2017 at 9:34 PM, Craig Ringer  wrote:
> On 15 December 2017 at 09:24, Greg Stark  wrote:
>> Another simpler option would be to open up a new file in the log
>> directory
>
> ... if we have one.
>
> We might be logging to syslog, or whatever else.
>
> I'd rather keep it simple(ish).

+1.  I still think just printing it to the log is fine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-17 Thread David Rowley
On 18 December 2017 at 15:04, Robert Haas  wrote:
> On Sun, Dec 17, 2017 at 5:29 AM, David Rowley
>  wrote:
>> I'm now not that clear on what the behaviour is if the ONLY keyword is
>> not specified on the CREATE INDEX for the partitioned index. Does that
>> go and create each leaf partition index regardless of if there is a
>> suitable candidate to ATTACH?
>
> No, the other way around.  ONLY is being proposed as a way to create
> an initially-not-valid parent to which we can then ATTACH
> subsequently-created child indexes.  But because we will have REPLACE
> rather than DETACH, once you get the index valid it never goes back to
> not-valid.

I understand what the ONLY is proposed to do. My question was in
regards to the behaviour without ONLY.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: pg_(total_)relation_size and partitioned tables

2017-12-17 Thread Michael Paquier
On Mon, Dec 18, 2017 at 9:29 AM, Michael Paquier
 wrote:
> The barrier here is thin. What's proposed here is already doable with
> a WITH RECURSIVE query. So why not just documenting this query and be
> done with it instead of complicating the code? It seems to me that the
> performance in calling pg_relation_size() in a cascading times fashion
> would not matter much. Or one could invent an additional cascading
> option which scans inheritance and/or partition chains, or simply have
> a new function.

I just blogged on the matter, and here is one possibility here
compatible with v10:
WITH RECURSIVE partition_info
  (relid,
   relname,
   relsize,
   relispartition,
   relkind) AS (
SELECT oid AS relid,
   relname,
   pg_relation_size(oid) AS relsize,
   relispartition,
   relkind
FROM pg_catalog.pg_class
WHERE relname = 'parent_name' AND
  relkind = 'p'
  UNION ALL
SELECT
 c.oid AS relid,
 c.relname AS relname,
 pg_relation_size(c.oid) AS relsize,
 c.relispartition AS relispartition,
 c.relkind AS relkind
FROM partition_info AS p,
 pg_catalog.pg_inherits AS i,
 pg_catalog.pg_class AS c
WHERE p.relid = i.inhparent AND
 c.oid = i.inhrelid AND
 c.relispartition
  )
SELECT * FROM partition_info;

This is not really straight-forward. You could as well have the
pg_relation_size call in the outer query.
-- 
Michael



Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-17 Thread Robert Haas
On Sun, Dec 17, 2017 at 9:38 PM, David Rowley
 wrote:
> On 18 December 2017 at 15:04, Robert Haas  wrote:
>> On Sun, Dec 17, 2017 at 5:29 AM, David Rowley
>>  wrote:
>>> I'm now not that clear on what the behaviour is if the ONLY keyword is
>>> not specified on the CREATE INDEX for the partitioned index. Does that
>>> go and create each leaf partition index regardless of if there is a
>>> suitable candidate to ATTACH?
>>
>> No, the other way around.  ONLY is being proposed as a way to create
>> an initially-not-valid parent to which we can then ATTACH
>> subsequently-created child indexes.  But because we will have REPLACE
>> rather than DETACH, once you get the index valid it never goes back to
>> not-valid.
>
> I understand what the ONLY is proposed to do. My question was in
> regards to the behaviour without ONLY.

Oh, sorry -- I was confused.  I'm not sure whether that should try to
attach to something if it exists, or just create unconditionally...
what do you think?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-17 Thread David Rowley
On 18 December 2017 at 15:59, Robert Haas  wrote:
> On Sun, Dec 17, 2017 at 9:38 PM, David Rowley
>  wrote:
>> On 18 December 2017 at 15:04, Robert Haas  wrote:
>>> On Sun, Dec 17, 2017 at 5:29 AM, David Rowley
>>>  wrote:
 I'm now not that clear on what the behaviour is if the ONLY keyword is
 not specified on the CREATE INDEX for the partitioned index. Does that
 go and create each leaf partition index regardless of if there is a
 suitable candidate to ATTACH?
>>>
>>> No, the other way around.  ONLY is being proposed as a way to create
>>> an initially-not-valid parent to which we can then ATTACH
>>> subsequently-created child indexes.  But because we will have REPLACE
>>> rather than DETACH, once you get the index valid it never goes back to
>>> not-valid.
>>
>> I understand what the ONLY is proposed to do. My question was in
>> regards to the behaviour without ONLY.
>
> Oh, sorry -- I was confused.  I'm not sure whether that should try to
> attach to something if it exists, or just create unconditionally...
> what do you think?

I think you feel quite strongly about not having the code select a
random matching index, so if we want to stick to that rule, then we'll
need to create a set of new leaf indexes rather than select a random
one.

If we don't do that, then we might as well go with my idea and ditch
the ONLY syntax and have the CREATE INDEX try to find a suitable leaf
index, or create a new leaf index if one cannot be found.

Would it be surprising to users if CREATE INDEX ON partitioned_table
created a bunch of duplicate new indexes on the leaf tables?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-17 Thread Robert Haas
On Sun, Dec 17, 2017 at 10:05 PM, David Rowley
 wrote:
> I think you feel quite strongly about not having the code select a
> random matching index, so if we want to stick to that rule, then we'll
> need to create a set of new leaf indexes rather than select a random
> one.

I feel quite strongly about it *in the context of pg_dump*.  Outside
of that scenario, I'm happy to go with whatever behavior you and
others think will satisfy the POLA.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

2017-12-17 Thread Robert Haas
On Thu, Dec 14, 2017 at 7:51 PM, Masahiko Sawada  wrote:
> +1 from me.

Works for me, too, although I still don't really follow how it's
happening in the present coding.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

2017-12-17 Thread Michael Paquier
On Mon, Dec 18, 2017 at 12:14 PM, Robert Haas  wrote:
> On Thu, Dec 14, 2017 at 7:51 PM, Masahiko Sawada  
> wrote:
>> +1 from me.
>
> Works for me, too, although I still don't really follow how it's
> happening in the present coding.

Craig has mentioned at least one way upthread:
https://www.postgresql.org/message-id/CAMsr+YHj4asAu8dJ1Q5hiW+wxL2Jd1=uxo52dvxhmxive-x...@mail.gmail.com
And that's possible when building with LOCK_DEBUG with trace_lwlocks
enabled at least. So I would rather not rely on assuming that
CHECK_FOR_INTERRUPTS() as a code path never taken for the cases
discussed here.
-- 
Michael



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2017-12-17 Thread Masahiko Sawada
On Sun, Dec 17, 2017 at 12:27 PM, Robert Haas  wrote:
> On Thu, Dec 14, 2017 at 5:45 AM, Masahiko Sawada  
> wrote:
>> Here is the result.
>> I've measured the through-put with some cases on my virtual machine.
>> Each client loads 48k file to each different relations located on
>> either xfs filesystem or ext4 filesystem, for 30 sec.
>>
>> Case 1: COPYs to relations on different filessystems(xfs and ext4) and
>> N_RELEXTLOCK_ENTS is 1024
>>
>> clients = 2, avg = 296.2068
>> clients = 5, avg = 372.0707
>> clients = 10, avg = 389.8850
>> clients = 50, avg = 428.8050
>>
>> Case 2: COPYs to relations on different filessystems(xfs and ext4) and
>> N_RELEXTLOCK_ENTS is 1
>>
>> clients = 2, avg = 294.3633
>> clients = 5, avg = 358.9364
>> clients = 10, avg = 383.6945
>> clients = 50, avg = 424.3687
>>
>> And the result of current HEAD is following.
>>
>> clients = 2, avg = 284.9976
>> clients = 5, avg = 356.1726
>> clients = 10, avg = 375.9856
>> clients = 50, avg = 429.5745
>>
>> In case2, the through-put got decreased compare to case 1 but it seems
>> to be almost same as current HEAD. Because the speed of acquiring and
>> releasing extension lock got x10 faster than current HEAD as I
>> mentioned before, the performance degradation may not have gotten
>> decreased than I expected even in case 2.
>> Since my machine doesn't have enough resources the result of clients =
>> 50 might not be a valid result.
>
> I have to admit that result is surprising to me.
>

I think the environment I used for performance measurement did not
have enough resources. I will do the same benchmark on an another
environment to see if it was a valid result, and will share it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: pg_(total_)relation_size and partitioned tables

2017-12-17 Thread Amit Langote
Thanks all for your thoughts.

I agree with the Robert's point which both David and Michael seem to agree
with that we shouldn't really be changing what pg_relation_size() is doing
under the covers.  And I guess the same for pg_table_size(), too.  Both of
those functions and their siblings work with relations that possess
on-disk structures and have associated relations (TOAST, indexes) that in
turn possess on-disk structures.  It seems quite clearly documented as
such.  Partitioned tables are different in that they neither possess
on-disk structures nor have any relations (TOAST, indexes) associated
directly with them.  Instead, they have partitions that are the relations
that aforementioned dbsize.c functions are familiar with.

So, I withdraw the patch I originally posted in favor of some other approach.

Reply continues below...

On 2017/12/18 11:51, Michael Paquier wrote:
> On Mon, Dec 18, 2017 at 9:29 AM, Michael Paquier
>  wrote:
>> The barrier here is thin. What's proposed here is already doable with
>> a WITH RECURSIVE query. So why not just documenting this query and be
>> done with it instead of complicating the code? It seems to me that the
>> performance in calling pg_relation_size() in a cascading times fashion
>> would not matter much. Or one could invent an additional cascading
>> option which scans inheritance and/or partition chains, or simply have
>> a new function.
> 
> I just blogged on the matter, and here is one possibility here
> compatible with v10:
> WITH RECURSIVE partition_info
>   (relid,
>relname,
>relsize,
>relispartition,
>relkind) AS (
> SELECT oid AS relid,
>relname,
>pg_relation_size(oid) AS relsize,
>relispartition,
>relkind
> FROM pg_catalog.pg_class
> WHERE relname = 'parent_name' AND
>   relkind = 'p'
>   UNION ALL
> SELECT
>  c.oid AS relid,
>  c.relname AS relname,
>  pg_relation_size(c.oid) AS relsize,
>  c.relispartition AS relispartition,
>  c.relkind AS relkind
> FROM partition_info AS p,
>  pg_catalog.pg_inherits AS i,
>  pg_catalog.pg_class AS c
> WHERE p.relid = i.inhparent AND
>  c.oid = i.inhrelid AND
>  c.relispartition
>   )
> SELECT * FROM partition_info;
> 
> This is not really straight-forward. You could as well have the
> pg_relation_size call in the outer query.

Thanks Michael for coming up with that.

Do you (and/or others) think that's something that we can wrap inside a
built-in function(s), that is, one defined in system_views.sql?  Or if we
decide to have new functions, say, pg_get_partitions() and/or
pg_get_partition_sizes(), we might as well implement them as C functions
inside dbsize.c.  If so, do we have want to have "partition" variants of
all *_size() functions viz. pg_relation_size(), pg_total_relation_size(),
pg_indexes_size(), and pg_table_size()?

Thanks,
Amit




Re: pg_(total_)relation_size and partitioned tables

2017-12-17 Thread Michael Paquier
On Mon, Dec 18, 2017 at 2:17 PM, Amit Langote
 wrote:
> Do you (and/or others) think that's something that we can wrap inside a
> built-in function(s), that is, one defined in system_views.sql?  Or if we
> decide to have new functions, say, pg_get_partitions() and/or
> pg_get_partition_sizes(), we might as well implement them as C functions
> inside dbsize.c.  If so, do we have want to have "partition" variants of
> all *_size() functions viz. pg_relation_size(), pg_total_relation_size(),
> pg_indexes_size(), and pg_table_size()?

I can see value in something like Robert is proposing upthread by
having a function one is able to customize to decide what to include
in the calculation. There could be options for at least:
- partitions, or relation cascading.
- index.
- toast tables.
- visibility maps.
- FSM.
The first three ones is what Robert are mentioned, still I would see
the last two ones are interesting things if optional. Or we could have
just a SRF that returns one row per object scanned.
-- 
Michael



Re: [HACKERS] replace GrantObjectType with ObjectType

2017-12-17 Thread Rushabh Lathia
On Sat, Dec 16, 2017 at 12:40 AM, Robert Haas  wrote:

> On Fri, Dec 15, 2017 at 12:40 PM, Peter Eisentraut
>  wrote:
> > On 12/14/17 22:59, Rushabh Lathia wrote:
> >> I noted that no_priv_msg and not_owner_msg array been removed
> >> and code fitted the code into aclcheck_error().  Actually that
> >> makes the code more complex then what it used to be.  I would
> >> prefer the array rather then code been fitted into the function.
> >
> > There is an argument for having a big array versus the switch/case
> > approach.  But most existing code around object addresses uses the
> > switch/case approach, so it's better to align it that way, I think.
> > It's weird to have to maintain two different styles.
>
>
Only motivation is, earlier approach looks more cleaner. Also patch is
getting bigger - so if we continue with old approach it will make review
easy. Just in case switch/case approach is a go to, then it can be
done as part of separate clean up patch.

Thanks,
Rushabh Lathia