Re: [HACKERS] [POC] Faster processing at Gather node

2017-10-16 Thread Rafia Sabih
On Tue, Oct 17, 2017 at 3:22 AM, Andres Freund  wrote:
> Hi Rafia,
>
> On 2017-05-19 17:25:38 +0530, Rafia Sabih wrote:
>> head:
>> explain  analyse select * from t where i < 3000;
>>  QUERY PLAN
>
> Could you share how exactly you generated the data? Just so others can
> compare a bit better with your results?
>

Sure. I used generate_series(1, 1000);
Please find the attached script for the detailed steps.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/


large_tbl_gen.sql
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] possible encoding issues with libxml2 functions

2017-10-16 Thread Pavel Stehule
2017-10-17 1:57 GMT+02:00 Noah Misch :

> On Sun, Aug 20, 2017 at 10:37:10PM +0200, Pavel Stehule wrote:
> > > We have xpath-bugfix.patch and xpath-parsing-error-fix.patch.  Both are
> > > equivalent under supported use cases (xpath in UTF8 databases).  Among
> > > non-supported use cases, they each make different things better and
> > > different
> > > things worse.  We should prefer to back-patch the version harming fewer
> > > applications.  I expect non-ASCII data is more common than xml
> declarations
> > > with "encoding" attribute, so xpath-bugfix.patch will harm fewer
> > > applications.
> > >
> > > Having said that, I now see a third option.  Condition this thread's
> > > patch's
> > > effects on GetDatabaseEncoding()==PG_UTF8.  That way, we fix supported
> > > cases,
> > > and we remain bug-compatible in unsupported cases.  I think that's
> better
> > > than
> > > the other options discussed so far.  If you agree, please send a patch
> > > based
> > > on xpath-bugfix.patch with the GetDatabaseEncoding()==PG_UTF8 change
> and
> > > the
> > > two edits I described earlier.
> > >
> >
> > I am sorry -  too long day today. Do you think some like
> >
> > diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
> > index 24229c2dff..9fd6f3509f 100644
> > --- a/src/backend/utils/adt/xml.c
> > +++ b/src/backend/utils/adt/xml.c
> > @@ -3914,7 +3914,14 @@ xpath_internal(text *xpath_expr_text, xmltype
> *data,
> > ArrayType *namespaces,
> > if (ctxt == NULL || xmlerrcxt->err_occurred)
> > xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
> > "could not allocate parser context");
> > -   doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL,
> 0);
> > +
> > +   /*
> > +* Passed XML is always in server encoding. When server encoding
> > +* is UTF8, we can pass this information to libxml2 to ignore
> > +* possible invalid encoding declaration in XML document.
> > +*/
> > +   doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL,
> > +   GetDatabaseEncoding() == PG_UTF8 ? "UTF-8" : NULL, 0);
> > if (doc == NULL || xmlerrcxt->err_occurred)
> > xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT,
> > "could not parse XML document");
>
> No, that doesn't match my description above.  I don't see a way to clarify
> the
> description.  Feel free to try again.  Alternately, if you wait, I will
> eventually construct the patch I described.
>

Please, if you can, try it write. I am little bit lost :)

Regards

Pavel


Re: [HACKERS] [PATCH] Add recovery_min_apply_delay_reconnect recovery option

2017-10-16 Thread Michael Paquier
On Tue, Oct 17, 2017 at 12:51 AM, Eric Radman  wrote:
> This administrative compromise is necessary because the WalReceiver is
> not resumed after a network interruption until all records are read,
> verified, and applied from the archive on disk.

Taking a step back here... recoveryApplyDelay() uses
XLogCtl->recoveryWakeupLatch which gets set if the WAL receiver has
received new WAL, or if the WAL receiver shuts down properly. So if
the WAL receiver gets down for whatever reason during the loop of
recoveryApplyDelay(), the startup process waits for a record to be
applied maybe for a long time, and as there is no WAL receiver we
actually don't receive any new WAL records. New WAL records would be
received only once WaitForWALToBecomeAvailable() is called, which
happens once the apply delay is done for. If the postmaster dies, then
HandleStartupProcInterrupts() would take care of taking down
immediately the startup process, which is cool.

I see what you are trying to achieve and that seems worth it. It is
indeed a waste to not have a WAL receiver online while waiting for a
delay to be applied. If there is a flacky network between the primary
and a standby, you may end up with a standby way behind its primary,
and that could penalize a primary clean shutdown as the primary waits
for the shutdown checkpoint record to be flushed on the standby.

I think that your way to deal with the problem is messy though. If you
think about it, no parameters are actually needed. What you should try
to achieve is to make recoveryApplyDelay() smarter, by making the wait
to forcibly stop if you detect a failure by getting out of the redo
routine, and then force again the record to be read again. This way,
the startup process would try to start again a new WAL receiver if it
thinks that the source it should read WAL from is a stream. That may
turn to be a patch more complicated than you think though.

Your patch also breaks actually the use case of standbys doing
recovery using only archives and no streaming. In this case
WalRcvStreaming returns false, and recovery_min_apply_delay_reconnect
would be used unconditionally, so you would break a lot of
applications silently.

> Is it possible to verify the archive on disk independently of
> application?  Adding a second delay parameter provides a workaround for
> some use cases without complecting xlog.c.

That's possible, not with core though. The archives are in a location
not controlled by the backend, but by archive_command, which may not
be local to the instance where Postgres is running. You could always
hack your own functions to do this work, here is an example of
something I came up with:
https://github.com/michaelpq/pg_plugins/tree/master/wal_utils
This prototype (use and hack at your own risk), is able to look at the
local contents of an archive. This can be used easily with a
client-side tool to copy a series of segments., or just perform sanity
checks on them.

For those reasons, -1 for the patch as proposed.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: heap/SLRU verification, relfrozenxid cut-off, and freeze-the-dead bug (Was: amcheck (B-Tree integrity checking tool))

2017-10-16 Thread Noah Misch
On Mon, Oct 16, 2017 at 12:57:39PM -0700, Peter Geoghegan wrote:
> On Fri, Oct 13, 2017 at 7:09 PM, Noah Misch  wrote:
> > The checker should
> > consider circumstances potentially carried from past versions via 
> > pg_upgrade.
> 
> Right. False positives are simply unacceptable.

False positives are bugs, but they're not exceptionally-odious bugs.

> > Fortunately, if you get some details wrong, it's cheap to recover from 
> > checker
> > bugs.
> 
> Ideally, amcheck will become a formal statement of the contracts
> provided by major subsystems, such as the heapam, the various SLRUs,
> and so on. While I agree that having bugs there is much less severe
> than having bugs in backend code, I would like the tool to reach a
> point where it actually *defines* correctness (by community
> consensus).

That presupposes construction of two pieces of software, the server and the
checker, such that every disagreement is a bug in the server.  But checkers
get bugs just like servers get bugs.  Checkers do provide a sort of
double-entry bookkeeping.  When a reproducible test case prompts a checker
complaint, we'll know *some* code is wrong.  That's an admirable contribution.

> If a bug in amcheck reflects a bug in our high level
> thinking about correctness, then that actually is a serious problem.

My notion of data file correctness is roughly this:

  A data file is correct if the server's reads and mutations thereof will not
  cause it to deviate from documented behavior.  Where the documentation isn't
  specific, fall back on SQL standards.  Where no documentation or SQL
  standard addresses a particular behavior, we should debate the matter and
  document the decision.

I'm essentially saying that the server is innocent until proven guilty.  It
would be cool to have a self-contained specification of PostgreSQL data files,
but where the server disagrees with the spec without causing problem
behaviors, we'd ultimately update the spec to fit the server.

Thanks,
nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Lockable views

2017-10-16 Thread Tatsuo Ishii
> I'm a bit confused. What is difference between tables and functions
> in a subquery with regard to view locking? I think also none view queries
> using a subquery do not care about the changes of tables in the 
> subquery while executing the query. I might be misnderstanding
> the problem you mentioned.

The difference is in the function cases we concern the function
definition. While the table cases need to care about table
definitions *and* contents of the table.

If we are changing the table definition, AccessExclusiveLock will be
held for the table and the updation will be blocked anyway. So we
don't need to care about the table definition changes.

On the other hand, table contents changes need to be cared because no
automatic locking are held in some cases. I think whether tables in
the subquery need locking or not is depending on use cases.

So I dug into the previous candidates a little bit more:

1) Leave as it is (ignore tables appearing in a subquery)

2) Lock all tables including in a subquery

3) Check subquery in the view definition. If there are some tables
   involved, emit an error and abort.

I think one of the problems with #2 is, we will lock tables involved
by the view in random order, which could cause unwanted dead
locks. This is not good and I cannot see any easy way to avoid
this. Also some tables may not need to be locked.

Problem with #3 is, it does not help a user who wants to control
lockings by himself/herself.

So it seem #1 is the most reasonable way to deal with the problem
assuming that it's user's responsibility to take appropriate locks on
the tables in the subquery.

> BTW, I found that if we have to handle subqueries in where clause, we would
> also have to care about subqueries in target list... The view defined as
> below is also updatable.
> 
>  =# create view v7 as select (select * from tbl2 limit 1) from tbl;

The view is not updatable. You will get something like if you try to update v7:

DETAIL:  Views that have no updatable columns are not automatically updatable.

On the other hand this:

create view v7 as select i, (select j from tbl2 limit 1) from tbl;

will be updatable. In this case column j of v7 will never be
updatable. And you should do something like:

insert into v7(i) values...

In short, you don't need to care about a subquery appearing in the TLE
as far as the view locking concerns.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Determine state of cluster (HA)

2017-10-16 Thread Craig Ringer
On 17 October 2017 at 01:02, Joshua D. Drake  wrote:
> On 10/15/2017 07:39 PM, Craig Ringer wrote:
>>
>> On 13 October 2017 at 08:50, Joshua D. Drake  wrote:
>>>
>>> -Hackers,
>>>
>>> I had a long call with a firm developing front end proxy/cache/HA for
>>> Postgres today. Essentially the software is a replacement for PGPool in
>>> entirety but also supports analytics etc... When I was asking them about
>>> pain points they talked about the below and I was wondering if this is a
>>> problem we would like to solve.
>>
>>
>> IMO: no one node knows the full state of the system, or can know it.
>
>
> That isn't exactly true. We do know if our replication state is current but
> only from the master which is part of the problem.

Sure. But unless you have a perfectly-reliable, latency-ignoring
wormhole link between master and standby, the standby always has
imperfect knowledge of the master. More importantly, it can never know
for sure how old its knowledge is.

https://www.postgresql.org/docs/current/static/monitoring-stats.html#PG-STAT-WAL-RECEIVER-VIEW
already does about the best we can probably do. In particular
last_msg_send_time and last_msg_receipt_time, used in combination with
latest_end_lsn and latest_end_time.

>> That said, I do think it'd be very desirable for us to introduce a
>> greater link from a standby to master:
>>
>> - Get info about master. We should finish merging recovery.conf into
>> postgresql.conf.
>
>
> Definitely.

There's a patch from Abhijit Menon-Sen you could adopt for PostgreSQL
11 for that.


>>> 1.  The dblink call doesn't have a way to specify a timeout, so we have
>>> to
>>> use Java futures to control how long this may take to a reasonable amount
>>> of
>>> time;
>>
>>
>> statement_timeout doesn't work?
>
>
> That would be a work around definitely but I think it would be better to
> say: ALTER SYSTEM SET PROMOTE TIMEOUT '120' (Example, let's not get off into
> the weeds :P) and if the standby can't receive a ping/ack within 120 it will
> promote itself.

I'm confused by this. I thought you were talking about timeouts
querying status of an upstream over dblink. Not automatic
self-promotion.

I'm really not a fan of Pg standbys self-promoting without working
with an external co-ordinator that handles STONITH/fencing. It's a
recipe for disaster. That's what I was saying upthread, that
implementing bits and pieces here can be quite dangerous.

This also takes it well outside what you were talking about, improving
the ability to detect Pg's state, and into having it become its own
co-ordinator for HA actions.

So lets go back to the original question. What's missing that
statement_timeout doesn't provide for querying remote servers for
their status over dblink?

If you want a nicer way to say "look up whatever your conninfo in
recovery.conf is, connect to it, get me some info on it and return it,
possibly daisy-chaining up a chain of replicas if you reach the
master" ... that's fine. But it's a different thing.

>> Er, yes? I don't understand what you are getting at here.
>
>
> Yes, I will need to go back to them on this one. I think what they mean is
> that if we have a connection that is getting closed it doesn't return why it
> is closing. It just throws an error.

Yes, we do.  From
https://www.postgresql.org/docs/current/static/errcodes-appendix.html:

Class 57 — Operator Intervention
  57000 operator_intervention
  57014 query_canceled
  57P01 admin_shutdown
  57P02 crash_shutdown
  57P03 cannot_connect_now
  57P04 database_dropped

Maybe they want more granularity in terms of what reasons are given
and what errors are reported. That's fine. But please provide
specifics.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BLK_DONE state in XLogReadBufferForRedoExtended

2017-10-16 Thread Michael Paquier
On Mon, Oct 16, 2017 at 9:50 PM, Amit Kapila  wrote:
> If above analysis is correct, then I think we can say that row state
> in a page will be same during recovery as it was when the original
> operation was performed if the full_page_writes are enabled. I am not
> sure how much this can help in current heap format, but this can help
> in zheap (undo based heap).

If I understood that correctly, that looks like a sane assumption. For
REGBUF_NO_IMAGE you may need to be careful though with undo
operations.

> In zheap, we are writing complete tuple for Delete operation in undo
> so that we can reclaim the corresponding tuple space as soon as the
> deleting transaction is committed.  Now, during recovery, we have to
> generate the complete undo record (which includes the entire tuple)
> and for that ideally, we should write the complete tuple in WAL, but
> instead of that, I think we can regenerate it from the original page.
> This is only applicable when full_page_writes are enabled, otherwise,
> a complete tuple is required in WAL.

Yeah, you should really try to support both modes as well.
Fortunately, it is possible to know if full page writes are enforced
at the moment a record is assembled and inserted, so you could rely on
that.

> I am not sure how much above makes sense to anyone without a detailed
> explanation, but I thought I should give some background on why I
> asked this question.  However, if anybody needs more explanation or
> sees any fault in above understanding, please let me know.

Thanks for clarifying, I was wondering the reason behind the question
as well. It is the second time that I see the word zheap on -hackers,
and the first time was no longer than 2 days ago by Robert.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Fix typo in blvacuum.c

2017-10-16 Thread Seki, Eiji
Hi all,

I found a typo in comment of blvacuum.c, so attach the patch for it.

--
Regards,
Eiji Seki
Fujitsu




fix_comment_in_blvacuum_c.patch
Description: fix_comment_in_blvacuum_c.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replace GrantObjectType with ObjectType

2017-10-16 Thread Michael Paquier
On Thu, Oct 12, 2017 at 5:43 PM, Alvaro Herrera  wrote:
> Michael Paquier wrote:
>> On Thu, Oct 12, 2017 at 7:55 AM, Peter Eisentraut
>>  wrote:
>> > It seems to me that having ACL_OBJECT_* symbols alongside OBJECT_*
>> > symbols is not useful and leads to duplication.  Digging around in the
>> > past suggests that we used to have a lot of these command-specific
>> > symbols but got rid of them over time, except that the ACL stuff was
>> > never touched.  The attached patch accomplishes that.
>
> +1 for this.
>
>> -bool
>> -EventTriggerSupportsGrantObjectType(GrantObjectType objtype)
>> -{
>> -   switch (objtype)
>> -   {
>> -   case ACL_OBJECT_DATABASE:
>> -   case ACL_OBJECT_TABLESPACE:
>> -   /* no support for global objects */
>> -   return false;
>> By removing that, if any GRANT/REVOKE support is added for another
>> object type, then EventTriggerSupportsObjectType would return true by
>> default instead of getting a compilation failure.
>
> Yeah, we've found it useful to remove default: clauses in some switch
> blocks so that we get compile warnings when we forget to add a new type
> (c.f. commit e84c0195980f).  Let's not add any more of those.

Here is an idea: let's keep EventTriggerSupportsGrantObjectType which
instead of ACL_OBJECT_* uses OBJECT_*, but complains with an error if
the object is not supported with GRANT. Not need for a default in this
case.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2017-10-16 Thread Peter Geoghegan
On Sat, Oct 14, 2017 at 2:47 PM, Peter Geoghegan  wrote:
> I am working on an experimental version of pg_filedump, customized to
> output XML that can be interpreted by an open source hex editor. The
> XML makes the hex editor produce color coded, commented
> tags/annotations for any given heap or B-Tree relation. This includes
> tooltips with literal values for all status bits (including
> t_infomask/t_infomask2 bits, IndexTuple bits, B-Tree meta page status
> bits, PD_* page-level bits, ItemId bits, and others).

This is now available from: https://github.com/petergeoghegan/pg_hexedit

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] possible encoding issues with libxml2 functions

2017-10-16 Thread Noah Misch
On Sun, Aug 20, 2017 at 10:37:10PM +0200, Pavel Stehule wrote:
> > We have xpath-bugfix.patch and xpath-parsing-error-fix.patch.  Both are
> > equivalent under supported use cases (xpath in UTF8 databases).  Among
> > non-supported use cases, they each make different things better and
> > different
> > things worse.  We should prefer to back-patch the version harming fewer
> > applications.  I expect non-ASCII data is more common than xml declarations
> > with "encoding" attribute, so xpath-bugfix.patch will harm fewer
> > applications.
> >
> > Having said that, I now see a third option.  Condition this thread's
> > patch's
> > effects on GetDatabaseEncoding()==PG_UTF8.  That way, we fix supported
> > cases,
> > and we remain bug-compatible in unsupported cases.  I think that's better
> > than
> > the other options discussed so far.  If you agree, please send a patch
> > based
> > on xpath-bugfix.patch with the GetDatabaseEncoding()==PG_UTF8 change and
> > the
> > two edits I described earlier.
> >
> 
> I am sorry -  too long day today. Do you think some like
> 
> diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
> index 24229c2dff..9fd6f3509f 100644
> --- a/src/backend/utils/adt/xml.c
> +++ b/src/backend/utils/adt/xml.c
> @@ -3914,7 +3914,14 @@ xpath_internal(text *xpath_expr_text, xmltype *data,
> ArrayType *namespaces,
> if (ctxt == NULL || xmlerrcxt->err_occurred)
> xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
> "could not allocate parser context");
> -   doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
> +
> +   /*
> +* Passed XML is always in server encoding. When server encoding
> +* is UTF8, we can pass this information to libxml2 to ignore
> +* possible invalid encoding declaration in XML document.
> +*/
> +   doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL,
> +   GetDatabaseEncoding() == PG_UTF8 ? "UTF-8" : NULL, 0);
> if (doc == NULL || xmlerrcxt->err_occurred)
> xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT,
> "could not parse XML document");

No, that doesn't match my description above.  I don't see a way to clarify the
description.  Feel free to try again.  Alternately, if you wait, I will
eventually construct the patch I described.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Still another race condition in recovery TAP tests

2017-10-16 Thread Michael Paquier
On Tue, Oct 17, 2017 at 5:01 AM, Robert Haas  wrote:
> On Fri, Oct 6, 2017 at 5:57 AM, Craig Ringer  wrote:
>>> Fewer people will test as we grow the list of modules they must first 
>>> install.
>> At worst, all we have to do is provide a script
>> that fetches them, from distro repos if possible, and failing that
>> from CPAN.
>>
>> With cpanminus, that's pretty darn simple too.
>
> I think the idea that this is going to work for everybody is not very
> realistic.  I am not going to run some random script to install stuff
> on my system in whatever manner it feels like, because when I do stuff
> like that I usually end up regretting it.  If it's a throw-away VM,
> sure, but not otherwise.

Yeah, having something which is network-dependent to run the full set
of regression tests is not a good idea. I have seen users, at least in
Japan, deploying Postgres on hosts that have no external connection,
just a base OS image built with all packages present. Data center
rules can be very strict.

> We don't even have good documentation of what packages you should
> install on various packaging systems to build the server, let alone
> all of the additional things that may be required depending on build
> options and TAP tests you might want to run.

On top of that, let's not forget that we take the time to improve the
modules what we already have in the code tree. Let's not forget that
;)
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-10-16 Thread Andres Freund
Hi Rafia,

On 2017-05-19 17:25:38 +0530, Rafia Sabih wrote:
> head:
> explain  analyse select * from t where i < 3000;
>  QUERY PLAN

Could you share how exactly you generated the data? Just so others can
compare a bit better with your results?

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is it time to kill support for very old servers?

2017-10-16 Thread Andres Freund
On 2017-10-16 16:59:59 -0400, Peter Eisentraut wrote:
> On 9/20/17 04:32, Andres Freund wrote:
> > Here's what I roughly was thinking of.  I don't quite like the name, and
> > the way the version is specified for libpq (basically just the "raw"
> > integer).
> 
> "forced_protocol_version" reads wrong to me.  I think
> "force_protocol_version" might be better.  Other than that, no issues
> with this concept.

Yea, I agree. I've read through the patch since, and it struck me as
odd. Not sure how I came up with it...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] EXPLAIN (ANALYZE, BUFFERS) reports bogus temporary buffer reads

2017-10-16 Thread Thomas Munro
Hi hackers,

Vik Fearing asked off-list why hash joins appear to read slightly more
temporary data than they write.  The reason is that we notch up a
phantom block read when we hit the end of each file.  Harmless but it
looks a bit weird and it's easily fixed.

Unpatched, a 16 batch hash join reports that we read 30 more blocks
than we wrote (2 per batch after the first, as expected):

   Buffers: shared hit=434 read=16234, temp read=5532 written=5502

With the attached patch:

   Buffers: shared hit=547 read=16121, temp read=5502 written=5502

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Don-t-count-EOF-as-a-temporary-buffer-read-in-EXPLAI.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is it time to kill support for very old servers?

2017-10-16 Thread Peter Eisentraut
On 9/20/17 04:32, Andres Freund wrote:
> Here's what I roughly was thinking of.  I don't quite like the name, and
> the way the version is specified for libpq (basically just the "raw"
> integer).

"forced_protocol_version" reads wrong to me.  I think
"force_protocol_version" might be better.  Other than that, no issues
with this concept.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] coverage analysis improvements

2017-10-16 Thread Peter Eisentraut
On 9/27/17 01:52, Michael Paquier wrote:
> I am marking the full set of patches as ready for committer.

All these patches have now been committed.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] oversight in EphemeralNamedRelation support

2017-10-16 Thread Tom Lane
Thomas Munro  writes:
> On Fri, Oct 13, 2017 at 10:01 AM, Tom Lane  wrote:
>> But I see very
>> little case for allowing CTEs to capture such references, because surely
>> we are never going to allow that to do anything useful, and we have
>> several years of precedent now that they don't capture.

> For what it's worth, SQL Server allows DML in CTEs like us but went
> the other way on this.  Not only are its CTEs in scope as DML targets,
> it actually lets you update them in cases where a view would be
> updatable, rewriting as base table updates.  I'm not suggesting that
> we should do that too (unless of course it shows up in a future
> standard), just pointing it out as a curiosity.

Interesting.  Still, given that we have quite a few years of precedent
that CTEs aren't in scope as DML targets, I'm disinclined to change
our semantics unless the point does show up in the standard.

I've not heard anyone speaking against the choices you made in your
prior message, so I'll go review your v3 patch, and push unless
I find problems.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Aggregate FILTER option is broken in v10

2017-10-16 Thread Andres Freund
On 2017-10-16 11:12:09 -0400, Tom Lane wrote:
> I wrote:
> > I think possibly the best answer is to revert 8ed3f11bb.  We could
> > think about some compromise solution like merging the projections
> > only for aggregates without FILTER.  But that would require two
> > code paths through the relevant functions in nodeAgg.c, which would
> > be a substantial maintenance burden; and the extra branches required
> > means that this would be a net negative for performance in the
> > simplest case with only one aggregate.
> 
> Hmm ... on closer inspection, the only performance-critical place
> where this matters is advance_aggregates, and that already has a check
> for whether the particular aggregate has a filter.  So we could do
> something like
> 
> /* Skip anything FILTERed out */
> if (filter)
> {
> // existing code to eval/check filter expr
> +
> +   /* Now it's safe to evaluate this agg's arguments */
> +   slot = ExecProject(pertrans->argproj);
> }
> +   else
> +   slot = aggstate->evalslot;
> 
> which seems like a pretty minimal extra cost for the normal case
> with no filter.

Thanks, that looks like a reasonable fix.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Still another race condition in recovery TAP tests

2017-10-16 Thread Robert Haas
On Fri, Oct 6, 2017 at 5:57 AM, Craig Ringer  wrote:
>> Fewer people will test as we grow the list of modules they must first 
>> install.
>
> Meh, I don't buy that.

Well, I do.  Prerequisites are a pain, and the more of them there are,
the more pain it is.

> At worst, all we have to do is provide a script
> that fetches them, from distro repos if possible, and failing that
> from CPAN.
>
> With cpanminus, that's pretty darn simple too.

I think the idea that this is going to work for everybody is not very
realistic.  I am not going to run some random script to install stuff
on my system in whatever manner it feels like, because when I do stuff
like that I usually end up regretting it.  If it's a throw-away VM,
sure, but not otherwise.

We don't even have good documentation of what packages you should
install on various packaging systems to build the server, let alone
all of the additional things that may be required depending on build
options and TAP tests you might want to run.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: heap/SLRU verification, relfrozenxid cut-off, and freeze-the-dead bug (Was: amcheck (B-Tree integrity checking tool))

2017-10-16 Thread Peter Geoghegan
On Fri, Oct 13, 2017 at 7:09 PM, Noah Misch  wrote:
> All good questions; I don't know offhand.  Discovering those answers is
> perhaps the chief labor required of such a project.

ISTM that by far the hardest part of the project is arriving at a
consensus around what a good set of invariants for CLOG and MultiXact
looks like.

I think that it's fair to say that this business with relfrozenxid now
appears to be more complicated than many of us would have thought. Or,
at least, more complicated than I thought when I first started
thinking about it. Once we're measuring this complexity (by having
checks), we should be in a better position to keep it under control,
and to avoid ambiguity.

> The checker should
> consider circumstances potentially carried from past versions via pg_upgrade.

Right. False positives are simply unacceptable.

> Fortunately, if you get some details wrong, it's cheap to recover from checker
> bugs.

Ideally, amcheck will become a formal statement of the contracts
provided by major subsystems, such as the heapam, the various SLRUs,
and so on. While I agree that having bugs there is much less severe
than having bugs in backend code, I would like the tool to reach a
point where it actually *defines* correctness (by community
consensus). If a bug in amcheck reflects a bug in our high level
thinking about correctness, then that actually is a serious problem.
Arguably, it's the most costly variety of bug that Postgres can have.

I may never be able to get general buy-in here; building broad
consensus like that is a lot harder than writing some code for a
contrib module. Making the checking code the *authoritative* record of
how invariants are *expected* to work is a major goal of the project,
though.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Determine state of cluster (HA)

2017-10-16 Thread Joshua D. Drake

On 10/16/2017 03:55 AM, Magnus Hagander wrote:



On Mon, Oct 16, 2017 at 4:39 AM, Craig Ringer > wrote:


On 13 October 2017 at 08:50, Joshua D. Drake > wrote:
> 5.  There is no way to connect to a db node with something akin to
> SQL-Server's "application intent" flags, to allow a connection to be
> rejected if we wish it to be a read/write connection.  This helps detect 
the
> state of the node directly without having to ask any further questions of
> the node, and makes it easier to "stall" during connection until a proper
> connection can be made.

That sounds desirable, and a good step toward eventually being able to
transparently re-route read/write queries from replica to master.
Which is where I'd like to land up eventually.


It also sounds a lot like the connection parameter target_session_attrs


Ahh, this is part of the new libpq failover right?

Thanks,

JD


--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc

PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
* Unless otherwise stated, opinions are my own.   *


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Determine state of cluster (HA)

2017-10-16 Thread Joshua D. Drake

On 10/15/2017 07:39 PM, Craig Ringer wrote:

On 13 October 2017 at 08:50, Joshua D. Drake  wrote:

-Hackers,

I had a long call with a firm developing front end proxy/cache/HA for
Postgres today. Essentially the software is a replacement for PGPool in
entirety but also supports analytics etc... When I was asking them about
pain points they talked about the below and I was wondering if this is a
problem we would like to solve.


IMO: no one node knows the full state of the system, or can know it.


That isn't exactly true. We do know if our replication state is current 
but only from the master which is part of the problem.




I'd love PostgreSQL to help users more with scaling, HA, etc. But I
think it's a big job. We'd need:

- a node topology of some kind, communicated between nodes
- heartbeat and monitoring
- failover coordination
- pooling/proxying
- STONITH/fencing
- etc.


I don't think we need all of that. This is more of a request to make it 
easier for those deploying HA to determine the state of Postgres.




That said, I do think it'd be very desirable for us to introduce a
greater link from a standby to master:

- Get info about master. We should finish merging recovery.conf into
postgresql.conf.


Definitely.


b. Attempt to connect to the host directly, if not...
c. use the slave and use the hostname via dblink to connect to the master,
as the hostname , i.e. select * from dblink('" + connInfo + "
dbname=postgres', 'select inet_server_addr()') AS t(inet_server_addr inet).
This is necessary in the event the hostname used in the recovery.conf file
is not resolvable from the outside.


OK, so "connect directly" here means from some 3rd party, the one
doing the querying of the replica.


1.  The dblink call doesn't have a way to specify a timeout, so we have to
use Java futures to control how long this may take to a reasonable amount of
time;


statement_timeout doesn't work?


That would be a work around definitely but I think it would be better to 
say: ALTER SYSTEM SET PROMOTE TIMEOUT '120' (Example, let's not get off 
into the weeds :P) and if the standby can't receive a ping/ack within 
120 it will promote itself.



PostgreSQL can't do anything about this one.


Yes that's true.


4.  It doesn't support cascading replication very well, although we could
augment the logic to allow us to map the relationship between nodes.
5.  There is no way to connect to a db node with something akin to
SQL-Server's "application intent" flags, to allow a connection to be
rejected if we wish it to be a read/write connection.  This helps detect the
state of the node directly without having to ask any further questions of
the node, and makes it easier to "stall" during connection until a proper
connection can be made.


That sounds desirable, and a good step toward eventually being able to
transparently re-route read/write queries from replica to master.
Which is where I'd like to land up eventually.

Again, that'd be a sensible patch to submit, quite separately to the
other topics.


Great.




6.  The master, on shutdown, will not actually close and disable connections
as it shuts down, instead, it will issue an error that it is shutting down
as it does so.


Er, yes? I don't understand what you are getting at here.


Yes, I will need to go back to them on this one. I think what they mean 
is that if we have a connection that is getting closed it doesn't return 
why it is closing. It just throws an error.




Can you describe expected vs actual behaviour in more detail?



I will need to get back to them on this but I think the behavior would 
be to have a return value of why the connection was closed vs just 
throwing an error. Say, "RETURN 66" means someone executed 
pg_going_to_failover() vs pg_terminate_backend() which could be for 
different reasons.


Thanks for responding, I am mostly the intermediary here,

JD


--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc

PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
* Unless otherwise stated, opinions are my own.   *


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ERROR: MultiXactId 3268957 has not been created yet -- apparent wraparound after missused pg_resetxlogs

2017-10-16 Thread Alvaro Herrera
RADIX Alain - externe wrote:

> I'm facing a problem with a PostgreSQL 9.6.2 reporting this error when 
> selecting a table
> ERROR:  MultiXactId 3268957 has not been created yet -- apparent wraparound
> 
> The root cause is not a Postgres bug but a buggy person that used 
> pg_resetxlogs obviously without reading or understanding the documentation.
> 
> I'm trying to extract data from the db to create a new one ;-)
> But pg_dump logically end with the same error.
> 
> I've seen the row with t_max  = 3268957 page_inspect, I might use it to 
> extract row from the page, but this will not be quite easy.

I'm not quite sure what it is that you want, other than to chastise the
person who deleted pg_wal.  Maybe pageinspect's heap_page_item_attrs()
function, new in 9.6, can help you extract data from tuples with
"corrupt" xmax.  Or perhaps it is easier to return the pg_control
multixact counters to the original values?

> Is there a way to change the t_max to a value that won't fire the error, with 
> this row.

You could try setting it to 0, and/or set/reset the xmax flags in
infomask.

> Hopefully, this is not a production database :-D but the user want to
> start qualification of the new application version today .. no luck
> for him.

Hurray.  The other good news is that one of your engineers is now an
experienced person.

> At least, there is something to learn, renaming pg_xlog to pg_wal won't stop 
> everybody :-/

Yay?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended statistics is not working on Vars hidden under a RelabelType

2017-10-16 Thread Robert Haas
On Mon, Oct 16, 2017 at 2:36 AM, David Rowley
 wrote:
> I personally think it's great we're starting to see a useful feature
> materialise that can help with poor row estimates from the planner.

I agree.  My original post to this thread was more of a throw-away
comment than anything, and I'm not attacking the feature.  I didn't
think it was a very clear example and, TBH, I still don't.  But I
don't want to blow that up into a big debate on the virtues of this
feature, which I never intended to question, or on the correctness of
the patch, which I also did not intend to question.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: enabling parallel execution for cursors explicitly (experimental)

2017-10-16 Thread Tomas Vondra


On 10/16/2017 01:13 PM, Amit Kapila wrote:
> On Sat, Oct 14, 2017 at 11:44 PM, Tomas Vondra
>  wrote:
>> Hi,
>>
>>
>> I propose is to add a new cursor option (PARALLEL), which would allow
>> parallel plans for that particular user-defined cursor. Attached is an
>> experimental patch doing this (I'm sure there are some loose ends).
>>
> 
> How will this work for backward scans?
> 

It may not work, just like for regular cursors without SCROLL. And if
you specify SCROLL, then I believe a Materialize node will be added
automatically if needed, but haven't tried.

> 
>>
>> Regarding (2), if the user suspends the cursor for a long time, bummer.
>> The parallel workers will remain assigned, doing nothing. I don't have
>> any idea how to get around that, but I don't see how we could do better.
>>
> 
> One idea could be that whenever someone uses Parallel option, we can 
> fetch and store all the data locally before returning control to the 
> user (something like WITH HOLD option).
> 

I don't like that very much, as it adds unnecessary overhead. As I said
before, I'm particularly interested in cursors returning large amounts
of data, so the overhead may be quite significant.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] Add recovery_min_apply_delay_reconnect recovery option

2017-10-16 Thread Eric Radman
This is a patch I am using in production using the following parameters
in recovery.conf:

recovery_min_apply_delay = '1d'
recovery_min_apply_delay_reconnect = '10 min'

In our environment we expect that standby servers with an apply delay
provide some protection against mistakes by the DBA (myself), and that
they contain a valid copy of the data that can be used in the event that
the master dies.

Does this feature seems applicable to a wider community?


== delay-reconnect-param ==

Add recovery_min_apply_delay_reconnect recovery option

'recovery_min_apply_delay_reconnect' allows an administrator to specify
how a standby using 'recovery_min_apply_delay' responds when streaming
replication is interrupted.

Combining these two parameters provides a fixed delay under normal
operation while maintaining some assurance that the standby contains an
up-to-date copy of the WAL.

This administrative compromise is necessary because the WalReceiver is
not resumed after a network interruption until all records are read,
verified, and applied from the archive on disk.

Is it possible to verify the archive on disk independently of
application?  Adding a second delay parameter provides a workaround for
some use cases without complecting xlog.c.

 doc/src/sgml/recovery-config.sgml   | 24 
 src/backend/access/transam/xlog.c   | 59 
++-
 src/test/recovery/t/005_replay_delay.pl |  8 ++--
 3 files changed, 76 insertions(+), 15 deletions(-)

commit b8807b43c6a44c0d85a6a86c13b48b47f56ea45f
Author: Eric Radman 
Date:   Mon Oct 16 10:07:55 2017 -0400

Add recovery_min_apply_delay_reconnect recovery option

'recovery_min_apply_delay_reconnect' allows an administrator to specify
how a standby using 'recovery_min_apply_delay' responds when streaming
replication is interrupted.

Combining these two parameters provides a fixed delay under normal
operation while maintaining some assurance that the standby contains an
up-to-date copy of the WAL.

This administrative compromise is necessary because the WalReceiver is
not resumed after a network interruption until all records are read,
verified, and applied from the archive on disk.

Is it possible to verify the archive on disk independently of
application?  Adding a second delay parameter provides a workaround for
some use cases without complecting xlog.c.

diff --git a/doc/src/sgml/recovery-config.sgml 
b/doc/src/sgml/recovery-config.sgml
index 0a5d086248..4f8823ee50 100644
--- a/doc/src/sgml/recovery-config.sgml
+++ b/doc/src/sgml/recovery-config.sgml
@@ -502,6 +502,30 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' 
 # Windows
   
  
 
+ 
+  recovery_min_apply_delay_reconnect 
(integer)
+  
+recovery_min_apply_delay_reconnect recovery 
parameter
+  
+  
+  
+   
+If the streaming replication is inturruped while
+recovery_min_apply_delay is set, WAL records will be
+replayed from the archive. After all records have been processed from
+local disk, PostgreSQL will attempt to resume streaming
+and connect to the master.
+   
+   
+This parameter is used to compromise the fixed apply delay in order to
+restablish streaming. In this way a standby server can be run in fair
+conditions with a long delay (hours or days) without while specifying
+the maximum delay that can be expected before the WAL archive is 
brought
+back up to date with the master after a network failure.
+   
+  
+ 
+
  

 
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index dd028a12a4..36a4779f70 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -267,6 +267,7 @@ static TimestampTz recoveryTargetTime;
 static char *recoveryTargetName;
 static XLogRecPtr recoveryTargetLSN;
 static int recovery_min_apply_delay = 0;
+static int recovery_min_apply_delay_reconnect = 0;
 static TimestampTz recoveryDelayUntilTime;
 
 /* options taken from recovery.conf for XLOG streaming */
@@ -5227,6 +5228,7 @@ readRecoveryCommandFile(void)
   *head = NULL,
   *tail = NULL;
boolrecoveryTargetActionSet = false;
+   const char  *hintmsg;
 
 
fd = AllocateFile(RECOVERY_COMMAND_FILE, "r");
@@ -5452,8 +5454,6 @@ readRecoveryCommandFile(void)
}
else if (strcmp(item->name, "recovery_min_apply_delay") == 0)
{
-   const char *hintmsg;
-
if (!parse_int(item->value, _min_apply_delay, 
GUC_UNIT_MS,
   ))
ereport(ERROR,
@@ -5463,6 +5463,25 @@ 

Re: [HACKERS] Aggregate FILTER option is broken in v10

2017-10-16 Thread Tom Lane
I wrote:
> I think possibly the best answer is to revert 8ed3f11bb.  We could
> think about some compromise solution like merging the projections
> only for aggregates without FILTER.  But that would require two
> code paths through the relevant functions in nodeAgg.c, which would
> be a substantial maintenance burden; and the extra branches required
> means that this would be a net negative for performance in the
> simplest case with only one aggregate.

Hmm ... on closer inspection, the only performance-critical place
where this matters is advance_aggregates, and that already has a check
for whether the particular aggregate has a filter.  So we could do
something like

/* Skip anything FILTERed out */
if (filter)
{
// existing code to eval/check filter expr
+
+   /* Now it's safe to evaluate this agg's arguments */
+   slot = ExecProject(pertrans->argproj);
}
+   else
+   slot = aggstate->evalslot;

which seems like a pretty minimal extra cost for the normal case
with no filter.

Let me see what I can make of that ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] ERROR: MultiXactId 3268957 has not been created yet -- apparent wraparound after missused pg_resetxlogs

2017-10-16 Thread RADIX Alain - externe
Hi,

I'm facing a problem with a PostgreSQL 9.6.2 reporting this error when 
selecting a table
ERROR:  MultiXactId 3268957 has not been created yet -- apparent wraparound

The root cause is not a Postgres bug but a buggy person that used pg_resetxlogs 
obviously without reading or understanding the documentation.

I'm trying to extract data from the db to create a new one ;-)
But pg_dump logically end with the same error.

I've seen the row with t_max  = 3268957 page_inspect, I might use it to extract 
row from the page, but this will not be quite easy.

Is there a way to change the t_max to a value that won't fire the error, with 
this row.

Could pg_filedump be used in this problem configuration.

As usual, xlogs where  deleted even stored in a directory named pg_wal :( ( I 
made the change before 10 ) and xlogs reseted by the same person that commented 
the backup months ago.

Hopefully, this is not a production database :-D but the user want to start 
qualification of the new application version today .. no luck for him.

Any advice of how to extract data is welcome.

At least, there is something to learn, renaming pg_xlog to pg_wal won't stop 
everybody :-/

Regards.



Alain RADIX
Expert SGBD
EDF - DSP/CSP IT-DMA
Solutions Groupe EDF
Expertise Applicative
32 Avenue Pablo Picasso
92016 Nanterre
alain-externe.ra...@edf.fr
Tél. : 01.78.66.66.83

BAL Générique : 
support-oracle-nive...@edf.fr
BAL Générique : 
support-postgres-nive...@edf.fr
URL de notre communauté : https://sissi.edf.fr/web/expertise-sgbd/

[cid:image002.png@01D34687.91D9DDA0]

Un geste simple pour l'environnement, n'imprimez ce message que si vous en avez 
l'utilité.






Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à 
l'intention exclusive des destinataires et les informations qui y figurent sont 
strictement confidentielles. Toute utilisation de ce Message non conforme à sa 
destination, toute diffusion ou toute publication totale ou partielle, est 
interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le 
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si 
vous avez reçu ce Message par erreur, merci de le supprimer de votre système, 
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support 
que ce soit. Nous vous remercions également d'en avertir immédiatement 
l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie 
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute 
erreur ou virus.


This message and any attachments (the 'Message') are intended solely for the 
addressees. The information contained in this Message is confidential. Any use 
of information contained in this Message not in accord with its purpose, any 
dissemination or disclosure, either whole or partial, is prohibited except 
formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any 
part of it. If you have received this message in error, please delete it and 
all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or 
virus-free.


[HACKERS] Aggregate FILTER option is broken in v10

2017-10-16 Thread Tom Lane
Consider

regression=# select sum(1/ten) filter (where ten>0) from tenk1;
ERROR:  division by zero

This query works without error in versions before 10.  It was evidently
broken by commit 8ed3f11bb, which rearranged nodeAgg.c to evaluate all
aggregate input expressions before considering any FILTERs.

This is not an acceptable behavioral change.  This sort of thing seems
like perhaps the primary use-case for FILTER.  It's stated to work by
our own manual --- see the last sentence in
https://www.postgresql.org/docs/devel/static/sql-expressions.html#syntax-express-eval
And it's required by the SQL spec, which states clearly that the
aggregate's input expression is only evaluated at rows for which
the filter expression yields true.  (In SQL:2011, see 10.9  general rules 4 and 5a.)

I think possibly the best answer is to revert 8ed3f11bb.  We could
think about some compromise solution like merging the projections
only for aggregates without FILTER.  But that would require two
code paths through the relevant functions in nodeAgg.c, which would
be a substantial maintenance burden; and the extra branches required
means that this would be a net negative for performance in the
simplest case with only one aggregate.  In any case, since that
patch went in before the v10 expression evaluation rewrite, I think
any argument that it's worth keeping would need to be made afresh.
The overhead that it was hoping to save should be much lower now.

Thoughts?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BLK_DONE state in XLogReadBufferForRedoExtended

2017-10-16 Thread Amit Kapila
On Fri, Oct 13, 2017 at 11:57 AM, Amit Kapila  wrote:
> On Fri, Oct 13, 2017 at 10:32 AM, Michael Paquier
>  wrote:
>> On Thu, Oct 12, 2017 at 10:47 PM, Amit Kapila  
>> wrote:
>>> Today, I was trying to think about cases when we can return BLK_DONE
>>> in XLogReadBufferForRedoExtended.  One thing that occurred to me is
>>> that it can happen during the replay of WAL if the full_page_writes is
>>> off.  Basically, when the full_page_writes is on, then during crash
>>> recovery, it will always first restore the full page image of page and
>>> then apply the WAL corresponding to that page, so we will never hit
>>> the case where the lsn of the page is greater than lsn of WAL record.
>>>
>>> Are there other cases for which we can get BLK_DONE state?  Is it
>>> mentioned somewhere in code/comments which I am missing?
>>
>> Remember the thread about meta page optimization... Some index AMs use
>> XLogInitBufferForRedo() to redo their meta pages and those don't have
>> a FPW so when doing crash recovery you may finish by not replaying a
>> meta page if its LSN on the page header is newer than what's being
>> replayed.
>>
>
> I think for metapage usage, it will anyway apply the changes.   See
> _bt_restore_page.
>
>> That's another case where BLK_DONE can be reached, even if
>> full_page_writes is on.
>>
>
> Yeah and probably if someone uses REGBUF_NO_IMAGE.  However, I was
> mainly thinking about cases where caller is not doing anything to
> prevent full_page_image explicitly.
>
>

If above analysis is correct, then I think we can say that row state
in a page will be same during recovery as it was when the original
operation was performed if the full_page_writes are enabled. I am not
sure how much this can help in current heap format, but this can help
in zheap (undo based heap).

In zheap, we are writing complete tuple for Delete operation in undo
so that we can reclaim the corresponding tuple space as soon as the
deleting transaction is committed.  Now, during recovery, we have to
generate the complete undo record (which includes the entire tuple)
and for that ideally, we should write the complete tuple in WAL, but
instead of that, I think we can regenerate it from the original page.
This is only applicable when full_page_writes are enabled, otherwise,
a complete tuple is required in WAL.

I am not sure how much above makes sense to anyone without a detailed
explanation, but I thought I should give some background on why I
asked this question.  However, if anybody needs more explanation or
sees any fault in above understanding, please let me know.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] ERROR: MultiXactId 3268957 has not been created yet -- apparent wraparound after missused pg_resetxlogs

2017-10-16 Thread alain radix
Hi,



I’m facing a problem with a PostgreSQL 9.6.2 reporting this error when
selecting a table

ERROR:  MultiXactId 3268957 has not been created yet -- apparent wraparound



The root cause is not a Postgres bug but a buggy person that used
pg_resetxlogs obviously without reading or understanding the documentation.



I’m trying to extract data from the db to create a new one ;-)

But pg_dump logically end with the same error.



I’ve seen the row with t_max  = 3268957 page_inspect, I might use it to
extract row from the page, but this will not be quite easy.



Is there a way to change the t_max to a value that won’t fire the error,
with this row.



Could pg_filedump be used in this problem configuration.



As usual, xlogs where  deleted even stored in a directory named pg_wal L (
I made the change before 10 ) and xlogs reseted by the same person that
commented the backup months ago.



Hopefully, this is not a production database :-D but the user want to start
qualification of the new application version today .. no luck for him.



Any advice of how to extract data is welcome.



At least, there is something to learn, renaming pg_xlog to pg_wal won’t
stop everybody :-/



Regards.


Alain Radix


Re: [HACKERS] Determine state of cluster (HA)

2017-10-16 Thread Jehan-Guillaume de Rorthais
On Mon, 16 Oct 2017 10:39:16 +0800
Craig Ringer  wrote:

> On 13 October 2017 at 08:50, Joshua D. Drake  wrote:

> > I had a long call with a firm developing front end proxy/cache/HA for
> > Postgres today. Essentially the software is a replacement for PGPool in
> > entirety but also supports analytics etc... When I was asking them about
> > pain points they talked about the below and I was wondering if this is a
> > problem we would like to solve.  
> 
> IMO: no one node knows the full state of the system, or can know it.

+1

> I'd love PostgreSQL to help users more with scaling, HA, etc. But I
> think it's a big job. We'd need:
> 
> - a node topology of some kind, communicated between nodes
> - heartbeat and monitoring
> - failover coordination
> - pooling/proxying
> - STONITH/fencing
> - etc.

And some of items on this list can not be in core. However, there's some things
PostgreSQL can do to make HA easier to deal with.

> That said, I do think it'd be very desirable for us to introduce a
> greater link from a standby to master:
> 
> - Get info about master. We should finish merging recovery.conf into
> postgresql.conf.

agree. +1.

To make things easier from the "cluster manager" piece outside of PostgreSQL, I
would add: 

* being able to "demote" a master as a standby without restart.
* being able to check the status of each node without eating a backend
  connection (to avoid hitting "max_connection" limit)
* being able to monitor each step of a switchover (or "controlled
  failover": standby/master role swapping between two nodes)

> > b. Attempt to connect to the host directly, if not...
> > c. use the slave and use the hostname via dblink to connect to the master,
> > as the hostname , i.e. select * from dblink('" + connInfo + "
> > dbname=postgres', 'select inet_server_addr()') AS t(inet_server_addr inet).
> > This is necessary in the event the hostname used in the recovery.conf file
> > is not resolvable from the outside.  
> 
> OK, so "connect directly" here means from some 3rd party, the one
> doing the querying of the replica.

It seems to me the failover process is issuing all required commands to move the
master role to another available standby. The knowledge of the orchestration
and final status (if everything went good) is in this piece of software. If you
want to know where is your master in an exotic or complex setup, ask who was
responsible to promote your master.

HA should stay as simple as possible. The more the architecture is complex, the
more you will have failing scenarios.

> > 1.  The dblink call doesn't have a way to specify a timeout, so we have to
> > use Java futures to control how long this may take to a reasonable amount of
> > time;  
> 
> statement_timeout doesn't work?
> 
> If not, that sounds like a sensible, separate feature to add. Patches welcome!
> 
> > 2.  NAT mapping may result in us detecting IP ranges that are not accessible
> > to the application nodes.  
> 
> PostgreSQL can't do anything about this one.

You could get the master IP address from the "pg_stat_wal_receiver" view. But
this is still not enough though. You might have dedicated networks for
applications and for pgsql replication both separated. If you want a standby
to tell the application where to connect to the master then you'll have to
put this information yourself somewhere, accessible from application nodes.

> > 3.  there is no easy way to monitor for state changes as they happen,
> > allowing faster failovers, everything has to be polled based on events;  

In the corosync world (the clustering piece of the Pacemaker ecosystem), node
failure are detected really really fast. About 1s.

Considering application failure (pgsql here), this will be polling, yes. But I
fail to imagine how a dying application can warn the cluster before dying. Not
only crashing (systemd could help there), but eg. before entering an infinite
dummy loop or an exhausting one.

> It'd be pretty simple to write a function that sleeps in the backend
> until it's promoted. I don't know off the top of my head if we set all
> proc latches when we promote, but if we don't it's probably harmless
> and somewhat useful to do so.

As soon as the cluster manager promoted a new master, it can trigger and event
to notify whatever you need.

> Either way, you'd do long-polling. Call the function and let the
> connection block until something interesting happens. Use TCP
> keepalives to make sure you notice if it dies. Have the function
> return when the state changes.

This would still rely on TCP keepalive frequency, back to polling :(

Regards,


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-16 Thread Ashutosh Bapat
On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas  wrote:

>>
>> The fix is to copy the relevant partitioning information from relcache
>> into PartitionSchemeData and RelOptInfo. Here's a quick patch with
>> that fix.
>
> Committed.  I hope that makes things less red rather than more,
> because I'm going to be AFK for a few hours anyway.
>

set_append_rel_size() crashes when it encounters a partitioned table
with a dropped column. Dropped columns do not have any translations
saved in AppendInfo::translated_vars; the corresponding entry is NULL
per make_inh_translation_list().
1802 att = TupleDescAttr(old_tupdesc, old_attno);
1803 if (att->attisdropped)
1804 {
1805 /* Just put NULL into this list entry */
1806 vars = lappend(vars, NULL);
1807 continue;
1808 }

In set_append_rel_size() we try to attr_needed for child tables. While
doing so we try to translate a user attribute number of parent to that
of a child and crash since the translated Var is NULL. Here's patch to
fix the crash. The patch also contains a testcase to test dropped
columns in partitioned table.

Sorry for not noticing this problem earlier.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
From eca9e03410b049bf79d767657ad4d90fb974d19c Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Mon, 16 Oct 2017 13:23:49 +0530
Subject: [PATCH] Ignore dropped columns in set_append_rel_size().

A parent Var corresponding to column dropped from a parent table will
not have any translation for a child.  It won't have any attr_needed
information since it can not be referenced from SQL. Hence ignore
dropped columns while constructing attr_needed information for a child
table.

Ashutosh Bapat.
---
 src/backend/optimizer/path/allpaths.c |   12 
 src/test/regress/expected/alter_table.out |7 +++
 src/test/regress/sql/alter_table.sql  |4 
 3 files changed, 23 insertions(+)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 5535b63..1bc5e01 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -950,6 +950,18 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 	attno - 1);
 	int			child_index;
 
+	/*
+	 * Ignore any column dropped from the parent.
+	 * Corresponding Var won't have any translation. It won't
+	 * have attr_needed information, since it can not be
+	 * referenced in the query.
+	 */
+	if (var == NULL)
+	{
+		Assert(attr_needed == NULL);
+		continue;
+	}
+
 	child_index = var->varattno - childrel->min_attr;
 	childrel->attr_needed[child_index] = attr_needed;
 }
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index dbe438d..cc56651 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -3604,6 +3604,13 @@ ALTER TABLE list_parted2 DROP COLUMN b;
 ERROR:  cannot drop column named in partition key
 ALTER TABLE list_parted2 ALTER COLUMN b TYPE text;
 ERROR:  cannot alter type of column named in partition key
+-- dropping non-partition key columns should be allowed on the parent table.
+ALTER TABLE list_parted DROP COLUMN b;
+SELECT * FROM list_parted;
+ a 
+---
+(0 rows)
+
 -- cleanup
 DROP TABLE list_parted, list_parted2, range_parted;
 DROP TABLE fail_def_part;
diff --git a/src/test/regress/sql/alter_table.sql b/src/test/regress/sql/alter_table.sql
index 0c8ae2a..fc7bd66 100644
--- a/src/test/regress/sql/alter_table.sql
+++ b/src/test/regress/sql/alter_table.sql
@@ -2390,6 +2390,10 @@ ALTER TABLE part_2 INHERIT inh_test;
 ALTER TABLE list_parted2 DROP COLUMN b;
 ALTER TABLE list_parted2 ALTER COLUMN b TYPE text;
 
+-- dropping non-partition key columns should be allowed on the parent table.
+ALTER TABLE list_parted DROP COLUMN b;
+SELECT * FROM list_parted;
+
 -- cleanup
 DROP TABLE list_parted, list_parted2, range_parted;
 DROP TABLE fail_def_part;
-- 
1.7.9.5


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: enabling parallel execution for cursors explicitly (experimental)

2017-10-16 Thread Amit Kapila
On Sat, Oct 14, 2017 at 11:44 PM, Tomas Vondra
 wrote:
> Hi,
>
>
> I propose is to add a new cursor option (PARALLEL), which would allow
> parallel plans for that particular user-defined cursor. Attached is an
> experimental patch doing this (I'm sure there are some loose ends).
>

How will this work for backward scans?


>
> Regarding (2), if the user suspends the cursor for a long time, bummer.
> The parallel workers will remain assigned, doing nothing. I don't have
> any idea how to get around that, but I don't see how we could do better.
>

One idea could be that whenever someone uses Parallel option, we can
fetch and store all the data locally before returning control to the
user (something like WITH HOLD option).

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel safety for extern params

2017-10-16 Thread Amit Kapila
On Sat, Oct 14, 2017 at 1:51 AM, Robert Haas  wrote:
> On Fri, Oct 13, 2017 at 1:19 AM, Amit Kapila  wrote:
>> After fixing this problem, when I ran the regression tests with
>> force_parallel_mode = regress, I saw multiple other failures.  All the
>> failures boil down to two kinds of cases:
>>
>> 1. There was an assumption while extracting simple expressions that
>> the target list of gather node can contain constants or Var's.  Now,
>> once the above patch allows extern params as parallel-safe, that
>> assumption no longer holds true.  We need to handle params as well.
>> Attached patch fix_simple_expr_interaction_gather_v1.patch handles
>> that case.
>
> - * referencing the child node's output ... but setrefs.c might also have
> - * copied a Const as-is.
> + * referencing the child node's output or a Param... but setrefs.c might
> + * also have copied a Const as-is.
>
> I think the Param case should be mentioned after "... but" not before
> - i.e. referencing the child node's output... but setrefs.c might also
> have copied a Const or Param is-is.
>

I am not sure if we can write the comment like that (.. copied a Const
or Param as-is.) because fix_upper_expr_mutator in setrefs.c has a
special handling for Var and Param where constants are copied as-is
via expression_tree_mutator.  Also as proposed, the handling for
params is more like Var in exec_save_simple_expr.

>> 2. We don't allow execution to use parallelism if the plan can be
>> executed multiple times.  This has been enforced in ExecutePlan, but
>> it seems like that we miss to handle the case where we are already in
>> parallel mode by the time we enforce that condition.  So, what
>> happens, as a result, is that it will allow to use parallelism when it
>> shouldn't (when the same plan is executed multiple times) and lead to
>> a crash.  One way to fix is that we temporarily exit the parallel mode
>> in such cases and reenter in the same state once the current execution
>> is over. Attached patch fix_parallel_mode_nested_execution_v1.patch
>> fixes this problem.
>
> This seems completely unsafe.  If somebody's already entered parallel
> mode, they are counting on having the parallel-mode restrictions
> enforced until they exit parallel mode.  We can't just disable those
> restrictions for a while in the middle and then put them back.
>

Right.

> I think the bug is in ExecGather(Merge): it assumes that if we're in
> parallel mode, it's OK to start workers.  But actually, it shouldn't
> do this unless ExecutePlan ended up with use_parallel_mode == true,
> which isn't quite the same thing.  I think we might need ExecutePlan
> to set a flag in the estate that ExecGather(Merge) can check.
>

Attached patch fixes the problem in a suggested way.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


fix_parallel_mode_nested_execution_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Determine state of cluster (HA)

2017-10-16 Thread Magnus Hagander
On Mon, Oct 16, 2017 at 4:39 AM, Craig Ringer  wrote:

> On 13 October 2017 at 08:50, Joshua D. Drake  wrote:
> > 5.  There is no way to connect to a db node with something akin to
> > SQL-Server's "application intent" flags, to allow a connection to be
> > rejected if we wish it to be a read/write connection.  This helps detect
> the
> > state of the node directly without having to ask any further questions of
> > the node, and makes it easier to "stall" during connection until a proper
> > connection can be made.
>
> That sounds desirable, and a good step toward eventually being able to
> transparently re-route read/write queries from replica to master.
> Which is where I'd like to land up eventually.
>

It also sounds a lot like the connection parameter target_session_attrs,
does it not? We don't reroute active connections based on it, and we're not
smart enough to do anything beyond "try them one by one until you reach the
one with the correct attributes", but the basic functionality is there.
Basically what we already have fulfills what JD is suggesting, but not what
Craig is, if I understand it correctly.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: [HACKERS] relkind check in DefineIndex

2017-10-16 Thread Michael Paquier
On Mon, Oct 16, 2017 at 2:56 PM, Amit Langote
 wrote:
> On 2017/10/14 4:32, Robert Haas wrote:
>> On Fri, Oct 13, 2017 at 12:38 PM, Alvaro Herrera
>>  wrote:
>>> The relkind check in DefineIndex has grown into an ugly rats nest of
>>> 'if' statements.  I propose to change it into a switch, as per the
>>> attached.
>>
>> wfm
>
> +1

+1. There is as well CreateTrigger(), analyze_rel() or
ATRewriteCatalogs() that do similar things but those are not using
multiple layers of checks.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] hash partitioning

2017-10-16 Thread Ashutosh Bapat
On Mon, Oct 16, 2017 at 2:36 PM, Ashutosh Bapat
 wrote:

>
> Probably we should move changes to partition_bounds_copy() in 0003 to
> 0001, whose name suggests so.
>

We can't do this, hash partition strategy is introduced by 0002. Sorry
for the noise.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GUC for cleanup indexes threshold.

2017-10-16 Thread Masahiko Sawada
On Fri, Oct 13, 2017 at 1:14 AM, Robert Haas  wrote:
> On Tue, Oct 10, 2017 at 5:55 AM, Pavel Golub  wrote:
>> DP> The new status of this patch is: Ready for Committer
>>
>> Seems  like,  we  may  also  going to hit it and it would be cool this
>> vacuum issue solved for next PG version.
>
> Exactly which patch on this thread is someone proposing for commit?
>

I guess that is the patch I proposed. However I think that there still
is room for discussion because the patch cannot skip to cleanup vacuum
when aggressive vacuum, which is one of the situation that I really
wanted to skip.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] hash partitioning

2017-10-16 Thread Ashutosh Bapat
On Tue, Oct 10, 2017 at 4:37 PM, amul sul  wrote:
> On Tue, Oct 10, 2017 at 3:42 PM, Ashutosh Bapat
>  wrote:
>> On Tue, Oct 10, 2017 at 3:32 PM, amul sul  wrote:
>>
 +hash_part? true : 
 key->parttypbyval[j],
 +key->parttyplen[j]);
 parttyplen is the length of partition key attribute, whereas what you want 
 here
 is the length of type of modulus and remainder. Is that correct? Probably 
 we
 need some special handling wherever parttyplen and parttypbyval is used 
 e.g. in
 call to partition_bounds_equal() from build_joinrel_partition_info().

>>>
>>> Unless I am missing something, I don't think we should worry about 
>>> parttyplen
>>> because in the datumCopy() when the datatype is pass-by-value then typelen
>>> is ignored.
>>
>> That's true, but it's ugly, passing typbyvalue of one type and len of other.
>>
>
> How about the attached patch(0003)?
> Also, the dim variable is renamed to natts.

Probably we should move changes to partition_bounds_copy() in 0003 to
0001, whose name suggests so.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-10-16 Thread Michael Paquier
On Thu, Sep 7, 2017 at 12:33 PM, Kyotaro HORIGUCHI
 wrote:
> At Wed, 6 Sep 2017 12:23:53 -0700, Andres Freund  wrote 
> in <20170906192353.ufp2dq7wm5fd6...@alap3.anarazel.de>
>> I'm not following. All we need to use is the beginning of the relevant
>> records, that's easy enough to keep track of. We don't need to read the
>> WAL or anything.
>
> The beginning is already tracked and nothing more to do.

I have finally allocated time to look at your newly-proposed patch,
sorry for the time it took. Patch 0002 forgot to include sys/stat.h to
allow the debugging tool to compile :)

> The first *problem* was WaitForWALToBecomeAvailable requests the
> beginning of a record, which is not on the page the function has
> been told to fetch. Still tliRecPtr is required to determine the
> TLI to request, it should request RecPtr to be streamed.

[...]

> The rest to do is let XLogPageRead retry other sources
> immediately. To do this I made ValidXLogPageHeader@xlogreader.c
> public (and renamed to XLogReaderValidatePageHeader).
>
> The patch attached fixes the problem and passes recovery
> tests. However, the test for this problem is not added. It needs
> to go to the last page in a segment then put a record continues
> to the next segment, then kill the standby after receiving the
> previous segment but before receiving the whole record.

+typedef struct XLogPageHeaderData *XLogPageHeader;
[...]
+/* Validate a page */
+extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
+   XLogRecPtr recptr, XLogPageHeader hdr);
Instead of exposing XLogPageHeaderData, I would recommend just using
char* and remove this declaration. The comment on top of
XLogReaderValidatePageHeader needs to make clear what caller should
provide.

+if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
+  (XLogPageHeader) readBuf))
+goto next_record_is_invalid;
+
[...]
-ptr = tliRecPtr;
+ptr = RecPtr;
 tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);

 if (curFileTLI > 0 && tli < curFileTLI)
The core of the patch is here (the patch has no comment so it is hard
to understand what's the point of what is being done), and if I
understand that correctly, you allow the receiver to fetch the
portions of a record spawned across multiple segments from different
sources, and I am not sure that we'd want to break that promise.
Shouldn't we instead have the receiver side track the beginning of the
record and send that position for the physical slot's restart_lsn?
This way the receiver would retain WAL segments from the real
beginning of a record. restart_lsn for replication slots is set when
processing the standby message in ProcessStandbyReplyMessage() using
now the flush LSN, so a more correct value should be provided using
that. Andres, what's your take on the matter?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Lockable views

2017-10-16 Thread Yugo Nagata
On Mon, 16 Oct 2017 10:07:48 +0900 (JST)
Tatsuo Ishii  wrote:

> >> >> test=# CREATE VIEW v3 AS SELECT count(*) FROM v1;
> >> >> CREATE VIEW
> >> >> test=# BEGIN;
> >> >> BEGIN
> >> >> test=# LOCK TABLE v3;
> >> >> ERROR:  cannot lock view "v3"
> >> >> DETAIL:  Views that return aggregate functions are not automatically 
> >> >> updatable.
> >> > 
> >> > It would be nice if the message would be something like:
> >> > 
> >> > DETAIL:  Views that return aggregate functions are not lockable
> > 
> > This uses messages from view_query_is_auto_updatable() of the rewrite 
> > system directly. 
> > Although we can modify the messages, I think it is not necessary for now
> > since we can lock only automatically updatable views.
> 
> You could add a flag to view_query_is_auto_updatable() to switch the
> message between
> 
>  DETAIL:  Views that return aggregate functions are not automatically 
> updatable.
> 
> and
> 
>  DETAIL:  Views that return aggregate functions are not lockable

OK. I'll change view_query_is_auto_updatable() so.

> 
> >> > I wonder if we should lock tables in a subquery as well. For example,
> >> > 
> >> > create view v1 as select * from t1 where i in (select i from t2);
> >> > 
> >> > In this case should we lock t2 as well?
> >> 
> >> Current the patch ignores t2 in the case above.
> >> 
> >> So we have options below:
> >> 
> >> - Leave as it is (ignore tables appearing in a subquery)
> >> 
> >> - Lock all tables including in a subquery
> >> 
> >> - Check subquery in the view definition. If there are some tables
> >>   involved, emit an error and abort.
> >> 
> >> The first one might be different from what users expect. There may be
> >> a risk that the second one could cause deadlock. So it seems the third
> >> one seems to be the safest IMO.
> > 
> > Make sense. Even if the view is locked, when tables in a subquery is
> > modified, the contents of view can change. To avoid it, we have to
> > lock tables, or give up to lock such views. 
> > 
> > We can say the same thing for functions in a subquery. If the definition
> > of the functions are changed, the result of the view can change.
> > We cannot lock functions, but should we abtain row-level lock on pg_proc
> > in such cases? (of cause, or give up to lock such views)
> 
> I think we don't need to care about function definition changes used
> in where clauses in views. None view queries using functions do not
> care about the definition changes of functions while executing the
> query. So why updatable views need to care them?

I'm a bit confused. What is difference between tables and functions
in a subquery with regard to view locking? I think also none view queries
using a subquery do not care about the changes of tables in the 
subquery while executing the query. I might be misnderstanding
the problem you mentioned.

BTW, I found that if we have to handle subqueries in where clause, we would
also have to care about subqueries in target list... The view defined as
below is also updatable.

 =# create view v7 as select (select * from tbl2 limit 1) from tbl;
 

> 
> > BTW, though you mentioned the risk of deadlocks, even when there
> > are no subquery, deadlock can occur in the current patch.
> > 
> > For example, lock a table T in Session1, and then lock a view V
> > whose base relation is T in Session2. Session2 will wait for 
> > Session1 to release the lock on T. After this, when Session1 try to
> > lock view V, the deadlock occurs and the query is canceled.
> 
> You are right. Dealocks could occur in any case.
> 
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp


-- 
Yugo Nagata 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended statistics is not working on Vars hidden under a RelabelType

2017-10-16 Thread David Rowley
On 15 October 2017 at 06:49, Robert Haas  wrote:
> On Fri, Oct 13, 2017 at 4:49 PM, David Rowley
>  wrote:
>> tps = 8282.481310 (including connections establishing)
>> tps = 8282.750821 (excluding connections establishing)
>
> vs.
>
>> tps = 8520.822410 (including connections establishing)
>> tps = 8521.132784 (excluding connections establishing)
>>
>> With the patch we are making use of the extended statistics, which we
>> do expect to be more work for the planner. Although, we didn't add
>> extended statistics to speed up the planner.
>
> Sure, I understand.  That's actually a pretty substantial regression -
> I guess that means that it's pretty important to avoid creating
> extended statistics that are not needed, at least for short-running
> queries.

To be honest, I ran that on a VM on my laptop. I was getting quite a
bit of noise. I just posted that to show that the 12x slowdown didn't
exist. I don't know what the actual slowdown is. I just know extended
stats are not free and that nobody expected that they ever would be.
The good news is that they're off by default and if the bad ever
outweighs the good then the fix for that starts with "DROP STATISTICS"

I personally think it's great we're starting to see a useful feature
materialise that can help with poor row estimates from the planner.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise aggregation/grouping

2017-10-16 Thread Jeevan Chalke
On Fri, Oct 13, 2017 at 1:13 PM, David Rowley 
wrote:

>
> I looked over the patch and saw this:
>
> @@ -1800,6 +1827,9 @@ cost_merge_append(Path *path, PlannerInfo *root,
>   */
>   run_cost += cpu_operator_cost * tuples;
>
> + /* Add MergeAppend node overhead like we do it for the Append node */
> + run_cost += cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples;
> +
>   path->startup_cost = startup_cost + input_startup_cost;
>   path->total_cost = startup_cost + run_cost + input_total_cost;
>  }
>
> You're doing that right after a comment that says we don't do that. It
> also does look like the "run_cost += cpu_operator_cost * tuples" is
> trying to do the same thing, so perhaps it's worth just replacing
> that, which by default will double that additional cost, although
> doing so would have the planner slightly prefer a MergeAppend to an
> Append than previously.
>

I think we can remove that code block entirely. I have added relevant
comments
around DEFAULT_APPEND_COST_FACTOR already.
However, I am not sure of doing this as you correctly said it may prefer
MergeAppend to an Append. Will it be fine we remove that code block?


> +#define DEFAULT_APPEND_COST_FACTOR 0.5
>
> I don't really think the DEFAULT_APPEND_COST_FACTOR adds much. it
> means very little by itself. It also seems that most of the other cost
> functions just use the magic number.
>

Agree, but those magic numbers used only once at that place. But here there
are two places. So if someone wants to update it, (s)he needs to make sure
to update that at two places. To minimize that risk, having a #define seems
better.


>
> --
>  David Rowley   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>



-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company