Re: [HACKERS] MERGE SQL Statement for PG11

2018-02-18 Thread Andreas Seltenreich
Pavan Deolasee writes:

> Thanks for doing those tests. I've just sent v16a version of the patch and
> I think it fixes the issues reported so far. Can you please recheck? Please
> let me know if there are other issues detected by sqlsmith or otherwise.

I re-did the testing with merge_v16a applied to master at 7923118c16
with ad7dbee368a reverted because of conflicts.  I can confirm that the
previous testcases don't fail anymore, but sqlsmith readily triggers the
following assertion:

TRAP: FailedAssertion("!(mergeTargetRelation > 0)", File: "planner.c",
Line: 1496)

Testcase attached.

regards,
Andreas


merge-v16a-trap-planner.c.sql
Description: application/sql


Re: [HACKERS] Can ICU be used for a database's default sort order?

2018-02-18 Thread Michail Nikolaev
Hello.

Just want to inform:
I have run
check,installcheck,plcheck,contribcheck,modulescheck,ecpgcheck,isolationcheck,upgradecheck
tests on Windows 10, VC2017 with patch applied on top of
2a41507dab0f293ff241fe8ae326065998668af8 as Andrey asked me.

Everything is passing with and without $config->{icu} =
'D:\Dev\postgres\icu\';

Best regards,
Michail.


пт, 16 февр. 2018 г. в 11:13, Andrey Borodin :

> Hi everyone!
>
> > 10 февр. 2018 г., в 20:45, Andrey Borodin 
> написал(а):
> >
> > I'm planning to provide review
> >
>
> So, I was looking into the patch.
> The patch adds:
> 1. Ability to specify collation provider (with version) in --locale for
> initdb and createdb.
> 2. Changes to locale checks
> 3. Sets ICU as default collation provider. For example
> "ru_RU@icu.153.80.32.1" is default on my machine with patch
> 4. Tests and necessary changes to documentation
>
> With patch I get correct ICU ordering by default
> postgres=# select unnest(array['е','ё','ж']) order by 1;
>  unnest
> 
>  е
>  ё
>  ж
> (3 rows)
>
> While libc locale provides incorrect order (I also get same ordering by
> default without patch)
>
> postgres=# select c from unnest(array['е','ё','ж']) c order by c collate
> "ru_RU";
>  c
> ---
>  е
>  ж
>  ё
> (3 rows)
>
>
> Unfortunately, neither "ru_RU@icu.153.80.32.1" (exposed by LC_COLLATE and
> other places) nor "ru_RU@icu" cannot be used by collate SQL clause.
> Also, patch removes compatibility with MSVC 1800 (Visual Studio 2013) on
> Windows XP and Windows Server 2003. This is done to use newer
> locale-related functions in VS2013 build.
>
> If the database was initialized with default locale without this patch,
> one cannot connect to it anymore
> psql: FATAL:  could not find out the collation provider for datcollate
> "ru_RU.UTF-8" of database "postgres"
> This problem is mentioned in commit message of the patch. I think that
> this problem should be addressed somehow.
> What do you think?
>
> Overall patch looks solid and thoughtful work and adds important
> functionality.
>
> Best regards, Andrey Borodin.
>


Re: missing toast table for pg_policy

2018-02-18 Thread Joe Conway
On 02/17/2018 11:39 AM, Tom Lane wrote:
> Joe Conway  writes:
>> Yes, exactly. I'm fine with not backpatching, just wanted to raise the
>> possibility. I will push later today to HEAD (with a catalog version bump).
> 
> BTW, I was wondering if it'd be a good idea to try to forestall future
> oversights of this sort by adding a test query in, say, misc_sanity.sql.
> Something like
> 
> select relname, attname, atttypid::regtype
> from pg_class c join pg_attribute a on c.oid = attrelid
> where c.oid < 16384 and reltoastrelid=0 and relkind = 'r' and attstorage != 
> 'p'
> order by 1,2;
> 
> If you try that you'll see the list is quite long:



> I think in most of these cases we've consciously decided not to toast-ify,
> but maybe some of them need a second look.

Is there really a compelling reason to not just create toast tables for
all system catalogs as in the attached? Then we could just check for 0
rows in misc_sanity.sql.

For what its worth:

HEAD

# du -h --max-depth=1 $PGDATA
[...]
22M /usr/local/pgsql-head/data/base
[...]
584K/usr/local/pgsql-head/data/global
[...]
38M /usr/local/pgsql-head/data

time make check
real0m16.295s
user0m3.597s
sys 0m1.465s


with patch

# du -h --max-depth=1 $PGDATA
[...]
23M /usr/local/pgsql-head/data/base
[...]
632K/usr/local/pgsql-head/data/global
[...]
39M /usr/local/pgsql-head/data

time make check
real0m16.462s
user0m3.521s
sys 0m1.531s

select relname, attname, atttypid::regtype
from pg_class c join pg_attribute a on c.oid = attrelid
where c.oid < 16384 and reltoastrelid=0 and relkind = 'r' and attstorage
!= 'p'
order by 1,2;
 relname | attname | atttypid
-+-+--
(0 rows)


-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development
diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index 809749add9..813b3b87cc 100644
--- a/src/backend/catalog/catalog.c
+++ b/src/backend/catalog/catalog.c
@@ -258,7 +258,19 @@ IsSharedRelation(Oid relationId)
 		relationId == PgDbRoleSettingToastTable ||
 		relationId == PgDbRoleSettingToastIndex ||
 		relationId == PgShseclabelToastTable ||
-		relationId == PgShseclabelToastIndex)
+		relationId == PgShseclabelToastIndex ||
+		relationId == PgAuthidToastTable ||
+		relationId == PgAuthidToastIndex ||
+		relationId == PgDatabaseToastTable ||
+		relationId == PgDatabaseToastIndex ||
+		relationId == PgPlTemplateToastTable ||
+		relationId == PgPlTemplateToastIndex ||
+		relationId == PgReplicationOriginToastTable ||
+		relationId == PgReplicationOriginToastIndex ||
+		relationId == PgSubscriptionToastTable ||
+		relationId == PgSubscriptionToastIndex ||
+		relationId == PgTablespaceToastTable ||
+		relationId == PgTablespaceToastIndex)
 		return true;
 	return false;
 }
diff --git a/src/include/catalog/toasting.h b/src/include/catalog/toasting.h
index f6387ae143..3ef3990ea3 100644
--- a/src/include/catalog/toasting.h
+++ b/src/include/catalog/toasting.h
@@ -46,25 +46,64 @@ extern void BootstrapToastTable(char *relName,
  */
 
 /* normal catalogs */
+DECLARE_TOAST(pg_aggregate, 4139, 4140);
 DECLARE_TOAST(pg_attrdef, 2830, 2831);
+DECLARE_TOAST(pg_attribute, 4141, 4142);
+DECLARE_TOAST(pg_class, 4143, 4144);
+DECLARE_TOAST(pg_collation, 4145, 4146);
 DECLARE_TOAST(pg_constraint, 2832, 2833);
+DECLARE_TOAST(pg_default_acl, 4147, 4148);
 DECLARE_TOAST(pg_description, 2834, 2835);
+DECLARE_TOAST(pg_event_trigger, 4149, 4150);
+DECLARE_TOAST(pg_extension, 4151, 4152);
+DECLARE_TOAST(pg_foreign_data_wrapper, 4153, 4154);
+DECLARE_TOAST(pg_foreign_server, 4155, 4156);
+DECLARE_TOAST(pg_foreign_table, 4157, 4158);
+DECLARE_TOAST(pg_index, 4159, 4160);
+DECLARE_TOAST(pg_init_privs, 4161, 4162);
+DECLARE_TOAST(pg_language, 4163, 4164);
+DECLARE_TOAST(pg_largeobject, 4165, 4166);
+DECLARE_TOAST(pg_largeobject_metadata, 4167, 4168);
+DECLARE_TOAST(pg_namespace, 4169, 4170);
+DECLARE_TOAST(pg_partitioned_table, 4171, 4172);
+DECLARE_TOAST(pg_policy, 4173, 4174);
 DECLARE_TOAST(pg_proc, 2836, 2837);
 DECLARE_TOAST(pg_rewrite, 2838, 2839);
 DECLARE_TOAST(pg_seclabel, 3598, 3599);
 DECLARE_TOAST(pg_statistic, 2840, 2841);
 DECLARE_TOAST(pg_statistic_ext, 3439, 3440);
 DECLARE_TOAST(pg_trigger, 2336, 2337);
+DECLARE_TOAST(pg_ts_dict, 4175, 4176);
+DECLARE_TOAST(pg_type, 4177, 4178);
+DECLARE_TOAST(pg_user_mapping, 4179, 4180);
 
 /* shared catalogs */
-DECLARE_TOAST(pg_shdescription, 2846, 2847);
-#define PgShdescriptionToastTable 2846
-#define PgShdescriptionToastIndex 2847
+DECLARE_TOAST(pg_authid, 4181, 4182);
+#define PgAuthidToastTable 4181
+#define PgAuthidToastIndex 4182
+DECLARE_TOAST(pg_database, 4183, 4184);	
+#define PgDatabaseToastTable 4183
+#define PgDatabaseToastIndex 4184
 DECLARE_TOAST(pg_db_role_setting, 2966, 2967);
 #define PgDbRoleSettingToastTable 2966
 #define PgDbRoleSettingToastInde

Re: missing toast table for pg_policy

2018-02-18 Thread Tom Lane
Joe Conway  writes:
> Is there really a compelling reason to not just create toast tables for
> all system catalogs as in the attached?

What happens when you VACUUM FULL pg_class?  (The associated toast table
would have to be nonempty for the test to prove much.)

I'm fairly suspicious of toasting anything that the toast mechanism itself
depends on, actually, and that would include at least pg_attribute and
pg_index as well as pg_class.  Maybe we could get away with it because
there would never be any actual recursion only potential recursion ...
but it seems scary.

On the whole, I'm dubious that the risk/reward ratio is attractive here.

regards, tom lane



Re: different results from plpgsql functions related to last changes in master

2018-02-18 Thread Tom Lane
Pavel Stehule  writes:
> I did update of plpgsql_check and I see, so some functions returns
> different result than on older posgresql. Probably this is wanted behave,
> but It should be mentioned as partial compatibility break, because some
> regress test can be broken too.

This is mentioned in the relevant commit message (4b93f5799):

...  A lesser, but still real, annoyance is that ROW format cannot
represent a true NULL composite value, only a row of per-field NULL
values, which is not exactly the same thing.

In the case you're showing here, a true NULL got changed into ROW(NULL)
by the old code, but that no longer happens.

regards, tom lane



Re: different results from plpgsql functions related to last changes in master

2018-02-18 Thread Pavel Stehule
2018-02-18 17:48 GMT+01:00 Tom Lane :

> Pavel Stehule  writes:
> > I did update of plpgsql_check and I see, so some functions returns
> > different result than on older posgresql. Probably this is wanted behave,
> > but It should be mentioned as partial compatibility break, because some
> > regress test can be broken too.
>
> This is mentioned in the relevant commit message (4b93f5799):
>
> ...  A lesser, but still real, annoyance is that ROW format cannot
> represent a true NULL composite value, only a row of per-field NULL
> values, which is not exactly the same thing.
>
> In the case you're showing here, a true NULL got changed into ROW(NULL)
> by the old code, but that no longer happens.
>

I understand, and I have not any problem with this behave. Just I am
expecting so lot of people will be surprised.

Regards

Pavel


>
> regards, tom lane
>


Re: [HACKERS] Bug in to_timestamp().

2018-02-18 Thread Dmitry Dolgov
> On 17 February 2018 at 10:02, Arthur Zakirov 
wrote:
>
> I think it could be fixed by another patch. But I'm not sure that it
> will be accepted as well as this patch :).

Why do you think there should be another patch for it?


Re: missing toast table for pg_policy

2018-02-18 Thread Joe Conway
On 02/18/2018 11:18 AM, Tom Lane wrote:
> Joe Conway  writes:
>> Is there really a compelling reason to not just create toast tables for
>> all system catalogs as in the attached?
> 
> What happens when you VACUUM FULL pg_class?  (The associated toast table
> would have to be nonempty for the test to prove much.)

I tried this:
create table foo (id int);
do $$declare i int; begin for i in 1..1000 loop execute 'create user u'
|| i; end loop; end;$$;
do $$declare i int; begin for i in 1..1000 loop execute 'grant all on
foo to u' || i; end loop; end;$$;
vacuum full pg_class;

Worked without issue FWIW.

> I'm fairly suspicious of toasting anything that the toast mechanism itself
> depends on, actually, and that would include at least pg_attribute and
> pg_index as well as pg_class.  Maybe we could get away with it because
> there would never be any actual recursion only potential recursion ...
> but it seems scary.

Well that is the other approach we could pursue -- instead of justifying
which system catalogs need toast tables we could create an exclusion
list of which ones should not have toast tables, with the current
candidates being pg_class, pg_attribute, and pg_index.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] Cache lookup errors with functions manipulation object addresses

2018-02-18 Thread Michael Paquier
On Sat, Feb 17, 2018 at 07:17:24PM -0300, Alvaro Herrera wrote:
> Pushed 0003.

Thanks.

> Maybe we can get rid of format_type_be_qualified too, but I didn't
> care too much about it either; it's not a big deal I think. 

Agreed. There are 16 callers spread in objectaddress.c and regproc.c,
this would generate some diffs.  If there are extra opinions later on,
we could always revisit that.  The new API is modular enough anyway.

> What interested me more was whether we could get rid of the
> FORMAT_TYPE_TYPEMOD_GIVEN flag, but ended up deciding not to pursue that
> as a phenomenal waste of time.  Here are some references in case you
> care.
> 
> https://www.postgresql.org/message-id/flat/20001659.fAAGxKX06044%40postgresql.org
> https://git.postgresql.org/pg/commitdiff/a585c20d12d0e22befc8308e9f8ccb6f54a5df69

Thanks for the threads, I didn't know about them.  I thought as well
about trying to remove FORMAT_TYPE_TYPEMOD_GIVEN but avoided to do so,
so as not to break things the way they should be for a long time as this
code is as it is now for at least as long as I am working on Postgres.
I didn't check the git history to see the logic behind the code though,
which I really should have done.  So thanks for the references.
--
Michael


signature.asc
Description: PGP signature


Re: TODO item: WAL replay of CREATE TABLESPACE with differing directory structure

2018-02-18 Thread Patrick Krecker
On Tue, Feb 13, 2018 at 8:24 PM, Michael Paquier  wrote:
> On Tue, Feb 13, 2018 at 01:44:34PM -0800, Patrick Krecker wrote:
>> I am searching for a way to make a contribution to Postgres and I came
>> across this TODO item (I realize there has been some controversy
>> around the TODO list [1], and I hope that my use of it doesn't spark
>> another discussion about removing it altogether):
>
> Well, it will point out again that TODO items are hard, complicated and
> mostly impossible projects.
>
>> "Allow WAL replay of CREATE TABLESPACE to work when the directory
>> structure on the recovery computer is different from the original"
>>
>> Currently it looks like tablespaces have to live inside the data
>> directory on the replica, notwithstanding administrator intervention
>> by manipulating the tablespace directory with symlinks after (or even
>> before?) it has been created via replay.
>
> Let's be clear here. There is no hard restriction with tablespace paths
> within the data directory, though you should not do that, and you get a
> nice warning when trying to do so with CREATE TABLESPACE (see 33cb8ff6).
> This also causes pg_basebackup to fail.  It is also bad design to create
> tablespaces within the data directory as those are aimed at making hot
> paths work on different partitions with different I/O properties.

Sorry, my language was imprecise here. What I meant is that the
pg_tablespace directory contains no symlinks when a tablespace
creation is streamed to a replica, i.e. the data files reside within
pg_tablespace on the replica.

>> Is the idea behind this task to allow the master to instruct the
>> replica where to put the tablespace on its filesystem, so as to allow
>> it to live outside of the data directory without direct manipulation
>> of the filesystem?
>
> WAL records associated to CREATE TABLESPACE (xl_tblspc_create_rec)
> register the location where a tablespace is located.  The location of a
> tablespace is not saved in the system catalogs, which offers flexibility
> in the way the symlink from pg_tblspc can be handled.  This is where the
> tablespace path remapping of pg_basebackup becomes handy, because you
> can repurpose paths easily when taking a base backup, but this forces
> you to create tablespaces first, and then create standbys.  We have also
> a set of existing problems:
> 1) If a primary and its standby are on the same server and you issue a
> CREATE TABLESPACE, then they would try to write to the same paths.
> 2) How do we design at DDL level a command which allows for specifying
> different paths depending on the node where the recovery happens.
>
> You would need in both cases a sort of ability to define a node name, so
> as for 1) you append the node name to the path and both primary and
> standby can use the same tablespace path, but with different sub-paths.
> And for 2), you can enforce a patch name by defining as well a path
> associated to a node name so as when xl_tblspc_create_rec records are
> replayed at recovery, you know which path to create.  Just designing
> that the right way as its own set of complications.
>
>> If this task is a worthwhile endeavor, I would be happy to take it on.
>> If not, I am open to other ideas :)
>
> This is part of the difficult, perhaps-not-worth doing impossible
> problems.  As a first contribution, you may want something easier.

Thank you for the response. I would suggest that we link to it from
the wiki so as to provide clarification to future readers of the todo
list.

> --
> Michael



Re: ALTER TABLE ADD COLUMN fast default

2018-02-18 Thread Andrew Dunstan
On Sun, Feb 18, 2018 at 2:52 PM, David Rowley
 wrote:
> On 17 February 2018 at 10:46, Andrew Dunstan
>  wrote:
>> The attached version largely fixes with the performance degradation
>> that Tomas Vondra reported, replacing an O(N^2) piece of code in
>> slot_getmissingattrs() by one that is O(N).
>
> I've looked over this patch and had a play around with it.
>
> Just a couple of things that I noticed while reading:
>

[ various typos ]

>
> I'll try to spend some time reviewing the code in detail soon.
>


Thanks. I'll fix the typos in the next version of the patch, hopefully
after your review.

cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Removing [Merge]Append nodes which contain a single subpath

2018-02-18 Thread Tomas Vondra
Hi,

On 01/25/2018 01:55 PM, David Rowley wrote:
> I've attached an updated patch which takes care of the build failure 
> caused by the new call to create_append_path added in bb94ce4.
> 

I've been looking at the patch over the past few days. I haven't found
any obvious issues with it, but I do have a couple of questions.

1) I can confirm that it indeed eliminates the Append overhead, using
the example from [1], modified to use table with a single partition. Of
course, this is a pretty formal check, because the patch simply removes
the Append node and so it'd be utterly broken if the overhead did not
disappear too.

[1]
https://www.postgresql.org/message-id/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B%3DZRh-rxy9qxfPA5Gw%40mail.gmail.com


2) If I had to guess where the bugs will be, my guess would be a missing
finalize_plan_tlist call - not necessarily in existing code, but perhaps
in future improvements.

I wonder if we could automate this call somehow, say by detecting when
create_plan_recurse is called after build_path_tlist (for a given path),
and if needed calling finalize_plan_tlist from create_plan_recurse.

Although, now that I look at it, promote_child_relation may be doing
something like that by adding stuff to the translate_* variables. I'm
not quite sure why we need to append to the lists, though?


3) I'm not quite sure what replace_translatable_exprs does, or more
precisely why do we actually need it. At this point the comment pretty
much says "We need to do translation. This function does translation."
Perhaps it should mention why removing/skipping the AppendPath implies
the need for translation or something?


4) The thread talks about "[Merge]Append" but the patch apparently only
tweaks create_append_path and does absolutely nothing to "merge" case.
While looking into why, I notices that generate_mergeappend_paths always
gets all_child_pathkeys=NULL in this case (with single subpath), and so
we never create the MergePath at all. I suppose there's some condition
somewhere responsible for doing that, but I haven't found it ...


5) The comment before AppendPath typedef says this:

 * An AppendPath with a single subpath can be set up to become a "proxy"
 * path. This allows a Path which belongs to one relation to be added to
 * the pathlist of some other relation.  This is intended as generic
 * infrastructure, but its primary use case is to allow Appends with
 * only a single subpath to be removed from the final plan.

I'm not quite sure how adding a 'isproxy' flag into AppendPath node
makes this a generic infrastructure? Wouldn't an explicit ProxyPath be a
better way to achieve that?

What other proxy paths do you envision, and why should they be
represented as AppendPath? Although, there seems to be a precedent in
using AppendPath to represent dummy paths ...

FWIW one advantage of a separate ProxyPath would be that it would be a
much clearer breakage for third-party code (say, extensions tweaking
paths using hooks), either at compile or execution time. With just a new
flag in existing node, that may easily slip through. On the other hand
no one promises the structures to be stable API ...


6) I suggest we add this assert to create_append_path:

Assert(!isproxy || (isproxy && (list_length(subpaths)==1)));

and perhaps we should do s/isproxy/is_proxy/ which seems like the usual
naming for boolean variables.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: ALTER TABLE ADD COLUMN fast default

2018-02-18 Thread David Rowley
On 19 February 2018 at 13:48, Andrew Dunstan
 wrote:
> On Sun, Feb 18, 2018 at 2:52 PM, David Rowley
>  wrote:
>> I'll try to spend some time reviewing the code in detail soon.
>>
>
> Thanks. I'll fix the typos in the next version of the patch, hopefully
> after your review.

Here's a more complete review... I reserve the right to be wrong about
a few things and be nitpicky on a few more...

3. All doc changes are indented 1 space too much. Many lines have also
been broken needlessly before 80 chars

4. Surplus "it" at end of line.

+ * Return the missing value of an attribute, or NULL if it
+ * there is no missing value.

5. "sttributes" -> "attributes"

+ /* initialize all the missing sttributes to null */

6. Surplus space before >=

+ if (missattn  >= startAttNum && !attrmiss[i].ammissingNull)

7. Why bother with the else here?

+ if (slot->tts_tupleDescriptor->constr)
+ {
+ missingnum = slot->tts_tupleDescriptor->constr->num_missing -1;
+ attrmiss = slot->tts_tupleDescriptor->constr->missing;
+ }
+ else
+ {
+ missingnum = -1;
+ attrmiss = NULL;
+ }

You may as well just do:

+ if (!slot->tts_tupleDescriptor->constr)
+ return;

+ missingnum = slot->tts_tupleDescriptor->constr->num_missing -1;
+ attrmiss = slot->tts_tupleDescriptor->constr->missing;

8. -1; should be - 1;

+ missingnum = slot->tts_tupleDescriptor->constr->num_missing -1;

9. Surplus braces can be removed:

+ if (attno < attnum)
  {
- slot->tts_values[attno] = (Datum) 0;
- slot->tts_isnull[attno] = true;
+ slot_getmissingattrs(slot, attno);
  }

10. You've made a behavioral change in the following code:

- for (; attno < attnum; attno++)
+ if (attno < attnum)
  {
- slot->tts_values[attno] = (Datum) 0;
- slot->tts_isnull[attno] = true;
+ slot_getmissingattrs(slot, attno);
  }

Previously this just used to initialise up to attnum, but
slot_getmissingattrs does not know about attnum and always seems to
fill the entire slot with all atts in the TupleDesc
This may cause some performance regression when in cases where only
part of the tuple needs to be deformed. I think Tomas' benchmarks used
the final column in the table, so likely wouldn't have notice this,
although it may have if he'd aggregated one of the middle columns.

Once that's fixed, you could ditch the Min() call in the following code:

attno = Min(attno, attnum);

slot_deform_tuple(slot, attno);

/*
* If tuple doesn't have all the atts indicated by tupleDesc, read the
* rest as NULLs or missing values
*/
if (attno < attnum)
{
slot_getmissingattrs(slot, attno);
}

And change it to something more like:

/*
* If tuple doesn't have all the atts indicated by tupleDesc, deform as
* much as possible, then fill the remaining required atts with nulls or
* missing values.
*/
if (attno < attnum)
{
slot_deform_tuple(slot, attno);
slot_getmissingattrs(slot, attno, attnum); // <-- (includes new parameter)
}
else
slot_deform_tuple(slot, attnum);

Which should result in a small performance increase due to not having
to compare attno < attnum twice. Although, perhaps the average compile
may optimise this anyway. I've not checked.

11. Is there a reason why the following code in getmissingattr() can't
be made into a bsearch?

+ for (missingnum = tupleDesc->constr->num_missing - 1;
+ missingnum >= 0; missingnum--)
+ {
+ if (attrmiss[missingnum].amnum == attnum)
+ {
+ if (attrmiss[missingnum].ammissingNull)
+ {
+ *isnull = true;
+ return (Datum) 0;
+ }
+ else
+ {
+ *isnull = false;
+ return attrmiss[missingnum].ammissing;
+ }
+ }
+ }
+ Assert(false); /* should not be able to get here */
+ }

As far as I can see, the attrmiss is sorted by amnum. But maybe I just
failed to find a case where it's not.

I'd imagine doing this would further improve Tomas' benchmark score
for the patched version, since he was performing a sum() on the final
column.

Although, If done, I'm actually holding some regard to the fact that
users may one day complain that their query became slower after a
table rewrite :-)

Update: I've stumbled upon the following code:

+ /*
+ * We have a dependency on the attrdef array being filled in in
+ * ascending attnum order. This should be guaranteed by the index
+ * driving the scan. But we want to be double sure
+ */
+ if (!(attp->attnum > attnum))
+ elog(ERROR, "attribute numbers not ascending");

and the code below this seems to also assemble the missing values in
attnum order.

While I'm here, I'd rather see this if test written as: if
(attp->attnum <= attnum)

Since the attnum order in the missing values appears to be well
defined in ascending order, you can also simplify the comparison logic
in equalTupleDescs(). The inner-most nested loop shouldn't be
required.

12. Additional braces not required in:

+ if (tupleDesc &&
+ attnum <= tupleDesc->natts &&
+ TupleDescAttr(tupleDesc, attnum - 1)->atthasmissing)
+ {
+ return false;
+ }
+ else
+ {
+ return true;
+ }

13. Also, just on study of the following condition:

+ if (tupleDesc &&
+ attnum <= tupleDesc->natts &&
+ 

RE: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-02-18 Thread Tsunakawa, Takayuki
Thank you for reviewing.

From: Michael Paquier [mailto:mich...@paquier.xyz]
> It seems to me that the consolidation of the page read should happen directly
> in xlogreader.c and not even in one of its callbacks so as no garbage data
> is presented back to the caller using its own XLogReader.
> I think that you need to initialize XLogReaderState->readBuf directly in
> ReadPageInternal() before reading a page and you should be good.  With your
> patch you get visibly only one portion of things addressed, what of other
> code paths using xlogreader.c's APIs like pg_rewind, 2PC code and such?

ReadPageInternal() doesn't know where the end of valid WAL is, so it cannot 
determine where to do memset(0).  For example, in non-streaming cases, it reads 
the entire WAL block into readbuf, including the garbage.

/*
 * If the current segment is being streamed from master, calculate how
 * much of the current page we have received already. We know the
 * requested record has been received, but this is for the benefit of
 * future calls, to allow quick exit at the top of this function.
 */
if (readSource == XLOG_FROM_STREAM)
{
if (((targetPagePtr) / XLOG_BLCKSZ) != (receivedUpto / 
XLOG_BLCKSZ))
readLen = XLOG_BLCKSZ;
else
readLen = receivedUpto % XLogSegSize - targetPageOff;
}
else
readLen = XLOG_BLCKSZ;


So we have to zero-fill the empty space of a WAL block before writing it.  
Currently, the master does that in AdvanceXLInsertBuffer(), StartupXLOG(), 
XLogFileCopy().  The cascading standby receives WAL from the master block by 
block, so it doesn't suffer from the garbage.  pg_receivexlog zero-fills a new 
WAL file.

> Please note that your patch has some unnecessary garbage in two places:

Ouch, cleaned up.


> + if (zbuffer == NULL)
> + zbuffer = palloc0(XLOG_BLCKSZ);
> You could just use a static buffer which is MemSet'd with 0 only if necessary.

I wanted to allocate memory only when necessary.  Currently, only the cascaded 
standby needs this.  To that end, I moved the memory allocation code to a right 
place.  Thanks for letting me notice this.

Regards
Takayuki Tsunakawa




zerofill_walblock_on_standby_v2.patch
Description: zerofill_walblock_on_standby_v2.patch


Re: Documenting PROVE_TESTS in section of TAP tests

2018-02-18 Thread Craig Ringer
On 17 February 2018 at 22:03, Michael Paquier  wrote:

> Hi all,
>
> The section of the documentation dedicated to TAP tests mentions
> PROVE_FLAGS:
> https://www.postgresql.org/docs/devel/static/regress-tap.html
>
> I think that it would be a good idea to mention PROVE_TESTS as well.  I
> personally use and abuse of it, and documenting it instead of keeping it
> hidden in Makefile.global.in is good for new developers and testers.
>
> Proposal of patch attached.
>
>
Good plan. I thought it was already mentioned, but that mention is only
in src/test/perl/README where I doubt people routinely running the tests
will see it. I think I put it there mainly to help people writing/fixing
tests.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: TODO item: WAL replay of CREATE TABLESPACE with differing directory structure

2018-02-18 Thread Michael Paquier
On Sun, Feb 18, 2018 at 04:43:38PM -0800, Patrick Krecker wrote:
> On Tue, Feb 13, 2018 at 8:24 PM, Michael Paquier  wrote:
>> Let's be clear here. There is no hard restriction with tablespace paths
>> within the data directory, though you should not do that, and you get a
>> nice warning when trying to do so with CREATE TABLESPACE (see 33cb8ff6).
>> This also causes pg_basebackup to fail.  It is also bad design to create
>> tablespaces within the data directory as those are aimed at making hot
>> paths work on different partitions with different I/O properties.
> 
> Sorry, my language was imprecise here. What I meant is that the
> pg_tablespace directory contains no symlinks when a tablespace
> creation is streamed to a replica, i.e. the data files reside within
> pg_tablespace on the replica.

There is nothing preventing you to do so I think, and base backups
should work properly as basebackup.c just loops through the paths of the
tablespace links, which leads to errors if the links are within the data
folder itself.  I don't think that we would want a mode where CREATE
TABLESPACE does not create a link at recovery as well, be it controlled
by a system-wide GUC or a switch at DDL level.  That's more likely to
trap users by putting hot data on the same partition as the data folder.

> Thank you for the response. I would suggest that we link to it from
> the wiki so as to provide clarification to future readers of the todo
> list.

Good idea!  I have just updated the wiki page with a link to my previous
post.
--
Michael


signature.asc
Description: PGP signature


Re: spelling of enable_partition_wise_join

2018-02-18 Thread Ashutosh Bapat
On Fri, Feb 16, 2018 at 9:09 PM, Peter Eisentraut
 wrote:
> On 2/14/18 03:28, Ashutosh Bapat wrote:
>> On Wed, Feb 14, 2018 at 2:15 AM, Alvaro Herrera  
>> wrote:
>>> Peter Eisentraut wrote:
 I wonder how others feel about this, but the spelling of
 enable_partition_wise_join feels funny to me every time I look at it.  I
 would write it enable_partitionwise_join.
>>>
>>> +1
>>>
>>> See also: https://postgr.es/m/20171005134847.shzldz2ublrb3ny2@alvherre.pgsql
>>
>> To that I replied with
>> https://www.postgresql.org/message-id/CAFjFpRcsZnxCen88a-16R5EYqPCwFYnFThM%2Bmjagu%3DB1QvxPVA%40mail.gmail.com.
>> I didn't get any further response, so nothing was changed that time.
>> Alvaro, Peter, Gavin have voted for partitionwise in this thread and
>> Robert had similar objections earlier. Looks like we should change it.
>
> done

Thanks. There are functions like try_partition_wise_join(),
generate_partition_wise_join_paths() which use partition_wise
spelling. Should we update those as well?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Creation of wiki page for open items of v11

2018-02-18 Thread Michael Paquier
On Fri, Feb 09, 2018 at 11:37:52AM +0530, Ashutosh Bapat wrote:
> On Fri, Feb 9, 2018 at 9:14 AM, Michael Paquier  wrote:
>> Are there any objections in creating a wiki page to track all those bugs
>> and issues?  We don't want to lose track of all that.
> 
> +1.

Okay, I have created a skeleton of page here:
https://wiki.postgresql.org/wiki/PostgreSQL_11_Open_Items
Feel free to add any bugs or issues related to v11 that need to be
tracked before the release.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] why not parallel seq scan for slow functions

2018-02-18 Thread Ashutosh Bapat
On Sat, Feb 17, 2018 at 8:17 AM, Amit Kapila  wrote:
> On Fri, Feb 16, 2018 at 9:29 AM, Ashutosh Bapat
>  wrote:
>> On Thu, Feb 15, 2018 at 7:47 PM, Amit Kapila  wrote:
>>> On Thu, Feb 15, 2018 at 4:18 PM, Ashutosh Bapat
>>>  wrote:
 I happened to look at the patch for something else. But here are some
 comments. If any of those have been already discussed, please feel
 free to ignore. I have gone through the thread cursorily, so I might
 have missed something.

 In grouping_planner() we call query_planner() first which builds the
 join tree and creates paths, then calculate the target for scan/join
 rel which is applied on the topmost scan join rel. I am wondering
 whether we can reverse this order to calculate the target list of
 scan/join relation and pass it to standard_join_search() (or the hook)
 through query_planner().

>>>
>>> I think the reason for doing in that order is that we can't compute
>>> target's width till after query_planner().  See the below note in
>>> code:
>>>
>>> /*
>>> * Convert the query's result tlist into PathTarget format.
>>> *
>>> * Note: it's desirable to not do this till after query_planner(),
>>> * because the target width estimates can use per-Var width numbers
>>> * that were obtained within query_planner().
>>> */
>>> final_target = create_pathtarget(root, tlist);
>>>
>>> Now, I think we can try to juggle the code in a way that the width can
>>> be computed later, but the rest can be done earlier.
>>
>> /* Convenience macro to get a PathTarget with valid cost/width fields */
>> #define create_pathtarget(root, tlist) \
>> set_pathtarget_cost_width(root, make_pathtarget_from_tlist(tlist))
>> create_pathtarget already works that way. We will need to split it.
>>
>> Create the Pathtarget without widths. Apply width estimates once we
>> know the width of Vars somewhere here in query_planner()
>> /*
>>  * We should now have size estimates for every actual table involved in
>>  * the query, and we also know which if any have been deleted from the
>>  * query by join removal; so we can compute total_table_pages.
>>  *
>>  * Note that appendrels are not double-counted here, even though we don't
>>  * bother to distinguish RelOptInfos for appendrel parents, because the
>>  * parents will still have size zero.
>>  *
>> The next step is building the join tree. Set the pathtarget before that.
>>
>>> However, I think
>>> that will be somewhat major change
>>
>> I agree.
>>
>>>  still won't address all kind of
>>> cases (like for ordered paths) unless we can try to get all kind of
>>> targets pushed down in the call stack.
>>
>> I didn't understand that.
>>
>
> The places where we use a target different than the target which is
> pushed via query planner can cause a similar problem (ex. see the
> usage of adjust_paths_for_srfs) because the cost of that target
> wouldn't be taken into consideration for Gather's costing.
>

Right. Right now apply_projection_to_path() or adjust_paths_for_srfs()
do not take into consideration the type of path whose cost is being
adjusted for the new targetlist. That works for most of the paths but
not all the paths like custom, FDW or parallel paths. The idea I am
proposing is to compute final targetlist before query planner so that
it's available when we create paths for the topmost scan/join
relation. That way, any path creation logic can then take advantage of
this list to compute costs, not just parallel paths.

--
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: non-bulk inserts and tuple routing

2018-02-18 Thread Amit Langote
Fujita-san,

On 2018/02/16 19:50, Etsuro Fujita wrote:
> (2018/02/16 18:23), Amit Langote wrote:
>> Good idea.  Done.
> Thanks.  I fixed a typo (s/Converti/Convert/) and adjusted these comments
> a bit further to match the preceding code/comments.  Attached is the
> updated version.

Thank you for updating the patch.

>> Updated patch attached.
> 
> Thanks, I think the patch is in good shape, so I'll mark this as ready for
> committer.

Thanks a lot for reviewing!

Attached rebased patch.

Regards,
Amit
From 20852fe6cc2f1b8bd57d8bae528083c9f8092dab Mon Sep 17 00:00:00 2001
From: amit 
Date: Mon, 19 Feb 2018 13:15:44 +0900
Subject: [PATCH v11] During tuple-routing, initialize per-partition objects
 lazily

Those objects include ResultRelInfo, tuple conversion map,
WITH CHECK OPTION quals and RETURNING projections.

This means we don't allocate these objects for partitions that are
never inserted into.
---
 src/backend/commands/copy.c|  10 +-
 src/backend/executor/execPartition.c   | 309 -
 src/backend/executor/nodeModifyTable.c | 131 ++
 src/include/executor/execPartition.h   |   9 +-
 4 files changed, 252 insertions(+), 207 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index d5883c98d1..4562a5121d 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2469,7 +2469,7 @@ CopyFrom(CopyState cstate)
PartitionTupleRouting *proute;
 
proute = cstate->partition_tuple_routing =
-   ExecSetupPartitionTupleRouting(NULL, cstate->rel, 1, 
estate);
+   ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
/*
 * If we are capturing transition tuples, they may need to be
@@ -2606,6 +2606,14 @@ CopyFrom(CopyState cstate)
 */
saved_resultRelInfo = resultRelInfo;
resultRelInfo = proute->partitions[leaf_part_index];
+   if (resultRelInfo == NULL)
+   {
+   resultRelInfo = ExecInitPartitionInfo(NULL,
+   
  saved_resultRelInfo,
+   
  proute, estate,
+   
  leaf_part_index);
+   Assert(resultRelInfo != NULL);
+   }
 
/* We do not yet have a way to insert into a foreign 
partition */
if (resultRelInfo->ri_FdwRoutine)
diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 00523ce250..04f76b123a 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -44,18 +44,23 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation 
rel,
  *
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
+ *
+ * While we allocate the arrays of pointers of ResultRelInfo and
+ * TupleConversionMap for all partitions here, actual objects themselves are
+ * lazily allocated for a given partition if a tuple is actually routed to it;
+ * see ExecInitPartitionInfo.  However, if the function is invoked for update
+ * tuple routing, caller would already have initialized ResultRelInfo's for
+ * some of the partitions, which are reused and assigned to their respective
+ * slot in the aforementioned array.
  */
 PartitionTupleRouting *
-ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
-  Relation rel, Index 
resultRTindex,
-  EState *estate)
+ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
TupleDesc   tupDesc = RelationGetDescr(rel);
List   *leaf_parts;
ListCell   *cell;
int i;
-   ResultRelInfo *leaf_part_arr = NULL,
-  *update_rri = NULL;
+   ResultRelInfo *update_rri = NULL;
int num_update_rri = 0,
update_rri_index = 0;
boolis_update = false;
@@ -76,6 +81,8 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
proute->parent_child_tupconv_maps =
(TupleConversionMap **) palloc0(proute->num_partitions *

sizeof(TupleConversionMap *));
+   proute->partition_oids = (Oid *) palloc(proute->num_partitions *
+   
sizeof(Oid));
 
/* Set up detail

Re: [HACKERS] why not parallel seq scan for slow functions

2018-02-18 Thread Ashutosh Bapat
On Mon, Feb 19, 2018 at 9:35 AM, Ashutosh Bapat
 wrote:
> On Sat, Feb 17, 2018 at 8:17 AM, Amit Kapila  wrote:
>> On Fri, Feb 16, 2018 at 9:29 AM, Ashutosh Bapat
>>  wrote:
>>> On Thu, Feb 15, 2018 at 7:47 PM, Amit Kapila  
>>> wrote:
 On Thu, Feb 15, 2018 at 4:18 PM, Ashutosh Bapat
  wrote:
> I happened to look at the patch for something else. But here are some
> comments. If any of those have been already discussed, please feel
> free to ignore. I have gone through the thread cursorily, so I might
> have missed something.
>
> In grouping_planner() we call query_planner() first which builds the
> join tree and creates paths, then calculate the target for scan/join
> rel which is applied on the topmost scan join rel. I am wondering
> whether we can reverse this order to calculate the target list of
> scan/join relation and pass it to standard_join_search() (or the hook)
> through query_planner().
>

 I think the reason for doing in that order is that we can't compute
 target's width till after query_planner().  See the below note in
 code:

 /*
 * Convert the query's result tlist into PathTarget format.
 *
 * Note: it's desirable to not do this till after query_planner(),
 * because the target width estimates can use per-Var width numbers
 * that were obtained within query_planner().
 */
 final_target = create_pathtarget(root, tlist);

 Now, I think we can try to juggle the code in a way that the width can
 be computed later, but the rest can be done earlier.
>>>
>>> /* Convenience macro to get a PathTarget with valid cost/width fields */
>>> #define create_pathtarget(root, tlist) \
>>> set_pathtarget_cost_width(root, make_pathtarget_from_tlist(tlist))
>>> create_pathtarget already works that way. We will need to split it.
>>>
>>> Create the Pathtarget without widths. Apply width estimates once we
>>> know the width of Vars somewhere here in query_planner()
>>> /*
>>>  * We should now have size estimates for every actual table involved in
>>>  * the query, and we also know which if any have been deleted from the
>>>  * query by join removal; so we can compute total_table_pages.
>>>  *
>>>  * Note that appendrels are not double-counted here, even though we 
>>> don't
>>>  * bother to distinguish RelOptInfos for appendrel parents, because the
>>>  * parents will still have size zero.
>>>  *
>>> The next step is building the join tree. Set the pathtarget before that.
>>>
 However, I think
 that will be somewhat major change
>>>
>>> I agree.
>>>
  still won't address all kind of
 cases (like for ordered paths) unless we can try to get all kind of
 targets pushed down in the call stack.
>>>
>>> I didn't understand that.
>>>
>>
>> The places where we use a target different than the target which is
>> pushed via query planner can cause a similar problem (ex. see the
>> usage of adjust_paths_for_srfs) because the cost of that target
>> wouldn't be taken into consideration for Gather's costing.
>>
>
> Right. Right now apply_projection_to_path() or adjust_paths_for_srfs()
> do not take into consideration the type of path whose cost is being
> adjusted for the new targetlist. That works for most of the paths but
> not all the paths like custom, FDW or parallel paths. The idea I am
> proposing is to compute final targetlist before query planner so that
> it's available when we create paths for the topmost scan/join
> relation. That way, any path creation logic can then take advantage of
> this list to compute costs, not just parallel paths.

In fact, we should do this not just for scan/join relations, but all
the upper relations as well. Upper relations too create parallel
paths, whose costs need to be adjusted after their targetlists are
updated by adjust_paths_for_srfs(). Similar adjustments are needed for
any FDWs, custom paths which cost targetlists differently.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] path toward faster partition pruning

2018-02-18 Thread Amit Langote
Hi David.

On 2018/02/17 18:24, David Rowley wrote:
> On 2 February 2018 at 23:03, Amit Langote  
> wrote:
>> I started wondering if it's not such a good idea to make
>> PartitionClauseInfo a Node at all?  I went back to your earlier message
>> [1] where you said that it's put into the Append node for run-time pruning
>> to use, but it doesn't sound nice that we'd be putting into the plan
>> something that's looks more like scratchpad for the partition.c code.  I
>> think we should try to keep PartitionClauseInfo in partition.h and put
>> only the list of matched bare clauses into Append.
> 
> That sounds like a good idea.
> 
> A patch which puts this back is attached.
> 
> I've changed the run-time prune patch to process the clause lists
> during execution instead.

Thank you.  I'll incorporate it in the version I'll send next.

Regards,
Amit




Re: [HACKERS] Removing [Merge]Append nodes which contain a single subpath

2018-02-18 Thread David Rowley
On 19 February 2018 at 15:11, Tomas Vondra  wrote:
> 1) I can confirm that it indeed eliminates the Append overhead, using
> the example from [1], modified to use table with a single partition. Of
> course, this is a pretty formal check, because the patch simply removes
> the Append node and so it'd be utterly broken if the overhead did not
> disappear too.
>
> [1]
> https://www.postgresql.org/message-id/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B%3DZRh-rxy9qxfPA5Gw%40mail.gmail.com

Thanks for testing that.

> 2) If I had to guess where the bugs will be, my guess would be a missing
> finalize_plan_tlist call - not necessarily in existing code, but perhaps
> in future improvements.

I agree with that. I'd say though that additionally for the future
that we might end up with some missing translation calls to
replace_translatable_exprs().

To try to ensure I didn't miss anything, I did previously modify the
regression test tables "tenk" and "tenk1" to become partitioned tables
with a single partition which allowed all possible values to try to
ensure I'd not missed anywhere. I just failed to find a reasonable way
to make this a more permanent test without enforcing that change on
the tables as a permanent change.

> I wonder if we could automate this call somehow, say by detecting when
> create_plan_recurse is called after build_path_tlist (for a given path),
> and if needed calling finalize_plan_tlist from create_plan_recurse.

Sounds nice, but I'm unsure how we could do that.

> Although, now that I look at it, promote_child_relation may be doing
> something like that by adding stuff to the translate_* variables.

Do you mean this?

/*
* Record this child as having been promoted.  Some places treat child
* relations in a special way, and this will give them a VIP ticket to
* adulthood, where required.
*/
root->translated_childrelids =
bms_add_members(root->translated_childrelids, child->relids);

That's to get around a problem in use_physical_tlist() where there's
different behaviour for RELOPT_OTHER_MEMBER_REL than there is for
RELOPT_BASEREL. Another option might be to instead change the
RELOPT_OTHER_MEMBER_REL  into  RELOPT_BASEREL, although I was a little
too scared that might result in many other areas requiring being
changed. I may be wrong about that though. We do, for example,
occasionally change a RELOPT_BASEREL into a RELOPT_DEADREL, so
RelOptInfos changing their RelOptKind is not a new concept, so perhaps
it's fine.

> I'm
> not quite sure why we need to append to the lists, though?

Multiple Append rels may be replaced by their only-child relation, so
we'd need to be able to translate the targetlists for both. For
example, joins above the Appends may contain Vars which belong to
either of the Appends and those need to be translated into the
promoted child relation's Vars.

> 3) I'm not quite sure what replace_translatable_exprs does, or more
> precisely why do we actually need it. At this point the comment pretty
> much says "We need to do translation. This function does translation."
> Perhaps it should mention why removing/skipping the AppendPath implies
> the need for translation or something?

It does mention "Expr translation is required to translate the
targetlists of nodes above the Append", but perhaps I can make that a
bit more clear.

Let's say we have a query such as:

SELECT * FROM fact f INNER JOIN dim1 d ON f.dim1_id = d.id WHERE
f.date BETWEEN '2018-01-01' AND '2018-01-31';

Which results in a hash join and a single Jan 2018 partition being
scanned for "fact". A plan without the Append node having been removed
would have the Append targetlist containing the Vars from the "fact"
table, as would the Hash Join node... The problem this function is
trying to solve is translating the Vars in the Hash Join node (or any
node above the Append) to change the "fact" Vars into "fact_jan2018"
Vars. In create_plan.c we skip over any isproxy Append paths and
instead return the recursive call to the single proxy-path. Without
the translation we'd run into problems when trying to find Vars in the
targetlists later.

> 4) The thread talks about "[Merge]Append" but the patch apparently only
> tweaks create_append_path and does absolutely nothing to "merge" case.
> While looking into why, I notices that generate_mergeappend_paths always
> gets all_child_pathkeys=NULL in this case (with single subpath), and so
> we never create the MergePath at all. I suppose there's some condition
> somewhere responsible for doing that, but I haven't found it ...

Yeah, only Append paths are used as proxy paths. The confusion here is
that the path creation logic has also been changed in allpaths.c (see
generate_proxy_paths()), so that we no longer build MergeAppend paths
when there's a single subpath. It's probably just the fact that Append
is being hi-jacked to act as ProxyPath that's causing the confusion
there. If you look over generate_proxy_paths and where it gets called
from, you'll see that we, instead of creating App

Re: Server Crash while executing pg_replication_slot_advance (second time)

2018-02-18 Thread Masahiko Sawada
On Sat, Feb 17, 2018 at 11:26 AM, Arseny Sher  wrote:
>
> Hello,
>
> I confirm this bug. The idea is that while usually we start decoding
> from safe data.restart_lsn point, here we don't care about consistent
> snapshots and rush into decoding right away from data.confirmed_flush
> (slotfunc.c:475). The latter points to the first page's header instead
> of valid record if we log SWITCH record and confirm it without any
> additional records (while doing this manually you'd better hurry up to
> outrun xl_running_xacts slipping through). This can be reproduced with a
> bit simpler sample:
>
> SELECT * FROM pg_create_logical_replication_slot(
>   'regression_slot1', 'test_decoding', true);
> select pg_switch_wal();
> -- ensure we are on clean segment and xl_running_xacts didn't slip yet
> select pg_current_wal_lsn();
>  pg_current_wal_lsn
> 
>  0/200
> (1 row)
>
> -- set confirmed_flush_lsn to segment start and verify that
> select pg_replication_slot_advance('regression_slot1', '0/600');
> select slot_name, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn
>   from pg_replication_slots;
> slot_name | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
> --+--+--+-+-
>  regression_slot1 |  |  557 | 0/15D0EE8   | 0/200
> (1 row)
>
> -- insert some wal to advance GetFlushRecPtr
> create table t (i int);
> -- now start decoding from start of segment, boom
> select pg_replication_slot_advance('regression_slot1', '0/600');
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> : @:!>
> : @:!>
> : @:!> \q
>
>
> Simple way to fix that is to start decoding traditionally from
> restart_lsn which certainly points to valid record. Attached patch shows
> the idea. However, I am not sure this is a right decision: decoding from
> restart_lsn is slower since it lags behind confirmed_lsn; and we really
> don't care about consistent snapshots in this function. Well, it would
> be good if we assemble one on our way, serialize it and advance
> restart_lsn -- and this is AFAIU the main reason we ever decode
> something at all instead of just calling LogicalConfirmReceivedLocation
> -- but that's not the main goal of the function.
>
> Another approach is to notice pointer to page start and replace it
> with ptr to first record, but that doesn't sound elegantly.
>

Or I think we need something like XLogFindNextRecord facility before
reading WAL from startlsn to find a valid lsn. Note that it might wait
for new record up to bgwriter_delay time.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] Partition-wise aggregation/grouping

2018-02-18 Thread Ashutosh Bapat
Commit 2fb1abaeb016aeb45b9e6d0b81b7a7e92bb251b9, changed
enable_partition_wise_join to enable_partitionwise_join. This patch
too should use enable_partitionwise_agg instead of
enable_partition_wise_agg.



Re: [PATCH] Add missing type conversion functions for PL/Python

2018-02-18 Thread Anthony Bykov
On Wed, 31 Jan 2018 11:57:12 +0800
Haozhou Wang  wrote:

> Hi All,
> 
> PL/Python already has different type conversion functions to
> convert PostgreSQL datum to Python object. However, the conversion
> functions from Python object to PostgreSQL datum only has Boolean,
> Bytea and String functions.
> 
> In this patch, we add rest of Integer and Float related datatype
> conversion functions
> and can increase the performance of data conversion greatly especially
> when returning a large array.
> 
> We did a quick test about the performance of returning array in
> PL/Python:
> 
> The following UDF is used for test:
> 
> ```
> CREATE OR REPLACE FUNCTION pyoutfloat8(num int) RETURNS float8[] AS $$
> return [x/3.0 for x in range(num)]
> $$ LANGUAGE plpythonu;
> ```
> 
> The test command is
> 
> ```
> select count(pyoutfloat8(n));
> ```
> 
> The count is used for avoid large output, where n is the number of
> element in returned array, and the size is from 1.5k to15m.
> 
> Size of Array  1.5k   |15k |
>  150k|   1.5m| 15m   |
> 
> Origin 2.324ms | 19.834ms  | 194.991ms
> | 1927.28ms|   19982.1ms  |
> 
> With this patch   1.168ms  |  3.052ms   |
> 21.888ms | 213.39ms  |2138.5ms   |
> 
> All test for both PG and PL/Python are passed.
> 
> Thanks very much.
> 
> 

Hello,
sounds like a really nice patch. I've started looking
through the code and noticed a sort of a typos (or I just couldn't
understand what did you mean) in comments.

file "src/pl/plpython/plpy_typeio.c"
the comment is 
* If can not convert if directly, fallback to PLyObject_ToDatum
* to convert it

Maybe it should be something like ("it" instead of second "if")
* If can not convert it directly, fallback to PLyObject_ToDatum
* to convert it

And the same typo is repeated several times in comments.

--
Anthony Bykov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



[doc fix] Correct calculation of vm.nr_hugepages

2018-02-18 Thread Tsunakawa, Takayuki
Hello,

The current PostgreSQL documentation overestimates the number of huge pages 
(vm.nr_hugepages) because the calculation uses the maximum virtual address 
space.  In practice, huge pages are only used for the anonymous shared memory 
segment.  The attached patch fixes the documentation.

FYI, Oracle presents a shell script to calculate the number of huge pages for 
its shared memory segments:

https://docs.oracle.com/cd/E11882_01/server.112/e10839/appi_vlm.htm#UNXAR385

Regards
Takayuki Tsunakawa



hugepage_size_doc.patch
Description: hugepage_size_doc.patch


Re: spelling of enable_partition_wise_join

2018-02-18 Thread Ashutosh Bapat
On Mon, Feb 19, 2018 at 9:08 AM, Ashutosh Bapat
 wrote:
>
> Thanks. There are functions like try_partition_wise_join(),
> generate_partition_wise_join_paths() which use partition_wise
> spelling. Should we update those as well?

I see that the commit already did that. Sorry. Although, I would have
liked comments to spell "partition-wise" instead of "partitionwise".
https://dictionary.cambridge.org/dictionary/english/wise
differentiates between  "-wise" and "wise". The later is used when
indicating a direction and the first one indicates "relating to". I
think here we want the first one.


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: pgbench - allow to specify scale as a size

2018-02-18 Thread Fabien COELHO


Hello Alvaro & Tom,

Why not then insert a "few" rows, measure size, truncate the table, 
compute the formula and then insert to the desired user requested 
size? (or insert what should be the minimum, scale 1, measure, and 
extrapolate what's missing). It doesn't sound too complicated to me, 
and targeting a size is something that I believe it's quite good for 
user.


The formula I used approximates the whole database, not just one table. 
There was one for the table, but this is only part of the issue. In 
particular, ISTM that index sizes should be included when caching is 
considered.


Also, index sizes are probably in n ln(n), so some level of 
approximation is inevitable.


Moreover, the intrinsic granularity of TPC-B as multiple of 100,000 
rows makes it not very precise wrt size anyway.


Sure, makes sense, so my second suggestion seems more reasonable: insert 
with scale 1, measure there (ok, you might need to crete indexes only to 
later drop them), and if computed scale > 1 then insert whatever is left 
to insert. Shouldn't be a big deal to me.


I could implement that, even if it would lead to some approximation 
nevertheless: ISTM that the very large scale regression performed by 
Kaarel is significantly more precise than testing with scale 1 (typically 
a few MiB) and extrapolation that to hundreds of GiB.


Maybe it could be done with kind of an open ended dichotomy, but creating 
and recreating index looks like an ugly solution, and what should be 
significant is the whole database size, including tellers & branches 
tables and all indexes, so I'm not convinced. Now as tellers & branches 
tables have basically the same structure as accounts, it could be just 
scaled by assuming that it would incur the same storage per row.


Anyway, even if I do not like it, it could be better than nothing. The key 
point for me is that if Tom is dead set against the feature the patch is 
dead anyway.


Tom, would Alvaro approach be more admissible to you that a fixed formula 
that would need updating, keeping in mind that such a feature implies 
some level approximation?


--
Fabien.



Re: [HACKERS] MERGE SQL Statement for PG11

2018-02-18 Thread Andreas Seltenreich
Pavan Deolasee writes:

> Thanks for doing those tests. I've just sent v16a version of the patch and
> I think it fixes the issues reported so far. Can you please recheck? Please
> let me know if there are other issues detected by sqlsmith or otherwise.

I re-did the testing with merge_v16a applied to master at 7923118c16
with ad7dbee368a reverted because of conflicts.  I can confirm that the
previous testcases don't fail anymore, but sqlsmith readily triggers the
following assertion:

TRAP: FailedAssertion("!(mergeTargetRelation > 0)", File: "planner.c",
Line: 1496)

Testcase attached.

regards,
Andreas


merge-v16a-trap-planner.c.sql
Description: application/sql


Re: [HACKERS] Can ICU be used for a database's default sort order?

2018-02-18 Thread Michail Nikolaev
Hello.

Just want to inform:
I have run
check,installcheck,plcheck,contribcheck,modulescheck,ecpgcheck,isolationcheck,upgradecheck
tests on Windows 10, VC2017 with patch applied on top of
2a41507dab0f293ff241fe8ae326065998668af8 as Andrey asked me.

Everything is passing with and without $config->{icu} =
'D:\Dev\postgres\icu\';

Best regards,
Michail.


пт, 16 февр. 2018 г. в 11:13, Andrey Borodin :

> Hi everyone!
>
> > 10 февр. 2018 г., в 20:45, Andrey Borodin 
> написал(а):
> >
> > I'm planning to provide review
> >
>
> So, I was looking into the patch.
> The patch adds:
> 1. Ability to specify collation provider (with version) in --locale for
> initdb and createdb.
> 2. Changes to locale checks
> 3. Sets ICU as default collation provider. For example
> "ru_RU@icu.153.80.32.1" is default on my machine with patch
> 4. Tests and necessary changes to documentation
>
> With patch I get correct ICU ordering by default
> postgres=# select unnest(array['е','ё','ж']) order by 1;
>  unnest
> 
>  е
>  ё
>  ж
> (3 rows)
>
> While libc locale provides incorrect order (I also get same ordering by
> default without patch)
>
> postgres=# select c from unnest(array['е','ё','ж']) c order by c collate
> "ru_RU";
>  c
> ---
>  е
>  ж
>  ё
> (3 rows)
>
>
> Unfortunately, neither "ru_RU@icu.153.80.32.1" (exposed by LC_COLLATE and
> other places) nor "ru_RU@icu" cannot be used by collate SQL clause.
> Also, patch removes compatibility with MSVC 1800 (Visual Studio 2013) on
> Windows XP and Windows Server 2003. This is done to use newer
> locale-related functions in VS2013 build.
>
> If the database was initialized with default locale without this patch,
> one cannot connect to it anymore
> psql: FATAL:  could not find out the collation provider for datcollate
> "ru_RU.UTF-8" of database "postgres"
> This problem is mentioned in commit message of the patch. I think that
> this problem should be addressed somehow.
> What do you think?
>
> Overall patch looks solid and thoughtful work and adds important
> functionality.
>
> Best regards, Andrey Borodin.
>


Re: missing toast table for pg_policy

2018-02-18 Thread Joe Conway
On 02/17/2018 11:39 AM, Tom Lane wrote:
> Joe Conway  writes:
>> Yes, exactly. I'm fine with not backpatching, just wanted to raise the
>> possibility. I will push later today to HEAD (with a catalog version bump).
> 
> BTW, I was wondering if it'd be a good idea to try to forestall future
> oversights of this sort by adding a test query in, say, misc_sanity.sql.
> Something like
> 
> select relname, attname, atttypid::regtype
> from pg_class c join pg_attribute a on c.oid = attrelid
> where c.oid < 16384 and reltoastrelid=0 and relkind = 'r' and attstorage != 
> 'p'
> order by 1,2;
> 
> If you try that you'll see the list is quite long:



> I think in most of these cases we've consciously decided not to toast-ify,
> but maybe some of them need a second look.

Is there really a compelling reason to not just create toast tables for
all system catalogs as in the attached? Then we could just check for 0
rows in misc_sanity.sql.

For what its worth:

HEAD

# du -h --max-depth=1 $PGDATA
[...]
22M /usr/local/pgsql-head/data/base
[...]
584K/usr/local/pgsql-head/data/global
[...]
38M /usr/local/pgsql-head/data

time make check
real0m16.295s
user0m3.597s
sys 0m1.465s


with patch

# du -h --max-depth=1 $PGDATA
[...]
23M /usr/local/pgsql-head/data/base
[...]
632K/usr/local/pgsql-head/data/global
[...]
39M /usr/local/pgsql-head/data

time make check
real0m16.462s
user0m3.521s
sys 0m1.531s

select relname, attname, atttypid::regtype
from pg_class c join pg_attribute a on c.oid = attrelid
where c.oid < 16384 and reltoastrelid=0 and relkind = 'r' and attstorage
!= 'p'
order by 1,2;
 relname | attname | atttypid
-+-+--
(0 rows)


-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development
diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index 809749add9..813b3b87cc 100644
--- a/src/backend/catalog/catalog.c
+++ b/src/backend/catalog/catalog.c
@@ -258,7 +258,19 @@ IsSharedRelation(Oid relationId)
 		relationId == PgDbRoleSettingToastTable ||
 		relationId == PgDbRoleSettingToastIndex ||
 		relationId == PgShseclabelToastTable ||
-		relationId == PgShseclabelToastIndex)
+		relationId == PgShseclabelToastIndex ||
+		relationId == PgAuthidToastTable ||
+		relationId == PgAuthidToastIndex ||
+		relationId == PgDatabaseToastTable ||
+		relationId == PgDatabaseToastIndex ||
+		relationId == PgPlTemplateToastTable ||
+		relationId == PgPlTemplateToastIndex ||
+		relationId == PgReplicationOriginToastTable ||
+		relationId == PgReplicationOriginToastIndex ||
+		relationId == PgSubscriptionToastTable ||
+		relationId == PgSubscriptionToastIndex ||
+		relationId == PgTablespaceToastTable ||
+		relationId == PgTablespaceToastIndex)
 		return true;
 	return false;
 }
diff --git a/src/include/catalog/toasting.h b/src/include/catalog/toasting.h
index f6387ae143..3ef3990ea3 100644
--- a/src/include/catalog/toasting.h
+++ b/src/include/catalog/toasting.h
@@ -46,25 +46,64 @@ extern void BootstrapToastTable(char *relName,
  */
 
 /* normal catalogs */
+DECLARE_TOAST(pg_aggregate, 4139, 4140);
 DECLARE_TOAST(pg_attrdef, 2830, 2831);
+DECLARE_TOAST(pg_attribute, 4141, 4142);
+DECLARE_TOAST(pg_class, 4143, 4144);
+DECLARE_TOAST(pg_collation, 4145, 4146);
 DECLARE_TOAST(pg_constraint, 2832, 2833);
+DECLARE_TOAST(pg_default_acl, 4147, 4148);
 DECLARE_TOAST(pg_description, 2834, 2835);
+DECLARE_TOAST(pg_event_trigger, 4149, 4150);
+DECLARE_TOAST(pg_extension, 4151, 4152);
+DECLARE_TOAST(pg_foreign_data_wrapper, 4153, 4154);
+DECLARE_TOAST(pg_foreign_server, 4155, 4156);
+DECLARE_TOAST(pg_foreign_table, 4157, 4158);
+DECLARE_TOAST(pg_index, 4159, 4160);
+DECLARE_TOAST(pg_init_privs, 4161, 4162);
+DECLARE_TOAST(pg_language, 4163, 4164);
+DECLARE_TOAST(pg_largeobject, 4165, 4166);
+DECLARE_TOAST(pg_largeobject_metadata, 4167, 4168);
+DECLARE_TOAST(pg_namespace, 4169, 4170);
+DECLARE_TOAST(pg_partitioned_table, 4171, 4172);
+DECLARE_TOAST(pg_policy, 4173, 4174);
 DECLARE_TOAST(pg_proc, 2836, 2837);
 DECLARE_TOAST(pg_rewrite, 2838, 2839);
 DECLARE_TOAST(pg_seclabel, 3598, 3599);
 DECLARE_TOAST(pg_statistic, 2840, 2841);
 DECLARE_TOAST(pg_statistic_ext, 3439, 3440);
 DECLARE_TOAST(pg_trigger, 2336, 2337);
+DECLARE_TOAST(pg_ts_dict, 4175, 4176);
+DECLARE_TOAST(pg_type, 4177, 4178);
+DECLARE_TOAST(pg_user_mapping, 4179, 4180);
 
 /* shared catalogs */
-DECLARE_TOAST(pg_shdescription, 2846, 2847);
-#define PgShdescriptionToastTable 2846
-#define PgShdescriptionToastIndex 2847
+DECLARE_TOAST(pg_authid, 4181, 4182);
+#define PgAuthidToastTable 4181
+#define PgAuthidToastIndex 4182
+DECLARE_TOAST(pg_database, 4183, 4184);	
+#define PgDatabaseToastTable 4183
+#define PgDatabaseToastIndex 4184
 DECLARE_TOAST(pg_db_role_setting, 2966, 2967);
 #define PgDbRoleSettingToastTable 2966
 #define PgDbRoleSettingToastInde

Re: missing toast table for pg_policy

2018-02-18 Thread Tom Lane
Joe Conway  writes:
> Is there really a compelling reason to not just create toast tables for
> all system catalogs as in the attached?

What happens when you VACUUM FULL pg_class?  (The associated toast table
would have to be nonempty for the test to prove much.)

I'm fairly suspicious of toasting anything that the toast mechanism itself
depends on, actually, and that would include at least pg_attribute and
pg_index as well as pg_class.  Maybe we could get away with it because
there would never be any actual recursion only potential recursion ...
but it seems scary.

On the whole, I'm dubious that the risk/reward ratio is attractive here.

regards, tom lane



Re: different results from plpgsql functions related to last changes in master

2018-02-18 Thread Tom Lane
Pavel Stehule  writes:
> I did update of plpgsql_check and I see, so some functions returns
> different result than on older posgresql. Probably this is wanted behave,
> but It should be mentioned as partial compatibility break, because some
> regress test can be broken too.

This is mentioned in the relevant commit message (4b93f5799):

...  A lesser, but still real, annoyance is that ROW format cannot
represent a true NULL composite value, only a row of per-field NULL
values, which is not exactly the same thing.

In the case you're showing here, a true NULL got changed into ROW(NULL)
by the old code, but that no longer happens.

regards, tom lane



Re: different results from plpgsql functions related to last changes in master

2018-02-18 Thread Pavel Stehule
2018-02-18 17:48 GMT+01:00 Tom Lane :

> Pavel Stehule  writes:
> > I did update of plpgsql_check and I see, so some functions returns
> > different result than on older posgresql. Probably this is wanted behave,
> > but It should be mentioned as partial compatibility break, because some
> > regress test can be broken too.
>
> This is mentioned in the relevant commit message (4b93f5799):
>
> ...  A lesser, but still real, annoyance is that ROW format cannot
> represent a true NULL composite value, only a row of per-field NULL
> values, which is not exactly the same thing.
>
> In the case you're showing here, a true NULL got changed into ROW(NULL)
> by the old code, but that no longer happens.
>

I understand, and I have not any problem with this behave. Just I am
expecting so lot of people will be surprised.

Regards

Pavel


>
> regards, tom lane
>


Re: [HACKERS] Bug in to_timestamp().

2018-02-18 Thread Dmitry Dolgov
> On 17 February 2018 at 10:02, Arthur Zakirov 
wrote:
>
> I think it could be fixed by another patch. But I'm not sure that it
> will be accepted as well as this patch :).

Why do you think there should be another patch for it?


Re: missing toast table for pg_policy

2018-02-18 Thread Joe Conway
On 02/18/2018 11:18 AM, Tom Lane wrote:
> Joe Conway  writes:
>> Is there really a compelling reason to not just create toast tables for
>> all system catalogs as in the attached?
> 
> What happens when you VACUUM FULL pg_class?  (The associated toast table
> would have to be nonempty for the test to prove much.)

I tried this:
create table foo (id int);
do $$declare i int; begin for i in 1..1000 loop execute 'create user u'
|| i; end loop; end;$$;
do $$declare i int; begin for i in 1..1000 loop execute 'grant all on
foo to u' || i; end loop; end;$$;
vacuum full pg_class;

Worked without issue FWIW.

> I'm fairly suspicious of toasting anything that the toast mechanism itself
> depends on, actually, and that would include at least pg_attribute and
> pg_index as well as pg_class.  Maybe we could get away with it because
> there would never be any actual recursion only potential recursion ...
> but it seems scary.

Well that is the other approach we could pursue -- instead of justifying
which system catalogs need toast tables we could create an exclusion
list of which ones should not have toast tables, with the current
candidates being pg_class, pg_attribute, and pg_index.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] Cache lookup errors with functions manipulation object addresses

2018-02-18 Thread Michael Paquier
On Sat, Feb 17, 2018 at 07:17:24PM -0300, Alvaro Herrera wrote:
> Pushed 0003.

Thanks.

> Maybe we can get rid of format_type_be_qualified too, but I didn't
> care too much about it either; it's not a big deal I think. 

Agreed. There are 16 callers spread in objectaddress.c and regproc.c,
this would generate some diffs.  If there are extra opinions later on,
we could always revisit that.  The new API is modular enough anyway.

> What interested me more was whether we could get rid of the
> FORMAT_TYPE_TYPEMOD_GIVEN flag, but ended up deciding not to pursue that
> as a phenomenal waste of time.  Here are some references in case you
> care.
> 
> https://www.postgresql.org/message-id/flat/20001659.fAAGxKX06044%40postgresql.org
> https://git.postgresql.org/pg/commitdiff/a585c20d12d0e22befc8308e9f8ccb6f54a5df69

Thanks for the threads, I didn't know about them.  I thought as well
about trying to remove FORMAT_TYPE_TYPEMOD_GIVEN but avoided to do so,
so as not to break things the way they should be for a long time as this
code is as it is now for at least as long as I am working on Postgres.
I didn't check the git history to see the logic behind the code though,
which I really should have done.  So thanks for the references.
--
Michael


signature.asc
Description: PGP signature


Re: TODO item: WAL replay of CREATE TABLESPACE with differing directory structure

2018-02-18 Thread Patrick Krecker
On Tue, Feb 13, 2018 at 8:24 PM, Michael Paquier  wrote:
> On Tue, Feb 13, 2018 at 01:44:34PM -0800, Patrick Krecker wrote:
>> I am searching for a way to make a contribution to Postgres and I came
>> across this TODO item (I realize there has been some controversy
>> around the TODO list [1], and I hope that my use of it doesn't spark
>> another discussion about removing it altogether):
>
> Well, it will point out again that TODO items are hard, complicated and
> mostly impossible projects.
>
>> "Allow WAL replay of CREATE TABLESPACE to work when the directory
>> structure on the recovery computer is different from the original"
>>
>> Currently it looks like tablespaces have to live inside the data
>> directory on the replica, notwithstanding administrator intervention
>> by manipulating the tablespace directory with symlinks after (or even
>> before?) it has been created via replay.
>
> Let's be clear here. There is no hard restriction with tablespace paths
> within the data directory, though you should not do that, and you get a
> nice warning when trying to do so with CREATE TABLESPACE (see 33cb8ff6).
> This also causes pg_basebackup to fail.  It is also bad design to create
> tablespaces within the data directory as those are aimed at making hot
> paths work on different partitions with different I/O properties.

Sorry, my language was imprecise here. What I meant is that the
pg_tablespace directory contains no symlinks when a tablespace
creation is streamed to a replica, i.e. the data files reside within
pg_tablespace on the replica.

>> Is the idea behind this task to allow the master to instruct the
>> replica where to put the tablespace on its filesystem, so as to allow
>> it to live outside of the data directory without direct manipulation
>> of the filesystem?
>
> WAL records associated to CREATE TABLESPACE (xl_tblspc_create_rec)
> register the location where a tablespace is located.  The location of a
> tablespace is not saved in the system catalogs, which offers flexibility
> in the way the symlink from pg_tblspc can be handled.  This is where the
> tablespace path remapping of pg_basebackup becomes handy, because you
> can repurpose paths easily when taking a base backup, but this forces
> you to create tablespaces first, and then create standbys.  We have also
> a set of existing problems:
> 1) If a primary and its standby are on the same server and you issue a
> CREATE TABLESPACE, then they would try to write to the same paths.
> 2) How do we design at DDL level a command which allows for specifying
> different paths depending on the node where the recovery happens.
>
> You would need in both cases a sort of ability to define a node name, so
> as for 1) you append the node name to the path and both primary and
> standby can use the same tablespace path, but with different sub-paths.
> And for 2), you can enforce a patch name by defining as well a path
> associated to a node name so as when xl_tblspc_create_rec records are
> replayed at recovery, you know which path to create.  Just designing
> that the right way as its own set of complications.
>
>> If this task is a worthwhile endeavor, I would be happy to take it on.
>> If not, I am open to other ideas :)
>
> This is part of the difficult, perhaps-not-worth doing impossible
> problems.  As a first contribution, you may want something easier.

Thank you for the response. I would suggest that we link to it from
the wiki so as to provide clarification to future readers of the todo
list.

> --
> Michael



Re: ALTER TABLE ADD COLUMN fast default

2018-02-18 Thread Andrew Dunstan
On Sun, Feb 18, 2018 at 2:52 PM, David Rowley
 wrote:
> On 17 February 2018 at 10:46, Andrew Dunstan
>  wrote:
>> The attached version largely fixes with the performance degradation
>> that Tomas Vondra reported, replacing an O(N^2) piece of code in
>> slot_getmissingattrs() by one that is O(N).
>
> I've looked over this patch and had a play around with it.
>
> Just a couple of things that I noticed while reading:
>

[ various typos ]

>
> I'll try to spend some time reviewing the code in detail soon.
>


Thanks. I'll fix the typos in the next version of the patch, hopefully
after your review.

cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Removing [Merge]Append nodes which contain a single subpath

2018-02-18 Thread Tomas Vondra
Hi,

On 01/25/2018 01:55 PM, David Rowley wrote:
> I've attached an updated patch which takes care of the build failure 
> caused by the new call to create_append_path added in bb94ce4.
> 

I've been looking at the patch over the past few days. I haven't found
any obvious issues with it, but I do have a couple of questions.

1) I can confirm that it indeed eliminates the Append overhead, using
the example from [1], modified to use table with a single partition. Of
course, this is a pretty formal check, because the patch simply removes
the Append node and so it'd be utterly broken if the overhead did not
disappear too.

[1]
https://www.postgresql.org/message-id/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B%3DZRh-rxy9qxfPA5Gw%40mail.gmail.com


2) If I had to guess where the bugs will be, my guess would be a missing
finalize_plan_tlist call - not necessarily in existing code, but perhaps
in future improvements.

I wonder if we could automate this call somehow, say by detecting when
create_plan_recurse is called after build_path_tlist (for a given path),
and if needed calling finalize_plan_tlist from create_plan_recurse.

Although, now that I look at it, promote_child_relation may be doing
something like that by adding stuff to the translate_* variables. I'm
not quite sure why we need to append to the lists, though?


3) I'm not quite sure what replace_translatable_exprs does, or more
precisely why do we actually need it. At this point the comment pretty
much says "We need to do translation. This function does translation."
Perhaps it should mention why removing/skipping the AppendPath implies
the need for translation or something?


4) The thread talks about "[Merge]Append" but the patch apparently only
tweaks create_append_path and does absolutely nothing to "merge" case.
While looking into why, I notices that generate_mergeappend_paths always
gets all_child_pathkeys=NULL in this case (with single subpath), and so
we never create the MergePath at all. I suppose there's some condition
somewhere responsible for doing that, but I haven't found it ...


5) The comment before AppendPath typedef says this:

 * An AppendPath with a single subpath can be set up to become a "proxy"
 * path. This allows a Path which belongs to one relation to be added to
 * the pathlist of some other relation.  This is intended as generic
 * infrastructure, but its primary use case is to allow Appends with
 * only a single subpath to be removed from the final plan.

I'm not quite sure how adding a 'isproxy' flag into AppendPath node
makes this a generic infrastructure? Wouldn't an explicit ProxyPath be a
better way to achieve that?

What other proxy paths do you envision, and why should they be
represented as AppendPath? Although, there seems to be a precedent in
using AppendPath to represent dummy paths ...

FWIW one advantage of a separate ProxyPath would be that it would be a
much clearer breakage for third-party code (say, extensions tweaking
paths using hooks), either at compile or execution time. With just a new
flag in existing node, that may easily slip through. On the other hand
no one promises the structures to be stable API ...


6) I suggest we add this assert to create_append_path:

Assert(!isproxy || (isproxy && (list_length(subpaths)==1)));

and perhaps we should do s/isproxy/is_proxy/ which seems like the usual
naming for boolean variables.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: ALTER TABLE ADD COLUMN fast default

2018-02-18 Thread David Rowley
On 19 February 2018 at 13:48, Andrew Dunstan
 wrote:
> On Sun, Feb 18, 2018 at 2:52 PM, David Rowley
>  wrote:
>> I'll try to spend some time reviewing the code in detail soon.
>>
>
> Thanks. I'll fix the typos in the next version of the patch, hopefully
> after your review.

Here's a more complete review... I reserve the right to be wrong about
a few things and be nitpicky on a few more...

3. All doc changes are indented 1 space too much. Many lines have also
been broken needlessly before 80 chars

4. Surplus "it" at end of line.

+ * Return the missing value of an attribute, or NULL if it
+ * there is no missing value.

5. "sttributes" -> "attributes"

+ /* initialize all the missing sttributes to null */

6. Surplus space before >=

+ if (missattn  >= startAttNum && !attrmiss[i].ammissingNull)

7. Why bother with the else here?

+ if (slot->tts_tupleDescriptor->constr)
+ {
+ missingnum = slot->tts_tupleDescriptor->constr->num_missing -1;
+ attrmiss = slot->tts_tupleDescriptor->constr->missing;
+ }
+ else
+ {
+ missingnum = -1;
+ attrmiss = NULL;
+ }

You may as well just do:

+ if (!slot->tts_tupleDescriptor->constr)
+ return;

+ missingnum = slot->tts_tupleDescriptor->constr->num_missing -1;
+ attrmiss = slot->tts_tupleDescriptor->constr->missing;

8. -1; should be - 1;

+ missingnum = slot->tts_tupleDescriptor->constr->num_missing -1;

9. Surplus braces can be removed:

+ if (attno < attnum)
  {
- slot->tts_values[attno] = (Datum) 0;
- slot->tts_isnull[attno] = true;
+ slot_getmissingattrs(slot, attno);
  }

10. You've made a behavioral change in the following code:

- for (; attno < attnum; attno++)
+ if (attno < attnum)
  {
- slot->tts_values[attno] = (Datum) 0;
- slot->tts_isnull[attno] = true;
+ slot_getmissingattrs(slot, attno);
  }

Previously this just used to initialise up to attnum, but
slot_getmissingattrs does not know about attnum and always seems to
fill the entire slot with all atts in the TupleDesc
This may cause some performance regression when in cases where only
part of the tuple needs to be deformed. I think Tomas' benchmarks used
the final column in the table, so likely wouldn't have notice this,
although it may have if he'd aggregated one of the middle columns.

Once that's fixed, you could ditch the Min() call in the following code:

attno = Min(attno, attnum);

slot_deform_tuple(slot, attno);

/*
* If tuple doesn't have all the atts indicated by tupleDesc, read the
* rest as NULLs or missing values
*/
if (attno < attnum)
{
slot_getmissingattrs(slot, attno);
}

And change it to something more like:

/*
* If tuple doesn't have all the atts indicated by tupleDesc, deform as
* much as possible, then fill the remaining required atts with nulls or
* missing values.
*/
if (attno < attnum)
{
slot_deform_tuple(slot, attno);
slot_getmissingattrs(slot, attno, attnum); // <-- (includes new parameter)
}
else
slot_deform_tuple(slot, attnum);

Which should result in a small performance increase due to not having
to compare attno < attnum twice. Although, perhaps the average compile
may optimise this anyway. I've not checked.

11. Is there a reason why the following code in getmissingattr() can't
be made into a bsearch?

+ for (missingnum = tupleDesc->constr->num_missing - 1;
+ missingnum >= 0; missingnum--)
+ {
+ if (attrmiss[missingnum].amnum == attnum)
+ {
+ if (attrmiss[missingnum].ammissingNull)
+ {
+ *isnull = true;
+ return (Datum) 0;
+ }
+ else
+ {
+ *isnull = false;
+ return attrmiss[missingnum].ammissing;
+ }
+ }
+ }
+ Assert(false); /* should not be able to get here */
+ }

As far as I can see, the attrmiss is sorted by amnum. But maybe I just
failed to find a case where it's not.

I'd imagine doing this would further improve Tomas' benchmark score
for the patched version, since he was performing a sum() on the final
column.

Although, If done, I'm actually holding some regard to the fact that
users may one day complain that their query became slower after a
table rewrite :-)

Update: I've stumbled upon the following code:

+ /*
+ * We have a dependency on the attrdef array being filled in in
+ * ascending attnum order. This should be guaranteed by the index
+ * driving the scan. But we want to be double sure
+ */
+ if (!(attp->attnum > attnum))
+ elog(ERROR, "attribute numbers not ascending");

and the code below this seems to also assemble the missing values in
attnum order.

While I'm here, I'd rather see this if test written as: if
(attp->attnum <= attnum)

Since the attnum order in the missing values appears to be well
defined in ascending order, you can also simplify the comparison logic
in equalTupleDescs(). The inner-most nested loop shouldn't be
required.

12. Additional braces not required in:

+ if (tupleDesc &&
+ attnum <= tupleDesc->natts &&
+ TupleDescAttr(tupleDesc, attnum - 1)->atthasmissing)
+ {
+ return false;
+ }
+ else
+ {
+ return true;
+ }

13. Also, just on study of the following condition:

+ if (tupleDesc &&
+ attnum <= tupleDesc->natts &&
+ 

RE: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-02-18 Thread Tsunakawa, Takayuki
Thank you for reviewing.

From: Michael Paquier [mailto:mich...@paquier.xyz]
> It seems to me that the consolidation of the page read should happen directly
> in xlogreader.c and not even in one of its callbacks so as no garbage data
> is presented back to the caller using its own XLogReader.
> I think that you need to initialize XLogReaderState->readBuf directly in
> ReadPageInternal() before reading a page and you should be good.  With your
> patch you get visibly only one portion of things addressed, what of other
> code paths using xlogreader.c's APIs like pg_rewind, 2PC code and such?

ReadPageInternal() doesn't know where the end of valid WAL is, so it cannot 
determine where to do memset(0).  For example, in non-streaming cases, it reads 
the entire WAL block into readbuf, including the garbage.

/*
 * If the current segment is being streamed from master, calculate how
 * much of the current page we have received already. We know the
 * requested record has been received, but this is for the benefit of
 * future calls, to allow quick exit at the top of this function.
 */
if (readSource == XLOG_FROM_STREAM)
{
if (((targetPagePtr) / XLOG_BLCKSZ) != (receivedUpto / 
XLOG_BLCKSZ))
readLen = XLOG_BLCKSZ;
else
readLen = receivedUpto % XLogSegSize - targetPageOff;
}
else
readLen = XLOG_BLCKSZ;


So we have to zero-fill the empty space of a WAL block before writing it.  
Currently, the master does that in AdvanceXLInsertBuffer(), StartupXLOG(), 
XLogFileCopy().  The cascading standby receives WAL from the master block by 
block, so it doesn't suffer from the garbage.  pg_receivexlog zero-fills a new 
WAL file.

> Please note that your patch has some unnecessary garbage in two places:

Ouch, cleaned up.


> + if (zbuffer == NULL)
> + zbuffer = palloc0(XLOG_BLCKSZ);
> You could just use a static buffer which is MemSet'd with 0 only if necessary.

I wanted to allocate memory only when necessary.  Currently, only the cascaded 
standby needs this.  To that end, I moved the memory allocation code to a right 
place.  Thanks for letting me notice this.

Regards
Takayuki Tsunakawa




zerofill_walblock_on_standby_v2.patch
Description: zerofill_walblock_on_standby_v2.patch


Re: Documenting PROVE_TESTS in section of TAP tests

2018-02-18 Thread Craig Ringer
On 17 February 2018 at 22:03, Michael Paquier  wrote:

> Hi all,
>
> The section of the documentation dedicated to TAP tests mentions
> PROVE_FLAGS:
> https://www.postgresql.org/docs/devel/static/regress-tap.html
>
> I think that it would be a good idea to mention PROVE_TESTS as well.  I
> personally use and abuse of it, and documenting it instead of keeping it
> hidden in Makefile.global.in is good for new developers and testers.
>
> Proposal of patch attached.
>
>
Good plan. I thought it was already mentioned, but that mention is only
in src/test/perl/README where I doubt people routinely running the tests
will see it. I think I put it there mainly to help people writing/fixing
tests.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: TODO item: WAL replay of CREATE TABLESPACE with differing directory structure

2018-02-18 Thread Michael Paquier
On Sun, Feb 18, 2018 at 04:43:38PM -0800, Patrick Krecker wrote:
> On Tue, Feb 13, 2018 at 8:24 PM, Michael Paquier  wrote:
>> Let's be clear here. There is no hard restriction with tablespace paths
>> within the data directory, though you should not do that, and you get a
>> nice warning when trying to do so with CREATE TABLESPACE (see 33cb8ff6).
>> This also causes pg_basebackup to fail.  It is also bad design to create
>> tablespaces within the data directory as those are aimed at making hot
>> paths work on different partitions with different I/O properties.
> 
> Sorry, my language was imprecise here. What I meant is that the
> pg_tablespace directory contains no symlinks when a tablespace
> creation is streamed to a replica, i.e. the data files reside within
> pg_tablespace on the replica.

There is nothing preventing you to do so I think, and base backups
should work properly as basebackup.c just loops through the paths of the
tablespace links, which leads to errors if the links are within the data
folder itself.  I don't think that we would want a mode where CREATE
TABLESPACE does not create a link at recovery as well, be it controlled
by a system-wide GUC or a switch at DDL level.  That's more likely to
trap users by putting hot data on the same partition as the data folder.

> Thank you for the response. I would suggest that we link to it from
> the wiki so as to provide clarification to future readers of the todo
> list.

Good idea!  I have just updated the wiki page with a link to my previous
post.
--
Michael


signature.asc
Description: PGP signature


Re: spelling of enable_partition_wise_join

2018-02-18 Thread Ashutosh Bapat
On Fri, Feb 16, 2018 at 9:09 PM, Peter Eisentraut
 wrote:
> On 2/14/18 03:28, Ashutosh Bapat wrote:
>> On Wed, Feb 14, 2018 at 2:15 AM, Alvaro Herrera  
>> wrote:
>>> Peter Eisentraut wrote:
 I wonder how others feel about this, but the spelling of
 enable_partition_wise_join feels funny to me every time I look at it.  I
 would write it enable_partitionwise_join.
>>>
>>> +1
>>>
>>> See also: https://postgr.es/m/20171005134847.shzldz2ublrb3ny2@alvherre.pgsql
>>
>> To that I replied with
>> https://www.postgresql.org/message-id/CAFjFpRcsZnxCen88a-16R5EYqPCwFYnFThM%2Bmjagu%3DB1QvxPVA%40mail.gmail.com.
>> I didn't get any further response, so nothing was changed that time.
>> Alvaro, Peter, Gavin have voted for partitionwise in this thread and
>> Robert had similar objections earlier. Looks like we should change it.
>
> done

Thanks. There are functions like try_partition_wise_join(),
generate_partition_wise_join_paths() which use partition_wise
spelling. Should we update those as well?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Creation of wiki page for open items of v11

2018-02-18 Thread Michael Paquier
On Fri, Feb 09, 2018 at 11:37:52AM +0530, Ashutosh Bapat wrote:
> On Fri, Feb 9, 2018 at 9:14 AM, Michael Paquier  wrote:
>> Are there any objections in creating a wiki page to track all those bugs
>> and issues?  We don't want to lose track of all that.
> 
> +1.

Okay, I have created a skeleton of page here:
https://wiki.postgresql.org/wiki/PostgreSQL_11_Open_Items
Feel free to add any bugs or issues related to v11 that need to be
tracked before the release.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] why not parallel seq scan for slow functions

2018-02-18 Thread Ashutosh Bapat
On Sat, Feb 17, 2018 at 8:17 AM, Amit Kapila  wrote:
> On Fri, Feb 16, 2018 at 9:29 AM, Ashutosh Bapat
>  wrote:
>> On Thu, Feb 15, 2018 at 7:47 PM, Amit Kapila  wrote:
>>> On Thu, Feb 15, 2018 at 4:18 PM, Ashutosh Bapat
>>>  wrote:
 I happened to look at the patch for something else. But here are some
 comments. If any of those have been already discussed, please feel
 free to ignore. I have gone through the thread cursorily, so I might
 have missed something.

 In grouping_planner() we call query_planner() first which builds the
 join tree and creates paths, then calculate the target for scan/join
 rel which is applied on the topmost scan join rel. I am wondering
 whether we can reverse this order to calculate the target list of
 scan/join relation and pass it to standard_join_search() (or the hook)
 through query_planner().

>>>
>>> I think the reason for doing in that order is that we can't compute
>>> target's width till after query_planner().  See the below note in
>>> code:
>>>
>>> /*
>>> * Convert the query's result tlist into PathTarget format.
>>> *
>>> * Note: it's desirable to not do this till after query_planner(),
>>> * because the target width estimates can use per-Var width numbers
>>> * that were obtained within query_planner().
>>> */
>>> final_target = create_pathtarget(root, tlist);
>>>
>>> Now, I think we can try to juggle the code in a way that the width can
>>> be computed later, but the rest can be done earlier.
>>
>> /* Convenience macro to get a PathTarget with valid cost/width fields */
>> #define create_pathtarget(root, tlist) \
>> set_pathtarget_cost_width(root, make_pathtarget_from_tlist(tlist))
>> create_pathtarget already works that way. We will need to split it.
>>
>> Create the Pathtarget without widths. Apply width estimates once we
>> know the width of Vars somewhere here in query_planner()
>> /*
>>  * We should now have size estimates for every actual table involved in
>>  * the query, and we also know which if any have been deleted from the
>>  * query by join removal; so we can compute total_table_pages.
>>  *
>>  * Note that appendrels are not double-counted here, even though we don't
>>  * bother to distinguish RelOptInfos for appendrel parents, because the
>>  * parents will still have size zero.
>>  *
>> The next step is building the join tree. Set the pathtarget before that.
>>
>>> However, I think
>>> that will be somewhat major change
>>
>> I agree.
>>
>>>  still won't address all kind of
>>> cases (like for ordered paths) unless we can try to get all kind of
>>> targets pushed down in the call stack.
>>
>> I didn't understand that.
>>
>
> The places where we use a target different than the target which is
> pushed via query planner can cause a similar problem (ex. see the
> usage of adjust_paths_for_srfs) because the cost of that target
> wouldn't be taken into consideration for Gather's costing.
>

Right. Right now apply_projection_to_path() or adjust_paths_for_srfs()
do not take into consideration the type of path whose cost is being
adjusted for the new targetlist. That works for most of the paths but
not all the paths like custom, FDW or parallel paths. The idea I am
proposing is to compute final targetlist before query planner so that
it's available when we create paths for the topmost scan/join
relation. That way, any path creation logic can then take advantage of
this list to compute costs, not just parallel paths.

--
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: non-bulk inserts and tuple routing

2018-02-18 Thread Amit Langote
Fujita-san,

On 2018/02/16 19:50, Etsuro Fujita wrote:
> (2018/02/16 18:23), Amit Langote wrote:
>> Good idea.  Done.
> Thanks.  I fixed a typo (s/Converti/Convert/) and adjusted these comments
> a bit further to match the preceding code/comments.  Attached is the
> updated version.

Thank you for updating the patch.

>> Updated patch attached.
> 
> Thanks, I think the patch is in good shape, so I'll mark this as ready for
> committer.

Thanks a lot for reviewing!

Attached rebased patch.

Regards,
Amit
From 20852fe6cc2f1b8bd57d8bae528083c9f8092dab Mon Sep 17 00:00:00 2001
From: amit 
Date: Mon, 19 Feb 2018 13:15:44 +0900
Subject: [PATCH v11] During tuple-routing, initialize per-partition objects
 lazily

Those objects include ResultRelInfo, tuple conversion map,
WITH CHECK OPTION quals and RETURNING projections.

This means we don't allocate these objects for partitions that are
never inserted into.
---
 src/backend/commands/copy.c|  10 +-
 src/backend/executor/execPartition.c   | 309 -
 src/backend/executor/nodeModifyTable.c | 131 ++
 src/include/executor/execPartition.h   |   9 +-
 4 files changed, 252 insertions(+), 207 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index d5883c98d1..4562a5121d 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2469,7 +2469,7 @@ CopyFrom(CopyState cstate)
PartitionTupleRouting *proute;
 
proute = cstate->partition_tuple_routing =
-   ExecSetupPartitionTupleRouting(NULL, cstate->rel, 1, 
estate);
+   ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
/*
 * If we are capturing transition tuples, they may need to be
@@ -2606,6 +2606,14 @@ CopyFrom(CopyState cstate)
 */
saved_resultRelInfo = resultRelInfo;
resultRelInfo = proute->partitions[leaf_part_index];
+   if (resultRelInfo == NULL)
+   {
+   resultRelInfo = ExecInitPartitionInfo(NULL,
+   
  saved_resultRelInfo,
+   
  proute, estate,
+   
  leaf_part_index);
+   Assert(resultRelInfo != NULL);
+   }
 
/* We do not yet have a way to insert into a foreign 
partition */
if (resultRelInfo->ri_FdwRoutine)
diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 00523ce250..04f76b123a 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -44,18 +44,23 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation 
rel,
  *
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
+ *
+ * While we allocate the arrays of pointers of ResultRelInfo and
+ * TupleConversionMap for all partitions here, actual objects themselves are
+ * lazily allocated for a given partition if a tuple is actually routed to it;
+ * see ExecInitPartitionInfo.  However, if the function is invoked for update
+ * tuple routing, caller would already have initialized ResultRelInfo's for
+ * some of the partitions, which are reused and assigned to their respective
+ * slot in the aforementioned array.
  */
 PartitionTupleRouting *
-ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
-  Relation rel, Index 
resultRTindex,
-  EState *estate)
+ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
TupleDesc   tupDesc = RelationGetDescr(rel);
List   *leaf_parts;
ListCell   *cell;
int i;
-   ResultRelInfo *leaf_part_arr = NULL,
-  *update_rri = NULL;
+   ResultRelInfo *update_rri = NULL;
int num_update_rri = 0,
update_rri_index = 0;
boolis_update = false;
@@ -76,6 +81,8 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
proute->parent_child_tupconv_maps =
(TupleConversionMap **) palloc0(proute->num_partitions *

sizeof(TupleConversionMap *));
+   proute->partition_oids = (Oid *) palloc(proute->num_partitions *
+   
sizeof(Oid));
 
/* Set up detail

Re: [HACKERS] why not parallel seq scan for slow functions

2018-02-18 Thread Ashutosh Bapat
On Mon, Feb 19, 2018 at 9:35 AM, Ashutosh Bapat
 wrote:
> On Sat, Feb 17, 2018 at 8:17 AM, Amit Kapila  wrote:
>> On Fri, Feb 16, 2018 at 9:29 AM, Ashutosh Bapat
>>  wrote:
>>> On Thu, Feb 15, 2018 at 7:47 PM, Amit Kapila  
>>> wrote:
 On Thu, Feb 15, 2018 at 4:18 PM, Ashutosh Bapat
  wrote:
> I happened to look at the patch for something else. But here are some
> comments. If any of those have been already discussed, please feel
> free to ignore. I have gone through the thread cursorily, so I might
> have missed something.
>
> In grouping_planner() we call query_planner() first which builds the
> join tree and creates paths, then calculate the target for scan/join
> rel which is applied on the topmost scan join rel. I am wondering
> whether we can reverse this order to calculate the target list of
> scan/join relation and pass it to standard_join_search() (or the hook)
> through query_planner().
>

 I think the reason for doing in that order is that we can't compute
 target's width till after query_planner().  See the below note in
 code:

 /*
 * Convert the query's result tlist into PathTarget format.
 *
 * Note: it's desirable to not do this till after query_planner(),
 * because the target width estimates can use per-Var width numbers
 * that were obtained within query_planner().
 */
 final_target = create_pathtarget(root, tlist);

 Now, I think we can try to juggle the code in a way that the width can
 be computed later, but the rest can be done earlier.
>>>
>>> /* Convenience macro to get a PathTarget with valid cost/width fields */
>>> #define create_pathtarget(root, tlist) \
>>> set_pathtarget_cost_width(root, make_pathtarget_from_tlist(tlist))
>>> create_pathtarget already works that way. We will need to split it.
>>>
>>> Create the Pathtarget without widths. Apply width estimates once we
>>> know the width of Vars somewhere here in query_planner()
>>> /*
>>>  * We should now have size estimates for every actual table involved in
>>>  * the query, and we also know which if any have been deleted from the
>>>  * query by join removal; so we can compute total_table_pages.
>>>  *
>>>  * Note that appendrels are not double-counted here, even though we 
>>> don't
>>>  * bother to distinguish RelOptInfos for appendrel parents, because the
>>>  * parents will still have size zero.
>>>  *
>>> The next step is building the join tree. Set the pathtarget before that.
>>>
 However, I think
 that will be somewhat major change
>>>
>>> I agree.
>>>
  still won't address all kind of
 cases (like for ordered paths) unless we can try to get all kind of
 targets pushed down in the call stack.
>>>
>>> I didn't understand that.
>>>
>>
>> The places where we use a target different than the target which is
>> pushed via query planner can cause a similar problem (ex. see the
>> usage of adjust_paths_for_srfs) because the cost of that target
>> wouldn't be taken into consideration for Gather's costing.
>>
>
> Right. Right now apply_projection_to_path() or adjust_paths_for_srfs()
> do not take into consideration the type of path whose cost is being
> adjusted for the new targetlist. That works for most of the paths but
> not all the paths like custom, FDW or parallel paths. The idea I am
> proposing is to compute final targetlist before query planner so that
> it's available when we create paths for the topmost scan/join
> relation. That way, any path creation logic can then take advantage of
> this list to compute costs, not just parallel paths.

In fact, we should do this not just for scan/join relations, but all
the upper relations as well. Upper relations too create parallel
paths, whose costs need to be adjusted after their targetlists are
updated by adjust_paths_for_srfs(). Similar adjustments are needed for
any FDWs, custom paths which cost targetlists differently.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] path toward faster partition pruning

2018-02-18 Thread Amit Langote
Hi David.

On 2018/02/17 18:24, David Rowley wrote:
> On 2 February 2018 at 23:03, Amit Langote  
> wrote:
>> I started wondering if it's not such a good idea to make
>> PartitionClauseInfo a Node at all?  I went back to your earlier message
>> [1] where you said that it's put into the Append node for run-time pruning
>> to use, but it doesn't sound nice that we'd be putting into the plan
>> something that's looks more like scratchpad for the partition.c code.  I
>> think we should try to keep PartitionClauseInfo in partition.h and put
>> only the list of matched bare clauses into Append.
> 
> That sounds like a good idea.
> 
> A patch which puts this back is attached.
> 
> I've changed the run-time prune patch to process the clause lists
> during execution instead.

Thank you.  I'll incorporate it in the version I'll send next.

Regards,
Amit




Re: [HACKERS] Removing [Merge]Append nodes which contain a single subpath

2018-02-18 Thread David Rowley
On 19 February 2018 at 15:11, Tomas Vondra  wrote:
> 1) I can confirm that it indeed eliminates the Append overhead, using
> the example from [1], modified to use table with a single partition. Of
> course, this is a pretty formal check, because the patch simply removes
> the Append node and so it'd be utterly broken if the overhead did not
> disappear too.
>
> [1]
> https://www.postgresql.org/message-id/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B%3DZRh-rxy9qxfPA5Gw%40mail.gmail.com

Thanks for testing that.

> 2) If I had to guess where the bugs will be, my guess would be a missing
> finalize_plan_tlist call - not necessarily in existing code, but perhaps
> in future improvements.

I agree with that. I'd say though that additionally for the future
that we might end up with some missing translation calls to
replace_translatable_exprs().

To try to ensure I didn't miss anything, I did previously modify the
regression test tables "tenk" and "tenk1" to become partitioned tables
with a single partition which allowed all possible values to try to
ensure I'd not missed anywhere. I just failed to find a reasonable way
to make this a more permanent test without enforcing that change on
the tables as a permanent change.

> I wonder if we could automate this call somehow, say by detecting when
> create_plan_recurse is called after build_path_tlist (for a given path),
> and if needed calling finalize_plan_tlist from create_plan_recurse.

Sounds nice, but I'm unsure how we could do that.

> Although, now that I look at it, promote_child_relation may be doing
> something like that by adding stuff to the translate_* variables.

Do you mean this?

/*
* Record this child as having been promoted.  Some places treat child
* relations in a special way, and this will give them a VIP ticket to
* adulthood, where required.
*/
root->translated_childrelids =
bms_add_members(root->translated_childrelids, child->relids);

That's to get around a problem in use_physical_tlist() where there's
different behaviour for RELOPT_OTHER_MEMBER_REL than there is for
RELOPT_BASEREL. Another option might be to instead change the
RELOPT_OTHER_MEMBER_REL  into  RELOPT_BASEREL, although I was a little
too scared that might result in many other areas requiring being
changed. I may be wrong about that though. We do, for example,
occasionally change a RELOPT_BASEREL into a RELOPT_DEADREL, so
RelOptInfos changing their RelOptKind is not a new concept, so perhaps
it's fine.

> I'm
> not quite sure why we need to append to the lists, though?

Multiple Append rels may be replaced by their only-child relation, so
we'd need to be able to translate the targetlists for both. For
example, joins above the Appends may contain Vars which belong to
either of the Appends and those need to be translated into the
promoted child relation's Vars.

> 3) I'm not quite sure what replace_translatable_exprs does, or more
> precisely why do we actually need it. At this point the comment pretty
> much says "We need to do translation. This function does translation."
> Perhaps it should mention why removing/skipping the AppendPath implies
> the need for translation or something?

It does mention "Expr translation is required to translate the
targetlists of nodes above the Append", but perhaps I can make that a
bit more clear.

Let's say we have a query such as:

SELECT * FROM fact f INNER JOIN dim1 d ON f.dim1_id = d.id WHERE
f.date BETWEEN '2018-01-01' AND '2018-01-31';

Which results in a hash join and a single Jan 2018 partition being
scanned for "fact". A plan without the Append node having been removed
would have the Append targetlist containing the Vars from the "fact"
table, as would the Hash Join node... The problem this function is
trying to solve is translating the Vars in the Hash Join node (or any
node above the Append) to change the "fact" Vars into "fact_jan2018"
Vars. In create_plan.c we skip over any isproxy Append paths and
instead return the recursive call to the single proxy-path. Without
the translation we'd run into problems when trying to find Vars in the
targetlists later.

> 4) The thread talks about "[Merge]Append" but the patch apparently only
> tweaks create_append_path and does absolutely nothing to "merge" case.
> While looking into why, I notices that generate_mergeappend_paths always
> gets all_child_pathkeys=NULL in this case (with single subpath), and so
> we never create the MergePath at all. I suppose there's some condition
> somewhere responsible for doing that, but I haven't found it ...

Yeah, only Append paths are used as proxy paths. The confusion here is
that the path creation logic has also been changed in allpaths.c (see
generate_proxy_paths()), so that we no longer build MergeAppend paths
when there's a single subpath. It's probably just the fact that Append
is being hi-jacked to act as ProxyPath that's causing the confusion
there. If you look over generate_proxy_paths and where it gets called
from, you'll see that we, instead of creating App

Re: Server Crash while executing pg_replication_slot_advance (second time)

2018-02-18 Thread Masahiko Sawada
On Sat, Feb 17, 2018 at 11:26 AM, Arseny Sher  wrote:
>
> Hello,
>
> I confirm this bug. The idea is that while usually we start decoding
> from safe data.restart_lsn point, here we don't care about consistent
> snapshots and rush into decoding right away from data.confirmed_flush
> (slotfunc.c:475). The latter points to the first page's header instead
> of valid record if we log SWITCH record and confirm it without any
> additional records (while doing this manually you'd better hurry up to
> outrun xl_running_xacts slipping through). This can be reproduced with a
> bit simpler sample:
>
> SELECT * FROM pg_create_logical_replication_slot(
>   'regression_slot1', 'test_decoding', true);
> select pg_switch_wal();
> -- ensure we are on clean segment and xl_running_xacts didn't slip yet
> select pg_current_wal_lsn();
>  pg_current_wal_lsn
> 
>  0/200
> (1 row)
>
> -- set confirmed_flush_lsn to segment start and verify that
> select pg_replication_slot_advance('regression_slot1', '0/600');
> select slot_name, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn
>   from pg_replication_slots;
> slot_name | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
> --+--+--+-+-
>  regression_slot1 |  |  557 | 0/15D0EE8   | 0/200
> (1 row)
>
> -- insert some wal to advance GetFlushRecPtr
> create table t (i int);
> -- now start decoding from start of segment, boom
> select pg_replication_slot_advance('regression_slot1', '0/600');
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> : @:!>
> : @:!>
> : @:!> \q
>
>
> Simple way to fix that is to start decoding traditionally from
> restart_lsn which certainly points to valid record. Attached patch shows
> the idea. However, I am not sure this is a right decision: decoding from
> restart_lsn is slower since it lags behind confirmed_lsn; and we really
> don't care about consistent snapshots in this function. Well, it would
> be good if we assemble one on our way, serialize it and advance
> restart_lsn -- and this is AFAIU the main reason we ever decode
> something at all instead of just calling LogicalConfirmReceivedLocation
> -- but that's not the main goal of the function.
>
> Another approach is to notice pointer to page start and replace it
> with ptr to first record, but that doesn't sound elegantly.
>

Or I think we need something like XLogFindNextRecord facility before
reading WAL from startlsn to find a valid lsn. Note that it might wait
for new record up to bgwriter_delay time.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] Partition-wise aggregation/grouping

2018-02-18 Thread Ashutosh Bapat
Commit 2fb1abaeb016aeb45b9e6d0b81b7a7e92bb251b9, changed
enable_partition_wise_join to enable_partitionwise_join. This patch
too should use enable_partitionwise_agg instead of
enable_partition_wise_agg.



Re: [PATCH] Add missing type conversion functions for PL/Python

2018-02-18 Thread Anthony Bykov
On Wed, 31 Jan 2018 11:57:12 +0800
Haozhou Wang  wrote:

> Hi All,
> 
> PL/Python already has different type conversion functions to
> convert PostgreSQL datum to Python object. However, the conversion
> functions from Python object to PostgreSQL datum only has Boolean,
> Bytea and String functions.
> 
> In this patch, we add rest of Integer and Float related datatype
> conversion functions
> and can increase the performance of data conversion greatly especially
> when returning a large array.
> 
> We did a quick test about the performance of returning array in
> PL/Python:
> 
> The following UDF is used for test:
> 
> ```
> CREATE OR REPLACE FUNCTION pyoutfloat8(num int) RETURNS float8[] AS $$
> return [x/3.0 for x in range(num)]
> $$ LANGUAGE plpythonu;
> ```
> 
> The test command is
> 
> ```
> select count(pyoutfloat8(n));
> ```
> 
> The count is used for avoid large output, where n is the number of
> element in returned array, and the size is from 1.5k to15m.
> 
> Size of Array  1.5k   |15k |
>  150k|   1.5m| 15m   |
> 
> Origin 2.324ms | 19.834ms  | 194.991ms
> | 1927.28ms|   19982.1ms  |
> 
> With this patch   1.168ms  |  3.052ms   |
> 21.888ms | 213.39ms  |2138.5ms   |
> 
> All test for both PG and PL/Python are passed.
> 
> Thanks very much.
> 
> 

Hello,
sounds like a really nice patch. I've started looking
through the code and noticed a sort of a typos (or I just couldn't
understand what did you mean) in comments.

file "src/pl/plpython/plpy_typeio.c"
the comment is 
* If can not convert if directly, fallback to PLyObject_ToDatum
* to convert it

Maybe it should be something like ("it" instead of second "if")
* If can not convert it directly, fallback to PLyObject_ToDatum
* to convert it

And the same typo is repeated several times in comments.

--
Anthony Bykov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



[doc fix] Correct calculation of vm.nr_hugepages

2018-02-18 Thread Tsunakawa, Takayuki
Hello,

The current PostgreSQL documentation overestimates the number of huge pages 
(vm.nr_hugepages) because the calculation uses the maximum virtual address 
space.  In practice, huge pages are only used for the anonymous shared memory 
segment.  The attached patch fixes the documentation.

FYI, Oracle presents a shell script to calculate the number of huge pages for 
its shared memory segments:

https://docs.oracle.com/cd/E11882_01/server.112/e10839/appi_vlm.htm#UNXAR385

Regards
Takayuki Tsunakawa



hugepage_size_doc.patch
Description: hugepage_size_doc.patch


Re: spelling of enable_partition_wise_join

2018-02-18 Thread Ashutosh Bapat
On Mon, Feb 19, 2018 at 9:08 AM, Ashutosh Bapat
 wrote:
>
> Thanks. There are functions like try_partition_wise_join(),
> generate_partition_wise_join_paths() which use partition_wise
> spelling. Should we update those as well?

I see that the commit already did that. Sorry. Although, I would have
liked comments to spell "partition-wise" instead of "partitionwise".
https://dictionary.cambridge.org/dictionary/english/wise
differentiates between  "-wise" and "wise". The later is used when
indicating a direction and the first one indicates "relating to". I
think here we want the first one.


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: pgbench - allow to specify scale as a size

2018-02-18 Thread Fabien COELHO


Hello Alvaro & Tom,

Why not then insert a "few" rows, measure size, truncate the table, 
compute the formula and then insert to the desired user requested 
size? (or insert what should be the minimum, scale 1, measure, and 
extrapolate what's missing). It doesn't sound too complicated to me, 
and targeting a size is something that I believe it's quite good for 
user.


The formula I used approximates the whole database, not just one table. 
There was one for the table, but this is only part of the issue. In 
particular, ISTM that index sizes should be included when caching is 
considered.


Also, index sizes are probably in n ln(n), so some level of 
approximation is inevitable.


Moreover, the intrinsic granularity of TPC-B as multiple of 100,000 
rows makes it not very precise wrt size anyway.


Sure, makes sense, so my second suggestion seems more reasonable: insert 
with scale 1, measure there (ok, you might need to crete indexes only to 
later drop them), and if computed scale > 1 then insert whatever is left 
to insert. Shouldn't be a big deal to me.


I could implement that, even if it would lead to some approximation 
nevertheless: ISTM that the very large scale regression performed by 
Kaarel is significantly more precise than testing with scale 1 (typically 
a few MiB) and extrapolation that to hundreds of GiB.


Maybe it could be done with kind of an open ended dichotomy, but creating 
and recreating index looks like an ugly solution, and what should be 
significant is the whole database size, including tellers & branches 
tables and all indexes, so I'm not convinced. Now as tellers & branches 
tables have basically the same structure as accounts, it could be just 
scaled by assuming that it would incur the same storage per row.


Anyway, even if I do not like it, it could be better than nothing. The key 
point for me is that if Tom is dead set against the feature the patch is 
dead anyway.


Tom, would Alvaro approach be more admissible to you that a fixed formula 
that would need updating, keeping in mind that such a feature implies 
some level approximation?


--
Fabien.