date:20161220

Re: [HACKERS] Why does plpython delay composite type resolution?

2016-12-20 Thread Andreas Karlsson


On 12/21/2016 04:14 AM, Jim Nasby wrote:

Why do functions that accept composite types delay type resolution until
execution? I have a naive patch that speeds up plpy.execute() by 8% by
caching interred python strings for the dictionary key names (which are
repeated over and over). The next step is to just pre-allocate those
strings as appropriate for the calling context, but it's not clear how
to handle that for input arguments.


Does your patch handle "ALTER TYPE name ADD ATTRIBUTE ..."? My immediate 
guess would be that it could be a cache invalidation thing.


Andreas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication slot xmin is not reset if HS feedback is turned off while standby is shut down

2016-12-20 Thread Ants Aasma

On Wed, Dec 21, 2016 at 4:14 AM, Craig Ringer  wrote:
> Re the patch, I don't like
>
> -   static bool master_has_standby_xmin = false;
> +   static bool master_has_standby_xmin = true;
>
> without any comment. It's addressed in the comment changes on the
> header func, but the link isn't obvious. Maybe a oneliner to say
> "ensure we always send at least one feedback message" ?

Will fix.

> I think this part of the patch is correct and useful.
>
>
>
> I don't see the point of
>
> +*
> +* If Hot Standby is not yet active we reset the xmin value. Check 
> this
> +* after the interval has expired to reduce number of calls.
>  */
> -   if (hot_standby_feedback)
> +   if (hot_standby_feedback && HotStandbyActive())
>
> though. Forcing feedback to send InvalidTransactionId until hot
> standby feedback is actively running seems counterproductive; we want
> to lock in feedback as soon as possible, not wait until we're
> accepting transactions. Simon and I are in fact working on changes to
> do the opposite of what you've got here and ensure that we don't allow
> hot standby connections until we know hot_standby_feedback is in
> effect and a usable xmin is locked in. That way the master won't
> remove tuples we need as soon as we start an xact and cause a conflict
> with recovery.
>
> If there are no running xacts, GetOldestXmin() will return the slot
> xmin if any. We should definitely not be clearing that just because
> we're not accepting hot standby connections yet; we want to make very
> sure it remains in effect unless the user explicitly de-configures hot
> standby.
>
>  (There's another pre-exisitng problem there, where we can start the
> walsender before slots are initialized, that I'll be addressing
> separately).
>
> If there's no slot xmin either, we'll send nextXid. That's a sensible
> starting point for hot standby feedback until we start some
> transactions.
>
> So -1 on this part of the patch, unless there's something I've misunderstood.

Currently there was no feedback sent if hot standby was not active. I
was not sure if it was safe to call GetOldestXmin() in that case.
However I did not consider cascading replica slots wanting to hold
back xmin, where resetting the parents xmin is indeed wrong. Do you
know if GetOldestXmin() is safe at this point and we can just remove
the HotStandbyActive() check? Otherwise I think the correct approach
is to move the check and return inside the hot_standby_feedback case
like this:

if (hot_standby_feedback)
{
if (!HotStandbyActive())
   return;

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] too low cost of Bitmap index scan

2016-12-20 Thread Pavel Stehule

2016-12-21 0:01 GMT+01:00 Tom Lane :

> Pavel Stehule  writes:
> > I am trying to fix slow query on PostgreSQL 9.5.4.
> > The data are almost in RAM
>
> If it's all in RAM, you'd likely be well-served to lower random_page_cost.
> It looks to me like the planner is estimating pretty accurately how many
> heap fetches will be eliminated by using the extra index; where it's off
> seems to be in the cost of those heap fetches relative to the index work.
>

When I decrease random page cost, then the cost of bitmapscan was decreased
too

https://explain.depesz.com/s/7CAJ .. random page cost 2
https://explain.depesz.com/s/iEBW .. random page cost 2, bitmapscan off
https://explain.depesz.com/s/W4zw .. random page cost 2
https://explain.depesz.com/s/Gar .. random page cost 1, bitmapscan off

I played with other costs, but without any success, the cost of bitmapscan
is significantly cheaper then index scan.

Regards

Pavel

> regards, tom lane
>

Re: [HACKERS] Logical decoding on standby

2016-12-20 Thread Petr Jelinek

On 21/12/16 04:06, Craig Ringer wrote:
> On 20 December 2016 at 15:03, Petr Jelinek  
> wrote:
> 
>>> The biggest change in this patch, and the main intrusive part, is that
>>> procArray->replication_slot_catalog_xmin is no longer directly used by
>>> vacuum. Instead, a new ShmemVariableCache->oldestCatalogXmin field is
>>> added, with a corresponding CheckPoint field.
>>> [snip]
>>
>> If this mechanism would not be needed most of the time, wouldn't it be
>> better to not have it and just have a way to ask physical slot about
>> what's the current reserved catalog_xmin (in most cases the standby
>> should actually know what it is since it's sending the hs_feedback, but
>> first slot created on such standby may need to check).
> 
> Yes, and that was actually my originally preferred approach, though
> the one above does offer the advantage that if something goes wrong we
> can detect it and fail gracefully. Possibly not worth the complexity
> though.
> 
> Your approach requires us to make very sure that hot_standby_feedback
> does not get turned off by user or become ineffective once we're
> replicating, though, since we won't have any way to detect when needed
> tuples are removed. We'd probably just bail out with relcache/syscache
> lookup errors, but I can't guarantee we wouldn't crash if we tried
> logical decoding on WAL where needed catalogs have been removed.
> 
> I initially ran into trouble doing that, but now think I have a way forward.
> 
>> WRT preventing
>> hs_feedback going off, we can refuse to start with hs_feedback off when
>> there are logical slots detected.
> 
> Yes. There are some ordering issues there though. We load slots quite
> late in startup and they don't get tracked in checkpoints. So we might
> launch the walreceiver before we load slots and notice their needed
> xmin/catalog_xmin. So we need to prevent sending of
> hot_standby_feedback until slots are loaded, or load slots earlier in
> startup. The former sounds less intrusive and safer - probably just
> add an "initialized" flag to ReplicationSlotCtlData and suppress
> hs_feedback until it becomes true.
> 
> We'd also have to suppress the validation callback action on the
> hot_standby_feedback GUC until we know replication slot state is
> initialised, and perform the check during slot startup instead. The
> hot_standby_feedback GUC validation check might get called before
> shmem is even set up so we have to guard against attempts to access a
> shmem segment that may not event exist yet.
> 
> The general idea is workable though. Refuse to start if logical slots
> exist and hot_standby_feedback is off or walreceiver isn't using a
> physical slot. Refuse to allow hot_standby_feedback to change
> 

These are all problems associated with replication slots being in shmem
if I understand correctly. I wonder, could we put just bool someplace
which is available early that says if there are any logical slots
defined? We don't actually need all the slot info, just to know if there
are some.

> 
>> You may ask what if user drops the slot and recreates or somehow
>> otherwise messes up catalog_xmin on master, well, considering that under
>> my proposal we'd first (on connect) check the slot for catalog_xmin we'd
>> know about it so we'd either mark the local slots broken/drop them or
>> plainly refuse to connect to the master same way as if it didn't have
>> required WAL anymore (not sure which behavior is better). Note that user
>> could mess up catalog_xmin even in your design the same way, so it's not
>> really a regression.
> 
> Agreed. Checking catalog_xmin of the slot when we connect is
> sufficient to guard against that, assuming we can trust that the
> catalog_xmin is actually in effect on the master. Consider cascading
> setups, where we set our catalog_xmin but it might not be "locked in"
> until the middle cascaded server relays it to the master.
> 
> I have a proposed solution to that which I'll outline in a separate
> patch+post; it ties in to some work on addressing the race between hot
> standby feedback taking effect and queries starting on the hot
> standby.  It boils down to "add a hot_standby_feedback reply protocol
> message".
> 
>> Plus
>> it might even be okay to only allow creating logical slots on standbys
>> connected directly to master in v1.
> 
> True. I didn't consider that.
> 
> We haven't had much luck in the past with such limitations, but
> personally I'd consider it a perfectly reasonable one.
> 

I think it's infinitely better with that limitation than the status quo.
Especially for failover scenario (you usually won't failover to replica
down the cascade as it's always more behind). Not to mention that with
every level of cascading you get automatically more lag which means more
bloat so it might not even be all that desirable to go that route
immediately in v1 when we don't have way to control that bloat/maximum
slot lag.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Dev

Re: [HACKERS] Logical decoding on standby

2016-12-20 Thread Petr Jelinek

On 21/12/16 04:31, Craig Ringer wrote:
> On 20 December 2016 at 15:03, Petr Jelinek  
> wrote:
> 
>> But in 0003 I don't understand following code:
>>> + if (endpos != InvalidXLogRecPtr && !do_start_slot)
>>> + {
>>> + fprintf(stderr,
>>> + _("%s: cannot use --create-slot or 
>>> --drop-slot together with --endpos\n"),
>>> + progname);
>>> + fprintf(stderr, _("Try \"%s --help\" for more 
>>> information.\n"),
>>> + progname);
>>> + exit(1);
>>> + }
>>
>> Why is --create-slot and --endpos not allowed together?
> 
> Actually, the test is fine, the error is just misleading due to my
> misunderstanding.
> 
> The fix is simply to change the error message to
> 
> _("%s: --endpos may only be specified
> with --start\n"),
> 
> so I won't post a separate followup patch.
> 

Ah okay makes sense. The --create-slot + --endpos should definitely be
allowed combination, especially now that we can extend this to
optionally use temporary slot.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_background contrib module proposal

2016-12-20 Thread Andrew Borodin

2016-12-21 4:55 GMT+05:00 David Fetter :
> I see.
>
> I find the following a little easier to follow, and the sleeps still
> work even when very short.

Now I know a little more SQL :) Thank you.

I'm not sure every platform supports microsecond sleeps. But it will
work anyway. This test is deterministic.
Without sequence here is no race condition. So it is not sleepsort, it
is deterministic. Though I agree that it is good thing for test, I'd
still add some miliseconds to test case when main query for sure have
to wait end of other sleeping query.
.
>
> CREATE TABLE input AS
> SELECT x, row_number() OVER (ORDER BY x) n
> FROM
> generate_series(0,.05,0.01) x
> ORDER BY x DESC;
>
> CREATE TABLE output(place int,value float);
>
> CREATE TABLE handles AS
> SELECT pg_background_launch(
> format($$
> SELECT pg_sleep(%s);
> INSERT INTO output VALUES (%s, %s)
> $$,
> x, n, x
> )
> ) h
> FROM input;
>
> --wait until everyone finishes
> SELECT * FROM handles JOIN LATERAL pg_background_result(handles.h) AS (x 
> TEXT) ON (true);
>
> --output results
> SELECT * FROM output ORDER BY place;

Best regrards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bigint vs txid user confusion

2016-12-20 Thread Jim Nasby


On 12/20/16 10:20 PM, Craig Ringer wrote:

Tools look at pg_class.relfrozenxid and pg_databse.datfrozenxid more
than probably anything else, so making changes that ignores them is
pretty pointless.


Except the only useful way I know of to access *frozenxid is using 
age(), and even that is a royal PITA when the xid is a special xid. So 
I'd argue that we should effectively remove xid from user's view. Even 
if we don't want to bloat pg_class by 4 bytes, we should just make xid 
even more opaque than it is today and tell users to just cast it to bigxid.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] simplehash vs. pgindent

2016-12-20 Thread Tom Lane

Robert Haas  writes:
> In commit b30d3ea824c5ccba43d3e942704f20686e7dbab8, when Andres added
> the simplehash stuff, he also added SH_TYPE, SH_ITERATOR, and
> SH_STATUS to typedefs.list.  When I subsequently updated typedefs.list
> from the buildfarm in acddbe221b084956a0efd6e4b6c6586e8fd994d7, it of
> course removed those entries since they are merely placeholders for
> real types, not anything that would ever appear in a symbol table.  So
> now if you pgindent the simplehash stuff, it does stupid things.

> I think we should add a file src/tools/pgindent/typedefs.extra.list.
> Somehow teach pgindent to search whatever directory it searches for
> typedefs.list for typedefs.extra.list as well.  Then we can put stuff
> in there that the buildfarm doesn't find, and pgindent can combine the
> files in load_typedefs().  That way we can still update typedefs.list
> straight from the buildfarm, but there's a place to put manual entries
> that we don't want to lose.  We could also teach load_typedefs() to
> strip out lines that are empty or start with #, and then we'd be able
> to put comments in typedefs.extra.list explaining why each entry needs
> to be present there rather than relying on the normal mechanism.

Or we could just fix the code so it doesn't use phony typedefs?
If a typedef is never used as a variable or field type, how much
do we really need it?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

2016-12-20 Thread Amit Langote

On 2016/12/21 14:03, Amit Langote wrote:
> OK, updated patch attached.

Oops, incomplete patch that was.  Complete patch attached this time.

Thanks,
Amit
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 1c219b03dd..115b98313e 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -13297,8 +13297,11 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
 		}
 	}
 
+	/* It's safe to skip the validation scan after all */
 	if (skip_validate)
-		elog(NOTICE, "skipping scan to validate partition constraint");
+		ereport(INFO,
+(errmsg("partition constraint for table \"%s\" is implied by existing constraints",
+		RelationGetRelationName(attachRel;
 
 	/*
 	 * Set up to have the table to be scanned to validate the partition
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index 99e20eb922..62e18961d3 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -3179,7 +3179,7 @@ ALTER TABLE list_parted2 ATTACH PARTITION part_3_4 FOR VALUES IN (3, 4);
 ALTER TABLE list_parted2 DETACH PARTITION part_3_4;
 ALTER TABLE part_3_4 ALTER a SET NOT NULL;
 ALTER TABLE list_parted2 ATTACH PARTITION part_3_4 FOR VALUES IN (3, 4);
-NOTICE:  skipping scan to validate partition constraint
+INFO:  partition constraint for table "part_3_4" is implied by existing constraints
 -- check validation when attaching range partitions
 CREATE TABLE range_parted (
 	a int,
@@ -3204,7 +3204,7 @@ CREATE TABLE part2 (
 	b int NOT NULL CHECK (b >= 10 AND b < 18)
 );
 ALTER TABLE range_parted ATTACH PARTITION part2 FOR VALUES FROM (1, 10) TO (1, 20);
-NOTICE:  skipping scan to validate partition constraint
+INFO:  partition constraint for table "part2" is implied by existing constraints
 -- check that leaf partitions are scanned when attaching a partitioned
 -- table
 CREATE TABLE part_5 (
@@ -3219,7 +3219,7 @@ ERROR:  partition constraint is violated by some row
 DELETE FROM part_5_a WHERE a NOT IN (3);
 ALTER TABLE part_5 ADD CONSTRAINT check_a CHECK (a IN (5)), ALTER a SET NOT NULL;
 ALTER TABLE list_parted2 ATTACH PARTITION part_5 FOR VALUES IN (5);
-NOTICE:  skipping scan to validate partition constraint
+INFO:  partition constraint for table "part_5" is implied by existing constraints
 -- check that the table being attached is not already a partition
 ALTER TABLE list_parted2 ATTACH PARTITION part_2 FOR VALUES IN (2);
 ERROR:  "part_2" is already a partition

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BUG: pg_stat_statements query normalization issues with combined queries

2016-12-20 Thread Kyotaro HORIGUCHI

At Tue, 20 Dec 2016 22:42:48 -0500, Robert Haas  wrote 
in 
> On Tue, Dec 20, 2016 at 6:18 AM, Fabien COELHO  wrote:
> > Would this approach be acceptable, or is modifying Nodes a no-go area?
> >
> > If it is acceptable, I can probably put together a patch and submit it.
> >
> > If not, I suggest to update the documentation to tell that
> > pg_stat_statements does not work properly with combined queries.
> 
> I think you've found a bug, but I'm a little doubtful about your
> proposed fix.  However, I haven't studied the code, so I don't know
> what other approach might be better.

That will work and doable, but the more fundamental issue here
seems to be that post_parse_analyze_hook or other similar hooks
are called with a Query incosistent with query_debug_string. It
is very conveniently used but the name seems to me to be
suggesting that such usage is out of purpose. I'm not sure,
though.

A similar behavior is shown in error messages but this harms far
less than the case of pg_stat_statements.

ERROR:  column "b" does not exist
LINE 1: ...elect * from t where a = 1; select * from t where b = 2; com...
 ^

1. Let all of the parse node have location in
  debug_query_string. (The original proposal)

2. Let the parser return once for each statement (like psql
   parser) and corresponding query string is stored in a
   varialble other than debug_query_string.

...

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel bitmap heap scan

2016-12-20 Thread Dilip Kumar

Rebased on the current head.

On Tue, Dec 13, 2016 at 12:06 PM, Dilip Kumar  wrote:
> On Sat, Dec 10, 2016 at 5:36 PM, Amit Kapila  wrote:
>> Few assorted comments:
>
> Thanks for the review
>
>>
>> 1.
>> + else if (needWait)
>> + {
>> + /* Add ourself to wait queue */
>> + ConditionVariablePrepareToSleep(&pbminfo->cv);
>> + queuedSelf = true;
>> + }
>>
>> With the committed version of condition variables, you can avoid
>> calling ConditionVariablePrepareToSleep().  Refer latest parallel
>> index scan patch [1].
>
> Oh, I see, Fixed.
>>
>> 2.
>> + ParallelBitmapScan
>> + Waiting for leader to complete the TidBitmap.
>> +
>> +
>>
>> Isn't it better to write it as below:
>>
>> Waiting for the leader backend to form the TidBitmap.
> Done this way..
>>
>> 3.
>> + * Update snpashot info in heap scan descriptor.
>>
>> /snpashot/snapshot
> Fixed
>>
>> 4.
>>  #include "utils/tqual.h"
>> -
>> +#include "miscadmin.h"
>>
>>  static void bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres);
>> -
>> +static bool pbms_is_leader(ParallelBitmapInfo *pbminfo);
>>
>>   TBMIterateResult *tbmres;
>> -
>> + ParallelBitmapInfo *pbminfo = node->parallel_bitmap;
>>
>>  static int tbm_comparator(const void *left, const void *right);
>> -
>> +void * tbm_alloc_shared(Size size, void *arg);
>>
>> It seems line deletes at above places are spurious.  Please check for
>> similar occurrences at other places in patch.
>>
> Fixed
>> 5.
>> + bool is_shared; /* need to build shared tbm if set*/
>>
>> space is required towards the end of the comment (set */).
> Fixed
>>
>> 6.
>> + /*
>> + * If we have shared TBM means we are running in parallel mode.
>> + * So directly convert dsa array to page and chunk array.
>> + */
>>
>> I think the above comment can be simplified as "For shared TBM,
>> directly convert dsa array to page and chunk array"
>
> Done this way..
>>
>> 7.
>> + dsa_entry = (void*)(((char*)dsa_entry) + sizeof(dsa_pointer));
>>
>> extra space is missing at multiple places in above line. It should be
>> written as below:
>>
>> dsa_entry = (void *)(((char *) dsa_entry) + sizeof(dsa_pointer));
> Fixed..
>
>>
>> Similar stuff needs to be taken care at other places in the patch as
>> well.  I think it will be better if you run pgindent on your patch.
>
> I have run the pgindent, and taken all the changes whichever was
> applicable for added code.
>
> There are some cleanup for "hash-support-alloc-free" also, so
> attaching a new patch for this as well.
>
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


hash-support-alloc-free-v5.patch
Description: Binary data


parallel-bitmap-heap-scan-v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

2016-12-20 Thread Amit Langote

On 2016/12/21 13:42, Robert Haas wrote:
> On Tue, Dec 20, 2016 at 9:14 PM, Amit Langote
>  wrote:
>> On 2016/12/21 1:45, Alvaro Herrera wrote:
>>> Robert Haas wrote:
 On Tue, Dec 20, 2016 at 10:27 AM, Alvaro Herrera
  wrote:
> Even if we decide to keep the message, I think it's not very good
> wording anyhow; as a translator I disliked it on sight.  Instead of
> "skipping scan to validate" I would use "skipping validation scan",
> except that it's not clear what it is we're validating.  Mentioning
> partition constraint in errcontext() doesn't like a great solution, but
> I can't think of anything better.

 Maybe something like: partition constraint for table \"%s\" is implied
 by existing constraints
>>>
>>> Actually, shouldn't we emit a message if we *don't* skip the check?
>>
>> Scanning (aka, not skipping) to validate the partition constraint is the
>> default behavior, so a user would be expecting it anyway, IOW, need not be
>> informed of it.  But when ATExecAttachPartition's efforts to avoid the
>> scan by comparing the partition constraint against existing constraints
>> (which the user most probably deliberately added just for this) succeed,
>> that seems like a better piece of information to provide the user with,
>> IMHO.  But then again, having a message printed before a potentially long
>> validation scan seems like something a user would like to see, to know
>> what it is that is going to take so long.  Hmm.
>>
>> Anyway, what would the opposite of Robert's suggested message look like:
>> "scanning table \"%s\" to validate partition constraint"?
> 
> Maybe: partition constraint for table \"%s\" is implied by existing 
> constraints

OK, updated patch attached.

Thanks,
Amit
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index 2a324c0b49..62e18961d3 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -3179,7 +3179,7 @@ ALTER TABLE list_parted2 ATTACH PARTITION part_3_4 FOR VALUES IN (3, 4);
 ALTER TABLE list_parted2 DETACH PARTITION part_3_4;
 ALTER TABLE part_3_4 ALTER a SET NOT NULL;
 ALTER TABLE list_parted2 ATTACH PARTITION part_3_4 FOR VALUES IN (3, 4);
-INFO:  skipping scan to validate partition constraint
+INFO:  partition constraint for table "part_3_4" is implied by existing constraints
 -- check validation when attaching range partitions
 CREATE TABLE range_parted (
 	a int,
@@ -3204,7 +3204,7 @@ CREATE TABLE part2 (
 	b int NOT NULL CHECK (b >= 10 AND b < 18)
 );
 ALTER TABLE range_parted ATTACH PARTITION part2 FOR VALUES FROM (1, 10) TO (1, 20);
-INFO:  skipping scan to validate partition constraint
+INFO:  partition constraint for table "part2" is implied by existing constraints
 -- check that leaf partitions are scanned when attaching a partitioned
 -- table
 CREATE TABLE part_5 (
@@ -3219,7 +3219,7 @@ ERROR:  partition constraint is violated by some row
 DELETE FROM part_5_a WHERE a NOT IN (3);
 ALTER TABLE part_5 ADD CONSTRAINT check_a CHECK (a IN (5)), ALTER a SET NOT NULL;
 ALTER TABLE list_parted2 ATTACH PARTITION part_5 FOR VALUES IN (5);
-INFO:  skipping scan to validate partition constraint
+INFO:  partition constraint for table "part_5" is implied by existing constraints
 -- check that the table being attached is not already a partition
 ALTER TABLE list_parted2 ATTACH PARTITION part_2 FOR VALUES IN (2);
 ERROR:  "part_2" is already a partition

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] simplehash vs. pgindent

2016-12-20 Thread Robert Haas

In commit b30d3ea824c5ccba43d3e942704f20686e7dbab8, when Andres added
the simplehash stuff, he also added SH_TYPE, SH_ITERATOR, and
SH_STATUS to typedefs.list.  When I subsequently updated typedefs.list
from the buildfarm in acddbe221b084956a0efd6e4b6c6586e8fd994d7, it of
course removed those entries since they are merely placeholders for
real types, not anything that would ever appear in a symbol table.  So
now if you pgindent the simplehash stuff, it does stupid things.

I think we should add a file src/tools/pgindent/typedefs.extra.list.
Somehow teach pgindent to search whatever directory it searches for
typedefs.list for typedefs.extra.list as well.  Then we can put stuff
in there that the buildfarm doesn't find, and pgindent can combine the
files in load_typedefs().  That way we can still update typedefs.list
straight from the buildfarm, but there's a place to put manual entries
that we don't want to lose.  We could also teach load_typedefs() to
strip out lines that are empty or start with #, and then we'd be able
to put comments in typedefs.extra.list explaining why each entry needs
to be present there rather than relying on the normal mechanism.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

2016-12-20 Thread Robert Haas

On Tue, Dec 20, 2016 at 9:14 PM, Amit Langote
 wrote:
> On 2016/12/21 1:45, Alvaro Herrera wrote:
>> Robert Haas wrote:
>>> On Tue, Dec 20, 2016 at 10:27 AM, Alvaro Herrera
>>>  wrote:
 Even if we decide to keep the message, I think it's not very good
 wording anyhow; as a translator I disliked it on sight.  Instead of
 "skipping scan to validate" I would use "skipping validation scan",
 except that it's not clear what it is we're validating.  Mentioning
 partition constraint in errcontext() doesn't like a great solution, but
 I can't think of anything better.
>>>
>>> Maybe something like: partition constraint for table \"%s\" is implied
>>> by existing constraints
>>
>> Actually, shouldn't we emit a message if we *don't* skip the check?
>
> Scanning (aka, not skipping) to validate the partition constraint is the
> default behavior, so a user would be expecting it anyway, IOW, need not be
> informed of it.  But when ATExecAttachPartition's efforts to avoid the
> scan by comparing the partition constraint against existing constraints
> (which the user most probably deliberately added just for this) succeed,
> that seems like a better piece of information to provide the user with,
> IMHO.  But then again, having a message printed before a potentially long
> validation scan seems like something a user would like to see, to know
> what it is that is going to take so long.  Hmm.
>
> Anyway, what would the opposite of Robert's suggested message look like:
> "scanning table \"%s\" to validate partition constraint"?

Maybe: partition constraint for table \"%s\" is implied by existing constraints

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstattuple documentation clarification

2016-12-20 Thread Robert Haas

On Tue, Dec 20, 2016 at 10:01 AM, Tom Lane  wrote:
> Andrew Dunstan  writes:
>> Recently a client was confused because there was a substantial
>> difference between the reported table_len of a table and the sum of the
>> corresponding tuple_len, dead_tuple_len and free_space. The docs are
>> fairly silent on this point, and I agree that in the absence of
>> explanation it is confusing, so I propose that we add a clarification
>> note along the lines of:
>
>> The table_len will always be greater than the sum of the tuple_len,
>> dead_tuple_len and free_space. The difference is accounted for by
>> page overhead and space that is not free but cannot be attributed to
>> any particular tuple.
>
>> Or perhaps we should be more explicit and refer to the item pointers on
>> the page.
>
> I find "not free but cannot be attributed to any particular tuple"
> to be entirely useless weasel wording, not to mention wrong with
> respect to item pointers in particular.
>
> Perhaps we should start counting the item pointers in tuple_len.
> We'd still have to explain about page header overhead, but that
> would be a pretty small and fixed-size discrepancy.

It's pretty weird to count unused or dead line pointers as part of
tuple_len, and it would screw things up for anybody trying to
calculate the average width of their tuples, which is an entirely
reasonable thing to want to do.  I think if we're going to count item
pointers as anything, it needs to be some new category -- either item
pointers specifically, or an "other stuff" bucket.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal : Parallel Merge Join

2016-12-20 Thread Dilip Kumar

On Tue, Dec 13, 2016 at 8:34 PM, Dilip Kumar  wrote:
> I have created two patches as per the suggestion.
>
> 1. mergejoin_refactoring_v2.patch --> Move functionality of
> considering various merge join path into new function.
> 2. parallel_mergejoin_v2.patch -> This adds the functionality of
> supporting partial mergejoin paths. This will apply on top of
> mergejoin_refactoring_v2.patch.

We have done further analysis of the performance with TPCH benchmark
at higher scale factor. I have tested parallel merge join patch along
with parallel index scan[1]

I have observed that with query3, we are getting linear scalability
('explain analyze' results are attached).

Test Setup:
---
TPCH 300 scale factor
work_mem = 1GB
shared_buffer = 1GB
random_page_cost=seq_page_cost=0.1
max_parallel_workers_per_gather=4 (warm cache ensured)
The median of 3 runs (reading are quite stable).

On Head:   2702568.099 ms
With Patch: 547363.164 ms

Other Experiments:

* I have also verified reading on the head, without modifying
random_page_cost=seq_page_cost, but there is no change in plan or
execution time.

* I have tried to increase the max_parallel_workers_per_gather to 8
but I did not observe further scaling.

[1] 
https://www.postgresql.org/message-id/CAA4eK1KA4LwTYXZG%3Dk3J2bA%2BZOEYnz2gkyWBKgY%3D_q0xJRBMDw%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

3_head.out
Description: Binary data

3_patch.out
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hang in pldebugger after git commit : 98a64d0

2016-12-20 Thread Amit Kapila

On Tue, Dec 20, 2016 at 8:04 PM, Robert Haas  wrote:
> On Sat, Dec 17, 2016 at 5:46 AM, Amit Kapila  wrote:
>> Yeah, but we are planning to add a generic flag in WaitEvent structure
>> which can be used for similar purpose.  However, as per your
>> suggestion, I have changed it only for Win32 port.  Also, I kept the
>> change in ModifyWaitEvent API as that seems to be okay considering
>> 'reset' is a generic flag in WaitEvent structure.
>
> Well, we don't really need the change in ModifyWaitEvent if we have
> the change in WaitEventSetWaitBlock, right?
>

Yes.

>  I'd be inclined to ditch
> the former and keep the latter.
>

Okay, I think you want to keep the change just for Win32 which seems
fair as we haven't noticed such problem on any other platform.

>  Also, this doesn't look right:
>
> +   for (cur_event = set->events;
> +cur_event < (set->events + set->nevents)
> +&& returned_events < nevents;
> +cur_event++)
> +   {
> +   if (cur_event->reset)
> +   {
> +   WaitEventAdjustWin32(set, cur_event);
> +   cur_event->reset = false;
> +   }
> +   }
>
> There's no need to include returned_events < nevents in the loop
> condition here because returned_events isn't changing.
>

Fixed.

> I think I'd also guard the reset flag with #ifdef WIN32.  If it's not
> properly supported elsewhere it's just a foot-gun, and there seems to
> be no reason to write the code to properly support it elsewhere,
> whatever that would mean.
>

Okay.


Ashutosh Sharma has helped to test that pldebugger issue is fixed with
attached version.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


reset_wait_events_v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bigint vs txid user confusion

2016-12-20 Thread Craig Ringer

On 17 December 2016 at 00:13, Robert Haas  wrote:
> On Thu, Dec 15, 2016 at 3:02 AM, Craig Ringer  wrote:
>> I really wish we could just change the pg_stat_activity and
>> pg_stat_replication xid fields to be epoch qualified in a 64-bit wide
>> 'fullxid' type, or similar.
>
> I think that approach is worth considering.

I'm worried about how many monitoring tools it'd break.

Arguably most or all will already be broken and not know it, though.
Those that aren't will be using age(xid), which we can support just
fine for the new type too, so they won't notice.

Ideally I'd just like to widen 'xid' to 64-bits. This would upset
clients that use binary mode and "know" it's a 32-bit type on the
wire, though, so I'm not sure it's a good idea even though it'd be
nicer in pretty much every other way.

So the idea then is to add a new bigxid type, a 64-bit wide
epoch-extended xid, and replace _all_ appearances of 'xid' with it in
catalogs, views, etc, such that 'xid' is entirely deprecated.

Rather than convert 64-bit extended XIDs to 32-bit for internal
comparisons/functions/operators we'll just epoch-extend our 32-bit
xids, getting rid of the need to handle any sort of "too old" concept
in most cases.

The return type of txid_current() should probably change to the new
bitxid type. This'll upset apps that expect to do maths on it since
the new bigxid type won't have many operators, but since most (all?)
of that maths will be wrong I don't think that's a bad thing. Banging
in a ::bigint will quickfix so adaptation is easy, and it'll highlight
incorrect uses (many) of the call. The type is on-wire compatible with
the current bigint return type until you hit epoch 2^31 in which case
you have bigger things to worry about, like pending epoch wraparound.

HOWEVER, if we're going to really remove 'xid' from user view, it
should vanish from pg_database and pg_class too. That's a LOT more
intrusive, and widens pg_class by 4 bytes per row for minimal gain. No
entry there can ever have an epoch older than the current epoch - 1,
and only then the part that's after the wraparound threshold. pg_class
is a raw relation so we can't transform what the user sees via a view
or function.

Tools look at pg_class.relfrozenxid and pg_databse.datfrozenxid more
than probably anything else, so making changes that ignores them is
pretty pointless.

Everything else looks easy and minimally intrusive, but I'm not sure
there's a sensible answer to pg_class.relfrozenxid.

test=> select table_name, column_name from information_schema.columns
where data_type = 'xid';
  table_name  |  column_name
--+---
 pg_class | relfrozenxid
 pg_class | relminmxid
 pg_database  | datfrozenxid
 pg_database  | datminmxid
 pg_locks | transactionid
 pg_prepared_xacts| transaction
 pg_stat_activity | backend_xid
 pg_stat_activity | backend_xmin
 pg_stat_replication  | backend_xmin
 pg_replication_slots | xmin
 pg_replication_slots | catalog_xmin
(11 rows)

test=> SELECT proname FROM pg_proc WHERE prorettype = 'xid'::regtype
OR 'xid'::regtype = ANY (proargtypes);
 proname
--
 xidin
 xidout
 xideq
 age
 mxid_age
 xideqint4
 pg_get_multixact_members
 pg_xact_commit_timestamp
 xidrecv
 xidsend
(10 rows)

test=> SELECT oprname FROM pg_operator WHERE 'xid'::regtype IN
(oprleft, oprright, oprresult);
 oprname
-
 =
 =
(2 rows)

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Faster methods for getting SPI results

2016-12-20 Thread Jim Nasby

I've been looking at the performance of SPI calls within plpython. 
There's a roughly 1.5x difference from equivalent python code just in 
pulling data out of the SPI tuplestore. Some of that is due to an 
inefficiency in how plpython is creating result dictionaries, but fixing 
that is ultimately a dead-end: if you're dealing with a lot of results 
in python, you want a tuple of arrays, not an array of tuples.


While we could just brute-force a tuple of arrays by plowing through the 
SPI tuplestore (this is what pl/r does), there's still a lot of extra 
work involved in doing that. AFAICT there's at least 2 copies that 
happen between the executor producing a tuple and it making it into the 
tuplestore, plus the tuplestore is going to consume a potentially very 
large amount of memory for a very short period of time, before all the 
data gets duplicated (again) into python objects.


It would be a lot more efficient if we could just grab datums from the 
executor and make a single copy into plpython (or R), letting the PL 
deal with all the memory management overhead.


I briefly looked at using SPI cursors to do just that, but that looks 
even worse: every fetch is executed in a subtransaction, and every fetch 
creates an entire tuplestore even if it's just going to return a single 
value. (But hey, we never claimed cursors were fast...)


Is there any way to avoid all of this? I'm guessing one issue might be 
that we don't want to call an external interpreter while potentially 
holding page pins, but even then couldn't we just copy a single tuple at 
a time and save a huge amount of palloc overhead?

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_background contrib module proposal

2016-12-20 Thread Robert Haas

On Tue, Dec 20, 2016 at 7:32 PM, Simon Riggs  wrote:
> On 9 December 2016 at 13:00, Robert Haas  wrote:
>> That might be good, because then we wouldn't have to maintain two
>> copies of the code.
>
> So why are there two things at all? Why is this being worked on as
> well as Peter's patch? What will that give us?

A feature that can be accessed without writing C code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-12-20 Thread Robert Haas

On Tue, Dec 20, 2016 at 8:14 PM, Peter Geoghegan  wrote:
> Without meaning to sound glib, unification is the process by which
> parallel CREATE INDEX has the leader read temp files from workers
> sufficient to complete its final on-the-fly merge.

That's not glib, but you can't in the end define BufFile unification
in terms of what parallel CREATE INDEX needs.  Whatever changes we
make to lower-level abstractions in the service of some higher-level
goal need to be explainable on their own terms.

>>It's fine to make up new words -- indeed, in some sense that is the essence
>> of writing any complex problem -- but you have to define them.
>
> I invite you to help me define this new word.

If at some point I'm able to understand what it means, I'll try to do
that.  I think you're loosely using "unification" to mean combining
stuff from different backends in some way that depends on the
particular context, so that "BufFile unification" can be different
from "LogicalTape unification".  But that's just punting the question
of what each of those things actually are.

> Maybe it's inferior to that, but I think what Heikki proposes is more
> or less complementary to what I've proposed, and has nothing to do
> with resource management and plenty to do with making the logtape.c
> interface look nice, AFAICT. It's also about refactoring/simplifying
> logtape.c itself, while we're at it. I believe that Heikki has yet to
> comment either way on my approach to resource management, one aspect
> of the patch that I was particularly keen on your looking into.

My reading of Heikki's point was that there's not much point in
touching the BufFile level of things if we can do all of the necessary
stuff at the LogicalTape level, and I agree with him about that.  If a
shared BufFile had a shared read-write pointer, that would be a good
justification for having it.  But it seems like unification at the
BufFile level is just concatenation, and that can be done just as well
at the LogicalTape level, so why tinker with BufFile?  As I've said, I
think there's some low-level hacking needed here to make sure files
get removed at the correct time in all cases, but apart from that I
see no good reason to push the concatenation operation all the way
down into BufFile.

> The theory of operation here is that workers own their own BufFiles,
> and are responsible for deleting them when they die. The assumption,
> rightly or wrongly, is that it's sufficient that workers flush
> everything out (write out temp files), and yield control to the
> leader, which will open their temp files for the duration of the
> leader final on-the-fly merge. The resource manager in the leader
> knows it isn't supposed to ever delete worker-owned files (just
> close() the FDs), and the leader errors if it cannot find temp files
> that match what it expects. If there is a an error in the leader, it
> shuts down workers, and they clean up, more than likely. If there is
> an error in the worker, or if the files cannot be deleted (e.g., if
> there is a classic hard crash scenario), we should also be okay,
> because nobody will trip up on some old temp file from some worker,
> since fd.c has some gumption about what workers need to do (and what
> the leader needs to avoid) in the event of a hard crash. I don't see a
> risk of file descriptor leaks, which may or may not have been part of
> your concern (please clarify).

I don't think there's any new issue with file descriptor leaks here,
but I think there is a risk of calling unlink() too early or too late
with your design.  My proposal was an effort to nail that down real
tight.

>> If the worker is always completely finished with the tape before the
>> leader touches it, couldn't the leader's LogicalTapeSet just "adopt"
>> the tape and overwrite it like any other?
>
> I'll remind you that parallel CREATE INDEX doesn't actually ever need
> to be randomAccess, and so we are not actually going to ever need to
> do this as things stand. I wrote the code that way in order to not
> break the existing interface, which seemed like a blocker to posting
> the patch. I am open to the idea of such an "adoption" occurring, even
> though it actually wouldn't help any case that exists in the patch as
> proposed. I didn't go that far in part because it seemed premature,
> given that nobody had looked at my work to date at the time, and given
> the fact that there'd be no initial user-visible benefit, and given
> how the exact meaning of "unification" was (and is) somewhat in flux.
>
> I see no good reason to not do that, although that might change if I
> actually seriously undertook to teach the leader about this kind of
> "adoption". I suspect that the interface specification would make for
> confusing reading, which isn't terribly appealing, but I'm sure I
> could manage to make it work given time.

I think the interface is pretty clear: the worker's logical tapes get
incorporated into the leader's LogicalTapeSet as if they'd bee

Re: [HACKERS] BUG: pg_stat_statements query normalization issues with combined queries

2016-12-20 Thread Robert Haas

On Tue, Dec 20, 2016 at 6:18 AM, Fabien COELHO  wrote:
> Would this approach be acceptable, or is modifying Nodes a no-go area?
>
> If it is acceptable, I can probably put together a patch and submit it.
>
> If not, I suggest to update the documentation to tell that
> pg_stat_statements does not work properly with combined queries.

I think you've found a bug, but I'm a little doubtful about your
proposed fix.  However, I haven't studied the code, so I don't know
what other approach might be better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical decoding on standby

2016-12-20 Thread Craig Ringer

On 20 December 2016 at 15:03, Petr Jelinek  wrote:

> But in 0003 I don't understand following code:
>> + if (endpos != InvalidXLogRecPtr && !do_start_slot)
>> + {
>> + fprintf(stderr,
>> + _("%s: cannot use --create-slot or --drop-slot 
>> together with --endpos\n"),
>> + progname);
>> + fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
>> + progname);
>> + exit(1);
>> + }
>
> Why is --create-slot and --endpos not allowed together?

Actually, the test is fine, the error is just misleading due to my
misunderstanding.

The fix is simply to change the error message to

_("%s: --endpos may only be specified
with --start\n"),

so I won't post a separate followup patch.

Okano Naoki tried to bring this to my attention earlier, but I didn't
understand as I hadn't yet realised you could use --create-slot
--start, they weren't mutually exclusive.

(We test to ensure --start --drop-slot isn't specified earlier so no
test for do_drop_slot is required).

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Why does plpython delay composite type resolution?

2016-12-20 Thread Jim Nasby

Why do functions that accept composite types delay type resolution until 
execution? I have a naive patch that speeds up plpy.execute() by 8% by 
caching interred python strings for the dictionary key names (which are 
repeated over and over). The next step is to just pre-allocate those 
strings as appropriate for the calling context, but it's not clear how 
to handle that for input arguments.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical decoding on standby

2016-12-20 Thread Craig Ringer

On 20 December 2016 at 15:03, Petr Jelinek  wrote:

>> The biggest change in this patch, and the main intrusive part, is that
>> procArray->replication_slot_catalog_xmin is no longer directly used by
>> vacuum. Instead, a new ShmemVariableCache->oldestCatalogXmin field is
>> added, with a corresponding CheckPoint field.
>> [snip]
>
> If this mechanism would not be needed most of the time, wouldn't it be
> better to not have it and just have a way to ask physical slot about
> what's the current reserved catalog_xmin (in most cases the standby
> should actually know what it is since it's sending the hs_feedback, but
> first slot created on such standby may need to check).

Yes, and that was actually my originally preferred approach, though
the one above does offer the advantage that if something goes wrong we
can detect it and fail gracefully. Possibly not worth the complexity
though.

Your approach requires us to make very sure that hot_standby_feedback
does not get turned off by user or become ineffective once we're
replicating, though, since we won't have any way to detect when needed
tuples are removed. We'd probably just bail out with relcache/syscache
lookup errors, but I can't guarantee we wouldn't crash if we tried
logical decoding on WAL where needed catalogs have been removed.

I initially ran into trouble doing that, but now think I have a way forward.

> WRT preventing
> hs_feedback going off, we can refuse to start with hs_feedback off when
> there are logical slots detected.

Yes. There are some ordering issues there though. We load slots quite
late in startup and they don't get tracked in checkpoints. So we might
launch the walreceiver before we load slots and notice their needed
xmin/catalog_xmin. So we need to prevent sending of
hot_standby_feedback until slots are loaded, or load slots earlier in
startup. The former sounds less intrusive and safer - probably just
add an "initialized" flag to ReplicationSlotCtlData and suppress
hs_feedback until it becomes true.

We'd also have to suppress the validation callback action on the
hot_standby_feedback GUC until we know replication slot state is
initialised, and perform the check during slot startup instead. The
hot_standby_feedback GUC validation check might get called before
shmem is even set up so we have to guard against attempts to access a
shmem segment that may not event exist yet.

The general idea is workable though. Refuse to start if logical slots
exist and hot_standby_feedback is off or walreceiver isn't using a
physical slot. Refuse to allow hot_standby_feedback to change

> We can also refuse to connect to the
> master without physical slot if there are logical slots detected. I
> don't see problem with either of those.

Agreed. We must also be able to reliably enforce that the walreceiver
is using a replication slot to connect to the master and refuse to
start if it is not. The user could change recovery.conf and restart
the walreceiver while we're running, after we perform that check, so
walreceiver must also refuse to start if logical replication slots
exist but it has no primary slot name configured.

> You may ask what if user drops the slot and recreates or somehow
> otherwise messes up catalog_xmin on master, well, considering that under
> my proposal we'd first (on connect) check the slot for catalog_xmin we'd
> know about it so we'd either mark the local slots broken/drop them or
> plainly refuse to connect to the master same way as if it didn't have
> required WAL anymore (not sure which behavior is better). Note that user
> could mess up catalog_xmin even in your design the same way, so it's not
> really a regression.

Agreed. Checking catalog_xmin of the slot when we connect is
sufficient to guard against that, assuming we can trust that the
catalog_xmin is actually in effect on the master. Consider cascading
setups, where we set our catalog_xmin but it might not be "locked in"
until the middle cascaded server relays it to the master.

I have a proposed solution to that which I'll outline in a separate
patch+post; it ties in to some work on addressing the race between hot
standby feedback taking effect and queries starting on the hot
standby.  It boils down to "add a hot_standby_feedback reply protocol
message".

> Plus
> it might even be okay to only allow creating logical slots on standbys
> connected directly to master in v1.

True. I didn't consider that.

We haven't had much luck in the past with such limitations, but
personally I'd consider it a perfectly reasonable one.

> But in 0003 I don't understand following code:
>> + if (endpos != InvalidXLogRecPtr && !do_start_slot)
>> + {
>> + fprintf(stderr,
>> + _("%s: cannot use --create-slot or --drop-slot 
>> together with --endpos\n"),
>> + progname);
>> + fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
>> + progname);
>> +

Re: [HACKERS] Replication slot xmin is not reset if HS feedback is turned off while standby is shut down

2016-12-20 Thread Michael Paquier

On Wed, Dec 21, 2016 at 11:14 AM, Craig Ringer  wrote:
> I didn't see this in the CF app so I created it in
> https://commitfest.postgresql.org/12/ as
> https://commitfest.postgresql.org/12/924/ . I couldn't find your
> PostgreSQL community username, so please log in and set yourself as
> author. I have set patch state to "waiting on author" pending your
> addressing these remarks.

You need to import the user first, which is what I did and I have
added the author name correctly.
(Glad you jumped on this patch, I was just going to begin a lookup. So
that's a nice timing.)
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

2016-12-20 Thread Amit Langote

On 2016/12/21 1:45, Alvaro Herrera wrote:
> Robert Haas wrote:
>> On Tue, Dec 20, 2016 at 10:27 AM, Alvaro Herrera
>>  wrote:
>>> Even if we decide to keep the message, I think it's not very good
>>> wording anyhow; as a translator I disliked it on sight.  Instead of
>>> "skipping scan to validate" I would use "skipping validation scan",
>>> except that it's not clear what it is we're validating.  Mentioning
>>> partition constraint in errcontext() doesn't like a great solution, but
>>> I can't think of anything better.
>>
>> Maybe something like: partition constraint for table \"%s\" is implied
>> by existing constraints
> 
> Actually, shouldn't we emit a message if we *don't* skip the check?

Scanning (aka, not skipping) to validate the partition constraint is the
default behavior, so a user would be expecting it anyway, IOW, need not be
informed of it.  But when ATExecAttachPartition's efforts to avoid the
scan by comparing the partition constraint against existing constraints
(which the user most probably deliberately added just for this) succeed,
that seems like a better piece of information to provide the user with,
IMHO.  But then again, having a message printed before a potentially long
validation scan seems like something a user would like to see, to know
what it is that is going to take so long.  Hmm.

Anyway, what would the opposite of Robert's suggested message look like:
"scanning table \"%s\" to validate partition constraint"?

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication slot xmin is not reset if HS feedback is turned off while standby is shut down

2016-12-20 Thread Craig Ringer

On 21 December 2016 at 00:52, Ants Aasma  wrote:

> The simple fix seems to be to always send out at least one feedback
> message on each connect regardless of hot_standby_feedback setting.

I agree.

> Patch attached. Looks like this goes back to version 9.4. It could
> conceivably cause issues for replication middleware that does not know
> how to handle hot standby feedback messages. Not sure if any exist and
> if that is a concern.

If it exists it's already broken. Not our problem IMO.

> A shell script to reproduce the problem is also attached, adjust the
> PGPATH variable to your postgres install and run in an empty
> directory.

I wrote some TAP tests for 9.6 (they're on the logical decoding on
standby thread) that can be extended to check this. Once they're
committed we can add a test for this change.

Re the patch, I don't like

-   static bool master_has_standby_xmin = false;
+   static bool master_has_standby_xmin = true;

without any comment. It's addressed in the comment changes on the
header func, but the link isn't obvious. Maybe a oneliner to say
"ensure we always send at least one feedback message" ?

I think this part of the patch is correct and useful.

I don't see the point of

+*
+* If Hot Standby is not yet active we reset the xmin value. Check this
+* after the interval has expired to reduce number of calls.
 */
-   if (hot_standby_feedback)
+   if (hot_standby_feedback && HotStandbyActive())

though. Forcing feedback to send InvalidTransactionId until hot
standby feedback is actively running seems counterproductive; we want
to lock in feedback as soon as possible, not wait until we're
accepting transactions. Simon and I are in fact working on changes to
do the opposite of what you've got here and ensure that we don't allow
hot standby connections until we know hot_standby_feedback is in
effect and a usable xmin is locked in. That way the master won't
remove tuples we need as soon as we start an xact and cause a conflict
with recovery.

If there are no running xacts, GetOldestXmin() will return the slot
xmin if any. We should definitely not be clearing that just because
we're not accepting hot standby connections yet; we want to make very
sure it remains in effect unless the user explicitly de-configures hot
standby.

 (There's another pre-exisitng problem there, where we can start the
walsender before slots are initialized, that I'll be addressing
separately).

If there's no slot xmin either, we'll send nextXid. That's a sensible
starting point for hot standby feedback until we start some
transactions.

So -1 on this part of the patch, unless there's something I've misunderstood.

I didn't see this in the CF app so I created it in
https://commitfest.postgresql.org/12/ as
https://commitfest.postgresql.org/12/924/ . I couldn't find your
PostgreSQL community username, so please log in and set yourself as
author. I have set patch state to "waiting on author" pending your
addressing these remarks.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: pg_authid.rolpassword format (was Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol)

2016-12-20 Thread Michael Paquier

On Tue, Dec 20, 2016 at 9:23 PM, Heikki Linnakangas  wrote:
> On 12/16/2016 03:31 AM, Michael Paquier wrote:
> Actually, it does still perform that check. There's a new function,
> plain_crypt_verify, that passwordcheck uses now. plain_crypt_verify() is
> intended to work with any future hash formats we might introduce in the
> future (including SCRAM), so that passwordcheck doesn't need to know about
> all the hash formats.

Bah. I have misread the first version of the patch, and it is indeed
keeping the username checks. Now that things don't crash that behaves
as expected:
=# load 'passwordcheck';
LOAD
=# alter role mpaquier password 'mpaquier';
ERROR:  22023: password must not contain user name
LOCATION:  check_password, passwordcheck.c:101
=# alter role mpaquier password 'md58349d3a1bc8f4f7399b1ff9dea493b15';
ERROR:  22023: password must not contain user name
LOCATION:  check_password, passwordcheck.c:82
With the patch:

>> +case PASSWORD_TYPE_PLAINTEXT:
>> +shadow_pass = &shadow_pass[strlen("plain:")];
>> +break;
>> It would be a good idea to have a generic routine able to get the plain
>> password value. In short I think that we should reduce the amount of
>> locations where "plain:" prefix is hardcoded.
>
> There is such a function included in the patch, get_plain_password(char
> *shadow_pass), actually. Contrib/passwordcheck uses it. I figured that in
> crypt.c itself, it's OK to do the above directly, but get_plain_password()
> is intended to be used elsewhere.

The idea would be to have the function not return an allocated string,
just a position to it. That would be useful in plain_crypt_verify()
for example, for a total of 4 places, including get_plain_password()
where the new string allocation is done. Well, it's not like this
prefix "plain:" would change anyway in the future nor that it is going
to spread much.

> Thanks for having a look! Attached is a new version, with that bug fixed.

I have been able more advanced testing without the crash and things
seem to work properly. The attached set of tests is also able to pass
for all the combinations of hba configurations and password formats.
And looking at the code I don't have more comments.
-- 
Michael

009_authentication.pl
Description: Perl program

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-12-20 Thread Kyotaro HORIGUCHI

At Tue, 20 Dec 2016 23:47:22 +0900, Fujii Masao  wrote 
in 
> On Tue, Dec 20, 2016 at 2:46 PM, Michael Paquier
>  wrote:
> > On Tue, Dec 20, 2016 at 2:31 PM, Masahiko Sawada  
> > wrote:
> >> Do we need to consider the sorting method and the selecting k-th
> >> latest LSN method?
> >
> > Honestly, nah. Tests are showing that there are many more bottlenecks
> > before that with just memory allocation and parsing.
> 
> I think that it's worth prototyping alternative algorithm, and
> measuring the performances of those alternative and current
> algorithms. This measurement should check not only the bottleneck
> but also how much each algorithm increases the time that backends
> need to wait for before they receive ack from walsender.
> 
> If it's reported that current algorithm is enough "effecient",
> we can just leave the code as it is. OTOH, if not, let's adopt
> the alternative one.

I'm personally interested in the difference of them but it
doesn't seem urgently required. If we have nothing particular to
do with this, considering other ordering method would be
valuable.

By a not-well-grounded thought though, maintaining top-kth list
by insertion sort would be promising rather than running top-kth
sorting on the whole list. Sorting on all walsenders is needed
for the first time and some other situation though.

By the way, do we continue dispu^h^hcussing on the format of
s_s_names and/or a successor right now?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protect syscache from bloating with negative cache entries

2016-12-20 Thread Kyotaro HORIGUCHI

Thank you for the discussion.

At Tue, 20 Dec 2016 15:10:21 -0500, Tom Lane  wrote in 
<23492.1482264...@sss.pgh.pa.us>
> The bigger picture here though is that we used to have limits on syscache
> size, and we got rid of them (commit 8b9bc234a, see also
> https://www.postgresql.org/message-id/flat/5141.1150327541%40sss.pgh.pa.us)
> not only because of the problem you mentioned about performance falling
> off a cliff once the working-set size exceeded the arbitrary limit, but
> also because enforcing the limit added significant overhead --- and did so
> whether or not you got any benefit from it, ie even if the limit is never
> reached.  Maybe the present patch avoids imposing a pile of overhead in
> situations where no pruning is needed, but it doesn't really look very
> promising from that angle in a quick once-over.

Indeed. As mentioned in the mail at the beginning of this thread,
it hits the whole-cache scanning if at least one negative cache
exists even it is not in a relation with the target relid, and it
can be significantly long on a fat cache.

Lists of negative entries like CatCacheList would help but needs
additional memeory.

> BTW, I don't see the point of the second patch at all?  Surely, if
> an object is deleted or updated, we already have code that flushes
> related catcache entries.  Otherwise the caches would deliver wrong
> data.

Maybe you take the patch wrongly. Negative entires won't be
flushed by any means. Deletion of a namespace causes cascaded
object deletion according to dependency then finaly goes to
non-neative cache invalidation. But a removal of *negative
entries* in RELNAMENSP won't happen.

The test script for the case (gen2.pl) does the following thing,

CREATE SCHEMA foo;
SELECT * FROM foo.invalid;
DROP SCHEMA foo;

Removing the schema foo leaves a negative cache entry for
'foo.invalid' in RELNAMENSP.

However, I'm not sure the above situation happens so frequent
that it is worthwhile to amend.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-12-20 Thread Peter Geoghegan

On Tue, Dec 20, 2016 at 2:53 PM, Robert Haas  wrote:
>> What I have in mind is something like the attached patch. It refactors
>> LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape
>> as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet
>> doesn't have the concept of a tape number anymore, it can contain any number
>> of tapes, and you can create more on the fly. With that, it'd be fairly easy
>> to make tuplesort.c merge LogicalTapes that came from different tape sets,
>> backed by different BufFiles. I think that'd avoid much of the unification
>> code.
>
> I just looked at the buffile.c/buffile.h changes in the latest version
> of the patch and I agree with this criticism, though maybe not with
> the proposed solution.  I actually don't understand what "unification"
> is supposed to mean.  The patch really doesn't explain that anywhere
> that I can see.  It says stuff like:
>
> + * Parallel operations can use an interface to unify multiple worker-owned
> + * BufFiles and a leader-owned BufFile within a leader process.  This relies
> + * on various fd.c conventions about the naming of temporary files.

Without meaning to sound glib, unification is the process by which
parallel CREATE INDEX has the leader read temp files from workers
sufficient to complete its final on-the-fly merge. So, it's a
terminology that's bit like "speculative insertion" was up until
UPSERT was committed: a concept that is somewhat in flux, and
describes a new low-level mechanism built to support a higher level
operation, which must accord with a higher level set of requirements
(so, for speculative insertion, that would be avoiding "unprincipled
deadlocks" and so on). That being the case, maybe "unification" isn't
useful as an precise piece of terminology at this point, but that will
change.

While I'm fairly confident that I basically have the right idea with
this patch, I think that you are better at judging the ins and outs of
resource management than I am, not least because of the experience of
working on parallel query itself. Also, I'm signed up to review
parallel hash join in large part because I think there might be some
convergence concerning the sharing of BufFiles among parallel workers.
I don't think I'm qualified to judge what a general abstraction like
this should look like, but I'm trying to get there.

> That comment tells you that unification is a thing you can do -- via
> an unspecified interface for unspecified reasons using unspecified
> conventions -- but it doesn't tell you what the semantics of it are
> supposed to be.  For example, if we "unify" several BufFiles, do they
> then have a shared seek pointer?

No.

> Do the existing contents effectively
> get concatenated in an unpredictable order, or are they all expected
> to be empty at the time unification happens?  Or something else?

The order is the same order as ordinal identifiers are assigned to
workers within tuplesort.c, which is undefined, with the notable
exception of the leader's own identifier (-1) and area of the unified
BufFile space (this is only relevant in randomAccess cases, where
leader may write stuff out to its own reserved part of the BufFile
space). It only matters that the bit of metadata in shared memory is
in that same order, which it clearly will be. So, it's unpredictable,
but in the same way that ordinal identifiers are assigned in a
not-well-defined order; it doesn't or at least shouldn't matter. We
can imagine a case where it does matter, and we probably should, but
that case isn't parallel CREATE INDEX.

>It's fine to make up new words -- indeed, in some sense that is the essence
> of writing any complex problem -- but you have to define them.

I invite you to help me define this new word.

> It's hard to understand how something like this doesn't leak
> resources.  Maybe that's been thought about here, but it isn't very
> clear to me how it's supposed to work.

I agree that it would be useful to centrally document what all this
unification stuff is about. Suggestions on where that should live are
welcome.

> In Heikki's proposal, if
> process A is trying to read a file owned by process B, and process B
> dies and removes the file before process A gets around to reading it,
> we have got trouble, especially on Windows which apparently has low
> tolerance for such things.  Peter's proposal avoids that - I *think* -
> by making the leader responsible for all resource cleanup, but that's
> inferior to the design we've used for other sorts of shared resource
> cleanup (DSM, DSA, shm_mq, lock groups) where the last process to
> detach always takes responsibility.

Maybe it's inferior to that, but I think what Heikki proposes is more
or less complementary to what I've proposed, and has nothing to do
with resource management and plenty to do with making the logtape.c
interface look nice, AFAICT. It's also about refactoring/simplifying
logtape.c itself, while we're at it. I believe that Heik

Re: pg_authid.rolpassword format (was Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol)

2016-12-20 Thread Stephen Frost

David,

* David Fetter (da...@fetter.org) wrote:
> On Tue, Dec 20, 2016 at 06:14:40PM -0500, Stephen Frost wrote:
> > * David Fetter (da...@fetter.org) wrote:
> > > On Tue, Dec 20, 2016 at 08:34:19AM -0500, Stephen Frost wrote:
> > > > * Heikki Linnakangas (hlinn...@iki.fi) wrote:
> > > > > Even if you have a separate "verifier type" column, it's not fully
> > > > > normalized, because there's still a dependency between the
> > > > > verifier and verifier type columns. You will always need to look
> > > > > at the verifier type to make sense of the verifier itself.
> > > > 
> > > > That's true- but you don't need to look at the verifier, or even
> > > > have *access* to the verifier, to look at the verifier type.
> > > 
> > > Would a view that shows only what's to the left of the first semicolon
> > > suit this purpose?
> > 
> > Obviously a (security barrier...) view or a (security definer) function
> > could be used, but I don't believe either is actually a good idea.
> 
> Would you be so kind as to help me understand what's wrong with that idea?

For starters, it doubles-down on the assumption that we'll always be
happy with that particular separator and implies to anyone watching that
they'll be able to trust it.  Further, it's additional complication
which, at least to my eyes, is entirely in the wrong direction.

We could push everything in pg_authid into a single colon-separated text
field and call it simpler because we don't have to deal with those silly
column things, and we'd have something a lot closer to a unix passwd
file too!, but it wouldn't make it a terribly smart thing to do.  We
aren't a bunch of individual C programs having to parse out things out
of flat text files, after all.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] pg_background contrib module proposal

2016-12-20 Thread Simon Riggs

On 9 December 2016 at 13:00, Robert Haas  wrote:

> That might be good, because then we wouldn't have to maintain two
> copies of the code.

So why are there two things at all? Why is this being worked on as
well as Peter's patch? What will that give us?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_background contrib module proposal

2016-12-20 Thread David Fetter

On Mon, Dec 19, 2016 at 09:30:32PM +0500, Andrew Borodin wrote:
> 2016-12-19 4:21 GMT+05:00 David Fetter :
> > Couldn't it sleep in increments smaller than a second?  Like maybe
> > milliseconds?  Also, it's probably cleaner (or at least more
> > comprehensible) to write something using format() and dollar quoting
> > than the line with the double 's.
> 
> Right. Here's version which waits for half a second. I do not see how
> to path doubles via dollar sign, pg_background_launch() gets a string
> as a parameter, it's not EXCEUTE USING.

I see.

I find the following a little easier to follow, and the sleeps still
work even when very short.

Best,
David.

CREATE TABLE input AS
SELECT x, row_number() OVER (ORDER BY x) n
FROM
generate_series(0,.05,0.01) x
ORDER BY x DESC;

CREATE TABLE output(place int,value float);

CREATE TABLE handles AS
SELECT pg_background_launch(
format($$
SELECT pg_sleep(%s);
INSERT INTO output VALUES (%s, %s)
$$,
x, n, x
)
) h
FROM input;

--wait until everyone finishes
SELECT * FROM handles JOIN LATERAL pg_background_result(handles.h) AS (x TEXT) 
ON (true);

--output results
SELECT * FROM output ORDER BY place;

-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: pg_authid.rolpassword format (was Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol)

2016-12-20 Thread David Fetter

On Tue, Dec 20, 2016 at 06:14:40PM -0500, Stephen Frost wrote:
> David,
> 
> * David Fetter (da...@fetter.org) wrote:
> > On Tue, Dec 20, 2016 at 08:34:19AM -0500, Stephen Frost wrote:
> > > * Heikki Linnakangas (hlinn...@iki.fi) wrote:
> > > > Even if you have a separate "verifier type" column, it's not fully
> > > > normalized, because there's still a dependency between the
> > > > verifier and verifier type columns. You will always need to look
> > > > at the verifier type to make sense of the verifier itself.
> > > 
> > > That's true- but you don't need to look at the verifier, or even
> > > have *access* to the verifier, to look at the verifier type.
> > 
> > Would a view that shows only what's to the left of the first semicolon
> > suit this purpose?
> 
> Obviously a (security barrier...) view or a (security definer) function
> could be used, but I don't believe either is actually a good idea.

Would you be so kind as to help me understand what's wrong with that idea?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protect syscache from bloating with negative cache entries

2016-12-20 Thread Robert Haas

On Tue, Dec 20, 2016 at 3:10 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Dec 20, 2016 at 10:09 AM, Tom Lane  wrote:
>>> I don't understand why we'd make that a system-wide behavior at all,
>>> rather than expecting each process to manage its own cache.
>
>> Individual backends don't have a really great way to do time-based
>> stuff, do they?  I mean, yes, there is enable_timeout() and friends,
>> but I think that requires quite a bit of bookkeeping.
>
> If I thought that "every ten minutes" was an ideal way to manage this,
> I might worry about that, but it doesn't really sound promising at all.
> Every so many queries would likely work better, or better yet make it
> self-adaptive depending on how much is in the local syscache.

I don't think "every so many queries" is very promising at all.
First, it has the same problem as a fixed cap on the number of
entries: if you're doing a round-robin just slightly bigger than that
value, performance will be poor.  Second, what's really important here
is to keep the percentage of wall-clock time spent populating the
system caches small.  If a backend is doing 4000 queries/second and
each of those 4000 queries touches a different table, it really needs
a cache of at least 4000 entries or it will thrash and slow way down.
But if it's doing a query every 10 minutes and those queries
round-robin between 4000 different tables, it doesn't really need a
4000-entry cache.  If those queries are long-running, the time to
repopulate the cache will only be a tiny fraction of runtime.  If the
queries are short-running, then the effect is, percentage-wise, just
the same as for the high-volume system, but in practice it isn't
likely to be felt as much.  I mean, if we keep a bunch of old cache
entries around on a mostly-idle backend, they are going to be pushed
out of CPU caches and maybe even paged out.  One can't expect a
backend that is woken up after a long sleep to be quite as snappy as
one that's continuously active.

Which gets to my third point: anything that's based on number of
queries won't do anything to help the case where backends sometimes go
idle and sit there for long periods.  Reducing resource utilization in
that case would be beneficial.  Ideally I'd like to get rid of not
only the backend-local cache contents but the backend itself, but
that's a much harder project.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: pg_authid.rolpassword format (was Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol)

2016-12-20 Thread Stephen Frost

David,

* David Fetter (da...@fetter.org) wrote:
> On Tue, Dec 20, 2016 at 08:34:19AM -0500, Stephen Frost wrote:
> > * Heikki Linnakangas (hlinn...@iki.fi) wrote:
> > > Even if you have a separate "verifier type" column, it's not fully
> > > normalized, because there's still a dependency between the
> > > verifier and verifier type columns. You will always need to look
> > > at the verifier type to make sense of the verifier itself.
> > 
> > That's true- but you don't need to look at the verifier, or even
> > have *access* to the verifier, to look at the verifier type.
> 
> Would a view that shows only what's to the left of the first semicolon
> suit this purpose?

Obviously a (security barrier...) view or a (security definer) function
could be used, but I don't believe either is actually a good idea.

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] too low cost of Bitmap index scan

2016-12-20 Thread Tom Lane

Pavel Stehule  writes:
> I am trying to fix slow query on PostgreSQL 9.5.4.
> The data are almost in RAM

If it's all in RAM, you'd likely be well-served to lower random_page_cost.
It looks to me like the planner is estimating pretty accurately how many
heap fetches will be eliminated by using the extra index; where it's off
seems to be in the cost of those heap fetches relative to the index work.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Make pg_basebackup -x stream the default

2016-12-20 Thread Michael Paquier

On Tue, Dec 20, 2016 at 10:38 PM, Fujii Masao  wrote:
> On Mon, Dec 19, 2016 at 7:51 PM, Vladimir Rusinov  wrote:
>> The server must also be configured with max_wal_senders set high
>> enough to leave at least one session available for the backup.
>
> I think that it's better to explain explicitly here that max_wal_senders
> should be higher than one by default.

Recovery tests are broken by this patch, the backup() method in
PostgresNode.pm uses pg_basebackup -x:
sub backup
{
my ($self, $backup_name) = @_;
my $backup_path = $self->backup_dir . '/' . $backup_name;
my $port= $self->port;
my $name= $self->name;

print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
TestLib::system_or_bail('pg_basebackup', '-D', $backup_path, '-p', $port,
'-x', '--no-sync');
print "# Backup finished\n";
}
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-12-20 Thread Robert Haas

On Wed, Sep 21, 2016 at 12:52 PM, Heikki Linnakangas  wrote:
> I find this unification business really complicated. I think it'd be simpler
> to keep the BufFiles and LogicalTapeSets separate, and instead teach
> tuplesort.c how to merge tapes that live on different
> LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single
> LogicalTapeSet can contain tapes from different underlying BufFiles.
>
> What I have in mind is something like the attached patch. It refactors
> LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape
> as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet
> doesn't have the concept of a tape number anymore, it can contain any number
> of tapes, and you can create more on the fly. With that, it'd be fairly easy
> to make tuplesort.c merge LogicalTapes that came from different tape sets,
> backed by different BufFiles. I think that'd avoid much of the unification
> code.

I just looked at the buffile.c/buffile.h changes in the latest version
of the patch and I agree with this criticism, though maybe not with
the proposed solution.  I actually don't understand what "unification"
is supposed to mean.  The patch really doesn't explain that anywhere
that I can see.  It says stuff like:

+ * Parallel operations can use an interface to unify multiple worker-owned
+ * BufFiles and a leader-owned BufFile within a leader process.  This relies
+ * on various fd.c conventions about the naming of temporary files.

That comment tells you that unification is a thing you can do -- via
an unspecified interface for unspecified reasons using unspecified
conventions -- but it doesn't tell you what the semantics of it are
supposed to be.  For example, if we "unify" several BufFiles, do they
then have a shared seek pointer?  Do the existing contents effectively
get concatenated in an unpredictable order, or are they all expected
to be empty at the time unification happens?  Or something else?  It's
fine to make up new words -- indeed, in some sense that is the essence
of writing any complex problem -- but you have to define them.  As far
as I can tell, the idea is that we're somehow magically concatenating
the BufFiles into one big super-BufFile, but I'm fuzzy on exactly
what's supposed to be going on there.

It's hard to understand how something like this doesn't leak
resources.  Maybe that's been thought about here, but it isn't very
clear to me how it's supposed to work.  In Heikki's proposal, if
process A is trying to read a file owned by process B, and process B
dies and removes the file before process A gets around to reading it,
we have got trouble, especially on Windows which apparently has low
tolerance for such things.  Peter's proposal avoids that - I *think* -
by making the leader responsible for all resource cleanup, but that's
inferior to the design we've used for other sorts of shared resource
cleanup (DSM, DSA, shm_mq, lock groups) where the last process to
detach always takes responsibility.  That avoids assuming that we're
always dealing with a leader-follower situation, it doesn't
categorically require the leader to be the one who creates the shared
resource, and it doesn't require the leader to be the last process to
die.

Imagine a data structure that is stored in dynamic shared memory and
contains space for a filename, a reference count, and a mutex.  Let's
call this thing a SharedTemporaryFile or something like that.  It
offers these APIs:

extern void SharedTemporaryFileInitialize(SharedTemporaryFile *);
extern void SharedTemporaryFileAttach(SharedTemporary File *, dsm_segment *seg);
extern void SharedTemporaryFileAssign(SharedTemporyFile *, char *pathname);
extern File SharedTemporaryFileGetFile(SharedTemporaryFile *);

After setting aside sizeof(SharedTemporaryFile) bytes in your shared
DSM sgement, you call SharedTemporaryFileInitialize() to initialize
them.  Then, every process that cares about the file does
SharedTemporaryFileAttach(), which bumps the reference count and sets
an on_dsm_detach hook to decrement the reference count and unlink the
file if the reference count thereby reaches 0.  One of those processes
does SharedTemporaryFileAssign(), which fills in the pathname and
clears FD_TEMPORARY.  Then, any process that has attached can call
SharedTemporaryFileGetFile() to get a File which can then be accessed
normally.  So, the pattern for parallel sort would be:

- Leader sets aside space and calls SharedTemporaryFileInitialize()
and SharedTemporaryFileAttach().
- The cooperating worker calls SharedTemporaryFileAttach() and then
SharedTemporaryFileAssign().
- The leader then calls SharedTemporaryFileGetFile().

Since the leader can attach to the file before the path name is filled
in, there's no window where the file is at risk of being leaked.
Before SharedTemporaryFileAssign(), the worker is solely responsible
for removing the file; after that call, whichever of the leader and
the worker exits last will remove the file.

Re: [HACKERS] Clarifying "server starting" messaging in pg_ctl start without --wait

2016-12-20 Thread Peter Eisentraut

On 12/20/16 3:49 PM, Tom Lane wrote:
> Peter Eisentraut  writes:
>> Maybe the fix is to make --wait the default?
> I was wondering about that too ... does anyone remember the rationale
> for the current behavior?

Probably because that didn't work reliably before pg_ctl learned how to
get the right port number and PQping() and such things.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: pg_authid.rolpassword format (was Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol)

2016-12-20 Thread Michael Paquier

On Wed, Dec 21, 2016 at 1:08 AM, David Fetter  wrote:
> Would a view that shows only what's to the left of the first semicolon
> suit this purpose?

Of course it would, you would just need to make the routines now
checking the shape of MD5 and SCRAM identifiers available at SQL level
and feed the strings into them. Now I am not sure that it's worth
having a new superuser view for that. pg_roles and pg_shadow hide the
information about verifiers.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Clarifying "server starting" messaging in pg_ctl start without --wait

2016-12-20 Thread David G. Johnston

On Tue, Dec 20, 2016 at 1:49 PM, Tom Lane  wrote:

> Peter Eisentraut  writes:
> > Maybe the fix is to make --wait the default?
>
> I was wondering about that too ... does anyone remember the rationale
> for the current behavior?  But the message for the non-wait case seems
> like it could stand to be improved independently of that.
>

Not totally independent.

If the default is changed to --wait then the message can be written
assuming the user understands what "--no-wait" does; but if the default is
left "--no-wait" then cluing the user into the asynchronous behavior and
telling them how to get the more expected synchronous behavior would be
helpful.

David J.

Re: [HACKERS] Clarifying "server starting" messaging in pg_ctl start without --wait

2016-12-20 Thread Alvaro Herrera

Tom Lane wrote:
> Ryan Murphy  writes:
> > I'm concerned some new users may not understand this behavior of pg_ctl, so
> > I wanted to suggest that we add some additional messaging after "server
> > starting" - something like:
> 
> > $ pg_ctl -D datadir -l logfile start
> > server starting
> > (to wait for confirmation that server actually started, try pg_ctl again
> > with --wait)
> 
> That seems annoyingly verbose and nanny-ish.  Perhaps we could get the
> point across like this:
> 
> $ pg_ctl -D datadir -l logfile start
> requested server to start

+1, but also +1 to making --wait the default.  Extra points if systemd
start scripts are broken by the change ;-)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Clarifying "server starting" messaging in pg_ctl start without --wait

2016-12-20 Thread Tom Lane

Peter Eisentraut  writes:
> Maybe the fix is to make --wait the default?

I was wondering about that too ... does anyone remember the rationale
for the current behavior?  But the message for the non-wait case seems
like it could stand to be improved independently of that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Clarifying "server starting" messaging in pg_ctl start without --wait

2016-12-20 Thread Tom Lane

Ryan Murphy  writes:
> I'm concerned some new users may not understand this behavior of pg_ctl, so
> I wanted to suggest that we add some additional messaging after "server
> starting" - something like:

> $ pg_ctl -D datadir -l logfile start
> server starting
> (to wait for confirmation that server actually started, try pg_ctl again
> with --wait)

That seems annoyingly verbose and nanny-ish.  Perhaps we could get the
point across like this:

$ pg_ctl -D datadir -l logfile start
requested server to start
$

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Clarifying "server starting" messaging in pg_ctl start without --wait

2016-12-20 Thread David Fetter

On Tue, Dec 20, 2016 at 03:43:11PM -0500, Peter Eisentraut wrote:
> On 12/20/16 3:31 PM, Ryan Murphy wrote:
> > I'm concerned some new users may not understand this behavior of pg_ctl,
> > so I wanted to suggest that we add some additional messaging after
> > "server starting" - something like:
> > 
> > $ pg_ctl -D datadir -l logfile start
> > server starting
> > (to wait for confirmation that server actually started, try pg_ctl again
> > with --wait)
> 
> Maybe the fix is to make --wait the default?

+1

It's not super useful to have the prompt return while the server is
still starting up.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Clarifying "server starting" messaging in pg_ctl start without --wait

2016-12-20 Thread Peter Eisentraut

On 12/20/16 3:31 PM, Ryan Murphy wrote:
> I'm concerned some new users may not understand this behavior of pg_ctl,
> so I wanted to suggest that we add some additional messaging after
> "server starting" - something like:
> 
> $ pg_ctl -D datadir -l logfile start
> server starting
> (to wait for confirmation that server actually started, try pg_ctl again
> with --wait)

Maybe the fix is to make --wait the default?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Clarifying "server starting" messaging in pg_ctl start without --wait

2016-12-20 Thread Ryan Murphy

Hi Postgres Devs,

I had a suggestion regarding the output pg_ctl gives when you use it to
start the postgres server.  At first I was going to write a patch, but then
I decided to just ask you guys first to see what you think.

I had an issue earlier where I was trying to upgrade my postgres database
to a new major version and incidentally a new pg_catalog version, and
therefore the new code could no longer run the existing data directory
without pg_upgrade or pg_dump (I ended up needing pg_dump).  Initially I
was very confused because I tried running "pg_ctl -D datadir -l logfile
start" like normal, and it just said "server starting", yet the server was
not starting.  It took me a while to realize that I needed to use the
"--wait" / "-w" option to actually wait and test whether the server was
really starting, at which point it told me there was a problem and to check
the log.

I'm concerned some new users may not understand this behavior of pg_ctl, so
I wanted to suggest that we add some additional messaging after "server
starting" - something like:

$ pg_ctl -D datadir -l logfile start
server starting
(to wait for confirmation that server actually started, try pg_ctl again
with --wait)

What do you guys think?  Is it important to keep pg_ctl output more terse
than this?  I do think something like this could help new users avoid
frustration.

I'm happy to write a patch for this if it's helpful, though it's such a
simple change that if one of the core devs wants this s/he can probably
more easily just add it themselves.

Cheers,
Ryan

Re: [HACKERS] Rethinking our fulltext phrase-search implementation

2016-12-20 Thread Tom Lane

I wrote:
> I've been thinking about how to fix the problem Andreas Seltenreich
> reported at
> https://postgr.es/m/87eg1y2s3x@credativ.de

Attached is a proposed patch that deals with the problems discussed
here and in <26706.1482087...@sss.pgh.pa.us>.  Is anyone interested
in reviewing this, or should I just push it?

BTW, I noticed that ts_headline() seems to not behave all that nicely
for phrase searches, eg

regression=# SELECT ts_headline('simple', '1 2 3 1 3'::text, '2 <-> 3', 
'ShortWord=0');
  ts_headline   

 1 2 3 1 3
(1 row)

Highlighting the second "3", which is not a match, seems pretty dubious.
Negative items are even worse, they don't change the results at all:

regression=# SELECT ts_headline('simple', '1 2 3 1 3'::text, '!2 <-> 3', 
'ShortWord=0');
  ts_headline   

 1 2 3 1 3
(1 row)

However, the code involved seems unrelated to the present patch, and
it's also about as close to completely uncommented as I've seen anywhere
in the PG code base.  So I'm not excited about touching it.

regards, tom lane

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 67d0c34..464ce83 100644
*** a/doc/src/sgml/datatype.sgml
--- b/doc/src/sgml/datatype.sgml
*** SELECT 'fat & rat & ! cat'::tsqu
*** 3959,3973 
  tsquery 
  
   'fat' & 'rat' & !'cat'
- 
- SELECT '(fat | rat) <-> cat'::tsquery;
-   tsquery
- ---
-  'fat' <-> 'cat' | 'rat' <-> 'cat'
  
- 
-  The last example demonstrates that tsquery sometimes
-  rearranges nested operators into a logically equivalent formulation.
  
  
  
--- 3959,3965 
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index 2da7595..bc33a70 100644
*** a/doc/src/sgml/textsearch.sgml
--- b/doc/src/sgml/textsearch.sgml
*** SELECT 'fat & cow'::tsquery @@ 'a fa
*** 264,270 
  text, any more than a tsvector is.  A tsquery
  contains search terms, which must be already-normalized lexemes, and
  may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators.
! (For details see .)  There are
  functions to_tsquery, plainto_tsquery,
  and phraseto_tsquery
  that are helpful in converting user-written text into a proper
--- 264,270 
  text, any more than a tsvector is.  A tsquery
  contains search terms, which must be already-normalized lexemes, and
  may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators.
! (For syntax details see .)  There are
  functions to_tsquery, plainto_tsquery,
  and phraseto_tsquery
  that are helpful in converting user-written text into a proper
*** text @@ text
*** 323,328 
--- 323,330 
  at least one of its arguments must appear, while the ! (NOT)
  operator specifies that its argument must not appear in
  order to have a match.
+ For example, the query fat & ! rat matches documents that
+ contain fat but not rat.
 
  
 
*** SELECT phraseto_tsquery('the cats ate th
*** 377,382 
--- 379,401 
  then &, then <->,
  and ! most tightly.
 
+ 
+
+ It's worth noticing that the AND/OR/NOT operators mean something subtly
+ different when they are within the arguments of a FOLLOWED BY operator
+ than when they are not, because then the position of the match is
+ significant.  Normally, !x matches only documents that do not
+ contain x anywhere.  But x <-> !y
+ matches x if it is not immediately followed by y;
+ an occurrence of y elsewhere in the document does not prevent
+ a match.  Another example is that x & y normally only
+ requires that x and y both appear somewhere in the
+ document, but (x & y) <-> z requires x
+ and y to match at the same place, immediately before
+ a z.  Thus this query behaves differently from x
+ <-> z & y <-> z, which would match a document
+ containing two separate sequences x z and y z.
+

  

diff --git a/src/backend/utils/adt/tsginidx.c b/src/backend/utils/adt/tsginidx.c
index efc111e..3e0a444 100644
*** a/src/backend/utils/adt/tsginidx.c
--- b/src/backend/utils/adt/tsginidx.c
*** checkcondition_gin(void *checkval, Query
*** 212,218 
   * Evaluate tsquery boolean expression using ternary logic.
   */
  static GinTernaryValue
! TS_execute_ternary(GinChkVal *gcv, QueryItem *curitem)
  {
  	GinTernaryValue val1,
  val2,
--- 212,218 
   * Evaluate tsquery boolean expression using ternary logic.
   */
  static GinTernaryValue
! TS_execute_ternary(GinChkVal *gcv, QueryItem *curitem, bool in_phrase)
  {
  	GinTernaryValue val1,
  val2,
*** TS_execute_ternary(GinChkVal *gcv, Query
*** 230,236 
  	switch (curitem->qoperator.oper)
  	{
  		case OP_NOT:
! 			result

Re: [HACKERS] Protect syscache from bloating with negative cache entries

2016-12-20 Thread Tom Lane

Robert Haas  writes:
> On Tue, Dec 20, 2016 at 10:09 AM, Tom Lane  wrote:
>> I don't understand why we'd make that a system-wide behavior at all,
>> rather than expecting each process to manage its own cache.

> Individual backends don't have a really great way to do time-based
> stuff, do they?  I mean, yes, there is enable_timeout() and friends,
> but I think that requires quite a bit of bookkeeping.

If I thought that "every ten minutes" was an ideal way to manage this,
I might worry about that, but it doesn't really sound promising at all.
Every so many queries would likely work better, or better yet make it
self-adaptive depending on how much is in the local syscache.

The bigger picture here though is that we used to have limits on syscache
size, and we got rid of them (commit 8b9bc234a, see also
https://www.postgresql.org/message-id/flat/5141.1150327541%40sss.pgh.pa.us)
not only because of the problem you mentioned about performance falling
off a cliff once the working-set size exceeded the arbitrary limit, but
also because enforcing the limit added significant overhead --- and did so
whether or not you got any benefit from it, ie even if the limit is never
reached.  Maybe the present patch avoids imposing a pile of overhead in
situations where no pruning is needed, but it doesn't really look very
promising from that angle in a quick once-over.

BTW, I don't see the point of the second patch at all?  Surely, if
an object is deleted or updated, we already have code that flushes
related catcache entries.  Otherwise the caches would deliver wrong
data.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cache Hash Index meta page.

2016-12-20 Thread Mithun Cy

On Tue, Dec 20, 2016 at 3:51 AM, Robert Haas  wrote:

> On Fri, Dec 16, 2016 at 5:16 AM, Mithun Cy 
> wrote:
>
>> Shouldn't _hash_doinsert() be using the cache, too
>>>
>>
>> Yes, we have an opportunity there, added same in code. But one difference
>> is at the end at-least once we need to read the meta page to split and/or
>> write. Performance improvement might not be as much as read-only.
>>
>
> Why would you need to read it at least once in one case but not the other?
>

 For read-only: in _hash_first if target bucket is not plit after caching
the meta page contents. we never need to read the metapage. But for insert:
in _hash_doinsert at the end we modify the meta page to store the number of
tuples.

+  * Write-lock the metapage so we can increment the tuple count. After
+  * incrementing it, check to see if it's time for a split.
+ */
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+
+ metap->hashm_ntuples += 1;

Well, I guess that makes it more appealing to cache the whole page at least
> in terms of raw number of bytes, but I suppose my real complaint here is
> that there don't seem to be any clear rules for where, whether, and to what
> extent we can rely on the cache to be valid.
>

--- Okay will revert back to cache the entire meta page.

> Without that, I think this patch is creating an extremely serious
> maintenance hazard for anyone who wants to try to modify this code in the
> future.  A future patch author needs to be able to tell what data they can
> use and what data they can't use, and why.
>

-- I think if it is okay, I can document same for each member
of HashMetaPageData whether to read from cached from meta page or directly
from current meta page. Below briefly I have commented for each member. If
you suggest I can go with that approach, I will produce a neat patch for
same.

*typedef struct HashMetaPageData*

*{*

*1. uint32 hashm_magic; /* magic no. for hash tables */*

-- I think this should remain same. So can be read from catched meta page.

*2. uint32 hashm_version; /* version ID */*

-- This is one time initied, never changed afterwards. So can be read from
catched metapage.

*3. double hashm_ntuples; /* number of tuples stored in the table */*

*-*- This changes on every insert. So should not be read from chached data.

*4. uint16 hashm_ffactor; /* target fill factor (tuples/bucket) */*

-- This is one time initied, never changed afterwards. So can be read from
catched metapage.

*5. uint16 hashm_bsize; /* index page size (bytes) */*

-- This is one time initied, never changed afterwards. So can be read from
catched metapage.

*6. uint16 hashm_bmsize; /* bitmap array size (bytes) - must be a power of
2 */*

-- This is one time initied, never changed afterwards. So can be read from
catched metapage

*7. uint16 hashm_bmshift; /* log2(bitmap array size in BITS) */*

-- This is one time initied, never changed afterwards. So can be read from
catched metapage

*8. *If hashm_maxbucket, hashm_highmask and hashm_lowmask are all read and
cached at same time when metapage was locked, then key to bucket number map
function _hash_hashkey2bucket() should always produce same output.  If
bucket is split after caching above elements (which can be known because
old bucket pages will never move once allocated and we mark bucket
pages hasho_prevblkno with incremented hashm_highmask), we can invalidate
them and re-read same from meta page. If your intention is not to save a
metapage read while trying to map the key to buket page, then do not read
them from cached meta page.

* uint32 hashm_maxbucket; /* ID of maximum bucket in use */*

* uint32 hashm_highmask; /* mask to modulo into entire table */*

* uint32 hashm_lowmask; /* mask to modulo into lower half of table */*

*9.*

* uint32 hashm_ovflpoint;/* splitpoint from which ovflpgs being*

* * allocated */*

-- Since used for allocation of overflow pages, should get latest value
directly from meta page.

*10.*

* uint32 hashm_firstfree; /* lowest-number free ovflpage (bit#) */*

-- Should always be read from metapage directly.

*11. *

* uint32 hashm_nmaps; /* number of bitmap pages */*

-- Should always be read from metapage directly.

*12.*

*RegProcedure hashm_procid; /* hash procedure id from pg_proc */*

-- Never used till now, When we use, need to look at it about its policy.

*13*

*uint32 hashm_spares[HASH_MAX_SPLITPOINTS]; /* spare pages before*

* * each splitpoint */*

Spares before given split_point never change and bucket pages never move.
So when combined with cached hashm_maxbucket, hashm_highmask
and hashm_lowmask, all read at same time under lock, BUCKET_TO_BLKNO should
always produce same output, pointing to right physical block. Should only
be used to save a meta page read when we want to map key to bucket block as
said above.

*14.*

* BlockNumber hashm_mapp[HASH_MAX_BITMAPS]; /* blknos of ovfl bitmaps */*

Always read from metapage durectly.

*} HashMetaPageData;*

> Any existing bugs should be t

Re: [HACKERS] Migration Docs WAS: Proposal for changes to recovery.conf API

2016-12-20 Thread Josh Berkus

On 12/19/2016 01:29 PM, Peter Eisentraut wrote:
> On 12/16/16 8:52 PM, Robert Haas wrote:
>> > If the explanation is just a few sentences long, I see no reason not
>> > to include it in the release notes.
> As far as I can tell from the latest posted patch, the upgrade
> instructions are
> 
> - To trigger recovery, create an empty file recovery.trigger instead of
> recovery.conf.
> 
> - All parameters formerly in recovery.conf are now regular
> postgresql.conf parameters.  For backward compatibility, recovery.conf
> is read after postgresql.conf, but the parameters can now be put into
> postgresql.conf if desired.

Aren't we changing how some of the parameters work?

> 
> Some of that might be subject to patch review, but it's probably not
> going to be much longer than that.  So I think that will fit well into
> the usual release notes section.

Changed the subject line of this thread because people are becoming
confused about the new topic.

I'm not talking about *just* the recovery.conf changes.  We're making a
lot of changes to 10 which will require user action, and there may be
more before 10 is baked.  For example, dealing with the version
numbering change.  I started a list of the things we're breaking for 10,
but I don't have it with me at the moment.  There's more than 3 things
on it.

And then there's docs for stuff which isn't *required* by upgrading, but
would be a good idea.  For example, we'll eventually want a doc on how
to migrate old-style partitioned tables to new-style partitioned tables.

In any case, Peter's response shows *exactly* why I don't want to put
this documentation into the release notes: because people are going to
complain it's too long and want to truncate it. Writing the docs will be
hard enough; if I (or anyone else) has to argue about whether or not
they're too long, I'm just going to drop the patch and walk away.

-- 
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.

2016-12-20 Thread Alvaro Herrera

Robert Haas wrote:
> Implement table partitioning.

I thought it was odd to use rd_rel->reloftype as a boolean in
ATExecAttachPartition, but apparently we do it elsewhere too, so let's
leave that complaint for another day.

What I also found off in the same function is that we use
SearchSysCacheCopyAttName() on each attribute and then don't free the
result, and don't ever use the returned tuple either.  A simple fix, I
thought, just remove the "Copy" and add a ReleaseSysCache().  But then I
noticed this whole thing is rather strange -- why not pass a boolean
flag down to CreateInheritance() and from there to
MergeAttributesIntoExisting() to implement the check?  That seems less
duplicative.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstattuple documentation clarification

2016-12-20 Thread Andrew Dunstan




On 12/20/2016 10:01 AM, Tom Lane wrote:

Andrew Dunstan  writes:

Recently a client was confused because there was a substantial
difference between the reported table_len of a table and the sum of the
corresponding tuple_len, dead_tuple_len and free_space. The docs are
fairly silent on this point, and I agree that in the absence of
explanation it is confusing, so I propose that we add a clarification
note along the lines of:
 The table_len will always be greater than the sum of the tuple_len,
 dead_tuple_len and free_space. The difference is accounted for by
 page overhead and space that is not free but cannot be attributed to
 any particular tuple.
Or perhaps we should be more explicit and refer to the item pointers on
the page.

I find "not free but cannot be attributed to any particular tuple"
to be entirely useless weasel wording, not to mention wrong with
respect to item pointers in particular.



Well, the reason I put it like that was that in my experimentation, 
after I vacuumed the table after a large delete the item pointer table 
didn't seem to shrink (at least according to the pgstattuple output), so 
we had a page with 0 dead tuples but some non-live line pointer space. 
If that's not what's happening then something is going on that I don't 
understand. (Wouldn't be a first.)





Perhaps we should start counting the item pointers in tuple_len.
We'd still have to explain about page header overhead, but that
would be a pretty small and fixed-size discrepancy.




Sure, sounds like a good idea. Meanwhile it would be nice to explain to 
people exactly what we currently have. If you have a good formulation 
I'm all ears.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protect syscache from bloating with negative cache entries

2016-12-20 Thread Robert Haas

On Tue, Dec 20, 2016 at 10:09 AM, Tom Lane  wrote:
> Craig Ringer  writes:
>> On 20 December 2016 at 21:59, Robert Haas  wrote:
>>> We could implement this by having
>>> some process, like the background writer,
>>> SendProcSignal(PROCSIG_HOUSEKEEPING) to every process in the system
>>> every 10 minutes or so.
>
>> ... on a rolling basis.
>
> I don't understand why we'd make that a system-wide behavior at all,
> rather than expecting each process to manage its own cache.

Individual backends don't have a really great way to do time-based
stuff, do they?  I mean, yes, there is enable_timeout() and friends,
but I think that requires quite a bit of bookkeeping.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

2016-12-20 Thread Robert Haas

On Mon, Dec 19, 2016 at 10:59 PM, Robert Haas  wrote:
> On Sun, Dec 18, 2016 at 10:00 PM, Amit Langote
>  wrote:
>> Here are updated patches including the additional information.
>
> Thanks.  Committed 0001.  Will have to review the others when I'm less tired.

0002. Can you add a test case for the bug fixed by this patch?

0003. Loses equalTupleDescs() check and various cases where
ExecOpenIndexes can be skipped.  Isn't that bad?  Also, "We locked all
the partitions above including the leaf partitions" should say "Caller
must have locked all the partitions including the leaf partitions".

0004. Unnecessary whitespace change in executor.h.  Still don't
understand why we need to hoist RelationGetPartitionQual() into the
caller.

0005. Can you add a test case for the bug fixed by this patch?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Replication slot xmin is not reset if HS feedback is turned off while standby is shut down

2016-12-20 Thread Ants Aasma

I just had a client issue with table bloat that I traced back to a
stale xmin value in a replication slot. xmin value from hot standby
feedback is stored in replication slot and used for vacuum xmin
calculation. If hot standby feedback is turned off while walreceiver
is active then the xmin gets reset by HS feedback message containing
InvalidTransactionId. However, if feedback gets turned off while
standby is shut down this message never gets sent and a stale value
gets left behind in the replication slot holding back vacuum.

The simple fix seems to be to always send out at least one feedback
message on each connect regardless of hot_standby_feedback setting.
Patch attached. Looks like this goes back to version 9.4. It could
conceivably cause issues for replication middleware that does not know
how to handle hot standby feedback messages. Not sure if any exist and
if that is a concern.

A shell script to reproduce the problem is also attached, adjust the
PGPATH variable to your postgres install and run in an empty
directory.

Regards,
Ants Aasma
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index cc3cf7d..31333ec 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1157,7 +1157,9 @@ XLogWalRcvSendReply(bool force, bool requestReply)
  * in case they don't have a watch.
  *
  * If the user disables feedback, send one final message to tell sender
- * to forget about the xmin on this standby.
+ * to forget about the xmin on this standby. We also send this message
+ * on first connect because a previous connection might have set xmin
+ * on a replication slot.
  */
 static void
 XLogWalRcvSendHSFeedback(bool immed)
@@ -1167,7 +1169,7 @@ XLogWalRcvSendHSFeedback(bool immed)
 	uint32		nextEpoch;
 	TransactionId xmin;
 	static TimestampTz sendTime = 0;
-	static bool master_has_standby_xmin = false;
+	static bool master_has_standby_xmin = true;
 
 	/*
 	 * If the user doesn't want status to be reported to the master, be sure
@@ -1192,20 +1194,13 @@ XLogWalRcvSendHSFeedback(bool immed)
 	}
 
 	/*
-	 * If Hot Standby is not yet active there is nothing to send. Check this
-	 * after the interval has expired to reduce number of calls.
-	 */
-	if (!HotStandbyActive())
-	{
-		Assert(!master_has_standby_xmin);
-		return;
-	}
-
-	/*
 	 * Make the expensive call to get the oldest xmin once we are certain
 	 * everything else has been checked.
+	 *
+	 * If Hot Standby is not yet active we reset the xmin value. Check this
+	 * after the interval has expired to reduce number of calls.
 	 */
-	if (hot_standby_feedback)
+	if (hot_standby_feedback && HotStandbyActive())
 		xmin = GetOldestXmin(NULL, false);
 	else
 		xmin = InvalidTransactionId;


slot-xmin-not-reset-reproduce.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

2016-12-20 Thread Alvaro Herrera

Robert Haas wrote:
> On Tue, Dec 20, 2016 at 10:27 AM, Alvaro Herrera
>  wrote:
> > Even if we decide to keep the message, I think it's not very good
> > wording anyhow; as a translator I disliked it on sight.  Instead of
> > "skipping scan to validate" I would use "skipping validation scan",
> > except that it's not clear what it is we're validating.  Mentioning
> > partition constraint in errcontext() doesn't like a great solution, but
> > I can't think of anything better.
> 
> Maybe something like: partition constraint for table \"%s\" is implied
> by existing constraints

Actually, shouldn't we emit a message if we *don't* skip the check?


-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_background contrib module proposal

2016-12-20 Thread David Fetter

On Tue, Dec 20, 2016 at 11:11:36AM +0530, amul sul wrote:
> On Tue, Dec 20, 2016 at 12:21 AM, David Fetter  wrote:
> > On Thu, Nov 24, 2016 at 09:16:53AM +0530, amul sul wrote:
> >> Hi All,
> >>
> >> I would like to take over pg_background patch and repost for
> >> discussion and review.
> >
> > This looks great.
> >
> > Sadly, it now breaks initdb when applied atop
> > dd728826c538f000220af98de025c00114ad0631 with:
> >
> > performing post-bootstrap initialization ... TRAP: 
> > FailedAssertion("!(const Node*)(rinfo))->type) == T_RestrictInfo))", 
> > File: "costsize.c", Line: 660)
> > sh: line 1:  2840 Aborted (core dumped) 
> > "/home/shackle/pggit/postgresql/tmp_install/home/shackle/10/bin/postgres" 
> > --single -F -O -j -c search_path=pg_catalog -c exit_on_error=true template1 
> > > /dev/null
> >
> 
> I've complied binary with --enable-cassert flag and pg_background
> patch to the top of described commit as well as on latest HEAD, but I
> am not able to reproduce this crash, see this:
> 
> [VM postgresql]$ /home/amul/Public/pg_inst/pg-master/bin/postgres
> --single -F -O -j -c search_path=pg_catalog -c exit_on_error=true
> template1

I haven't managed to reproduce it either.  Sorry about the noise.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: pg_authid.rolpassword format (was Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol)

2016-12-20 Thread David Fetter

On Tue, Dec 20, 2016 at 08:34:19AM -0500, Stephen Frost wrote:
> Heikki,
> 
> * Heikki Linnakangas (hlinn...@iki.fi) wrote:
> > Even if you have a separate "verifier type" column, it's not fully
> > normalized, because there's still a dependency between the
> > verifier and verifier type columns. You will always need to look
> > at the verifier type to make sense of the verifier itself.
> 
> That's true- but you don't need to look at the verifier, or even
> have *access* to the verifier, to look at the verifier type.

Would a view that shows only what's to the left of the first semicolon
suit this purpose?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

2016-12-20 Thread Robert Haas

On Tue, Dec 20, 2016 at 10:27 AM, Alvaro Herrera
 wrote:
> Even if we decide to keep the message, I think it's not very good
> wording anyhow; as a translator I disliked it on sight.  Instead of
> "skipping scan to validate" I would use "skipping validation scan",
> except that it's not clear what it is we're validating.  Mentioning
> partition constraint in errcontext() doesn't like a great solution, but
> I can't think of anything better.

Maybe something like: partition constraint for table \"%s\" is implied
by existing constraints

> (We have the table_rewrite event trigger, for a very similar use case.)

Hmm, maybe we should see if we can safely support an event trigger here, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect: Hash index support

2016-12-20 Thread Ashutosh Sharma

Hi Jesper,

> I was planning to submit a new version next week for CF/January, so I'll
> review your changes with the previous feedback in mind.
>
> Thanks for working on this !

As i was not seeing any updates from you for last 1 month, I thought
of working on it. I have created a commit-fest entry for it and have
added your name as main Author.  I am sorry, I think i should have
taken your opinion before starting to work on it.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

2016-12-20 Thread Alvaro Herrera

Robert Haas wrote:
> On Tue, Dec 20, 2016 at 4:51 AM, Alvaro Herrera
>  wrote:
> > Amit Langote wrote:
> >
> >> diff --git a/src/backend/commands/tablecmds.c 
> >> b/src/backend/commands/tablecmds.c
> >> index 1c219b03dd..6a179596ce 100644
> >> --- a/src/backend/commands/tablecmds.c
> >> +++ b/src/backend/commands/tablecmds.c
> >> @@ -13297,8 +13297,10 @@ ATExecAttachPartition(List **wqueue, Relation 
> >> rel, PartitionCmd *cmd)
> >>   }
> >>   }
> >>
> >> + /* It's safe to skip the validation scan after all */
> >>   if (skip_validate)
> >> - elog(NOTICE, "skipping scan to validate partition 
> >> constraint");
> >> + ereport(INFO,
> >> + (errmsg("skipping scan to validate partition 
> >> constraint")));
> >
> > Why not just remove the message altogether?
> 
> That's certainly an option.  It might be noise in some situations.  On
> the other hand, it affects whether attaching the partition is O(1) or
> O(n), so somebody might well want to know.  Or maybe they might be
> more likely to want a message in the reverse situation, telling them
> that the partition constraint DOES need to be validated.  I'm not sure
> what the best user interface is here; thoughts welcome.

Even if we decide to keep the message, I think it's not very good
wording anyhow; as a translator I disliked it on sight.  Instead of
"skipping scan to validate" I would use "skipping validation scan",
except that it's not clear what it is we're validating.  Mentioning
partition constraint in errcontext() doesn't like a great solution, but
I can't think of anything better.

(We have the table_rewrite event trigger, for a very similar use case.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal for changes to recovery.conf API

2016-12-20 Thread Simon Riggs

On 20 December 2016 at 15:03, Fujii Masao  wrote:

> API for crash recovery will never be changed. That is, crash recovery needs
> neither recovery.trigger nor standby.trigger. When the server starts a crash
> recovery without any trigger file, any recovery parameter settings in
> postgresql.conf are ignored. Right?

Yes. There are no conceptual changes, just the API.

The goals are: visibility and reloading of recovery parameters,
removal of special case code.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstattuple documentation clarification

2016-12-20 Thread Tom Lane

Andrew Dunstan  writes:
> Recently a client was confused because there was a substantial 
> difference between the reported table_len of a table and the sum of the 
> corresponding tuple_len, dead_tuple_len and free_space. The docs are 
> fairly silent on this point, and I agree that in the absence of 
> explanation it is confusing, so I propose that we add a clarification 
> note along the lines of:

> The table_len will always be greater than the sum of the tuple_len,
> dead_tuple_len and free_space. The difference is accounted for by
> page overhead and space that is not free but cannot be attributed to
> any particular tuple.

> Or perhaps we should be more explicit and refer to the item pointers on 
> the page.

I find "not free but cannot be attributed to any particular tuple"
to be entirely useless weasel wording, not to mention wrong with
respect to item pointers in particular.

Perhaps we should start counting the item pointers in tuple_len.
We'd still have to explain about page header overhead, but that
would be a pretty small and fixed-size discrepancy.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time to drop old-style (V0) functions?

2016-12-20 Thread Tom Lane

Robert Haas  writes:
> Well, I'm hoping there is a way to have a fast-path and a slow-path
> without slowing things down too much.

Seems like this discussion has veered way off into the weeds.
I suggest we confine it to the stated topic; if you want to discuss
ways to optimize V1 protocol (or invent a V2, which some of this
sounds like it would need), that should be a distinct thread.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protect syscache from bloating with negative cache entries

2016-12-20 Thread Tom Lane

Craig Ringer  writes:
> On 20 December 2016 at 21:59, Robert Haas  wrote:
>> We could implement this by having
>> some process, like the background writer,
>> SendProcSignal(PROCSIG_HOUSEKEEPING) to every process in the system
>> every 10 minutes or so.

> ... on a rolling basis.

I don't understand why we'd make that a system-wide behavior at all,
rather than expecting each process to manage its own cache.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect: Hash index support

2016-12-20 Thread Jesper Pedersen


Hi,

On 12/20/2016 05:55 AM, Ashutosh Sharma wrote:

Attached is the revised patch. Please have a look and let me know your
feedback. I have also changed the status for this patch in the
upcoming commitfest to "needs review".  Thanks.



I was planning to submit a new version next week for CF/January, so I'll 
review your changes with the previous feedback in mind.


Thanks for working on this !

Best regards,
 Jesper



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal for changes to recovery.conf API

2016-12-20 Thread Fujii Masao

On Tue, Dec 20, 2016 at 7:11 AM, Simon Riggs  wrote:
> On 19 December 2016 at 21:29, Peter Eisentraut
>  wrote:
>> On 12/16/16 8:52 PM, Robert Haas wrote:
>>> If the explanation is just a few sentences long, I see no reason not
>>> to include it in the release notes.
>>
>> As far as I can tell from the latest posted patch, the upgrade
>> instructions are
>>
>> - To trigger recovery, create an empty file recovery.trigger instead of
>> recovery.conf.
>>
>> - All parameters formerly in recovery.conf are now regular
>> postgresql.conf parameters.  For backward compatibility, recovery.conf
>> is read after postgresql.conf, but the parameters can now be put into
>> postgresql.conf if desired.
>>
>> Some of that might be subject to patch review, but it's probably not
>> going to be much longer than that.  So I think that will fit well into
>> the usual release notes section.
>
> Although that's right, there is more detail.
>
> The current list is this... based upon discussion in Tokyo dev meeting
>
> * Any use of the earlier recovery.conf or any of the old recovery
> parameter names causes an ERROR, so we have a clear API break
>
> * To trigger recovery, create an empty recovery.trigger file. Same as
> recovery.conf with standby_mode = off
>
> * To trigger standby, create an empty standby.trigger file. Same as
> recovery.conf with standby_mode = on

API for crash recovery will never be changed. That is, crash recovery needs
neither recovery.trigger nor standby.trigger. When the server starts a crash
recovery without any trigger file, any recovery parameter settings in
postgresql.conf are ignored. Right?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-12-20 Thread Fujii Masao

On Tue, Dec 20, 2016 at 2:46 PM, Michael Paquier
 wrote:
> On Tue, Dec 20, 2016 at 2:31 PM, Masahiko Sawada  
> wrote:
>> Do we need to consider the sorting method and the selecting k-th
>> latest LSN method?
>
> Honestly, nah. Tests are showing that there are many more bottlenecks
> before that with just memory allocation and parsing.

I think that it's worth prototyping alternative algorithm, and
measuring the performances of those alternative and current
algorithms. This measurement should check not only the bottleneck
but also how much each algorithm increases the time that backends
need to wait for before they receive ack from walsender.

If it's reported that current algorithm is enough "effecient",
we can just leave the code as it is. OTOH, if not, let's adopt
the alternative one.

Regards,

-- 
Fujii Masao

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protect syscache from bloating with negative cache entries

2016-12-20 Thread Craig Ringer

On 20 December 2016 at 21:59, Robert Haas  wrote:

> We could implement this by having
> some process, like the background writer,
> SendProcSignal(PROCSIG_HOUSEKEEPING) to every process in the system
> every 10 minutes or so.

... on a rolling basis.

Otherwise that'll be no fun at all, especially with some of those
lovely "we kept getting errors so we raised max_connections to 5000"
systems out there. But also on more sensibly configured ones that're
busy and want nice smooth performance without stalls.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hang in pldebugger after git commit : 98a64d0

2016-12-20 Thread Robert Haas

On Sat, Dec 17, 2016 at 5:46 AM, Amit Kapila  wrote:
> Yeah, but we are planning to add a generic flag in WaitEvent structure
> which can be used for similar purpose.  However, as per your
> suggestion, I have changed it only for Win32 port.  Also, I kept the
> change in ModifyWaitEvent API as that seems to be okay considering
> 'reset' is a generic flag in WaitEvent structure.

Well, we don't really need the change in ModifyWaitEvent if we have
the change in WaitEventSetWaitBlock, right?  I'd be inclined to ditch
the former and keep the latter.  Also, this doesn't look right:

+   for (cur_event = set->events;
+cur_event < (set->events + set->nevents)
+&& returned_events < nevents;
+cur_event++)
+   {
+   if (cur_event->reset)
+   {
+   WaitEventAdjustWin32(set, cur_event);
+   cur_event->reset = false;
+   }
+   }

There's no need to include returned_events < nevents in the loop
condition here because returned_events isn't changing.

I think I'd also guard the reset flag with #ifdef WIN32.  If it's not
properly supported elsewhere it's just a foot-gun, and there seems to
be no reason to write the code to properly support it elsewhere,
whatever that would mean.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-12-20 Thread Fujii Masao

On Tue, Dec 20, 2016 at 1:44 AM, Alvaro Herrera
 wrote:
> Fujii Masao wrote:
>
>> Regarding this feature, there are some loose ends. We should work on
>> and complete them until the release.
>
> Please list these in https://wiki.postgresql.org/wiki/Open_Items so that we
> don't forget.

Yep, added!

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical Replication WIP

2016-12-20 Thread Petr Jelinek

On 20/12/16 10:56, Erik Rijkers wrote:
> On 2016-12-20 10:48, Petr Jelinek wrote:
> 
> Here is another small thing:
> 
> $ psql -d testdb -p 6972
> psql (10devel_logical_replication_20161220_1008_db80acfc9d50)
> Type "help" for help.
> 
> testdb=# drop publication if exists xxx;
> ERROR:  unrecognized object type: 28
> 
> 
> testdb=# drop subscription if exists xxx;
> WARNING:  relcache reference leak: relation "pg_subscription" not closed
> DROP SUBSCRIPTION
> 
> 
> I don't mind but I suppose eventually other messages need to go there
> 

Yep, attached should fix it.

DDL for completely new db objects surely touches a lot of places.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/src/backend/commands/dropcmds.c b/src/backend/commands/dropcmds.c
index 61ff8f2..7080c4b 100644
--- a/src/backend/commands/dropcmds.c
+++ b/src/backend/commands/dropcmds.c
@@ -441,6 +441,10 @@ does_not_exist_skipping(ObjectType objtype, List *objname, List *objargs)
 }
 			}
 			break;
+		case OBJECT_PUBLICATION:
+			msg = gettext_noop("publication \"%s\" does not exist, skipping");
+			name = NameListToString(objname);
+			break;
 		default:
 			elog(ERROR, "unrecognized object type: %d", (int) objtype);
 			break;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 25c8c34..bd27aac 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -492,6 +492,8 @@ DropSubscription(DropSubscriptionStmt *stmt)
 
 	if (!HeapTupleIsValid(tup))
 	{
+		heap_close(rel, NoLock);
+
 		if (!stmt->missing_ok)
 			ereport(ERROR,
 	(errcode(ERRCODE_UNDEFINED_OBJECT),
@@ -501,6 +503,7 @@ DropSubscription(DropSubscriptionStmt *stmt)
 			ereport(NOTICE,
 	(errmsg("subscription \"%s\" does not exist, skipping",
 			stmt->subname)));
+
 		return;
 	}
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash Indexes

2016-12-20 Thread Amit Kapila

On Tue, Dec 20, 2016 at 7:44 PM, Robert Haas  wrote:
> On Tue, Dec 20, 2016 at 9:01 AM, Amit Kapila  wrote:
>> On Tue, Dec 20, 2016 at 7:11 PM, Robert Haas  wrote:
>>> On Tue, Dec 20, 2016 at 4:51 AM, Amit Kapila  
>>> wrote:
 We have mainly four actions for squeeze operation, add tuples to the
 write page, empty overflow page, unlinks overflow page, make it free
 by setting the corresponding bit in overflow page.  Now, if we don't
 log the changes to write page and freeing of overflow page as one
 operation, then won't query on standby can either see duplicate tuples
 or miss the tuples which are freed in overflow page.
>>>
>>> No, I think you could have two operations:
>>>
>>> 1. Move tuples from the "read" page to the "write" page.
>>>
>>> 2. Unlink the overflow page from the chain and mark it free.
>>>
>>> If we fail after step 1, the bucket chain might end with an empty
>>> overflow page, but that's OK.
>>
>> If there is an empty page in bucket chain, access to that page will
>> give an error (In WAL patch we are initializing the page instead of
>> making it completely empty, so we might not see an error in such a
>> case).
>
> It wouldn't be a new, uninitialized page.  It would be empty of
> tuples, not all-zeroes.
>

AFAIU we initialize page as all-zeros, but I think you are envisioning
that we need to change it to a new uninitialized page.

>> What advantage do you see by splitting the operation?
>
> It's simpler.  The code here is very complicated and trying to merge
> too many things into a single operation may make it even more
> complicated, increasing the risk of bugs and making the code hard to
> maintain without necessarily buying much performance.
>

Sure, if you find that way better, then we can change it, but the
current patch treats it as a single operation.  If after looking the
patch you find it is better to change it into two operations, I will
do so.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

2016-12-20 Thread Robert Haas

On Tue, Dec 20, 2016 at 4:51 AM, Alvaro Herrera
 wrote:
> Amit Langote wrote:
>
>> diff --git a/src/backend/commands/tablecmds.c 
>> b/src/backend/commands/tablecmds.c
>> index 1c219b03dd..6a179596ce 100644
>> --- a/src/backend/commands/tablecmds.c
>> +++ b/src/backend/commands/tablecmds.c
>> @@ -13297,8 +13297,10 @@ ATExecAttachPartition(List **wqueue, Relation rel, 
>> PartitionCmd *cmd)
>>   }
>>   }
>>
>> + /* It's safe to skip the validation scan after all */
>>   if (skip_validate)
>> - elog(NOTICE, "skipping scan to validate partition constraint");
>> + ereport(INFO,
>> + (errmsg("skipping scan to validate partition 
>> constraint")));
>
> Why not just remove the message altogether?

That's certainly an option.  It might be noise in some situations.  On
the other hand, it affects whether attaching the partition is O(1) or
O(n), so somebody might well want to know.  Or maybe they might be
more likely to want a message in the reverse situation, telling them
that the partition constraint DOES need to be validated.  I'm not sure
what the best user interface is here; thoughts welcome.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash Indexes

2016-12-20 Thread Robert Haas

On Tue, Dec 20, 2016 at 9:01 AM, Amit Kapila  wrote:
> On Tue, Dec 20, 2016 at 7:11 PM, Robert Haas  wrote:
>> On Tue, Dec 20, 2016 at 4:51 AM, Amit Kapila  wrote:
>>> We have mainly four actions for squeeze operation, add tuples to the
>>> write page, empty overflow page, unlinks overflow page, make it free
>>> by setting the corresponding bit in overflow page.  Now, if we don't
>>> log the changes to write page and freeing of overflow page as one
>>> operation, then won't query on standby can either see duplicate tuples
>>> or miss the tuples which are freed in overflow page.
>>
>> No, I think you could have two operations:
>>
>> 1. Move tuples from the "read" page to the "write" page.
>>
>> 2. Unlink the overflow page from the chain and mark it free.
>>
>> If we fail after step 1, the bucket chain might end with an empty
>> overflow page, but that's OK.
>
> If there is an empty page in bucket chain, access to that page will
> give an error (In WAL patch we are initializing the page instead of
> making it completely empty, so we might not see an error in such a
> case).

It wouldn't be a new, uninitialized page.  It would be empty of
tuples, not all-zeroes.

> What advantage do you see by splitting the operation?

It's simpler.  The code here is very complicated and trying to merge
too many things into a single operation may make it even more
complicated, increasing the risk of bugs and making the code hard to
maintain without necessarily buying much performance.

> Anyway, I think it is better to discuss this in WAL patch thread.

OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time to drop old-style (V0) functions?

2016-12-20 Thread Robert Haas

On Tue, Dec 20, 2016 at 8:44 AM, Andres Freund  wrote:
>> Advanced Server uses 256, and we had a customer complain recently that
>> 256 wasn't high enough.  So apparently the use case for functions with
>> ridiculous numbers of arguments isn't exactly 0.
>
> Well, there's a cost/benefit analysis involved here, as in a lot of
> other places. There's a lot of things one can conceive a use-case for,
> doesn't mean it's worth catering for all of them.

Clearly true.

>> I mean, there's no reason that we can't use dynamic allocation here.
>> If we put the first, say, 8 arguments in the FunctionCallInfoData
>> itself and then separately allocated space any extras, the structure
>> could be a lot smaller without needing an arbitrary limit on the
>> argument count.
>
> Except that that'll mean additional overhead during evaluation of
> function arguments, an overhead that everyone will have to pay. Suddenly
> you need two loops that fill in arguments, depending on their
> argument-number (i.e. additional branches in a thight spot).  And not
> being able to pass the null bitmap via an register argument also costs
> everyone.

Well, I'm hoping there is a way to have a fast-path and a slow-path
without slowing things down too much.  You're proposing something like
that anyway, because you're proposing to put the first two arguments
in actual function arguments and the rest someplace else.  I wouldn't
be surprised if 99% of function calls had 2 arguments or less because
most people probably call functions mostly or entirely as operators
and even user-defined functions don't necessarily have lots of
arguments.  But we wouldn't propose restricting all function calls to
2 arguments just so +(int4, int4) can be fast.  There's clearly got to
be a slow path for calls with more than 2 arguments and, if that's the
case, I don't see why there can't be a really-slow path, perhaps
guarded by unlikely(), for cases involving more than 8 or 16 or
whatever.  I find it really hard to believe that in 2016 that we can't
find a way to make simple things fast and complicated things possible.
Saying we can't make SQL-callable functions with large number of
arguments work feels to me like saying that we have to go back to
8-character filenames with 3-character extensions for efficiency - in
the 1980s, that argument might have held some water, but nobody buys
it any more.  Human time is worth more than computer time, and humans
save time when programs don't impose inconveniently-small limits.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pgstattuple documentation clarification

2016-12-20 Thread Andrew Dunstan



Recently a client was confused because there was a substantial 
difference between the reported table_len of a table and the sum of the 
corresponding tuple_len, dead_tuple_len and free_space. The docs are 
fairly silent on this point, and I agree that in the absence of 
explanation it is confusing, so I propose that we add a clarification 
note along the lines of:



   The table_len will always be greater than the sum of the tuple_len,
   dead_tuple_len and free_space. The difference is accounted for by
   page overhead and space that is not free but cannot be attributed to
   any particular tuple.


Or perhaps we should be more explicit and refer to the item pointers on 
the page.



Thoughts?


cheers


andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash Indexes

2016-12-20 Thread Amit Kapila

On Tue, Dec 20, 2016 at 7:11 PM, Robert Haas  wrote:
> On Tue, Dec 20, 2016 at 4:51 AM, Amit Kapila  wrote:
>> We have mainly four actions for squeeze operation, add tuples to the
>> write page, empty overflow page, unlinks overflow page, make it free
>> by setting the corresponding bit in overflow page.  Now, if we don't
>> log the changes to write page and freeing of overflow page as one
>> operation, then won't query on standby can either see duplicate tuples
>> or miss the tuples which are freed in overflow page.
>
> No, I think you could have two operations:
>
> 1. Move tuples from the "read" page to the "write" page.
>
> 2. Unlink the overflow page from the chain and mark it free.
>
> If we fail after step 1, the bucket chain might end with an empty
> overflow page, but that's OK.
>

If there is an empty page in bucket chain, access to that page will
give an error (In WAL patch we are initializing the page instead of
making it completely empty, so we might not see an error in such a
case).   What advantage do you see by splitting the operation?
Anyway, I think it is better to discuss this in WAL patch thread.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protect syscache from bloating with negative cache entries

2016-12-20 Thread Robert Haas

On Mon, Dec 19, 2016 at 6:15 AM, Kyotaro HORIGUCHI
 wrote:
> Hello, recently one of my customer stumbled over an immoderate
> catcache bloat.

This isn't only an issue for negative catcache entries.  A long time
ago, there was a limit on the size of the relcache, which was removed
because if you have a workload where the working set of relations is
just larger than the limit, performance is terrible.  But the problem
now is that backend memory usage can grow without bound, and that's
also bad, especially on systems with hundreds of long-lived backends.
In connection-pooling environments, the problem is worse, because
every connection in the pool eventually caches references to
everything of interest to any client.

Your patches seem to me to have some merit, but I wonder if we should
also consider having a time-based threshold of some kind.  If, say, a
backend hasn't accessed a catcache or relcache entry for many minutes,
it becomes eligible to be flushed.  We could implement this by having
some process, like the background writer,
SendProcSignal(PROCSIG_HOUSEKEEPING) to every process in the system
every 10 minutes or so.  When a process receives this signal, it sets
a flag that is checked before going idle.  When it sees the flag set,
it makes a pass over every catcache and relcache entry.  All the ones
that are unmarked get marked, and all of the ones that are marked get
removed.  Access to an entry clears any mark.  So anything that's not
touched for more than 10 minutes starts dropping out of backend
caches.

Anyway, that would be a much bigger change from what you are proposing
here, and what you are proposing here seems reasonable so I guess I
shouldn't distract from it.  Your email just made me think of it,
because I agree that catcache/relcache bloat is a serious issue.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_sequence catalog

2016-12-20 Thread Peter Eisentraut

On 12/1/16 9:47 PM, Andreas Karlsson wrote:
> I think this patch looks good now so I am setting it to ready for committer.

committed, thanks

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time to drop old-style (V0) functions?

2016-12-20 Thread Andres Freund

On 2016-12-20 08:35:05 -0500, Robert Haas wrote:
> On Tue, Dec 20, 2016 at 8:21 AM, Andres Freund  wrote:
> > On 2016-12-20 08:15:07 -0500, Robert Haas wrote:
> >> On Tue, Dec 20, 2016 at 3:11 AM, Andres Freund  wrote:
> >> > I think a more efficient variant would make the function signature look
> >> > something like:
> >> >
> >> > Datum /* directly returned argument */
> >> > pgfunc(
> >> > /* extra information about function call */
> >> > FunctionCallInfo *fcinfo,
> >> > /* bitmap of NULL arguments */
> >> > uint64_t nulls,
> >> > /* first argument */
> >> > Datum arg0,
> >> > /* second argument */
> >> > Datum arg1,
> >> > /* returned NULL */
> >> > bool *isnull
> >> > );
> >>
> >> Yeah, that's kind of nice.  I like the uint64 for nulls, although
> >> FUNC_MAX_ARGS > 64 by default and certainly can be configured that
> >> way.  It wouldn't be a problem for any common cases, of course.
> >
> > Well, I suspect we'd have to change that then. Is anybody seriously
> > interested in supporting FUNC_MAX_ARGS > 64? We don't have to make our
> > life hard by supporting useless features... I'd even question using
> > 64bit for this on 32bit platforms.
>
> Advanced Server uses 256, and we had a customer complain recently that
> 256 wasn't high enough.  So apparently the use case for functions with
> ridiculous numbers of arguments isn't exactly 0.

Well, there's a cost/benefit analysis involved here, as in a lot of
other places. There's a lot of things one can conceive a use-case for,
doesn't mean it's worth catering for all of them.


> I mean, there's no reason that we can't use dynamic allocation here.
> If we put the first, say, 8 arguments in the FunctionCallInfoData
> itself and then separately allocated space any extras, the structure
> could be a lot smaller without needing an arbitrary limit on the
> argument count.

Except that that'll mean additional overhead during evaluation of
function arguments, an overhead that everyone will have to pay. Suddenly
you need two loops that fill in arguments, depending on their
argument-number (i.e. additional branches in a thight spot).  And not
being able to pass the null bitmap via an register argument also costs
everyone.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash Indexes

2016-12-20 Thread Robert Haas

On Tue, Dec 20, 2016 at 4:51 AM, Amit Kapila  wrote:
> We have mainly four actions for squeeze operation, add tuples to the
> write page, empty overflow page, unlinks overflow page, make it free
> by setting the corresponding bit in overflow page.  Now, if we don't
> log the changes to write page and freeing of overflow page as one
> operation, then won't query on standby can either see duplicate tuples
> or miss the tuples which are freed in overflow page.

No, I think you could have two operations:

1. Move tuples from the "read" page to the "write" page.

2. Unlink the overflow page from the chain and mark it free.

If we fail after step 1, the bucket chain might end with an empty
overflow page, but that's OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Make pg_basebackup -x stream the default

2016-12-20 Thread Fujii Masao

On Mon, Dec 19, 2016 at 7:51 PM, Vladimir Rusinov  wrote:
>
> On Sat, Dec 17, 2016 at 2:37 PM, Magnus Hagander 
> wrote:
>>
>> Attached is an updated patch that does this. As a bonus it simplifies the
>> code a bit. I also fixed an error message that I missed updating in the
>> previous patch.
>
>
> looks good to me.

Yep, basically looks good to me. Here are some small comments.

> while ((c = getopt_long(argc, argv, "D:F:r:RT:xX:l:nNzZ:d:c:h:p:U:s:S:wWvP",
> long_options, &option_index)) != -1)

'x' should be removed from the above code.

> To create a backup of a single-tablespace local database and compress this 
> with bzip2:
>
> $ pg_basebackup -D - -Ft | bzip2 > backup.tar.bz2

The above example in the docs needs to be modified. For example,
add "-X fetch" into the above command so that pg_basebackup
should not exit with an error.

> The server must also be configured with max_wal_senders set high
> enough to leave at least one session available for the backup.

I think that it's better to explain explicitly here that max_wal_senders
should be higher than one by default.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: pg_authid.rolpassword format (was Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol)

2016-12-20 Thread Stephen Frost

Heikki,

* Heikki Linnakangas (hlinn...@iki.fi) wrote:
> Even if you have a separate "verifier type" column, it's not fully
> normalized, because there's still a dependency between the verifier
> and verifier type columns. You will always need to look at the
> verifier type to make sense of the verifier itself.

That's true- but you don't need to look at the verifier, or even have
*access* to the verifier, to look at the verifier type.  That is
actually very useful when you start thinking about the downstream side
of this- what about the monitoring tool which will want to check and
make sure there are only certain verifier types being used?  It'll have
to be a superuser, or have access to some superuser security defined
function, and that really sucks.  I'm not saying that we would
necessairly want the verifier type to be publicly visible, but being
able to see it without being a superuser would be good, imv.

> It's more convenient to carry the type information with the verifier
> itself, in backend code, in pg_dump, etc. Sure, you could have a
> separate "transfer" text format that has the prefix, and strip it
> out when the datum enters the system. But it is even simpler to have
> only one format, with the prefix, and use that everywhere.

It's more convenient when you need to look at both- it's not more
convenient when you only wish to look at the verifier type.  Further, it
means that we have to have a construct that assumes things about the
verifier type and verifier- what if a verifier type came along that used
a colon?  We'd have to do some special magic to handle that correctly,
and that just sucks, and anyone who is writing code to generically deal
with these fields will end up writing that same code (or forgetting to,
and not handling the case correctly).

> It might make sense to add a separate column, to e.g. make it easier
> to e.g. query for users that have an MD5 verifier. You could do
> "WHERE rolverifiertype = 'md5'", instead of "WHERE rolpassword LIKE
> 'md5%'". It's not a big difference, though. But even if we did that,
> I would still love to have the type information *also* included with
> the verifier itself, for convenience. And if we include it in the
> verifier itself, adding a separate type column seems more trouble
> than it's worth.

I don't agree that it's "not a big difference."  As I argue above- your
approach also assumes that anyone who would like to investigate the
verifier type should have access to the verifier itself, which I do not
agree with.  I also have a hard time buying the argument that it's
really so much more convenient to have the verifier type included in the
same string as the verifier that we should duplicate that information
and then run the risk that we end up with the two not matching or that
we won't ever run into complications down the road when our chosen
separator causes us difficulties.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Time to drop old-style (V0) functions?

2016-12-20 Thread Robert Haas

On Tue, Dec 20, 2016 at 8:21 AM, Andres Freund  wrote:
> On 2016-12-20 08:15:07 -0500, Robert Haas wrote:
>> On Tue, Dec 20, 2016 at 3:11 AM, Andres Freund  wrote:
>> > I think a more efficient variant would make the function signature look
>> > something like:
>> >
>> > Datum /* directly returned argument */
>> > pgfunc(
>> > /* extra information about function call */
>> > FunctionCallInfo *fcinfo,
>> > /* bitmap of NULL arguments */
>> > uint64_t nulls,
>> > /* first argument */
>> > Datum arg0,
>> > /* second argument */
>> > Datum arg1,
>> > /* returned NULL */
>> > bool *isnull
>> > );
>>
>> Yeah, that's kind of nice.  I like the uint64 for nulls, although
>> FUNC_MAX_ARGS > 64 by default and certainly can be configured that
>> way.  It wouldn't be a problem for any common cases, of course.
>
> Well, I suspect we'd have to change that then. Is anybody seriously
> interested in supporting FUNC_MAX_ARGS > 64? We don't have to make our
> life hard by supporting useless features... I'd even question using
> 64bit for this on 32bit platforms.

Advanced Server uses 256, and we had a customer complain recently that
256 wasn't high enough.  So apparently the use case for functions with
ridiculous numbers of arguments isn't exactly 0.  For most people it's
not a big problem in practice, but actually I think that it's pretty
lame that we have a limit at all.  I mean, there's no reason that we
can't use dynamic allocation here.  If we put the first, say, 8
arguments in the FunctionCallInfoData itself and then separately
allocated space any extras, the structure could be a lot smaller
without needing an arbitrary limit on the argument count.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time to drop old-style (V0) functions?

2016-12-20 Thread Andres Freund

On 2016-12-20 08:15:07 -0500, Robert Haas wrote:
> On Tue, Dec 20, 2016 at 3:11 AM, Andres Freund  wrote:
> > I think a more efficient variant would make the function signature look
> > something like:
> >
> > Datum /* directly returned argument */
> > pgfunc(
> > /* extra information about function call */
> > FunctionCallInfo *fcinfo,
> > /* bitmap of NULL arguments */
> > uint64_t nulls,
> > /* first argument */
> > Datum arg0,
> > /* second argument */
> > Datum arg1,
> > /* returned NULL */
> > bool *isnull
> > );
> 
> Yeah, that's kind of nice.  I like the uint64 for nulls, although
> FUNC_MAX_ARGS > 64 by default and certainly can be configured that
> way.  It wouldn't be a problem for any common cases, of course.

Well, I suspect we'd have to change that then. Is anybody seriously
interested in supporting FUNC_MAX_ARGS > 64? We don't have to make our
life hard by supporting useless features... I'd even question using
64bit for this on 32bit platforms.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2016-12-20 Thread Andres Freund

On 2016-12-20 08:10:29 -0500, Robert Haas wrote:
> We could use the GUC assign hook to compute a mask and a shift, so
> that this could be written as (CurrPos & mask_variable) == 0.  That
> would avoid the division instruction, though not the memory access.

I suspect that'd be fine.

> I hope this is all in the noise, though.

Could very well be.

> I know this is code is hot but I think it'll be hard to construct a
> test case where the bottleneck is anything other than the speed at
> which the disk can absorb bytes.

I don't think that's really true. Heikki's WAL changes made a *BIG*
difference. And pretty small changes in xlog.c can make noticeable
throughput differences both in single and multi-threaded
workloads. E.g. witnessed by the fact that the crc computation used to
be a major bottleneck (and the crc32c instruction still shows up
noticeably in profiles).  SSDs have become fast enough that it's
increasingly hard to saturate them.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time to drop old-style (V0) functions?

2016-12-20 Thread Robert Haas

On Tue, Dec 20, 2016 at 3:11 AM, Andres Freund  wrote:
> I think a more efficient variant would make the function signature look
> something like:
>
> Datum /* directly returned argument */
> pgfunc(
> /* extra information about function call */
> FunctionCallInfo *fcinfo,
> /* bitmap of NULL arguments */
> uint64_t nulls,
> /* first argument */
> Datum arg0,
> /* second argument */
> Datum arg1,
> /* returned NULL */
> bool *isnull
> );

Yeah, that's kind of nice.  I like the uint64 for nulls, although
FUNC_MAX_ARGS > 64 by default and certainly can be configured that
way.  It wouldn't be a problem for any common cases, of course.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2016-12-20 Thread Robert Haas

On Tue, Dec 20, 2016 at 3:28 AM, Andres Freund  wrote:
> I do think this has the potential for negative performance
> implications. Consider code like
> /* skip over the page header */
> if (CurrPos % XLogSegSize == 0)
> {
> CurrPos += SizeOfXLogLongPHD;
> currpos += SizeOfXLogLongPHD;
> }
> else
> right now that's doable in an efficient manner, because XLogSegSize is
> constant. If it's a variable and the compiler doesn't even know it's a
> power-of-two, it'll have to do a full "div" - and that's quite easily
> noticeable in a lot of cases.
>
> Now it could entirely be that the costs of this will be swamped by
> everything else, but I'd not want to rely on it.

We could use the GUC assign hook to compute a mask and a shift, so
that this could be written as (CurrPos & mask_variable) == 0.  That
would avoid the division instruction, though not the memory access.  I
hope this is all in the noise, though.  I know this is code is hot but
I think it'll be hard to construct a test case where the bottleneck is
anything other than the speed at which the disk can absorb bytes.  I
suppose we could set fsync=off and put the whole cluster on a RAMDISK
to avoid those bottlenecks, but of course no real storage system
behaves like that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] too low cost of Bitmap index scan

2016-12-20 Thread Pavel Stehule

2016-12-20 13:55 GMT+01:00 Robert Haas :

> On Tue, Dec 20, 2016 at 2:13 AM, Pavel Stehule 
> wrote:
> > 2016-12-19 23:28 GMT+01:00 Robert Haas :
> >> On Sat, Dec 17, 2016 at 3:30 AM, Pavel Stehule  >
> >> wrote:
> >> > ->  Bitmap Heap Scan on "Zasilka"  (cost=5097.39..5670.64 rows=1
> >> > width=12)
> >> > (actual time=62.253..62.400 rows=3 loops=231)
> >> ...
> >> > When I disable bitmap scan, then the query is 6x time faster
> >> 
> >> >->  Index Scan using "Zasilka_idx_Dopravce" on "Zasilka"
> >> > (cost=0.00..30489.04 rows=1 width=12) (actual time=15.651..17.187
> rows=3
> >> > loops=231)
> >> > Index Cond: ("Dopravce" = "Dopravce_Ridic_1"."ID")
> >> >Filter: (("StavDatum" > (now() - '10 days'::interval)) AND
> >> > (("Stav" =
> >> > 34) OR ("Stav" = 29) OR ("Stav" = 180) OR ("Stav" = 213) OR ("Stav" =
> >> > 46) OR
> >> > (("Produkt" = 33) AND ("Stav" = 179)) OR ((("ZpetnaZasilka" =
> >> > '-1'::integer)
> >> > OR ("PrimaZasilka" = '-1'::integer)) AND ("Stav" = 40
> >> > Rows Removed by Filter: 7596
> >>
> >> I'm not sure, but my guess would be that the query planner isn't
> >> getting a very accurate selectivity estimate for that giant filter
> >> condition, and that's why the cost estimate is off.
> >
> > maybe operator cost is too high?
>
> Hmm, seems like you'd be paying the operator cost either way.  No?
>

It looks so this cost is much more significant in index scan feature

Pavel



>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

Re: pg_authid.rolpassword format (was Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol)

2016-12-20 Thread Robert Haas

On Tue, Dec 20, 2016 at 6:37 AM, Heikki Linnakangas  wrote:
> It's more convenient to carry the type information with the verifier itself,
> in backend code, in pg_dump, etc. Sure, you could have a separate "transfer"
> text format that has the prefix, and strip it out when the datum enters the
> system. But it is even simpler to have only one format, with the prefix, and
> use that everywhere.

I see your point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] too low cost of Bitmap index scan

2016-12-20 Thread Robert Haas

On Tue, Dec 20, 2016 at 2:13 AM, Pavel Stehule  wrote:
> 2016-12-19 23:28 GMT+01:00 Robert Haas :
>> On Sat, Dec 17, 2016 at 3:30 AM, Pavel Stehule 
>> wrote:
>> > ->  Bitmap Heap Scan on "Zasilka"  (cost=5097.39..5670.64 rows=1
>> > width=12)
>> > (actual time=62.253..62.400 rows=3 loops=231)
>> ...
>> > When I disable bitmap scan, then the query is 6x time faster
>> 
>> >->  Index Scan using "Zasilka_idx_Dopravce" on "Zasilka"
>> > (cost=0.00..30489.04 rows=1 width=12) (actual time=15.651..17.187 rows=3
>> > loops=231)
>> > Index Cond: ("Dopravce" = "Dopravce_Ridic_1"."ID")
>> >Filter: (("StavDatum" > (now() - '10 days'::interval)) AND
>> > (("Stav" =
>> > 34) OR ("Stav" = 29) OR ("Stav" = 180) OR ("Stav" = 213) OR ("Stav" =
>> > 46) OR
>> > (("Produkt" = 33) AND ("Stav" = 179)) OR ((("ZpetnaZasilka" =
>> > '-1'::integer)
>> > OR ("PrimaZasilka" = '-1'::integer)) AND ("Stav" = 40
>> > Rows Removed by Filter: 7596
>>
>> I'm not sure, but my guess would be that the query planner isn't
>> getting a very accurate selectivity estimate for that giant filter
>> condition, and that's why the cost estimate is off.
>
> maybe operator cost is too high?

Hmm, seems like you'd be paying the operator cost either way.  No?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: pg_authid.rolpassword format (was Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol)

2016-12-20 Thread Heikki Linnakangas


On 12/16/2016 03:31 AM, Michael Paquier wrote:

On Thu, Dec 15, 2016 at 9:48 PM, Heikki Linnakangas  wrote:

The only way to distinguish, is to know about every verifier kind there is,
and check whether rolpassword looks valid as anything else than a plaintext
password. And we already got tripped by a bug-of-omission on that once. If
we add more verifier formats in the future, it's bound to happen again.
Let's nip that source of bugs in the bud. Attached is a patch to implement
what I have in mind.


OK, I had a look at the patch proposed.

-if (!pg_md5_encrypt(username, username, namelen, encrypted))
-elog(ERROR, "password encryption failed");
-if (strcmp(password, encrypted) == 0)
-ereport(ERROR,
-(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("password must not contain user name")));

This patch removes the only possible check for MD5 hashes that it has
never been done in passwordcheck. It may be fine to remove it, but I would
think that it is a good source of example regarding what could be done with
MD5 hashes, though limited. So it seems to me that this check should involve
as well pg_md5_encrypt on the username and compare if with the MD5 hash
given by the caller.


Actually, it does still perform that check. There's a new function, 
plain_crypt_verify, that passwordcheck uses now. plain_crypt_verify() is 
intended to work with any future hash formats we might introduce in the 
future (including SCRAM), so that passwordcheck doesn't need to know 
about all the hash formats.



A simple ALTER USER role PASSWORD 'foo' causes a crash:


Ah, fixed.


+case PASSWORD_TYPE_PLAINTEXT:
+shadow_pass = &shadow_pass[strlen("plain:")];
+break;
It would be a good idea to have a generic routine able to get the plain
password value. In short I think that we should reduce the amount of
locations where "plain:" prefix is hardcoded.


There is such a function included in the patch, get_plain_password(char 
*shadow_pass), actually. Contrib/passwordcheck uses it. I figured that 
in crypt.c itself, it's OK to do the above directly, but 
get_plain_password() is intended to be used elsewhere.


Thanks for having a look! Attached is a new version, with that bug fixed.

- Heikki



0001-Use-plain-prefix-for-plaintext-passwords-stored-in-p-2.patch
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] invalid combination of options "-D - -F t -X stream" in pg_basebackup

2016-12-20 Thread Magnus Hagander

On Tue, Dec 20, 2016 at 6:56 AM, Fujii Masao  wrote:

> On Tue, Dec 20, 2016 at 1:43 AM, Magnus Hagander 
> wrote:
> >
> >
> > On Mon, Dec 19, 2016 at 5:39 PM, Fujii Masao 
> wrote:
> >>
> >> Hi,
> >>
> >> Isn't it better to forbid the conbination of the options "-D -", "-F t"
> >> and
> >> "-X stream" in pg_basebackup? This is obviously invalid setting and the
> >> docs
> >> warns this as follows. But currently users can specify such setting and
> >> pg_basebackup can exit unexpectedly with an error.
> >>
> >> ---
> >> If the value - (dash) is specified as target directory, the tar contents
> >> will
> >> be written to standard output, suitable for piping to for example gzip.
> >> This is only possible if the cluster has no additional tablespaces.
> >> ---
> >
> >
> > Yes, definitely. I'd say that's an oversight in implementing the support
> for
> > stream-to-tar that it did not detect this issue.
> >
> > Do you want to provide a patch for it, or should I?
>
> What about the attached patch?
>
> +fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
> +progname);
>
> I added the above hint message because other codes checking invalid
> options also have such hint messages. But there is no additional
> useful information about valid combination of options in the help
> messages, so I'm a bit tempted to remove the above hint message.
>

Looks good to me.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: pg_authid.rolpassword format (was Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol)

2016-12-20 Thread Heikki Linnakangas


On 12/16/2016 05:48 PM, Robert Haas wrote:

On Thu, Dec 15, 2016 at 8:40 AM, Stephen Frost  wrote:

* Heikki Linnakangas (hlinn...@iki.fi) wrote:

On 12/14/2016 04:57 PM, Stephen Frost wrote:

* Peter Eisentraut (peter.eisentr...@2ndquadrant.com) wrote:

On 12/14/16 5:15 AM, Michael Paquier wrote:

I would be tempted to suggest adding the verifier type as a new column
of pg_authid


Yes please.


This discussion seems to continue to come up and I don't entirely
understand why we keep trying to shove more things into pg_authid, or
worse, into rolpassword.


I understand the relational beauty of having a separate column for
the verifier type, but I don't think it would be practical.


I disagree.


Me, too.  I think the idea of moving everything into a separate table
that allows multiple verifiers is probably not a good thing to do just
right now, because that introduces a bunch of additional issues above
and beyond what we need to do to get SCRAM implemented.  There are
administration and policy decisions to be made there that we should
not conflate with SCRAM proper.

However, Heikki's proposal seems to be that it's reasonable to force
rolpassword to be of the form 'type:verifier' in all cases but not
reasonable to have separate columns for type and verifier.  Eh?


I fear we'll just have to agree to disagree here, but I'll try to 
explain myself one more time.


Even if you have a separate "verifier type" column, it's not fully 
normalized, because there's still a dependency between the verifier and 
verifier type columns. You will always need to look at the verifier type 
to make sense of the verifier itself.


It's more convenient to carry the type information with the verifier 
itself, in backend code, in pg_dump, etc. Sure, you could have a 
separate "transfer" text format that has the prefix, and strip it out 
when the datum enters the system. But it is even simpler to have only 
one format, with the prefix, and use that everywhere.


It might make sense to add a separate column, to e.g. make it easier to 
e.g. query for users that have an MD5 verifier. You could do "WHERE 
rolverifiertype = 'md5'", instead of "WHERE rolpassword LIKE 'md5%'". 
It's not a big difference, though. But even if we did that, I would 
still love to have the type information *also* included with the 
verifier itself, for convenience. And if we include it in the verifier 
itself, adding a separate type column seems more trouble than it's worth.


For comparison, imagine that we added a column to pg_authid for a 
picture of the user, stored as a bytea. The picture can be in JPEG or 
PNG format. Looking at the first few bytes of the image, you can tell 
which one it is. Would it make sense to add a separate "type" column, to 
tell what format the image is in? I think it would be more convenient 
and robust to rely on the first bytes of the image data instead.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BUG: pg_stat_statements query normalization issues with combined queries

2016-12-20 Thread Fabien COELHO



Hello again pgdevs,

More investigations conclude that the information is lost by the parser.

From a ;-separated string raw_parser returns a List which are 
directly the nodes of the statements, with empty statements silently 
skipped.


The Node entry of high level statements (*Stmt) does NOT contain any
location information, only some lower level nodes have it.

A possible fix of this would be to add the query string location 
information systematically at the head of every high level *Stmt, which 
would then share both NodeTag and this information (say location and 
length). This would be a new intermediate kind of node of which all these 
statements would "inherit", say a "ParseNode" structure.


The actual location information may be filled-in at quite a high level in 
the parser, maybe around "stmtmulti" and "stmt" rules, so it would not 
necessarily be too intrusive in the overall parser grammar.


Then once the information is available, the query string may be cut where 
appropriate to only store the relevant string in pg_stat_statements or 
above.


Would this approach be acceptable, or is modifying Nodes a no-go area?

If it is acceptable, I can probably put together a patch and submit it.

If not, I suggest to update the documentation to tell that 
pg_stat_statements does not work properly with combined queries.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 117 matches

Mail list logo