Re: [HACKERS] why not parallel seq scan for slow functions

2017-07-10 Thread Dilip Kumar
On Tue, Jul 11, 2017 at 9:02 AM, Jeff Janes  wrote:
> If I have a slow function which is evaluated in a simple seq scan, I do not
> get parallel execution, even though it would be massively useful.  Unless
> force_parallel_mode=ON, then I get a dummy parallel plan with one worker.
>
> explain select aid,slow(abalance) from pgbench_accounts;

After analysing this, I see multiple reasons of this getting not selected

1. The query is selecting all the tuple and the benefit what we are
getting by parallelism is by dividing cpu_tuple_cost which is 0.01 but
for each tuple sent from worker to gather there is parallel_tuple_cost
which is 0.1 for each tuple.  (which will be very less in case of
aggregate).   Maybe you can try some selecting with some condition.

like below:
postgres=# explain select slow(abalance) from pgbench_accounts where
abalance > 1;
QUERY PLAN
---
 Gather  (cost=0.00..46602.33 rows=1 width=4)
   Workers Planned: 2
   ->  Parallel Seq Scan on pgbench_accounts  (cost=0.00..46602.33
rows=1 width=4)
 Filter: (abalance > 1)

2. The second problem I am seeing is that (maybe the code problem),
the "slow" function is very costly (1000) and in
apply_projection_to_path we account for this cost.  But, I have
noticed that for gather node also we are adding this cost to all the
rows but actually, if the lower node is already doing the projection
then gather node just need to send out the tuple instead of actually
applying the projection.

In below function, we always multiply the target->cost.per_tuple with
path->rows, but in case of gather it should multiply this with
subpath->rows

apply_projection_to_path()


path->startup_cost += target->cost.startup - oldcost.startup;
path->total_cost += target->cost.startup - oldcost.startup +
(target->cost.per_tuple - oldcost.per_tuple) * path->rows;


So because of this high projection cost the seqpath and parallel path
both have fuzzily same cost but seqpath is winning because it's
parallel safe.

>
> CREATE OR REPLACE FUNCTION slow(integer)
>  RETURNS integer
>  LANGUAGE plperl
>  IMMUTABLE PARALLEL SAFE STRICT COST 1000
> AS $function$
>   my $thing=$_[0];
>   foreach (1..1_000_000) {
> $thing = sqrt($thing);
> $thing *= $thing;
>   };
>   return ($thing+0);
> $function$;
>
> The partial path is getting added to the list of paths, it is just not
> getting chosen, even if parallel_*_cost are set to zero.  Why not?
>
> If I do an aggregate, then it does use parallel workers:
>
> explain select sum(slow(abalance)) from pgbench_accounts;
>
> It doesn't use as many as I would like, because there is a limit based on
> the logarithm of the table size (I'm using -s 10 and get 3 parallel
> processes) , but at least I know how to start looking into that.
>
> Also, how do you debug stuff like this?  Are there some gdb tricks to make
> this easier to introspect into the plans?
>
> Cheers,
>
> Jeff


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New partitioning - some feedback

2017-07-10 Thread Alvaro Herrera
Robert Haas wrote:
> On Mon, Jul 10, 2017 at 2:15 AM, Amit Langote
>  wrote:

> > Actually, if \d had shown RELKIND_PARTITIONED_TABLE tables as of Type
> > "partitioned table", we wouldn't need a separate flag for marking a table
> > as having partitions.
> 
> I think that is false.  Whether something is partitioned and whether
> it is a partition are independent concerns.

Maybe this discussion is easier if we differentiate "list tables" (\dt,
or \d without a pattern) from "describe table" (\d with a name pattern).

It seems to me that the "describe" command should list partitions --
perhaps only when the + flag is given.  However, the "list tables"
command \dt should definitely IMO not list partitions.  Maybe \dt should
have some flag indicating whether each table is partitioned.


-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] why not parallel seq scan for slow functions

2017-07-10 Thread Jeff Janes
If I have a slow function which is evaluated in a simple seq scan, I do not
get parallel execution, even though it would be massively useful.  Unless
force_parallel_mode=ON, then I get a dummy parallel plan with one worker.

explain select aid,slow(abalance) from pgbench_accounts;

CREATE OR REPLACE FUNCTION slow(integer)
 RETURNS integer
 LANGUAGE plperl
 IMMUTABLE PARALLEL SAFE STRICT COST 1000
AS $function$
  my $thing=$_[0];
  foreach (1..1_000_000) {
$thing = sqrt($thing);
$thing *= $thing;
  };
  return ($thing+0);
$function$;

The partial path is getting added to the list of paths, it is just not
getting chosen, even if parallel_*_cost are set to zero.  Why not?

If I do an aggregate, then it does use parallel workers:

explain select sum(slow(abalance)) from pgbench_accounts;

It doesn't use as many as I would like, because there is a limit based on
the logarithm of the table size (I'm using -s 10 and get 3 parallel
processes) , but at least I know how to start looking into that.

Also, how do you debug stuff like this?  Are there some gdb tricks to make
this easier to introspect into the plans?

Cheers,

Jeff


Re: [HACKERS] pgsql 10: hash indexes testing

2017-07-10 Thread Alvaro Herrera
Amit Kapila wrote:
> On Tue, Jul 11, 2017 at 6:51 AM, AP  wrote:
> > On Fri, Jul 07, 2017 at 05:58:25PM +0530, Amit Kapila wrote:

> >> I can understand your concerns.  To address first concern we need to
> >> work on one or more of following work items: (a) work on vacuums that
> >> can be triggered on insert only workload (it should perform index
> >> vacuum as well) (b) separate utility statement/function to squeeze
> >> hash index (c) db internally does squeezing like after each split, so
> >> that chances of such a problem can be reduced, but that will be at the
> >> cost of performance reduction in other workloads, so not sure if it is
> >> advisable.  Among these (b) is simplest to do but may not be
> >> convenient for the user.
> >
> > (a) seems like a good compromise on (c) if it can be done without disruption
> > and in time.
> > (b) seems analogous to the path autovcauum took. Unless I misremember, 
> > before
> > autovacuum we had a cronjob to do similar work. It's probably a sane 
> > path
> > to take as a first step on the way to (a)
> > (c) may not be worth the effort if it compromises general use, though 
> > perhaps
> > it could be used to indicate to (a) that now is a good time to handle
> > this bit?
> 
> Nice summarization!  I think before doing anything of that sort we
> need opinions from others as well.  If some other community members
> also see value in doing one or multiple of above things, then I can
> write a patch.

I haven't read the thread, but in late PG10 autovacuum gained the idea
of "work items" that can be plugged from other parts of the server;
currently BRIN uses it to cause a range to be summarized right after the
next one starts being filled.  This is activated separately for each
index via a reloption.  Perhaps something like that can be used for hash
indexes?  See commit 7526e10224f0792201e99631567bbe44492bbde4.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql 10: hash indexes testing

2017-07-10 Thread Amit Kapila
On Tue, Jul 11, 2017 at 6:51 AM, AP  wrote:
> On Fri, Jul 07, 2017 at 05:58:25PM +0530, Amit Kapila wrote:
>> On Fri, Jul 7, 2017 at 8:22 AM, AP  wrote:
>> > On Thu, Jul 06, 2017 at 05:19:59PM +0530, Amit Kapila wrote:
>> >> I think if you are under development, it is always advisable to create
>> >> indexes after initial bulk load.  That way it will be faster and will
>> >> take lesser space atleast in case of hash index.
>> >
>> > This is a bit of a pickle, actually:
>> > * if I do have a hash index I'll wind up with a bloated one at some stage
>> >   that refused to allow more inserts until the index is re-created
>> > * if I don't have an index then I'll wind up with a table where I cannot
>> >   create a hash index because it has too many rows for it to handle
>> >
>> > I'm at a bit of a loss as to how to deal with this.
>>
>> I can understand your concerns.  To address first concern we need to
>> work on one or more of following work items: (a) work on vacuums that
>> can be triggered on insert only workload (it should perform index
>> vacuum as well) (b) separate utility statement/function to squeeze
>> hash index (c) db internally does squeezing like after each split, so
>> that chances of such a problem can be reduced, but that will be at the
>> cost of performance reduction in other workloads, so not sure if it is
>> advisable.  Among these (b) is simplest to do but may not be
>> convenient for the user.
>
> (a) seems like a good compromise on (c) if it can be done without disruption
> and in time.
> (b) seems analogous to the path autovcauum took. Unless I misremember, before
> autovacuum we had a cronjob to do similar work. It's probably a sane path
> to take as a first step on the way to (a)
> (c) may not be worth the effort if it compromises general use, though perhaps
> it could be used to indicate to (a) that now is a good time to handle
> this bit?
>

Nice summarization!  I think before doing anything of that sort we
need opinions from others as well.  If some other community members
also see value in doing one or multiple of above things, then I can
write a patch.

>> To address your second concern, we need to speed up the creation of
>> hash index which is a relatively big project.  Having said that, I
>> think in your case, this is one-time operation so spending once more
>> time might be okay.
>
> Yup. Primarily I just wanted the idea out there that this isn't that easy
> to cope with manually and to get it onto a todo list (unless it was an
> easy thing to do given a bit of thought but it appears not).
>
> Out of curiosity, and apologies if you explained it already and I missed
> the signficance of the words, how does this bloat happen?
>

You might want to read src/backend/access/hash/README.  During split
operation, we copy tuples from the old bucket (bucket being split) to
new bucket (bucket being populated) and once all the tuples are copied
and there is no prior scan left which has started during split on the
buckets involved in the split, we remove the tuples from the old
bucket.  Now, as we might need to wait for the scans to finish, we
have preferred to perform it during vacuum or during next split from
that bucket.  Till the tuples are removed from the old bucket, there
will be some bloat in the system.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New partitioning - some feedback

2017-07-10 Thread Amit Langote
On 2017/07/11 10:34, Paul A Jungwirth wrote:
>> Also, there seems to be at least some preference
>> for excluding partitions by default from the \d listing.
> 
> As another user of partitions I'll chime in and say that would be very
> nice! On the other hand, with pre-10 partitions you do see all the
> child tables with `\d`, so showing declarative partitions seems more
> consistent with the old functionality.

That's one way of looking at it. :)

> On the third hand with pre-10 partitions I can put the child tables
> into a separate schema to de-clutter `\d`, but I don't see any way to
> do that with declarative partitions. Since there is no workaround it
> makes it a bit more important for partitions not to be so noisy.

You can do that with declarative partitions:

create table foo (a int) partition by list (a);

create schema foo_parts;
create table foo_parts.foo1 partition of foo for values in (1);
create table foo_parts.foo2 partition of foo for values in (2);

\d
   List of relations
 Schema | Name | Type  | Owner
+--+---+---
 public | foo  | table | amit
(1 row)

set search_path to foo_parts;

\d
List of relations
  Schema   | Name | Type  | Owner
---+--+---+---
 foo_parts | foo1 | table | amit
 foo_parts | foo2 | table | amit
(2 rows)

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-07-10 Thread Peter Geoghegan
On Fri, Jun 9, 2017 at 11:09 AM, Robert Haas  wrote:
>> Isn't that what strxfrm() is?
>
> Yeah, just with bugs.  If ICU has a non-buggy equivalent, then we can
> make this work.

I agree that it probably isn't worth using strxfrm() again, simply
because the glibc implementation is buggy, and glibc as a project is
not at all concerned about how badly that would affect PostgreSQL.

I would like to point out on this thread that the strcmp() tie-breaker
is also a big blocker to implementing normalized keys in B-Tree
indexes (at least, if you want to get them for collated text, which I
think you really need to make the implementation effort worth it).
This is something that is discussed in a section on the normalized
keys wiki page I created recently [1].

[1] 
https://wiki.postgresql.org/wiki/Key_normalization#ICU.2C_text_equality_semantics.2C_and_hashing
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New partitioning - some feedback

2017-07-10 Thread Paul A Jungwirth
> Also, there seems to be at least some preference
> for excluding partitions by default from the \d listing.

As another user of partitions I'll chime in and say that would be very
nice! On the other hand, with pre-10 partitions you do see all the
child tables with `\d`, so showing declarative partitions seems more
consistent with the old functionality.

On the third hand with pre-10 partitions I can put the child tables
into a separate schema to de-clutter `\d`, but I don't see any way to
do that with declarative partitions. Since there is no workaround it
makes it a bit more important for partitions not to be so noisy.

Paul


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

2017-07-10 Thread Kyotaro HORIGUCHI
At Mon, 10 Jul 2017 18:37:34 +0530, Amit Kapila  wrote 
in 

Re: [HACKERS] New partitioning - some feedback

2017-07-10 Thread Amit Langote
On 2017/07/11 7:33, Robert Haas wrote:
> On Mon, Jul 10, 2017 at 2:15 AM, Amit Langote  
> wrote:
>> Actually, if \d had shown RELKIND_PARTITIONED_TABLE tables as of Type
>> "partitioned table", we wouldn't need a separate flag for marking a table
>> as having partitions.
> 
> I think that is false.  Whether something is partitioned and whether
> it is a partition are independent concerns.

I meant to speak of RELKIND_PARTITIONED_TABLE tables as having partitions
(although it could be a partition itself too).  If based on the relkind,
we had shown their type as "partitioned table" (not just "table"), then we
wouldn't need a separate flag/column in the \d output to distinguish
partitioned tables as being different from regular tables, as Craig seemed
to be proposing.

Since we are going the route of showing relispartition = true relations as
of different type in the \d listing (as "partition"/"foreign partition"),
we might as well go and spell RELKIND_PARTITIONED_TABLE tables as
"partitioned table".  But, I'm afraid that it would be a much bigger
change if we don't want to restrict this terminology change to \d listing;
error messages don't bother about distinguishing "partitions"
(relispartition = true) or "partitioned tables"
(RELKIND_PARTITIONED_TABLE), for instance.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

2017-07-10 Thread Masahiko Sawada
On Mon, Jul 10, 2017 at 11:56 AM, Michael Paquier
 wrote:
> On Sat, Jul 8, 2017 at 12:50 AM, Masahiko Sawada  
> wrote:
>> On Fri, Jul 7, 2017 at 3:48 PM, Michael Paquier
>>  wrote:
>>> So I would suggest the following things to address this issue:
>>> 1) Generate a backup history file for backups taken at recovery as well.
>>> 2) Archive it if archive_mode = always.
>>> 3) Wait for both the segment of the stop point and the backup history
>>> files to be archived before returning back to the caller of
>>> pg_stop_backup.
>>>
>>> It would be nice to get all that addressed in 10. Thoughts?
>>
>> Yeah, I agree. Attached patch makes it works and deals with the
>> history file issue.
>
> I had a look at the proposed patch. Here are some comments.

Thank you for reviewing the patch!

>
> @@ -11002,10 +11000,10 @@ do_pg_stop_backup(char *labelfile, bool
> waitforarchive, TimeLineID *stoptli_p)
> if (waitforarchive && XLogArchivingActive())
> {
> XLByteToPrevSeg(stoppoint, _logSegNo);
> -   XLogFileName(lastxlogfilename, ThisTimeLineID, _logSegNo);
> +   XLogFileName(lastxlogfilename, stoptli, _logSegNo);
>
> On a standby the wait phase should not happen if archive_mode = on,
> but only if archive_mode = always. So I would suggest to change this
> upper condition a bit, and shuffle a bit the code to make the wait
> phase happen last:
> 1) stoptli_p first.
> 2) Check for XLogArchivingActive or XLogArchivingAlways, then with the
> NOTICE message.
> 3) Do the actual wait.

Thank you for the suggestion. Fixed.

> This way the code doing the wait does not need to be in its long
> lengthy if() branch. I think that we should replace the pg_usleep()
> call with a latch to make this more responsive. That should be a
> future patch.

Yeah, I agree.

> In backup history files generated in standbys, the STOP TIME is not
> set and this results in garbage in the file.

Fixed.

> +If second parameter is true and on standby, pg_stop_backup
> +waits for WAL to be archived without forcibly switching WAL on standby.
> +So enforcing manually a WAL switch on primary needs to happen.
> Here is a reformulation:
> If the second parameter wait_for_archive is true and the backup is
> taken on a standby, pg_stop_backup waits for WAL to be archived when
> archive_mode = always. Enforcing manually a WAL segment switch to
> happen with for example pg_switch_wal() may be necessary if the
> primary has low activity to allow the backup to complete. Using
> statement_timeout to limit the amount of time to wait or switching
> wait_for_archive to false will control the wait time, though all the
> WAL segments necessary to recover into a consistent state from the
> backup taken may not be archived at the time pg_stop_backup returns
> its status to the caller.

Thanks, fixed.

> The errhint() for the wait phase should be reworked for a standby. I
> would suggest for errmsg() the same thing, aka:
> pg_stop_backup still waiting for all required WAL segments to be
> archived (%d seconds elapsed)
> But the following for a backup started in recovery. That's long but
> things need to be really clear to the user:
> Backup has been taken from a standby, check if the WAL segments needed
> for this backup have been completed, in which case a forced segment
> switch may can be needed on the primary. If a segment switch has
> already happened check that your archive_command is executing
> properly. pg_stop_backup can be canceled safely, but the database
> backup will not be usable without all the WAL segments.

Fixed.

Attached updated version patch. Please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


pg_stop_backup_on_standby_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Allow multiple hostaddrs to go with multiple hostnames.

2017-07-10 Thread Masahiko Sawada
On Mon, Jul 10, 2017 at 10:01 PM, Heikki Linnakangas  wrote:
> On 07/10/2017 01:27 PM, Masahiko Sawada wrote:
>>
>> This commit seems be cause of the documentation compilation error. I
>> think  is missing.
>>
>> ...
>>
>> Attached small patch fixes this.
>
>
> Thanks, committed!
>
> Strangely, it worked on my system, despite that clear mistake. Looks like
> the 'osx' tool is more strict than what I had installed. I have now also
> installed opensp, so won't make this mistake again.
>

Thank you!

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] More race conditions in logical replication

2017-07-10 Thread Alvaro Herrera
Alvaro Herrera wrote:

> I'll next update this on or before Monday 10th at 19:00 CLT.

I couldn't get to this today as I wanted.  Next update on Wednesday
12th, same time.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Key normalization for nbtree

2017-07-10 Thread Peter Geoghegan
On Mon, Jul 10, 2017 at 4:08 PM, Greg Stark  wrote:
> One thing I would like to see is features like this added to the
> opclasses (or opfamilies?) using standard PG functions that return
> standard PG data types. So if each opclass had a function that took
> the data type in question and returned a bytea then you could
> implement that function using a language you felt like (in theory),
> test it using standard SQL, and possibly even find other uses for it.

That seems like a good goal.

> That kind of abstraction would be more promising for the future than
> having yet another C api that is used for precisely one purpose and
> whose definition is "provide the data needed for this usage".

Perhaps this is obvious, but the advantage of flattening everything
into one universal representation is that it's a very simple API for
opclass authors, that puts control into the hands of the core nbtree
code, where it belongs. For text, say, you can generate fictitious
minimal separator keys with suffix truncation, that really are as
short as possible, down to level of individual bits. If you tried to
do this with the original text representation, you'd have to worry
about the fact that that's probably not going to even be valid UTF-8,
and that encoding aware truncation is needed, etc. You'd definitely
have to "double check" that the truncated key was greater than the
left half and less than the right half, just in case you didn't end up
with a valid separator due to the vagaries of the collation rules.

That's the kind of complexity that scales poorly, because the
complexity cannot be isolated.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New partitioning - some feedback

2017-07-10 Thread Greg Stark
On 10 July 2017 at 23:46, David Fetter  wrote:
> On Mon, Jul 10, 2017 at 05:33:34PM -0500, Robert Haas wrote:
>> On Mon, Jul 10, 2017 at 2:15 AM, Amit Langote
>>  wrote:
>> > I posted a patch upthread which makes \d hide partitions
>> > (relispartition = true relations) and include them if the newly
>> > proposed '!' modifier is specified.  The '+' modifier is being
>> > used to show additional detail of relations chosen to be listed at
>> > all, so it seemed like a bad idea to extend its meaning to also
>> > dictate whether partitions are to be listed.
>>
>> +1.  That'd be a mess.
>
> With utmost respect, it's less messy than adding '!' to the already
> way too random and mysterious syntax of psql's \ commands.  What
> should '\det!' mean?  What about '\dT!'?

Fwiw as, I believe, the first person to make this complaint I would be
fine with + listing all partitions. Imho adding an orthogonal "!"
would be too much mental overhead for the user.

If you want something different perhaps we can invent ++ for "even
more information" and list partitions only if two plusses are
provided. (I don't think the other way around makes sense since you
might need a way to list permissions and comments without listing
every partition if you're on a system with an unmanageable number of
partitions but you never absolutely need a way to list partitions
without the comments and permissions). At least that doesn't require
the user to learn a new flag and how it interacts with everything
else.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Key normalization for nbtree

2017-07-10 Thread Greg Stark
On 10 July 2017 at 19:40, Peter Geoghegan  wrote:
> Key normalization means creating a representation for internal page
> items that we always just memcmp(), regardless of the details of the
> underlying datatypes.

One thing I would like to see is features like this added to the
opclasses (or opfamilies?) using standard PG functions that return
standard PG data types. So if each opclass had a function that took
the data type in question and returned a bytea then you could
implement that function using a language you felt like (in theory),
test it using standard SQL, and possibly even find other uses for it.
That kind of abstraction would be more promising for the future than
having yet another C api that is used for precisely one purpose and
whose definition is "provide the data needed for this usage".

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New partitioning - some feedback

2017-07-10 Thread David Fetter
On Mon, Jul 10, 2017 at 05:33:34PM -0500, Robert Haas wrote:
> On Mon, Jul 10, 2017 at 2:15 AM, Amit Langote
>  wrote:
> > I posted a patch upthread which makes \d hide partitions
> > (relispartition = true relations) and include them if the newly
> > proposed '!' modifier is specified.  The '+' modifier is being
> > used to show additional detail of relations chosen to be listed at
> > all, so it seemed like a bad idea to extend its meaning to also
> > dictate whether partitions are to be listed.
> 
> +1.  That'd be a mess.

With utmost respect, it's less messy than adding '!' to the already
way too random and mysterious syntax of psql's \ commands.  What
should '\det!' mean?  What about '\dT!'?

> > Actually, if \d had shown RELKIND_PARTITIONED_TABLE tables as of
> > Type "partitioned table", we wouldn't need a separate flag for
> > marking a table as having partitions.
> 
> I think that is false.  Whether something is partitioned and whether
> it is a partition are independent concerns.

So whatever we land on needs to mention partition_of and
has_partitions.  Is that latter just its immediate partitions?
Recursion all the way down?  Somewhere in between?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New partitioning - some feedback

2017-07-10 Thread Robert Haas
On Mon, Jul 10, 2017 at 2:15 AM, Amit Langote
 wrote:
> I posted a patch upthread which makes \d hide partitions (relispartition =
> true relations) and include them if the newly proposed '!' modifier is
> specified.  The '+' modifier is being used to show additional detail of
> relations chosen to be listed at all, so it seemed like a bad idea to
> extend its meaning to also dictate whether partitions are to be listed.

+1.  That'd be a mess.

> Actually, if \d had shown RELKIND_PARTITIONED_TABLE tables as of Type
> "partitioned table", we wouldn't need a separate flag for marking a table
> as having partitions.

I think that is false.  Whether something is partitioned and whether
it is a partition are independent concerns.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add support for tuple routing to foreign partitions

2017-07-10 Thread Robert Haas
On Thu, Jun 29, 2017 at 6:20 AM, Etsuro Fujita
 wrote:
> So, I dropped the COPY part.

Ouch.  I think we should try to figure out how the COPY part will be
handled before we commit to a design.

I have to admit that I'm a little bit fuzzy about why foreign insert
routing requires all of these changes.  I think this patch would
benefit from being accompanied by several paragraphs of explanation
outlining the rationale for each part of the patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Oddity in error handling of constraint violation in ExecConstraints for partitioned tables

2017-07-10 Thread Robert Haas
On Sun, Jul 9, 2017 at 3:06 AM, Noah Misch  wrote:
> On Fri, Jul 07, 2017 at 06:47:26PM +0900, Amit Langote wrote:
>> On 2017/07/06 16:06, Etsuro Fujita wrote:
>> > Looks odd to me because the error message doesn't show any DETAIL info;
>> > since the CTE query, which produces the message, is the same as the above
>> > query, the message should also be the same as the one for the above
>> > query.
>>
>> I agree that the DETAIL should be shown.
>
>> The patch keeps tests that you added in your patch.  Added this to the
>> open items list.
>
> [Action required within three days.  This is a generic notification.]
>
> The above-described topic is currently a PostgreSQL 10 open item.  Robert,
> since you committed the patch believed to have created it, you own this open
> item.  If some other commit is more relevant or if this does not belong as a
> v10 open item, please let us know.  Otherwise, please observe the policy on
> open item ownership[1] and send a status update within three calendar days of
> this message.  Include a date for your subsequent status update.  Testers may
> discover new open items at any time, and I want to plan to get them all fixed
> well in advance of shipping v10.  Consequently, I will appreciate your efforts
> toward speedy resolution.  Thanks.

The posted patches look OK to me.  Barring developments, I will commit
them on 2017-07-17, or send another update by then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Key normalization for nbtree

2017-07-10 Thread Peter Geoghegan
On Mon, Jul 10, 2017 at 12:53 PM, Claudio Freire  wrote:
> I do have a patch that attacks suffix truncation, heap tid unification
> and prefix compression all at once.

That's great! I'll certainly be able to review it.

> It's on a hiatus ATM, but, as you say, the implementations are highly
> correlated so attacking them at once makes a lot of sense. Or, at
> least, attacking one having the other in the back of your mind.
>
> Key normalization would simplify prefix compression considerably, for 
> instance.

The stuff that I go into detail about is the stuff I think you'd need
to do up front. Prefix compression would be a "nice to have", but I
think you'd specifically do key normalization, suffix truncation, and
adding the heap TID to the key space all up front.

> A missing optimization is that having tid unification allows VACUUM to
> implement a different strategy when it needs to clean up only a tiny
> fraction of the index. It can do the lookup by key-tid instead of
> scanning the whole index, which can be a win if the index is large and
> the number of index pointers to kill is small.

I do mention that on the Wiki page. I also describe ways that these
techniques have some non-obvious benefits for VACUUM B-Tree page
deletion, which you should get "for free" once you do the 3 things I
mentioned. A lot of the benefits for VACUUM are seen in the worst
case, which is when they're really needed.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Key normalization for nbtree

2017-07-10 Thread Peter Geoghegan
On Mon, Jul 10, 2017 at 1:23 PM, Alvaro Herrera
 wrote:
> Claudio Freire wrote:
>
>> A missing optimization is that having tid unification allows VACUUM to
>> implement a different strategy when it needs to clean up only a tiny
>> fraction of the index. It can do the lookup by key-tid instead of
>> scanning the whole index, which can be a win if the index is large and
>> the number of index pointers to kill is small.
>
> Doing index cleanup by using keys instead of scanning the whole index
> would be a *huge* win for many use cases.  I don't think that work needs
> to be related to either of these patches.

The problem is that it will perform horribly when there are many
duplicates, unless you really can treat heap TID as a part of the key.
Once you do that, you can be almost certain that you won't have to
visit more than one leaf page per index tuple to kill. And, you can
amortize descending the index among a set of duplicate keyed tuples.
You arrive at the leaf tuple with the lowest sort order, based on
(key, TID), kill that, and then continue an index scan along the leaf
level. VACUUM ticks them off a private "kill list" which is also
ordered by (key, TID). Much less random I/O, and pretty good
guarantees about the worst case.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Key normalization for nbtree

2017-07-10 Thread Alvaro Herrera
Claudio Freire wrote:

> A missing optimization is that having tid unification allows VACUUM to
> implement a different strategy when it needs to clean up only a tiny
> fraction of the index. It can do the lookup by key-tid instead of
> scanning the whole index, which can be a win if the index is large and
> the number of index pointers to kill is small.

Doing index cleanup by using keys instead of scanning the whole index
would be a *huge* win for many use cases.  I don't think that work needs
to be related to either of these patches.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Key normalization for nbtree

2017-07-10 Thread Claudio Freire
On Mon, Jul 10, 2017 at 3:40 PM, Peter Geoghegan  wrote:
> It might appear excessive to talk about several different techniques
> in one place, but that seemed like the best way to me, because there
> are subtle dependencies. If most of the optimizations are pursued as a
> project all at once (say, key normalization, suffix truncation, and
> treating heap TID as a unique-ifier), that may actually be more likely
> to succeed than a project to do just one. The techniques don't appear
> to be related at first, but they really are.

I do have a patch that attacks suffix truncation, heap tid unification
and prefix compression all at once.

It's on a hiatus ATM, but, as you say, the implementations are highly
correlated so attacking them at once makes a lot of sense. Or, at
least, attacking one having the other in the back of your mind.

Key normalization would simplify prefix compression considerably, for instance.

A missing optimization is that having tid unification allows VACUUM to
implement a different strategy when it needs to clean up only a tiny
fraction of the index. It can do the lookup by key-tid instead of
scanning the whole index, which can be a win if the index is large and
the number of index pointers to kill is small.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] RFC: Key normalization for nbtree

2017-07-10 Thread Peter Geoghegan
I've created a new Wiki page that describes a scheme for normalizing
internal page items within B-Tree indexes, and the many optimizations
that this can enable:

https://wiki.postgresql.org/wiki/Key_normalization

Key normalization means creating a representation for internal page
items that we always just memcmp(), regardless of the details of the
underlying datatypes.

My intent in creating this wiki page is to document these techniques
centrally, as well as the problems that they may solve, and to show
how they're all interrelated. It might be that confusion about how one
optimization enables another holds back patch authors.

It might appear excessive to talk about several different techniques
in one place, but that seemed like the best way to me, because there
are subtle dependencies. If most of the optimizations are pursued as a
project all at once (say, key normalization, suffix truncation, and
treating heap TID as a unique-ifier), that may actually be more likely
to succeed than a project to do just one. The techniques don't appear
to be related at first, but they really are.

I'm not planning on working on key normalization or any of the other
techniques as projects myself, but FWIW I have produced minimal
prototypes of a few of the techniques over the past several years,
just to verify my understanding. My theories on this topic seem worth
writing down. I'm happy to explain or clarify any aspect of what I
describe, and to revise the design based on feedback. It is still very
much a work in progress.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New partitioning - some feedback

2017-07-10 Thread David Fetter
On Mon, Jul 10, 2017 at 04:15:28PM +0900, Amit Langote wrote:
> On 2017/07/10 15:32, Craig Ringer wrote:
> > On 8 July 2017 at 00:03, David Fetter  wrote:
> > 
> >> On Fri, Jul 07, 2017 at 10:29:26AM +0900, Amit Langote wrote:
> >>> Hi Mark,
> >>>
> >>> On 2017/07/07 9:02, Mark Kirkwood wrote:
>  I've been trying out the new partitioning in version 10. Firstly, I
> >> must
>  say this is excellent - so much nicer than the old inheritance based
> >> method!
> >>>
> >>> Thanks. :)
> >>>
>  My only niggle is the display of partitioned tables via \d etc. e.g:
> 
>  part=# \d
>  List of relations
>   Schema | Name | Type  |  Owner
>  +--+---+--
>   public | date_fact| table | postgres
>   public | date_fact_201705 | table | postgres
>   public | date_fact_201706 | table | postgres
>   public | date_fact_20170601   | table | postgres
>   public | date_fact_2017060100 | table | postgres
>   public | date_fact_201707 | table | postgres
>   public | date_fact_rest   | table | postgres
>  (7 rows)
> >>
> >> Would showing relispartition=tru tables only in \d+ fix this?
> >> 
> >>
> > 
> > I think so.
> 
> I posted a patch upthread which makes \d hide partitions (relispartition =
> true relations) and include them if the newly proposed '!' modifier is
> specified.  The '+' modifier is being used to show additional detail of
> relations chosen to be listed at all, so it seemed like a bad idea to
> extend its meaning to also dictate whether partitions are to be listed.
> We have a separate 'S' modifier to ask to list system objects (which are,
> by default hidden), so it made sense to me to add yet another modifier
> (aforementioned '!') for partitions.

We have already made '+' signal "more detail, unspecified," for a lot
of different cases.  If partitions are just "more detail" about a
table, which is the direction we've decided to go, it makes sense to
list them under the rubric of '+' rather than inventing yet another
hunk of syntax to psql's already confusing \ commands.

> > I'd like to add a flag of some kind to \d column output that marks a table
> > as having partitions, but I can't think of anything narrow enough and still
> > useful.
> 
> Actually, if \d had shown RELKIND_PARTITIONED_TABLE tables as of Type
> "partitioned table", we wouldn't need a separate flag for marking a table
> as having partitions.  But we've avoided using that term ("partitioned
> table") in the error messages and such, so wouldn't perhaps be a good idea
> to do that here.  But I wonder if we (also) want to distinguish
> partitioned tables from regular tables?  I understood that there is some
> desire for partitions be distinguished when they are listed in the output,
> either by default or by using a modifier.

+1 for showing that they're a different beast.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] retry shm attach for windows (WAS: Re: OK, so culicidae is *still* broken)

2017-07-10 Thread Tom Lane
Noah Misch  writes:
> On Mon, Jul 10, 2017 at 10:46:09AM -0400, Tom Lane wrote:
>> Shall we go for broke and also remove the ASLR-disabling patch in beta2?

> As I mentioned in my message eight hours ago, no.

Ah, sorry, I'd managed to swap out that bit of info already.  However,
I've now gone and read the Symantec paper you pointed to.  It's pretty
interesting, although I think it should not be treated as gospel.
For one thing, it's ten years old, and for another, it's describing
the behavior of Windows Vista; one might reasonably assume that Microsoft
has improved some of this stuff since then.

I noted with particular interest that the PEB (Process Environment Block)
location is stated to be randomized whether or not you have ASLR enabled,
although apparently there are only a small number of possible locations.
I wonder whether that could explain anything.  If it is affecting us
then the retry logic should fix it.

Anyway, the point at hand is Symantec's claim that there's a global
offset that normally only changes at reboot but might change at other
times.  I'm inclined to take that with a really large grain of salt.
For instance, here's a flat denial of that from a Microsoft person:
https://blogs.msdn.microsoft.com/oldnewthing/20170118-00/?p=95205
with some interesting stuff in the comments too, particularly about
how Windows attempts to load DLLs at the same location in all
processes so as to share page table entries.

Looking at Symantec's appendix II, which is a script that claims to
provoke changes in this global offset, what it's doing is creating,
executing, and deleting files over and over.

It seems plausible to me that Windows intends to assign a randomized
address to an executable or DLL once per boot, or at least once per
loading of such a file, and that what Symantec's script is doing is
defeating Windows' recognition that the "same" executable is being
run each time.  Without games like that, the executable image would
persist at the same base address, at least till it got swapped out,
allowing it to appear at the same address across multiple executions.

If things do work more or less like that, then it's likely that it would
have no effect on us, because the continuous existence of the postmaster
process would lock the PG executable as well as all DLLs loaded by the
postmaster into addresses that would not change for the life of the
postmaster.  Under that theory, what might bite us is randomization of the
PEB, stack, or heap (which are stated to be re-done every time); or some
antivirus code that's doing something weird.  It's entirely possible that
security-oriented code might take measures to ensure that it gets mapped
into different addresses in every process even though Windows wouldn't
normally do that.  But all of these cases would presumably yield to retry
of the process launch.

In short, I don't really buy that we should be afraid to enable ASLR
because of this Symantec paper.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: retry shm attach for windows (WAS: Re: OK, so culicidae is *still* broken)

2017-07-10 Thread Noah Misch
On Mon, Jul 10, 2017 at 10:46:09AM -0400, Tom Lane wrote:
> Magnus Hagander  writes:
> > On Jul 10, 2017 16:08, "Tom Lane"  wrote:
> >> Okay, so that leaves us with a decision to make: push it into beta2, or
> >> wait till after wrap?  I find it pretty scary to push a patch with
> >> portability implications so soon before wrap, but a quick look at the
> >> buildfarm suggests we'd get runs from 3 or 4 Windows members before the
> >> wrap, if I do it quickly.  If we wait, then it will hit the field in
> >> production releases with reasonable buildfarm testing but little more.
> >> That's also pretty scary.
> >> On balance I'm tempted to push it now for beta2, but it's a close call.
> >> Thoughts?
> 
> > Given the lack of windows testing on non packaged releases I think we
> > should definitely push this for beta2. That will give it a much better real
> > world testing on what is still a beta.
> > If it breaks its a lot better to do it in beta2 (and possibly quickly roll
> > a beta3) than in production.
> 
> Shall we go for broke and also remove the ASLR-disabling patch in beta2?
> I do not feel a need to back-patch that one, at least not yet.  But if
> we're going to do it any time soon, a beta seems like the time.

As I mentioned in my message eight hours ago, no.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: retry shm attach for windows (WAS: Re: OK, so culicidae is *still* broken)

2017-07-10 Thread Noah Misch
On Mon, Jul 10, 2017 at 10:08:53AM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > I recommend pushing your patch so the August back-branch releases have it.
> > One can see by inspection that your patch has negligible effect on systems
> > healthy today.  I have a reasonable suspicion it will help some systems.  If
> > others remain broken after this, that fact will provide a useful clue.
> 
> Okay, so that leaves us with a decision to make: push it into beta2, or
> wait till after wrap?  I find it pretty scary to push a patch with
> portability implications so soon before wrap, but a quick look at the
> buildfarm suggests we'd get runs from 3 or 4 Windows members before the
> wrap, if I do it quickly.  If we wait, then it will hit the field in
> production releases with reasonable buildfarm testing but little more.
> That's also pretty scary.
> 
> On balance I'm tempted to push it now for beta2, but it's a close call.
> Thoughts?

Pushing now for beta2 sounds good.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] retry shm attach for windows (WAS: Re: OK, so culicidae is *still* broken)

2017-07-10 Thread Tom Lane
Magnus Hagander  writes:
> On Jul 10, 2017 16:08, "Tom Lane"  wrote:
>> Okay, so that leaves us with a decision to make: push it into beta2, or
>> wait till after wrap?  I find it pretty scary to push a patch with
>> portability implications so soon before wrap, but a quick look at the
>> buildfarm suggests we'd get runs from 3 or 4 Windows members before the
>> wrap, if I do it quickly.  If we wait, then it will hit the field in
>> production releases with reasonable buildfarm testing but little more.
>> That's also pretty scary.
>> On balance I'm tempted to push it now for beta2, but it's a close call.
>> Thoughts?

> Given the lack of windows testing on non packaged releases I think we
> should definitely push this for beta2. That will give it a much better real
> world testing on what is still a beta.
> If it breaks its a lot better to do it in beta2 (and possibly quickly roll
> a beta3) than in production.

Shall we go for broke and also remove the ASLR-disabling patch in beta2?
I do not feel a need to back-patch that one, at least not yet.  But if
we're going to do it any time soon, a beta seems like the time.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: retry shm attach for windows (WAS: Re: OK, so culicidae is *still* broken)

2017-07-10 Thread Magnus Hagander
On Jul 10, 2017 16:08, "Tom Lane"  wrote:

Noah Misch  writes:
> On Mon, Jun 05, 2017 at 09:56:33AM -0400, Tom Lane wrote:
>> Yeah, being able to reproduce the problem reliably enough to say whether
>> it's fixed or not is definitely the sticking point here.  I have some
>> ideas about that: ...

> I tried this procedure without finding a single failure.

Thanks for testing!  But apparently we still lack some critical part
of the recipe for getting failures in the wild.

> I watched the ensuing memory maps, which led me to these interesting
facts:

>   An important result of the ASLR design in Windows Vista is that some
address
>   space layout parameters, such as PEB, stack, and heap locations, are
>   selected once per program execution. Other parameters, such as the
location
>   of the program code, data segment, BSS segment, and libraries, change
only
>   between reboots.
>   ...
>   This offset is selected once per reboot, although we’ve uncovered at
least
>   one other way to cause this offset to be reset without a reboot (see
>   Appendix II).
>   -- http://www.symantec.com/avcenter/reference/Address_
Space_Layout_Randomization.pdf

That is really interesting, though I'm not quite sure what to make of it.
It suggests though that you might need certain types of antivirus products
to be active, to create more variability in the initial process address
space layout than would happen from Windows ASLR alone.

> I recommend pushing your patch so the August back-branch releases have it.
> One can see by inspection that your patch has negligible effect on systems
> healthy today.  I have a reasonable suspicion it will help some systems.
If
> others remain broken after this, that fact will provide a useful clue.

Okay, so that leaves us with a decision to make: push it into beta2, or
wait till after wrap?  I find it pretty scary to push a patch with
portability implications so soon before wrap, but a quick look at the
buildfarm suggests we'd get runs from 3 or 4 Windows members before the
wrap, if I do it quickly.  If we wait, then it will hit the field in
production releases with reasonable buildfarm testing but little more.
That's also pretty scary.

On balance I'm tempted to push it now for beta2, but it's a close call.
Thoughts?


Given the lack of windows testing on non packaged releases I think we
should definitely push this for beta2. That will give it a much better real
world testing on what is still a beta.

If it breaks its a lot better to do it in beta2 (and possibly quickly roll
a beta3) than in production.

/Magnus


Re: [HACKERS] retry shm attach for windows (WAS: Re: OK, so culicidae is *still* broken)

2017-07-10 Thread Tom Lane
Noah Misch  writes:
> On Mon, Jun 05, 2017 at 09:56:33AM -0400, Tom Lane wrote:
>> Yeah, being able to reproduce the problem reliably enough to say whether
>> it's fixed or not is definitely the sticking point here.  I have some
>> ideas about that: ...

> I tried this procedure without finding a single failure.

Thanks for testing!  But apparently we still lack some critical part
of the recipe for getting failures in the wild.

> I watched the ensuing memory maps, which led me to these interesting facts:

>   An important result of the ASLR design in Windows Vista is that some address
>   space layout parameters, such as PEB, stack, and heap locations, are
>   selected once per program execution. Other parameters, such as the location
>   of the program code, data segment, BSS segment, and libraries, change only
>   between reboots.
>   ...
>   This offset is selected once per reboot, although we’ve uncovered at least
>   one other way to cause this offset to be reset without a reboot (see
>   Appendix II).
>   -- 
> http://www.symantec.com/avcenter/reference/Address_Space_Layout_Randomization.pdf
>  

That is really interesting, though I'm not quite sure what to make of it.
It suggests though that you might need certain types of antivirus products
to be active, to create more variability in the initial process address
space layout than would happen from Windows ASLR alone.

> I recommend pushing your patch so the August back-branch releases have it.
> One can see by inspection that your patch has negligible effect on systems
> healthy today.  I have a reasonable suspicion it will help some systems.  If
> others remain broken after this, that fact will provide a useful clue.

Okay, so that leaves us with a decision to make: push it into beta2, or
wait till after wrap?  I find it pretty scary to push a patch with
portability implications so soon before wrap, but a quick look at the
buildfarm suggests we'd get runs from 3 or 4 Windows members before the
wrap, if I do it quickly.  If we wait, then it will hit the field in
production releases with reasonable buildfarm testing but little more.
That's also pretty scary.

On balance I'm tempted to push it now for beta2, but it's a close call.
Thoughts?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2017-07-10 Thread Marina Polyakova

Hello everyone!

There's the second version of my patch for pgbench. Now transactions 
with serialization and deadlock failures are rolled back and retried 
until they end successfully or their number of attempts reaches maximum.


In details:
- You can set the maximum number of attempts by the appropriate 
benchmarking option (--max-attempts-number). Its default value is 1 
partly because retrying of shell commands can produce new errors.
- Statistics of attempts and failures is printed in progress, in 
transaction / aggregation logs and in the end with other results (all 
and for each script). The transaction failure is reported here only if 
the last retry of this transaction fails.
- Also failures and average numbers of transactions attempts are printed 
per-command with average latencies if you use the appropriate 
benchmarking option (--report-per-command, -r) (it replaces the option 
--report-latencies as I was advised here [1]). Average numbers of 
transactions attempts are printed only for commands which start 
transactions.


As usual: TAP tests for new functionality and changed documentation with 
new examples.


Patch is attached. Any suggestions are welcome!

[1] 
https://www.postgresql.org/message-id/alpine.DEB.2.20.1707031321370.3419%40lancre


--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres CompanyFrom 58f51cdc896af801bcd35e495406655ca03aa6ce Mon Sep 17 00:00:00 2001
From: Marina Polyakova 
Date: Mon, 10 Jul 2017 13:33:41 +0300
Subject: [PATCH v2] Pgbench Retry transactions with serialization or deadlock
 errors

Now transactions with serialization or deadlock failures can be rolled back and
retried again and again until they end successfully or their number of attempts
reaches maximum. You can set the maximum number of attempts by the appropriate
benchmarking option (--max-attempts-number). Its default value is 1. Statistics
of attempts and failures is printed in progress, in transaction / aggregation
logs and in the end with other results (all and for each script). The
transaction failure is reported here only if the last retry of this transaction
fails. Also failures and average numbers of transactions attempts are printed
per-command with average latencies if you use the appropriate benchmarking
option (--report-per-command, -r). Average numbers of transactions attempts are
printed only for commands which start transactions.
---
 doc/src/sgml/ref/pgbench.sgml  | 277 ++--
 src/bin/pgbench/pgbench.c  | 751 ++---
 src/bin/pgbench/t/002_serialization_errors.pl  | 121 
 src/bin/pgbench/t/003_deadlock_errors.pl   | 130 
 src/bin/pgbench/t/004_retry_failed_transactions.pl | 280 
 5 files changed, 1421 insertions(+), 138 deletions(-)
 create mode 100644 src/bin/pgbench/t/002_serialization_errors.pl
 create mode 100644 src/bin/pgbench/t/003_deadlock_errors.pl
 create mode 100644 src/bin/pgbench/t/004_retry_failed_transactions.pl

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 64b043b..dc1daa9 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -49,22 +49,34 @@
 
 
 transaction type: builtin: TPC-B (sort of)
+transaction maximum attempts number: 1
 scaling factor: 10
 query mode: simple
 number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 1/1
+number of transactions with serialization failures: 0 (0.000 %)
+number of transactions with deadlock failures: 0 (0.000 %)
+attempts number average = 1.00
+attempts number stddev = 0.00
 tps = 85.184871 (including connections establishing)
 tps = 85.296346 (excluding connections establishing)
 
 
-  The first six lines report some of the most important parameter
+  The first seven lines report some of the most important parameter
   settings.  The next line reports the number of transactions completed
   and intended (the latter being just the product of number of clients
   and number of transactions per client); these will be equal unless the run
   failed before completion.  (In -T mode, only the actual
   number of transactions is printed.)
+  The next four lines report the number of transactions with serialization and
+  deadlock failures, and also the statistics of transactions attempts.  With
+  such errors, transactions are rolled back and are repeated again and again
+  until they end sucessufully or their number of attempts reaches maximum (to
+  change this maximum see the appropriate benchmarking option
+  --max-attempts-number).  The transaction failure is reported here
+  only if the last retry of this transaction fails.
   The last two lines report the number of transactions per second,
   figured with and without counting the time to start database sessions.
  
@@ -434,24 +446,28 @@ pgbench  options  dbname
   

 

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

2017-07-10 Thread Amit Kapila
On Mon, Jul 10, 2017 at 3:27 PM, Kyotaro HORIGUCHI
 wrote:
> Hi,
>
> At Mon, 10 Jul 2017 14:58:13 +0530, Amit Kapila  
> wrote in 

[HACKERS] Re: [COMMITTERS] pgsql: Allow multiple hostaddrs to go with multiple hostnames.

2017-07-10 Thread Heikki Linnakangas

On 07/10/2017 01:27 PM, Masahiko Sawada wrote:

This commit seems be cause of the documentation compilation error. I
think  is missing.

...

Attached small patch fixes this.


Thanks, committed!

Strangely, it worked on my system, despite that clear mistake. Looks 
like the 'osx' tool is more strict than what I had installed. I have now 
also installed opensp, so won't make this mistake again.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dumping database creation options and ACLs

2017-07-10 Thread Adrien Nayrat
On 07/03/2017 05:16 PM, Rafael Martinez wrote:
> We have a discussion about this some time ago and we created a wiki page
> where we tried to write down some ideas/proposals and links to threads
> discussing the subject:
> 
> https://wiki.postgresql.org/wiki/Pg_dump_improvements

Thanks for this link! I'll look at this.


On 07/03/2017 04:58 PM, Robert Haas wrote:
> Note that some progress has been made on the CURRENT_DATABASE thing:
>
>
https://www.postgresql.org/message-id/caf3+xm+xsswcwqzmp1cjj12gpz8dxhcm9_ft1y-0fvzxi9p...@mail.gmail.com
>
> I tend to favor that approach myself, although one point in favor of
> your suggestion is that adding another flag to pg_dumpall is a heck of
> a lot less work to get to some kind of solution to this issue.

Thanks, I'll look. Even if my approach is simple, the question is "Do we want
another flag in pg_dumpall? Is it the role of pg_dumpall?".


Regards,

-- 
Adrien NAYRAT

http://dalibo.com - http://dalibo.org



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] List of hostaddrs not supported

2017-07-10 Thread Heikki Linnakangas

On 07/10/2017 01:47 PM, Arthur Zakirov wrote:

Hello,

2017-07-10 12:30 GMT+03:00 Heikki Linnakangas :



I just remembered that this was still pending. I made the documentation
changes, and committed this patch now.

We're uncomfortably close to wrapping the next beta, later today, but I
think it's better to get this into the hands of people testing this, rather
than wait for the next beta. I think the risk of breaking something that
used to work is small.


I get this warning during compilation using gcc 7.1.1 20170621:


fe-connect.c:1100:61: warning: comparison between pointer and zero

character constant [-Wpointer-compare]

conn->connhost[i].host != NULL && conn->connhost[i].host != '\0')


Thanks, fixed! That check for empty hostname was indeed wrong.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Double shared memory allocation for SLRU LWLocks

2017-07-10 Thread Alexander Korotkov
Hi, all!

It seems to me that we're allocating shared memory for SLRU lwlocks twice,
unless I'm missing something.

SimpleLruShmemSize() calculates total SLRU shared memory size including
lwlocks size.

SimpleLruInit() starts with line

shared = (SlruShared) ShmemInitStruct(name,
  SimpleLruShmemSize(nslots, nlsns),
  );

which allocates SLRU shared memory (LWLocks size is included because
SimpleLruShmemSize() is used for size computation).

Following line allocates shared memory for LWLocks again:
shared->buffer_locks = (LWLockPadded *) ShmemAlloc(sizeof(LWLockPadded) *
nslots);

Attached patch fixes that by removing extra ShmemAlloc for SLRU.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


fix-slru-lwlock-shmem-double-allocation.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] paths in partitions of a dummy partitioned table

2017-07-10 Thread Ashutosh Bapat
On Mon, Jul 10, 2017 at 2:59 PM, Kyotaro HORIGUCHI
 wrote:
> Hello,

Thanks for the review.

>> If a partitioned table is proven dummy, set_rel_pathlist() doesn't mark the
>> partition relations dummy and thus doesn't set any (dummy) paths in the
>> partition relations. The lack of paths in the partitions means that we can
>
> A parent won't be proven dummy directly but be a dummy rel (means
> IS_DUMMY_REL(rel) returns true) if no child is attached as the
> result of constarint exlusion.

In a query like SELECT * FROM parent WHERE 1 = 2; the parent is marked
dummy in set_rel_size()
 319 set_rel_size(PlannerInfo *root, RelOptInfo *rel,
 320  Index rti, RangeTblEntry *rte)
 321 {
 322 if (rel->reloptkind == RELOPT_BASEREL &&
 323 relation_excluded_by_constraints(root, rel, rte))
 324 {
 325 /*
 326  * We proved we don't need to scan the rel via constraint
exclusion,
 327  * so set up a single dummy path for it.  Here we only
check this for
 328  * regular baserels; if it's an otherrel, CE was already checked in
 329  * set_append_rel_size().
 330  *
 331  * In this case, we go ahead and set up the relation's
path right away
 332  * instead of leaving it for set_rel_pathlist to do.
This is because
 333  * we don't have a convention for marking a rel as dummy except by
 334  * assigning a dummy path to it.
 335  */
 336 set_dummy_rel_pathlist(rel);
 337 }

It's in such cases, that a parent is proven dummy without looking at
the child relations.

>
>> not use partition-wise join while joining this table with some other 
>> similarly
>> partitioned table as the partitions do not have any paths that child-joins 
>> can
>> use. This means that we can not create partition relations for such a join 
>> and
>> thus can not consider the join to be partitioned. This doesn't matter much 
>> when
>> the dummy relation is the outer relation, since the resultant join is also
>> dummy. But when the dummy relation is inner relation, the resultant join is 
>> not
>> dummy and can be considered to be partitioned with same partitioning scheme 
>> as
>> the outer relation to be joined with other similarly partitioned table. Not
>> having paths in the partitions deprives us of this future optimization.
>> Without partition-wise join, not setting dummy paths in the partition 
>> relations
>> makes sense, since those do not have any use. But with partition-wise join 
>> that
>> changes.
>
> Of course for inner-joins and even for outer-joins, there's no
> point in considering the children of a dummy parent as a side of
> a join. For what reason, or in what case does partition-wise join
> need to consider children of a proven-dummy parent?

Consider a join A LEFT JOIN B LEFT JOIN C where B is proven dummy and
A, B and C are all partitioned with the same partitioning scheme. The
result of A LEFT JOIN B is same as appending Ak LEFT JOIN Bk where Ak
and Bk are matching partitions of A and B resp. When we join this with
C, the result would be same as appending Ak LEFT JOIN Bk LEFT JOIN Ck
where Ck is partition matching Ak and Bk. So, even though B is proven
dummy, we can execute A LEFT JOIN B LEFT JOIN C as a partition-wise
join, if we can execute A LEFT JOIN B as a partition-wise join.

Does that answer your question?

>
>> If we always mark the partitions dummy, that effort may get wasted if the
>> partitioned table doesn't participate in the partition-wise join. A possible
>> solution would be to set the partitions dummy when only the partitioned table
>> participates in the join, but that happens during make_join_rel(), much after
>> we have set up the simple relations. So, I am not sure, whether that's a good
>> option. But I am open to suggestions.
>>
>> What if we always mark the partition relations of a dummy partitioned table
>> dummy? I tried attached path on a thousand partition table, the planning time
>> increased by 3-5%. That doesn't look that bad for a thousand partitions.
>>
>> Any other options/comments?
>
> Since I don't understand the meaning of the patch, the comments
> below are just from a micro-viewpoint.
>
> -   if (IS_DUMMY_REL(rel))
> +   if (IS_DUMMY_REL(rel) && !rte->inh)
>
> In set_rel_pathlist, if want to exlude the inh==true case from
> the first if, just inverting the first and second if caluses
> would be enough, like this.
>
> |  if (rte->inh)
> |set_append_rel_pathlist(root, rel, rti, rte);
> |  else if (IS_DUMMY_REL(rel))
> |/* We already proved the relation empty, so nothing more to do *

Right. Done in the attached patch.

>
> +   if (IS_DUMMY_REL(rel))
> +   set_dummy_rel_pathlist(childrel);
>
> This is equivalent of just returning before looking over
> append_rel_list when rel is a dummy. It doesn't seem to me to add
> marked-as-dummy children to a dummy parent. (I understand that is
> the 

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-10 Thread Fabien COELHO


Hello Alik,

Your description is not very precise. What version of Postgres is used? 
If there is a decline, compared to which version? Is there a link to 
these results?


Benchmark have been done in master v10. I am attaching image with results:
.


Ok, thanks.

More precision would be helpful, such as the exact pgbench option used (eg 
how many client per thread in pgbench, how long does it run, prepared 
transactions, ...).


Intuitively, contention should explain a saturation of the tps 
performance, because more clients are not effective to improve tps as the 
wait for other clients, and the latency would degrade.


But it is unclear to me why the tps would be reduced even with lock 
contention, so something seems amiss.


Performance debugging by mail is an uneasy task.

Maybe you could try zipf with unlogged tables, to check whether skipping 
the WAL write does something.


Also Amit advice about the perf report looks useful.

Given the explanations, the random draw mostly hits values at the 
beginning of the interval, so when the number of client goes higher one 
just get locking contention on the updated row?


Yes, exactly.


Ok. The uniform distribution run, if all other parameters are equal, gives 
a hint about the potential performance when the performance bottleneck is 
hit.


On Workload A with uniform distribution PostgreSQL shows better results 
than MongoDB and MySQL(see attachment). Also you can notice that for 
small number of clients type of distribution does not affect on tps on 
MySQL.


Ok. I assume that you use pgbench for pg and other ad-hoc tools for the 
other targets (mysql & mongodb).


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] List of hostaddrs not supported

2017-07-10 Thread Arthur Zakirov
Hello,

2017-07-10 12:30 GMT+03:00 Heikki Linnakangas :
>
>
> I just remembered that this was still pending. I made the documentation
> changes, and committed this patch now.
>
> We're uncomfortably close to wrapping the next beta, later today, but I
> think it's better to get this into the hands of people testing this, rather
> than wait for the next beta. I think the risk of breaking something that
> used to work is small.
>
>
I get this warning during compilation using gcc 7.1.1 20170621:

> fe-connect.c:1100:61: warning: comparison between pointer and zero
character constant [-Wpointer-compare]
> conn->connhost[i].host != NULL && conn->connhost[i].host != '\0')

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [HACKERS] COPY vs. transition tables

2017-07-10 Thread Andrew Gierth
> "Andrew" == Andrew Gierth  writes:
> "Thomas" == Thomas Munro  writes:

 Thomas> Here it is.  Added to open items.

 Andrew> On it.

Committed.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Allow multiple hostaddrs to go with multiple hostnames.

2017-07-10 Thread Masahiko Sawada
On Mon, Jul 10, 2017 at 6:30 PM, Heikki Linnakangas
 wrote:
> Allow multiple hostaddrs to go with multiple hostnames.
>
> Also fix two other issues, while we're at it:
>
> * In error message on connection failure, if multiple network addresses
> were given as the host option, as in "host=127.0.0.1,127.0.0.2", the
> error message printed the address twice.
>
> * If there were many more ports than hostnames, the error message would
> always claim that there was one port too many, even if there was more than
> one. For example, if you gave 2 hostnames and 5 ports, the error message
> claimed that you gave 2 hostnames and 3 ports.
>
> Discussion: 
> https://www.postgresql.org/message-id/10badbc6-4d5a-a769-623a-f7ada43e1...@iki.fi
>
> Branch
> --
> master
>
> Details
> ---
> https://git.postgresql.org/pg/commitdiff/7b02ba62e9ffad5b14c24756a0c2aeae839c9d05
>
> Modified Files
> --
> doc/src/sgml/libpq.sgml   |  55 +++-
> src/interfaces/libpq/fe-connect.c | 258 --
> src/interfaces/libpq/libpq-int.h  |   3 +-
> 3 files changed, 219 insertions(+), 97 deletions(-)
>
>
> --
> Sent via pgsql-committers mailing list (pgsql-committ...@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-committers

Hi,

This commit seems be cause of the documentation compilation error. I
think  is missing.

$ make -C doc/src/sgml/
make: Entering directory `/home/masahiko/pgsql/source/postgresql/doc/src/sgml'
osx -wall -wno-unused-param -wno-empty -wfully-tagged -D . -D . -x
lower postgres.sgml >postgres.xml.tmp
osx:libpq.sgml:926:9:E: end tag for "SECT3" omitted, but OMITTAG NO
was specified
osx:libpq.sgml:891:3: start tag was here
make: *** [postgres.xml] Error 1
make: Leaving directory `/home/masahiko/pgsql/source/postgresql/doc/src/sgml'

Attached small patch fixes this.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 124c21b..98b6938 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -923,6 +923,7 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
host, it is not possible to e.g. specify a different username for 
different hosts.
  
+   
   
 
   

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY vs. transition tables

2017-07-10 Thread Andrew Gierth
> "Thomas" == Thomas Munro  writes:

 Thomas> Here it is.  Added to open items.

On it.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

2017-07-10 Thread Kyotaro HORIGUCHI
Hi,

At Mon, 10 Jul 2017 14:58:13 +0530, Amit Kapila  wrote 
in 

Re: [HACKERS] POC: Sharing record typmods between backends

2017-07-10 Thread Thomas Munro
On Thu, Jun 1, 2017 at 6:29 AM, Andres Freund  wrote:
> On May 31, 2017 11:28:18 AM PDT, Robert Haas  wrote:
>>> On 2017-05-31 13:27:28 -0400, Dilip Kumar wrote:
[ ... various discussion in support of using DHT ... ]

Ok, good.

Here's a new version that introduces a per-session DSM segment to hold
the shared record typmod registry (and maybe more things later).  The
per-session segment is created the first time you run a parallel query
(though there is handling for failure to allocate that allows the
parallel query to continue with no workers) and lives until your
leader backend exits.  When parallel workers start up, they see its
handle in the per-query segment and attach to it, which puts
typcache.c into write-through cache mode so their idea of record
typmods stays in sync with the leader (and each other).

I also noticed that I could delete even more of tqueue.c than before:
it doesn't seem to have any remaining reason to need to know the
TupleDesc.

One way to test this code is to apply just
0003-rip-out-tqueue-remapping-v3.patch and then try the example from
the first message in this thread to see it break, and then try again
with the other two patches applied.  By adding debugging trace you can
see that the worker pushes a bunch of TupleDescs into shmem, they get
pulled out by the leader when it sees the tuples, and then on a second
invocation the (new) worker can reuse them: it finds matches already
in shmem from the first invocation.

I used a DSM segment with a TOC and a DSA area inside that, like the
existing per-query DSM segment, but obviously you could spin it
various different ways.  One example: just have a DSA area and make a
new kind of TOC thing that deals in dsa_pointers.  Better ideas?

I believe combo CIDs should also go in there, to enable parallel
write, but I'm not 100% sure: that's neither per-session nor per-query
data, that's per-transaction.  So perhaps the per-session DSM could
hold a per-session DSA and a per-transaction DSA, where the latter is
reset for each transaction, just like TopTransactionContext (though
dsa.c doesn't have a 'reset thyself' function currently).  That seems
like a good place to store a shared combo CID hash table using DHT.
Thoughts?

-- 
Thomas Munro
http://www.enterprisedb.com


shared-record-typmod-registry-v3.patchset.tgz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] List of hostaddrs not supported

2017-07-10 Thread Heikki Linnakangas

On 06/09/2017 04:26 PM, Robert Haas wrote:

On Fri, Jun 9, 2017 at 6:36 AM, Heikki Linnakangas  wrote:

Right. I think it's a usability fail as it is; it certainly fooled me. We
could make the error messages and documentation more clear. But even better
to allow multiple host addresses, so that it works as you'd expect.


Sure, I don't have a problem with that.  I guess part of the point of
beta releases is to correct things that don't turn out to be as smart
as we thought they were, and this seems to be an example of that.


I just remembered that this was still pending. I made the documentation 
changes, and committed this patch now.


We're uncomfortably close to wrapping the next beta, later today, but I 
think it's better to get this into the hands of people testing this, 
rather than wait for the next beta. I think the risk of breaking 
something that used to work is small.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] paths in partitions of a dummy partitioned table

2017-07-10 Thread Kyotaro HORIGUCHI
Hello,

At Thu, 6 Jul 2017 21:05:21 +0530, Ashutosh Bapat 
 wrote in 

> If a partitioned table is proven dummy, set_rel_pathlist() doesn't mark the
> partition relations dummy and thus doesn't set any (dummy) paths in the
> partition relations. The lack of paths in the partitions means that we can

A parent won't be proven dummy directly but be a dummy rel (means
IS_DUMMY_REL(rel) returns true) if no child is attached as the
result of constarint exlusion.

> not use partition-wise join while joining this table with some other similarly
> partitioned table as the partitions do not have any paths that child-joins can
> use. This means that we can not create partition relations for such a join and
> thus can not consider the join to be partitioned. This doesn't matter much 
> when
> the dummy relation is the outer relation, since the resultant join is also
> dummy. But when the dummy relation is inner relation, the resultant join is 
> not
> dummy and can be considered to be partitioned with same partitioning scheme as
> the outer relation to be joined with other similarly partitioned table. Not
> having paths in the partitions deprives us of this future optimization.
> Without partition-wise join, not setting dummy paths in the partition 
> relations
> makes sense, since those do not have any use. But with partition-wise join 
> that
> changes.

Of course for inner-joins and even for outer-joins, there's no
point in considering the children of a dummy parent as a side of
a join. For what reason, or in what case does partition-wise join
need to consider children of a proven-dummy parent?

> If we always mark the partitions dummy, that effort may get wasted if the
> partitioned table doesn't participate in the partition-wise join. A possible
> solution would be to set the partitions dummy when only the partitioned table
> participates in the join, but that happens during make_join_rel(), much after
> we have set up the simple relations. So, I am not sure, whether that's a good
> option. But I am open to suggestions.
> 
> What if we always mark the partition relations of a dummy partitioned table
> dummy? I tried attached path on a thousand partition table, the planning time
> increased by 3-5%. That doesn't look that bad for a thousand partitions.
> 
> Any other options/comments?

Since I don't understand the meaning of the patch, the comments
below are just from a micro-viewpoint.

-   if (IS_DUMMY_REL(rel))
+   if (IS_DUMMY_REL(rel) && !rte->inh)

In set_rel_pathlist, if want to exlude the inh==true case from
the first if, just inverting the first and second if caluses
would be enough, like this.

|  if (rte->inh)
|set_append_rel_pathlist(root, rel, rti, rte);
|  else if (IS_DUMMY_REL(rel))
|/* We already proved the relation empty, so nothing more to do */


+   if (IS_DUMMY_REL(rel))
+   set_dummy_rel_pathlist(childrel);

This is equivalent of just returning before looking over
append_rel_list when rel is a dummy. It doesn't seem to me to add
marked-as-dummy children to a dummy parent. (I understand that is
the objective of this patch.)

# Maybe I'm looking a different version of the source code, or
# I'm terriblly misunderstanding something..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

2017-07-10 Thread Amit Kapila
On Mon, Jul 10, 2017 at 10:39 AM, Kyotaro HORIGUCHI
 wrote:
> At Sat, 8 Jul 2017 16:41:27 +0530, Amit Kapila  
> wrote in 
>> On Sat, Jul 8, 2017 at 9:08 AM, Robert Haas  wrote:
>> > On Fri, Jul 7, 2017 at 11:36 PM, Amit Kapila  
>> > wrote:
>> >> I think we should do that as a separate patch (I can write the same as
>> >> well) because that is not new behavior introduced by this patch, ...
>> >
>> > True, although maybe hash indexes are the only way that happens today?
>> >
>>
>> No, I think it will happen for all other indexes as well.  Basically,
>> we log new page for init forks of other indexes and then while
>> restoring that full page image, it will fall through the same path.
>
> (Sorry, I didn't noticed that hash metapage is not using log_newpgae)
>

The bitmappage is also not using log_newpage.

> For example, (bt|gin|gist|spgist|brin)buildempty are using
> log_newpage to log INIT_FORK metapages. I believe that the xlog
> flow from log_newpage to XLogReadBufferForRedoExtended is
> developed so that pages in init forks are surely flushed during
> recovery, so we should fix it instead adding other flushes if the
> path doesn't work. Am I looking wrong place? (I think it is
> working.)
>

You are looking at right place.

> If I'm understanding correctly here, hashbild and hashbuildempty
> should be refactored using the same structure with other *build
> and *buildempty functions. Specifically metapages initialization
> subroutines (_hash_init or _bt_initmetapage or SpGistInitMetapage
> or...)  does only on-memory initialization and does not emit WAL,
> then *build and *buildempty emits WAL in their required way.
>

I have considered this way of doing as well, read my first e-mail [1]
in this thread "Another approach to fix the issue ".  It is not
that we can't change the code of hashbuildempty to make it log new
pages for all kind of pages (metapage, bitmappage and datapages), but
I am not sure if there is a value in going in that direction.  If we
have to go in that direction, we need to hard code some of the values
like hashm_nmaps and hashm_mapp in metapage rather than initializing
them after bitmappage creation.  Before going in that direction, I
think we should also take opinion from other people especially some
committer as we might need to maintain two different ways of doing
almost the same thing.

I am also not 100% comfortable with adding flush at two new places,
but that makes the code for fix simpler and fundamentally there is no
problem in doing so.

There is a small downside to always logging new page which is that it
will always log full page image in WAL which has the potential to
increase WAL volume if there are many unlogged tables and indexes.
Now considering the init fork generally has very less pages, it is not
a big concern, but still avoiding full page image has some value.

>
>> >>> By the way the comment of the function ReadBufferWithoutRelcache
>> >>> has the following sentense.
>> >>>
>> >>> | * NB: At present, this function may only be used on permanent 
>> >>> relations, which
>> >>> | * is OK, because we only use it during XLOG replay.  If in the future 
>> >>> we
>> >>> | * want to use it on temporary or unlogged relations, we could pass 
>> >>> additional
>> >>> | * parameters.
>> >>>
>> >>> and does
>> >>>
>> >>> | return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, 
>> >>> blockNum,
>> >>>  mode, strategy, 
>> >>> );
>> >>>
>> >>> This surely works since BufferAlloc recognizes INIT_FORK. But
>> >>> some adjustment may be needed around here.
>> >>
>> >> Yeah, it should probably mention that the init fork of an unlogged
>> >> relation is also OK.
>> >>
>> >
>> > I think we should do that as a separate patch (I can write the same as
>> > well) because that is not new behavior introduced by this patch, but
>> > let me know if you think that we should add such a comment in this
>> > patch itself.
>> >
>>
>> Attached a separate patch to adjust the comment atop 
>> ReadBufferWithoutRelcache.
>
> + * NB: At present, this function may only be used on permanent relations and
> + * init fork of an unlogged relation, which is OK, because we only use it
> + * during XLOG replay.  If in the future we want to use it on temporary or
> + * unlogged relations, we could pass additional parameters.
>
> *I* think the target of the funcion is permanent relations and
> init forks, not unlogged relations. And I'd like to have an
> additional comment about RELPERSISTENCE_PERMANENT. Something like
> the attached.
>

Okay, let's leave this for committer to decide.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1JpcMsEtOL_J7WODumeEfyrPi7FPYHeVdS7fyyrCrgp4w%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: 

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-10 Thread Amit Kapila
On Mon, Jul 10, 2017 at 12:19 PM, Alik Khilazhev  wrote:

> Hello, Fabien!
>
> Your description is not very precise. What version of Postgres is used? If
> there is a decline, compared to which version? Is there a link to these
> results?
>
>
> Benchmark have been done in master v10. I am attaching image with results:
> .
>
>
It will be interesting to see what the profiling data with perf says about
this for PostgreSQL.  Can you try to get the perf report?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] New partitioning - some feedback

2017-07-10 Thread Amit Langote
On 2017/07/10 15:32, Craig Ringer wrote:
> On 8 July 2017 at 00:03, David Fetter  wrote:
> 
>> On Fri, Jul 07, 2017 at 10:29:26AM +0900, Amit Langote wrote:
>>> Hi Mark,
>>>
>>> On 2017/07/07 9:02, Mark Kirkwood wrote:
 I've been trying out the new partitioning in version 10. Firstly, I
>> must
 say this is excellent - so much nicer than the old inheritance based
>> method!
>>>
>>> Thanks. :)
>>>
 My only niggle is the display of partitioned tables via \d etc. e.g:

 part=# \d
 List of relations
  Schema | Name | Type  |  Owner
 +--+---+--
  public | date_fact| table | postgres
  public | date_fact_201705 | table | postgres
  public | date_fact_201706 | table | postgres
  public | date_fact_20170601   | table | postgres
  public | date_fact_2017060100 | table | postgres
  public | date_fact_201707 | table | postgres
  public | date_fact_rest   | table | postgres
 (7 rows)
>>
>> Would showing relispartition=tru tables only in \d+ fix this?
>> 
>>
> 
> I think so.

I posted a patch upthread which makes \d hide partitions (relispartition =
true relations) and include them if the newly proposed '!' modifier is
specified.  The '+' modifier is being used to show additional detail of
relations chosen to be listed at all, so it seemed like a bad idea to
extend its meaning to also dictate whether partitions are to be listed.
We have a separate 'S' modifier to ask to list system objects (which are,
by default hidden), so it made sense to me to add yet another modifier
(aforementioned '!') for partitions.

> I'd like to add a flag of some kind to \d column output that marks a table
> as having partitions, but I can't think of anything narrow enough and still
> useful.

Actually, if \d had shown RELKIND_PARTITIONED_TABLE tables as of Type
"partitioned table", we wouldn't need a separate flag for marking a table
as having partitions.  But we've avoided using that term ("partitioned
table") in the error messages and such, so wouldn't perhaps be a good idea
to do that here.  But I wonder if we (also) want to distinguish
partitioned tables from regular tables?  I understood that there is some
desire for partitions be distinguished when they are listed in the output,
either by default or by using a modifier.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: retry shm attach for windows (WAS: Re: OK, so culicidae is *still* broken)

2017-07-10 Thread Michael Paquier
On Mon, Jul 10, 2017 at 3:36 PM, Noah Misch  wrote:
> I recommend pushing your patch so the August back-branch releases have it.
> One can see by inspection that your patch has negligible effect on systems
> healthy today.  I have a reasonable suspicion it will help some systems.  If
> others remain broken after this, that fact will provide a useful clue.

+1. Thanks for taking the time to test and look at the patch.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-10 Thread Alik Khilazhev
Hello, Fabien!

> Your description is not very precise. What version of Postgres is used? If 
> there is a decline, compared to which version? Is there a link to these 
> results?

Benchmark have been done in master v10. I am attaching image with results:
.

> Indeed, the function computation is over expensive, and the numerical 
> precision of the implementation is doubtful.
> 
> If there is no better way to compute this function, ISTM that it should be 
> summed in reverse order to accumulate small values first, from (1/n)^s + ... 
> + (1/2)^ s. As 1/1 == 1, the corresponding term is 1, no point in calling pow 
> for this one, so it could be:
> 
>   double ans = 0.0;
>   for (i = n; i >= 2; i--)
> ans += pow(1. / i, theta);
>   return 1.0 + ans;

You are right, it’s better to reverse order.

> If the functions when actually used is likely to be called with different 
> parameters, then some caching beyond the last value would seem in order. 
> Maybe a small fixed size array?
> 
> However, it should be somehow thread safe, which does not seem to be the case 
> with the current implementation. Maybe a per-thread cache? Or use a lock only 
> to update a shared cache? At least it should avoid locking to read values…

Yea, I forget about thread-safety. I will implement per-thread cache with small 
fixed array.

> Given the explanations, the random draw mostly hits values at the beginning 
> of the interval, so when the number of client goes higher one just get 
> locking contention on the updated row?

Yes, exactly. 

> ISTM that also having the tps achieved with a flat distribution would allow 
> to check this hypothesis.

On Workload A with uniform distribution PostgreSQL shows better results than 
MongoDB and MySQL(see attachment). Also you can notice that for small number of 
clients  type of distribution does not affect on tps on MySQL. 



And it’s important to mention that postgres run with option 
synchronous_commit=off, to satisfy  durability MongoDB 
writeConcern=1=false. In this mode there is possibility to lose all 
changes in the last second. If we run postgres with max durability MongoDB will 
lag far behind. 
---
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com 
The Russian Postgres Company



[HACKERS] Re: retry shm attach for windows (WAS: Re: OK, so culicidae is *still* broken)

2017-07-10 Thread Noah Misch
On Mon, Jun 05, 2017 at 09:56:33AM -0400, Tom Lane wrote:
> Amit Kapila  writes:
> > Sure.  I think it is slightly tricky because specs don't say clearly
> > how ASLR can impact the behavior of any API and in my last attempt I
> > could not reproduce the issue.
> 
> > I can try to do basic verification with the patch you have proposed,
> > but I fear, to do the actual tests as suggested by you is difficult as
> > it is not reproducible on my machine, but I can still try.
> 
> Yeah, being able to reproduce the problem reliably enough to say whether
> it's fixed or not is definitely the sticking point here.  I have some
> ideas about that:
> 
> 1. Be sure to use Win32 not Win64 --- the odds of a failure in the larger
> address space are so small you'd never prove anything.  (And of course
> it has to be a version that has ASLR enabled.)
> 
> 2. Revert 7f3e17b48 so that you have an ASLR-enabled build.
> 
> 3. Crank shared_buffers to the maximum the machine will allow, reducing
> the amount of free address space and improving the odds of a collision.
> 
> 4. Spawn lots of sessions --- pgbench with -C option might be a useful
> testing tool.
> 
> With luck, that will get failures often enough that you can be pretty
> sure whether a patch improves the situation or not.

I tried this procedure without finding a single failure.  Attributes:

- 32-bit build of commit fad7873 w/ 7f3e17b48 reverted
- confirmed ASLR-enabled with "dumpbin /headers postgres.exe"
- OS = 64-bit Windows Server 2016
- compiler = Visual Studio 2015 Express
- no config.pl, so not linked with any optional library
- tried shared_buffers=1100M and shared_buffers=24M
- echo 'select 1;' >pgbench-trivial; pgbench -n -f pgbench-trivial -C -c 30 -j5 
-T900
- tried starting as a service at boot time, in addition to manual start

=== postgresql.conf ===
log_connections = on
log_line_prefix = '%p %m '
autovacuum = off
listen_addresses = '127.0.0.1'
log_min_messages = debug1
max_connections = 40
shared_buffers = 24MB
deadlock_timeout = '20ms'
wal_buffers = '16MB'
fsync = off


I watched the ensuing memory maps, which led me to these interesting facts:

  An important result of the ASLR design in Windows Vista is that some address
  space layout parameters, such as PEB, stack, and heap locations, are
  selected once per program execution. Other parameters, such as the location
  of the program code, data segment, BSS segment, and libraries, change only
  between reboots.
  ...
  This offset is selected once per reboot, although we’ve uncovered at least
  one other way to cause this offset to be reset without a reboot (see
  Appendix II).
  -- 
http://www.symantec.com/avcenter/reference/Address_Space_Layout_Randomization.pdf
 

So, reattach failures might be reproducible on some reboots and not others.
While I had a few reboots during the test work, I did not exercise that
dimension systematically.  This information also implies we should not
re-enable ASLR, even if your patch helps, due to addresses that change less
than once per process creation but occasionally more than once per boot.


I did try the combination of your patch and the following to simulate a 95%
failure rate:

--- a/src/backend/port/win32_shmem.c
+++ b/src/backend/port/win32_shmem.c
@@ -410,2 +410,4 @@ pgwin32_ReserveSharedMemoryRegion(HANDLE hChild)
 MEM_RESERVE, 
PAGE_READWRITE);
+   if (random() > MAX_RANDOM_VALUE / 20)
+   address = NULL;
if (address == NULL)

This confirmed retries worked.  During a 900s test run, one connection attempt
failed permanently by exhausting its 100 retries.  The run achieved 11
connections per second.  That is an order of magnitude slower than a run
without simulated failures, but most applications would tolerate it.

I recommend pushing your patch so the August back-branch releases have it.
One can see by inspection that your patch has negligible effect on systems
healthy today.  I have a reasonable suspicion it will help some systems.  If
others remain broken after this, that fact will provide a useful clue.

Thanks,
nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New partitioning - some feedback

2017-07-10 Thread Craig Ringer
On 8 July 2017 at 00:03, David Fetter  wrote:

> On Fri, Jul 07, 2017 at 10:29:26AM +0900, Amit Langote wrote:
> > Hi Mark,
> >
> > On 2017/07/07 9:02, Mark Kirkwood wrote:
> > > I've been trying out the new partitioning in version 10. Firstly, I
> must
> > > say this is excellent - so much nicer than the old inheritance based
> method!
> >
> > Thanks. :)
> >
> > > My only niggle is the display of partitioned tables via \d etc. e.g:
> > >
> > > part=# \d
> > > List of relations
> > >  Schema | Name | Type  |  Owner
> > > +--+---+--
> > >  public | date_fact| table | postgres
> > >  public | date_fact_201705 | table | postgres
> > >  public | date_fact_201706 | table | postgres
> > >  public | date_fact_20170601   | table | postgres
> > >  public | date_fact_2017060100 | table | postgres
> > >  public | date_fact_201707 | table | postgres
> > >  public | date_fact_rest   | table | postgres
> > > (7 rows)
>
> Would showing relispartition=tru tables only in \d+ fix this?
> 
>

I think so.

I'd like to add a flag of some kind to \d column output that marks a table
as having partitions, but I can't think of anything narrow enough and still
useful.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Oddity in error handling of constraint violation in ExecConstraints for partitioned tables

2017-07-10 Thread Amit Langote
Fujita-san,

On 2017/07/10 14:15, Etsuro Fujita wrote:
> On 2017/07/07 18:47, Amit Langote wrote:
>> On 2017/07/06 16:06, Etsuro Fujita wrote:
>>> I think this should be fixed.  Attached is a patch for that.
> 
>> How about setting ri_RangeTableIndex of the partition ResultRelInfo
>> correctly in the first place?  If you look at the way InitResultRelInfo()
>> is called in ExecSetupPartitionTupleRouting(), a dummy RT index of 1 is
>> passed.  We could instead pass the correct one.  I think
>> ModifyTable.nominalRelation is that (at least in the INSERT case.
>>
>> The attached patch implements that.  It modifies
>> ExecSetupPartitionTupleRouting to accept one more argument resultRTindex
>> and passes along the same to InitResultRelInfo().  Later when
>> ExecConstraints() wants to build the modifiedCols set, it looks up the
>> correct RTE using the partition ResultRelInfo, which has the appropriate
>> ri_RangeTableIndex set and uses the same.
> 
> Looks good to me.

Thanks for the review.

>> The patch keeps tests that you added in your patch.  Added this to the
>> open items list.
> 
> Another thing I noticed is the error handling in ExecWithCheckOptions; it
> doesn't take any care for partition tables, so the column description in
> the error message for WCO_VIEW_CHECK is built using the partition table's
> rowtype, not the root table's.  But I think it'd be better to build that
> using the root table's rowtype, like ExecConstraints, because that would
> make the column description easier to understand since the parent view(s)
> (from which WithCheckOptions evaluated there are created) would have been
> defined on the root table.  This seems independent from the above issue,
> so I created a separate patch, which I'm attaching. What do you think
> about that?

Good catch.  The patch looks spot on to me.  To keep both the patches
together, I'm attaching them here as 0001 which fixes the original issue
you reported on this thread and 0002 which is your patch to fix error
reporting in ExecWithCheckOptions.

Thanks,
Amit
From f9f4cc93d962ff433fe98b36a84953ba4048a725 Mon Sep 17 00:00:00 2001
From: amit 
Date: Fri, 7 Jul 2017 17:24:44 +0900
Subject: [PATCH 1/2] Properly set ri_RangeTableIndex of partition result rels

Previously it was set to a "dummy" value of 1, which would cause
certain code, such as ExecConstraint()'s error handling code, to
look at the unintended range table entry.  Instead set it to the
index of the RT entry corresponding to root partitioned table.
---
 src/backend/commands/copy.c|  1 +
 src/backend/executor/execMain.c|  3 ++-
 src/backend/executor/nodeModifyTable.c |  1 +
 src/include/executor/executor.h|  1 +
 src/test/regress/expected/insert.out   | 18 ++
 src/test/regress/sql/insert.sql| 18 ++
 6 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f391828e74..51693791cc 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -1433,6 +1433,7 @@ BeginCopy(ParseState *pstate,
num_partitions;
 
ExecSetupPartitionTupleRouting(rel,
+   
   1,

   _dispatch_info,

   ,

   _tupconv_maps,
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 0f08283f81..c36b5b7392 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -3213,6 +3213,7 @@ EvalPlanQualEnd(EPQState *epqstate)
  */
 void
 ExecSetupPartitionTupleRouting(Relation rel,
+  Index resultRTindex,
   PartitionDispatch 
**pd,
   ResultRelInfo 
**partitions,
   TupleConversionMap 
***tup_conv_maps,
@@ -3271,7 +3272,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
 
InitResultRelInfo(leaf_part_rri,
  partrel,
- 1,/* dummy */
+ resultRTindex,/* 
dummy */
  rel,
  0);
 
diff --git a/src/backend/executor/nodeModifyTable.c 
b/src/backend/executor/nodeModifyTable.c
index 8d17425abe..77ba15dd90 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1914,6 +1914,7 @@