date:20160121

Re: [HACKERS] WIP: Failover Slots

2016-01-21 Thread Craig Ringer

On 22 January 2016 at 00:31, Robert Haas  wrote:

> On Fri, Jan 1, 2016 at 7:50 PM, Simon Riggs  wrote:
> > Failover Slots
> > In the current patch, any slot defined on a master will generate WAL,
> > leading to a pending-slot being present on all standby nodes. When a
> standby
> > is promoted, the slot becomes usable and will have the properties as of
> the
> > last fsync on the master.
>
> No objection to the concept, but I think the new behavior needs to be
> optional.  I am guessing that you are thinking along similar lines,
> but it's not explicitly called out here.

Yeah,  I think that's the idea too. For one thing we'll want to allow
non-failover slots to continue to be usable from a streaming replica, but
we must ERROR if anyone attempts to attach to and replay from a failover
slot via a replica since we can't write WAL there. Both kinds are needed.

There's a 'failover' bool member in the slot persistent data for that
reason. It's not (yet) exposed via the UI.

I presume we'll want to:

* add a new default-false argument is_failover_slot or similar to
pg_create_logical_replication_slot and pg_create_physical_replication_slot

* Add a new optional flag argument FAILOVER to CREATE_REPLICATION_SLOT in
both its LOGICAL and PHYSICAL forms.

... and will be adding that to this patch, barring syntax objections etc.

It's also going to be necessary to handle what happens when a new failover
slot (physical or logical) is created on the master where it conflicts with
the name of a non-failover physical slot that was created on the replica.
In this case I am inclined to terminate any walsender backend for the
replica's slot with a conflict with recovery, remove its slot and replace
it with a failover slot. The failover slot does not permit replay while in
recovery so if the booted-off client reconnects it'll get an ERROR saying
it can't replay from a failover slot. It should be pretty clear to the
admin what's happening between the conflict with recovery and the failover
slot error. There could still be an issue if the client persistently keeps
retrying and successfully reconnects after replica promotion but I don't
see that much can be done about that. The documentation will need to
address the need to try to avoid name conflicts between slots created on
replicas and failover slots on the master.

I'll be working on those docs, on the name conflict handling, and on the
syntax during my coming flight. Toddler permitting.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] custom function for converting human readable sizes to bytes

2016-01-21 Thread Pavel Stehule

2016-01-22 7:03 GMT+01:00 Vitaly Burovoy :

> On 1/20/16, Pavel Stehule  wrote:
> > ...
> > New version is attached
> >
> > Regards
> > Pavel
>
> Hello!
>
> 1. Why the function is marked as VOLATILE? It depends only on input
> value, it does change nothing in the DB.
> I guess it should be IMMUTABLE for efficient caching its result.
>
> 2.
> >   text*arg = PG_GETARG_TEXT_PP(0);
> ...
> >   str = text_to_cstring(arg);
> >
> Hmm... My question was _why_ you get TEXT and convert it instead of
> getting an argument of type cstring (not about usage of
> VARSIZE_ANY_EXHDR).
> It was enough to give me a link[1] and leave the usage of
> VARSIZE_ANY_EXHDR as is.
>
>
>
> I think all the other claims below are mostly cosmetic. Main behavior
> is OK (considering its usage will not be in heavy queries).
>
> ===
> Documentation:
> Besides currently added row in a table I think it is worth to add
> detailed comment after a block "pg_size_pretty can be
> used" similar to (but the best way is to get advice for a correct
> phrase):
> "pg_size_bytes can be used to get a size in bytes as bigint from a
> human-readable format for a comparison purposes (it is the opposite to
> pg_size_pretty function). The format is numeric with an optional size
> unit: kB, MB, GB or TB."
> (and delete unit sizes from the table row).
>
> ===
> Tests:
> Since there was a letter with an explanation why units bigger than
> "TB" can't be used[2], it is a good reason to add that size units
> ("PB", "EB", "ZB") in tests with a note to update the documentation
> part when that unit are supported.
>
> ===
> Code style:
> 1. When you declare pointers, you must align _names_ by the left
> border, i.e. asterisks must be at one position to the left from the
> aligned names, i.e. one to three spaces instead of TAB before them.
>
> 2.
> > int   multiplier;
> One more TAB to align it with the name at the next line.
>
> ===
> Code:
>
> > /* allow whitespace between integer and unit */
> May be "numeric" instead of "integer"?
>
> > "\"%s\" is not in expected format"
> It is not clear what format is expected.
>
> > /* ignore plus symbol */
> It seems it is not ignored, but copied... ;-)
>
> > to ger hintstr
> s/ger/get/
> > Elsewhere is used space trimmed buffer.
> I'd write it as "Otherwise trimmed value is used."
> I'm not good at English, but that full block looks little odd for me.
> I tried to reword it in the attachment single_buffer.c, but I don't
> think it is enough.
>
> > We allow ending spaces.
> "trailing"?
>
> > but this message can be little bit no intuitive - better text is "is not
> a valid number"
> It was a block with a comment "don't allow empty string", i.e. absence
> of a number...
>
> > Automatic memory deallocation doesn't cover all possible situations where
> > the function can be used - for example DirectFunctionCall - so explicit
> > deallocation can descrease a memory requirements when you call these
> > functions from C.
> DirectFunctionCall uses a memory context of the caller. For example,
> you don't free "num" allocated by numeric_in and numeric_mul, also
> mul_num allocated by int8_numeric...
> I'd ask experienced hackers for importance of freeing (or leaving)
> "str" and "buffer".
>

Sometimes, the memory can be released by caller only - or the memory usage
is pretty minimal. But when you can release, memory then you should to do
it to decrease push to memory allocator.



>
> > postgres=# select pg_size_bytes('+912+ kB');
> > ERROR:  invalid unit: "+ kB"
> > This is difficult - you have to divide string to two parts and first
> world is number, second world is unit.
> Your code looks like a duplicated (not by lines, but by behavior).
> numeric_in has similar checks, let it do them for you. The goal of
> your function is just split mantissa and size unit, and if the last
> one is found, turn it into an exponent.
> See my example in the attachment check_mantissa_by_numeric_in.c. There
> is just searching non-space char that can't be part of numeric. Then
> all before it passes to numeric_in and let it check anything it
> should. I even left first spaces to show full numeric part to a user
> if an error occurs (for any case).
> The second attachment single_buffer.c is an attempt to avoid
> allocating the second buffer (the first is "str" allocated by
> text_to_cstring) and copying into it. I don't think it gives a big
> improvement, but nevertheless.
>

This code is little bit more complex than it is necessary (but few lines
more) to produce readable error messages. I am thinking so it is correct.

I'll look on this patch next week.

Thank you for review

Regards

Pavel


>
> ===
> [1] http://www.postgresql.org/message-id/29618.1451882...@sss.pgh.pa.us
> [2]
> http://www.postgresql.org/message-id/CAB7nPqS6Wob4WnZb=dhb3o0pc-nx1v3xjszkn3z9kbexgcq...@mail.gmail.com
>
> --
> Best regards,
> Vitaly Burovoy
>

Re: [HACKERS] Extracting fields from 'infinity'::TIMESTAMP[TZ]

2016-01-21 Thread Vitaly Burovoy

On 1/21/16, Tom Lane  wrote:
> Vik Fearing  writes:
>> I looked around for other places where this code should be used and
>> didn't find any.  I am marking this patch Ready for Committer.
>
> I pushed this with some adjustments, mainly to make sure that the
> erroneous-units errors exactly match those that would be thrown in
> the main code line.
>
>   regards, tom lane

Thank you! I didn't pay enough attention to it at that time.

-- 
Best regards,
Vitaly Burovoy


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

2016-01-21 Thread Michael Paquier

On Mon, Dec 21, 2015 at 4:45 PM, Michael Paquier
 wrote:
> As this thread is stalling a bit, please find attached a series of
> patch gathering all the pending issues for this thread:
> - 0001, fix config_default.pl for MSVC builds to take into account TAP tests
> - 0002, append a node name in get_new_node (per Noah's request)
> - 0003, the actual recovery test suite
> Hopefully this facilitates future reviews.

Patch 2 has been pushed as c8642d9 (thanks Alvaro). The remaining two
patches still apply and pass cleanly.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] custom function for converting human readable sizes to bytes

2016-01-21 Thread Vitaly Burovoy

On 1/20/16, Pavel Stehule  wrote:
> ...
> New version is attached
>
> Regards
> Pavel

Hello!

1. Why the function is marked as VOLATILE? It depends only on input
value, it does change nothing in the DB.
I guess it should be IMMUTABLE for efficient caching its result.

2.
>   text*arg = PG_GETARG_TEXT_PP(0);
...
>   str = text_to_cstring(arg);
>   
Hmm... My question was _why_ you get TEXT and convert it instead of
getting an argument of type cstring (not about usage of
VARSIZE_ANY_EXHDR).
It was enough to give me a link[1] and leave the usage of
VARSIZE_ANY_EXHDR as is.



I think all the other claims below are mostly cosmetic. Main behavior
is OK (considering its usage will not be in heavy queries).

===
Documentation:
Besides currently added row in a table I think it is worth to add
detailed comment after a block "pg_size_pretty can be
used" similar to (but the best way is to get advice for a correct
phrase):
"pg_size_bytes can be used to get a size in bytes as bigint from a
human-readable format for a comparison purposes (it is the opposite to
pg_size_pretty function). The format is numeric with an optional size
unit: kB, MB, GB or TB."
(and delete unit sizes from the table row).

===
Tests:
Since there was a letter with an explanation why units bigger than
"TB" can't be used[2], it is a good reason to add that size units
("PB", "EB", "ZB") in tests with a note to update the documentation
part when that unit are supported.

===
Code style:
1. When you declare pointers, you must align _names_ by the left
border, i.e. asterisks must be at one position to the left from the
aligned names, i.e. one to three spaces instead of TAB before them.

2.
> int   multiplier;
One more TAB to align it with the name at the next line.

===
Code:

> /* allow whitespace between integer and unit */
May be "numeric" instead of "integer"?

> "\"%s\" is not in expected format"
It is not clear what format is expected.

> /* ignore plus symbol */
It seems it is not ignored, but copied... ;-)

> to ger hintstr
s/ger/get/
> Elsewhere is used space trimmed buffer.
I'd write it as "Otherwise trimmed value is used."
I'm not good at English, but that full block looks little odd for me.
I tried to reword it in the attachment single_buffer.c, but I don't
think it is enough.

> We allow ending spaces.
"trailing"?

> but this message can be little bit no intuitive - better text is "is not a 
> valid number"
It was a block with a comment "don't allow empty string", i.e. absence
of a number...

> Automatic memory deallocation doesn't cover all possible situations where
> the function can be used - for example DirectFunctionCall - so explicit
> deallocation can descrease a memory requirements when you call these
> functions from C.
DirectFunctionCall uses a memory context of the caller. For example,
you don't free "num" allocated by numeric_in and numeric_mul, also
mul_num allocated by int8_numeric...
I'd ask experienced hackers for importance of freeing (or leaving)
"str" and "buffer".

> postgres=# select pg_size_bytes('+912+ kB');
> ERROR:  invalid unit: "+ kB"
> This is difficult - you have to divide string to two parts and first world is 
> number, second world is unit.
Your code looks like a duplicated (not by lines, but by behavior).
numeric_in has similar checks, let it do them for you. The goal of
your function is just split mantissa and size unit, and if the last
one is found, turn it into an exponent.
See my example in the attachment check_mantissa_by_numeric_in.c. There
is just searching non-space char that can't be part of numeric. Then
all before it passes to numeric_in and let it check anything it
should. I even left first spaces to show full numeric part to a user
if an error occurs (for any case).
The second attachment single_buffer.c is an attempt to avoid
allocating the second buffer (the first is "str" allocated by
text_to_cstring) and copying into it. I don't think it gives a big
improvement, but nevertheless.

===
[1] http://www.postgresql.org/message-id/29618.1451882...@sss.pgh.pa.us
[2] 
http://www.postgresql.org/message-id/CAB7nPqS6Wob4WnZb=dhb3o0pc-nx1v3xjszkn3z9kbexgcq...@mail.gmail.com

-- 
Best regards,
Vitaly Burovoy
/*
 * Convert human readable size to long int.
 *
 * Due to support for decimal values and case insensitivity of units
 * a function parse_int cannot be used.
 */
Datum
pg_size_bytes(PG_FUNCTION_ARGS)
{
	text	   *arg = PG_GETARG_TEXT_PP(0);
	char	   *str,
			   *strptr;
	char	   *buffer,
			   *bufptr;
	Numeric		num;
	int64		result;
	size_t		len;

	str = text_to_cstring(arg);
	strptr = str;

	/* Skip leading spaces */
	while (isspace((unsigned char) *strptr))
		strptr++;

	bufptr = strptr;
	for(;*strptr != '\0'; strptr++)
	{
		if (isdigit((unsigned char) *strptr) ||
			isspace((unsigned char) *strptr) ||
			*strptr == '.' || *strptr == '+' || *strptr == '-')
			continue;

		break;
	}

	/* don't allow empty string */
	if (bufptr == strptr)
		ereport(ERROR,
(errcode

Re: [HACKERS] silent data loss with ext4 / all current versions

2016-01-21 Thread Michael Paquier

On Tue, Jan 19, 2016 at 4:20 PM, Tomas Vondra
 wrote:
>
>
> On 01/19/2016 08:03 AM, Michael Paquier wrote:
>>
>> On Tue, Jan 19, 2016 at 3:58 PM, Tomas Vondra
>>  wrote:
>>>
>>>
> ...

 Tomas, I am planning to have a look at that, because it seems to be
 important. In case it becomes lost on my radar, do you mind if I add
 it to the 2016-03 CF?
>>>
>>>
>>>
>>> Well, what else can I do? I have to admit I'm quite surprised this is
>>> still
>>> rotting here, considering it addresses a rather serious data loss /
>>> corruption issue on pretty common setup.
>>
>>
>> Well, I think you did what you could. And we need to be sure now that
>> it gets in and that this patch gets a serious lookup. So for now my
>> guess is that not loosing track of it would be a good first move. I
>> have added it here to attract more attention:
>> https://commitfest.postgresql.org/9/484/
>
>
> Ah, thanks. I haven't realized it's not added into 2016-1 (I'd swear I added
> it into the CF app).

So, I have been playing with a Linux VM with VMware Fusion and on ext4
with data=ordered the renames are getting lost if the root folder is
not fsync. By killing-9 the VM I am able to reproduce that really
easily.

Here are some comments about your patch after a look at the code.

Regarding the additions in fsync_fname() in xlog.c:
1) In InstallXLogFileSegment, rename() will be called only if
HAVE_WORKING_LINK is not used, which happens only on Windows and
cygwin. We could add it for consistency, but it should be within the
#else/#endif block. It is not critical as of now.
2) The call in RemoveXlogFile is not necessary, the rename happening
only on Windows.
3) In exitArchiveRecovery if the rename is not made durable I think it
does not matter much. Even if recovery.conf is the one present once
the node restarts node would move back again to recovery, and actually
we had better move back to recovery in this case, no?
4) StartupXLOG for the tablespace map. I don't think this one is
needed as well. Even if the tablespace map is not removed after a
power loss user would get an error telling that the file should be
removed.
5) For the one where haveBackupLabel || haveTblspcMap. If we do the
fsync, we guarantee that there is no need to do again the recovery.
But in case of a power loss, isn't it better to do the recovery again?
6) For the one after XLogArchiveNotify() for the last partial segment
of the old timeline, it doesn't really matter to not make the change
persistent as this is mainly done because it is useful to identify
that it is a partial segment.
7) In CancelBackup, this one is not needed as well I think. We would
surely want to get back to recovery if those files remain after a
power loss.

For the ones in xlogarchive.c:
1) For KeepFileRestoredFromArchive, it does not matter here, we are
renaming a file with a temporary name to a permanent name. Once the
node restarts, we would see an extra temporary file if the rename was
not effective.
2) XLogArchiveForceDone, the only bad thing that would happen here is
to leave behind a .ready file instead of a .done file. I guess that we
could have it though as an optimization to not have to archive again
this file.

For the one in pgarch.c:
1) In pgarch_archiveDone, we could add it as an optimization to
actually let the server know that the segment has been already
archived, preventing a retry.

In timeline.c:
1) writeTimeLineHistoryFile, it does not matter much. In the worst
case we would have just a dummy temporary file, and startup process
would come back here (in exitArchiveRecovery() we may finish with an
unnamed backup file similarly).
2) In writeTimeLineHistoryFile, similarly we don't need to care much,
in WalRcvFetchTimeLineHistoryFiles recovery would take again the same
path

Thoughts?
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Releasing in September

2016-01-21 Thread Noah Misch

On Wed, Jan 20, 2016 at 06:58:24PM +, Simon Riggs wrote:
> The main problem is the length of the integration phase, which is mostly
> where nothing happens.

The open items wiki page saw steady change from 30 April to 28 December[1];
the two longest quiet periods were an eleven-day gap from 21 August and a
nine-day gap from 16 December.  Nineteen distinct contributors were finding,
fixing or otherwise editing data about 9.5 defects.  The integration work is
too slow, yes, but this persistent myth of a dead period is false.

For my part, I spent 1 July to 11 December reviewing 9.5 commits and reporting
or fixing defective 9.5 work.  (I did spoil myself with a few breaks fixing
pre-9.5 bugs.)  I doubt the 9.5 new bugs would have run dry within the next
five months, either, but the release team picked a fair juncture to give up
and call it dot-zero.

nm

[1] 
https://wiki.postgresql.org/index.php?title=PostgreSQL_9.5_Open_Items&limit=500&action=history

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Releasing in September

2016-01-21 Thread Noah Misch

On Wed, Jan 20, 2016 at 12:40:04PM -0500, Tom Lane wrote:
> I think it might also be a good idea if we could somehow distinguish
> "nobody had time for that yet" from "nobody is interested", with an eye
> to flat-out rejecting patches that no one cares enough about to review.
> Maybe we could reduce the workload by inventing some kind of up-front
> filter whereby a patch isn't even a candidate to get reviewed unless, say,
> at least two people besides the author say "this sounds like it's worth
> pursuing".
> 
> One of the other things I do not like about the current CF process is that
> it's created a default assumption that most submitted patches should get
> accepted eventually.  It is *very* hard to reject a patch altogether in
> the CF process: you more or less have to show cause why it should not get
> accepted.  That default is backwards.  Maybe this isn't the process' fault
> exactly, but it sure seems like that's how we approach patches now.

These paragraphs point to different facets of the same problem: community
conventions make rejecting a patch a substantial project in its own right.  +1
for experimenting with one or more ways to expedite that.  There's your
two-ACKs idea above, and we floated[1] a few more at the 2015-06-16 meeting:

  This was followed by discussion of how we arrange the CFs. Noah suggested a
  prioritization system. He suggested a system where various committers can
  provide feedback on stuff as triage. Simon agreed and volunteered. Haas
  suggested a 2-committer veto. Frost said we need a formal way to "nack"
  something. We need to triage stuff earlier in the process. Haas says the big
  problem is the endless arguments because we don't want to say "no" to people
  because we have scarce committer time.

I recommend trying the two-ACKs plan for one CF.  Keep it if it helps
markedly; otherwise, try another of these variants for the CF after that one.

Thanks,
nm

[1] https://wiki.postgresql.org/wiki/PgCon_2015_Developer_Meeting#9.6_Schedule

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning

2016-01-21 Thread Amit Langote


Hi,

On 2016/01/17 9:47, Corey Huinker wrote:
>> If we have a CREATE statement for each partition, how do we generalize
>> that to partitions at different levels? For example, if we use something
>> like the following to create a partition of parent_name:
>>
>> CREATE PARTITION partition_name OF parent_name FOR VALUES ...
>> WITH ... TABLESPACE ...
>>
>> Do we then say:
>>
>> CREATE PARTITION subpartition_name OF partition_name ...
> 
> That's how I'd want it for partitions created after the initial partitioned
> table is created.
> 
> I'd like to be able to identify the parent partition by it's own
> partitioning parameters rather than name, like the way we can derive the
> name of an index in ON CONFLICT. But I see no clean way to do that, and if
> one did come up, we'd simply allow the user to replace
> 
> with
> table_name PARTITION partition_spec [...PARTITION partition_spec [
> ...PARTITION turtles_all_the_way_down]]).
> 
> Again, totally fine with forcing the maintenance script to know or discover
> the name of the partition to be subpartitioned...for now.

I'm thinking of going with this last option.

So for now, you create an empty partitioned table specifying all the
partition keys without being able to  define any partitions in the same
statement. Partitions (and partitions thereof, if any) will be created
using CREATE PARTITION statements, one for each.

>> to create a level 2 partition (sub-partition) of parent_name? Obviously,
>> as is readily apparent from the command, it is still a direct partition of
>> partition_name for all internal purposes (consider partition list caching
>> in relcache, recursive tuple routing, etc.) save some others.
>>
>> I ask that also because it's related to the choice of syntax to use to
>> declare the partition key for the multi-level case. I'm considering the
>> SUBPARTITION BY notation and perhaps we could generalize it to more than
>> just 2 levels. So, for the above case, parent_name would have been created
>> as:
>>
>> CREATE TABLE parent_name PARTITION BY ... SUBPARTITION BY ...
> 
>> Needless to say, when subpartition_name is created with the command we saw
>> a moment ago, the root partitioned table would be locked. In fact, adding
>> a partition anywhere in the hierarchy needs an exclusive lock on the root
>> table. Also, partition rule (the FOR VALUES clause) would be validated
>> against PARTITION BY or SUBPARTITION BY clause at the respective level.
>>
>> Although, I must admit I feel a little uneasy about the inherent asymmetry
>> in using SUBPARTITION BY for key declaration whereas piggybacking CREATE
>> PARTITION for creating sub-partitions. Is there a better way?
> 
> Provided that the syntax allows for N levels of partitioning, I don't care
> if it's
> PARTITION BY.., PARTITION BY..., PARTITION BY ...
> or
> PARTITION BY.., SUBPARTITION BY..., SUBPARTITION BY ...
> 
> The first is probably better for meta-coding purposes, but the second makes
> it clear which partition layer is first.

I prefer second one too, for clarity. In any case, the PARTITION /
SUBPARTITION BY clauses are ordered. That is, successive clauses specify
partition key for corresponding successive levels of the hierarchy.

> There's a great chance that not everyone cares right now about this part
>> of the new partitioning but just want to put it out there. There are more
>> contentious issues like the syntax, partitioning maintenance commands that
>> we plan to support (now or later) and such.
>>
> 
> What I've read so far addresses most of my concerns.
> 
> Still somewhat on my mind:
> 
> 1. ability to describe partition bounds via range types, regardless of
> whether the Automatic Tuple Routing uses those types internally.

Specifying range partitioning bound as PARTITION FOR RANGE 
sounds like it offers some flexibility, which can be seen as a good thing.
But it tends to make internal logic slightly complicated.

Whereas, saying PARTITION FOR VALUES LESS THAN (max1, max2, ...) is
notationally simpler still and easier to work with internally. Also, there
will be no confusion about exclusivity of the bound if we document it so.

> 2. syntax for splitting a partition in two, merging two adjacent partitions
> (you probably touched on these earlier and I missed it or forgot).

Maintenance operations like split, merge would not be the first cut.

> 3. ability to swap a partition with a table not currently associated with
> the partitioned table.

EXCHANGE PARTITION-like ability is something on my list. There are a few
hurdles to get there. Especially, source table would have to be physical
tuple descriptor (ie, including dropped columns) compatible with the
partitioned table given some other parts of the design (I touched on those
in my earlier message).

> 4. The applicability of this syntax to materialized views, allowing us to
> do REFRESH CONCURRENTLY a few parts at a time, or only refreshing the data
> we know needs it.
> 
> Items 2 and 3 don't have to

Re: [HACKERS] Parallel Aggregate

2016-01-21 Thread Haribabu Kommi

On Fri, Jan 22, 2016 at 7:44 AM, David Rowley
 wrote:
> On 21 January 2016 at 18:26, Haribabu Kommi  wrote:
>> Here I attached update parallel aggregate patch on top of recent commits
>> of combine aggregate and parallel join patch. It still lacks of cost 
>> comparison
>> code to compare both parallel and normal aggregates.
>
> Thanks for the updated patch.
>
> I'm just starting to look over this now.
>
> # create table t1 as select x from generate_series(1,100) x(x);
> # vacuum ANALYZE t1;
> # set max_parallel_degree =8;
> # explain select sum(x) from t1;
>QUERY PLAN
> -
>  Aggregate  (cost=9633.33..9633.34 rows=1 width=4)
>->  Parallel Seq Scan on t1  (cost=0.00..8591.67 rows=416667 width=4)
> (2 rows)
>
> I'm not quite sure what's happening here yet as I've not ran it
> through my debugger, but how can we have a Parallel Seq Scan without a
> Gather node? It appears to give correct results, so I can only assume
> it's not actually a parallel scan at all.
>
> Let's check:
>
> # select relname,seq_scan from pg_stat_user_tables where relname ='t1';
>  relname | seq_scan
> -+--
>  t1  |0
> (1 row)
>
> # explain analyze select sum(x) from t1;
> QUERY PLAN
> --
>  Aggregate  (cost=9633.33..9633.34 rows=1 width=4) (actual
> time=161.820..161.821 rows=1 loops=1)
>->  Parallel Seq Scan on t1  (cost=0.00..8591.67 rows=416667
> width=4) (actual time=0.051..85.348 rows=100 loops=1)
>  Planning time: 0.040 ms
>  Execution time: 161.861 ms
> (4 rows)
>
> # select relname,seq_scan from pg_stat_user_tables where relname ='t1';
>  relname | seq_scan
> -+--
>  t1  |1
> (1 row)
>
> Only 1 scan.
>
>
> # explain analyze select * from t1 where x=1;
>QUERY PLAN
> 
>  Gather  (cost=1000.00..10633.43 rows=1 width=4) (actual
> time=0.231..49.105 rows=1 loops=1)
>Number of Workers: 2
>->  Parallel Seq Scan on t1  (cost=0.00..9633.33 rows=0 width=4)
> (actual time=29.060..45.302 rows=0 loops=3)
>  Filter: (x = 1)
>  Rows Removed by Filter: 33
>  Planning time: 0.049 ms
>  Execution time: 51.438 ms
> (7 rows)
>
> # select relname,seq_scan from pg_stat_user_tables where relname ='t1';
>  relname | seq_scan
> -+--
>  t1  |4
> (1 row)
>
> 3 more scans. This one seems to actually be parallel, and makes sense
> based on "Number of Workers: 2"


The problem was the gather path that is generated on partial path list is
not getting added to path list, because of which, there is  a mismatch in
sorted path and cheapest_path, so it leads to a wrong plan.

For temporarily, I marked the sorted_path and cheapest_path as same
and it works fine.


> Also looking at the patch:
>
> +bool
> +aggregates_allow_partial(Node *clause)
> +{
>
> In the latest patch that I sent on the combine aggregates thread:
> http://www.postgresql.org/message-id/CAKJS1f_in9J_ru4gPfygCQLUeB3=rzq3kg6rnpn-fzzhddi...@mail.gmail.com
> I made it so there's 3 possible return values from this function. As
> your patch stands now, if I create an aggregate function with an
> INTERNAL state with a combine function set, then this patch might try
> to parallel aggregate that and pass around the pointer to the internal
> state in the Tuple going from the worker to the main process, when the
> main process dereferences this pointer we'll get a segmentation
> violation. So I'd say you should maybe use a modified version of my
> latest aggregates_allow_partial() and check for PAT_ANY, and only
> parallelise the aggregate if you get that value.  If the use of
> partial aggregate was within a single process then you could be quite
> content with PAT_INTERNAL_ONLY. You'll just need to pull out the logic
> that checks for serial and deserial functions, since that's not in
> yet, and just have it return PAT_INTERNAL_ONLY if INTERNAL aggregates
> are found which have combine functions set.
>

I took the suggested code changes from combine aggregate patch and
changed accordingly.

Along with these changes, I added a float8 combine function to see
how it works under parallel aggregate, it is working fine for float4, but
giving small data mismatch with float8 data type.

postgres=# select avg(f3), avg(f4) from tbl;
   avg|   avg
--+--
 1.1002384186 | 100.12344879
(1 row)

postgres=# set enable_parallelagg = true;
SET
postgres=# select avg(f3), avg(f4) from tbl;
   avg|   avg
--+--
 1.1002384186 | 100.12344918
(1 row)

Column

Re: [HACKERS] Extracting fields from 'infinity'::TIMESTAMP[TZ]

2016-01-21 Thread Tom Lane

Vik Fearing  writes:
> I looked around for other places where this code should be used and
> didn't find any.  I am marking this patch Ready for Committer.

I pushed this with some adjustments, mainly to make sure that the
erroneous-units errors exactly match those that would be thrown in
the main code line.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: pglogical_output - a general purpose logical decoding output plugin

2016-01-21 Thread Craig Ringer

On 22 January 2016 at 06:13, Tomasz Rybak  wrote:


> + data stream. The output stream is designed to be compact and fast to
> decode,
> + and the plugin supports upstream filtering of data so that only the
> required
> + information is sent.
>
> plugin supports upstream filtering of data through hooks so that ...
>
>
Ok.


> + subset of that database may be selected for replication, currently based
> on
> + table and on replication origin. Filtering by a WHERE clause can be
> supported
> + easily in future.
>
> Is this filtering by table and replication origin implemented? I haven't
> noticed it in source.
>

That's what the hooks are for.


> + other daemon is required. It's accumulated using
>
> Stream of changes is accumulated...
>

Ok.


> + [the `CREATE_REPLICATION_SLOT ... LOGICAL ...` or `START_REPLICATION
> SLOT ... LOGICAL ...` commands](
> http://www.postgresql.org/docs/current/static/logicaldecoding-walsender.html)
> to start streaming changes. (It can also be used via
> + [SQL level functions](
> http://www.postgresql.org/docs/current/static/logicaldecoding-sql.html)
> + over a non-replication connection, but this is mainly for debugging
> purposes)
>
> Replication slot can also be configured (causing output plugin to be
> loaded) via [SQL level functions]...
>

Covered in the next section. Or at least it is in the SGML docs conversion
I'm still trying to finish off..


> + * Client issues `CREATE_REPLICATION_SLOT slotname LOGICAL 'pglogical'`
> if it's setting up for the first time
>
> * Client issues `CREATE_REPLICATION_SLOT slotname LOGICAL 'pglogical'` to
> setup replication if it's connecting for the first time
>

I disagree. It's entirely possible to do your slot creation/setup manually
or via something else, re-use a slot first created by another node, etc.
Slot creation is part of client setup, not so much connection.


> + Details are in the replication protocol docs.
>
> Add link to file with protocol documentation.
>

Is done in SGML conversion.


> + If your application creates its own slots on first use and hasn't
> previously
> + connected to this database on this system you'll need to create a
> replication
> + slot. This keeps track of the client's replay state even while it's
> disconnected.
>
> If your application hasn't previously connected to this database on this
> system
> it'll need to create and configure replication slot which keeps track of
> the
> client's replay state even while it's disconnected.
>

As above, I don't quite agree.


> + `pglogical`'s output plugin now sends a continuous series of `CopyData`
>
> As this is separate plugin, use 'pglogical_output' plugin now sends...
> (not only here but also in few other places).
>

Thanks. Done.

+ All hooks are called in their own memory context, which lasts for the
> duration
>
> All hooks are called in separate hook memory context, which lasts for the
> duration...
>

I don't see the difference, but OK.

Simplified:

> All hooks are called in a memory context that lasts ...


> + + switched to a longer lived memory context like TopMemoryContext.
> Memory allocated
> + + in the hook context will be automatically when the decoding session
> shuts down.
>
> ...will be automatically freed when the decoding...
>

Fixed, thanks.


> DDL for global object changes must be synchronized via some external means.
>
> Just:
> Global object changes must be synchronized via some external means.
>

Agree, done.


> + determine why an error occurs in a downstream, since you can examine a
> + json-ified representation of the xact. It's necessary to supply a minimal
>
> since you can examine a transaction in json (and not binary) format. It's
> necessary
>

Ok, done.


> + Once you've peeked the stream and know the LSN you want to discard up
> to, you
> + can use `pg_logical_slot_peek_changes`, specifying an `upto_lsn`, to
> consume
>
> Shouldn't it be pg_logical_slot_get_changes? get_changes consumes changes,
> peek_changes leaves them in the stream. Especially as example below
> points that we need to use get_changes.


Yes. It should. Whoops. Thanks.


> DESIGN.md:
>
> + attnos don't necessarily correspond. The column names might, and their
> ordering
> + might even be the same, but any column drop or column type change will
> result
>
> The column names and their ordering might even be the same...
>

I disagree, that has a different meaning. It's also not really user-facing
docs so I'm not too worried about being quite as readable.

README.pglogical_output_plhooks:
>
> + Note that pglogical
> + must already be installed so that its headers can be found.
>
> Note that pglogical_output must already...
>

Thanks.


> + Arguments are the oid of the affected relation, and the change type:
> 'I'nsert,
> + 'U'pdate or 'D'elete. There is no way to access the change data -
> columns changed,
> + new values, etc.
>
> Is it true (no way to access change data)? You added passing change
> to C hooks; from looking at code it looks like it's t

Re: [HACKERS] log_checkpoint's "0 transaction log file(s) added" is extremely misleading

2016-01-21 Thread Andres Freund

On January 22, 2016 3:29:44 AM GMT+01:00, Simon Riggs  
wrote:
>On 22 January 2016 at 01:12, Andres Freund  wrote:
>
>> Hi,
>>
>> While in theory correct, I think $subject is basically meaningless
>> because other backends may have added thousands of new segments. Yes,
>it
>> wasn't the checkpointer, but that's not particularly relevant
>> imo. Additionally, afaics, it will only ever be 0 or 1.
>>
>
>Even better, we could make it add >1

That'd indeed be good, but I don't think it really will address my complaint: 
We'd still potentially create new segments outside the prealloc call. Including 
from within the checkpointer, when flushing WAL to be able to write out a page.

Andres

---
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] log_checkpoint's "0 transaction log file(s) added" is extremely misleading

2016-01-21 Thread Simon Riggs

On 22 January 2016 at 01:12, Andres Freund  wrote:

> Hi,
>
> While in theory correct, I think $subject is basically meaningless
> because other backends may have added thousands of new segments. Yes, it
> wasn't the checkpointer, but that's not particularly relevant
> imo. Additionally, afaics, it will only ever be 0 or 1.
>

Even better, we could make it add >1


> I think we should either remove that part of the log output, or make it
> display the number of segments added since the beginning of the
> checkpoint.
>

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

[HACKERS] log_checkpoint's "0 transaction log file(s) added" is extremely misleading

2016-01-21 Thread Andres Freund

Hi,

While in theory correct, I think $subject is basically meaningless
because other backends may have added thousands of new segments. Yes, it
wasn't the checkpointer, but that's not particularly relevant
imo. Additionally, afaics, it will only ever be 0 or 1.

I think we should either remove that part of the log output, or make it
display the number of segments added since the beginning of the
checkpoint.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Combining Aggregates

2016-01-21 Thread Haribabu Kommi

On Thu, Jan 21, 2016 at 3:52 PM, David Rowley
 wrote:
> On 21 January 2016 at 15:53, Haribabu Kommi 
> wrote:
>>
>> On Thu, Jan 21, 2016 at 1:33 PM, Haribabu Kommi
>>  wrote:
>> >
>> > Here I attached updated patch of parallel aggregate based on the latest
>> > changes in master. Still it lack of cost comparison of normal aggregate
>> > to
>> > parallel aggregate because of difficulty. This cost comparison is
>> > required
>> > in parallel aggregate as this is having some regression when the number
>> > of groups are less in the query plan.
>> >
>>
>> Updated patch is attached after removing a warning in building group
>> aggregate path.
>
>
> Hi,
>
> Thanks for updating the patch. I'd like to look at this with priority, but
> can you post it on the Parallel Agg thread? that way anyone following there
> can chime in over there rather than here.  I've still got a bit of work to
> do (in the not too distant future) on the serial/deserial part, so would be
> better to keep this thread for discussion on that.

Thanks for the details. Sorry for sending parallel aggregate patch in
this thread.
I will take care of it from next time onward.


Regards,
Hari Babu
Fujitsu Australia


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: PL/Pythonu - function ereport

2016-01-21 Thread Pavel Stehule

2016-01-14 17:16 GMT+01:00 Catalin Iacob :

> On Wed, Jan 13, 2016 at 7:40 PM, Jim Nasby 
> wrote:
> > On 1/12/16 11:25 AM, Catalin Iacob wrote:
> >> They're similar but not really the same thing. raise Error and
> >> plpy.error are both ways to call ereport(ERROR, ...) while SPIError is
> >> raised when coming back after calling into Postgres to execute SQL
> >> that itself raises an error. Now indeed, that executed SQL raised an
> >> error itself via another ereport(ERROR, ...) somewhere but that's a
> >> different thing.
> >
> >
> > Why should they be different? An error is an error. You either want to
> trap
> > a specific type of error or you don't. Having two completely different
> ways
> > to do the same thing is just confusing.
>
> With my (indeed limited) understanding, I don't agree they're the same
> thing and I tried to explain above why I think they're quite
> different. You may not agree. If they are different things, using
> different exception types is natural.
>
> Consider this call chain: SQL code 1 -> PL/Python code 1 -> SPI (via
> plpy.execute) -> SQL code 2 -> PL/Python code 2
>
> If I'm writing PL/Python code 1 and I want to raise an error toward
> SQL code 1 I use raise plpy.Error. plpy.SPIError is what I get if I
> call into SQL code 2 and that has an error. That error could indeed
> come from a plpy.Error thrown by PL/Python code 2 or it could come
> from a SQL syntax error or whatever. But the symmetry holds:
> plpy.Error is what you raise to stop the query and return errors to
> your SQL caller and plpy.SPIError is what you get back if you use SPI
> to execute some other piece of SQL which has an error. I *think* (I'm
> an internals newbie though so I could be wrong) that SQL code 1
> doesn't call into PL/Python code 1 via SPI so why would the latter use
> something called SPIError to inform the former about an error?
>

It is not correct - outside PLPython you got a Error (PostgreSQL error has
not any classes), and isn't important the raising class (Error or
SPIError). Inside PL/Python you will got SPIError or successors (based on
SQLcode).

Currently If I raise plpy.Error, then it is immediately trasformed to
PostgreSQL, and and then to SPIError and successors.

Re: [HACKERS]WIP: Covering + unique indexes.

2016-01-21 Thread David Rowley

On 20 January 2016 at 06:08, Anastasia Lubennikova
 wrote:
>
>
>
> 18.01.2016 01:02, David Rowley пишет:
>
> On 14 January 2016 at 08:24, David Rowley  
> wrote:
>>
>> I will try to review the omit_opclass_4.0.patch soon.
>
>
> Hi, as promised, here's my review of the omit_opclass_4.0.patch patch.
>
> Thank you again. All mentioned points are fixed and patches are merged.
> I hope it's all right now. Please check comments one more time. I rather 
> doubt that I wrote everything correctly.


Thanks for updating.

+for the searching or ordering of records can defined in the

should be:

+for the searching or ordering of records can be defined in the

but perhaps "defined" should be "included".

The following is still quite wasteful. CopyIndexTuple() does a
palloc() and memcpy(), and then you throw that away if
rel->rd_index->indnatts != rel->rd_index->indnkeyatts. I think you
just need to add an "else" and move the CopyIndexTuple() below the if.

item = (IndexTuple) PageGetItem(lpage, itemid);
  right_item = CopyIndexTuple(item);
+ if (rel->rd_index->indnatts != rel->rd_index->indnkeyatts)
+ right_item = index_reform_tuple(rel, right_item,
rel->rd_index->indnatts, rel->rd_index->indnkeyatts);

Tom also commited
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=65c5fcd353a859da9e61bfb2b92a99f12937de3b
So it looks like you'll need to update your pg_am.h changes. Looks
like you'll need a new struct member in IndexAmRoutine and just
populate that new member in each of the *handler functions listed in
pg_am.h

-#define Natts_pg_am 30
+#define Natts_pg_am 31

Can the following be changed to

-   (At present, only b-tree supports it.)
+   (At present, only b-tree supports it.) Columns included with clause
+   INCLUDING  aren't used to enforce uniqueness.

-   (At present, only b-tree supports it.)
+   (At present, only b-tree supports it.) Columns which are present in the
 INCLUDING clause are not used to enforce uniqueness.

> Also this makes me think that the name ii_KeyAttrNumbers is now out-of-date, 
> as it contains
> the including columns too by the looks of it. Maybe it just needs to drop the 
> "Key" and become
> "ii_AttrNumbers". It would be interesting to hear what others think of that.
>
> I'm also wondering if indexkeys is still a good name for the IndexOptInfo 
> struct member.
> Including columns are not really keys, but I feel renaming that might cause a 
> fair bit of code churn, so I'd be interested to hear what other's have to say.
>
>
> I agree that KeyAttrNumbers and indexkeys are a bit confusing names, but I'd 
> like to keep them at least in this patch.
> It's may be worth doing "index structures refactoring" as a separate patch.


I agree. A separate patch sounds like the best course of action, but
authoring that can wait until after this is committed (I think).

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS]WIP: Covering + unique indexes.

2016-01-21 Thread Jeff Janes

On Tue, Jan 19, 2016 at 9:08 AM, Anastasia Lubennikova
 wrote:
>
>
> 18.01.2016 01:02, David Rowley пишет:
>
> On 14 January 2016 at 08:24, David Rowley 
> wrote:
>>
>> I will try to review the omit_opclass_4.0.patch soon.
>
>
> Hi, as promised, here's my review of the omit_opclass_4.0.patch patch.
>
> Thank you again. All mentioned points are fixed and patches are merged.
> I hope it's all right now. Please check comments one more time. I rather
> doubt that I wrote everything correctly.

Unfortunately there are several merge conflicts between your patch and
this commit:

commit 65c5fcd353a859da9e61bfb2b92a99f12937de3b
Author: Tom Lane 
Date:   Sun Jan 17 19:36:59 2016 -0500

Restructure index access method API to hide most of it at the C level.


Can you rebase past that commit?

Thanks,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: pglogical_output - a general purpose logical decoding output plugin

2016-01-21 Thread Tomasz Rybak

The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Documentation - although I haven't yet went through protocol documentation:

README.md

+ data stream. The output stream is designed to be compact and fast to decode,
+ and the plugin supports upstream filtering of data so that only the required
+ information is sent.

plugin supports upstream filtering of data through hooks so that ...



+ subset of that database may be selected for replication, currently based on
+ table and on replication origin. Filtering by a WHERE clause can be supported
+ easily in future.

Is this filtering by table and replication origin implemented? I haven't
noticed it in source.



+ other daemon is required. It's accumulated using

Stream of changes is accumulated...



+ [the `CREATE_REPLICATION_SLOT ... LOGICAL ...` or `START_REPLICATION SLOT ... 
LOGICAL ...` 
commands](http://www.postgresql.org/docs/current/static/logicaldecoding-walsender.html)
 to start streaming changes. (It can also be used via
+ [SQL level 
functions](http://www.postgresql.org/docs/current/static/logicaldecoding-sql.html)
+ over a non-replication connection, but this is mainly for debugging purposes)

Replication slot can also be configured (causing output plugin to be loaded) 
via [SQL level functions]...




+ The overall flow of client/server interaction is:

The overall flow of client/server interaction is as follows:



+ * Client issues `CREATE_REPLICATION_SLOT slotname LOGICAL 'pglogical'` if 
it's setting up for the first time

* Client issues `CREATE_REPLICATION_SLOT slotname LOGICAL 'pglogical'` to setup 
replication if it's connecting for the first time



+ Details are in the replication protocol docs.

Add link to file with protocol documentation.




+ If your application creates its own slots on first use and hasn't previously
+ connected to this database on this system you'll need to create a replication
+ slot. This keeps track of the client's replay state even while it's 
disconnected.

If your application hasn't previously connected to this database on this system
it'll need to create and configure replication slot which keeps track of the
client's replay state even while it's disconnected.




+ `pglogical`'s output plugin now sends a continuous series of `CopyData`

As this is separate plugin, use 'pglogical_output' plugin now sends...
(not only here but also in few other places).




+ All hooks are called in their own memory context, which lasts for the duration

All hooks are called in separate hook memory context, which lasts for the 
duration...





+ + switched to a longer lived memory context like TopMemoryContext. Memory 
allocated
+ + in the hook context will be automatically when the decoding session shuts 
down.

...will be automatically freed when the decoding...




DDL for global object changes must be synchronized via some external means.

Just:
Global object changes must be synchronized via some external means.





+ determine why an error occurs in a downstream, since you can examine a
+ json-ified representation of the xact. It's necessary to supply a minimal

since you can examine a transaction in json (and not binary) format. It's 
necessary





+ discard up to, as identifed by LSN (log sequence number). See

identified




+ Once you've peeked the stream and know the LSN you want to discard up to, you
+ can use `pg_logical_slot_peek_changes`, specifying an `upto_lsn`, to consume

Shouldn't it be pg_logical_slot_get_changes? get_changes consumes changes,
peek_changes leaves them in the stream. Especially as example below
points that we need to use get_changes.



+ tp to but not including that point, i.e. that will be the
+ point at which replay resumes.

IMO it's better to introduce new sentence:
that point. This will be the point at which replay resumes.



DESIGN.md:

+ attnos don't necessarily correspond. The column names might, and their 
ordering
+ might even be the same, but any column drop or column type change will result

The column names and their ordering might even be the same...









README.pglogical_output_plhooks:

+ Note that pglogical
+ must already be installed so that its headers can be found.

Note that pglogical_output must already...



+ Arguments are the oid of the affected relation, and the change type: 'I'nsert,
+ 'U'pdate or 'D'elete. There is no way to access the change data - columns 
changed,
+ new values, etc.

Is it true (no way to access change data)? You added passing change
to C hooks; from looking at code it looks like it's true, but I want to be sure.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Combining Aggregates

2016-01-21 Thread David Rowley

On 22 January 2016 at 06:56, Robert Haas  wrote:
> On Wed, Jan 20, 2016 at 8:32 PM, David Rowley
>  wrote:
>> The other two usages which I have thought of are;
>>
>> 1) Aggregating before UNION ALL, which might be fairly simple after the
>> grouping planner changes, as it may just be a matter of considering another
>> "grouping path" which partially aggregates before the UNION ALL, and
>> performs the final grouping stage after UNION ALL. At this stage it's hard
>> to say how that will work as I'm not sure how far changes to the grouping
>> planner will go. Perhaps Tom can comment?
>
> I hope he will, but in the meantime let me ask how this does us any
> good.  UNION ALL turns into an Append node.  Pushing aggregation
> through an Append doesn't make anything faster AFAICS *unless* you can
> further optimize beginning at that point.  For example, if one or both
> sides of the Append node have a Gather under them, and you can push
> the partial aggregation down into the Gather; or if you hit a Foreign
> Scan and can have it do partial aggregation on the remote side, you
> win.  But if you just have:
>
> Finalize Aggregate
> -> Append
>   -> Partial Aggregate
> -> Thing One
>   -> Partial Aggregate
> -> Thing Two
>
> Instead of:
>
> Aggregate
> -> Append
>   -> Thing One
>   -> Thing Two
>
> ...then I don't see the point, at least not for a single-group
> Aggregate or HashAggregate.  For a GroupAggregate, Thing One and Thing
> Two might need to be sorted, and sorting two or more smaller data sets
> might be faster than sorting one larger data set, but I can't see us
> winning anything very big here.
>
> To be clear, I'm not saying we shouldn't do this.  I just don't think
> it does much *by itself*.  If you combine it with other optimizations
> that let the aggregation be pushed further down the plan tree, it
> could win big.

Yes, I agree, it's not a big win, at least not in the case of a serial
plan. If each branch of the UNION ALL could be processed in parallel,
then that's different.

It's quite simple to test how much of a win it'll be in the serial
case today, and yes, it's not much, but it's a bit.

create table t1 as select x from generate_series(1,100) x(x);
vacuum analyze t1;
select count(*) from (select * from t1 union all select * from t1) t;
  count
-
 200
(1 row)

Time: 185.793 ms

-- Mock up pushed down aggregation by using sum() as a combine
function for count(*)
select sum(c) from (select count(*) c from t1 union all select
count(*) from t1) t;
   sum
-
 200
(1 row)

Time: 162.076 ms

Not particularly incredible, but we don't normally turn our noses up
at a 14% improvement, so let's just see how complex it will be to
implement, once the upper planner changes are done.

But as you mention about lack of ability to make use of pre-sorted
Path for each branch of the UNION ALL; I was really hoping Tom's patch
will improve that part by allowing the planner to choose a pre-sorted
Path and perform a MergeAppend instead of an Append, which would allow
pre-sorted input into a GroupAggregate node.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Set search_path + server-prepared statements = cached plan must not change result type

2016-01-21 Thread Vladimir Sitnikov

Robert>Are you really seeing the same behavior in all versions?

I do not have "pre 9.1" at hand, however all 9.1, 9.2, 9.3, 9.4, and
9.5 are affected.

9.1 just silently executes "old statement" as if search_path was not
modified at all.
9.2, 9.3, 9.4, and 9.5 all fail with "cached plan must not change
result type" error.

See java-based test in [1], and build logs for 9.1-9.4 in [2]

I do not have "brand new 9.5", however I think 9.5rc1 is good enough:
"PostgreSQL 9.5rc1 on x86_64-apple-darwin15.2.0, compiled by Apple
LLVM version 7.0.0 (clang-700.1.76), 64-bit"

Here's my test case:

select version();

create schema customer1;
create table customer1.test(i int4);

create schema customer2;
create table customer2.test(i varchar);

set search_path to customer1,public;
prepare stmt as select * from test;

set search_path to customer2,public;

execute stmt;

--ERROR: cached plan must not change result type
--SQL state: 0A000

[1]: 
https://github.com/pgjdbc/pgjdbc/commit/8fcd07a24666de308419d54e49e2f65f40661e2a#diff-526a72847ed4c9f31f699515d06e508bR188
[2]: https://travis-ci.org/pgjdbc/pgjdbc/builds/103940843

Vladimir


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Aggregate

2016-01-21 Thread David Rowley

On 21 January 2016 at 18:26, Haribabu Kommi  wrote:
> Here I attached update parallel aggregate patch on top of recent commits
> of combine aggregate and parallel join patch. It still lacks of cost 
> comparison
> code to compare both parallel and normal aggregates.

Thanks for the updated patch.

I'm just starting to look over this now.

# create table t1 as select x from generate_series(1,100) x(x);
# vacuum ANALYZE t1;
# set max_parallel_degree =8;
# explain select sum(x) from t1;
   QUERY PLAN
-
 Aggregate  (cost=9633.33..9633.34 rows=1 width=4)
   ->  Parallel Seq Scan on t1  (cost=0.00..8591.67 rows=416667 width=4)
(2 rows)

I'm not quite sure what's happening here yet as I've not ran it
through my debugger, but how can we have a Parallel Seq Scan without a
Gather node? It appears to give correct results, so I can only assume
it's not actually a parallel scan at all.

Let's check:

# select relname,seq_scan from pg_stat_user_tables where relname ='t1';
 relname | seq_scan
-+--
 t1  |0
(1 row)

# explain analyze select sum(x) from t1;
QUERY PLAN
--
 Aggregate  (cost=9633.33..9633.34 rows=1 width=4) (actual
time=161.820..161.821 rows=1 loops=1)
   ->  Parallel Seq Scan on t1  (cost=0.00..8591.67 rows=416667
width=4) (actual time=0.051..85.348 rows=100 loops=1)
 Planning time: 0.040 ms
 Execution time: 161.861 ms
(4 rows)

# select relname,seq_scan from pg_stat_user_tables where relname ='t1';
 relname | seq_scan
-+--
 t1  |1
(1 row)

Only 1 scan.

# explain analyze select * from t1 where x=1;
   QUERY PLAN

 Gather  (cost=1000.00..10633.43 rows=1 width=4) (actual
time=0.231..49.105 rows=1 loops=1)
   Number of Workers: 2
   ->  Parallel Seq Scan on t1  (cost=0.00..9633.33 rows=0 width=4)
(actual time=29.060..45.302 rows=0 loops=3)
 Filter: (x = 1)
 Rows Removed by Filter: 33
 Planning time: 0.049 ms
 Execution time: 51.438 ms
(7 rows)

# select relname,seq_scan from pg_stat_user_tables where relname ='t1';
 relname | seq_scan
-+--
 t1  |4
(1 row)

3 more scans. This one seems to actually be parallel, and makes sense
based on "Number of Workers: 2"

Also looking at the patch:

+bool
+aggregates_allow_partial(Node *clause)
+{

In the latest patch that I sent on the combine aggregates thread:
http://www.postgresql.org/message-id/CAKJS1f_in9J_ru4gPfygCQLUeB3=rzq3kg6rnpn-fzzhddi...@mail.gmail.com
I made it so there's 3 possible return values from this function. As
your patch stands now, if I create an aggregate function with an
INTERNAL state with a combine function set, then this patch might try
to parallel aggregate that and pass around the pointer to the internal
state in the Tuple going from the worker to the main process, when the
main process dereferences this pointer we'll get a segmentation
violation. So I'd say you should maybe use a modified version of my
latest aggregates_allow_partial() and check for PAT_ANY, and only
parallelise the aggregate if you get that value.  If the use of
partial aggregate was within a single process then you could be quite
content with PAT_INTERNAL_ONLY. You'll just need to pull out the logic
that checks for serial and deserial functions, since that's not in
yet, and just have it return PAT_INTERNAL_ONLY if INTERNAL aggregates
are found which have combine functions set.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] count_nulls(VARIADIC "any")

2016-01-21 Thread Pavel Stehule

Hi

2016-01-17 8:43 GMT+01:00 Pavel Stehule :

>
>
> 2016-01-12 17:27 GMT+01:00 Marko Tiikkaja :
>
>> On 03/01/16 22:49, Jim Nasby wrote:
>>
>>> In the unit test, I'd personally prefer just building a table with the
>>> test cases and the expected NULL/NOT NULL results, at least for all the
>>> calls that would fit that paradigm. That should significantly reduce the
>>> size of the test. Not a huge deal though...
>>>
>>
>> I don't really see the point.  "The size of the test" doesn't seem like a
>> worthwhile optimization target, unless the test scripts are somehow really
>> unnecessarily large.
>>
>> Further, if you were developing code related to this, previously you
>> could just copy-paste the defective test case in order to easily reproduce
>> a problem.  But now suddenly you need a ton of different setup.
>>
>> I don't expect to really have a say in this, but I think the tests are
>> now worse than they were before.
>>
>
> the form of regress tests is not pretty significant issue. Jim's design is
> little bit transparent, Marko's is maybe little bit practical. Both has
> sense from my opinion, and any hasn't significant advantage against other.
>

any possible agreement, how these tests should be designed?

simple patch, simple regress tests, so there are no reason for long waiting.

Regards

Pavel


> Regards
>
> Pavel
>
>
>>
>>
>> .m
>>
>
>

Re: [HACKERS] WIP: Failover Slots

2016-01-21 Thread Simon Riggs

On 21 January 2016 at 16:31, Robert Haas  wrote:

> On Fri, Jan 1, 2016 at 7:50 PM, Simon Riggs  wrote:
> > Failover Slots
> > In the current patch, any slot defined on a master will generate WAL,
> > leading to a pending-slot being present on all standby nodes. When a
> standby
> > is promoted, the slot becomes usable and will have the properties as of
> the
> > last fsync on the master.
>
> No objection to the concept, but I think the new behavior needs to be
> optional.  I am guessing that you are thinking along similar lines,
> but it's not explicitly called out here.
>

I was unsure myself; but making them optional seems reasonable.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] custom function for converting human readable sizes to bytes

2016-01-21 Thread Pavel Stehule

2016-01-21 11:51 GMT+01:00 Vitaly Burovoy :

> On 1/20/16, Pavel Stehule  wrote:
> > ...
> > New version is attached
> >
> > Regards
> > Pavel
>
> I'm sorry I'll do a review only tonight.
>

ook :)

Thank you

Pavel



> --
> Best regards,
> Vitaly Burovoy
>

Re: [HACKERS] Combining Aggregates

2016-01-21 Thread Robert Haas

On Wed, Jan 20, 2016 at 9:33 PM, Haribabu Kommi
 wrote:
> Here I attached updated patch of parallel aggregate based on the latest
> changes in master. Still it lack of cost comparison of normal aggregate to
> parallel aggregate because of difficulty. This cost comparison is required
> in parallel aggregate as this is having some regression when the number
> of groups are less in the query plan.

Please keep this on its own thread rather than colonizing this one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Combining Aggregates

2016-01-21 Thread Robert Haas

On Wed, Jan 20, 2016 at 8:32 PM, David Rowley
 wrote:
> At the moment I think everything which will use this is queued up behind the
> pathification of the grouping planner which Tom is working on. I think
> naturally Parallel Aggregate makes sense to work on first, given all the
> other parallel stuff in this release. I plan on working on that that by
> either assisting Haribabu, or... whatever else it takes.

Great!

> The other two usages which I have thought of are;
>
> 1) Aggregating before UNION ALL, which might be fairly simple after the
> grouping planner changes, as it may just be a matter of considering another
> "grouping path" which partially aggregates before the UNION ALL, and
> performs the final grouping stage after UNION ALL. At this stage it's hard
> to say how that will work as I'm not sure how far changes to the grouping
> planner will go. Perhaps Tom can comment?

I hope he will, but in the meantime let me ask how this does us any
good.  UNION ALL turns into an Append node.  Pushing aggregation
through an Append doesn't make anything faster AFAICS *unless* you can
further optimize beginning at that point.  For example, if one or both
sides of the Append node have a Gather under them, and you can push
the partial aggregation down into the Gather; or if you hit a Foreign
Scan and can have it do partial aggregation on the remote side, you
win.  But if you just have:

Finalize Aggregate
-> Append
  -> Partial Aggregate
-> Thing One
  -> Partial Aggregate
-> Thing Two

Instead of:

Aggregate
-> Append
  -> Thing One
  -> Thing Two

...then I don't see the point, at least not for a single-group
Aggregate or HashAggregate.  For a GroupAggregate, Thing One and Thing
Two might need to be sorted, and sorting two or more smaller data sets
might be faster than sorting one larger data set, but I can't see us
winning anything very big here.

To be clear, I'm not saying we shouldn't do this.  I just don't think
it does much *by itself*.  If you combine it with other optimizations
that let the aggregation be pushed further down the plan tree, it
could win big.

> 2) Group before join. e.g select p.description,sum(s.qty) from sale s inner
> join s.product_id = p.product_id group by s.product_id group by
> p.description;  I have a partial patch which implements this, although I was
> a bit stuck on if I should invent the concept of "GroupingPaths", or just
> inject alternative subquery relations which are already grouped by the
> correct clause.  The problem with "GroupingPaths" was down to the row
> estimates currently come from the RelOptInfo and is set in
> set_baserel_size_estimates() which always assumes the ungrouped number of
> rows, which is not what's needed if the grouping is already performed. I was
> holding off to see how Tom does this in the grouping planner changes.

Your sample query has two GROUP BY clauses; I think you meant to
include only the second one.  This seems like it can win big, because
it's possible that joining first means joining against a much larger
relation, whereas aggregating first might reduce "sale" to a volume of
data that fits in work_mem, after which you can hash join.  On the
other hand, if you add WHERE p.description LIKE '%snuffleupagus%' then
the aggregate-first strategy figures to lose, assuming things are
indexed properly.

On the substantive question you raise, I hope Tom will weigh in here.
However, I'll give you my take.  It seems to me that you are likely to
run into problems not only with the row counts but also with target
lists and width estimates.  All of those things change once you
aggregate.  It seems clear that you are going to need a GroupingPath
here, but it also seems really clear that you can't take a path where
some aggregation has been done and use add_path to toss it on the pile
for the same RelOptInfo as unaggregated paths to which it is not
comparable.  Right now, there's a RelOptInfo corresponding to each
subset of the joinrels for which we build paths; but in this new
world, I think you'll end up with multiple joinrels for each
RelOptInfo, one per

In your example above, you'd end up with RelOptInfos for (1) {s} with
no aggregation, (2) {p} with no aggregation, (3) {s p} with no
aggregation, (4) {s} aggregated by s.product_id and (5) {s p}
aggregated by p.description.  You can generate paths for (5) either by
joining (2) to (4) or by aggregating (3), and the cheapest path for
(5) is the overall winner.  It seems a bit tricky to determine which
aggregations are worth considering, and how to represent that in the
RelOptInfo, but overall that feels like the right idea.

I'm not sure how much this is going to overlap with the work Tom is
already doing.  I think his principal goal is to let the planning of a
subquery return multiple candidate paths.  That's almost certain to
involve creating something like GroupingPath, though I can't predict
the details, and I suspect it's also likely that he'll create some new
RelOptInfo that rep

Re: [HACKERS] Batch update of indexes

2016-01-21 Thread Konstantin Knizhnik




On 21.01.2016 19:09, Anastasia Lubennikova wrote:
What I meant is more like a BRIN-like combination of an index scan and 
heap scan.
Maybe it could be called "deferred inserts" or "temporary read-only 
index"
Maybe it's similar with mysql insert buffer 
http://dev.mysql.com/doc/refman/5.7/en/innodb-insert-buffering.html

I think it'll be more clear with example. Please don't care about syntax.

CREATE TABLE tbl (c1 int);
CREATE INDEX idx on tbl(c1);

SET enable_deferred_insert(idx) = on;
At this moment, we save the last_indexed_item (its TID) somewhere in 
index metapage.


Since that moment, the data inserted into the table doesn't touch the 
index.
We perform some heavy insert and then go back to the normal index 
behavior.


SET enable_deferred_insert(idx) = off;
This command takes all the data between the last_indexed_item and the 
end of the table, and inserts it into the index at a time.


Of course there are new problems to deal with, but it's really useful 
for the use case to balance irregular heavy write load, isn't it?


BTW, could you explain, what is the reason to copy data into the 
pending list and then copy it again while flushing pending list into 
the index? Why not read this data directly from the table? I feel that 
I've missed something important here.


No, I do  not think that inserted data should be placed in pending list 
and then copied to main table.
It should be stored directly in the main table and "pending list" is 
just some fast, transient index.


From my point of view there are two possibilities:
1. Preserve strict table-index consistency: query results should not 
depend on presence of the index
2. Support out-of-date or deferred indexes, which can be updated in 
background.


Second approach is certainty more efficient and IMHO it acceptable for 
most of the users.

But we are violating one of the fundamental properties of RDBMes...
So I am not sure which approach to chose.

First case is also harder to implement, because we have to somehow merge 
two indexes during index scan
and provide proper recovery of main index in case of failure (assuming 
that pending list is maintained in memory and is lost after the fault).



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Batch update of indexes

2016-01-21 Thread Konstantin Knizhnik




On 21.01.2016 10:14, Simon Riggs wrote:
On 21 January 2016 at 06:41, konstantin knizhnik 
mailto:k.knizh...@postgrespro.ru>> wrote:


Certainly for B-Tree we can organize insert buffer (or pending
list) as sorted array or also as a tree.
But in both case complexity of search in this buffer will be
O(log(N)), not O(1).


You are right; search within this buffer is O(log(N)). But we can test 
whether we need to search in the buffer at all with O(1).


Only if records are inserted in key ascending order...
But usually we need to maintain more than once index and and for them it 
is not true.




--
Simon Riggs http://www.2ndQuadrant.com/ 
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Re: pglogical_output - a general purpose logical decoding output plugin

2016-01-21 Thread Robert Haas

On Wed, Jan 20, 2016 at 12:54 AM, Craig Ringer  wrote:
> The idea here is that we want downwards compatibility as far as possible and
> maintainable but we can't really be upwards compatible for breaking protocol
> revisions. So the output plugin's native protocol version is inherently the
> max protocol version and we don't need a separate MAX.

That idea seems right to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: pglogical_output - a general purpose logical decoding output plugin

2016-01-21 Thread Robert Haas

On Wed, Jan 20, 2016 at 8:04 PM, Craig Ringer  wrote:
> itself is an abbreviation of its self.

I do not think this is true.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] removal of unused argument in ginInsertCleanup()

2016-01-21 Thread Fujii Masao

On Thu, Jan 21, 2016 at 9:15 AM, Michael Paquier
 wrote:
> On Wed, Jan 20, 2016 at 11:32 PM, Fujii Masao  wrote:
>> I found the unused argument "vac_delay" in ginInsertCleanup().
>> I think that we should remove it. Patch attached.
>
> Visibly an oversight of dc943ad, +1 for nuking it.

Ok, pushed!

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Set search_path + server-prepared statements = cached plan must not change result type

2016-01-21 Thread Robert Haas

On Wed, Jan 20, 2016 at 10:23 AM, Vladimir Sitnikov
 wrote:
>>  I believe, and the conclusion was that
>>if you think you need this, you're doing it wrong
>
> So what is the recommended approach to use server-prepared statements
> at the client side (I mean at JDBC driver side)?
>
> Currently "prepare, switch search_path, execute" leads to "cached plan
> must not change result type" error.
> Can one expect the issue to be fixed in subsequent 8.4, 8.5, ..., 9.5 
> versions?

Are you really seeing the same behavior in all versions?  Because I
thought we changed this pretty significantly in this commit:

commit 0d5fbdc157a17abc379052f5099b1c29a33cebe2
Author: Tom Lane 
Date:   Fri Jan 25 14:14:41 2013 -0500

Change plan caching to honor, not resist, changes in search_path.

In the initial implementation of plan caching, we saved the active
search_path when a plan was first cached, then reinstalled that path
anytime we needed to reparse or replan.  The idea of that was to try to
reselect the same referenced objects, in somewhat the same way that views
continue to refer to the same objects in the face of schema or name
changes.  Of course, that analogy doesn't bear close inspection, since
holding the search_path fixed doesn't cope with object drops or renames.
Moreover sticking with the old path seems to create more surprises than
it avoids.  So instead of doing that, consider that the cached plan depends
on search_path, and force reparse/replan if the active search_path is
different than it was when we last saved the plan.

This gets us fairly close to having "transparency" of plan caching, in the
sense that the cached statement acts the same as if you'd just resubmitted
the original query text for another execution.  There are still some corner
cases where this fails though: a new object added in the search path
schema(s) might capture a reference in the query text, but we'd not realize
that and force a reparse.  We might try to fix that in the future, but for
the moment it looks too expensive and complicated.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Failover Slots

2016-01-21 Thread Robert Haas

On Fri, Jan 1, 2016 at 7:50 PM, Simon Riggs  wrote:
> Failover Slots
> In the current patch, any slot defined on a master will generate WAL,
> leading to a pending-slot being present on all standby nodes. When a standby
> is promoted, the slot becomes usable and will have the properties as of the
> last fsync on the master.

No objection to the concept, but I think the new behavior needs to be
optional.  I am guessing that you are thinking along similar lines,
but it's not explicitly called out here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Batch update of indexes

2016-01-21 Thread Anastasia Lubennikova



20.01.2016 17:55, Konstantin Knizhnik:

Hi,


Hi, I glad to see that you interested in that too.
I think this is a good feature and I think it will be very useful to 
have.
I have already mentioned some related problems and possible 
improvements in my presentation.
http://www.slideshare.net/AnastasiaLubennikova/indexes-dont-mean-slow-inserts 

Last two slides concern to this thread. Briefly, I've suggested to 
think about insertion buffer. Something very similar to it is already 
implemented in BRIN. It does not index last data from heap, while the 
number of last pages is less than pages_per_block.


Do you mean GIN-like usage of insertion buffer (here it is called 
"pending list")?
So that we have to combine search in the main tree and in the insert 
buffer?
Actually this is what I want to avoided (because at least in case of 
GIN pending list cause significant degrade of performance,

while up-to-date state of full text index is rarely required).



What I meant is more like a BRIN-like combination of an index scan and 
heap scan.

Maybe it could be called "deferred inserts" or "temporary read-only index"
Maybe it's similar with mysql insert buffer 
http://dev.mysql.com/doc/refman/5.7/en/innodb-insert-buffering.html

I think it'll be more clear with example. Please don't care about syntax.

CREATE TABLE tbl (c1 int);
CREATE INDEX idx on tbl(c1);

SET enable_deferred_insert(idx) = on;
At this moment, we save the last_indexed_item (its TID) somewhere in 
index metapage.


Since that moment, the data inserted into the table doesn't touch the index.
We perform some heavy insert and then go back to the normal index behavior.

SET enable_deferred_insert(idx) = off;
This command takes all the data between the last_indexed_item and the 
end of the table, and inserts it into the index at a time.


Of course there are new problems to deal with, but it's really useful 
for the use case to balance irregular heavy write load, isn't it?


BTW, could you explain, what is the reason to copy data into the pending 
list and then copy it again while flushing pending list into the index? 
Why not read this data directly from the table? I feel that I've missed 
something important here.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Expanded Objects and Order By

2016-01-21 Thread Tom Lane

Paul Ramsey  writes:
> Thank the Maker, it is reproduceable: returning an expanded header in the _in 
> function is not appreciated in a very narrow number of cases.

BTW, on further poking around: if you'd had RANDOMIZE_ALLOCATED_MEMORY
enabled, returning an expanded object from an input function would have
failed this test in stringTypeDatum():

#ifdef RANDOMIZE_ALLOCATED_MEMORY

/*
 * For pass-by-reference data types, repeat the conversion to see if the
 * input function leaves any uninitialized bytes in the result.  We can
 * only detect that reliably if RANDOMIZE_ALLOCATED_MEMORY is enabled, so
 * we don't bother testing otherwise.  The reason we don't want any
 * instability in the input function is that comparison of Const nodes
 * relies on bytewise comparison of the datums, so if the input function
 * leaves garbage then subexpressions that should be identical may not get
 * recognized as such.  See pgsql-hackers discussion of 2008-04-04.
 */
if (string && !typform->typbyval)
{
Datumresult2;

result2 = OidInputFunctionCall(typinput, string,
   typioparam, atttypmod);
if (!datumIsEqual(result, result2, typform->typbyval, typform->typlen))
elog(WARNING, "type %s has unstable input conversion for \"%s\"",
 NameStr(typform->typname), string);
}
#endif

The pointer values in the two objects could not be equal so datumIsEqual
would certainly fail.  So one way of looking at this is that the case
indeed should be considered unsupported.

However, what I'm realizing from this discussion is that we need to have a
stronger convention on what we put into Const nodes.  Repeatable results
from the input function are certainly necessary, but we *also* need to
forbid short-header, compressed, or externally toasted values.  Otherwise
we will have Consts that aren't very Const (if the external value goes
away) or fail to compare equal to other Consts that they reasonably should
compare equal to.

Seen in that light, applying pg_detoast_datum() to all varlena values
that are going into Consts is sensible, and it'll take care of the
expanded-object variant of the issue as well.

I've not coded the fix yet, but it looks like the thing to do is to
put "if not-null and typlen==-1 then pg_detoast_datum()" logic into
makeConst() and coerce_type(), which so far as I can find are the only
places that build new Const nodes.  Also, the above-quoted
repeatable-results check has to be applied *after* the detoast step,
else it'll still be a hazard for expanded objects.  However, coerce_type()
is the only caller of stringTypeDatum(), so what we can do is move the
repeatable-results check over to coerce_type() and have it repeat
pg_detoast_datum() as well as the input-function call proper.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent error handling in START_REPLICATION command

2016-01-21 Thread Shulgin, Oleksandr

On Thu, Jan 21, 2016 at 3:25 PM, Robert Haas  wrote:

> On Wed, Jan 20, 2016 at 2:28 AM, Craig Ringer 
> wrote:
> > It enters COPY BOTH mode before it invokes the startup callback. The
> client
> > has no way to unilaterally terminate COPY BOTH mode and return to the
> normal
> > walsender protocol. The server doesn't do it when there's an ERROR. So
> the
> > only option the client has for recovery is to disconnect and reconnect.
>
> This isn't my understanding of how the protocol works, and it's not
> what the documentation says either.
>
> Per http://www.postgresql.org/docs/current/static/protocol-flow.html:
>
> ===
> In the event of a backend-detected error during copy-both mode, the
> backend will issue an ErrorResponse message, discard frontend messages
> until a Sync message is received, and then issue ReadyForQuery and
> return to normal processing. The frontend should treat receipt of
> ErrorResponse as terminating the copy in both directions; no CopyDone
> should be sent in this case.
> ===
>
> So it's true that the client can't unilaterally terminate COPY BOTH
> mode; it can only send CopyDone.  But an error on the server side
> should do so.
>

Hm, you're right.  Even though the server sends COPY_BOTH message before
the plugin startup, an attempt by the client to actually read the data
messages results in ErrorResponse handled on the client, then the client
can re-submit the corrected START_REPLICATION command and continue without
the need to reconnect.  This cannot be actually tested in psql, but I can
verify the behavior with my python scripts.

Then I don't have a preference for the early error reporting in this case.
If the current behavior potentially allows for more flexible error
reporting, I'm for keeping it.

--
Alex

Re: [HACKERS] Inconsistent error handling in START_REPLICATION command

2016-01-21 Thread Robert Haas

On Wed, Jan 20, 2016 at 2:28 AM, Craig Ringer  wrote:
> It enters COPY BOTH mode before it invokes the startup callback. The client
> has no way to unilaterally terminate COPY BOTH mode and return to the normal
> walsender protocol. The server doesn't do it when there's an ERROR. So the
> only option the client has for recovery is to disconnect and reconnect.

This isn't my understanding of how the protocol works, and it's not
what the documentation says either.

Per http://www.postgresql.org/docs/current/static/protocol-flow.html:

===
In the event of a backend-detected error during copy-both mode, the
backend will issue an ErrorResponse message, discard frontend messages
until a Sync message is received, and then issue ReadyForQuery and
return to normal processing. The frontend should treat receipt of
ErrorResponse as terminating the copy in both directions; no CopyDone
should be sent in this case.
===

So it's true that the client can't unilaterally terminate COPY BOTH
mode; it can only send CopyDone.  But an error on the server side
should do so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-21 Thread Anastasia Lubennikova


30.12.2015 16:01, Aleksander Alekseev:

Here is a clean version of the patch. Step 1 is an optimization. Step 2
refactors this:

  HTAB *
  ShmemInitHash(const char *name, /* table string name for shmem index */
- long init_size,   /* initial table size */
+ long init_size,   /* initial table size XXX ignored,
  refactor! */


Hi, I found that the patch is still not reviewed, so I've looked it over.
I haven't dived into this subject before, so my comments will be more 
general than relating to your investigation.
Sorry if some things seem like nitpicks, I just suppose that clear patch 
would be more convenient for reviewers.


First of all, why not merge both patches into one? They aren't too big 
anyway.
Then, this thread became too tangled. I think it's worth to write a new 
message with the patch, the test script, some results and brief overview 
of how does it really works. It will make following review much easier.


+   /* number of entries in hash table */
+   longnentries[NUM_LOCK_PARTITIONS];
+   /* linked list of free elements */
+   HASHELEMENT *freeList[NUM_LOCK_PARTITIONS];

I think comments should be changed, to be more informative here.

+   if (IS_PARTITIONED(hashp->hctl))
+   partitions_number = NUM_LOCK_PARTITIONS;
+   else
+   partitions_number = 1;
+
+   nelem_alloc = ((int) nelem) / partitions_number;
+   if (nelem_alloc == 0)
+   nelem_alloc = 1;

Add a comment here too.

-/* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS  128
-
 /* Number of partitions the shared lock tables are divided into */
-#define LOG2_NUM_LOCK_PARTITIONS  4
+#define LOG2_NUM_LOCK_PARTITIONS  7
 #define NUM_LOCK_PARTITIONS  (1 << LOG2_NUM_LOCK_PARTITIONS)

+/* Number of partitions of the shared buffer mapping hashtable */
+#define NUM_BUFFER_PARTITIONS NUM_LOCK_PARTITIONS
+

I'm sure, it was discussed above. but these definitions do not look 
obvious at all.

Maybe you should explain this magic number 7 in the comment above?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2016-01-21 Thread Ashutosh Bapat

> > In such a
> > case, which userid should be stored in UserMapping structure?It might
> look
> > like setting GetUserId() as userid in UserMapping is wise, but not
> really.
> > All the foreign tables might have different effective userids, each
> > different from GetUserId() and what GetUserId() would return might have a
> > user mapping different from the effective userids. What userid should
> > UserMapping structure have in that case? I thought, that's why Hanada-san
> > used invalid userid there, so left as it is.
>
> Well, we kind of need to get it right, not just be guessing.
>
> It looks to me as though GetConnection() only uses the user ID as a
> cache lookup key.  What it's trying to do is ensure that if user A and
> user B need different user mapping options, we don't use the same
> connection for both.  So, actually, passing InvalidOid seems like it's
> not really a problem here.  It's arguably more correct than what we've
> been doing up until now, since it means we cache the connection under
> user OID whose options we used, rather than the user OID that asked
> about the options.
>

That means that, right now, for two different local users which use public
user mapping we will be creating two different connections to the foreign
server with the same credentials. That doesn't look good. If we obtained
user mapping using user mapping oid (which will have invalid user id for
public user mapping) and used userid from that structure, we will get rid
of this problem.


>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] Extracting fields from 'infinity'::TIMESTAMP[TZ]

2016-01-21 Thread Vik Fearing

On 01/05/2016 09:07 AM, Vitaly Burovoy wrote:
> On 1/4/16, Alvaro Herrera  wrote:
>> It seems we got majority approval on the design of this patch, and no
>> disagreement; the last submitted version appears to implement that.
>> There's no documentation change in the patch though.  I'm marking it as
>> Waiting on Author; please resubmit with necessary doc changes.
> 
> Thank you!
> Version 3 of the patch with touched documentation in the attachment.
> 
> I decided to mark it as a note, because that separation
> (monotonic/oscillation fields) is not obvious and for most values the
> function "extract" works as expected (e.g. does not give an error)
> until special values are (casually?) passed.

I have reviewed this patch.  It applies and compiles cleanly and
implements the behavior reached by consensus.

The documentation is a little light, but I don't see what else needs to
be said.

The code is clean and well commented.  All extraction options are supported.

Regression tests are present and seemingly complete.

I looked around for other places where this code should be used and
didn't find any.  I am marking this patch Ready for Committer.
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2016-01-21 Thread Ashutosh Bapat

On Thu, Jan 21, 2016 at 3:03 AM, Robert Haas  wrote:

> On Mon, Jan 18, 2016 at 6:47 AM, Ashutosh Bapat
>  wrote:
> > 2. pg_fdw_join_v2.patch: postgres_fdw changes for supporting join
> pushdown
>
> The very first hunk of this patch contains annoying whitespace
> changes.  Even if the result is what pgindent is going to do anyway,
> such changes should be stripped out of patches for ease of review.  In
> this case, though, I'm pretty sure it isn't what pgindent is going to
> do, so it's just useless churn.  Please remove all such changes from
> the patch.
>

Done.

>
> find_var_pos() looks like it should just be inlined into its only caller.
>
> Node *node = (Node *) var;
> TargetListEntry *tle = tlist_member(node, context->outerlist);
> if (tle)
> {
> side = OUTER_ALIAS;
> pos = tle->resno;
> }
> else
> {
> side = INNER_ALIAS;
> tle = tlist_member(node, context->innertlist);
> pos = tle->resno;
> }
>
> Why are we calling the return value "pos" instead of "resno" as we
> typically do elsewhere?
>

I have rewritten deparseJoinVar as
/*
 * Deparse given Var required for a joinrel into buf.
 */
static void
deparseJoinVar(Var *node, deparse_expr_cxt *context)
{
char*side;
TargetEntry *tle;

/* Lookup outer side */
tle = tlist_member((Node *)node, context->outertlist);
if (tle)
side = OUTER_ALIAS;
else
{
/* Not found on outer side; lookup inner */
side = INNER_ALIAS;
tle = tlist_member((Node *)node, context->innertlist);
}

/* The input var should be either on left or right side */
Assert(tle && side);

appendStringInfo(context->buf, "%s.%s%d", side, COL_ALIAS_PREFIX,
tle->resno);
}

> get_jointype_name() would probably be better written as a switch.

Done.

> On
> the flip side, deparseColumnRef() would have a lot less code churn if
> it *weren't* written as a switch.
>

Done.

>
> What this patch does to the naming and calling conventions in
> deparse.c does not good.  Previously, we had deparseTargetList().
> Now, we sometimes use that and sometimes deparseAttrsUsed() for almost
> the same thing.

Previously deparseTargetList deparsed the SELECT or RETURNING clause by
including list of name of attributes provided by attrs_used. That's now
done by deparseAttrsUsed(). In current path deparseTargetList() deparses
the targetlist i.e. list of TargetEntry nodes (right now only Vars).
Although these two functions return comma separated string of column names,
their inputs are different. deparseAttrsUsed() can never be called for more
than one base relation. deparseTargetList() on the other hand can deparse a
targetlist with Var nodes from multiple relations. We need those two
functionalities separately. We might convert attrs_used into a list of
TargetEntry nodes using build_tlist_to_deparse() and use deparseTargetList
everywhere. A side effect of that would be separating retrieved_attrs
collection from deparsing code. I didn't do that change in this version to
avoid large code changes. But I am fine doing that, if that makes code
readable.

If we have to keep old deparseTargetList as is (and don't rename it as
deparseAttrsUsed), we can rename the new deparseTargetList as something
different may be deparseSelectList. I am fine with that too. But we need
the later functionality, whatever its name be.

> Previously, we had deparseColumnRef(); now we have
> both that and deparseJoinVar() doing very similar things.  But in each
> case, the function names are not parallel and the calling conventions
> are totally different.  Like here:
>
> +   if (context->foreignrel->reloptkind == RELOPT_JOINREL)
> +   deparseJoinVar(node, context);
> +   else
> +   deparseColumnRef(buf, node->varno,
> node->varattno, context->root);
>
> We pass the buf separately to deparseColumnRef(), but for some reason
> not to deparseJoinVar().I wonder if these functions need to be two
> separate things or if the work done by deparseJoinVar() should
> actually be part of deparseColumnRef().  But even if it needs to be
> separate, I wonder why we can't arrange things so that they get the
> same arguments, more or less.
>

deparseVar() is called for any Var node that's encountered. deparseJoinVar
is called to deparse a Var from join relation, which is supposed to output
INNER or OUTER var's alias as used in INNER or OUTER subqueries. It does
not output the real column names. deparseColumnRef() however is the same
old thing; it deparses column of given base relation. They are not similar
things.

deparseJoinVar gets its buf from context, which we do not pass to
deparseColumnRef(). Not all callers of deparseColumnRef have a
deparse_expr_cxt with them. Also, outertlist and innertlist required by
deparseJoinVar are passed through deparse_expr_cxt. It doesn't look worth
to create a context just for the sake of making function definitions look
similar. So, we need to have these two f

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-21 Thread Dilip Kumar

On Tue, Jan 12, 2016 at 1:50 PM, Aleksander Alekseev <
a.aleks...@postgrespro.ru> wrote:

>
> increasing number of lock partitions (see columns "no locks", "lwlock"
> and "spinlock array"). Previously it couldn't help us (see "master"
> column) because of a bottleneck.
>
> If you have any other questions regarding this patch please don't
> hesitate to ask.
>

I have done some performance bench marking for this
patch.(dynahash-optimization-v10-step1.patch)

Machine Detail:
cpu   : POWER8
cores: 24 (192 with HT)

Non Default Parameter:

Shared Buffer= 30GB
max_wal_size= 10GB
max_connections=500

Test1:
*schema.sql* and *pgbench.sql* are same files which Aleksander has attached
in first mail of the thread.

psql -d postgres -f schema.sql
pgbench -c$ -j$ -f pgbench.sql  postgres

clientbasepatch
1145145
2262258
4453472
8749732
1611141128
3217271789
6427292793
12835343520

With this test i did not see any improvement, though i think it should
improve the performance, do you have any suggestion to see the results same
as yours ?

I have also dump stacks using some script and I have observed many stacks
dumps as you mentioned in initial thread. And after patch, I found that
number of lock waiting stacks are reduced.

Stack Dump:
---
#0  0x7f25842899a7 in semop () from /lib64/libc.so.6
#1  0x007116d0 in PGSemaphoreLock (sema=0x7f257cb170d8) at
pg_sema.c:387
#2  0x0078955f in LWLockAcquire (lock=0x7f247698a980,
mode=LW_EXCLUSIVE) at lwlock.c:1028
#3  0x007804c4 in LockAcquireExtended
#4  0x0077fe91 in LockAcquire
#5  0x0077e862 in LockRelationOid
#6  0x0053bc7b in find_inheritance_children
#7  0x0053bd4f in find_all_inheritors
#8  0x006fc0a2 in expand_inherited_rtentry
#9  0x006fbf91 in expand_inherited_tables

I have tried to analyze using perf also, I can see that amount of time
taken in hash_search_with_hash_value is reduced from 3.86%(master) to
1.72%(patch).

I have plan to do further investigation, in different scenarios of dynahash.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Releasing in September

2016-01-21 Thread Marcin Mańk

On Wed, Jan 20, 2016 at 4:40 PM, Bruce Momjian  wrote:

> Many people where happy with our consistent releasing major releases in
> September, e.g. 9.0 to 9.3:
>
> Not sure why the commitfest process should be synchronized with the
release process. What if, when the release date comes, the currently
running commitfest (if there is one) gets put on hold, cleanup work is
done, release gets stamped, then commitfest gets resumed for the next
release?

Re: [HACKERS] Minor code improvements to create_foreignscan_plan/ExecInitForeignScan

2016-01-21 Thread Etsuro Fujita


On 2016/01/21 7:04, Alvaro Herrera wrote:

Etsuro Fujita wrote:

On second thought, I noticed that detecting whether we see a system column
that way needs more cycles in cases where the reltargetlist and the
restriction clauses don't contain any system columns.  ISTM that such cases
are rather common, so I'm inclined to keep that code as-is.



Ah, I see now what you did there. I was thinking you'd have the
foreach(restrictinfo) loop, then once the loop is complete scan the
bitmapset; not scan the bitmap set scan inside the other loop.


Ah, I got to the point.  I think that is a good idea.  The additional 
cycles for the worst case are bounded and negligible.  Please find 
attached an updated patch.


Best regards,
Etsuro Fujita
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
***
*** 2099,2108  create_foreignscan_plan(PlannerInfo *root, ForeignPath *best_path,
  	RelOptInfo *rel = best_path->path.parent;
  	Index		scan_relid = rel->relid;
  	Oid			rel_oid = InvalidOid;
- 	Bitmapset  *attrs_used = NULL;
  	Plan	   *outer_plan = NULL;
- 	ListCell   *lc;
- 	int			i;
  
  	Assert(rel->fdwroutine != NULL);
  
--- 2099,2105 
***
*** 2171,2206  create_foreignscan_plan(PlannerInfo *root, ForeignPath *best_path,
  	}
  
  	/*
! 	 * Detect whether any system columns are requested from rel.  This is a
! 	 * bit of a kluge and might go away someday, so we intentionally leave it
! 	 * out of the API presented to FDWs.
! 	 *
! 	 * First, examine all the attributes needed for joins or final output.
! 	 * Note: we must look at reltargetlist, not the attr_needed data, because
! 	 * attr_needed isn't computed for inheritance child rels.
  	 */
! 	pull_varattnos((Node *) rel->reltargetlist, rel->relid, &attrs_used);
! 
! 	/* Add all the attributes used by restriction clauses. */
! 	foreach(lc, rel->baserestrictinfo)
  	{
! 		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
  
! 		pull_varattnos((Node *) rinfo->clause, rel->relid, &attrs_used);
! 	}
  
! 	/* Now, are any system columns requested from rel? */
! 	scan_plan->fsSystemCol = false;
! 	for (i = FirstLowInvalidHeapAttributeNumber + 1; i < 0; i++)
! 	{
! 		if (bms_is_member(i - FirstLowInvalidHeapAttributeNumber, attrs_used))
  		{
! 			scan_plan->fsSystemCol = true;
! 			break;
  		}
- 	}
  
! 	bms_free(attrs_used);
  
  	return scan_plan;
  }
--- 2168,2228 
  	}
  
  	/*
! 	 * If rel is a base relation, detect whether any system columns are
! 	 * requested from the rel.  (If rel is a join relation, rel->relid will be
! 	 * 0, but there can be no Var with relid 0 in the reltargetlist or the
! 	 * restriction clauses, so we skip this in that case.  Note that any such
! 	 * columns in base relations that were joined are assumed to be contained
! 	 * in fdw_scan_tlist.)  This is a bit of a kluge and might go away someday,
! 	 * so we intentionally leave it out of the API presented to FDWs.
  	 */
! 	scan_plan->fsSystemCol = false;
! 	if (scan_relid > 0)
  	{
! 		Bitmapset  *attrs_used = NULL;
! 		ListCell   *lc;
! 		int			i;
  
! 		/*
! 		 * First, examine all the attributes needed for joins or final output.
! 		 * Note: we must look at reltargetlist, not the attr_needed data,
! 		 * because attr_needed isn't computed for inheritance child rels.
! 		 */
! 		pull_varattnos((Node *) rel->reltargetlist, scan_relid, &attrs_used);
  
! 		/* Now, are any system columns requested from rel? */
! 		for (i = FirstLowInvalidHeapAttributeNumber + 1; i < 0; i++)
  		{
! 			if (bms_is_member(i - FirstLowInvalidHeapAttributeNumber, attrs_used))
! 			{
! scan_plan->fsSystemCol = true;
! break;
! 			}
  		}
  
! 		/* Examine all the attributes used by restriction clauses, if needed */
! 		if (!scan_plan->fsSystemCol)
! 		{
! 			foreach(lc, rel->baserestrictinfo)
! 			{
! RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
! 
! pull_varattnos((Node *) rinfo->clause, scan_relid, &attrs_used);
! 			}
! 
! 			/* Now, are any system columns requested from rel? */
! 			for (i = FirstLowInvalidHeapAttributeNumber + 1; i < 0; i++)
! 			{
! if (bms_is_member(i - FirstLowInvalidHeapAttributeNumber, attrs_used))
! {
! 	scan_plan->fsSystemCol = true;
! 	break;
! }
! 			}
! 		}
! 
! 		bms_free(attrs_used);
! 	}
  
  	return scan_plan;
  }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] custom function for converting human readable sizes to bytes

2016-01-21 Thread Vitaly Burovoy

On 1/20/16, Pavel Stehule  wrote:
> ...
> New version is attached
>
> Regards
> Pavel

I'm sorry I'll do a review only tonight.
-- 
Best regards,
Vitaly Burovoy


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PoC] Asynchronous execution again (which is not parallel)

2016-01-21 Thread Amit Langote


Hi!

On 2016/01/21 18:26, Kyotaro HORIGUCHI wrote:
>>> Then, suppose we add a function bool ExecStartAsync(PlanState *target,
>>> ExecCallback callback, PlanState *cb_planstate, void *cb_context).
>>> For non-async-aware plan nodes, this just returns false.  async-aware
>>> plan nodes should initiate some work, register some callbacks, and
>>> return.  The callback that get registered should arrange in turn to
>>> register the callback passed as an argument when a tuple becomes
>>> available, passing the planstate and context provided by
>>> ExecStartAsync's caller, plus the TupleTableSlot containing the tuple.
>>
>> Although I don't imagine clearly about the case of
>> async-aware-nodes under non-aware-nodes, it seems to have a high
>> affinity with (true) parallel execution framework.
> 
> The ExecStartAsync is similar to ExecStartNode of my old
> patch. One of the most annoying things of that is that it needs
> to walk down to their descendents and in turn it needs garbageous
> corresponding additional codes for all type of nodes which can
> have children.

Unless I am missing something, I wonder if this is where
planstate_tree_walker() introduced by commit 8dd401aa is useful. For
example, I see that it's used in ExecShutdownNode() in a manner that looks
interesting for this discussion.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Odd behavior in foreign table modification (Was: Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW)

2016-01-21 Thread Etsuro Fujita


On 2016/01/19 19:04, Thom Brown wrote:

On 12 January 2016 at 11:49, Etsuro Fujita  wrote:

On 2016/01/12 20:36, Thom Brown wrote:



On 8 January 2016 at 05:08, Etsuro Fujita 
wrote:



On 2016/01/06 20:37, Thom Brown wrote:


I've run into an issue:

*# UPDATE master_customers SET id = 22 WHERE id = 16 RETURNING
tableoid::regclass;
ERROR:
CONTEXT:  Remote SQL command: UPDATE public.customers SET id = 22
WHERE ((id = 16)) RETURNING NULL



While working on this, I noticed that the existing postgres_fdw system
shows
similar behavior, so I changed the subject.

IIUC, the reason for that is when the local query specifies "RETURNING
tableoid::regclass", the FDW has fmstate->has_returning=false while the
remote query executed at ModifyTable has "RETURNING NULL", as shown in
the
above example; that would cause an abnormal exit in executing the remote
query in postgresExecForeignUpdate, since that the FDW would get
PGRES_TUPLES_OK as a result of the query while the FDW would think that
the
right result to get should be PGRES_COMMAND_OK, from the flag
fmstate->has_returning=false.



Attached is a patch to fix that.



I can't apply this patch in tandem with FDW DML pushdown patch (either
v2 or v3).



That patch is for fixing the similar issue in the existing postgres_fdw
system.  So, please apply that patch without the DML pushdown patch.  If
that patch is reasonable as a fix for the issue, I'll update the DML
pushdown patch (v3) on top of that patch.



The patch seems to work for me:

Before:

*# UPDATE master_customers SET id = 22 WHERE id = 1 RETURNING
tableoid::regclass;
ERROR:
CONTEXT:  Remote SQL command: UPDATE public.customers SET id = $2
WHERE ctid = $1 RETURNING NULL

After:

*# UPDATE master_customers SET id = 22 WHERE id = 1 RETURNING
tableoid::regclass;
  tableoid
--
  remote.customers
(1 row)

UPDATE 1


Thanks for the testing!

I updated the DML pushdown patch on top of Robert's version of this 
bugfix patch.  Please see


http://www.postgresql.org/message-id/56a0a9f0.9090...@lab.ntt.co.jp

Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2016-01-21 Thread Etsuro Fujita


On 2016/01/20 19:57, Rushabh Lathia wrote:

Overall I am quite done with the review of this patch. Patch is in good
shape and covered most of the things which been discussed earlier
or been mentioned during review process. Patch pass through the
make check and also includes good test coverage.


Thanks for the review!


Here are couple of things which is still open for discussion:



1)
.) When Tom Lane and Stephen Frost suggested getting the core
code involved,
I thought that we can do the mandatory checks into core it self
and making
completely out of dml_is_pushdown_safe(). Please correct me



The reason why I put that function in postgres_fdw.c is Check point 4:

+  * 4. We can't push an UPDATE down, if any expressions to assign
to the target
+  * columns are unsafe to evaluate on the remote server.



Here I was talking about checks related to triggers, or to LIMIT. I think
earlier thread talked about those mandatory check to the core. So may
be we can move those checks into make_modifytable() before calling
the PlanDMLPushdown.

This need to handle by the Owner.


Done.  For that, I modified relation_has_row_triggers a bit, renamed it 
to has_row_triggers (more shortly), and moved it to plancat.c.  And I 
merged dml_is_pushdown_safe with postgresPlanDMLPushdown, and revised 
that callback routine a bit.  Attached is an updated version of the 
patch created on top of Robert's version of the patch [1], which fixes 
handling of RETURNING tableoid in updating foreign tables.



2) Decision on whether we need the separate new node ForeignUpdate,
ForeignDelete. In my opinion I really don't see the need of this as we
that will add lot of duplicate. Having said that if committer or someone
else feel like that will make code more clean that is also true,

This need more comments from the committer.


I agree with you.

Other changes:

* In previous version, I assumed that PlanDMLPushdown sets fsSystemCol 
to true when rewriting the ForeignScan plan node so as to push down an 
UPDATE/DELETE to the remote server, in order to initialize t_tableOid 
for the scan tuple in ForeignNext.  The reason is that I created the 
patch so that the scan tuple is provided to the local query's RETURNING 
computation, which might see the tableoid column.  In this version, 
however, I modified the patch so that the tableoid value is inserted by 
ModifyTable.  This eliminates the need for postgres_fdw (or any other 
FDW) to set fsSystemCol to true in PlanDMLPushdown.


* Add set_transmission_modes/reset_transmission_modes to 
deparsePushedDownUpdateSql.


* Revise comments a bit further.

* Revise docs, including a fix for a wrong copy-and-paste.

Best regards,
Etsuro Fujita

[1] 
http://www.postgresql.org/message-id/ca+tgmoz40j2uc5ac1nxu03oj4crvolks15xx+ptfp-1u-8z...@mail.gmail.com
*** a/contrib/postgres_fdw/deparse.c
--- b/contrib/postgres_fdw/deparse.c
***
*** 823,829  deparseTargetList(StringInfo buf,
   *
   * If params is not NULL, it receives a list of Params and other-relation Vars
   * used in the clauses; these values must be transmitted to the remote server
!  * as parameter values.
   *
   * If params is NULL, we're generating the query for EXPLAIN purposes,
   * so Params and other-relation Vars should be replaced by dummy values.
--- 823,829 
   *
   * If params is not NULL, it receives a list of Params and other-relation Vars
   * used in the clauses; these values must be transmitted to the remote server
!  * as parameter values.  Caller is responsible for initializing it to empty.
   *
   * If params is NULL, we're generating the query for EXPLAIN purposes,
   * so Params and other-relation Vars should be replaced by dummy values.
***
*** 840,848  appendWhereClause(StringInfo buf,
  	int			nestlevel;
  	ListCell   *lc;
  
- 	if (params)
- 		*params = NIL;			/* initialize result list to empty */
- 
  	/* Set up context struct for recursion */
  	context.root = root;
  	context.foreignrel = baserel;
--- 840,845 
***
*** 978,983  deparseUpdateSql(StringInfo buf, PlannerInfo *root,
--- 975,1044 
  }
  
  /*
+  * deparse remote UPDATE statement
+  *
+  * The statement text is appended to buf, and we also create an integer List
+  * of the columns being retrieved by RETURNING (if any), which is returned
+  * to *retrieved_attrs.
+  */
+ void
+ deparsePushedDownUpdateSql(StringInfo buf, PlannerInfo *root,
+ 		   Index rtindex, Relation rel,
+ 		   List	*targetlist,
+ 		   List *targetAttrs,
+ 		   List	*remote_conds,
+ 		   List **params_list,
+ 		   List *returningList,
+ 		   List **retrieved_attrs)
+ {
+ 	RelOptInfo *baserel = root->simple_rel_array[rtindex];
+ 	deparse_expr_cxt context;
+ 	int			nestlevel;
+ 	bool		first;
+ 	ListCell   *lc;
+ 
+ 	if (params_list)
+ 		*params_list = NIL;		/* initialize result list to empty */
+ 
+ 	/* Set up context struct for recursion */
+

Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-01-21 Thread Victor Wagner

On Tue, 19 Jan 2016 14:34:54 +
Thom Brown  wrote:

> The segfault issue I originally reported now appears to be resolved.
> 
> But now I have another one:
> 
> psql
> 'postgresql://thom@127.0.0.1:5530,127.0.0.1:5531,127.0.0.1:5532,127.0.0.1:5533/postgres?hostorder=random&readonly=1&failover_timeout=5'
> -c 'show port'
> 
> Segmentation fault

This segfault appears to be a simple programming error. 
which comes out only when  we need to retry address list (which happens
when no node is available at the time of first scan)

There is code which prevents random hostorder to connect same node
twice a row

while (current != NULL)
{
if (current == conn->addr_cur)
{
current = current->ai_next;
}
count++;
if ((rand()&0x) < 0x1 / count)
{
choice = current;
}
current = current->ai_next;
}


There should be continue;  after first current = current->ai_next;
 
> This is where no nodes are available.  If I remove hostorder=random,
> or replace it with hostorder=sequential, it doesn't segfault.
> However, it then tries to connect to PGHOST on PGPORT repeatedly, even
> if I bring up one of the nodes it should be looking for.  Not only
> this, but it seems to do it forever if failover_timeout is greater
> than 0.

But this turns out to be more complicated matter, so I'm not ready to
provide next version of patch yet.







-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PoC] Asynchronous execution again (which is not parallel)

2016-01-21 Thread Kyotaro HORIGUCHI

Hello,

I put some consideration and trial on callbacks as a means to
async(early)-execution.

> > Suppose we equip each EState with the ability to fire "callbacks".
> > Callbacks have the signature:
> > 
> > typedef bool (*ExecCallback)(PlanState *planstate, TupleTableSlot
> > *slot, void *context);
> > 
> > Executor nodes can register immediate callbacks to be run at the
> > earliest possible opportunity using a function like
> > ExecRegisterCallback(estate, callback, planstate, slot, context).
> > They can registered deferred callbacks that will be called when a file
> > descriptor becomes ready for I/O, or when the process latch is set,
> > using a call like ExecRegisterFileCallback(estate, fd, event,
> > callback, planstate, slot, context) or
> > ExecRegisterLatchCallback(estate, callback, planstate, slot, context).

I considered on this. The immediate callbacks seems fine but
using latch or fds to signal tuple availability doesn't seem to
fit callbacks stored in estate. They are deferrable until
parent's tuple request and such kind of events can be handled at
the time as ExecGather does now. However some kind of
synchronize/waiting mechanism like latch or select() is needed
anyway.

> > To execute callbacks, an executor node can call ExecFireCallbacks(),
> > which will fire immediate callbacks in order of registration, and wait
> > for the file descriptors for which callbacks have been registered and
> > for the process latch when no immediate callbacks remain but there are
> > still deferred callbacks.  It will return when (1) there are no
> > remaining immediate or deferred callbacks or (2) one of the callbacks
> > returns "true".
> 
> Excellent! I unconsciously excluded the case of callbacks because
> I supposed (without certain ground) all executor nodes can have a
> chance to win from this. Such callback is a good choice to do
> what Start*Node did in the lastest patch.

The previous code added a large amount of garbage, which was the
mechanism of async-execution including additional code for
ExecStartNode phase in the same manner to ExecProcNode and
ExecEndNode. Most of the additional code is totally useless for
most of the types of node.

Callback is usable for not-so-common invoked-for-a-event-at-once
operations such like error-handling. For this case, the
operations can be asynch-execution of a node and the event can be
just before ExecProcNode on the topmost node. The first patch
attached allows async-capable nodes to register callbacks on Init
phase and executes them just before Exec phase on the topmost
node. It grately reduces the additional code as the result. My
first impression from the word "callbacks" is this.

The following operation yields LOG messages from dummy callback
with this patch.

CREATE TABLE t1 (a int, b int);
INSERT INTO t1 (SELECT a, 1 FROM generate_series(0, 99) a);
CREATE TABLE t2 (a int, b int);
INSERT INTO t2 (SELECT a, 2 FROM generate_series(0, 99) a);
CREATE TABLE t3 (a int, b int);
INSERT INTO t3 (SELECT a, 3 FROM generate_series(0, 99) a);
SELECT * FROM t1 UNION ALL SELECT * FROM t2 UNION ALL SELECT * FROM t3;
===
LOG:  dummy_async_cb is called for 0x2783a98
LOG:  dummy_async_cb is called for 0x2784248
LOG:  dummy_async_cb is called for 0x2784ad0

What my previous patch could do is doable by this first patch
with far less diffs.

If this design is not bad, I'll do postgres_fdw part.


Next is discussion about async tuple fetching.

> > Then, suppose we add a function bool ExecStartAsync(PlanState *target,
> > ExecCallback callback, PlanState *cb_planstate, void *cb_context).
> > For non-async-aware plan nodes, this just returns false.  async-aware
> > plan nodes should initiate some work, register some callbacks, and
> > return.  The callback that get registered should arrange in turn to
> > register the callback passed as an argument when a tuple becomes
> > available, passing the planstate and context provided by
> > ExecStartAsync's caller, plus the TupleTableSlot containing the tuple.
> 
> Although I don't imagine clearly about the case of
> async-aware-nodes under non-aware-nodes, it seems to have a high
> affinity with (true) parallel execution framework.

The ExecStartAsync is similar to ExecStartNode of my old
patch. One of the most annoying things of that is that it needs
to walk down to their descendents and in turn it needs garbageous
corresponding additional codes for all type of nodes which can
have children.

Instead, in the second patch, I modified ExecProcNode to return
async status in EState. It will be EXEC_READY or EXEC_EOT(End of
table/No more tuple?) for non-async-capable nodes and
async-capable nodes can set it EXEC_NOT_READY, which indicates
that there could be more tuple but not available yet.

Async-aware nodes such as Append can go to the next child if the
predecessor returned EXEC_NOT_READY. If all !EXEC_EOT nodes
returned EXEC_NOT_READY, Append will wait using some signaling
mechanism (it runs busily now instead.). As an example, th

Re: Odd behavior in foreign table modification (Was: Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW)

2016-01-21 Thread Etsuro Fujita


On 2016/01/21 5:06, Robert Haas wrote:

On Wed, Jan 20, 2016 at 4:50 AM, Etsuro Fujita
 wrote:

My concern about that is that would make the code in deparseTargetList()
complicated.

Essentially, I think your propossal needs a two-pass algorithm for
deparseTargetList; (1) create an integer List of the columns being retrieved
from the given attrs_used (getRetrievedAttrs()), and (2) print those columns
(printRetrievedAttrs()).  How about sharing those two functions between
deparseTargetList and deparseReturningList?:



I don't see why we'd need that.  I adjusted the code in postgres_fdw
along the lines I had in mind and am attaching the result.  It doesn't
look complicated to me, and it passes the regression test you wrote.


Thanks for the patch!  From the patch, I correctly understand what you 
proposed.  Good idea!



By the way, I'm not too sure I understand the need for the core
changes that are part of this patch, and I think that point merits
some discussion.  Whenever you change core like this, you're changing
the contract between the FDW and core; it's not just postgres_fdw that
needs updating, but every FDW.  So we'd better be pretty sure we need
these changes and they are adequately justified before we think about
putting them into the tree.  Are these core changes really needed
here, or can we fix this whole issue in postgres_fdw and leave the
core code alone?


Well, if we think it is the FDW's responsibility to insert a valid value 
for tableoid in the returned slot during ExecForeignInsert, 
ExecForeignUpdate or ExecForeignDelete, we don't need those core 
changes.  However, I think it would be better that that's done by 
ModifyTable in the same way as ForeignScan does in ForeignNext, IMO. 
That eliminates the need for postgres_fdw or any other FDW to do that 
business in the callback routines.


Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Releasing in September

2016-01-21 Thread Amit Kapila

On Thu, Jan 21, 2016 at 10:11 AM, Michael Paquier 
wrote:
>
> On Thu, Jan 21, 2016 at 1:32 PM, Michael Paquier
>  wrote:
> > On Thu, Jan 21, 2016 at 12:59 AM, Bruce Momjian 
wrote:
> >> On Wed, Jan 20, 2016 at 10:55:07AM -0500, Robert Haas wrote:
> >>> On Wed, Jan 20, 2016 at 10:48 AM, Andres Freund 
wrote:
> >>> > I think this has very little to do with commitfest schedules, and
much
> >>> > more with the "early" forking of the new version branch. For both
9.4
> >>> > and 9.5 we essentially spent a couple months twiddling our thumbs.
> >>>
> >>> It's certainly true that we twiddled our thumbs quite a bit about
> >>> getting 9.5 ready to ship.  However, the old process where nobody
> >>> could get anything committed for six months out of the year blew
> >>> chunks, too.  Personally, I think that the solution is to cut off the
> >>> last CommitFest a lot sooner, and then reopen the tree for the next
> >>> release as soon as possible.  But this never works, because there are
> >>> always patches we want to slip in late.
> >>
> >> The bottom line is we can't sustain five commitfests and stay on time
> >> --- we need to go back to four, which I think is what we used to do.
We
> >> twiddled our thumbs back in the September-release years too, but had
> >> consistency because twiddling was built into the schedule.
> >
> > Here I agree. Five commit fests is proving to be a lot and CF managers
> > are facing a severe burnout. Though I guess it sounded like a fine
> > idea at the moment this was begun (disclaimer: I was not there). So we
> > may want to do back to 4, and avoid too much overlapping between CFs
> > and what would be a stability period, without any commit fests going
> > on. It seems to me that the problem regarding the fact that fixes got
> > committed late is that committer's, author's and reviewer's attention
> > are getting distracted with commit fests and the new features proposed
> > there. Perhaps some people are more interested in implementing new
> > features than working on bugs and would just continue hacking and
> > arguing about new features, at least a stability period may attract
> > more committer attention into actual bug fixes, in short: no new
> > features can be committed until the previous versions has reached at
> > least beta2, rc, whatever. This may accelerate the stability process.
>
> Another idea popping to my mind in order to fix the CF manager burnout
> and actually motivate people into becoming CF managers: say that one a
> CF manager is done with a commit fest, folks on -hackers discuss if
> the CFM has done a good job or not. If yes, he gets a travel paid to
> the conference of his choice.
>

I also think there should be some way to give credit to CFM, if it is
difficult to do anything related to money, then we can enforce that if
CFM submits any patches for next CF, then those should be prioritised.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] checkpointer continuous flushing

2016-01-21 Thread Andres Freund

On 2016-01-21 11:33:15 +0530, Amit Kapila wrote:
> On Wed, Jan 20, 2016 at 9:07 PM, Andres Freund  wrote:
> > I don't think it's strongly related - the contention here is on read
> > access to the clog, not on write access.
> 
> Aren't reads on clog contended with parallel writes to clog?

Sure. But you're not going to beat "no access to the clog" due to hint
bits, by making parallel writes a bit better citizens.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

55 matches

Mail list logo