date:20170224

Re: [HACKERS] Proposal : Parallel Merge Join

2017-02-24 Thread Amit Kapila

On Fri, Feb 24, 2017 at 4:49 PM, Dilip Kumar  wrote:
> On Fri, Feb 24, 2017 at 3:04 PM, Amit Kapila  wrote:
>> What advantage do you see for considering such a path when the cost of
>> innerpath is more than cheapest_total_inner?  Remember the more paths
>> we try to consider, the more time we spend in the planner.  By any
>> chance are you able to generate any query where it will give benefit
>> by considering costlier innerpath?
>
> Changed
>

It seems you have forgotten to change in the below check:

+ if (innerpath != NULL &&
+ innerpath->parallel_safe &&
+ (cheapest_startup_inner == NULL ||
+ cheapest_startup_inner->parallel_safe == false ||
+ compare_path_costs(innerpath, cheapest_startup_inner,
+ STARTUP_COST) < 0))


+ /* Found a cheap (or even-cheaper) sorted parallel safer path */

typo
/safer/safe

Note - Change the patch status in CF app (to Needs Review) whenever
you post a new patch.  I could see that your other patch also needs a
similar update in CF app.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-02-24 Thread Pavan Deolasee

On Fri, Feb 24, 2017 at 9:47 PM, Robert Haas  wrote:

>
> And there are no bugs, right?  :-)

Yeah yeah absolutely nothing. Just like any other feature committed to
Postgres so far ;-)

I need to polish the chain conversion patch a bit and also add missing
support for redo, hash indexes etc. Support for hash indexes will need
overloading of ip_posid bits in the index tuple (since there are no free
bits left in hash tuples). I plan to work on that next and submit a fully
functional patch, hopefully before the commit-fest starts.

(I have mentioned the idea of overloading ip_posid bits a few times now and
haven't heard any objection so far. Well, that could either mean that
nobody has read those emails seriously or there is general acceptance to
that idea.. I am assuming latter :-))

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-24 Thread Kohei KaiGai

2017-02-25 1:40 GMT+09:00 Claudio Freire :
> On Fri, Feb 24, 2017 at 1:24 PM, Robert Haas  wrote:
>> On Fri, Feb 24, 2017 at 7:29 PM, Kohei KaiGai  wrote:
>>> This attached patch re-designed the previous v2 according to Robert's 
>>> comment.
>>> All I had to do is putting entrypoint for ForeignScan/CustomScan at
>>> ExecShutdownNode,
>>> because it is already modified to call child node first, earlier than
>>> ExecShutdownGather
>>> which releases DSM segment.
>>
>> Aside from the documentation, which needs some work, this looks fine
>> to me on a quick read.
>
> LGTM too.
>
> You may want to clarify in the docs when the hook is called in
> relation to other hooks too.
>
I tried to add a description how custom-scan callbacks performs under the
executor, and when invoked roughly.
However, it is fundamentally not easy for most of people because it assumes
FDW/CSP author understand the overall behavior of optimizer / executor, not
only APIs specifications.

Thanks,
-- 
KaiGai Kohei 


parallel-finish-fdw_csp.v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

2017-02-24 Thread Amit Kapila

On Sat, Feb 25, 2017 at 3:40 AM, Peter Geoghegan  wrote:
> On Fri, Feb 24, 2017 at 9:26 AM, Robert Haas  wrote:
>> I think this thread is pretty short on evidence that would let us make
>> a smart decision about what to do here.  I see three possibilities.
>> The first is that this patch is a good idea whether we do something
>> about the issue of half-dead pages or not.  The second is that this
>> patch is a good idea if we do something about the issue of half-dead
>> pages but a bad idea if we don't.  The third is that this patch is a
>> bad idea whether or not we do anything about the issue of half-dead
>> pages.
>

+1.  I think we can track the stats from
IndexBulkDeleteResult->pages_free to see the impact of the patch.


> Half-dead pages are not really relevant to this discussion, AFAICT. I
> think that both you and Simon mean "recyclable" pages.
>

Yes, I think so and I think that is just confusion about terminology.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Oblivious nested loop join algorithm

2017-02-24 Thread Amin Fallahi

Hi all

Is the current implementation of nested loop join in postgres
(memory-trace) oblivious? Why or why not?

Re: [HACKERS] Logical replication existing data copy

2017-02-24 Thread Petr Jelinek

On 25/02/17 00:39, Erik Rijkers wrote:
> On 2017-02-25 00:08, Petr Jelinek wrote:
>>
>> There is now a lot of fixes for existing code that this patch depends
>> on. Hopefully some of the fixes get committed soonish.
> 
> Indeed - could you look over the below list of 8 patches; is it correct
> and in the right (apply) order?
> 
> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
> 0002-Fix-after-trigger-execution-in-logical-replication.patch
> 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
> snapbuild-v3-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
> snapbuild-v3-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
> 
> snapbuild-v3-0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
> snapbuild-v3-0004-Skip-unnecessary-snapshot-builds.patch
> 0001-Logical-replication-support-for-initial-data-copy-v6.patch
> 

Yes that should be correct.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical replication existing data copy

2017-02-24 Thread Erik Rijkers


On 2017-02-25 00:08, Petr Jelinek wrote:


There is now a lot of fixes for existing code that this patch depends
on. Hopefully some of the fixes get committed soonish.


Indeed - could you look over the below list of 8 patches; is it correct 
and in the right (apply) order?


0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
0002-Fix-after-trigger-execution-in-logical-replication.patch
0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
snapbuild-v3-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
snapbuild-v3-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
snapbuild-v3-0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
snapbuild-v3-0004-Skip-unnecessary-snapshot-builds.patch
0001-Logical-replication-support-for-initial-data-copy-v6.patch

(they do apply & compile like this...)







--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Poor memory context performance in large hash joins

2017-02-24 Thread Andres Freund

On 2017-02-24 15:12:37 -0800, Andres Freund wrote:
> On 2017-02-24 18:04:18 -0500, Tom Lane wrote:
> > Concretely, something like the attached.  This passes regression tests
> > but I've not pushed on it any harder than that.
> 
> Heh, I'd just gotten something that didn't immediately crash anymore ;)
> 
> Running your patch against Jeff's test-case, verified before that I
> could easily reproduce the O(N^2) cost.

Oh, that didn't take as long as I was afraid (optimized/non-assert build):

postgres[26268][1]=# SET work_mem = '13GB';
SET
Time: 2.591 ms
postgres[26268][1]=# select count(*) from foobar2 where not exists (select 1 
from foobar t where t.titleid=foobar2.titleid);
┌───┐
│ count │
├───┤
│ 0 │
└───┘
(1 row)

Time: 268043.710 ms (04:28.044)

Profile:
  13.49%  postgres[.] ExecHashTableInsert
  11.16%  postgres[.] BufFileRead
   9.20%  postgres[.] ExecHashIncreaseNumBatches
   9.19%  libc-2.24.so[.] __memmove_avx_unaligned_erms
   6.74%  postgres[.] dense_alloc.isra.0
   5.53%  postgres[.] ExecStoreMinimalTuple
   5.14%  postgres[.] ExecHashJoinGetSavedTuple.isra.0
   4.68%  postgres[.] AllocSetAlloc
   4.12%  postgres[.] AllocSetFree
   2.31%  postgres[.] palloc
   2.06%  [kernel][k] free_pcppages_bulk


I'd previously run the test for 30min, with something like 99% of the
profile being in AllocSetFree().

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Poor memory context performance in large hash joins

2017-02-24 Thread Andres Freund

On 2017-02-24 18:04:18 -0500, Tom Lane wrote:
> Concretely, something like the attached.  This passes regression tests
> but I've not pushed on it any harder than that.

Heh, I'd just gotten something that didn't immediately crash anymore ;)

Running your patch against Jeff's test-case, verified before that I
could easily reproduce the O(N^2) cost.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical replication existing data copy

2017-02-24 Thread Petr Jelinek

On 25/02/17 00:05, Erik Rijkers wrote:
> On 2017-02-24 22:58, Petr Jelinek wrote:
>> On 23/02/17 01:41, Petr Jelinek wrote:
>>> On 23/02/17 01:02, Erik Rijkers wrote:
 On 2017-02-22 18:13, Erik Rijkers wrote:
> On 2017-02-22 14:48, Erik Rijkers wrote:
>> On 2017-02-22 13:03, Petr Jelinek wrote:
>>
>>> 0001-Skip-unnecessary-snapshot-builds.patch
>>> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
>>> 0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
>>> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
>>> 0002-Fix-after-trigger-execution-in-logical-replication.patch
>>> 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
>>> 0001-Logical-replication-support-for-initial-data-copy-v5.patch
>>
>> It works well now, or at least my particular test case seems now
>> solved.
>
> Cried victory too early, I'm afraid.
>

 I got into a 'hung' state while repeating  pgbench_derail2.sh.

 Below is some state.  I notice that master pg_stat_replication.syaye is
 'startup'.
 Maybe I should only start the test after that state has changed. Any
 of the
 other possible values (catchup, streaming) wuold be OK, I would think.

>>>
>>> I think that's known issue (see comment in tablesync.c about hanging
>>> forever). I think I may have fixed it locally.
>>>
>>> I will submit patch once I fixed the other snapshot issue (I managed to
>>> reproduce it as well, although very rarely so it's rather hard to test).
>>>
>>
>> Hi,
>>
>> Here it is. But check also the snapbuild related thread for updated
>> patches related to that (the issue you had with this not copying all
>> rows is yet another pre-existing Postgres bug).
>>
> 
> 
> The four earlier snapbuild patches apply cleanly, but
> then I get errors while  applying
> 0001-Logical-replication-support-for-initial-data-copy-v6.patch:
> 
> 
> patching file src/test/regress/expected/sanity_check.out
> (Stripping trailing CRs from patch.)
> patching file src/test/regress/expected/subscription.out
> Hunk #2 FAILED at 25.
> 1 out of 2 hunks FAILED -- saving rejects to file
> src/test/regress/expected/subscription.out.rej
> (Stripping trailing CRs from patch.)
> patching file src/test/regress/sql/object_address.sql
> (Stripping trailing CRs from patch.)
> patching file src/test/regress/sql/subscription.sql
> (Stripping trailing CRs from patch.)
> patching file src/test/subscription/t/001_rep_changes.pl
> Hunk #9 succeeded at 175 with fuzz 2.
> Hunk #10 succeeded at 193 (offset -9 lines).
> (Stripping trailing CRs from patch.)
> patching file src/test/subscription/t/002_types.pl
> (Stripping trailing CRs from patch.)
> can't find file to patch at input line 4296
> Perhaps you used the wrong -p or --strip option?
> The text leading up to this was:
> --
> |diff --git a/src/test/subscription/t/003_constraints.pl
> b/src/test/subscription/t/003_constraints.pl
> |index 17d4565..9543b91 100644
> |--- a/src/test/subscription/t/003_constraints.pl
> |+++ b/src/test/subscription/t/003_constraints.pl
> --
> File to patch:
> 
> 
> Can you have a look?
> 

Hi,

I think you are missing the other fixes (this specific error says that
you didn't apply
0002-Fix-after-trigger-execution-in-logical-replication.patch).

There is now a lot of fixes for existing code that this patch depends
on. Hopefully some of the fixes get committed soonish.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical replication existing data copy

2017-02-24 Thread Erik Rijkers


On 2017-02-24 22:58, Petr Jelinek wrote:

On 23/02/17 01:41, Petr Jelinek wrote:

On 23/02/17 01:02, Erik Rijkers wrote:

On 2017-02-22 18:13, Erik Rijkers wrote:

On 2017-02-22 14:48, Erik Rijkers wrote:

On 2017-02-22 13:03, Petr Jelinek wrote:


0001-Skip-unnecessary-snapshot-builds.patch
0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
0002-Fix-after-trigger-execution-in-logical-replication.patch
0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
0001-Logical-replication-support-for-initial-data-copy-v5.patch


It works well now, or at least my particular test case seems now 
solved.


Cried victory too early, I'm afraid.



I got into a 'hung' state while repeating  pgbench_derail2.sh.

Below is some state.  I notice that master pg_stat_replication.syaye 
is

'startup'.
Maybe I should only start the test after that state has changed. Any 
of the
other possible values (catchup, streaming) wuold be OK, I would 
think.




I think that's known issue (see comment in tablesync.c about hanging
forever). I think I may have fixed it locally.

I will submit patch once I fixed the other snapshot issue (I managed 
to
reproduce it as well, although very rarely so it's rather hard to 
test).




Hi,

Here it is. But check also the snapbuild related thread for updated
patches related to that (the issue you had with this not copying all
rows is yet another pre-existing Postgres bug).




The four earlier snapbuild patches apply cleanly, but
then I get errors while  applying
0001-Logical-replication-support-for-initial-data-copy-v6.patch:


patching file src/test/regress/expected/sanity_check.out
(Stripping trailing CRs from patch.)
patching file src/test/regress/expected/subscription.out
Hunk #2 FAILED at 25.
1 out of 2 hunks FAILED -- saving rejects to file 
src/test/regress/expected/subscription.out.rej

(Stripping trailing CRs from patch.)
patching file src/test/regress/sql/object_address.sql
(Stripping trailing CRs from patch.)
patching file src/test/regress/sql/subscription.sql
(Stripping trailing CRs from patch.)
patching file src/test/subscription/t/001_rep_changes.pl
Hunk #9 succeeded at 175 with fuzz 2.
Hunk #10 succeeded at 193 (offset -9 lines).
(Stripping trailing CRs from patch.)
patching file src/test/subscription/t/002_types.pl
(Stripping trailing CRs from patch.)
can't find file to patch at input line 4296
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--
|diff --git a/src/test/subscription/t/003_constraints.pl 
b/src/test/subscription/t/003_constraints.pl

|index 17d4565..9543b91 100644
|--- a/src/test/subscription/t/003_constraints.pl
|+++ b/src/test/subscription/t/003_constraints.pl
--
File to patch:


Can you have a look?

thanks,

Erik Rijkers



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Poor memory context performance in large hash joins

2017-02-24 Thread Tom Lane

Jeff Janes  writes:
> On Fri, Feb 24, 2017 at 10:59 AM, Tom Lane  wrote:
>> Uh, what?  In a doubly-linked list, you can remove an element in O(1)
>> time, you don't need any searching.

> Currently it is walking the chain to identify which block holds the chunk
> in the first place, not just to get the pointer to the previous block.  But
> I see that that could be fixed by pointer arithmetic once there is a reason
> to fix it. Which there isn't a reason to as long as you need to walk the
> chain to get the prev pointer anyway.  Like this:?
> targetblock = (AllocBlock) (((char*)chunk) - ALLOC_BLOCKHDRSZ);

Right.  We're just turning around the existing address calculation.  It'd
be a good idea to provide some cross-check that we found a sane-looking
block header, but that's not that hard --- we know what ought to be in
aset, freeptr, and endptr for a single-chunk block's header, and that
seems like enough of a crosscheck to me.

Concretely, something like the attached.  This passes regression tests
but I've not pushed on it any harder than that.

One argument against this is that it adds some nonzero amount of overhead
to block management; but considering that we are calling malloc or realloc
or free alongside any one of these manipulations, that overhead should be
pretty negligible.  I'm a bit more worried about increasing the alloc
block header size by 8 bytes, but that would only really matter with a
very small allocChunkLimit.

regards, tom lane

diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 4dfc3ec..c250bae 100644
*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
*** typedef AllocSetContext *AllocSet;
*** 201,207 
  typedef struct AllocBlockData
  {
  	AllocSet	aset;			/* aset that owns this block */
! 	AllocBlock	next;			/* next block in aset's blocks list */
  	char	   *freeptr;		/* start of free space in this block */
  	char	   *endptr;			/* end of space in this block */
  }	AllocBlockData;
--- 201,208 
  typedef struct AllocBlockData
  {
  	AllocSet	aset;			/* aset that owns this block */
! 	AllocBlock	prev;			/* prev block in aset's blocks list, if any */
! 	AllocBlock	next;			/* next block in aset's blocks list, if any */
  	char	   *freeptr;		/* start of free space in this block */
  	char	   *endptr;			/* end of space in this block */
  }	AllocBlockData;
*** AllocSetContextCreate(MemoryContext pare
*** 522,528 
--- 523,532 
  		block->aset = set;
  		block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block->endptr = ((char *) block) + blksize;
+ 		block->prev = NULL;
  		block->next = set->blocks;
+ 		if (block->next)
+ 			block->next->prev = block;
  		set->blocks = block;
  		/* Mark block as not to be released at reset time */
  		set->keeper = block;
*** AllocSetReset(MemoryContext context)
*** 604,609 
--- 608,614 
  			VALGRIND_MAKE_MEM_NOACCESS(datastart, block->freeptr - datastart);
  #endif
  			block->freeptr = datastart;
+ 			block->prev = NULL;
  			block->next = NULL;
  		}
  		else
*** AllocSetAlloc(MemoryContext context, Siz
*** 710,725 
  #endif
  
  		/*
! 		 * Stick the new block underneath the active allocation block, so that
! 		 * we don't lose the use of the space remaining therein.
  		 */
  		if (set->blocks != NULL)
  		{
  			block->next = set->blocks->next;
  			set->blocks->next = block;
  		}
  		else
  		{
  			block->next = NULL;
  			set->blocks = block;
  		}
--- 715,734 
  #endif
  
  		/*
! 		 * Stick the new block underneath the active allocation block, if any,
! 		 * so that we don't lose the use of the space remaining therein.
  		 */
  		if (set->blocks != NULL)
  		{
+ 			block->prev = set->blocks;
  			block->next = set->blocks->next;
+ 			if (block->next)
+ block->next->prev = block;
  			set->blocks->next = block;
  		}
  		else
  		{
+ 			block->prev = NULL;
  			block->next = NULL;
  			set->blocks = block;
  		}
*** AllocSetAlloc(MemoryContext context, Siz
*** 900,906 
--- 909,918 
  		VALGRIND_MAKE_MEM_NOACCESS(block->freeptr,
     blksize - ALLOC_BLOCKHDRSZ);
  
+ 		block->prev = NULL;
  		block->next = set->blocks;
+ 		if (block->next)
+ 			block->next->prev = block;
  		set->blocks = block;
  	}
  
*** AllocSetFree(MemoryContext context, void
*** 960,988 
  	{
  		/*
  		 * Big chunks are certain to have been allocated as single-chunk
! 		 * blocks.  Find the containing block and return it to malloc().
  		 */
! 		AllocBlock	block = set->blocks;
! 		AllocBlock	prevblock = NULL;
  
! 		while (block != NULL)
! 		{
! 			if (chunk == (AllocChunk) (((char *) block) + ALLOC_BLOCKHDRSZ))
! break;
! 			prevblock = block;
! 			block = block->next;
! 		}
! 		if (block == NULL)
  			elog(ERROR, "could not find block containing chunk %p", chunk);
- 		/* let's just make sure chunk is the only one in

Re: [HACKERS] btree_gin and btree_gist for enums

2017-02-24 Thread Andrew Dunstan



On 02/24/2017 02:55 PM, Andrew Dunstan wrote:
>
> On 02/24/2017 11:02 AM, Tom Lane wrote:
>>> I don't know what to call it either. In my test I used
>>> CallerContextFunctionCall2 - not sure if that's quite right, but should
>>> be close.
>> CallerInfo?  CallerFInfo?  Or we could spell out CallerFmgrInfo but
>> that seems a bit verbose.
>>
>>  
> I'll go with CallerFInfoFunctionCall2 etc.
>
> In the btree_gist system the calls to the routines like enum_cmp are
> buried about three levels deep. I'm thinking I'll just pass the flinfo
> down the stack and just call these routines at the bottom level.
>
>
>


It's occurred to me that we could reduce the code clutter in fmgr.c a
bit by turning the DirectFunctionCall{n]Coll functions into macros
calling these functions and passing NULL as the flinfo param.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

2017-02-24 Thread Peter Geoghegan

On Fri, Feb 24, 2017 at 9:26 AM, Robert Haas  wrote:
> I think this thread is pretty short on evidence that would let us make
> a smart decision about what to do here.  I see three possibilities.
> The first is that this patch is a good idea whether we do something
> about the issue of half-dead pages or not.  The second is that this
> patch is a good idea if we do something about the issue of half-dead
> pages but a bad idea if we don't.  The third is that this patch is a
> bad idea whether or not we do anything about the issue of half-dead
> pages.

Half-dead pages are not really relevant to this discussion, AFAICT. I
think that both you and Simon mean "recyclable" pages. There are
several levels of indirection involved here, to keep the locking very
granular, so it gets tricky to talk about.

B-Tree page deletion is like a page split in reverse. It has a
symmetry with page splits, which have two phases (atomic operations).
There are also two phases for deletion, the first of which leaves the
target page without a downlink in its parent, and marks it half dead.
By the end of the first phase, there are still sibling pointers, so an
index scan can land on them before the second phase of deletion begins
-- they can visit a half-dead page before such time as the second
phase of deletion begins, where the sibling link goes away. So, the
sibling link isn't stale as such, but the page is still morally dead.
(Second phase is where we remove even the sibling links, and declare
it fully dead.)

Even though there are two phases of deletion, the second still occurs
immediately after the first within VACUUM. The need to have two phases
is hard to explain, so I won't try, but it suffices to say that VACUUM
does not actually ever leave a page half dead unless there is a hard
crash.

Recall that VACUUMing of a B-Tree is performed sequentially, so blocks
can be recycled without needing to be found via a downlink or sibling
link by VACUUM. What is at issue here, then, is VACUUM's degree of
"eagerness" around putting *fully* dead B-Tree pages in the FSM for
recycling. The interlock with RecentGlobalXmin is what makes it
impossible for VACUUM to generally fully delete pages, *as well as*
mark them as recyclable (put them in the FSM) all at once.

Maybe you get this already, since, as I said, the terminology is
tricky in this area, but I can't tell.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: two slab-like memory allocators

2017-02-24 Thread Andres Freund

Hi,


On 2017-02-21 01:43:46 +0100, Tomas Vondra wrote:
> Attached is v9 of this patch series. This addresses most of the points
> raised in the review, namely:

Cool, thanks.

> 3) get rid of the block-level bitmap tracking free chunks
>
> Instead of the bitmap, I've used a simple singly-linked list, using int32
> chunk indexes. Perhaps it could use the slist instead, but I'm not quite
> sure MAXALIGN is guaranteed to be greater than pointer.

I'm pretty sure it's guaranteed to be >= sizeof(void*).  But this seems
ok, so ...


Attached are changes I made on the path to committing the patch. These
look large enough that I wanted to give you a chance to comment:

- The AllocFreeInfo changes aren't correct for 32bit systems with 64bit,
  longs (like linux, unlike windows) - using %zu is better (z is code
  for size_t sized)
- Split off the reorderbuffer changes
- Removed SlabBlock/SlabChunk / renamed SlabBlockData
-
  +#ifdef MEMORY_CONTEXT_CHECKING
  +#define SLAB_CHUNK_USED \
  + (offsetof(SlabChunkData, requested_size) + sizeof(Size))
  +#else
  doesn't work anymore, because requested_size isn't a top-level
  field. I first redefined it as (without surrounding ifdef)
  #define SLAB_CHUNK_USED \
(offsetof(SlabChunkData, header) + sizeof(StandardChunkHeader))
  but I'm not really sure there's a whole lot of point in the define
  rather than just using sizeof() on the whole thing - there's no trailing
  padding?
- SLAB_CHUNK_PUBLIC and SLAB_BLOCKHDRSZ are unused?
- renamed 'set' variables (and comments) to slab.
- used castNode(SlabContext, context) instead of manual casts
- I rephrased
  + *
  + *   We cache indexes of the first empty chunk on each block 
(firstFreeChunk),
  + *   and freelist index for blocks with least free chunks (minFreeChunks), so
  + *   that we don't have to search the freelist and block on every SlabAlloc()
  + *   call, which is quite expensive.
so it's not referencing firstFreeChunk anymore, since that seems to
make less sense now that firstFreeChunk is essentially just the head
of the list of free chunks.
- added typedefs.list entries and pgindented slab.c
- "mark the chunk as unused (zero the bit)" referenced non-existant
   bitmap afaics.
- valgrind was triggering on
block->firstFreeChunk = *(int32 *) SlabChunkGetPointer(chunk);
  because that was previously marked as NOACCESS (via
  wipe_mem). Explicitly marking as DEFINED solves that.
- removed
  * XXX Perhaps we should not be gentle at all and simply fails in all cases,
  * to eliminate the (mostly pointless) uncertainty.
- you'd included MemoryContext tup_context; in 0002, but it's not really
  useful yet (and the comments above it in reorderbuffer.h don't apply)
- SlabFreeInfo/SlabAllocInfo didn't actually compile w/ HAVE_ALLOCINFO
  defined (pre StandardChunkHeader def)
- some minor stuff.

I'm kinda inclined to drop SlabFreeInfo/SlabAllocInfo.

I've not yet looked a lot at the next type of context - I want to get
this much committed first...

I plan to give this another pass sometime this weekend and then push
soon.

- Andres
>From cec3f8372137d2392ff7ac0ab1b2db11fc96e8b3 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 23 Feb 2017 22:35:44 -0800
Subject: [PATCH 1/3] Make useful infrastructure from aset.c generally
 available.

An upcoming patch introduces a new type of memory context. To avoid
duplicating debugging infrastructure with aset.c move useful pieces to
memdebug.[ch].

While touching aset.c, fix printf format AllocFree* debug macros.

Author: Tomas Vondra
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/b3b2245c-b37a-e1e5-ebc4-857c914bc...@2ndquadrant.com
---
 src/backend/utils/mmgr/Makefile   |   2 +-
 src/backend/utils/mmgr/aset.c | 115 +-
 src/backend/utils/mmgr/memdebug.c |  93 ++
 src/include/utils/memdebug.h  |  48 
 4 files changed, 144 insertions(+), 114 deletions(-)
 create mode 100644 src/backend/utils/mmgr/memdebug.c

diff --git a/src/backend/utils/mmgr/Makefile b/src/backend/utils/mmgr/Makefile
index 1842bae386..fc5f793b7f 100644
--- a/src/backend/utils/mmgr/Makefile
+++ b/src/backend/utils/mmgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/utils/mmgr
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = aset.o dsa.o freepage.o mcxt.o portalmem.o
+OBJS = aset.o dsa.o freepage.o mcxt.o memdebug.o portalmem.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 4dfc3ec260..8056c00ae4 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -41,46 +41,6 @@
  *	chunks as chunks.  Anything "large" is passed off to malloc().  Change
  *	the number of freelists to change the small/large boundary.
  *
- *
- *	About CLOBBER_FREED_MEMORY:
- *
- *	If this symbol is defined, all freed memory is overwritten with 0x7F's.
- *

Re: [HACKERS] snapbuild woes

2017-02-24 Thread Petr Jelinek

On 22/02/17 03:05, Petr Jelinek wrote:
> On 13/12/16 00:38, Petr Jelinek wrote:
>> On 12/12/16 23:33, Andres Freund wrote:
>>> On 2016-12-12 23:27:30 +0100, Petr Jelinek wrote:
 On 12/12/16 22:42, Andres Freund wrote:
> Hi,
>
> On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote:
>> Hi,
>> First one is outright bug, which has to do with how we track running
>> transactions. What snapbuild basically does while doing initial snapshot
>> is read the xl_running_xacts record, store the list of running txes and
>> then wait until they all finish. The problem with this is that
>> xl_running_xacts does not ensure that it only logs transactions that are
>> actually still running (to avoid locking PGPROC) so there might be xids
>> in xl_running_xacts that already committed before it was logged.
>
> I don't think that's actually true?  Notice how LogStandbySnapshot()
> only releases the lock *after* the LogCurrentRunningXacts() iff
> wal_level >= WAL_LEVEL_LOGICAL.  So the explanation for the problem you
> observed must actually be a bit more complex :(
>

 Hmm, interesting, I did see the transaction commit in the WAL before the
 xl_running_xacts that contained the xid as running. I only seen it on
 production system though, didn't really manage to easily reproduce it
 locally.
>>>
>>> I suspect the reason for that is that RecordTransactionCommit() doesn't
>>> conflict with ProcArrayLock in the first place - only
>>> ProcArrayEndTransaction() does.  So they're still running in the PGPROC
>>> sense, just not the crash-recovery sense...
>>>
>>
>> That looks like reasonable explanation. BTW I realized my patch needs
>> bit more work, currently it will break the actual snapshot as it behaves
>> same as if the xl_running_xacts was empty which is not correct AFAICS.
>>
> 
> Hi,
> 
> I got to work on this again. Unfortunately I haven't found solution that
> I would be very happy with. What I did is in case we read
> xl_running_xacts which has all transactions we track finished, we start
> tracking from that new xl_running_xacts again with the difference that
> we clean up the running transactions based on previously seen committed
> ones. That means that on busy server we may wait for multiple
> xl_running_xacts rather than just one, but at least we have chance to
> finish unlike with current coding which basically waits for empty
> xl_running_xacts. I also removed the additional locking for logical
> wal_level in LogStandbySnapshot() since it does not work.

Not hearing any opposition to this idea so I decided to polish this and
also optimize it a bit.

That being said, thanks to testing from Erik Rijkers I've identified one
more bug in how we do the initial snapshot. Apparently we don't reserve
the global xmin when we start building the initial exported snapshot for
a slot (we only reserver catalog_xmin which is fine for logical decoding
but not for the exported snapshot) so the VACUUM and heap pruning will
happily delete old versions of rows that are still needed by anybody
trying to use that exported snapshot.

Attached are updated versions of patches:

0001 - Fixes the above mentioned global xmin tracking issues. Needs to
be backported all the way to 9.4

0002 - Removes use of the logical decoding saved snapshots for initial
exported snapshot since those snapshots only work for catalogs and not
user data. Also needs to be backported all the way to 9.4.

0003 - Changes handling of the xl_running_xacts in initial snapshot
build to what I wrote above and removes the extra locking from
LogStandbySnapshot introduced by logical decoding.

0004 - Improves performance of initial snapshot building by skipping
catalog snapshot build for transactions that don't do catalog changes.

The 0001 and 0002 are bug fixes because without them the exported
snapshots are basically corrupted. The 0003 and 0004 are performance
improvements, but on busy servers the snapshot export might never happen
so it's for rather serious performance issues.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
From 67a44702ff146756b33e8d15e91a02f5d9e86792 Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Fri, 24 Feb 2017 21:39:03 +0100
Subject: [PATCH 1/4] Reserve global xmin for create slot snasphot export

Otherwise the VACUUM or pruning might remove tuples still needed by the
exported snapshot.
---
 src/backend/replication/logical/logical.c | 31 +++
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..9062244 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -267,12 +267,18 @@ CreateInitDecodingContext(char *plugin,
 	 * the slot machinery about the new limit. Once that's done the

[HACKERS] Re: proposal - psql: possibility to specify sort for describe commands, when size is printed

2017-02-24 Thread Pavel Stehule

Hi

2017-02-23 12:17 GMT+01:00 Pavel Stehule :

> Hi
>
> Currently is not possible to control sort columns for \d* commands.
> Usually schema and table name is used. Really often task is collect the
> most big objects in database. "\dt+, \di+" shows necessary information, but
> not in practical order.
>
> Instead introduction some additional flags to backslash commands, I
> propose a special psql variable that can be used for specification of order
> used when some plus command is used.
>
> some like
>
> set EXTENDED_DESCRIBE_SORT size_desc
> \dt+
> \l+
> \di+
>
> Possible variants: schema_table, table_schema, size_desc, size_asc
>
> Comments, notes?
>

here is a patch

Regards

Pavel


>
> Regards
>
> Pavel
>
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index ae58708aae..b4dfd1f71c 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -3495,6 +3495,18 @@ bar
   
 
   
+VERBOSE_SORT
+
+
+This variable can be set to the values schema_name,
+name_schema, size_asc, or
+size_desc to control the order of content of 
+decrible command.
+
+
+  
+
+  
 VERBOSITY
 
 
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index e2e4cbcc08..7ae5992b90 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -263,7 +263,18 @@ describeTablespaces(const char *pattern, bool verbose)
 		  NULL, "spcname", NULL,
 		  NULL);
 
-	appendPQExpBufferStr(, "ORDER BY 1;");
+
+	if (verbose && pset.sversion >= 90200)
+	{
+		if (pset.verbose_sort == PSQL_SORT_SIZE_ASC)
+			appendPQExpBufferStr(, "ORDER BY pg_catalog.pg_tablespace_size(oid), 1;");
+		else if (pset.verbose_sort == PSQL_SORT_SIZE_DESC)
+			appendPQExpBufferStr(, "ORDER BY pg_catalog.pg_tablespace_size(oid) DESC, 1;");
+		else
+			appendPQExpBufferStr(, "ORDER BY 1;");
+	}
+	else
+		appendPQExpBufferStr(, "ORDER BY 1;");
 
 	res = PSQLexec(buf.data);
 	termPQExpBuffer();
@@ -822,7 +833,21 @@ listAllDbs(const char *pattern, bool verbose)
 		processSQLNamePattern(pset.db, , pattern, false, false,
 			  NULL, "d.datname", NULL, NULL);
 
-	appendPQExpBufferStr(, "ORDER BY 1;");
+	if (verbose && pset.sversion >= 80200)
+	{
+		if (pset.verbose_sort == PSQL_SORT_SIZE_ASC)
+			appendPQExpBuffer(,
+		  "ORDER BY CASE WHEN pg_catalog.has_database_privilege(d.datname, 'CONNECT')\n"
+		  "  THEN pg_catalog.pg_database_size(d.datname) END ASC, 1;\n");
+		else if (pset.verbose_sort == PSQL_SORT_SIZE_DESC)
+			appendPQExpBuffer(,
+		  "ORDER BY CASE WHEN pg_catalog.has_database_privilege(d.datname, 'CONNECT')\n"
+		  "  THEN pg_catalog.pg_database_size(d.datname) END DESC, 1;\n");
+		else
+			appendPQExpBufferStr(, "ORDER BY 1;");
+	}
+	else
+		appendPQExpBufferStr(, "ORDER BY 1;");
 	res = PSQLexec(buf.data);
 	termPQExpBuffer();
 	if (!res)
@@ -3258,7 +3283,29 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys
 		  "n.nspname", "c.relname", NULL,
 		  "pg_catalog.pg_table_is_visible(c.oid)");
 
-	appendPQExpBufferStr(, "ORDER BY 1,2;");
+	if (verbose && pset.sversion >= 80100)
+	{
+		if (pset.verbose_sort == PSQL_SORT_SCHEMA_NAME)
+			appendPQExpBufferStr(, "ORDER BY 1,2;");
+		else if (pset.verbose_sort == PSQL_SORT_NAME_SCHEMA)
+			appendPQExpBufferStr(, "ORDER BY 2,1;");
+		else
+		{
+			if (pset.sversion >= 9)
+appendPQExpBufferStr(,
+	"ORDER BY pg_catalog.pg_table_size(c.oid) ");
+			else
+appendPQExpBufferStr(,
+	"ORDER BY pg_catalog.pg_relation_size(c.oid) ");
+
+			if (pset.verbose_sort == PSQL_SORT_SIZE_DESC)
+appendPQExpBufferStr(, "DESC, 1,2;");
+			else
+appendPQExpBufferStr(, "ASC, 1,2;");
+		}
+	}
+	else
+		appendPQExpBufferStr(, "ORDER BY 1,2;");
 
 	res = PSQLexec(buf.data);
 	termPQExpBuffer();
diff --git a/src/bin/psql/help.c b/src/bin/psql/help.c
index 3e3cab4941..09c1a49413 100644
--- a/src/bin/psql/help.c
+++ b/src/bin/psql/help.c
@@ -327,7 +327,7 @@ helpVariables(unsigned short int pager)
 	 * Windows builds currently print one more line than non-Windows builds.
 	 * Using the larger number is fine.
 	 */
-	output = PageOutput(88, pager ? &(pset.popt.topt) : NULL);
+	output = PageOutput(90, pager ? &(pset.popt.topt) : NULL);
 
 	fprintf(output, _("List of specially treated variables\n\n"));
 
@@ -364,6 +364,8 @@ helpVariables(unsigned short int pager)
 	fprintf(output, _("  SINGLESTEP single-step mode (same as -s option)\n"));
 	fprintf(output, _("  USER   the currently connected database user\n"));
 	fprintf(output, _("  VERBOSITY  controls verbosity of error reports [default, verbose, terse]\n"));
+	fprintf(output, _("  VERBOSE_SORT   controls sort of result in verbose mode\n"
+	" [schema_name, name_schema, size_desc, size_asc]\n"));
 
 	fprintf(output, _("\nDisplay

Re: [HACKERS] Checksums by default?

2017-02-24 Thread Ants Aasma

On Fri, Feb 24, 2017 at 10:30 PM, Bruce Momjian  wrote:
> On Fri, Feb 24, 2017 at 10:09:50PM +0200, Ants Aasma wrote:
>> On Fri, Feb 24, 2017 at 9:37 PM, Bruce Momjian  wrote:
>> > Oh, that's why we will hopefully eventually change the page checksum
>> > algorithm to use the special CRC32 instruction, and set a new checksum
>> > version --- got it.  I assume there is currently no compile-time way to
>> > do this.
>>
>> Using CRC32 as implemented now for the WAL would be significantly
>> slower than what we have now due to instruction latency. Even the best
>> theoretical implementation using the CRC32 instruction would still be
>> about the same speed than what we have now. I haven't seen anybody
>> working on swapping out the current algorithm. And I don't really see
>> a reason to, it would introduce a load of headaches for no real gain.
>
> Uh, I am confused.  I thought you said we were leaving some performance
> on the table.  What is that?   I though CRC32 was SSE4.1.  Why is CRC32
> good for the WAL but bad for the page checksums?  What about the WAL
> page images?

The page checksum algorithm was designed to take advantage of CPUs
that provide vectorized 32bit integer multiplication. On x86 this was
introduced with SSE4.1 extensions. This means that by default we can't
take advantage of the design. The code is written in a way that
compiler auto vectorization works on it, so only using appropriate
compilation flags are needed to compile a version that does use vector
instructions. However to enable it on generic builds, a runtime switch
between different levels of vectorization support is needed. This is
what is leaving the performance on the table.

The page checksum algorithm we have is extremely fast, memcpy fast.
Even without vectorization it is right up there with Murmurhash3a and
xxHash. With vectorization it's 4x faster. And it works this fast on
most modern CPUs, not only Intel. The downside is that it only works
well for large blocks, and only fixed power-of-2 size with the current
implementation. WAL page images have the page hole removed so can't
easily take advantage of this.

That said, I haven't really seen either the hardware accelerated CRC32
calculation nor the non-vectorized page checksum take a noticeable
amount of time on real world workloads. The benchmarks presented in
this thread seem to corroborate this observation.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: About CMake v2

2017-02-24 Thread Bruce Momjian

On Wed, Feb  8, 2017 at 04:52:19PM -0500, Tom Lane wrote:
> Peter Eisentraut  writes:
> > On 2/8/17 6:21 AM, Yuriy Zhuravlev wrote:
> >> Support two build systems it's not big deal really. I have been working
> >> on this past year without any big troubles. 
> >> Also we have second perl build system...
> 
> > The perl/msvc build system pulls in information from the makefiles.  So
> > when you add a file or something basic like that, you don't have to
> > update it.  So it's really more like 1.5 build systems.
> 
> Really it's more like 1.1 build systems, in that the MSVC scripts do that
> just well enough that you *usually* don't have to think about them.  But
> then when they fail, and you have to figure out why, it can be a pain.

If cmake isn't going to be able to query the Makefiles and adjust to
changes we make there, changing our Windows build system from MSVC to
cmake takes us from maintaining 1.1 build systems to two build systems,
and I don't think anyone wants that.

If we go to cmake, I think we need to agree we will eventually _only_
use cmake.  I don't think having two build systems we have to maintain
is better than 1.1 build systems where we can mostly ignore 0.1 of that.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checksums by default?

2017-02-24 Thread Bruce Momjian

On Fri, Feb 24, 2017 at 10:09:50PM +0200, Ants Aasma wrote:
> On Fri, Feb 24, 2017 at 9:37 PM, Bruce Momjian  wrote:
> > Oh, that's why we will hopefully eventually change the page checksum
> > algorithm to use the special CRC32 instruction, and set a new checksum
> > version --- got it.  I assume there is currently no compile-time way to
> > do this.
> 
> Using CRC32 as implemented now for the WAL would be significantly
> slower than what we have now due to instruction latency. Even the best
> theoretical implementation using the CRC32 instruction would still be
> about the same speed than what we have now. I haven't seen anybody
> working on swapping out the current algorithm. And I don't really see
> a reason to, it would introduce a load of headaches for no real gain.

Uh, I am confused.  I thought you said we were leaving some performance
on the table.  What is that?   I though CRC32 was SSE4.1.  Why is CRC32
good for the WAL but bad for the page checksums?  What about the WAL
page images?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] case_preservation_and_insensitivity = on

2017-02-24 Thread Gavin Flower


On 25/02/17 09:02, Jim Nasby wrote:

On 2/24/17 12:28 AM, Robert Haas wrote:

On Thu, Feb 23, 2017 at 6:59 PM, Tom Lane  wrote:

I think these are straw-man arguments, really.  Consider the actual use
case for such a feature: it's for porting some application that was not
written against Postgres to begin with.

I'm not sure that's totally true.  I think at least some requests for
this feature are intended at satisfying somebody's sense of
aesthetics.


If I had $1 for every time I had to chase someone away from using 
camelcase I'd be able to sponsor a key at the next conference. And 
honetly I'd actually like to be able to use camelcase and still get 
easy to read output from \d & co.


IOW, this is definitely NOT driven just by porting efforts. I think 
the only reason we don't hear more requests about it is people 
(grudgingly) just muddle on without it.


I'd love to be able to successfully use camelcase for things like 
variable and table names in pg, without having to quote everything - but 
never felt it worthwhile to ask for it.



Cheers,

Gavin



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PROPOSAL] Temporal query processing with range types

2017-02-24 Thread Jim Nasby


On 2/24/17 6:40 AM, Peter Moser wrote:

Do you think it is better to remove the syntax for ranges expressed in
different columns?


It's not that hard to construct a range type on-the-fly from 2 columns, 
so (without having looked at the patch or really followed the thread) I 
would think the answer is yes.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bytea_output output of base64

2017-02-24 Thread Jim Nasby


On 2/24/17 7:44 AM, Kenneth Marshall wrote:

Like David suggests,
if you want compact, run it through lz4/gzip/lzop...for a much better size
return.


Speaking of which; any bytea where you care about this is likely to live 
in an already compressed state in toast. ISTM it would be valuable if we 
had a way to just spit out the raw compressed data (or a text-safe 
version of that), at least for COPY's purposes...

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FYI: git worktrees as replacement for "rsync the CVSROOT"

2017-02-24 Thread Jim Nasby


On 2/24/17 10:24 AM, Tom Lane wrote:

Andrew Dunstan  writes:

On 02/24/2017 02:36 AM, Craig Ringer wrote:

On 16 January 2017 at 05:01, Jim Nasby  wrote:

git worktree add ../9.6 REL9_6_STABLE



Does this do anythng different from the git contrib script
git-new-workdir that I have been using for quite a long while?


I think it's basically a more formally supported version of the contrib
script.  They may have fixed some of the hackier aspects of the contrib
script --- I mind in particular the fact that you need to disable git's
auto-gc activity when you use git-new-workdir, but I don't see any such
restriction in the git-worktree man page.

Haven't tried to switch over myself, but maybe I will at some point.


One thing to be aware of that I discovered: you may not have 2 checkouts 
of the same branch, something that is possible with what's currently 
documented on the wiki. Since the only pain in the wiki workflow is 
setting up a new branch (which I've scripted, attached) I've pretty much 
given up on using worktrees.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)
#!/bin/sh

if [ $# -ne 1 ]; then
  echo Error
  exit 1
fi

branch=REL`echo $1 | tr . _`_STABLE

mkdir i/$1
git clone --reference postgresql.git -b $branch 
git://git.postgresql.org/git/postgresql.git $1

cd $1
ln -s ../i/$i i

cd .git/info
ln -sf ../../../git-info-exclude exclude

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checksums by default?

2017-02-24 Thread Ants Aasma

On Fri, Feb 24, 2017 at 9:49 PM, Jim Nasby  wrote:
> On 2/24/17 12:30 PM, Tomas Vondra wrote:
>>
>> In any case, we can't just build x86-64 packages with compile-time
>> SSE4.1 checks.
>
>
> Dumb question... since we're already discussing llvm for the executor, would
> that potentially be an option here? AIUI that also opens the possibility of
> using the GPU as well.

Just transferring the block to the GPU would be slower than what we
have now. Theoretically LLVM could be used to JIT the checksum
calculation, but just precompiling a couple of versions and swithcing
between them at runtime would be simpler and would give the same
speedup.

Regards,
Ants saasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

2017-02-24 Thread Jim Nasby


On 2/24/17 11:26 AM, Robert Haas wrote:

I think we need to come up with some set of tests to figure out what
actually works well in practice here.  Theories are a good starting
point, but good vacuum behavior is really important, and a patch that
changes it ought to be backed up by at least some experimental
evidence.


I think something else worth considering is that if we had some method 
of mapping heap TIDs back to indexes then a lot (all?) of these problems 
would go away. 10+ years ago the idea of keeping such a mapping would 
probably be untenable, but with resource forks and how much cheaper 
storage is maybe that's no longer the case.


For btree I think this could be done by keeping a second btree ordered 
by ctid that points either to index entries or even just to whole index 
pages. At ~ 20 bytes per entry, even a 1B row index would take ~20GB.


Page splits are obviously a big issue. Maybe it's safe to update the 
ctid map for every item that gets moved when a split happens.


Would a ctid map work for other indexes as well?
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checksums by default?

2017-02-24 Thread Ants Aasma

On Fri, Feb 24, 2017 at 9:37 PM, Bruce Momjian  wrote:
> Oh, that's why we will hopefully eventually change the page checksum
> algorithm to use the special CRC32 instruction, and set a new checksum
> version --- got it.  I assume there is currently no compile-time way to
> do this.

Using CRC32 as implemented now for the WAL would be significantly
slower than what we have now due to instruction latency. Even the best
theoretical implementation using the CRC32 instruction would still be
about the same speed than what we have now. I haven't seen anybody
working on swapping out the current algorithm. And I don't really see
a reason to, it would introduce a load of headaches for no real gain.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checksums by default?

2017-02-24 Thread Ants Aasma

On Fri, Feb 24, 2017 at 10:02 PM, Bruce Momjian  wrote:
> Uh, as far as I know, the best you are going to get from llvm is
> standard assembly, while the SSE4.1 instructions use special assembly
> instructions, so they would be faster, and in a way they are a GPU built
> into CPUs.

Both LLVM and GCC are capable of compiling the code that we have to a
vectorized loop using SSE4.1 or AVX2 instructions given the proper
compilation flags. This is exactly what was giving the speedup in the
test I showed in my e-mail.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] case_preservation_and_insensitivity = on

2017-02-24 Thread Jim Nasby


On 2/24/17 12:28 AM, Robert Haas wrote:

On Thu, Feb 23, 2017 at 6:59 PM, Tom Lane  wrote:

I think these are straw-man arguments, really.  Consider the actual use
case for such a feature: it's for porting some application that was not
written against Postgres to begin with.

I'm not sure that's totally true.  I think at least some requests for
this feature are intended at satisfying somebody's sense of
aesthetics.


If I had $1 for every time I had to chase someone away from using 
camelcase I'd be able to sponsor a key at the next conference. And 
honetly I'd actually like to be able to use camelcase and still get easy 
to read output from \d & co.


IOW, this is definitely NOT driven just by porting efforts. I think the 
only reason we don't hear more requests about it is people (grudgingly) 
just muddle on without it.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checksums by default?

2017-02-24 Thread Bruce Momjian

On Fri, Feb 24, 2017 at 01:49:07PM -0600, Jim Nasby wrote:
> On 2/24/17 12:30 PM, Tomas Vondra wrote:
> >In any case, we can't just build x86-64 packages with compile-time
> >SSE4.1 checks.
> 
> Dumb question... since we're already discussing llvm for the executor, would
> that potentially be an option here? AIUI that also opens the possibility of
> using the GPU as well.

Uh, as far as I know, the best you are going to get from llvm is
standard assembly, while the SSE4.1 instructions use special assembly
instructions, so they would be faster, and in a way they are a GPU built
into CPUs.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Poor memory context performance in large hash joins

2017-02-24 Thread Jeff Janes

On Fri, Feb 24, 2017 at 10:59 AM, Tom Lane  wrote:

> Jeff Janes  writes:
> > On Thu, Feb 23, 2017 at 2:28 PM, Tom Lane  wrote:
> >> Maybe it's time to convert that to a doubly-linked list.
>
> > I don't think that would help.  You would need some heuristic to guess
> > whether the chunk you are looking for is near the front, or near the end.
>
> Uh, what?  In a doubly-linked list, you can remove an element in O(1)
> time, you don't need any searching.  It basically becomes
>   item->prev->next = item->next;
>   item->next->prev = item->prev;
> modulo possible special cases for the head and tail elements.
>

Currently it is walking the chain to identify which block holds the chunk
in the first place, not just to get the pointer to the previous block.  But
I see that that could be fixed by pointer arithmetic once there is a reason
to fix it. Which there isn't a reason to as long as you need to walk the
chain to get the prev pointer anyway.  Like this:?

targetblock = (AllocBlock) (((char*)chunk) - ALLOC_BLOCKHDRSZ);

> >> Although if the
> >> hash code is producing a whole lot of requests that are only a bit
> bigger
> >> than the separate-block threshold, I'd say It's Doing It Wrong.  It
> should
> >> learn to aggregate them into larger requests.
>
> > Right now it is using compiled-in 32KB chunks.  Should it use something
> > like max(32kb,work_mem/128) instead?
>
> I'd say it should double the size of the request each time.  That's what
> we do in most places.
>

I thought the design goal there was that the space in old chunks could get
re-used into the new chunks in a reasonably fine-grained way. If the last
chunk contains just over half of all the data, that couldn't happen.

Cheers,

Jeff

Re: [HACKERS] case_preservation_and_insensitivity = on

2017-02-24 Thread Jim Nasby


On 2/24/17 11:34 AM, Joel Jacobson wrote:

SELECT  SomeCol,  OtherCol,   FooCol,   BarCol,   MyCol,   ExtraCol,   LastCol
INTO _SomeCol, _OtherCol, _FooCol, _BarCol, _MyCol, _ExtraCol, _LastCol
FROM Foo
WHERE Bar = 'Baz';

This is to avoid typos that are then visually easy to spot, thanks to
all chars being aligned.


Why not just use a record or the table composite? I'll commonly do stuff 
like:


DECLARE
  r record
BEGIN
  SELECT INTO STRICT r
  blah, foo, bar, baz
FROM pirate
  ;

  IF r.blah THEN RAISE 'Yaaar!' END IF;
...

(Well, to be honest I always try to write pirate apps in plR... ;P)
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] btree_gin and btree_gist for enums

2017-02-24 Thread Andrew Dunstan



On 02/24/2017 11:02 AM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> On 02/23/2017 04:41 PM, Tom Lane wrote:
>>> The reason this is kind of scary is that it's just blithely assuming
>>> that the function won't look at the *other* fields of the FmgrInfo.
>>> If it did, it would likely get very confused, since those fields
>>> would be describing the GIN support function, not the function we're
>>> calling.
>>>
>>> We could alternatively have this trampoline function set up a fresh, local
>>> FmgrInfo struct that it zeroes except for copying fn_extra and fn_mcxt
>>> from the caller's struct, and then it copies fn_extra back again on the
>>> way out.  That's a few more cycles but it would be safer, I think; if the
>>> function tried to look at the other fields such as fn_oid it would see
>>> obviously bogus data.
>> Do we want one or both of these? I'm prepared to code up a patch to
>> fmgr.[ch] to provide them.
> On reflection I'm not sure that the double-copy approach is all that much
> safer than just passing down the caller's flinfo pointer.  Most of the
> time it would be better, but suppose that the callee updates fn_extra
> and then throws elog(ERROR) --- the outcome would be different, probably
> creating a leak in fn_mcxt.  Maybe this would still be okay, because
> perhaps that FmgrInfo is never used again, but I don't think we can assume
> that for the case at hand.
>
> At this point I'd be inclined to just document that the called function
> should only use fn_extra/fn_mcxt.


fair enough. Will do it that way.


>
>> I don't know what to call it either. In my test I used
>> CallerContextFunctionCall2 - not sure if that's quite right, but should
>> be close.
> CallerInfo?  CallerFInfo?  Or we could spell out CallerFmgrInfo but
> that seems a bit verbose.
>
>   

I'll go with CallerFInfoFunctionCall2 etc.

In the btree_gist system the calls to the routines like enum_cmp are
buried about three levels deep. I'm thinking I'll just pass the flinfo
down the stack and just call these routines at the bottom level.


cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checksums by default?

2017-02-24 Thread Jim Nasby


On 2/24/17 12:30 PM, Tomas Vondra wrote:

In any case, we can't just build x86-64 packages with compile-time
SSE4.1 checks.


Dumb question... since we're already discussing llvm for the executor, 
would that potentially be an option here? AIUI that also opens the 
possibility of using the GPU as well.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checksums by default?

2017-02-24 Thread Bruce Momjian

On Fri, Feb 24, 2017 at 08:31:09PM +0200, Ants Aasma wrote:
> >> We looked at that when picking the algorithm. At that point it seemed
> >> that CRC CPU instructions were not universal enough to rely on them.
> >> The algorithm we ended up on was designed to be fast on SIMD hardware.
> >> Unfortunately on x86-64 that required SSE4.1 integer instructions, so
> >> with default compiles there is a lot of performance left on table. A
> >> low hanging fruit would be to do CPU detection like the CRC case and
> >> enable a SSE4.1 optimized variant when those instructions are
> >> available. IIRC it was actually a lot faster than the naive hardware
> >> CRC that is used for WAL and about on par with interleaved CRC.
> >
> > Uh, I thought already did compile-time testing for SSE4.1 and used them
> > if present.  Why do you say "with default compiles there is a lot of
> > performance left on table?"
> 
> Compile time checks don't help because the compiled binary could be
> run on a different host that does not have SSE4.1 (as extremely
> unlikely as it is at this point of time). A runtime check is done for

Right.

> WAL checksums that use a special CRC32 instruction. Block checksums
> predate that and use a different algorithm that was picked because it
> could be accelerated with vectorized execution on non-Intel
> architectures. We just never got around to adding runtime checks for
> the architecture to enable this speedup.

Oh, that's why we will hopefully eventually change the page checksum
algorithm to use the special CRC32 instruction, and set a new checksum
version --- got it.  I assume there is currently no compile-time way to
do this.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Poor memory context performance in large hash joins

2017-02-24 Thread Tom Lane

Jeff Janes  writes:
> On Thu, Feb 23, 2017 at 2:28 PM, Tom Lane  wrote:
>> Maybe it's time to convert that to a doubly-linked list.

> I don't think that would help.  You would need some heuristic to guess
> whether the chunk you are looking for is near the front, or near the end.

Uh, what?  In a doubly-linked list, you can remove an element in O(1)
time, you don't need any searching.  It basically becomes
  item->prev->next = item->next;
  item->next->prev = item->prev;
modulo possible special cases for the head and tail elements.

>> Although if the
>> hash code is producing a whole lot of requests that are only a bit bigger
>> than the separate-block threshold, I'd say It's Doing It Wrong.  It should
>> learn to aggregate them into larger requests.

> Right now it is using compiled-in 32KB chunks.  Should it use something
> like max(32kb,work_mem/128) instead?

I'd say it should double the size of the request each time.  That's what
we do in most places.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Poor memory context performance in large hash joins

2017-02-24 Thread Jeff Janes

On Thu, Feb 23, 2017 at 10:47 PM, Andres Freund  wrote:

> On 2017-02-23 17:28:26 -0500, Tom Lane wrote:
> > Jeff Janes  writes:
> > > The number of new chunks can be almost as as large as the number of old
> > > chunks, especially if there is a very popular value.  The problem is
> that
> > > every time an old chunk is freed, the code in aset.c around line 968
> has to
> > > walk over all the newly allocated chunks in the linked list before it
> can
> > > find the old one being freed.  This is an N^2 operation, and I think
> it has
> > > horrible CPU cache hit rates as well.
> >
> > Maybe it's time to convert that to a doubly-linked list.
>
> Yes, I do think so. Given that we only have that for full blocks, not
> for small chunks, the cost seems neglegible.
>
> That would also, partially, address the performance issue
> http://archives.postgresql.org/message-id/d15dff83-0b37-28ed
> -0809-95a5cc7292ad%402ndquadrant.com
> addresses, in a more realistically backpatchable manner.
>
> Jeff, do you have a handy demonstrator?
>

Not exactly "handy", as it takes about 45 minutes to set up and uses >50GB
of disk and 16GB of RAM, but here is a demonstration.

create table foobar as select CASE when random()<(420.0/540.0) then 1 else
floor(random()*88)::int END as titleid from
generate_series(1,54000);

create table foobar2 as select distinct titleid from foobar ;

analyze;

set work_mem TO "13GB";

select count(*) from foobar2 where not exists (select 1 from foobar t where
t.titleid=foobar2.titleid);

This will run for effectively forever, unless it gets killed/dies due to
OOM first.  If I have other things consuming some of my 16GB of RAM, then
it gets OOM (which is not a complaint: it is as one should expect given
that I told it work_mem was 13GB). If I have no other demands on RAM, then
I can't tell if it would eventually OOM or not because of how long it would
take to get that far.

This is inspired by the thread
https://www.postgresql.org/message-id/flat/CACw4T0p4Lzd6VpwptxgPgoTMh2dEKTQBGu7NTaJ1%2BA0PRx1BGg%40mail.gmail.com#cacw4t0p4lzd6vpwptxgpgotmh2dektqbgu7ntaj1+a0prx1...@mail.gmail.com

I ran into this while evaluating Tom's responding patch, but you don't need
to apply that patch to run this example and see the effect.  I don't have
an example of a case that demonstrates the problem in the absence of a
degenerate hash bucket.  I think there should be non-degenerate cases that
trigger it, but I haven't been able to produce one yet.

> > Although if the
> > hash code is producing a whole lot of requests that are only a bit bigger
> > than the separate-block threshold, I'd say It's Doing It Wrong.  It
> should
> > learn to aggregate them into larger requests.
>
> That's probably right, but we can't really address that in the
> back-branches.  And to me this sounds like something we should address
> in the branches, not just in master.  Even if we'd also fix the
> hash-aggregation logic, I think such an O(n^2) behaviour in the
> allocator is a bad idea in general, and we should fix it anyway.
>

I don't know how important a back-patch would be.  This is a toy case for
me.  Presumably not for David Hinkle, except he doesn't want a hash join in
the first place. While I want one that finishes in a reasonable time. It
might depend on what happens to Tom's OOM patch.

It would be great if the allocator was made bullet-proof, but I don't think
adding reverse links (or anything else back-patchable) is going to be
enough to do that.

Cheers,

Jeff

Re: [HACKERS] Checksums by default?

2017-02-24 Thread Ants Aasma

On Fri, Feb 24, 2017 at 7:47 PM, Bruce Momjian  wrote:
> On Sat, Jan 21, 2017 at 09:02:25PM +0200, Ants Aasma wrote:
>> > It might be worth looking into using the CRC CPU instruction to reduce this
>> > overhead, like we do for the WAL checksums. Since that is a different
>> > algorithm it would be a compatibility break and we would need to support 
>> > the
>> > old algorithm for upgraded clusters..
>>
>> We looked at that when picking the algorithm. At that point it seemed
>> that CRC CPU instructions were not universal enough to rely on them.
>> The algorithm we ended up on was designed to be fast on SIMD hardware.
>> Unfortunately on x86-64 that required SSE4.1 integer instructions, so
>> with default compiles there is a lot of performance left on table. A
>> low hanging fruit would be to do CPU detection like the CRC case and
>> enable a SSE4.1 optimized variant when those instructions are
>> available. IIRC it was actually a lot faster than the naive hardware
>> CRC that is used for WAL and about on par with interleaved CRC.
>
> Uh, I thought already did compile-time testing for SSE4.1 and used them
> if present.  Why do you say "with default compiles there is a lot of
> performance left on table?"

Compile time checks don't help because the compiled binary could be
run on a different host that does not have SSE4.1 (as extremely
unlikely as it is at this point of time). A runtime check is done for
WAL checksums that use a special CRC32 instruction. Block checksums
predate that and use a different algorithm that was picked because it
could be accelerated with vectorized execution on non-Intel
architectures. We just never got around to adding runtime checks for
the architecture to enable this speedup.

The attached test runs 1M iterations of the checksum about 3x faster
when compiled with SSE4.1 and vectorization, 4x if AVX2 is added into
the mix.

Test:
gcc $CFLAGS -Isrc/include -DN=100 testchecksum.c -o testchecksum
&& time ./testchecksum

Results:
CFLAGS="-O2": 2.364s
CFLAGS="-O2 -msse4.1 -ftree-vectorize": 0.752s
CFLAGS="-O2 -mavx2 -ftree-vectorize": 0.552s

That 0.552s is 15GB/s per core on a 3 year old laptop.

Regards,
Ants Aasma
#include "postgres.h"
#include "storage/checksum_impl.h"

void main() {
	char page[8192] = {0};
	uint32 i, sum = 0;

	for (i = 0; i < N; i++)
		sum ^= pg_checksum_page(page, i);
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checksums by default?

2017-02-24 Thread Tomas Vondra


On 02/24/2017 06:47 PM, Bruce Momjian wrote:

On Sat, Jan 21, 2017 at 09:02:25PM +0200, Ants Aasma wrote:

It might be worth looking into using the CRC CPU instruction to reduce this
overhead, like we do for the WAL checksums. Since that is a different
algorithm it would be a compatibility break and we would need to support the
old algorithm for upgraded clusters..


We looked at that when picking the algorithm. At that point it seemed
that CRC CPU instructions were not universal enough to rely on them.
The algorithm we ended up on was designed to be fast on SIMD hardware.
Unfortunately on x86-64 that required SSE4.1 integer instructions, so
with default compiles there is a lot of performance left on table. A
low hanging fruit would be to do CPU detection like the CRC case and
enable a SSE4.1 optimized variant when those instructions are
available. IIRC it was actually a lot faster than the naive hardware
CRC that is used for WAL and about on par with interleaved CRC.


Uh, I thought already did compile-time testing for SSE4.1 and used them
if present.  Why do you say "with default compiles there is a lot of
performance left on table?"



Compile-time is not enough - we build binary packages that may then be 
installed on machines without the SSE4.1 instructions available.


On Intel this may not be a huge issue - the first microarchitecture with 
SSE4.1 was "Nehalem", announced in 2007, so we're only left with very 
old boxes based on "Intel Core" (and perhaps the even older P6).


On AMD, it's a bit worse - the first micro-architecture with SSE4.1 was 
Bulldozer (late 2011). So quite a few CPUs out there, even if most 
people use Intel.


In any case, we can't just build x86-64 packages with compile-time 
SSE4.1 checks.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Poor memory context performance in large hash joins

2017-02-24 Thread Jeff Janes

On Thu, Feb 23, 2017 at 2:28 PM, Tom Lane  wrote:

> Jeff Janes  writes:
> > The number of new chunks can be almost as as large as the number of old
> > chunks, especially if there is a very popular value.  The problem is that
> > every time an old chunk is freed, the code in aset.c around line 968 has
> to
> > walk over all the newly allocated chunks in the linked list before it can
> > find the old one being freed.  This is an N^2 operation, and I think it
> has
> > horrible CPU cache hit rates as well.
>
> Maybe it's time to convert that to a doubly-linked list.

I don't think that would help.  You would need some heuristic to guess
whether the chunk you are looking for is near the front, or near the end.
And in this case, the desired chunk starts out at the front, and then keeps
moving down the list with each iteration as new things are added to the
front, until near the end of the ExecHashIncreaseNumBatches it is freeing
them from near the end.  But in between, it is freeing them from the
middle, so I don't think a doubly-linked list would change it from N^2,
just lower the constant, even if you always knew which end to start at.  Or
am I misunderstanding what the implications for a doubly-linked-list are?

What would really help here is if it remembered the next pointer of the
just-freed chunk, and started the scan from that location the next time,
cycling around to the head pointer if it doesn't find anything.  But I
don't think that that is a very general solution.

Or, if you could pass a flag when creating the context telling it whether
memory will be freed mostly-LIFO or mostly-FIFO, and have it use a stack or
a queue accordingly.

> Although if the
> hash code is producing a whole lot of requests that are only a bit bigger
> than the separate-block threshold, I'd say It's Doing It Wrong.  It should
> learn to aggregate them into larger requests.
>

Right now it is using compiled-in 32KB chunks.  Should it use something
like max(32kb,work_mem/128) instead?

Cheers,

Jeff

Re: [HACKERS] Checksums by default?

2017-02-24 Thread Bruce Momjian

On Sat, Jan 21, 2017 at 09:02:25PM +0200, Ants Aasma wrote:
> > It might be worth looking into using the CRC CPU instruction to reduce this
> > overhead, like we do for the WAL checksums. Since that is a different
> > algorithm it would be a compatibility break and we would need to support the
> > old algorithm for upgraded clusters..
> 
> We looked at that when picking the algorithm. At that point it seemed
> that CRC CPU instructions were not universal enough to rely on them.
> The algorithm we ended up on was designed to be fast on SIMD hardware.
> Unfortunately on x86-64 that required SSE4.1 integer instructions, so
> with default compiles there is a lot of performance left on table. A
> low hanging fruit would be to do CPU detection like the CRC case and
> enable a SSE4.1 optimized variant when those instructions are
> available. IIRC it was actually a lot faster than the naive hardware
> CRC that is used for WAL and about on par with interleaved CRC.

Uh, I thought already did compile-time testing for SSE4.1 and used them
if present.  Why do you say "with default compiles there is a lot of
performance left on table?"

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] case_preservation_and_insensitivity = on

2017-02-24 Thread Joel Jacobson

On Thu, Feb 23, 2017 at 8:04 AM, Robert Haas  wrote:
>
> It doesn't sound like a good solution to me, because there can be SQL
> code inside stored procedures that clients never see.

In our code base, we use CamelCase in all PL/pgSQL functions, both for
columns and variables,
e.g. SELECT UserID INTO _UserID FROM Users WHERE Username = 'foo';

Here, it's not a problem that the column name is e.g. "userid",
since the case-insensitive feature makes it work.

What type of case problem do you foresee for stored procedures?

I've only experienced the case-folding to be a problem outside of SPs,
since the casing *is* preserved in the PL/pgSQL source code
(since it's stored as-is, without any modifications).

What *would* be a problem though, is if in a future PL/pgSQL 3,
a PL/pgSQL query like,
SELECT UserID FROM Users WHERE Username = 'foo';
would automatically export the column "UserID" to the current scope as
a PL/pgSQL 3 variable named "userid",
since then you would actually want the value of the userid column to
be exported to a variable named "UserID".

Such a feature would be nice, since a very common code-pattern in
PL/pgSQL is to just have lots of meaningless identical lists of
columns and then an identical list of variables with the same names as
the columns.
When the list is short, it's not a problem, but when selecting lots of
columns, it gets ugly.

What I usually end up with is to align the columns and variables on
two rows, e.g.:

SELECT  SomeCol,  OtherCol,   FooCol,   BarCol,   MyCol,   ExtraCol,   LastCol
INTO _SomeCol, _OtherCol, _FooCol, _BarCol, _MyCol, _ExtraCol, _LastCol
FROM Foo
WHERE Bar = 'Baz';

This is to avoid typos that are then visually easy to spot, thanks to
all chars being aligned.

Imagine if, thanks to case-preservation, if you should simply do:
SELECT  SomeCol,  OtherCol,   FooCol,   BarCol,   MyCol,   ExtraCol,   LastCol
FROM Foo
WHERE Bar = 'Baz';

And all the columns would be exported to the variables SomeCol,
OtherCol,   FooCol,   BarCol,   MyCol,   ExtraCol,   LastCol,
instead of somecol, othercol, foocol, barcol, mycol, extracol, lastcol;

This would be a huge win in avoiding unnecessary code repetition.

Then of course, if you want a column Foo to instead be exported to
Bar, then you simply do "SELECT Foo AS Bar".

Thoughts?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

2017-02-24 Thread Robert Haas

On Fri, Feb 24, 2017 at 8:49 AM, Amit Kapila  wrote:
>> IIUC, I think that we need to have the number of half-dead pages in meta 
>> page.
>
> Don't you think we need to consider backward compatibility if we want
> to do that?

Yeah, that would be an on-disk format break.

I think this thread is pretty short on evidence that would let us make
a smart decision about what to do here.  I see three possibilities.
The first is that this patch is a good idea whether we do something
about the issue of half-dead pages or not.  The second is that this
patch is a good idea if we do something about the issue of half-dead
pages but a bad idea if we don't.  The third is that this patch is a
bad idea whether or not we do anything about the issue of half-dead
pages.

Unfortunately, we have no evidence at all that would let us figure out
which of those three things is true.  The original post didn't include
any relevant benchmarks or test results.  Simon's reply, which
suggested that the problem of half-dead pages, didn't include any
benchmarks or test results.  In fact, in neither place were any tests
suggested, even hypothetically, which would help us decide what to do.
I had a hunch when I saw this thread that it was a good idea, and
Simon has a hunch that this btree page recycling thing needs to be
fixed first, and he might be right.  Or we might both be wrong.  Or
... who knows, really?

I think we need to come up with some set of tests to figure out what
actually works well in practice here.  Theories are a good starting
point, but good vacuum behavior is really important, and a patch that
changes it ought to be backed up by at least some experimental
evidence.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Make subquery alias optional in FROM clause

2017-02-24 Thread David G. Johnston

 On Fri, Feb 24, 2017 at 9:35 AM, David Fetter  wrote:

>
> > => SELECT "?column"? FROM (select 1+1 as "?column?", 1+1) AS x;
> > ERROR:  42703: column "?column" does not exist
> > LINE 2: SELECT "?column"? FROM (select 1+1 as "?column?", 1+1) AS x;
> >^
> > HINT:  Perhaps you meant to reference the column "x.?column?" or the
> > column "x.?column?".
>

This is indirectly pointing out the duplication since the hint is
specifying the exact same name twice...

I don't know how far comparing apples and oranges gets us here...and the
assignment of names to expression columns lacking aliases is a bit smarter
than given credit for here - e.g., it uses the name of the function in a
simple function call expression.

There is no risk of naming conflicts in pre-existing queries.  I say we do
something like:  pg_subquery_n and make it known that the value for "n"
will be chosen independent of names already present in the query.  We've
recently reserved the pg_ prefix for roles we might as well leverage that.
These names need only be available for internal needs; as a user I'd expect
it is be noted as an implementation detail that should not be relied upon.
Whether it needs to get exposed for technical reasons (e.g., dump/restore
and explain) I do not know.

David J.

Re: [HACKERS] Hash support for grouping sets

2017-02-24 Thread Andrew Gierth

> "Thom" == Thom Brown  writes:

 Thom> This doesn't apply cleanly to latest master.  Could you please
 Thom> post a rebased patch?

Sure.

-- 
Andrew (irc:RhodiumToad)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index c9e0a3e..480a07e 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -996,6 +996,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		pname = "HashAggregate";
 		strategy = "Hashed";
 		break;
+	case AGG_MIXED:
+		pname = "MixedAggregate";
+		strategy = "Mixed";
+		break;
 	default:
 		pname = "Aggregate ???";
 		strategy = "???";
@@ -1924,6 +1928,19 @@ show_grouping_set_keys(PlanState *planstate,
 	ListCell   *lc;
 	List	   *gsets = aggnode->groupingSets;
 	AttrNumber *keycols = aggnode->grpColIdx;
+	const char *keyname;
+	const char *keysetname;
+
+	if (aggnode->aggstrategy == AGG_HASHED || aggnode->aggstrategy == AGG_MIXED)
+	{
+		keyname = "Hash Key";
+		keysetname = "Hash Keys";
+	}
+	else
+	{
+		keyname = "Group Key";
+		keysetname = "Group Keys";
+	}
 
 	ExplainOpenGroup("Grouping Set", NULL, true, es);
 
@@ -1938,7 +1955,7 @@ show_grouping_set_keys(PlanState *planstate,
 			es->indent++;
 	}
 
-	ExplainOpenGroup("Group Keys", "Group Keys", false, es);
+	ExplainOpenGroup(keysetname, keysetname, false, es);
 
 	foreach(lc, gsets)
 	{
@@ -1962,12 +1979,12 @@ show_grouping_set_keys(PlanState *planstate,
 		}
 
 		if (!result && es->format == EXPLAIN_FORMAT_TEXT)
-			ExplainPropertyText("Group Key", "()", es);
+			ExplainPropertyText(keyname, "()", es);
 		else
-			ExplainPropertyListNested("Group Key", result, es);
+			ExplainPropertyListNested(keyname, result, es);
 	}
 
-	ExplainCloseGroup("Group Keys", "Group Keys", false, es);
+	ExplainCloseGroup(keysetname, keysetname, false, es);
 
 	if (sortnode && es->format == EXPLAIN_FORMAT_TEXT)
 		es->indent--;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index aa08152..f059560 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -122,12 +122,19 @@
  *	  specific).
  *
  *	  Where more complex grouping sets are used, we break them down into
- *	  "phases", where each phase has a different sort order.  During each
- *	  phase but the last, the input tuples are additionally stored in a
- *	  tuplesort which is keyed to the next phase's sort order; during each
- *	  phase but the first, the input tuples are drawn from the previously
- *	  sorted data.  (The sorting of the data for the first phase is handled by
- *	  the planner, as it might be satisfied by underlying nodes.)
+ *	  "phases", where each phase has a different sort order (except phase 0
+ *	  which is reserved for hashing).  During each phase but the last, the
+ *	  input tuples are additionally stored in a tuplesort which is keyed to the
+ *	  next phase's sort order; during each phase but the first, the input
+ *	  tuples are drawn from the previously sorted data.  (The sorting of the
+ *	  data for the first phase is handled by the planner, as it might be
+ *	  satisfied by underlying nodes.)
+ *
+ *	  Hashing can be mixed with sorted grouping.  To do this, we have an
+ *	  AGG_MIXED strategy that populates the hashtables during the first sorted
+ *	  phase, and switches to reading them out after completing all sort phases.
+ *	  We can also support AGG_HASHED with multiple hash tables and no sorting
+ *	  at all.
  *
  *	  From the perspective of aggregate transition and final functions, the
  *	  only issue regarding grouping sets is this: a single call site (flinfo)
@@ -139,8 +146,6 @@
  *	  sensitive to the grouping set for which the aggregate function is
  *	  currently being called.
  *
- *	  TODO: AGG_HASHED doesn't support multiple grouping sets yet.
- *
  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -432,6 +437,7 @@ typedef struct AggStatePerGroupData
  */
 typedef struct AggStatePerPhaseData
 {
+	AggStrategy aggstrategy;	/* strategy for this phase */
 	int			numsets;		/* number of grouping sets (or 0) */
 	int		   *gset_lengths;	/* lengths of grouping sets */
 	Bitmapset **grouped_cols;	/* column groupings for rollup */
@@ -440,7 +446,31 @@ typedef struct AggStatePerPhaseData
 	Sort	   *sortnode;		/* Sort node for input ordering for phase */
 }	AggStatePerPhaseData;
 
+/*
+ * AggStatePerHashData - per-hashtable state
+ *
+ * When doing grouping sets with hashing, we have one of these for each
+ * grouping set. (When doing hashing without grouping sets, we have just one of
+ * them.)
+ */
+
+typedef struct AggStatePerHashData
+{
+	TupleHashTable hashtable;	/* hash table with one entry per group */
+	TupleHashIterator hashiter; /* for iterating through hash table */
+	TupleTableSlot *hashslot;	/* slot for loading hash table */
+	FmgrInfo   *hashfunctions;	/*

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-02-24 Thread Bruce Momjian

On Fri, Feb 24, 2017 at 02:14:23PM +0530, Pavan Deolasee wrote:
> 
> 
> On Thu, Feb 23, 2017 at 11:53 PM, Bruce Momjian  wrote:
> 
> On Thu, Feb 23, 2017 at 03:03:39PM -0300, Alvaro Herrera wrote:
> > Bruce Momjian wrote:
> >
> > > As I remember, WARM only allows
> > > a single index-column change in the chain.  Why are you seeing such a
> > > large performance improvement?  I would have thought it would be that
> > > high if we allowed an unlimited number of index changes in the chain.
> >
> > The second update in a chain creates another non-warm-updated tuple, so
> > the third update can be a warm update again, and so on.
> 
> Right, before this patch they would be two independent HOT chains.  It
> still seems like an unexpectedly-high performance win.  Are two
> independent HOT chains that much more expensive than joining them via
> WARM?
> 
> 
> In these tests, there are zero HOT updates, since every update modifies some
> index column. With WARM, we could reduce regular updates to half, even when we
> allow only one WARM update per chain (chain really has a single tuple for this
> discussion). IOW approximately half updates insert new index entry in *every*
> index and half updates 
> insert new index entry *only* in affected index. That itself does a good bit
> for performance.
> 
> So to answer your question: yes, joining two HOT chains via WARM is much
> cheaper because it results in creating new index entries just for affected
> indexes.

OK, all my questions have been answered, including the use of flag bits.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-24 Thread Claudio Freire

On Fri, Feb 24, 2017 at 1:24 PM, Robert Haas  wrote:
> On Fri, Feb 24, 2017 at 7:29 PM, Kohei KaiGai  wrote:
>> This attached patch re-designed the previous v2 according to Robert's 
>> comment.
>> All I had to do is putting entrypoint for ForeignScan/CustomScan at
>> ExecShutdownNode,
>> because it is already modified to call child node first, earlier than
>> ExecShutdownGather
>> which releases DSM segment.
>
> Aside from the documentation, which needs some work, this looks fine
> to me on a quick read.

LGTM too.

You may want to clarify in the docs when the hook is called in
relation to other hooks too.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Make subquery alias optional in FROM clause

2017-02-24 Thread David Fetter

On Thu, Feb 23, 2017 at 01:27:29PM +, Greg Stark wrote:
> On 22 February 2017 at 15:08, Tom Lane  wrote:
> > Indeed.  When I wrote the comment you're referring to, quite a few years
> > ago now, I thought that popular demand might force us to allow omitted
> > aliases.  But the demand never materialized.  At this point it seems
> > clear to me that there isn't really good reason to exceed the spec here.
> > It just encourages people to write unportable SQL code.
> 
> 
> Oh my. This bothers me all the time. I always assumed the reason it
> was like this was because the grammar would be ambiguous without it
> and it would require extreme measures to hack the grammar to work. If
> it's this easy I would totally be for it.
> 
> Offhand I think there are plenty of solutions for the problem of
> inventing names and I suspect any of them would work fine:
> 
> 1) Don't assign a name -- I would guess this would require some
> adjustments in the rule deparsing (i.e. views).
> 
> 2) Assign a name but add a flag indicating the name is autogenerated
> and shouldn't be used for resolving references and shouldn't be
> dumped. Then it shouldn't really matter if there's a conflict since
> the name is only used for things like error messages, not resolving
> references.
> 
> 3) thumb through all the names in the query and pick one that doesn't 
> conflict.
> 
> For what it's worth while it wouldn't be a *bad* thing to avoid
> conflicts I think this is being held to an inconsistent standard here.
> It's not like there aren't similar situations elsewhere in the
> codebase where we just don't worry about this kind of thing:
> 
> => SELECT "?column"? FROM (select 1+1 as "?column?", 1+1) AS x;
> ERROR:  42703: column "?column" does not exist
> LINE 2: SELECT "?column"? FROM (select 1+1 as "?column?", 1+1) AS x;
>^
> HINT:  Perhaps you meant to reference the column "x.?column?" or the
> column "x.?column?".

That's because you transposed the two characters after column in your
target list:
   XX
SELECT "?column"? FROM (select 1+1 as "?column?", 1+1) AS x;
SELECT "?column?" FROM (select 1+1 as "?column?", 1+1) AS x;

This is what you get when you do the second, which I'm assuming is
what you meant to do:

ERROR:  column reference "?column?" is ambiguous
LINE 1: SELECT "?column?" FROM (select 1+1 as "?column?", 1+1) AS x;

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FYI: git worktrees as replacement for "rsync the CVSROOT"

2017-02-24 Thread Tom Lane

Andrew Dunstan  writes:
> On 02/24/2017 02:36 AM, Craig Ringer wrote:
>> On 16 January 2017 at 05:01, Jim Nasby  wrote:
>>> git worktree add ../9.6 REL9_6_STABLE

> Does this do anythng different from the git contrib script
> git-new-workdir that I have been using for quite a long while?

I think it's basically a more formally supported version of the contrib
script.  They may have fixed some of the hackier aspects of the contrib
script --- I mind in particular the fact that you need to disable git's
auto-gc activity when you use git-new-workdir, but I don't see any such
restriction in the git-worktree man page.

Haven't tried to switch over myself, but maybe I will at some point.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-24 Thread Robert Haas

On Fri, Feb 24, 2017 at 7:29 PM, Kohei KaiGai  wrote:
> This attached patch re-designed the previous v2 according to Robert's comment.
> All I had to do is putting entrypoint for ForeignScan/CustomScan at
> ExecShutdownNode,
> because it is already modified to call child node first, earlier than
> ExecShutdownGather
> which releases DSM segment.

Aside from the documentation, which needs some work, this looks fine
to me on a quick read.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] utility commands benefiting from parallel plan

2017-02-24 Thread Robert Haas

On Fri, Feb 24, 2017 at 11:43 AM, Haribabu Kommi
 wrote:
> Here I attached an implementation patch that allows
> utility statements that have queries underneath such as
> CREATE TABLE AS, CREATE MATERIALIZED VIEW
> and REFRESH commands to benefit from parallel plan.
>
> These write operations not performed concurrently by the
> parallel workers, but the underlying query that is used by
> these operations are eligible for parallel plans.
>
> Currently the write operations are implemented for the
> tuple dest types DestIntoRel and DestTransientRel.
>
> Currently I am evaluating other write operations that can
> benefit with parallelism without side effects in enabling them.
>
> comments?

I think a lot more work than this will be needed.  See:

https://www.postgresql.org/message-id/CA+TgmoZC5ft_t9uQWSO5_1vU6H8oVyD=zyuLvRnJqTN==fv...@mail.gmail.com

...and the discussion which followed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Make subquery alias optional in FROM clause

2017-02-24 Thread Tom Lane

Craig Ringer  writes:
> Personally I think we need to generate one, if nothing else for error
> messages where we try to emit qualified names of columns.

Also for EXPLAIN, where there has to be a way to name everything.

> But I don't see that the name needs to be anything we can refer to
> elsewhere or anything faintly sane to type. Something like:
>   ""

-1.  "Make it ugly as sin and then pretend that nobody could conflict
with it" is neither formally correct nor nice to look at in the contexts
where people have to look at it.

I'm for something along the lines of "subquery_n" where we simply keep
incrementing n until we find a name that is not present in the query
already.  This is basically what ruleutils.c does now when it has to
cons up a unique table alias, which it must do in cases involving
inheritance.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] safer node casting

2017-02-24 Thread Peter Eisentraut

On 2/24/17 10:54, Tom Lane wrote:
> Andres Freund  writes:
>> Those aren't actually equivalent, because of the !nodeptr. IsA() crashes
>> for NULL pointers, but the new code won't. Which means 9ba8a9ce4548b et
>> al actually weakened some asserts.
> 
>> Should we perhaps have one NULL accepting version (castNodeNull?) and
>> one that separately asserts that ptr != NULL?
> 
> -1 ... if you're going to use something in a way that requires it not to
>  be null, your code will crash quite efficiently on a null, with or
>  without an assert.  I don't think we need the extra cogitive burden of
>  two distinct macros for this.

I think we should just add some Assert(thepointer) where necessary.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-02-24 Thread Robert Haas

On Fri, Feb 24, 2017 at 4:06 PM, Pavan Deolasee
 wrote:
>> Wow, OK.  In my view, that makes the chain conversion code pretty much
>> essential, because if you had WARM without chain conversion then the
>> visibility map gets more or less irrevocably less effective over time,
>> which sounds terrible.
>
> Yes. I decided to complete chain conversion patch when I realised that IOS
> will otherwise become completely useful if large percentage of rows are
> updated just once. So I agree. It's not an optional patch and should get in
> with the main WARM patch.

Right, and it's not just index-only scans.  VACUUM gets permanently
more expensive, too, which is probably a much worse problem.

>> But it sounds to me like even with the chain
>> conversion, it might take multiple vacuum passes before all visibility
>> map bits are set, which isn't such a great property (thus e.g.
>> fdf9e21196a6f58c6021c967dc5776a16190f295).
>
> The chain conversion algorithm first converts the chains during vacuum and
> then checks if the page can be set all-visible. So I'm not sure why it would
> take multiple vacuums before a page is set all-visible. The commit you quote
> was written to ensure that we make another attempt to set the page
> all-visible after al dead tuples are removed from the page. Similarly, we
> will convert all WARM chains to HOT chains and then check for all-visibility
> of the page.

OK, that sounds good.  And there are no bugs, right?  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] btree_gin and btree_gist for enums

2017-02-24 Thread Tom Lane

Andrew Dunstan  writes:
> On 02/23/2017 04:41 PM, Tom Lane wrote:
>> The reason this is kind of scary is that it's just blithely assuming
>> that the function won't look at the *other* fields of the FmgrInfo.
>> If it did, it would likely get very confused, since those fields
>> would be describing the GIN support function, not the function we're
>> calling.
>> 
>> We could alternatively have this trampoline function set up a fresh, local
>> FmgrInfo struct that it zeroes except for copying fn_extra and fn_mcxt
>> from the caller's struct, and then it copies fn_extra back again on the
>> way out.  That's a few more cycles but it would be safer, I think; if the
>> function tried to look at the other fields such as fn_oid it would see
>> obviously bogus data.

> Do we want one or both of these? I'm prepared to code up a patch to
> fmgr.[ch] to provide them.

On reflection I'm not sure that the double-copy approach is all that much
safer than just passing down the caller's flinfo pointer.  Most of the
time it would be better, but suppose that the callee updates fn_extra
and then throws elog(ERROR) --- the outcome would be different, probably
creating a leak in fn_mcxt.  Maybe this would still be okay, because
perhaps that FmgrInfo is never used again, but I don't think we can assume
that for the case at hand.

At this point I'd be inclined to just document that the called function
should only use fn_extra/fn_mcxt.

> I don't know what to call it either. In my test I used
> CallerContextFunctionCall2 - not sure if that's quite right, but should
> be close.

CallerInfo?  CallerFInfo?  Or we could spell out CallerFmgrInfo but
that seems a bit verbose.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] safer node casting

2017-02-24 Thread Tom Lane

Andres Freund  writes:
> Those aren't actually equivalent, because of the !nodeptr. IsA() crashes
> for NULL pointers, but the new code won't. Which means 9ba8a9ce4548b et
> al actually weakened some asserts.

> Should we perhaps have one NULL accepting version (castNodeNull?) and
> one that separately asserts that ptr != NULL?

-1 ... if you're going to use something in a way that requires it not to
 be null, your code will crash quite efficiently on a null, with or
 without an assert.  I don't think we need the extra cogitive burden of
 two distinct macros for this.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] utility commands benefiting from parallel plan

2017-02-24 Thread Dilip Kumar

On Fri, Feb 24, 2017 at 11:43 AM, Haribabu Kommi
 wrote:
> Here I attached an implementation patch that allows
> utility statements that have queries underneath such as
> CREATE TABLE AS, CREATE MATERIALIZED VIEW
> and REFRESH commands to benefit from parallel plan.
>
> These write operations not performed concurrently by the
> parallel workers, but the underlying query that is used by
> these operations are eligible for parallel plans.
>
> Currently the write operations are implemented for the
> tuple dest types DestIntoRel and DestTransientRel.
>
> Currently I am evaluating other write operations that can
> benefit with parallelism without side effects in enabling them.

The Idea looks good to me.

Since we are already modifying heap_prepare_insert, I am thinking that
we can as well enable queries like "insert into .. select from .."
with minor modification?

- * For now, parallel operations are required to be strictly read-only.
- * Unlike heap_update() and heap_delete(), an insert should never create a
- * combo CID, so it might be possible to relax this restriction, but not
- * without more thought and testing.
+ * For now, parallel operations are required to be strictly read-only in
+ * parallel worker.
This statement is still not true, we can not do heap_update in the
leader even though worker are doing the read-only operation (update
with select).  We can change the comments such that it appears more
specific to insert I think.
-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-02-24 Thread David G. Johnston

On Friday, February 24, 2017, Simon Riggs  wrote:
>
> 2. I know that DB2 handles this by having the user specify WITH ROW
> MOVEMENT to explicitly indicate they accept the issue and want update
> to work even with that. We could have an explicit option to allow
> that. This appears to be the only way we could avoid silent errors for
> foreign table partitions.
>
>
This does, however, make the partitioning very non-transparent to every
update query just because it is remotely possible a partition-moving update
might occur concurrently.

I dislike an error.  I'd say that making partition "just work" here is
material for another patch.  In this one an update of the partition key can
be documented as shorthand for delete-returning-insert with all the
limitations that go with that.  If someone acceptably solves the
ctid following logic later it can be committed - I'm assuming there would
be no complaints to making things just work in a case where they only sorta
worked.

David J.

Re: [HACKERS] btree_gin and btree_gist for enums

2017-02-24 Thread Andrew Dunstan



On 02/23/2017 04:41 PM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> I'm not entirely sure how I should replace those DirectFunctionCall2 calls.
> Basically what we want is for the called function to be able to use the
> fcinfo->flinfo->fn_extra and fcinfo->flinfo->fn_mcxt fields of the
> FmgrInfo struct that GIN has set up for the btree_gin support function.
>
> The fast but somewhat scary way to do it would just be to pass through
> the flinfo pointer as-is.  Imagine that fmgr.c grows a set of functions
> like
>
> Datum
> DontKnowWhatToCallThisFunctionCall2(PGFunction func,
> FmgrInfo *flinfo, Oid collation,
> Datum arg1, Datum arg2)
> {
> FunctionCallInfoData fcinfo;
> Datumresult;
>
> InitFunctionCallInfoData(fcinfo, flinfo, 2, collation, NULL, NULL);
>
> fcinfo.arg[0] = arg1;
> fcinfo.arg[1] = arg2;
> fcinfo.argnull[0] = false;
> fcinfo.argnull[1] = false;
>
> result = (*func) ();
>
> /* Check for null result, since caller is clearly not expecting one */
> if (fcinfo.isnull)
> elog(ERROR, "function %p returned NULL", (void *) func);
>
> return result;
> }
>
> and then you'd just pass through the fcinfo->flinfo you got.
>
> The reason this is kind of scary is that it's just blithely assuming
> that the function won't look at the *other* fields of the FmgrInfo.
> If it did, it would likely get very confused, since those fields
> would be describing the GIN support function, not the function we're
> calling.
>
> We could alternatively have this trampoline function set up a fresh, local
> FmgrInfo struct that it zeroes except for copying fn_extra and fn_mcxt
> from the caller's struct, and then it copies fn_extra back again on the
> way out.  That's a few more cycles but it would be safer, I think; if the
> function tried to look at the other fields such as fn_oid it would see
> obviously bogus data.
>
>



Do we want one or both of these? I'm prepared to code up a patch to
fmgr.[ch] to provide them.

I don't know what to call it either. In my test I used
CallerContextFunctionCall2 - not sure if that's quite right, but should
be close.

The technique is somewhat similar to what we do in plperl.c where we
fake up a function call for handling DO blocks.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checksums by default?

2017-02-24 Thread Magnus Hagander

On Thu, Feb 23, 2017 at 10:37 PM, Bruce Momjian  wrote:

> On Sat, Jan 21, 2017 at 12:46:05PM -0500, Stephen Frost wrote:
> > * Petr Jelinek (petr.jeli...@2ndquadrant.com) wrote:
> > > As we don't know the performance impact is (there was no benchmark done
> > > on reasonably current code base) I really don't understand how you can
> > > judge if it's worth it or not.
> >
> > Because I see having checksums as, frankly, something we always should
> > have had (as most other databases do, for good reason...) and because
> > they will hopefully prevent data loss.  I'm willing to give us a fair
> > bit to minimize the risk of losing data.
>
> Do these other databases do checksums because they don't do
> full_page_writes?  They just detect torn pages rather than repair them
> like we do?
>

Torn page detection is usually/often done by other means than checksums. I
don't think those are necessarily related.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-24 Thread Kohei KaiGai

Hello,

This attached patch re-designed the previous v2 according to Robert's comment.
All I had to do is putting entrypoint for ForeignScan/CustomScan at
ExecShutdownNode,
because it is already modified to call child node first, earlier than
ExecShutdownGather
which releases DSM segment.

Thanks,

2017-02-20 9:25 GMT+09:00 Kouhei Kaigai :
>> -Original Message-
>> From: pgsql-hackers-ow...@postgresql.org
>> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
>> Sent: Monday, February 20, 2017 2:20 AM
>> To: Kaigai Kouhei(海外 浩平) 
>> Cc: Claudio Freire ; Amit Kapila
>> ; pgsql-hackers 
>> Subject: Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside
>> ExecEndGather)
>>
>> On Fri, Feb 17, 2017 at 12:46 PM, Kouhei Kaigai 
>> wrote:
>> > The attached patch is revised one.
>> >
>> > Invocation of Exec(Foreign|Custom)ParallelFinish was moved to
>> > ExecParallelRetrieveInstrumentation() not to walk on the plan- state
>> > tree twice.
>> > One (hypothetical) downside is, FDW/CSP can retrieve its own run-time
>> > statistics only when query is executed under EXPLAIN ANALYZE.
>> >
>> > This enhancement allows FDW/CSP to collect its specific run- time
>> > statistics more than Instrumentation, then show them as output of
>> > EXPLAIN. My expected examples are GPU's kernel execution time, DMA
>> > transfer ratio and so on. These statistics will never appear in the
>> > Instrumentation structure, however, can be a hot- point of performance
>> > bottleneck if CustomScan works on background workers.
>>
>> Would gather_shutdown_children_first.patch from
>> https://www.postgresql.org/message-id/CAFiTN-s5KuRuDrQCEpiHHzmVf7JTtbn
>> b8eb10c-6aywjdxb...@mail.gmail.com
>> help with this problem also?  Suppose we did that, and then also added an
>> ExecShutdownCustom method.  Then you'd definitely be able to get control
>> before the DSM went away, either from ExecEndNode() or ExecShutdownNode().
>>
> Ah, yes, I couldn't find any problem around the above approach.
> ExecShutdownGather() can be called by either ExecShutdownNode() or
> ExecEndGather(). This patch allows to have an entrypoint for CSP/FDW
> prior to release of the DSM.
>
> Thanks,
> 
> PG-Strom Project / NEC OSS Promotion Center
> KaiGai Kohei 
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



-- 
KaiGai Kohei 


parallel-finish-fdw_csp.v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bytea_output output of base64

2017-02-24 Thread Kenneth Marshall

On Thu, Feb 23, 2017 at 03:52:46PM -0800, David Fetter wrote:
> On Thu, Feb 23, 2017 at 05:55:37PM -0500, Bruce Momjian wrote:
> > On Thu, Feb 23, 2017 at 04:08:58PM -0500, Tom Lane wrote:
> > > Bruce Momjian  writes:
> > > > Is there a reason we don't support base64 as a bytea_output output
> > > > option, except that no one has implemented it?
> > > 
> > > How about "we already have one too many bytea output formats"?
> > > I don't think forcing code to try to support still another one
> > > is a great thing ... especially not if it couldn't be reliably
> > > distinguished from the hex format.
> > 
> > Is there a reason we chose hex over base64?
> 
> Whether there was or not, there's not a compelling reason now to break
> people's software.  When people want compression, methods a LOT more
> effective than base64 are common.  Gzip, for example.
> 
> Best,
> David.

First, hex encoding is very simple to perform. Second, most applications
have routines to handle it trivially. And third, base64 encoding has some
padding constraints that can complicate is processing. Like David suggests,
if you want compact, run it through lz4/gzip/lzop...for a much better size
return.

Regards,
Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FYI: git worktrees as replacement for "rsync the CVSROOT"

2017-02-24 Thread Andrew Dunstan



On 02/24/2017 02:36 AM, Craig Ringer wrote:
> On 16 January 2017 at 05:01, Jim Nasby  wrote:
>> Not sure how many people still use [1], as referenced by our git wiki[2],
>> but it appears git worktrees are a viable replacement for that technique. In
>> short, if you're already in your checkout:
>>
>> git worktree add ../9.6 REL9_6_STABLE
>>
>> would give you a checkout of 9.6 in the ../9.6 directory.
>>
>> BTW, I learned about this from this "git year in review" article[3].
> Looks handy enough to merit adding to the Pg developer FAQ. Please?
>
> It looks cleaner than my current approach of doing a local clone or
> re-cloning from upstream with a local repo as a --reference .
>



Does this do anythng different from the git contrib script
git-new-workdir that I have been using for quite a long while?
Essentially it symlinks a bunch of things from the old workdir to the
new one. I copied the technique in the buildfarm's git_use_workdirs feature.


cheers

andrew

-- 

Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PROPOSAL] Temporal query processing with range types

2017-02-24 Thread Peter Moser

2017-02-22 19:43 GMT+01:00 Peter Eisentraut :
> On 2/16/17 07:41, Robert Haas wrote:
>> Also, it sounds like all of this is intended to work with ranges that
>> are stored in different columns rather than with PostgreSQL's built-in
>> range types.
>
> Yeah, that should certainly be changed.

Our syntax supports PostgreSQL's built-in range types and ranges that
are stored in different columns.

For instance, for range types an ALIGN query would look like this:
  SELECT * FROM (r ALIGN s ON q WITH (r.t, s.t)) c

... and for ranges in different columns like this:
  SELECT * FROM (r ALIGN s ON q WITH (r.ts, r.te, s.ts, s.te)) c

... where r and s are input relations, q can be any join qualifier, and
r.t, s.t, r.ts, r.te, s.ts, and s.te can be any column name. The
latter represent the valid time intervals, that is time point start,
and time point end of each tuple for each input relation. These can
be defined as four scalars, or two half-open, i.e., [), range typed
values.

It would reduce the size of our patch and simplify the overall structure,
if we would remove the possibility to express valid time start points and end
points in different columns.

Do you think it is better to remove the syntax for ranges expressed in
different columns?

However, internally we still need to split the
range types into two separate points, because NORMALIZE does not make a
distinction between start and end timepoints while grouping, therefore we
have only one timepoint attribute there (i.e., P1), which is the union of
start and end timepoints (see executor/nodeTemporalAdjustment.c). A second
constraint is, that we support currently only half-open intervals, that is,
any interval definition (open/closed/half-open) different from the PostgreSQL's
default, i.e., [), leads to undefined results.

Best regards,
Anton, Johann, Michael, Peter

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] safer node casting

2017-02-24 Thread Andres Freund

Hi,

I was about to add a few more uses of castNode, which made me think.

You proposed replacing:

On 2016-12-31 12:08:22 -0500, Peter Eisentraut wrote:
> There is a common coding pattern that goes like this:
> 
> RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
> Assert(IsA(rinfo, RestrictInfo));

with

> +#define castNode(_type_,nodeptr) (AssertMacro(!nodeptr || 
> IsA(nodeptr,_type_)), (_type_ *)(nodeptr))
(now an inline function, but that's besides my point)

Those aren't actually equivalent, because of the !nodeptr. IsA() crashes
for NULL pointers, but the new code won't. Which means 9ba8a9ce4548b et
al actually weakened some asserts.

Should we perhaps have one NULL accepting version (castNodeNull?) and
one that separately asserts that ptr != NULL?

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2017-02-24 Thread Kuntal Ghosh

On Fri, Feb 24, 2017 at 12:47 PM, Beena Emerson  wrote:
>
> Hello,
>
> The recent commit  c29aff959dc64f7321062e7f33d8c6ec23db53d has again changed
> the code and the second patch cannot be applied cleanly. Please find
> attached the rebased 02 patch. 01 patch is the same .
>
I've done an initial review of the patch. The objective of the patch
is to modify the wal-segsize as an initdb-time parameter instead of a
compile time parameter.

The patch introduces following three different techniques to expose
the XLogSize to different modules:

1. Directly read XLogSegSize from the control file
This is used by default, i.e., StartupXLOG() and looks good to me.

2. Run the SHOW wal_segment_size command to fetch and set the XLogSegSize

+   if (!RetrieveXLogSegSize(conn))
+   disconnect_and_exit(1);
+
You need the same logic in pg_receivewal.c as well.

3. Retrieve the XLogSegSize by reading the file size of WAL files
+   if (private.inpath != NULL)
+   sprintf(full_path, "%s/%s", private.inpath, fname);
+   else
+   strcpy(full_path, fname);
+
+   stat(full_path, );
+
+   if (!IsValidXLogSegSize(fst.st_size))
+   {
+   fprintf(stderr,
+   _("%s: file size %d is invalid \n"),
+   progname, (int) fst.st_size);
+
+   return EXIT_FAILURE;
+
+   }
+
+   XLogSegSize = (int) fst.st_size;
I see couple of issues with this approach:

* You should check the return value of stat() before going ahead.
Something like,
if (stat(filename, ) < 0)
error "file doesn't exist"

* You're considering any WAL file with a power of 2 as valid. Suppose,
the correct WAL seg size is 64mb. For some reason, the server
generated a 16mb invalid WAL file(maybe it crashed while creating the
WAL file). Your code seems to treat this as a valid file which I think
is incorrect. Do you agree with that?

Is it possible to unify these different techniques of reading
XLogSegSize in a generalized function with a proper documentation
describing the scope and limitations of each approach?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keep ECPG comment for log_min_duration_statement

2017-02-24 Thread Michael Meskes

> As I said in the first e-mail, there are some restrictions of comment
> position in my implementation. I am concerned that an error will
> occur in .pgc files users made in old version.
> So, this feature should be a new option.

I see, this makes sense.

> When the pre-compiler(ECPG) converts C with embedded SQL to normal C
> code, gram.y is used for syntactic analysis. I need to change gram.y
> for comments in SQL statement. 
> But I do not come up with better idea that gram.y is not affected.
> If you are interested in my implementation in detail, please check
> the [WIP]patch I attached.

I'm not sure we would want to change the backend parser for something
only used in ecpg. Actually I'm pretty sure we don't.

I can see two possible solutions. One would be to replace the parser
rules. Please see parse.pl for details. Some rules are already replaced
by ecpg specific ones. However, the more rules we replace the more
manual syncing effort we need for changes in gram.y.

The other option I can see, albeit without looking into details, is
allowing all comments and then testing it for the right syntax after
parsing. This could potentially also solve the above mentioned option
problem.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Meskes at (Debian|Postgresql) dot Org
Jabber: michael at xmpp dot meskes dot org
VfL Borussia! Força Barça! SF 49ers! Use Debian GNU/Linux, PostgreSQL


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal : Parallel Merge Join

2017-02-24 Thread Dilip Kumar

On Fri, Feb 24, 2017 at 3:04 PM, Amit Kapila  wrote:
> What advantage do you see for considering such a path when the cost of
> innerpath is more than cheapest_total_inner?  Remember the more paths
> we try to consider, the more time we spend in the planner.  By any
> chance are you able to generate any query where it will give benefit
> by considering costlier innerpath?

Changed
>
> 2.
> +static void generate_parallel_mergejoin_paths(PlannerInfo *root,
> +  RelOptInfo *joinrel,
> +  RelOptInfo *innerrel,
> +  Path *outerpath,
> +  JoinType jointype,
> +  JoinPathExtraData *extra,
> +  Path *inner_cheapest_total,
> +  List *merge_pathkeys);
>
> It is better to name this function as
> generate_partial_mergejoin_paths() as we are generating only partial
> paths in this function and accordingly change the comment on top of
> the function.  I see that you might be naming it based on
> consider_parallel_*, however, I think it is better to use partial in
> the name as that is what we are doing inside that function.  Also, I
> think this function has removed/changed some handling related to
> unique outer and full joins, so it is better to mention that in the
> function comments, something like "unlike above function, this
> function doesn't expect to handle join types JOIN_UNIQUE_OUTER or
> JOIN_FULL" and add Assert for both of those types.

Done
>
> 3.
> A test case is still missing.
Added



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


parallel_mergejoin_v6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal for changes to recovery.conf API

2017-02-24 Thread Simon Riggs

On 12 January 2017 at 13:34, Peter Eisentraut
 wrote:
> On 1/11/17 5:27 AM, Simon Riggs wrote:
>> The main area of "design doubt" remains the implementation of the
>> recovery_target parameter set. Are we happy with the user interface
>> choices in the patch, given the understanding that the situation was
>> more comple than at first thought?
>
> Could you summarize the current proposal(s)?
>
> Personally, I don't immediately see the need to change anything from the
> parameter names that I currently see in recovery.conf.sample.

New patch version implementing everything you requested, incl docs and
tap tests.

The patch as offered here is what I've been asked to do by everybody
as well as I can do it. I'm very happy to receive comments and to
rework the design based upon further feedback.

I'm not completely convinced this is a great design, so I'm happy to
hear input. pg_basebackup -R is the main wrinkle.

The timeline handling has a bug at present that I'm working on, but
I'm not worried it constitutes a major problem. Obviously it will be
fixed before commit, but the patch needs more discussion
now/yesterday.

All parameters are set at PGC_POSTMASTER for now.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

newRecoveryAPI.v102f.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-02-24 Thread Pavan Deolasee

On Fri, Feb 24, 2017 at 3:42 PM, Robert Haas  wrote:

>
>
> Wow, OK.  In my view, that makes the chain conversion code pretty much
> essential, because if you had WARM without chain conversion then the
> visibility map gets more or less irrevocably less effective over time,
> which sounds terrible.


Yes. I decided to complete chain conversion patch when I realised that IOS
will otherwise become completely useful if large percentage of rows are
updated just once. So I agree. It's not an optional patch and should get in
with the main WARM patch.


> But it sounds to me like even with the chain
> conversion, it might take multiple vacuum passes before all visibility
> map bits are set, which isn't such a great property (thus e.g.
> fdf9e21196a6f58c6021c967dc5776a16190f295).
>
>
The chain conversion algorithm first converts the chains during vacuum and
then checks if the page can be set all-visible. So I'm not sure why it
would take multiple vacuums before a page is set all-visible. The commit
you quote was written to ensure that we make another attempt to set the
page all-visible after al dead tuples are removed from the page. Similarly,
we will convert all WARM chains to HOT chains and then check for
all-visibility of the page.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Proposal : Parallel Merge Join

2017-02-24 Thread Amit Kapila

On Fri, Feb 24, 2017 at 3:42 PM, Dilip Kumar  wrote:
> On Fri, Feb 24, 2017 at 3:04 PM, Amit Kapila  wrote:
>
> Thanks for the comments.
>
>> What advantage do you see for considering such a path when the cost of
>> innerpath is more than cheapest_total_inner?  Remember the more paths
>> we try to consider, the more time we spend in the planner.  By any
>> chance are you able to generate any query where it will give benefit
>> by considering costlier innerpath?
>
> If inner_cheapest_total path is not parallel safe then I am trying to
> find the cheapest parallel safe path and generate partial merge join.
>

Not sure, if we can just ignore the cheapest inner path because it is
not parallel safe.  It is also possible that you pick costly inner
path just because it is parallel safe and then, later on, you have to
discard it because the overall cost of partial merge join is much
more.

> I agree that it will consider one extra path while considering every
> partial merge join path
>

Hmm, AFAICS, it will consider two extra paths per sort key (the loop
around considering such paths is up to num_sortkeys)

> (I think with this addition we will not get
> parallel merge path for the any more TPCH query).  I feel in general
> we can get some better plan by this especially when gather is the
> cheapest path for inner relation(which is not parallel safe) but at
> the cost of considering some extra paths.

I agree in some cases it could be better, but I think benefits are not
completely clear, so probably we can leave it as of now and if later
any one comes with a clear use case or can see the benefits of such
path then we can include it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Make subquery alias optional in FROM clause

2017-02-24 Thread Robert Haas

On Fri, Feb 24, 2017 at 1:04 PM, Craig Ringer  wrote:
> On 23 February 2017 at 22:20, Tom Lane  wrote:
>>> * Don't force/generate an alias at all.
>>
>>> I've no idea for this yet and Tom already was concerned what this might
>>> break. There are several places in the transform phase where the
>>> refnames are required (e.g. isLockedRefname()).
>>
>> Yeah.  This would be cleaner in some sense but also a lot more delicate.
>> Not sure it's worth the trouble.
>
> Personally I think we need to generate one, if nothing else for error
> messages where we try to emit qualified names of columns.
>
> But I don't see that the name needs to be anything we can refer to
> elsewhere or anything faintly sane to type. Something like:
>
>   ""
>
> in line with our current generation of refcursor names.

Isn't that a terribly unfriendly thing to include in an error message?
 I'd much rather see the column qualified with whatever the alias name
is inside the subquery than see it qualified with some internally
generated name that's completely meaningless.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-02-24 Thread Robert Haas

On Fri, Feb 24, 2017 at 3:24 PM, Simon Riggs  wrote:
> I'd give the view that we cannot silently ignore this issue, bearing
> in mind the point that we're expecting partitioned tables to behave
> exactly like normal tables.

At the risk of repeating myself, I don't expect that, and I don't
think it's a reasonable expectation.  It's reasonable to expect
partitioning to be notably better than inheritance (which I think it
already is) and to provide a good base for future work (which I think
it does), but I think getting them to behave exactly like normal
tables (except for the things we want to be different) will take
another ten years of development work.

> In my understanding the issue is that UPDATEs will fail to update a
> row when a valid row exists in the case where a row moved between
> partitions; that behaviour will be different to a standard table.

Right, when at READ COMMITTED and EvalPlanQual would have happened otherwise.

> It is of course very good that we have something ready for this
> release and can make a choice of what to do.
>
> Thoughts
>
> 1. Reuse the tuple state HEAP_MOVED_OFF which IIRC represent exactly
> almost exactly the same thing. An UPDATE which gets to a
> HEAP_MOVED_OFF tuple will know to re-find the tuple via the partition
> metadata, or I might be persuaded that in-this-release it is
> acceptable to fail when this occurs with an ERROR and a retryable
> SQLCODE, since the UPDATE will succeed on next execution.

I've got my doubts about whether we can make that bit work that way,
considering that we still support pg_upgrade (possibly in multiple
steps) from old releases that had VACUUM FULL.  We really ought to put
some work into reclaiming those old bits, but there's probably no time
for that in v10.

> 2. I know that DB2 handles this by having the user specify WITH ROW
> MOVEMENT to explicitly indicate they accept the issue and want update
> to work even with that. We could have an explicit option to allow
> that. This appears to be the only way we could avoid silent errors for
> foreign table partitions.

Yeah, that's a thought.  We could give people a choice between (a)
updates that cause rows to move between partitions just fail and (b)
such updates work but with EPQ-related deficiencies.  I had previously
thought that, given those two choices, everybody would like (b) better
than (a), but maybe not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Monitoring roles patch

2017-02-24 Thread Dave Page

Per the discussion at
https://www.postgresql.org/message-id/CA%2BOCxoyYxO%2BJmzv2Micj4uAaQdAi6nq0w25BPQgLLxsrvTmREw%40mail.gmail.com,
attached is a patch that implements the following:

- Adds a default role called pg_monitor
- Gives members of the pg_monitor role full access to:
pg_ls_logdir() and pg_ls_waldir()
pg_stat_* views and functions
pg_tablespace_size() and pg_database_size()
Contrib modules:
pg_buffercache,
pg_freespacemap,
pgrowlocks,
pg_stat_statements,
pgstattuple and
pg_visibility (but NOT pg_truncate_visibility_map() )
- Adds a default role called pg_read_all_gucs
- Allows members of pg_read_all_gucs to, well, read all GUCs
- Grants pg_read_all_gucs to pg_monitor

Note that updates to contrib modules followed the strategy recently
used in changes to pgstattuple following discussion here, in which the
installation SQL script is left at the prior version, and an update
script is added and default version number bumped to match that of the
upgrade script.

Patch includes doc updates, and is dependent on my pg_ls_logdir() and
pg_ls_waldir() patch
(https://www.postgresql.org/message-id/CA+OCxow-X=D2fWdKy+HP+vQ1LtrgbsYQ=cshzzbqyft5joy...@mail.gmail.com).

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index 497dbeb229..18f7a87452 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -4,8 +4,9 @@ MODULE_big = pg_buffercache
 OBJS = pg_buffercache_pages.o $(WIN32RES)
 
 EXTENSION = pg_buffercache
-DATA = pg_buffercache--1.2.sql pg_buffercache--1.1--1.2.sql \
-   pg_buffercache--1.0--1.1.sql pg_buffercache--unpackaged--1.0.sql
+DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
+   pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
+   pg_buffercache--unpackaged--1.0.sql
 PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
 
 ifdef USE_PGXS
diff --git a/contrib/pg_buffercache/pg_buffercache--1.2--1.3.sql 
b/contrib/pg_buffercache/pg_buffercache--1.2--1.3.sql
new file mode 100644
index 00..b37ef0112e
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.2--1.3.sql
@@ -0,0 +1,7 @@
+/* contrib/pg_buffercache/pg_buffercache--1.2--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.3'" to load this file. 
\quit
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control 
b/contrib/pg_buffercache/pg_buffercache.control
index a4d664f3fa..8c060ae9ab 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
 # pg_buffercache extension
 comment = 'examine the shared buffer cache'
-default_version = '1.2'
+default_version = '1.3'
 module_pathname = '$libdir/pg_buffercache'
 relocatable = true
diff --git a/contrib/pg_freespacemap/Makefile b/contrib/pg_freespacemap/Makefile
index 7bc0e9555d..0a2f000ec6 100644
--- a/contrib/pg_freespacemap/Makefile
+++ b/contrib/pg_freespacemap/Makefile
@@ -4,8 +4,8 @@ MODULE_big = pg_freespacemap
 OBJS = pg_freespacemap.o $(WIN32RES)
 
 EXTENSION = pg_freespacemap
-DATA = pg_freespacemap--1.1.sql pg_freespacemap--1.0--1.1.sql \
-   pg_freespacemap--unpackaged--1.0.sql
+DATA = pg_freespacemap--1.1.sql pg_freespacemap--1.1--1.2.sql \
+   pg_freespacemap--1.0--1.1.sql pg_freespacemap--unpackaged--1.0.sql
 PGFILEDESC = "pg_freespacemap - monitoring of free space map"
 
 ifdef USE_PGXS
diff --git a/contrib/pg_freespacemap/pg_freespacemap--1.1--1.2.sql 
b/contrib/pg_freespacemap/pg_freespacemap--1.1--1.2.sql
new file mode 100644
index 00..490bb3bf46
--- /dev/null
+++ b/contrib/pg_freespacemap/pg_freespacemap--1.1--1.2.sql
@@ -0,0 +1,7 @@
+/* contrib/pg_freespacemap/pg_freespacemap--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_freespacemap UPDATE TO '1.2'" to load this file. 
\quit
+
+GRANT EXECUTE ON FUNCTION  pg_freespace(regclass, bigint) TO pg_monitor;
+GRANT EXECUTE ON FUNCTION  pg_freespace(regclass) TO pg_monitor;
diff --git a/contrib/pg_freespacemap/pg_freespacemap.control 
b/contrib/pg_freespacemap/pg_freespacemap.control
index 764db30d18..ac8fc5050a 100644
--- a/contrib/pg_freespacemap/pg_freespacemap.control
+++ b/contrib/pg_freespacemap/pg_freespacemap.control
@@ -1,5 +1,5 @@
 # pg_freespacemap extension
 comment = 'examine the free space map (FSM)'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/pg_freespacemap'
 relocatable = true
diff --git a/contrib/pg_stat_statements/Makefile 
b/contrib/pg_stat_statements/Makefile
index

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-02-24 Thread Robert Haas

On Fri, Feb 24, 2017 at 3:31 PM, Pavan Deolasee
 wrote:
> On Fri, Feb 24, 2017 at 3:23 PM, Robert Haas  wrote:
>> I don't immediately see how this will work with index-only scans.  If
>> the tuple is HOT updated several times, HOT-pruned back to a single
>> version, and then the page is all-visible, the index entries are
>> guaranteed to agree with the remaining tuple, so it's fine to believe
>> the data in the index tuple.  But with WARM, that would no longer be
>> true, unless you have some trick for that...
>
> Well the trick is to not allow index-only scans on such pages by not marking
> them all-visible. That's why when a tuple is WARM updated, we carry that
> information in the subsequent versions even when later updates are HOT
> updates. The chain conversion algorithm will handle this by clearing those
> bits and thus allowing index-only scans again.

Wow, OK.  In my view, that makes the chain conversion code pretty much
essential, because if you had WARM without chain conversion then the
visibility map gets more or less irrevocably less effective over time,
which sounds terrible.  But it sounds to me like even with the chain
conversion, it might take multiple vacuum passes before all visibility
map bits are set, which isn't such a great property (thus e.g.
fdf9e21196a6f58c6021c967dc5776a16190f295).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal : Parallel Merge Join

2017-02-24 Thread Dilip Kumar

On Fri, Feb 24, 2017 at 3:04 PM, Amit Kapila  wrote:

Thanks for the comments.

> What advantage do you see for considering such a path when the cost of
> innerpath is more than cheapest_total_inner?  Remember the more paths
> we try to consider, the more time we spend in the planner.  By any
> chance are you able to generate any query where it will give benefit
> by considering costlier innerpath?

If inner_cheapest_total path is not parallel safe then I am trying to
find the cheapest parallel safe path and generate partial merge join.
I agree that it will consider one extra path while considering every
partial merge join path (I think with this addition we will not get
parallel merge path for the any more TPCH query).  I feel in general
we can get some better plan by this especially when gather is the
cheapest path for inner relation(which is not parallel safe) but at
the cost of considering some extra paths.  What is your thought on
this?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-02-24 Thread Pavan Deolasee

On Fri, Feb 24, 2017 at 3:23 PM, Robert Haas  wrote:

>
> I don't immediately see how this will work with index-only scans.  If
> the tuple is HOT updated several times, HOT-pruned back to a single
> version, and then the page is all-visible, the index entries are
> guaranteed to agree with the remaining tuple, so it's fine to believe
> the data in the index tuple.  But with WARM, that would no longer be
> true, unless you have some trick for that...
>
>
Well the trick is to not allow index-only scans on such pages by not
marking them all-visible. That's why when a tuple is WARM updated, we carry
that information in the subsequent versions even when later updates are HOT
updates. The chain conversion algorithm will handle this by clearing those
bits and thus allowing index-only scans again.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

[HACKERS] Making clausesel.c Smarter

2017-02-24 Thread David Rowley

I've stumbled over an interesting query out in the wild where the
query was testing a nullable timestamptz column was earlier than some
point in time, and also checking the column IS NOT NULL.

The use pattern here was that records which required processing had
this timestamp set to something other than NULL, a worker would come
along and process those, and UPDATE the record to NULL to mark the
fact that it was now processed. So what we are left with was a table
with a small number of rows with a value in this timestamp column, and
an ever-growing number of rows with a NULL value.

A highly simplified version of the query was checking for records that
required processing before some date, say:

SELECT * FROM a WHERE ts < '2017-02-24' AND ts IS NOT NULL;

Of course, the ts IS NOT NULL is not required here, but I can
understand how someone might make the mistake of adding this. The
simple solution to the problem was to have that null test removed, but
that seemingly innocent little mistake caused some pain due to very
slow running queries which held on to a nested loop plan 33 times
longer than it should have been doing. Basically what was happening
here is that clauselist_selectivity() was applying the selectivity
with the ~0.97 null_fact from pg_statistic over the top of the already
applied estimate on the number of rows below the constant timestamp.

Since the diagnosis of this problem was not instant, and some amount
of pain was suffered before the solution was found,  I wondered if it
might be worth smartening up the planner a bit in this area.

We're already pretty good at handling conditions like: SELECT * FROM a
WHERE x < 10 and x < 1; where we'll effectively ignore the x < 10
estimate since x < 1 is more restrictive, so if we just build on that
ability a bit we could get enough code to cover the above case.

I've attached a draft patch which improves the situation here.

Given the test case:

create table ts (ts timestamptz);
insert into ts select case when x%1000=0 then '2017-01-01'::date +
(x::text || ' sec')::interval else null end from
generate_series(1,100)  x ( x );
analyze ts;

explain analyze select count(*) from ts where ts <=
'2017-02-01'::timestamptz and ts is not null;

With the patch we get:

 QUERY PLAN
-
 Aggregate  (cost=15938.83..15938.84 rows=1 width=8) (actual
time=101.003..101.003 rows=1 loops=1)
   ->  Seq Scan on ts  (cost=0.00..15937.00 rows=733 width=0) (actual
time=0.184..100.868 rows=1000 loops=1)
 Filter: ((ts IS NOT NULL) AND (ts <= '2017-02-01
00:00:00+13'::timestamp with time zone))
 Rows Removed by Filter: 999000
 Planning time: 0.153 ms
 Execution time: 101.063 ms

Whereas master gives us:

QUERY PLAN
---
 Aggregate  (cost=15937.00..15937.01 rows=1 width=8) (actual
time=119.256..119.256 rows=1 loops=1)
   ->  Seq Scan on ts  (cost=0.00..15937.00 rows=1 width=0) (actual
time=0.172..119.062 rows=1000 loops=1)
 Filter: ((ts IS NOT NULL) AND (ts <= '2017-02-01
00:00:00+13'::timestamp with time zone))
 Rows Removed by Filter: 999000
 Planning time: 0.851 ms
 Execution time: 119.401 ms

A side effect of this is that we're now able to better detect
impossible cases such as:

postgres=# explain analyze select count(*) from ts where ts <=
'2017-02-01'::timestamptz and ts is null;
QUERY PLAN
--
 Aggregate  (cost=15937.00..15937.01 rows=1 width=8) (actual
time=135.012..135.012 rows=1 loops=1)
   ->  Seq Scan on ts  (cost=0.00..15937.00 rows=1 width=0) (actual
time=135.007..135.007 rows=0 loops=1)
 Filter: ((ts IS NULL) AND (ts <= '2017-02-01
00:00:00+13'::timestamp with time zone))
 Rows Removed by Filter: 100
 Planning time: 0.067 ms
 Execution time: 135.050 ms
(6 rows)

Master is not able to see the impossibility of this case:

postgres=# explain analyze select count(*) from ts where ts <=
'2017-02-01'::timestamptz and ts is null;
 QUERY PLAN

 Aggregate  (cost=15938.83..15938.84 rows=1 width=8) (actual
time=131.681..131.681 rows=1 loops=1)
   ->  Seq Scan on ts  (cost=0.00..15937.00 rows=733 width=0) (actual
time=131.676..131.676 rows=0 loops=1)
 Filter: ((ts IS NULL) AND (ts <= '2017-02-01
00:00:00+13'::timestamp with time zone))
 Rows Removed by Filter: 100
 Planning time: 0.090 ms
 Execution time: 131.719 ms
(6 rows)

Now one thing I was unsure of in the patch was this "Other clauses"
concept

Re: [HACKERS] UPDATE of partition key

2017-02-24 Thread Simon Riggs

On 24 February 2017 at 07:02, Robert Haas  wrote:
> On Mon, Feb 20, 2017 at 2:58 PM, Amit Khandekar  
> wrote:
>> I am inclined to at least have some option for the user to decide the
>> behaviour. In the future we can even consider support for walking
>> through the ctid chain across multiple relfilenodes. But till then, we
>> need to decide what default behaviour to keep. My inclination is more
>> towards erroring out in an unfortunate even where there is an UPDATE
>> while the row-movement is happening. One option is to not get into
>> finding whether the DELETE was part of partition row-movement or it
>> was indeed a DELETE, and always error out the UPDATE when
>> heap_update() returns HeapTupleUpdated, but only if the table is a
>> leaf partition. But this obviously will cause annoyance because of
>> chances of getting such errors when there are concurrent updates and
>> deletes in the same partition. But we can keep a table-level option
>> for determining whether to error out or silently lose the UPDATE.
>
> I'm still a fan of the "do nothing and just document that this is a
> weirdness of partitioned tables" approach, because implementing
> something will be complicated, will ensure that this misses this
> release if not the next one, and may not be any better for users.  But
> probably we need to get some more opinions from other people, since I
> can imagine people being pretty unhappy if the consensus happens to be
> at odds with my own preferences.

I'd give the view that we cannot silently ignore this issue, bearing
in mind the point that we're expecting partitioned tables to behave
exactly like normal tables.

In my understanding the issue is that UPDATEs will fail to update a
row when a valid row exists in the case where a row moved between
partitions; that behaviour will be different to a standard table.

It is of course very good that we have something ready for this
release and can make a choice of what to do.

Thoughts

1. Reuse the tuple state HEAP_MOVED_OFF which IIRC represent exactly
almost exactly the same thing. An UPDATE which gets to a
HEAP_MOVED_OFF tuple will know to re-find the tuple via the partition
metadata, or I might be persuaded that in-this-release it is
acceptable to fail when this occurs with an ERROR and a retryable
SQLCODE, since the UPDATE will succeed on next execution.

2. I know that DB2 handles this by having the user specify WITH ROW
MOVEMENT to explicitly indicate they accept the issue and want update
to work even with that. We could have an explicit option to allow
that. This appears to be the only way we could avoid silent errors for
foreign table partitions.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-02-24 Thread Robert Haas

On Fri, Feb 24, 2017 at 2:42 PM, Pavan Deolasee
 wrote:
> Let's take an example. Say, we have a table (a int, b int, c text) and two
> indexes on first two columns.
>
>H  W
> H
> (1, 100, 'foo') -> (1, 100, 'bar') --> (1, 200, 'bar') -> (1,
> 200, 'foo')
>
> The first update will be a HOT update, the second update will be a WARM
> update and the third update will again be a HOT update. The first and third
> update do not create any new index entry, though the second update will
> create a new index entry in the second index. Any further WARM updates to
> this chain is not allowed, but further HOT updates are ok.
>
> If all but the last version become DEAD, HOT-prune will remove all of them
> and turn the first line pointer into REDIRECT line pointer.

So, when you do the WARM update, the new index entries still point at
the original root, which they don't match, not the version where that
new value first appeared?

I don't immediately see how this will work with index-only scans.  If
the tuple is HOT updated several times, HOT-pruned back to a single
version, and then the page is all-visible, the index entries are
guaranteed to agree with the remaining tuple, so it's fine to believe
the data in the index tuple.  But with WARM, that would no longer be
true, unless you have some trick for that...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal : Parallel Merge Join

2017-02-24 Thread Amit Kapila

On Tue, Feb 14, 2017 at 5:22 PM, Dilip Kumar  wrote:
> On Tue, Feb 14, 2017 at 12:25 PM, Amit Kapila  wrote:
> Apart from this, there was one small problem in an earlier version
> which I have corrected in this.
>
> + /* Consider only parallel safe inner path */
> + if (innerpath != NULL &&
> + innerpath->parallel_safe &&
> + (cheapest_total_inner == NULL ||
> + cheapest_total_inner->parallel_safe == false ||
> + compare_path_costs(innerpath, cheapest_total_inner,
> + TOTAL_COST) < 0))
>
> In this comparison, we were only checking if innerpath is cheaper than
> the cheapest_total_inner then generate path with this new inner path
> as well. Now, I have added one more check if cheapest_total_inner was
> not parallel safe then also consider a path with this new inner
> (provided this inner is parallel safe).
>

What advantage do you see for considering such a path when the cost of
innerpath is more than cheapest_total_inner?  Remember the more paths
we try to consider, the more time we spend in the planner.  By any
chance are you able to generate any query where it will give benefit
by considering costlier innerpath?

2.
+static void generate_parallel_mergejoin_paths(PlannerInfo *root,
+  RelOptInfo *joinrel,
+  RelOptInfo *innerrel,
+  Path *outerpath,
+  JoinType jointype,
+  JoinPathExtraData *extra,
+  Path *inner_cheapest_total,
+  List *merge_pathkeys);

It is better to name this function as
generate_partial_mergejoin_paths() as we are generating only partial
paths in this function and accordingly change the comment on top of
the function.  I see that you might be naming it based on
consider_parallel_*, however, I think it is better to use partial in
the name as that is what we are doing inside that function.  Also, I
think this function has removed/changed some handling related to
unique outer and full joins, so it is better to mention that in the
function comments, something like "unlike above function, this
function doesn't expect to handle join types JOIN_UNIQUE_OUTER or
JOIN_FULL" and add Assert for both of those types.

3.
A test case is still missing.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keep ECPG comment for log_min_duration_statement

2017-02-24 Thread Okano, Naoki

Hi, 

Michael wrote:
> The reason for not keeping the comments in the statement was simply to
> make the parser simpler. If you added this feature, do we still see a
> reason for keeping the old version? Or in other words, shouldn't we
> make the "enable-parse-comment" version the default without a new
> option?
Thank you for your feedback! 

As I said in the first e-mail, there are some restrictions of comment position 
in my implementation. I am concerned that an error will occur in .pgc files 
users made in old version.
So, this feature should be a new option.

When the pre-compiler(ECPG) converts C with embedded SQL to normal C code, 
gram.y is used for syntactic analysis. I need to change gram.y for comments in 
SQL statement. 
But I do not come up with better idea that gram.y is not affected.
If you are interested in my implementation in detail, please check the 
[WIP]patch I attached.

I am appreciated if you give me any idea or opinion.

Regards,
Okano Naoki
Fujitsu

[WIP]enable-parse-comment.patch
Description: [WIP]enable-parse-comment.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-02-24 Thread Pavan Deolasee

On Fri, Feb 24, 2017 at 12:28 AM, Alvaro Herrera 
wrote:

> Bruce Momjian wrote:
> > On Thu, Feb 23, 2017 at 03:45:24PM -0300, Alvaro Herrera wrote:
>
> > > > and potentially trim the first HOT chain as those tuples become
> > > > invisible.
> > >
> > > That can already happen even without WARM, no?
> >
> > Uh, the point is that with WARM those four early tuples can be removed
> > via a prune, rather than requiring a VACUUM. Without WARM, the fourth
> > tuple can't be removed until the index is cleared by VACUUM.
>
> I *think* that the WARM-updated one cannot be pruned either, because
> it's pointed to by at least one index (otherwise it'd have been a HOT
> update).  The ones prior to that can be removed either way.
>
>
No, even the WARM-updated can be pruned and if there are further HOT
updates, those can be pruned too. All indexes and even multiple pointers
from the same index are always pointing to the root of the WARM chain and
that line pointer does not go away unless the entire chain become dead. The
only material difference between HOT and WARM is that since there are two
index pointers from the same index to the same root line pointer, we must
do recheck. But HOT-pruning and all such things remain the same.

Let's take an example. Say, we have a table (a int, b int, c text) and two
indexes on first two columns.

   H  W
   H
(1, 100, 'foo') -> (1, 100, 'bar') --> (1, 200, 'bar') -> (1,
200, 'foo')

The first update will be a HOT update, the second update will be a WARM
update and the third update will again be a HOT update. The first and third
update do not create any new index entry, though the second update will
create a new index entry in the second index. Any further WARM updates to
this chain is not allowed, but further HOT updates are ok.

If all but the last version become DEAD, HOT-prune will remove all of them
and turn the first line pointer into REDIRECT line pointer. At this point,
the first index has one index pointer and the second index has two index
pointers, but all pointing to the same root line pointer, which has not
become REDIRECT line pointer.

   Redirect
o---> (1, 200, 'foo')

I think the part you want (be able to prune the WARM updated tuple) is
> part of what Pavan calls "turning the WARM chain into a HOT chain", so
> not part of the initial patch.  Pavan can explain this part better, and
> also set me straight in case I'm wrong in the above :-)
>
>
Umm.. it's a bit different. Without chain conversion, we still don't allow
further WARM updates to the above chain because that might create a third
index pointer and our recheck logic can't cope up with duplicate scans. HOT
updates are allowed though.

The latest patch that I proposed will handle this case and convert such
chains into regular HOT-pruned chains. To do that, we must remove the
duplicate (and now wrong) index pointer to the chain. Once we do that and
change the state on the heap tuple, we can once again do WARM update to
this tuple. Note that in this example the chain has just one tuple, which
will be the case typically, but the algorithm can deal with the case where
there are multiple tuples but with matching index keys.

Hope this helps.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-02-24 Thread Pavan Deolasee

On Fri, Feb 24, 2017 at 2:13 PM, Robert Haas  wrote:

> On Thu, Feb 23, 2017 at 9:21 PM, Bruce Momjian  wrote:
> >
>
> > I have what might be a supid question.  As I remember, WARM only allows
> > a single index-column change in the chain.  Why are you seeing such a
> > large performance improvement?  I would have thought it would be that
> > high if we allowed an unlimited number of index changes in the chain.
>
> I'm not sure how the test case is set up.  If the table has multiple
> indexes, each on a different column, and only one of the indexes is
> updated, then you figure to win because now the other indexes need
> less maintenance (and get less bloated).  If you have only a single
> index, then I don't see how WARM can be any better than HOT, but maybe
> I just don't understand the situation.
>
>
That's correct. If you have just one index and if the UPDATE modifies
indexed indexed, the UPDATE won't be a WARM update and the patch gives you
no benefit. OTOH if the UPDATE doesn't modify any indexed columns, then it
will be a HOT update and again the patch gives you no benefit. It might be
worthwhile to see if patch causes any regression in these scenarios, though
I think it will be minimal or zero.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-02-24 Thread Pavan Deolasee

On Thu, Feb 23, 2017 at 11:53 PM, Bruce Momjian  wrote:

> On Thu, Feb 23, 2017 at 03:03:39PM -0300, Alvaro Herrera wrote:
> > Bruce Momjian wrote:
> >
> > > As I remember, WARM only allows
> > > a single index-column change in the chain.  Why are you seeing such a
> > > large performance improvement?  I would have thought it would be that
> > > high if we allowed an unlimited number of index changes in the chain.
> >
> > The second update in a chain creates another non-warm-updated tuple, so
> > the third update can be a warm update again, and so on.
>
> Right, before this patch they would be two independent HOT chains.  It
> still seems like an unexpectedly-high performance win.  Are two
> independent HOT chains that much more expensive than joining them via
> WARM?

In these tests, there are zero HOT updates, since every update modifies
some index column. With WARM, we could reduce regular updates to half, even
when we allow only one WARM update per chain (chain really has a single
tuple for this discussion). IOW approximately half updates insert new index
entry in *every* index and half updates
insert new index entry *only* in affected index. That itself does a good
bit for performance.

So to answer your question: yes, joining two HOT chains via WARM is much
cheaper because it results in creating new index entries just for affected
indexes.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-02-24 Thread Robert Haas

On Thu, Feb 23, 2017 at 9:21 PM, Bruce Momjian  wrote:
> On Tue, Jan 31, 2017 at 04:52:39PM +0530, Pavan Deolasee wrote:
>> The other critical bug I found, which unfortunately exists in the master too,
>> is the index corruption during CIC. The patch includes the same fix that I've
>> proposed on the other thread. With these changes, WARM stress is running fine
>> for last 24 hours on a decently powerful box. Multiple CREATE/DROP INDEX 
>> cycles
>> and updates via different indexed columns, with a mix of FOR SHARE/UPDATE and
>> rollbacks did not produce any consistency issues. A side note: while
>> performance measurement wasn't a goal of stress tests, WARM has done about 
>> 67%
>> more transaction than master in 24 hour period (95M in master vs 156M in WARM
>> to be precise on a 30GB table including indexes). I believe the numbers would
>> be far better had the test not dropping and recreating the indexes, thus
>> effectively cleaning up all index bloats. Also the table is small enough to 
>> fit
>> in the shared buffers. I'll rerun these tests with much larger scale factor 
>> and
>> without dropping indexes.
>
> Thanks for setting up the test harness.  I know it is hard but
> in this case it has found an existing bug and given good performance
> numbers.  :-)
>
> I have what might be a supid question.  As I remember, WARM only allows
> a single index-column change in the chain.  Why are you seeing such a
> large performance improvement?  I would have thought it would be that
> high if we allowed an unlimited number of index changes in the chain.

I'm not sure how the test case is set up.  If the table has multiple
indexes, each on a different column, and only one of the indexes is
updated, then you figure to win because now the other indexes need
less maintenance (and get less bloated).  If you have only a single
index, then I don't see how WARM can be any better than HOT, but maybe
I just don't understand the situation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-02-24 Thread Pavan Deolasee

On Thu, Feb 23, 2017 at 11:30 PM, Bruce Momjian  wrote:

> On Wed, Feb  1, 2017 at 10:46:45AM +0530, Pavan Deolasee wrote:
> > > contains a WARM tuple. Alternate ideas/suggestions and review of
> the
> > design
> > > are welcome!
> >
> > t_infomask2 contains one last unused bit,
> >
> >
> > Umm, WARM is using 2 unused bits from t_infomask2. You mean there is
> another
> > free bit after that too?
>
> We are obviously going to use several heap or item pointer bits for
> WARM, and once we do that it is going to be hard to undo that.  Pavan,
> are you saying you could do more with WARM if you had more bits?  Are we
> sure we have given you all the bits we can?  Do we want to commit to a
> lesser feature because the bits are not available?
>
>
btree implementation is complete as much as I would like (there are a few
TODOs, but no show stoppers), at least for the first release. There is a
free bit in btree index tuple header that I could use for chain conversion.
In the heap tuples, I can reuse HEAP_MOVED_OFF because that bit will only
be set along with HEAP_WARM_TUPLE bit. Since none of the upgraded clusters
can have HEAP_WARM_TUPLE bit set, I think we are safe.

WARM currently also supports hash indexes, but there is no free bit left in
hash index tuple header. I think I can work around that by using a bit from
ip_posid (not yet implemented/tested, but seems doable).

IMHO if we can do that i.e. support btree and hash indexes to start with,
we should be good to go for the first release. We can try to support other
popular index AMs in the subsequent release.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Add checklist item for psql completion to commitfest review

2017-02-24 Thread Robert Haas

On Fri, Feb 24, 2017 at 2:28 AM, Jim Nasby  wrote:
> I've never messed with completion so I don't know how hard it is, but my
> impression is that it gets added after the fact not because of any
> intentional decisions but because people simply forget about it. ISTM it
> would be more efficient of community resources to deal with completion in
> the original patch, unless there's some reason not to.
>
> IOW, no, don't make it a hard requirement, but don't omit it simply through
> forgetfulness.

Because our biggest problem around here is that it's too easy to get
patches committed, so we should add some more requirements?

I think it's great to include tab completion support in initial
patches if people are willing to do that, but I'm not prepared to
insist on it.  Anyway, it's not just a yes-or-no thing; it's pretty
common that people go back later and add more tab completion to things
that already have some tab completion.  So if we adopt the rule you
are proposing, then the next question will be whether a given patch
has ENOUGH tab completion.  Ugh.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-02-24 Thread Robert Haas

On Fri, Feb 24, 2017 at 1:18 PM, David G. Johnston
 wrote:
> For my own sanity - the move update would complete successfully and break
> every ctid chain that it touches.  Any update lined up behind it in the lock
> queue would discover their target record has been deleted and would
> experience whatever behavior their isolation level dictates for such a
> situation.  So multi-partition update queries will fail to update some
> records if they happen to move between partitions even if they would
> otherwise match the query's predicate.

Right.  That's the behavior for which I am advocating, on the grounds
that it's the simplest to implement and if we all agree on something
else more complicated later, we can do it then.

> Is there any difference in behavior between this and a SQL writeable CTE
> performing the same thing via delete-returning-insert?

Not to my knowledge.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bytea_output output of base64

2017-02-24 Thread Fabien COELHO



It undoubtedly would make pg_dump smaller, though I'm not sure how much 
that's worth since if you care at all about that you'll gzip it.


But, the other thing it might do is speed up COPY, especially on input. Some 
performance tests of that might be interesting.


For what it is worth:

Ascii85 (aka Base85) is used in PDF by Adobe, with a prefix "<~" and a 
suffix "~>". It codes 4 bytes as 5 ascii characters, i.e. a 25% loss. 
There is also Z85 which avoids some special characters: backslash, single 
quote, double quote.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

90 matches

Mail list logo