Re: [HACKERS] [PATCH] Refactoring of LWLock tranches

2015-11-17 Thread Ildus Kurbangaliev
On Mon, 16 Nov 2015 18:55:55 -0500
Robert Haas  wrote:

> On Mon, Nov 16, 2015 at 7:32 AM, Ildus Kurbangaliev
>  wrote:
> > What if just create a control struct in shared memory like in other places? 
> > BufferDescriptors
> > and BufferBlocks can be kept there along with tranches definitions
> > and lwlocks. Buffer locks that are located in MainLWLockArray by offset
> > can be moved there too.
> 
> Yeah, we could do that, but what's the advantage of it?  The alignment
> of the buffer descriptors is kinda finnicky and matters to
> performance, so it seems better not to prefix them with something that
> might perturb it.  If we just rebase Andres' patch over what I just
> committed and add in something so that the buffer numbers are fed from
> #defines or an enum instead of being random integers, I think we're
> done.

I created a little patch on top of Andres' patch as example.
I see several advantages:

1) We can avoid constants, and use a standard steps for tranches
creation.
2) We have only one global variable (BufferCtl)
3) Tranches moved to shared memory, so we won't need to do
an additional work here.
4) Also we can kept buffer locks from MainLWLockArray there (that was done
in attached patch).

-- 
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 3ae2848..e733377 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -18,9 +18,7 @@
 #include "storage/buf_internals.h"
 
 
-BufferDescPadded *BufferDescriptors;
-char	   *BufferBlocks;
-
+BufferCtlBase	*BufferCtl;
 
 /*
  * Data Structures:
@@ -64,28 +62,49 @@ char	   *BufferBlocks;
 void
 InitBufferPool(void)
 {
-	bool		foundBufs,
-foundDescs;
-
-	/* Align descriptors to a cacheline boundary. */
-	BufferDescriptors = (BufferDescPadded *) CACHELINEALIGN(
-		ShmemInitStruct("Buffer Descriptors",
-	NBuffers * sizeof(BufferDescPadded) + PG_CACHE_LINE_SIZE,
-		));
+	bool			 foundCtl;
+	int i;
+	BufferCtlData	*ctl;
 
-	BufferBlocks = (char *)
-		ShmemInitStruct("Buffer Blocks",
-		NBuffers * (Size) BLCKSZ, );
+	ctl = (BufferCtlData *) ShmemInitStruct("BufferCtl",
+		sizeof(BufferCtlData), );
+	BufferCtl = (BufferCtlBase *) ctl;
 
-	if (foundDescs || foundBufs)
-	{
-		/* both should be present or neither */
-		Assert(foundDescs && foundBufs);
-		/* note: this path is only taken in EXEC_BACKEND case */
-	}
-	else
+	if (!foundCtl)
 	{
-		int			i;
+		/* Align descriptors to a cacheline boundary. */
+		ctl->base.bufferDescriptors = (void *) CACHELINEALIGN(ShmemAlloc(
+			NBuffers * sizeof(BufferDescPadded) + PG_CACHE_LINE_SIZE));
+
+		ctl->base.bufferBlocks = (char *) ShmemAlloc(NBuffers * (Size) BLCKSZ);
+
+		strncpy(ctl->IOLWLockTrancheName, "buffer_io",
+			BUFMGR_MAX_NAME_LENGTH);
+		strncpy(ctl->contentLWLockTrancheName, "buffer_content",
+			BUFMGR_MAX_NAME_LENGTH);
+		strncpy(ctl->blockLWLockTrancheName, "buffer_blocks",
+			BUFMGR_MAX_NAME_LENGTH);
+
+		ctl->IOLocks = (LWLockMinimallyPadded *) ShmemAlloc(
+			sizeof(LWLockMinimallyPadded) * NBuffers);
+
+		/* Initialize tranche fields */
+		ctl->IOLWLockTranche.array_base = ctl->IOLocks;
+		ctl->IOLWLockTranche.array_stride = sizeof(LWLockMinimallyPadded);
+		ctl->IOLWLockTranche.name = ctl->IOLWLockTrancheName;
+
+		ctl->contentLWLockTranche.array_base =
+			((char *) ctl->base.bufferDescriptors) + offsetof(BufferDesc, content_lock);
+		ctl->contentLWLockTranche.array_stride = sizeof(BufferDescPadded);
+		ctl->contentLWLockTranche.name = ctl->contentLWLockTrancheName;
+
+		ctl->blockLWLockTranche.array_base = >blockLocks;
+		ctl->blockLWLockTranche.array_stride = sizeof(LWLockPadded);
+		ctl->blockLWLockTranche.name = ctl->blockLWLockTrancheName;
+
+		ctl->blockLWLockTrancheId = LWLockNewTrancheId();
+		ctl->contentLWLockTrancheId = LWLockNewTrancheId();
+		ctl->IOLWLockTrancheId = LWLockNewTrancheId();
 
 		/*
 		 * Initialize all the buffer headers.
@@ -110,16 +129,30 @@ InitBufferPool(void)
 			 */
 			buf->freeNext = i + 1;
 
-			buf->io_in_progress_lock = LWLockAssign();
-			buf->content_lock = LWLockAssign();
+			LWLockInitialize(BufferDescriptorGetContentLock(buf),
+ctl->contentLWLockTrancheId);
+			LWLockInitialize(>IOLocks[i].lock,
+ctl->IOLWLockTrancheId);
 		}
 
+		for (i = 0; i < NUM_BUFFER_PARTITIONS; i++)
+			LWLockInitialize(>blockLocks[i].lock,
+ctl->blockLWLockTrancheId);
+
 		/* Correct last entry of linked list */
 		GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
 	}
 
+	/* Register tranches in main tranches array */
+	LWLockRegisterTranche(ctl->IOLWLockTrancheId,
+		>IOLWLockTranche);
+	LWLockRegisterTranche(ctl->contentLWLockTrancheId,
+		>contentLWLockTranche);
+	LWLockRegisterTranche(ctl->blockLWLockTrancheId,
+		>blockLWLockTranche);
+
 	/* Init other shared buffer-management 

Re: [HACKERS] Parallel Seq Scan

2015-11-17 Thread Bert
edit: maybe this is more useful? :)

(gdb) bt full
#0  0x00490b56 in heap_parallelscan_nextpage ()
No symbol table info available.
#1  0x00493fdf in heap_getnext ()
No symbol table info available.
#2  0x005c0733 in SeqNext ()
No symbol table info available.
#3  0x005ac5d9 in ExecScan ()
No symbol table info available.
#4  0x005a5c08 in ExecProcNode ()
No symbol table info available.
#5  0x005b5298 in ExecGather ()
No symbol table info available.
#6  0x005a5aa8 in ExecProcNode ()
No symbol table info available.
#7  0x005b68b9 in MultiExecHash ()
No symbol table info available.
#8  0x005b7256 in ExecHashJoin ()
No symbol table info available.
#9  0x005a5b18 in ExecProcNode ()
No symbol table info available.
#10 0x005b0ac9 in fetch_input_tuple ()
No symbol table info available.
#11 0x005b1eaf in ExecAgg ()
No symbol table info available.
#12 0x005a5ad8 in ExecProcNode ()
No symbol table info available.
#13 0x005c11e1 in ExecSort ()
No symbol table info available.
#14 0x005a5af8 in ExecProcNode ()
No symbol table info available.
#15 0x005ba164 in ExecLimit ()
No symbol table info available.
#16 0x005a5a38 in ExecProcNode ()
No symbol table info available.
#17 0x005a2343 in standard_ExecutorRun ()
No symbol table info available.
#18 0x0069cb08 in PortalRunSelect ()
No symbol table info available.
#19 0x0069de5f in PortalRun ()
No symbol table info available.
#20 0x0069bc16 in PostgresMain ()
No symbol table info available.
#21 0x00466f55 in ServerLoop ()
No symbol table info available.
#22 0x00648436 in PostmasterMain ()
No symbol table info available.
#23 0x004679f0 in main ()
No symbol table info available.


On Tue, Nov 17, 2015 at 12:38 PM, Bert  wrote:

> Hi,
>
> this is the backtrace:
> gdb /var/lib/pgsql/9.6/data/ /var/lib/pgsql/9.6/data/core.7877
> GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-64.el7
> Copyright (C) 2013 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <
> http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu".
> For bug reporting instructions, please see:
> ...
> /var/lib/pgsql/9.6/data/: Success.
> [New LWP 7877]
> Missing separate debuginfo for the main executable file
> Try: yum --enablerepo='*debug*' install
> /usr/lib/debug/.build-id/02/20b77a9ab8f607b0610082794165fccedf210d
> Core was generated by `postgres: postgres tpcds [loca'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x00490b56 in ?? ()
> (gdb) bt full
> #0  0x00490b56 in ?? ()
> No symbol table info available.
> #1  0x3668 in ?? ()
> No symbol table info available.
> #2  0x7f956249a008 in ?? ()
> No symbol table info available.
> #3  0x0228c498 in ?? ()
> No symbol table info available.
> #4  0x0001 in ?? ()
> No symbol table info available.
> #5  0x0228ad00 in ?? ()
> No symbol table info available.
> #6  0x00493fdf in ?? ()
> No symbol table info available.
> #7  0x021a8e50 in ?? ()
> No symbol table info available.
> #8  0x in ?? ()
> No symbol table info available.
> (gdb) q
>
> Is there something else I can do?
>
> On Mon, Nov 16, 2015 at 8:59 PM, Robert Haas 
> wrote:
>
>> On Mon, Nov 16, 2015 at 2:51 PM, Bert  wrote:
>> > I've just pulled and compiled the new code.
>> > I'm running a TPC-DS like test on different PostgreSQL installations,
>> but
>> > running (max) 12queries in parallel on a server with 12cores.
>> > I've configured max_parallel_degree to 2, and I get messages that
>> backend
>> > processes crash.
>> > I am running the same test now with 6queries in parallel, and parallel
>> > degree to 2, and they seem to work. for now. :)
>> >
>> > This is the output I get in /var/log/messages
>> > Nov 16 20:40:05 woludwha02 kernel: postgres[22918]: segfault at
>> 7fa3437bf104
>> > ip 00490b56 sp 7ffdf2f083a0 error 6 in
>> postgres[40+5b5000]
>> >
>> > Is there something else I should get?
>>
>> Can you enable core dumps e.g. by passing the -c option to pg_ctl
>> start?  If you can get a core file, you can then get a backtrace
>> using:
>>
>> gdb /path/to/postgres /path/to/core
>> bt full
>> q
>>
>> That should be enough to find and fix whatever the bug is.  Thanks for
>> testing.
>>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>
>
>
> --
> Bert Desmet
> 0477/305361
>



-- 
Bert Desmet
0477/305361


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2015-11-17 Thread Amit Kapila
On Tue, Nov 17, 2015 at 6:30 PM, Simon Riggs  wrote:

> On 17 November 2015 at 11:48, Amit Kapila  wrote:
>
>> On Tue, Nov 17, 2015 at 5:04 PM, Simon Riggs 
>> wrote:
>>
>>> On 17 November 2015 at 11:27, Amit Kapila 
>>> wrote:
>>>
>>> We are trying to speed up real cases, not just benchmarks.
>
> So +1 for the concept, patch is going in right direction though lets
> do the full press-up.
>
>
 I have mentioned above the reason for not doing it for sub
 transactions, if
 you think it is viable to reserve space in shared memory for this
 purpose, then
 I can include the optimization for subtransactions as well.

>>>
>>> The number of subxids is unbounded, so as you say, reserving shmem isn't
>>> viable.
>>>
>>> I'm interested in real world cases, so allocating 65 xids per process
>>> isn't needed, but we can say is that the optimization shouldn't break down
>>> abruptly in the presence of a small/reasonable number of subtransactions.
>>>
>>>
>> I think in that case what we can do is if the total number of
>> sub transactions is lesser than equal to 64 (we can find that by
>> overflowed flag in PGXact) , then apply this optimisation, else use
>> the existing flow to update the transaction status.  I think for that we
>> don't even need to reserve any additional memory. Does that sound
>> sensible to you?
>>
>
> I understand you to mean that the leader should look backwards through the
> queue collecting xids while !(PGXACT->overflowed)
>
>
Yes, that is what the above idea is, but the problem with that is leader
won't be able to collect the subxids of member proc's (from each member
proc's XidCache) as it doesn't have information which of those subxid's
needs to be update as part of current transaction status update (for
subtransactions on different clog pages, we update the status of those
in multiple phases).  I think it could only be possible to use the above
idea
if all the subtransactions are on same page, which we can identify in
function TransactionIdSetPageStatus().  Though it looks okay that we
can apply this optimization when number of subtransactions is lesser
than 65 and all exist on same page, still it would be better if we can
apply it generically for all cases when number of subtransactions is small
(say 32 or 64).  Does this explanation clarify the problem with the above
idea to handle subtransactions?


No additional shmem is required
>
>
If we want to do it for all cases when number of subtransactions
are small then we need extra memory.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2015-11-17 Thread Amit Kapila
On Tue, Nov 17, 2015 at 2:45 PM, Simon Riggs  wrote:

> On 17 November 2015 at 06:50, Amit Kapila  wrote:
>
>
>> In anycase, I went ahead and tried further reducing the CLogControlLock
>> contention by grouping the transaction status updates.  The basic idea
>> is same as is used to reduce the ProcArrayLock contention [1] which is to
>> allow one of the proc to become leader and update the transaction status
>> for
>> other active transactions in system.  This has helped to reduce the
>> contention
>> around CLOGControlLock.
>>
>
> Sounds good. The technique has proved effective with proc array and makes
> sense to use here also.
>
>
>> Attached patch group_update_clog_v1.patch
>> implements this idea.
>>
>
> I don't think we should be doing this only for transactions that don't
> have subtransactions.
>

The reason for not doing this optimization for subtransactions is that we
need to advertise the information that Group leader needs for updating
the transaction status and if we want to do it for sub transactions, then
all the subtransaction id's needs to be advertised.  Now here the tricky
part is that number of subtransactions for which the status needs to
be updated is dynamic, so reserving memory for it would be difficult.
However, we can reserve some space in Proc like we do for XidCache
(cache of sub transaction ids) and then use that to advertise that many
Xid's at-a-time or just allow this optimization if number of subtransactions
is lesser than or equal to the size of this new XidCache.  I am not sure
if it is good idea to use the existing XidCache for this purpose in which
case we need to have a separate space in PGProc for this purpose.  I
don't see allocating space for 64 or so subxid's as a problem, however
doing it for bigger number could be cause of concern.


> We are trying to speed up real cases, not just benchmarks.
>
> So +1 for the concept, patch is going in right direction though lets do
> the full press-up.
>
>
I have mentioned above the reason for not doing it for sub transactions, if
you think it is viable to reserve space in shared memory for this purpose,
then
I can include the optimization for subtransactions as well.



> The above data indicates that contention due to CLogControlLock is
>> reduced by around 50% with this patch.
>>
>> The reasons for remaining contention could be:
>>
>> 1. Readers of clog data (checking transaction status data) can take
>> Exclusive CLOGControlLock when reading the page from disk, this can
>> contend with other Readers (shared lockers of CLogControlLock) and with
>> exclusive locker which updates transaction status. One of the ways to
>> mitigate this contention is to increase the number of CLOG buffers for
>> which
>> patch has been already posted on this thread.
>>
>> 2. Readers of clog data (checking transaction status data) takes shared
>> CLOGControlLock which can contend with exclusive locker (Group leader)
>> which
>> updates transaction status.  I have tried to reduce the amount of work
>> done
>> by group leader, by allowing group leader to just read the Clog page once
>> for all the transactions in the group which updated the same CLOG page
>> (idea similar to what we currently we use for updating the status of
>> transactions
>> having sub-transaction tree), but that hasn't given any further
>> performance boost,
>> so I left it.
>>
>> I think we can use some other ways as well to reduce the contention around
>> CLOGControlLock by doing somewhat major surgery around SLRU like using
>> buffer pools similar to shared buffers, but this idea gives us moderate
>> improvement without much impact on exiting mechanism.
>>
>
> My earlier patch to reduce contention by changing required lock level is
> still valid here. Increasing the number of buffers doesn't do enough to
> remove that.
>
>
I understand that increasing alone the number of buffers is not
enough, that's why I have tried this group leader idea.  However
if we do something on lines what you have described below
(handling page faults) could avoid the need for increasing buffers.


> I'm working on a patch to use a fast-update area like we use for GIN. If a
> page is not available when we want to record commit, just store it in a
> hash table, when not in crash recovery. I'm experimenting with writing WAL
> for any xids earlier than last checkpoint, though we could also trickle
> writes and/or flush them in batches at checkpoint time - your code would
> help there.
>
> The hash table can also be used for lookups. My thinking is that most
> reads of older xids are caused by long running transactions, so they cause
> a page fault at commit and then other page faults later when people read
> them back in. The hash table works for both kinds of page fault.
>
>

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel Seq Scan

2015-11-17 Thread Bert
Hi,

this is the backtrace:
gdb /var/lib/pgsql/9.6/data/ /var/lib/pgsql/9.6/data/core.7877
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-64.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
...
/var/lib/pgsql/9.6/data/: Success.
[New LWP 7877]
Missing separate debuginfo for the main executable file
Try: yum --enablerepo='*debug*' install
/usr/lib/debug/.build-id/02/20b77a9ab8f607b0610082794165fccedf210d
Core was generated by `postgres: postgres tpcds [loca'.
Program terminated with signal 11, Segmentation fault.
#0  0x00490b56 in ?? ()
(gdb) bt full
#0  0x00490b56 in ?? ()
No symbol table info available.
#1  0x3668 in ?? ()
No symbol table info available.
#2  0x7f956249a008 in ?? ()
No symbol table info available.
#3  0x0228c498 in ?? ()
No symbol table info available.
#4  0x0001 in ?? ()
No symbol table info available.
#5  0x0228ad00 in ?? ()
No symbol table info available.
#6  0x00493fdf in ?? ()
No symbol table info available.
#7  0x021a8e50 in ?? ()
No symbol table info available.
#8  0x in ?? ()
No symbol table info available.
(gdb) q

Is there something else I can do?

On Mon, Nov 16, 2015 at 8:59 PM, Robert Haas  wrote:

> On Mon, Nov 16, 2015 at 2:51 PM, Bert  wrote:
> > I've just pulled and compiled the new code.
> > I'm running a TPC-DS like test on different PostgreSQL installations, but
> > running (max) 12queries in parallel on a server with 12cores.
> > I've configured max_parallel_degree to 2, and I get messages that backend
> > processes crash.
> > I am running the same test now with 6queries in parallel, and parallel
> > degree to 2, and they seem to work. for now. :)
> >
> > This is the output I get in /var/log/messages
> > Nov 16 20:40:05 woludwha02 kernel: postgres[22918]: segfault at
> 7fa3437bf104
> > ip 00490b56 sp 7ffdf2f083a0 error 6 in
> postgres[40+5b5000]
> >
> > Is there something else I should get?
>
> Can you enable core dumps e.g. by passing the -c option to pg_ctl
> start?  If you can get a core file, you can then get a backtrace
> using:
>
> gdb /path/to/postgres /path/to/core
> bt full
> q
>
> That should be enough to find and fix whatever the bug is.  Thanks for
> testing.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



-- 
Bert Desmet
0477/305361


Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-11-17 Thread Vladimir Borodin

> 14 нояб. 2015 г., в 10:50, Amit Kapila  написал(а):
> 
> On Wed, Sep 16, 2015 at 11:22 PM, Robert Haas  > wrote:
> > On Wed, Sep 16, 2015 at 12:29 PM, Alexander Korotkov
> > > wrote:
> >
> > >> I think it's reasonable to consider reporting this data in the PGPROC
> > >> using a 4-byte integer rather than reporting it through a singe byte
> > >> in the backend status structure.  I believe that addresses the
> > >> concerns about reporting from auxiliary processes, and it also allows
> > >> a little more data to be reported.  For anything in excess of that, I
> > >> think we should think rather harder.  Most likely, such addition
> > >> detail should be reported only for certain types of wait events, or on
> > >> a delay, or something like that, so that the core mechanism remains
> > >> really, really fast.
> > >
> > > That sounds reasonable. There are many pending questions, but it seems 
> > > like
> > > step forward to me.
> >
> > Great, let's do it.  I think we should probably do the work to
> > separate the non-individual lwlocks into tranches first, though.
> >
> 
> One thing that occurred to me in this context is that if we store the wait
> event information in PGPROC, then can we think of providing the info
> about wait events in a separate view pg_stat_waits (or pg_stat_wait_info or
> any other better name) where we can display wait information about
> all-processes rather than only backends?  This will avoid the confusion
> about breaking the backward compatibility for the current 'waiting' column
> in pg_stat_activity.
> 
> pg_stat_waits can have columns:
> pid  - Process Id
> wait_class_name - Name of the wait class
> wait class_event  - name of the wait event
> 
> We can extend it later with the information about timing for wait event.
> 
> Also, if we follow this approach, I think we don't need to store this
> information in PgBackendStatus.

Sounds like exactly the same that was proposed by Ildus in this thead [0]. 
Great to be thinking in the same direction. And on the rights of advertisements 
I’ve somehow described using all those views here [1].

[0] http://www.postgresql.org/message-id/559d4729.9080...@postgrespro.ru
[1] https://simply.name/pg-stat-wait.html

>   
> 
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com 

--
May the force be with you…
https://simply.name



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2015-11-17 Thread Amit Kapila
On Tue, Nov 17, 2015 at 5:18 PM, Amit Kapila 
wrote:

> On Tue, Nov 17, 2015 at 5:04 PM, Simon Riggs 
> wrote:
>
>> On 17 November 2015 at 11:27, Amit Kapila 
>> wrote:
>>
>> We are trying to speed up real cases, not just benchmarks.

 So +1 for the concept, patch is going in right direction though lets do
 the full press-up.


>>> I have mentioned above the reason for not doing it for sub transactions,
>>> if
>>> you think it is viable to reserve space in shared memory for this
>>> purpose, then
>>> I can include the optimization for subtransactions as well.
>>>
>>
>> The number of subxids is unbounded, so as you say, reserving shmem isn't
>> viable.
>>
>> I'm interested in real world cases, so allocating 65 xids per process
>> isn't needed, but we can say is that the optimization shouldn't break down
>> abruptly in the presence of a small/reasonable number of subtransactions.
>>
>>
> I think in that case what we can do is if the total number of
> sub transactions is lesser than equal to 64 (we can find that by
> overflowed flag in PGXact) , then apply this optimisation, else use
> the existing flow to update the transaction status.  I think for that we
> don't even need to reserve any additional memory.
>

I think this won't work as it is, because subxids in XidCache could be
on different pages in which case we either need an additional flag
in XidCache array or a separate array to indicate for which subxids
we want to update the status.  I don't see any better way to do this
optimization for sub transactions, do you have something else in
mind?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-11-17 Thread Kyotaro HORIGUCHI
Oops. 

At Tue, 17 Nov 2015 19:40:10 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20151117.194010.17198448.horiguchi.kyot...@lab.ntt.co.jp>
> Hello,
> 
> At Tue, 17 Nov 2015 18:13:11 +0900, Masahiko Sawada  
> wrote in 

Re: [HACKERS] Freeze avoidance of very large table.

2015-11-17 Thread Thom Brown
On 17 November 2015 at 10:29, Masahiko Sawada  wrote:
>
>
> On Tue, Nov 17, 2015 at 10:45 AM, Robert Haas  wrote:
>> On Sun, Nov 15, 2015 at 1:47 AM, Amit Kapila 
>> wrote:
>>> On Sat, Nov 14, 2015 at 1:19 AM, Andres Freund 
>>> wrote:
 On 2015-10-31 11:02:12 +0530, Amit Kapila wrote:
 > On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs 
 > wrote:
 > >
 > > On 1 October 2015 at 23:30, Josh Berkus  wrote:
 > >>
 > >> On 10/01/2015 07:43 AM, Robert Haas wrote:
 > >> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao
 > >> > 
 > wrote:
 > >> >> I wonder how much it's worth renaming only the file extension
 > >> >> while
 > >> >> there are many places where "visibility map" and "vm" are used,
 > >> >> for example, log messages, function names, variables, etc.
 > >> >
 > >> > I'd be inclined to keep calling it the visibility map (vm) even
 > >> > if
 > >> > it
 > >> > also contains freeze information.
 > >> >
 >
 > What is your main worry about changing the name of this map, is it
 > about more code churn or is it about that we might introduce new
 > issues
 > or is it about that people are already accustomed to call this map as
 > visibility map?

 Several:
 * Visibility map is rather descriptive, none of the replacement terms
   imo come close. Few people will know what a 'freeze' map is.
 * It increases the size of the patch considerably
 * It forces tooling that knows about the layout of the database
   directory to change their tools

>>>
>>> All these points are legitimate.
>>>
 On the benfit side the only argument I've heard so far is that it allows
 to disambiguate the format. But, uh, a look at the major version does
 that just as well, for far less trouble.

 > It seems to me quite logical for understanding purpose as well.  Any
 > new
 > person who wants to work in this area or is looking into it will
 > always
 > wonder why this map is named as visibility map even though it contains
 > information about visibility of page as well as frozen state of page.

 Being frozen is about visibility as well.

>>>
>>> OTOH being visible doesn't mean page is frozen.  I understand that frozen
>>> is
>>> related to visibility, but still it is a separate state of page and used
>>> for
>>> different
>>> purpose.  I think this is a subjective point and we could go either way,
>>> it
>>> is
>>> just a matter in which way more people are comfortable.
>>
>> I'm stickin' with what I said before, and what I think Andres is
>> saying too: renaming the map is a horrible idea.  It produces lots of
>> code churn for no real benefit.  We usually avoid such changes, and I
>> think we should do so here, too.
>
> I understood.
> I'm going to turn the patch back to visibility map, and just add the logic
> of enhancement of VACUUM FREEZE.
> If we want to add the other status not related to visibility into visibility
> map in the future, it would be worth to consider.

Could someone post a TL;DR summary of what the current plan looks
like?  I can see there is a huge amount of discussion to trawl back
through.  I can see it's something to do with the visibility map.  And
does it avoid freezing very large tables like the title originally
sought?

Thanks

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Getting sorted data from foreign server for merge join

2015-11-17 Thread Ashutosh Bapat
Thanks Robert for your comments. Please see my replies inlined. Updated
patch is attached.

On Fri, Nov 6, 2015 at 10:02 PM, Robert Haas  wrote:

>
> I think this approach is generally reasonable, but I suggested parts
> of it, so may be biased.  I would be interested in hearing the
> opinions of others.
>
> Random notes:
>
> "possibily" is a typo.
>

Fixed.


>
> usable_pklist is confusing because it seems like it might be talking
> about primary keys rather than pathkeys.  Also, I realize now, looking
> at this again, that you're saying "usable" when what I really think
> you mean is "useful". Lots of pathkeys are usable, but only a few of
> those are useful.  I suggest renaming usable_pathkeys to
> query_pathkeys


Done.


> and usable_pklist to useful_pathkeys.


pathkeys is being used to mean a list of pathkey nodes. What we want here
is list of pathkeys so I have renamed it as useful_pathkeys_list, somewhat
long but clear.


> Similarly, let's
> rename generate_pathkeys_for_relation() to
> get_useful_pathkeys_for_relation().
>

Done.


>
> Although I'm usually on the side of marking things as extern whenever
> we find it convenient, I'm nervous about doing that to
> make_canonical_pathkey(), because it has side effects.  Searching the
> list of canonical pathkeys for the one we need is reasonable, but is
> it really right to ever think that we might create a new one at this
> stage?  Maybe it is, but if so I'd like to hear a good explanation as
> to why.
>

For a foreign table no pathkeys are created before creating paths for
individual foreign table since there are no indexes. For a regular table,
the pathkeys get created for useful indexes. Exception to this is if the
expression appears in the ORDER BY clause, the pathkey for the same is
created while handling ORDER BY clause. So, we will always need to create
pathkey for a foreign table unless the corresponding expression does not
appear in the ORDER BY clause. This can be verified by breaking in
make_canonical_pathkey() while executing "explain verbose select * from ft1
join lt using (val);" ft1(val int, val2 int) is foreign table and lt (val
int, val2 int) is local table. You will hit the first breakpoint with stack
trace
Breakpoint 1, make_canonical_pathkey (root=0x231d858, eclass=0x231f030,
opfamily=1976, strategy=1, nulls_first=0 '\000') at pathkeys.c:60
(gdb) where
#0  make_canonical_pathkey (root=0x231d858, eclass=0x231f030,
opfamily=1976, strategy=1, nulls_first=0 '\000') at pathkeys.c:60
#1  0x7f6740077e39 in get_useful_pathkeys_for_relation (root=0x231d858,
rel=0x231e390) at postgres_fdw.c:663
#2  0x7f6740077f34 in postgresGetForeignPaths (root=0x231d858,
baserel=0x231e390, foreigntableid=16393) at postgres_fdw.c:705
#3  0x007079cf in set_foreign_pathlist (root=0x231d858,
rel=0x231e390, rte=0x231c488) at allpaths.c:600
#4  0x00707653 in set_rel_pathlist (root=0x231d858, rel=0x231e390,
rti=1, rte=0x231c488) at allpaths.c:395
#5  0x0070735f in set_base_rel_pathlists (root=0x231d858) at
allpaths.c:277

at this point

(gdb) p root->canon_pathkeys
$1 = (List *) 0x0

so get_useful_pathkeys_for_relation->make_canonical_pathkey() will create
the first pathkey.

I have left the corresponding TODO untouched, if anybody else wants to
review that part of the code. I will remove it in the final version of the
patch.


>
>  Is the comment "Equivalence classes covering relations other than the
> current one are of interest here" missing a "not"?
>

The sentence is correct. We need equivalence class covering relations other
than the current one, because only such equivalence classes indicate joins
between two relations. If an equivalence class covers only a single baserel
(has only a single member in ec_relids), it indicates equivalence between
columns/expressions of the same table, which can not result in a join. I
have changed to comment to be more elaborate.


>
> I don't find this comment illuminating:
>
> + * In case of child relation, we need to check that the
> + * equivalence class indicates a join to a relation other than
> + * parents, other children and itself (something similar to
> above).
> + * Otherwise we will end up creating useless paths. The code
> below is
> + * similar to generate_implied_equalities_for_column(), which
> might
> + * give a hint.
>
> That basically just says that we have to do it this way because the
> other way would be wrong.  But it doesn't say WHY the other way would
> be wrong.


There's no "other way" which is wrong. What's the other way you are talking
about?

Anway, I have updated the comment to be more readable.


> Then a few lines later, you have another comment which says
> the same thing again:
>
> +/*
> + * Ignore equivalence members which correspond to children
> + * or same relation or to parent relations
> + */
>
>
Updated this too.


Re: [HACKERS] Extracting fields from 'infinity'::TIMESTAMP[TZ]

2015-11-17 Thread Robert Haas
On Tue, Nov 17, 2015 at 10:57 AM, Tom Lane  wrote:
> Vitaly Burovoy  writes:
>> I suppose behavior of monotonic values (julian, century, decade,
>> isoyear, millennium and year) should be the same as for epoch (which
>> obviously also monotonic value).
>> Proposed patch has that behavior: +/-infinity for epoch, julian,
>> century, decade, isoyear, millennium and year; NULL for other fields.
>
> Works for me.

Same here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extracting fields from 'infinity'::TIMESTAMP[TZ]

2015-11-17 Thread Tom Lane
Jim Nasby  writes:
> On 11/17/15 2:09 AM, Vitaly Burovoy wrote:
>> Proposed patch has that behavior: ±infinity for epoch, julian,
>> century, decade, isoyear, millennium and year; NULL for other fields.

> What's the logic behind NULL here? Infinity is infinity, whether it's 
> minutes or years.

Didn't you follow the upthread discussion?  Fields such as "minutes"
are cyclic, so it's impossible to say either that they converge to
a defined limit or diverge to infinity as x increases.  NULL, in the
sense of "unknown", seems like a reasonable representation of that.
Infinity doesn't.

> My specific fear is that now people will have to do a bunch of IF 
> timestamp IS NOT NULL THEN ... to get the behavior they need.

Considering that the old behavior is to return zero, and we've had
relatively few complaints about that, I doubt very many people are
going to care.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] new full text search configurations

2015-11-17 Thread Oleg Bartunov
I checked new snowball site http://snowballstem.org/ and found several new
stemmers appeared (as external contributions):


   - Irish and Czech 
   - Object Pascal codegenerator for Snowball
   
   - Two stemmers for Romanian 
   - Hungarian 
   - Turkish 
   - Armenian 
   - Basque (Euskera)
   
   - Catalan 

Some of them we don't have in our list of default configurations. Since
these are external, not official stemmers, it'd be nice if  people  look
and test them. If they are fine, we can prepare new configurations for 9.6.

 \dF
   List of text search configurations
   Schema   |Name|  Description
++---
 pg_catalog | danish | configuration for danish language
 pg_catalog | dutch  | configuration for dutch language
 pg_catalog | english| configuration for english language
 pg_catalog | finnish| configuration for finnish language
 pg_catalog | french | configuration for french language
 pg_catalog | german | configuration for german language
 pg_catalog | hungarian  | configuration for hungarian language
 pg_catalog | italian| configuration for italian language
 pg_catalog | norwegian  | configuration for norwegian language
 pg_catalog | portuguese | configuration for portuguese language
 pg_catalog | romanian   | configuration for romanian language
 pg_catalog | russian| configuration for russian language
 pg_catalog | simple | simple configuration
 pg_catalog | spanish| configuration for spanish language
 pg_catalog | swedish| configuration for swedish language
 pg_catalog | turkish| configuration for turkish language
 public | english_ns |
(17 rows)


Re: [HACKERS] Should TIDs be typbyval = FLOAT8PASSBYVAL to speed up CREATE INDEX CONCURRENTLY?

2015-11-17 Thread Tom Lane
Peter Geoghegan  writes:
>>> Might be better to hack a special case right there (ie, embed TIDs into
>>> int8s and sort the int8s) rather than try to change the type's SQL
>>> declaration.

> I suggested to someone else that he take a look at this as a project,
> but I guess he was busy too. I decided to just do it myself. Patch is
> attached.

I think this might do the wrong thing with block numbers above 0x8000
and/or offset numbers above 0x8000.  I'd be more comfortable about it if
+   encoded = ((int64) block << 16) | offset;
were
+   encoded = ((uint64) block << 16) | (uint16) offset;
so that there's no risk of the compiler deciding to sign-extend rather
than zero-fill either input.

Also, I think it'd be a good idea to explicitly set indexcursor = NULL
in the tuplesort_empty case; the previous coding was effectively doing
that.  It's true that the code shouldn't attempt to touch the value,
but it's better not to have dangling pointers lying around.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extracting fields from 'infinity'::TIMESTAMP[TZ]

2015-11-17 Thread Tom Lane
Vitaly Burovoy  writes:
> I suppose behavior of monotonic values (julian, century, decade,
> isoyear, millennium and year) should be the same as for epoch (which
> obviously also monotonic value).
> Proposed patch has that behavior: +/-infinity for epoch, julian,
> century, decade, isoyear, millennium and year; NULL for other fields.

Works for me.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-17 Thread Thom Brown
On 17 November 2015 at 15:43, Jim Nasby  wrote:
> On 11/17/15 4:41 AM, Thom Brown wrote:
>>
>> Could someone post a TL;DR summary of what the current plan looks
>> like?  I can see there is a huge amount of discussion to trawl back
>> through.  I can see it's something to do with the visibility map.  And
>> does it avoid freezing very large tables like the title originally
>> sought?
>
>
> Basically, it follows the same pattern that all-visible bits do, except
> instead of indicating a page is all-visible, the bit shows that all tuples
> on the page are frozen. That allows a scan_all vacuum to skip those pages.

So the visibility map is being repurposed?  And if a row on a frozen
page is modified, what happens to the visibility of all other rows on
that page, as the bit will be set back to 0?  I think I'm missing a
critical part of this functionality.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] better systemd integration

2015-11-17 Thread Tom Lane
Peter Eisentraut  writes:
> I have written a couple of patches to improve the integration of the
> postgres daemon with systemd.

Seems like a generally reasonable thing to do.  systemd is probably
not going away (unfortunately IMO, but there it is).

> The second patch improves integration with the system journal managed by
> systemd.  This is a facility that captures a daemon's standard output
> and error and records it in configurable places, including syslog.  The
> patch adds a new log_destination that is like stderr but marks up the
> output so that systemd knows the severity.  With that in place, users
> can choose to do away with the postgres log file management and let
> systemd do it.

One of the benefits of the log collector is that it's able to do something
reasonably sane with random stderr output that might be generated by
libraries linked into PG (starting with glibc...).  If someone sets things
up as you're suggesting, what will systemd do with unlabeled output lines?
Or in other words, how much value-add is there really from this markup?

Also, it looks like the markup is intended to be per-line, but as you've
coded this a possibly-multi-line error report will only have a prefix
on the first line.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] new full text search configurations

2015-11-17 Thread Pavel Stehule
Hi

2015-11-17 17:28 GMT+01:00 Oleg Bartunov :

> I checked new snowball site http://snowballstem.org/ and found several
> new stemmers appeared (as external contributions):
>
>
>- Irish and Czech 
>
> Czech snowball needs recheck - 5 years ago it was not success in my tests

Regards

Pavel



>
>- Object Pascal codegenerator for Snowball
>
>- Two stemmers for Romanian
>
>- Hungarian 
>- Turkish 
>- Armenian 
>- Basque (Euskera)
>
>- Catalan 
>
> Some of them we don't have in our list of default configurations. Since
> these are external, not official stemmers, it'd be nice if  people  look
> and test them. If they are fine, we can prepare new configurations for 9.6.
>
>  \dF
>List of text search configurations
>Schema   |Name|  Description
> ++---
>  pg_catalog | danish | configuration for danish language
>  pg_catalog | dutch  | configuration for dutch language
>  pg_catalog | english| configuration for english language
>  pg_catalog | finnish| configuration for finnish language
>  pg_catalog | french | configuration for french language
>  pg_catalog | german | configuration for german language
>  pg_catalog | hungarian  | configuration for hungarian language
>  pg_catalog | italian| configuration for italian language
>  pg_catalog | norwegian  | configuration for norwegian language
>  pg_catalog | portuguese | configuration for portuguese language
>  pg_catalog | romanian   | configuration for romanian language
>  pg_catalog | russian| configuration for russian language
>  pg_catalog | simple | simple configuration
>  pg_catalog | spanish| configuration for spanish language
>  pg_catalog | swedish| configuration for swedish language
>  pg_catalog | turkish| configuration for turkish language
>  public | english_ns |
> (17 rows)
>


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2015-11-17 Thread Simon Riggs
On 17 November 2015 at 11:27, Amit Kapila  wrote:


> Attached patch group_update_clog_v1.patch
>>> implements this idea.
>>>
>>
>> I don't think we should be doing this only for transactions that don't
>> have subtransactions.
>>
>
> The reason for not doing this optimization for subtransactions is that we
> need to advertise the information that Group leader needs for updating
> the transaction status and if we want to do it for sub transactions, then
> all the subtransaction id's needs to be advertised.  Now here the tricky
> part is that number of subtransactions for which the status needs to
> be updated is dynamic, so reserving memory for it would be difficult.
> However, we can reserve some space in Proc like we do for XidCache
> (cache of sub transaction ids) and then use that to advertise that many
> Xid's at-a-time or just allow this optimization if number of
> subtransactions
> is lesser than or equal to the size of this new XidCache.  I am not sure
> if it is good idea to use the existing XidCache for this purpose in which
> case we need to have a separate space in PGProc for this purpose.  I
> don't see allocating space for 64 or so subxid's as a problem, however
> doing it for bigger number could be cause of concern.
>
>
>> We are trying to speed up real cases, not just benchmarks.
>>
>> So +1 for the concept, patch is going in right direction though lets do
>> the full press-up.
>>
>>
> I have mentioned above the reason for not doing it for sub transactions, if
> you think it is viable to reserve space in shared memory for this purpose,
> then
> I can include the optimization for subtransactions as well.
>

The number of subxids is unbounded, so as you say, reserving shmem isn't
viable.

I'm interested in real world cases, so allocating 65 xids per process isn't
needed, but we can say is that the optimization shouldn't break down
abruptly in the presence of a small/reasonable number of subtransactions.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Extracting fields from 'infinity'::TIMESTAMP[TZ]

2015-11-17 Thread Jim Nasby

On 11/17/15 2:09 AM, Vitaly Burovoy wrote:

I suppose behavior of monotonic values (julian, century, decade,
isoyear, millennium and year) should be the same as for epoch (which
obviously also monotonic value).
Proposed patch has that behavior: ±infinity for epoch, julian,
century, decade, isoyear, millennium and year; NULL for other fields.


What's the logic behind NULL here? Infinity is infinity, whether it's 
minutes or years. It's absolutely NOT the same thing as a NULL 
timestamp. I don't see why the normal constraint of minute < 60 should 
apply here; infinity isn't a normal number.


My specific fear is that now people will have to do a bunch of IF 
timestamp IS NOT NULL THEN ... to get the behavior they need.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Question concerning XTM (eXtensible Transaction Manager API)

2015-11-17 Thread Kevin Grittner
On Tuesday, November 17, 2015 12:43 AM, konstantin knizhnik 
 wrote:
> On Nov 16, 2015, at 11:21 PM, Kevin Grittner wrote:

>> If you are saying that DTM tries to roll back a transaction after
>> any participating server has entered the RecordTransactionCommit()
>> critical section, then IMO it is broken.  Full stop.  That can't
>> work with any reasonable semantics as far as I can see.
>
> DTM is not trying to rollback committed transaction.
> What he tries to do is to hide this commit.
> As I already wrote, the idea was to implement "lightweight" 2PC
> because prepared transactions mechanism in PostgreSQL adds too much
> overhead and cause soe problems with recovery.

The point remains that there must be *some* "point of no return"
beyond which rollback (or "hiding" is not possible).  Until this
point, all heavyweight locks held by the transaction must be
maintained without interruption, data modification of the
transaction must not be visible, and any attempt to update or
delete data updated or deleted by the transaction must block or
throw an error.  It sounds like you are attempting to move the
point at which this "point of no return" is, but it isn't as clear
as I would like.  It seems like all participating nodes are
responsible for notifying the arbiter that they have completed, and
until then the arbiter gets involved in every visibility check,
overriding the "normal" value?

> The transaction is normally committed in xlog, so that it can
> always be recovered in case of node fault.
> But before setting correspondent bit(s) in CLOG and releasing
> locks we first contact arbiter to get global status of transaction.
> If it is successfully locally committed by all nodes, then
> arbiter approves commit and commit of transaction normally
> completed.
> Otherwise arbiter rejects commit. In this case DTM marks
> transaction as aborted in CLOG and returns error to the client.
> XLOG is not changed and in case of failure PostgreSQL will try to
> replay this transaction.
> But during recovery it also tries to restore transaction status
> in CLOG.
> And at this placeDTM contacts arbiter to know status of
> transaction.
> If it is marked as aborted in arbiter's CLOG, then it wiull be
> also marked as aborted in local CLOG.
> And according to PostgreSQL visibility rules no other transaction
> will see changes made by this transaction.

If a node goes through crash and recovery after it has written its
commit information to xlog, how are its heavyweight locks, etc.,
maintained throughout?  For example, does each arbiter node have
the complete set of heavyweight locks?  (Basically, all the
information which can be written to files in pg_twophase must be
held somewhere by all arbiter nodes, and used where appropriate.)

If a participating node is lost after some other nodes have told
the arbiter that they have committed, and the lost node will never
be able to indicate that it is committed or rolled back, what is
the mechanism for resolving that?

>>> We can not just call elog(ERROR,...) in SetTransactionStatus
>>> implementation because inside critical section it cause Postgres
>>> crash with panic message. So we have to remember that transaction is
>>> rejected and report error later after exit from critical section:
>>
>> I don't believe that is a good plan.  You should not enter the
>> critical section for recording that a commit is complete until all
>> the work for the commit is done except for telling the all the
>> servers that all servers are ready.
>
> It is good point.
> May be it is the reason of performance scalability problems we
> have noticed with DTM.

Well, certainly the first phase of two-phase commit can take place
in parallel, and once that is complete then the second phase
(commit or rollback of all the participating prepared transactions)
can take place in parallel.  There is no need to serialize that.

> Sorry, some clarification.
> We get 10x slowdown of performance caused by 2pc on very heavy
> load on the IBM system with 256 cores.
> At "normal" servers slowdown of 2pc is smaller - about 2x.

That suggests some contention point, probably on spinlocks.  Were
you able to identify the particular hot spot(s)?


On Tuesday, November 17, 2015 3:09 AM, konstantin knizhnik 
 wrote:
> On Nov 17, 2015, at 10:44 AM, Amit Kapila wrote:

>> I think the general idea is that if Commit is WAL logged, then the
>> operation is considered to committed on local node and commit should
>> happen on any node, only once prepare from all nodes is successful.
>> And after that transaction is not supposed to abort.  But I think you are
>> trying to optimize the DTM in some way to not follow that kind of protocol.
>
> DTM is still following 2PC protocol:
> First transaction is saved in WAL at all nodes and only after it
> commit is completed at all nodes.

So, essentially you are treating the traditional commit point as
phase 1 in a new approach to two-phase 

Re: [HACKERS] Freeze avoidance of very large table.

2015-11-17 Thread Jim Nasby

On 11/17/15 4:41 AM, Thom Brown wrote:

Could someone post a TL;DR summary of what the current plan looks
like?  I can see there is a huge amount of discussion to trawl back
through.  I can see it's something to do with the visibility map.  And
does it avoid freezing very large tables like the title originally
sought?


Basically, it follows the same pattern that all-visible bits do, except 
instead of indicating a page is all-visible, the bit shows that all 
tuples on the page are frozen. That allows a scan_all vacuum to skip 
those pages.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] better systemd integration

2015-11-17 Thread Peter Eisentraut
I have written a couple of patches to improve the integration of the
postgres daemon with systemd.

The setup that is shipped with Red Hat- and Debian-family packages at
the moment is just an imitation of the old shell scripts, relying on
polling by pg_ctl for readiness, with various custom pieces of
complexity for handling custom port numbers and such.

In the first patch, my proposal is to use sd_notify() calls from
libsystemd to notify the systemd daemon directly when startup is
completed.  This is the recommended low-overhead solution that is now
being adopted by many other server packages.  It allows us to cut out
pg_ctl completely from the startup configuration and makes the startup
configuration manageable by non-wizards.  An example is included in the
patch.

The second patch improves integration with the system journal managed by
systemd.  This is a facility that captures a daemon's standard output
and error and records it in configurable places, including syslog.  The
patch adds a new log_destination that is like stderr but marks up the
output so that systemd knows the severity.  With that in place, users
can choose to do away with the postgres log file management and let
systemd do it.

The third patch is technically unrelated but arose while I was working
on this.  It improves error reporting when the data directory is missing.
From c1725cbf51b79cfd78bc7f9358891599c4ffae6c Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 17 Nov 2015 06:46:17 -0500
Subject: [PATCH 1/3] Add support for systemd service notifications

Insert sd_notify() calls at server start and stop for integration with
systemd.  This allows the use of systemd service units of type "notify",
which greatly simplifies the systemd configuration.
---
 configure   | 38 +
 configure.in|  9 +
 doc/src/sgml/installation.sgml  | 16 
 doc/src/sgml/runtime.sgml   | 24 +++
 src/Makefile.global.in  |  1 +
 src/backend/Makefile|  4 
 src/backend/postmaster/postmaster.c | 21 
 src/include/pg_config.h.in  |  3 +++
 8 files changed, 116 insertions(+)

diff --git a/configure b/configure
index b771a83..f9f93b3 100755
--- a/configure
+++ b/configure
@@ -709,6 +709,7 @@ with_libxml
 XML2_CONFIG
 UUID_EXTRA_OBJS
 with_uuid
+with_systemd
 with_selinux
 with_openssl
 krb_srvtab
@@ -830,6 +831,7 @@ with_ldap
 with_bonjour
 with_openssl
 with_selinux
+with_systemd
 with_readline
 with_libedit_preferred
 with_uuid
@@ -1518,6 +1520,7 @@ Optional Packages:
   --with-bonjour  build with Bonjour support
   --with-openssl  build with OpenSSL support
   --with-selinux  build with SELinux support
+  --with-systemd  build with systemd support
   --without-readline  do not use GNU Readline nor BSD Libedit for editing
   --with-libedit-preferred
   prefer BSD Libedit over GNU Readline
@@ -5695,6 +5698,41 @@ fi
 $as_echo "$with_selinux" >&6; }
 
 #
+# Systemd
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with systemd support" >&5
+$as_echo_n "checking whether to build with systemd support... " >&6; }
+
+
+
+# Check whether --with-systemd was given.
+if test "${with_systemd+set}" = set; then :
+  withval=$with_systemd;
+  case $withval in
+yes)
+
+$as_echo "#define USE_SYSTEMD 1" >>confdefs.h
+
+  ;;
+no)
+  :
+  ;;
+*)
+  as_fn_error $? "no argument expected for --with-systemd option" "$LINENO" 5
+  ;;
+  esac
+
+else
+  with_systemd=no
+
+fi
+
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_systemd" >&5
+$as_echo "$with_systemd" >&6; }
+
+#
 # Readline
 #
 
diff --git a/configure.in b/configure.in
index b5868b0..47c82cb 100644
--- a/configure.in
+++ b/configure.in
@@ -700,6 +700,15 @@ AC_SUBST(with_selinux)
 AC_MSG_RESULT([$with_selinux])
 
 #
+# Systemd
+#
+AC_MSG_CHECKING([whether to build with systemd support])
+PGAC_ARG_BOOL(with, systemd, no, [build with systemd support],
+  [AC_DEFINE([USE_SYSTEMD], 1, [Define to build with systemd support. (--with-systemd)])])
+AC_SUBST(with_systemd)
+AC_MSG_RESULT([$with_systemd])
+
+#
 # Readline
 #
 PGAC_ARG_BOOL(with, readline, yes,
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index 8964834..7b6a389 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -813,6 +813,22 @@ Configuration
   
 
   
+   --with-systemd
+   
+
+ Build with support
+ for systemdsystemd
+ service notifications.  This improves integration if the server binary
+ is started under systemd but has no impact
+ otherwise; see  for more
+ information.  libsystemd and the
+ associated header files need to be installed to be able to use this
+ 

Re: [HACKERS] Bug in numeric multiplication

2015-11-17 Thread Tom Lane
Dean Rasheed  writes:
> I just noticed that div_var_fast() has almost identical code, and so
> in principle it has the same vulnerability, although it obviously only
> affects the transcendental functions.
> I don't actually have a test case that triggers it, but it's basically
> the same algorithm, so logically it needs the same additional headroom
> to avoid a possible overflow.

Hm, good point.  I don't feel a compulsion to have a test case that
proves it's broken before we fix it.  Do you want to send a patch?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tab completion for ALTER COLUMN SET STATISTICS

2015-11-17 Thread Jeff Janes
On Mon, Sep 28, 2015 at 8:48 AM, Robert Haas  wrote:
> On Sat, Sep 26, 2015 at 7:24 AM, Michael Paquier
>  wrote:
>> On Sat, Sep 26, 2015 at 7:18 AM, Jeff Janes  wrote:
>>> If I have "alter table foo alter COLUMN bar SET STATISTICS" in the line
>>> buffer,
>>> it tab completes to add " TO", which is not legal.
>>>
>>> The attached patch makes it not tab complete anything at all, which is at
>>> least not actively misleading.
>>> I thought of having it complete "-1", "" so that it gives a clue
>>> about what is needed, but I didn't see any precedence for non-literal
>>> clue-giving and I did not want to try to create new precedence.
>>
>> +1 for the way you are doing it in your patch.
>
> Before we take that approach, can I back up and ask whether we
> shouldn't instead narrow the rule that's inserting TO?  I think that
> completion is coming from here:
>
> else if (pg_strcasecmp(prev2_wd, "SET") == 0 &&
>  pg_strcasecmp(prev4_wd, "UPDATE") != 0 &&
>  pg_strcasecmp(prev_wd, "TABLESPACE") != 0 &&
>  pg_strcasecmp(prev_wd, "SCHEMA") != 0 &&
>  prev_wd[strlen(prev_wd) - 1] != ')' &&
>  prev_wd[strlen(prev_wd) - 1] != '=' &&
>  pg_strcasecmp(prev4_wd, "DOMAIN") != 0)
> COMPLETE_WITH_CONST("TO");
>
> Now, that is basically an incredibly broad production: every time the
> second-most recent word is SET, complete with TO.  It looks to me like
> this has already been patched around multiple times to make it
> slightly narrower.  Those exceptions were added by three different
> commits, in 2004, 2010, and 2012.
>
> Maybe it's time to back up and make the whole thing a lot narrower.
> Like, if second-most-recent word of the command is also the FIRST word
> of the command, this is probably right.  And there may be a few other
> situations where it's right.  But in general, SET is used in lots of
> places and the fact that we've seen one recently isn't enough to
> decide in TO.

There are quite a few places where TO is legitimately the 2nd word
after SET.  I don't know how to do an exhaustive survey of them, but
here are the ones I know about:

SET  TO
ALTER DATABASE  SET  TO
ALTER ROLE  SET  TO
ALTER USER  SET  TO
ALTER FUNCTION  (  )
SET  TO

I don't know if coding the non-TO cases as exceptions is easier or
harder than coding the TO cases more tightly.

In the case of "SET SCHEMA", it seems like we might be able to just
move that case higher in the giant list of 'else if' so that it
triggers before getting to the generic SET  code.  But I can't
figure out where in the file that code is, to see if it might be
moved.  And I'm not sure that intentionally relying on order more than
already is a better strategy, it seems kind of fragile.

In any case, I'd rather not try re-factor this part of the code while
there is another large refactoring pending.  But I think an
incremental improvement would be acceptable.

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-11-17 Thread Kyotaro HORIGUCHI
Hello,

At Tue, 17 Nov 2015 18:13:11 +0900, Masahiko Sawada  
wrote in 

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2015-11-17 Thread Amit Kapila
On Tue, Nov 17, 2015 at 5:04 PM, Simon Riggs  wrote:

> On 17 November 2015 at 11:27, Amit Kapila  wrote:
>
> We are trying to speed up real cases, not just benchmarks.
>>>
>>> So +1 for the concept, patch is going in right direction though lets do
>>> the full press-up.
>>>
>>>
>> I have mentioned above the reason for not doing it for sub transactions,
>> if
>> you think it is viable to reserve space in shared memory for this
>> purpose, then
>> I can include the optimization for subtransactions as well.
>>
>
> The number of subxids is unbounded, so as you say, reserving shmem isn't
> viable.
>
> I'm interested in real world cases, so allocating 65 xids per process
> isn't needed, but we can say is that the optimization shouldn't break down
> abruptly in the presence of a small/reasonable number of subtransactions.
>
>
I think in that case what we can do is if the total number of
sub transactions is lesser than equal to 64 (we can find that by
overflowed flag in PGXact) , then apply this optimisation, else use
the existing flow to update the transaction status.  I think for that we
don't even need to reserve any additional memory. Does that sound
sensible to you?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2015-11-17 Thread Simon Riggs
On 17 November 2015 at 11:48, Amit Kapila  wrote:

> On Tue, Nov 17, 2015 at 5:04 PM, Simon Riggs 
> wrote:
>
>> On 17 November 2015 at 11:27, Amit Kapila 
>> wrote:
>>
>> We are trying to speed up real cases, not just benchmarks.

 So +1 for the concept, patch is going in right direction though lets do
 the full press-up.


>>> I have mentioned above the reason for not doing it for sub transactions,
>>> if
>>> you think it is viable to reserve space in shared memory for this
>>> purpose, then
>>> I can include the optimization for subtransactions as well.
>>>
>>
>> The number of subxids is unbounded, so as you say, reserving shmem isn't
>> viable.
>>
>> I'm interested in real world cases, so allocating 65 xids per process
>> isn't needed, but we can say is that the optimization shouldn't break down
>> abruptly in the presence of a small/reasonable number of subtransactions.
>>
>>
> I think in that case what we can do is if the total number of
> sub transactions is lesser than equal to 64 (we can find that by
> overflowed flag in PGXact) , then apply this optimisation, else use
> the existing flow to update the transaction status.  I think for that we
> don't even need to reserve any additional memory. Does that sound
> sensible to you?
>

I understand you to mean that the leader should look backwards through the
queue collecting xids while !(PGXACT->overflowed)

No additional shmem is required

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-11-17 Thread Robert Haas
On Thu, Nov 12, 2015 at 12:54 AM, Etsuro Fujita
 wrote:
> Really?  I think there would be not a little burden on an FDW author; when
> postgres_fdw delegates to the subplan to the remote server, for example, it
> would need to create a remote join query by looking at tuples possibly
> fetched and stored in estate->es_epqTuple[], send the query and receive the
> result during the callback routine.  Furthermore, what I'm most concerned
> about is that wouldn't be efficient.  So, my question about that approach is
> whether FDWs really do some thing like that during the callback routine,
> instead of performing a secondary join plan locally.  As I said before, I
> know that KaiGai-san considers that that approach would be useful for custom
> joins.  But I see zero evidence that there is a good use-case for an FDW.

It could do that.  But it could also just invoke a subplan as you are
proposing.  Or at least, I think we should set it up so that such a
thing is possible.  In which case I don't see the problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-11-17 Thread Robert Haas
On Tue, Nov 17, 2015 at 6:52 AM, Bert  wrote:
> edit: maybe this is more useful? :)

Definitely.  But if you've built with --enable-debug and not stripped
the resulting executable, we ought to get line numbers as well, plus
the arguments to each function on the stack.  That would help a lot
more.  The only things that get dereferenced in that function are
"scan" and "parallel_scan", so it's a good bet that one of those
pointers is pointing off into never-never land.  I can't immediately
guess how that's happening, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor comment improvement to create_foreignscan_plan

2015-11-17 Thread Robert Haas
On Sun, Nov 15, 2015 at 9:25 PM, Etsuro Fujita
 wrote:
> Oops, I've found another one.  I think we should update a comment in
> postgresGetForeignPlan, too; add remote filtering expressions to the list of
> information needed to create a ForeignScan node.

Instead of saying "remote/local", how about saying "remote and local"
or just leaving it out altogether as in the attached?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index cd4ed0c..a6ba672 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -902,8 +902,8 @@ postgresGetForeignPlan(PlannerInfo *root,
 			 retrieved_attrs);
 
 	/*
-	 * Create the ForeignScan node from target list, local filtering
-	 * expressions, remote parameter expressions, and FDW private information.
+	 * Create the ForeignScan node from target list, filtering expressions,
+	 * remote parameter expressions, and FDW private information.
 	 *
 	 * Note that the remote parameter expressions are stored in the fdw_exprs
 	 * field of the finished plan node; we can't keep them in private state

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pageinspect patch, for showing tuple data

2015-11-17 Thread Nikolay Shaplov

> >> I still have an opinion that documentation should be more verbose, than
> >> your version, but I can accept your version.
> > 
> > I am not sure that's necessary, pageinspect is for developers.
> > 
> >> Who is going to add heap_page_item_attrs to your patch? me or you?
> > 
> > I recall some parts of the code I still did not like much :) I'll grab
> > some room to have an extra look at it.
> 
> I have added back heap_page_item_attrs the patch attached, fixing at
> the same time some typos found while looking again at the code. If you
> are fine with this version, let's have a committer look at it.

Everything seems to be ok. I've changed only one thing in your version
of the patch. I've renamed split_tuple_data to
tuple_data_split_internal, because when there are split_tuple_data and
tuple_data_split in the same file, nobody will guess what the difference
between these function and why are they named in such a strange way.

If it is ok, set proper status, and let commiter do his job :-)



-- 
Nikolay Shaplov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
index aec5258..91ab119 100644
--- a/contrib/pageinspect/Makefile
+++ b/contrib/pageinspect/Makefile
@@ -5,9 +5,9 @@ OBJS		= rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \
 		  brinfuncs.o ginfuncs.o $(WIN32RES)
 
 EXTENSION = pageinspect
-DATA = pageinspect--1.3.sql pageinspect--1.2--1.3.sql \
-	pageinspect--1.1--1.2.sql pageinspect--1.0--1.1.sql \
-	pageinspect--unpackaged--1.0.sql
+DATA = pageinspect--1.4.sql pageinspect--1.3--1.4.sql \
+	pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \
+	pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql
 PGFILEDESC = "pageinspect - functions to inspect contents of database pages"
 
 ifdef USE_PGXS
diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c
index 8d1666c..546a051 100644
--- a/contrib/pageinspect/heapfuncs.c
+++ b/contrib/pageinspect/heapfuncs.c
@@ -27,8 +27,11 @@
 
 #include "access/htup_details.h"
 #include "funcapi.h"
-#include "utils/builtins.h"
+#include "catalog/pg_type.h"
 #include "miscadmin.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
+#include "utils/rel.h"
 
 
 /*
@@ -55,6 +58,42 @@ bits_to_text(bits8 *bits, int len)
 
 
 /*
+ * text_to_bits
+ *
+ * Converts a c-string representation of bits into a bits8-array. This is
+ * the reverse operation of previous routine.
+ */
+static bits8 *
+text_to_bits(char *str, int len)
+{
+	bits8	   *bits;
+	int			off = 0;
+	char		byte = 0;
+
+	bits = palloc(len + 1);
+
+	while (off < len)
+	{
+		if (off % 8 == 0)
+			byte = 0;
+
+		if ((str[off] == '0') || (str[off] == '1'))
+			byte = byte | ((str[off] - '0') << off % 8);
+		else
+			ereport(ERROR,
+	(errcode(ERRCODE_DATA_CORRUPTED),
+	 errmsg("illegal character '%c' in t_bits string", str[off])));
+
+		if (off % 8 == 7)
+			bits[off / 8] = byte;
+
+		off++;
+	}
+
+	return bits;
+}
+
+/*
  * heap_page_items
  *
  * Allows inspection of line pointers and tuple headers of a heap page.
@@ -122,8 +161,8 @@ heap_page_items(PG_FUNCTION_ARGS)
 		HeapTuple	resultTuple;
 		Datum		result;
 		ItemId		id;
-		Datum		values[13];
-		bool		nulls[13];
+		Datum		values[14];
+		bool		nulls[14];
 		uint16		lp_offset;
 		uint16		lp_flags;
 		uint16		lp_len;
@@ -154,7 +193,8 @@ heap_page_items(PG_FUNCTION_ARGS)
 			lp_offset + lp_len <= raw_page_size)
 		{
 			HeapTupleHeader tuphdr;
-			int			bits_len;
+			bytea		*tuple_data_bytea;
+			int			tuple_data_len;
 
 			/* Extract information from the tuple header */
 
@@ -168,6 +208,14 @@ heap_page_items(PG_FUNCTION_ARGS)
 			values[9] = UInt32GetDatum(tuphdr->t_infomask);
 			values[10] = UInt8GetDatum(tuphdr->t_hoff);
 
+			/* Copy raw tuple data into bytea attribute */
+			tuple_data_len = lp_len - tuphdr->t_hoff;
+			tuple_data_bytea = (bytea *) palloc(tuple_data_len + VARHDRSZ);
+			SET_VARSIZE(tuple_data_bytea, tuple_data_len + VARHDRSZ);
+			memcpy(VARDATA(tuple_data_bytea), (char *) tuphdr + tuphdr->t_hoff,
+	 tuple_data_len);
+			values[13] = PointerGetDatum(tuple_data_bytea);
+
 			/*
 			 * We already checked that the item is completely within the raw
 			 * page passed to us, with the length given in the line pointer.
@@ -180,11 +228,11 @@ heap_page_items(PG_FUNCTION_ARGS)
 			{
 if (tuphdr->t_infomask & HEAP_HASNULL)
 {
-	bits_len = tuphdr->t_hoff -
-		offsetof(HeapTupleHeaderData, t_bits);
+	int	bits_len =
+			((tuphdr->t_infomask2 & HEAP_NATTS_MASK) / 8 + 1) * 8;
 
 	values[11] = CStringGetTextDatum(
- bits_to_text(tuphdr->t_bits, bits_len * 8));
+ bits_to_text(tuphdr->t_bits, bits_len));
 }
 else
 	nulls[11] = true;
@@ -208,7 +256,7 @@ heap_page_items(PG_FUNCTION_ARGS)
 			 */
 			int			i;
 
-			for (i = 4; i <= 12; i++)
+			for (i = 4; i <= 13; i++)
 nulls[i] = true;
 		}
 
@@ -223,3 +271,205 @@ heap_page_items(PG_FUNCTION_ARGS)
 	else
 		

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-17 Thread Robert Haas
On Fri, Nov 13, 2015 at 8:35 AM, Michael Paquier
 wrote:
> As well as there could be FS, OS, network problems... To come back to
> the point, my point is simply that I found surprising the sentence of
> Konstantin upthread saying that if commit fails on some of the nodes
> we should rollback the prepared transaction on all nodes. In the
> example given, in the phase after calling dtm_end_prepare, say we
> perform COMMIT PREPARED correctly on node 1, but then failed it on
> node 2 because a meteor has hit a server, it seems that we cannot
> rollback, instead we had better rolling in a backup and be sure that
> the transaction gets committed. How would you rollback the transaction
> already committed on node 1? But perhaps I missed something...

Right.  In that case, we have to try to eventually get it committed everywhere.

One thing that's a bit confusing about this XTM interface is what
"COMMIT" actually means.  The idea is that on the standby server we
will call some DTM-provided function and pass it a token.  Then we
will begin and commit a transaction.  But presumably the commit does
not actually commit, because if it's a single transaction on all nodes
then the commit can't be completed until all work is done all nodes.
So my guess is that the COMMIT here is intended to behave more like a
PREPARE, but this is not made explicit.

>> One point I'd like to mention is that it's absolutely critical to
>> design this in a way that minimizes network roundtrips without
>> compromising correctness.  XC's GTM proxy suggests that they failed to
>> do that.  I think we really need to look at what's going to be on the
>> other sides of the proposed APIs and think about whether it's going to
>> be possible to have a strong local caching layer that keeps network
>> roundtrips to a minimum.  We should consider whether the need for such
>> a caching layer has any impact on what the APIs should look like.
>
> At this time, the number of round trips needed particularly for READ
> COMMITTED transactions that need a new snapshot for each query was
> really a performance killer. We used DBT-1 (TPC-W) which is less
> OLTP-like than DBT-2 (TPC-C), still with DBT-1 the scalability limit
> was quickly reached with 10-20 nodes..

Yeah.  I think this merits a good bit of thought.  Superficially, at
least, it seems that every time you need a snapshot - which in the
case of READ COMMITTED is for every SQL statement - you need a network
roundtrip to the snapshot server.  If multiple backends request a
snapshot in very quick succession, you might be able to do a sort of
"group commit" thing where you send a single request to the server and
they all use the resulting snapshot, but it seems hard to get very far
with such optimization.  For example, if backend 1 sends a snapshot
request and backend 2 then realizes that it also needs a snapshot, it
can't just wait for the reply from backend 1 and use that one.  The
user might have committed a transaction someplace else and then kicked
off a transaction on backend 2 afterward, expecting it to see the work
committed earlier.  But the snapshot returned to backend 1 might have
been taken before that.  So, all in all, this seems rather crippling.

Things are better if the system has a single coordinator node that is
also the arbiter of commits and snapshots.  Then, it can always take a
snapshot locally with no network roundtrip, and when it reaches out to
a shard, it can pass along the snapshot information with the SQL query
(or query plan) it has to send anyway.  But then the single
coordinator seems like it becomes a bottleneck.  As soon as you have
multiple coordinators, one of them has got to be the arbiter of global
ordering, and now all of the other coordinators have to talk to it
constantly.

Maybe I'm thinking of this too narrowly by talking about snapshots;
perhaps there are other ways of ensuring whatever level of transaction
isolation we want to have here.  But I'm not sure it matters that much
- I don't see any way for the sees-the-effects-of relation on the set
of all transactions to be a total ordering without some kind of
central arbiter of the commit ordering.  Except for
perfectly-synchronized timestamps, but I don't think that's really
physically possible anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-11-17 Thread Robert Haas
On Mon, Nov 16, 2015 at 9:49 PM, Amit Kapila  wrote:
>> I don't understand this part.
>>
>
> The code in question is as below:
>
> tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
>
> {
> ..
>
> if (tqueue->tupledesc != tupledesc ||
>
> tqueue->remapinfo->natts != tupledesc->natts)
>
> {
>
> if (tqueue->remapinfo != NULL)
>
> pfree(tqueue->remapinfo);
>
> tqueue->remapinfo = BuildRemapInfo(tupledesc);
>
> }
>
> ..
> }
>
> Here the above check always passes as tqueue->tupledesc is not
> set due to which it always try to build remap info.  Is there any reason
> for doing so?

Groan.  The problem here is that tqueue->tupledesc never gets set.  I
think this should be fixed as in the attached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index 5735acf..d625b0d 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@ -127,15 +127,15 @@ tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
 	 * new tupledesc.  This is a strange test both because the executor really
 	 * shouldn't change the tupledesc, and also because it would be unsafe if
 	 * the old tupledesc could be freed and a new one allocated at the same
-	 * address.  But since some very old code in printtup.c uses this test, we
-	 * adopt it here as well.
+	 * address.  But since some very old code in printtup.c uses a similar
+	 * test, we adopt it here as well.
 	 */
-	if (tqueue->tupledesc != tupledesc ||
-		tqueue->remapinfo->natts != tupledesc->natts)
+	if (tqueue->tupledesc != tupledesc)
 	{
 		if (tqueue->remapinfo != NULL)
 			pfree(tqueue->remapinfo);
 		tqueue->remapinfo = BuildRemapInfo(tupledesc);
+		tqueue->tupledesc = tupledesc;
 	}
 
 	tuple = ExecMaterializeSlot(slot);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Refactoring of LWLock tranches

2015-11-17 Thread and...@anarazel.de
On 2015-11-17 14:14:50 +0300, Ildus Kurbangaliev wrote:
> 1) We can avoid constants, and use a standard steps for tranches
> creation.

The constants are actually a bit useful, to easily determine
builtin/non-builtin stuff.

> 2) We have only one global variable (BufferCtl)

Don't see the advantage. This adds another, albeit predictable,
indirection in frequent callsites. There's no architectural advantage in
avoiding these.

> 3) Tranches moved to shared memory, so we won't need to do
> an additional work here.

I can see some benefit, but it also doesn't seem like a huge benefit.


> 4) Also we can kept buffer locks from MainLWLockArray there (that was done
> in attached patch).

That's the 'blockLocks' thing? That's a pretty bad name. These are
buffer mapping locks. And this seems like a separate patch, separately
debated.

> + if (!foundCtl)
>   {
> - int i;
> + /* Align descriptors to a cacheline boundary. */
> + ctl->base.bufferDescriptors = (void *) 
> CACHELINEALIGN(ShmemAlloc(
> + NBuffers * sizeof(BufferDescPadded) + 
> PG_CACHE_LINE_SIZE));
> +
> + ctl->base.bufferBlocks = (char *) ShmemAlloc(NBuffers * (Size) 
> BLCKSZ);

I'm going to entirely veto this.

> + strncpy(ctl->IOLWLockTrancheName, "buffer_io",
> + BUFMGR_MAX_NAME_LENGTH);
> + strncpy(ctl->contentLWLockTrancheName, "buffer_content",
> + BUFMGR_MAX_NAME_LENGTH);
> + strncpy(ctl->blockLWLockTrancheName, "buffer_blocks",
> + BUFMGR_MAX_NAME_LENGTH);
> +
> + ctl->IOLocks = (LWLockMinimallyPadded *) ShmemAlloc(
> + sizeof(LWLockMinimallyPadded) * NBuffers);

This should be cacheline aligned.

> - buf->io_in_progress_lock = LWLockAssign();
> - buf->content_lock = LWLockAssign();
> + LWLockInitialize(BufferDescriptorGetContentLock(buf),
> + ctl->contentLWLockTrancheId);
> + LWLockInitialize(>IOLocks[i].lock,
> + ctl->IOLWLockTrancheId);

Seems weird to use the macro accessing content locks, but not IO locks.

>  /* Note: these two macros only work on shared buffers, not local ones! */
> -#define BufHdrGetBlock(bufHdr)   ((Block) (BufferBlocks + ((Size) 
> (bufHdr)->buf_id) * BLCKSZ))
> +#define BufHdrGetBlock(bufHdr)   ((Block) (BufferCtl->bufferBlocks + 
> ((Size) (bufHdr)->buf_id) * BLCKSZ))

That's the additional indirection I'm talking about.

> @@ -353,9 +353,6 @@ NumLWLocks(void)
>   /* Predefined LWLocks */
>   numLocks = NUM_FIXED_LWLOCKS;
>  
> - /* bufmgr.c needs two for each shared buffer */
> - numLocks += 2 * NBuffers;
> -
>   /* proc.c needs one for each backend or auxiliary process */
>   numLocks += MaxBackends + NUM_AUXILIARY_PROCS;

Didn't you also move the buffer mapping locks... ?

> diff --git a/src/include/storage/buf_internals.h 
> b/src/include/storage/buf_internals.h
> index 521ee1c..e1795dc 100644
> --- a/src/include/storage/buf_internals.h
> +++ b/src/include/storage/buf_internals.h
> @@ -95,6 +95,7 @@ typedef struct buftag
>   (a).forkNum == (b).forkNum \
>  )
>  
> +
>  /*

unrelated change.

>   * The shared buffer mapping table is partitioned to reduce contention.
>   * To determine which partition lock a given tag requires, compute the tag's
> @@ -104,10 +105,9 @@ typedef struct buftag
>  #define BufTableHashPartition(hashcode) \
>   ((hashcode) % NUM_BUFFER_PARTITIONS)
>  #define BufMappingPartitionLock(hashcode) \
> - ([BUFFER_MAPPING_LWLOCK_OFFSET + \
> - BufTableHashPartition(hashcode)].lock)
> + (&((BufferCtlData 
> *)BufferCtl)->blockLocks[BufTableHashPartition(hashcode)].lock)
>  #define BufMappingPartitionLockByIndex(i) \
> - ([BUFFER_MAPPING_LWLOCK_OFFSET + (i)].lock)
> + (&((BufferCtlData *)BufferCtl)->blockLocks[i].lock)

Again, unnecessary indirections.

> +/* Maximum length of a bufmgr lock name */
> +#define BUFMGR_MAX_NAME_LENGTH   32

If we were to do this I'd just use NAMEDATALEN.

> +/*
> + * Base control struct for the buffer manager data
> + * Located in shared memory.
> + *
> + * BufferCtl variable points to BufferCtlBase because of only
> + * the backend code knows about BufferDescPadded, LWLock and
> + * others structures and the same time it must be usable in
> + * the frontend code.
> + */
> +typedef struct BufferCtlData
> +{
> + BufferCtlBase base;

Eeek. What's the point here?


Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-11-17 Thread Robert Haas
On Sun, Nov 8, 2015 at 7:26 PM, Kouhei Kaigai  wrote:
> The attached patch is an adjusted version of the previous one.
> Even though it co-exists a new callback and fdw_recheck_quals,
> the callback is kicked first as follows.

This seems excessive to me: why would we need an arbitrary-length list
of plans for an FDW?  I think we should just allow an outer child and
an inner child, which is probably one more than we'll ever need in
practice.

This looks like an independent bug fix:

+   fscan->fdw_recheck_quals = (List *)
+   fix_upper_expr(root,
+  (Node *)
fscan->fdw_recheck_quals,
+  itlist,
+  INDEX_VAR,
+  rtoffset);
pfree(itlist);

If so, it should be committed separately and back-patched to 9.5.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should TIDs be typbyval = FLOAT8PASSBYVAL to speed up CREATE INDEX CONCURRENTLY?

2015-11-17 Thread Peter Geoghegan
On Tue, Nov 17, 2015 at 7:54 AM, Tom Lane  wrote:
> I think this might do the wrong thing with block numbers above 0x8000
> and/or offset numbers above 0x8000.  I'd be more comfortable about it if
> +   encoded = ((int64) block << 16) | offset;
> were
> +   encoded = ((uint64) block << 16) | (uint16) offset;
> so that there's no risk of the compiler deciding to sign-extend rather
> than zero-fill either input.

I don't have a problem with your alternative, but I don't see any risk
with the original. It is recommended by various coding standards to
only use bitwise operators on unsigned operands, so that's a good
enough reason, I suppose.

> Also, I think it'd be a good idea to explicitly set indexcursor = NULL
> in the tuplesort_empty case; the previous coding was effectively doing
> that.  It's true that the code shouldn't attempt to touch the value,
> but it's better not to have dangling pointers lying around.

The code started that way, but I removed the "indexcursor = NULL"
because the previous coding was *not* doing that. tuplesort_getdatum()
was not setting the passed Datum pointer (which points to indexcursor
here) in the event of returning false. Maybe I should have left in the
code setting indexcursor = NULL all the same.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Skip ALTER x SET SCHEMA if the schema didn't change

2015-11-17 Thread Robert Haas
On Mon, Nov 16, 2015 at 4:27 AM, Marti Raudsepp  wrote:
> Thank you so much for the review and patch update. I should have done that
> myself, but I've been really busy for the last few weeks. :(

Maybe I'm having an attack of the stupids today, but it looks to me
like the changes to pg_constraint.c look awfully strange to me.  In
the old code, if object_address_present() returns true, we continue,
skipping the rest of the loop.  In the new code, we instead set
alreadyChanged to true.  That causes both of the following if
statements, as revised, to fall out, so that we skip the rest of the
loop.  Huh?  Wouldn't a one line change to add oldNspId != newNspId to
the criteria for a simple_heap_update be just as good?

Backing up a bit, maybe we should be a bit more vigorous in treating a
same-namespace move as a no-op.  That is, don't worry about calling
the post-alter hook in that case - just have AlterConstraintNamespaces
start by checking whether oldNspId == newNspid right at the top; if
so, return.  The patch seems to have the idea that it is important to
call the post-alter hook even in that case, but I'm not sure whether
that's true.  I'm not sure it's false, but I'm also not sure it's
true.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Cause TestLib.pm to define $windows_os in all branches.

2015-11-17 Thread Andrew Dunstan



On 11/17/2015 02:15 PM, Tom Lane wrote:

Michael Paquier  writes:

On Tue, Oct 13, 2015 at 10:17 PM, Andrew Dunstan wrote:

In general I think we can be a good deal more liberal about backpatching the
testing regime than we are with production code, where we are always
cautious, and the caution has paid big dividends in our reputation for
stability.

Attached are patches for 9.4 and 9.5 syncing up everything with
master. I tested both of them on OSX, Linux and Windows (MSVC only
though using vcregress tapcheck).

I'm not in a position to double-check that these patches work on Windows,
but I reviewed them through the expedient of diff'ing the patched files
against HEAD.  The only problem I found was you'd left out the
documentation addition about needing IPC::Run in install-windows.sgml.
So I've pushed them with that fix.

Where do we now stand on invoking the TAP tests in the buildfarm?





This has fallen off my pile a bit :-( I will try to take a good look 
over the next 48 hours.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in numeric multiplication

2015-11-17 Thread Tom Lane
I wrote:
> I pushed this, but while looking at it, my attention was drawn to this
> bit down near the end of the loop:

> /*
>  * The dividend digit we are about to replace might still be nonzero.
>  * Fold it into the next digit position.  We don't need to worry about
>  * overflow here since this should nearly cancel with the subtraction
>  * of the divisor.
>  */
> div[qi + 1] += div[qi] * NBASE;

> In the first place, I'm not feeling especially convinced by that handwave
> about there being no risk of overflow.  In the second place, this is
> indisputably failing to consider whether maxdiv might need to increase.
> If I add
> Assert(Abs(div[qi + 1]) <= (maxdiv * (NBASE-1)));
> right after this, the regression tests blow up.  So it looks to me like
> we've got some more work to do.

After thinking some more about what this is doing, it seems to me that
we could avoid changing div[qi + 1] at all here, and instead deal with
leftover dividend digits by shoving them into the floating-point part of
the calculation.  All that we really need to have happen as a consequence
of the above is to inject an additional "div[qi] * NBASE" term into future
calculations of fdividend.  So we can do that out-of-band and avoid
mucking up the maxdiv invariant.

Also, I realized that this bit:

/*
 * All the div[] digits except possibly div[qi] are now in the
 * range 0..NBASE-1.
 */
maxdiv = Abs(newdig) / (NBASE - 1);
maxdiv = Max(maxdiv, 1);

is unduly conservative, because, while indeed div[qi] might be out of
range, there is no need to consider it in maxdiv anymore, because the new
maxdiv value only determines what will happen in future loop iterations
where the current div[qi] will be irrelevant.  So we should just reset
maxdiv to 1 unconditionally here.

So that led me to the attached patch, which passes regression tests fine.
I'm not sure that it's provably okay though.  The loose ends are to show
that div[qi] can't overflow an int during the divisor-subtraction step
and that "outercarry" remains bounded.   Testing suggests that outercarry
can't exceed INT_MAX/NBASE, but I don't see how to prove that.

regards, tom lane

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index dcdc5cf..03a646b 100644
*** a/src/backend/utils/adt/numeric.c
--- b/src/backend/utils/adt/numeric.c
*** div_var_fast(NumericVar *var1, NumericVa
*** 6186,6192 
  	double		fdividend,
  fdivisor,
  fdivisorinverse,
! fquotient;
  	int			qi;
  	int			i;
  
--- 6186,6193 
  	double		fdividend,
  fdivisor,
  fdivisorinverse,
! fquotient,
! outercarry;
  	int			qi;
  	int			i;
  
*** div_var_fast(NumericVar *var1, NumericVa
*** 6274,6289 
  	 * To avoid overflow in maxdiv itself, it represents the max absolute
  	 * value divided by NBASE-1, ie, at the top of the loop it is known that
  	 * no div[] entry has an absolute value exceeding maxdiv * (NBASE-1).
  	 */
  	maxdiv = 1;
  
  	/*
! 	 * Outer loop computes next quotient digit, which will go into div[qi]
  	 */
  	for (qi = 0; qi < div_ndigits; qi++)
  	{
  		/* Approximate the current dividend value */
! 		fdividend = (double) div[qi];
  		for (i = 1; i < 4; i++)
  		{
  			fdividend *= NBASE;
--- 6275,6297 
  	 * To avoid overflow in maxdiv itself, it represents the max absolute
  	 * value divided by NBASE-1, ie, at the top of the loop it is known that
  	 * no div[] entry has an absolute value exceeding maxdiv * (NBASE-1).
+ 	 *
+ 	 * Note that maxdiv only includes digit positions that are still part of
+ 	 * the dividend, ie, strictly speaking the above holds only for div[i]
+ 	 * where i >= qi, at the top of the loop.
  	 */
  	maxdiv = 1;
  
  	/*
! 	 * Outer loop computes next quotient digit, which will go into div[qi].
! 	 * "outercarry" represents high-order dividend digits carried across
! 	 * iterations.
  	 */
+ 	outercarry = 0;
  	for (qi = 0; qi < div_ndigits; qi++)
  	{
  		/* Approximate the current dividend value */
! 		fdividend = outercarry * NBASE + (double) div[qi];
  		for (i = 1; i < 4; i++)
  		{
  			fdividend *= NBASE;
*** div_var_fast(NumericVar *var1, NumericVa
*** 6320,6340 
  		carry = 0;
  	div[i] = newdig;
  }
! newdig = div[qi] + carry;
! div[qi] = newdig;
  
  /*
   * All the div[] digits except possibly div[qi] are now in the
!  * range 0..NBASE-1.
   */
! maxdiv = Abs(newdig) / (NBASE - 1);
! maxdiv = Max(maxdiv, 1);
  
  /*
   * Recompute the quotient digit since new info may have
   * propagated into the top four dividend digits
   */
! fdividend = (double) div[qi];
  for (i = 1; i < 4; i++)
  {
  	fdividend *= NBASE;
--- 6328,6347 
  		carry = 0;
  	div[i] = newdig;
  }
! 

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-11-17 Thread Kouhei Kaigai
> On Sun, Nov 8, 2015 at 7:26 PM, Kouhei Kaigai  wrote:
> > The attached patch is an adjusted version of the previous one.
> > Even though it co-exists a new callback and fdw_recheck_quals,
> > the callback is kicked first as follows.
> 
> This seems excessive to me: why would we need an arbitrary-length list
> of plans for an FDW?  I think we should just allow an outer child and
> an inner child, which is probably one more than we'll ever need in
> practice.
>
It just intends to keep code symmetry with custom-scan case, so not
a significant reason.
And, I expected ForeignScan will also need multiple sub-plans soon
to support more intelligent push-down like:
http://www.postgresql.org/message-id/9a28c8860f777e439aa12e8aea7694f8010f4...@bpxm15gp.gisp.nec.co.jp

It is a separate discussion, of course, so I don't have strong preference
here.

> This looks like an independent bug fix:
> 
> +   fscan->fdw_recheck_quals = (List *)
> +   fix_upper_expr(root,
> +  (Node *)
> fscan->fdw_recheck_quals,
> +  itlist,
> +  INDEX_VAR,
> +  rtoffset);
> pfree(itlist);
> 
> If so, it should be committed separately and back-patched to 9.5.
>
OK, I'll split the patch into two.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Cause TestLib.pm to define $windows_os in all branches.

2015-11-17 Thread Michael Paquier
On Wed, Nov 18, 2015 at 4:15 AM, Tom Lane wrote:
> I'm not in a position to double-check that these patches work on Windows,
> but I reviewed them through the expedient of diff'ing the patched files
> against HEAD.

I'll double-check things a bit and post replies on this thread should
I detect a problem.

> The only problem I found was you'd left out the
> documentation addition about needing IPC::Run in install-windows.sgml.
> So I've pushed them with that fix.

Thanks, I clearly forgot this documentation bit.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: multiple psql option -c

2015-11-17 Thread Andrew Dunstan



On 11/17/2015 04:13 PM, Tom Lane wrote:

Robert Haas  writes:

A few years ago there was a proposal to not only allow multiple -c
options, but to allow -c and -f to be intermingled.  This seems really
rather useful; I'd like to be able to type psql -c do_this_first -f
script.sql and have that work.  But of course it's kind of hard to
figure out how that should behave given the various differences
between -c and -f.

Hm.  That's actually a good reason for changing -c, I guess.  (Or else
introducing a -C that is compatible with -f, but I agree that that
seems a bit ugly.)

If we made -c handle its input just like it had come from a file,
then what would we need to explain to people that wanted the old
behavior?  I guess we'd need to tell them to use --no-psqlrc and
--single-transaction at least, and "-v ON_ERROR_STOP=1".  And
even then it wouldn't replicate the behavior of discarding all
but the last command output.  (Maybe nobody out there is relying
on that, but I wouldn't bet on it.)  And that's all still assuming
that they don't care about multi-command-per-PQexec in itself, but
only the easily user-visible differences.

So that is kind of looking like a mess, and 90% of the mess is from
not wanting to use a PQexec per -c switch, which if you ask me is
not very much of the value-add here.  AFAICS the only thing that's
really in conflict with -f is the implied --no-psqlrc.  How about
this design:

1. -c no longer implies --no-psqlrc.  That's a backwards incompatibility,
but very easy to explain and very easy to work around.

2. You can have multiple -c and/or -f.  Each -c is processed in
the traditional way, ie, either it's a single backslash command
or it's sent in a single PQexec.  That doesn't seem to me to have
much impact on the behavior of adjacent -c or -f.

3. If you combine -1 with -c and/or -f, you get one BEGIN inserted
at the beginning and one COMMIT at the end.  Nothing else changes.

As long as you put only one SQL command per -c, I don't think that
this definition has any real surprises.  And we can discourage
people from putting more than one, saying that that will invoke
legacy behaviors you probably don't want.





WFM. The only reason I originally suggested -C was to avoid 
compatibility issues. If we're prepared to take that on then I agree 
it's better to do what you suggest.


I assume that we won't have any great difficulty in handling 
intermingled -c and -f options in the correct order. Essentially I think 
we'll have to have a unified list of such arguments.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should TIDs be typbyval = FLOAT8PASSBYVAL to speed up CREATE INDEX CONCURRENTLY?

2015-11-17 Thread Peter Geoghegan
On Tue, Nov 17, 2015 at 12:53 AM, Simon Riggs  wrote:
> Short and sweet! Looks good.

Thanks.

> I would be inclined to add more comments to explain it, these things have a
> habit of being forgotten.

I'm not sure what additional detail I can add.

I seem to be able to produce these sorting patches at a much greater
rate than they can be committed, in part because Robert is the only
one that ever reviews them, and he is only one person. Since you think
the patch is good work, perhaps you can find the time to formally
review it.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should TIDs be typbyval = FLOAT8PASSBYVAL to speed up CREATE INDEX CONCURRENTLY?

2015-11-17 Thread Michael Paquier
On Wed, Nov 18, 2015 at 8:50 AM, Peter Geoghegan wrote:
> I seem to be able to produce these sorting patches at a much greater
> rate than they can be committed, in part because Robert is the only
> one that ever reviews them, and he is only one person. Since you think
> the patch is good work, perhaps you can find the time to formally
> review it.

Finding reviewing volunteers is a good thing, particularly on these
times where we tend to have more patches than reviews, however I would
suggest prioritizing the older items by beginning in what is in the
current CF (47 items waiting for review at I write this message), 3
patches for the sorting work.

FWIW, I think that this series of patches is interesting and have high
value because particularly I have seen clear improvements particularly
with raw dumps on schemas with many indexes (disclaimer: I am the guy
Peter talked to regarding this patch though this was not on the top
head nor of my TODOs).
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: multiple psql option -c

2015-11-17 Thread Tom Lane
Robert Haas  writes:
> On Mon, Nov 16, 2015 at 6:05 PM, Andrew Dunstan  wrote:
>> I honestly don't see what's so confusing about it, and if there is any
>> confusion then surely we could make sure what's happening is well
>> documented.

> +1.  I'm actually kind of wondering if we should just back up and
> change the way -c works instead, and allow it to be specified more
> than once.  The current behavior is essentially a crock that has only
> backward compatibility to recommend it, and not having two
> confusingly-similar options is of some non-trivial value.

Well, it's not *entirely* true that it has only backwards compatibility
to recommend it: without -c in its current form, there would be no way
to test multiple-commands-in-one-PQexec cases without hacking up some
custom test infrastructure.  That's not a very strong reason maybe, but
it's a reason.  And backwards compatibility is usually a strong argument
around here anyway.

I've not been following this thread in any detail, but have we considered
the approach of allowing multiple -c and saying that each -c is a separate
PQexec (or backslash command)?  So the semantics of one -c wouldn't change,
but commands submitted through multiple -c switches would behave
relatively unsurprisingly, and we wouldn't need two kinds of switch.

Another issue here is how -1 ought to interact with multiple -c.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-11-17 Thread Robert Haas
On Thu, Nov 12, 2015 at 10:23 AM, Amit Kapila  wrote:
> Thanks for the report.  The reason for this problem is that instrumentation
> information from workers is getting aggregated multiple times.  In
> ExecShutdownGatherWorkers(), we call ExecParallelFinish where it
> will wait for workers to finish and then accumulate stats from workers.
> Now ExecShutdownGatherWorkers() could be called multiple times
> (once we read all tuples from workers, at end of node) and it should be
> ensured that repeated calls should not try to redo the work done by first
> call.
> The same is ensured for tuplequeues, but not for parallel executor info.
> I think we can safely assume that we need to call ExecParallelFinish() only
> when there are workers started by the Gathers node, so on those lines
> attached patch should fix the problem.

I suggest that we instead fix ExecParallelFinish() to be idempotent.
Add a "bool finished" flag to ParallelExecutorInfo and return at once
if it's already set. Get rid of the exposed
ExecParallelReinitializeTupleQueues() interface and have
ExecParallelReinitialize(pei) instead.  Have that call
ReinitializeParallelDSM(), ExecParallelSetupTupleQueues(pei->pcxt,
true), and set pei->finished = false.  I think that would give us a
slightly cleaner separation of concerns between nodeGather.c and
execParallel.c.

Your fix seems a little fragile.  You're relying on node->reader !=
NULL to tell you whether the readers need to be cleaned up, but in
fact node->reader is set to a non-NULL value AFTER the pei has been
created.  Granted, we currently always create a reader unless we don't
get any workers, and if we don't get any workers then failing to call
ExecParallelFinish is currently harmless, but nonetheless I think we
should be more explicit about this so it doesn't accidentally get
broken later.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: multiple psql option -c

2015-11-17 Thread Pavel Stehule
> Well, it's not *entirely* true that it has only backwards compatibility
> to recommend it: without -c in its current form, there would be no way
> to test multiple-commands-in-one-PQexec cases without hacking up some
> custom test infrastructure.  That's not a very strong reason maybe, but
> it's a reason.  And backwards compatibility is usually a strong argument
> around here anyway.
>
> I've not been following this thread in any detail, but have we considered
> the approach of allowing multiple -c and saying that each -c is a separate
> PQexec (or backslash command)?  So the semantics of one -c wouldn't change,
> but commands submitted through multiple -c switches would behave
> relatively unsurprisingly, and we wouldn't need two kinds of switch.
>

I believe it can work, but there are stronger limit of single PQexec call -
only result of last command is displayed.

Regards

Pavel


>
> Another issue here is how -1 ought to interact with multiple -c.
>
> regards, tom lane
>


Re: [HACKERS] Bug in numeric multiplication

2015-11-17 Thread Tom Lane
Dean Rasheed  writes:
> On 17 November 2015 at 14:43, Tom Lane  wrote:
>> Hm, good point.  I don't feel a compulsion to have a test case that
>> proves it's broken before we fix it.  Do you want to send a patch?

> OK, here it is. It's slightly different from mul_var, because maxdiv
> is tracking absolute values and the carry may be positive or negative,
> and it's absolute value may be as high as INT_MAX / NBASE + 1 (when
> it's negative), but otherwise the principle is the same.

I pushed this, but while looking at it, my attention was drawn to this
bit down near the end of the loop:

/*
 * The dividend digit we are about to replace might still be nonzero.
 * Fold it into the next digit position.  We don't need to worry about
 * overflow here since this should nearly cancel with the subtraction
 * of the divisor.
 */
div[qi + 1] += div[qi] * NBASE;

In the first place, I'm not feeling especially convinced by that handwave
about there being no risk of overflow.  In the second place, this is
indisputably failing to consider whether maxdiv might need to increase.
If I add

Assert(Abs(div[qi + 1]) <= (maxdiv * (NBASE-1)));

right after this, the regression tests blow up.  So it looks to me like
we've got some more work to do.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: multiple psql option -c

2015-11-17 Thread Robert Haas
On Mon, Nov 16, 2015 at 6:05 PM, Andrew Dunstan  wrote:
> I honestly don't see what's so confusing about it, and if there is any
> confusion then surely we could make sure what's happening is well
> documented.

+1.  I'm actually kind of wondering if we should just back up and
change the way -c works instead, and allow it to be specified more
than once.  The current behavior is essentially a crock that has only
backward compatibility to recommend it, and not having two
confusingly-similar options is of some non-trivial value.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: multiple psql option -c

2015-11-17 Thread Pavel Stehule
2015-11-17 21:00 GMT+01:00 Robert Haas :

> On Tue, Nov 17, 2015 at 2:25 PM, Tom Lane  wrote:
> > Robert Haas  writes:
> >> On Mon, Nov 16, 2015 at 6:05 PM, Andrew Dunstan 
> wrote:
> >>> I honestly don't see what's so confusing about it, and if there is any
> >>> confusion then surely we could make sure what's happening is well
> >>> documented.
> >
> >> +1.  I'm actually kind of wondering if we should just back up and
> >> change the way -c works instead, and allow it to be specified more
> >> than once.  The current behavior is essentially a crock that has only
> >> backward compatibility to recommend it, and not having two
> >> confusingly-similar options is of some non-trivial value.
> >
> > Well, it's not *entirely* true that it has only backwards compatibility
> > to recommend it: without -c in its current form, there would be no way
> > to test multiple-commands-in-one-PQexec cases without hacking up some
> > custom test infrastructure.  That's not a very strong reason maybe, but
> > it's a reason.
>
> True.  We could have a --no-split-commands option for that case, perhaps.
>
> > And backwards compatibility is usually a strong argument
> > around here anyway.
>
> Yeah, but not - at least in my book - at the expense of being stuck
> with a confusing interface forever.
>
> > I've not been following this thread in any detail, but have we considered
> > the approach of allowing multiple -c and saying that each -c is a
> separate
> > PQexec (or backslash command)?  So the semantics of one -c wouldn't
> change,
> > but commands submitted through multiple -c switches would behave
> > relatively unsurprisingly, and we wouldn't need two kinds of switch.
> >
> > Another issue here is how -1 ought to interact with multiple -c.
>
> On a code level, I think the issue here is that ACT_SINGLE_QUERY
> bypasses a lot of stuff that happens in the ACT_FILE case: directly in
> main, there's process_psqlrc(); inside process_file(), there's the
> single_txn handling; then inside MainLoop, there's splitting of
> commands and cancel handling and various other stuff.  In my
> imagination, it's like this because originally psql wasn't nearly as
> complicated as it is now, and as features got added in various places,
> -c gradually diverged.  I don't know whether that's really what
> happened, but it seems to me that it would be good to bring those
> things back together.
>
> A few years ago there was a proposal to not only allow multiple -c
> options, but to allow -c and -f to be intermingled.  This seems really
> rather useful; I'd like to be able to type psql -c do_this_first -f
> script.sql and have that work.  But of course it's kind of hard to
> figure out how that should behave given the various differences
> between -c and -f.  I think in the long run we'll be better off
> rationalizing the interface; I really doubt how many people, even on
> this mailing list, can even enumerate all the differences between -c
> and -f.  If it's too complicated for hackers to remember, it's
> probably not very good for users either.
>

This shouldn't be hard. With proposed patch, you can do "-C '\i xxx' -C '\i
'" - that is effectively same like multiple -f

and this mix can be really useful }in any order).

Regards

Pavel


>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-17 Thread Robert Haas
On Tue, Nov 17, 2015 at 4:26 AM, Thom Brown  wrote:
> Okay, I've tried this patch.

Thanks!

> Yes, it's working!

Woohoo.

> However, the first parallel seq scan shows it getting 170314 rows.
> Another run shows it getting 194165 rows.  The final result is
> correct, but as you can see from the rows on the Append node (59094295
> rows), it doesn't match the number of rows on the Gather node
> (3000).

Is this the same issue reported in
http://www.postgresql.org/message-id/CAFj8pRBF-i=qdg9b5nzrxyfchzbezwmthxyphidqvwomojh...@mail.gmail.com
and not yet fixed?  I am inclined to think it probably is.

> And also, for some reason, I can no longer get this using more than 2
> workers, even with max_worker_processes = 16 and max_parallel_degree =
> 12.  I don't know if that's anything to do with this patch though.

The number of workers is limited based on the size of the largest
table involved in the Append.  That probably needs considerable
improvement, of course, but this patch is still a step forward over
not-this-patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Implement failover on libpq connect level.

2015-11-17 Thread Peter Eisentraut
On 11/6/15 3:59 PM, Robert Haas wrote:
> So, I really wonder why we're not happy with the ability to substitute
> out just the host and IP.

One of my concerns is that the proposed URLs are not valid URLs anymore
that can be parsed or composed with a URL library, which would be sad.

The other issue is that I think it's taking the implementation down the
wrong path.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Question concerning XTM (eXtensible Transaction Manager API)

2015-11-17 Thread Alvaro Herrera
konstantin knizhnik wrote:

> The transaction is normally committed in xlog, so that it can always be 
> recovered in case of node fault.
> But before setting correspondent bit(s) in CLOG and releasing locks we first 
> contact arbiter to get global status of transaction.
> If it is successfully locally committed by all nodes, then arbiter approves 
> commit and commit of transaction normally completed.
> Otherwise arbiter rejects commit. In this case DTM marks transaction as 
> aborted in CLOG and returns error to the client.
> XLOG is not changed and in case of failure PostgreSQL will try to replay this 
> transaction.
> But during recovery it also tries to restore transaction status in CLOG.
> And at this placeDTM contacts arbiter to know status of transaction.
> If it is marked as aborted in arbiter's CLOG, then it wiull be also marked as 
> aborted in local CLOG.
> And according to PostgreSQL visibility rules no other transaction will see 
> changes made by this transaction.

One problem I see with this approach is that the WAL replay can happen
long after it was written; for instance you might have saved a
basebackup and WAL stream and replay it all several days or weeks later,
when the arbiter no longer has information about the XID.  Later
transactions might (will) depend on the aborted state of the transaction
in question, so this effectively corrupts the database.

In other words, while it's reasonable to require that the arbiter can
always be contacted for transaction commit/abort at run time, but it's
not reasonable to contact the arbiter during WAL replay.

I think this merits more explanation:

> The transaction is normally committed in xlog, so that it can always be 
> recovered in case of node fault.

Why would anyone want to "recover" a transaction that was aborted?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_receivexlog: spurious error message connecting to 9.3

2015-11-17 Thread Robert Haas
On Tue, Nov 10, 2015 at 1:35 AM, Craig Ringer  wrote:
> On 10 November 2015 at 01:47, Marco Nenciarini
>  wrote:
>
>> I've attached a little patch that removes the errors when connected to 9.3.
>
> Looks good to me. No point confusing users.
>
> The other callers of RunIdentifySystem are pg_basebackup and
> pg_receivelogical.
>
> pg_basebackup doesn't ask for the db_name (passes null).
>
> pg_receivelogical handles it being null already (and if it didn't,
> it'd die with or without this patch).
>
> pg_receivexlog expects it to be null and fails gracefully if it isn't.
>
> So this change just removes some pointless noise.

The fprintf(stderr, ...) does not cause a non-local exit, so the
"else" just after it should be deleted.  Otherwise, when that branch
is taken, *db_name doesn't get initialized at all.

Actually, I'd suggest doing it like the attached instead, which seems
a bit tighter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c
index 2c963b6..d7b4499 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -294,15 +294,13 @@ RunIdentifySystem(PGconn *conn, char **sysid, TimeLineID *starttli,
 	/* Get database name, only available in 9.4 and newer versions */
 	if (db_name != NULL)
 	{
-		if (PQnfields(res) < 4)
+		*db_name = NULL;
+		if (PQnfields(res) >= 4 && !PQgetisnull(res, 0, 3))
+			*db_name = pg_strdup(PQgetvalue(res, 0, 3));
+		else if (PQserverVersion(conn) >= 90400)
 			fprintf(stderr,
 	_("%s: could not identify system: got %d rows and %d fields, expected %d rows and %d or more fields\n"),
 	progname, PQntuples(res), PQnfields(res), 1, 4);
-
-		if (PQgetisnull(res, 0, 3))
-			*db_name = NULL;
-		else
-			*db_name = pg_strdup(PQgetvalue(res, 0, 3));
 	}
 
 	PQclear(res);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch (3): Implement failover on libpq connect level.

2015-11-17 Thread Thom Brown
On 26 October 2015 at 07:58, Victor Wagner  wrote:
> On 2015.10.14 at 13:41:51 +0300, Victor Wagner wrote:
>
>> Attached patch which implements client library failover and
>> loadbalancing as was described in the proposal
>> <20150818041850.ga5...@wagner.pp.ru>.
>
> New version of patch
>
> 1. Handles replication connections correctly (i.e doesn't attempt to
> check readwrite mode if replication option is on)
> 2. Some minor improvement recommended by Korry Douglas  in
> <562a9259.4060...@enterprisedb.com>

This patch doesn't apply.  On line 636, this appears:

<<< BEGIN MERGE CONFLICT: local copy shown first <<<
+   if (value[0] ==  't')
=== COMMON ANCESTOR content follows 
+   if (strcmp(value, "true") == 0)
=== MERGED IN content follows ==
+   if (value[0]=='t')
>>> END MERGE CONFLICT >

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in numeric multiplication

2015-11-17 Thread Dean Rasheed
On 17 November 2015 at 14:43, Tom Lane  wrote:
> Dean Rasheed  writes:
>> I just noticed that div_var_fast() has almost identical code, and so
>> in principle it has the same vulnerability, although it obviously only
>> affects the transcendental functions.
>> I don't actually have a test case that triggers it, but it's basically
>> the same algorithm, so logically it needs the same additional headroom
>> to avoid a possible overflow.
>
> Hm, good point.  I don't feel a compulsion to have a test case that
> proves it's broken before we fix it.  Do you want to send a patch?
>

OK, here it is. It's slightly different from mul_var, because maxdiv
is tracking absolute values and the carry may be positive or negative,
and it's absolute value may be as high as INT_MAX / NBASE + 1 (when
it's negative), but otherwise the principle is the same.

Regards,
Dean
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
new file mode 100644
index fe9f3f7..eaafcd9
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -6266,8 +6266,15 @@ div_var_fast(NumericVar *var1, NumericVa
 	/*
 	 * maxdiv tracks the maximum possible absolute value of any div[] entry;
 	 * when this threatens to exceed INT_MAX, we take the time to propagate
-	 * carries.  To avoid overflow in maxdiv itself, it actually represents
-	 * the max possible abs. value divided by NBASE-1.
+	 * carries.  Furthermore, we need to ensure that overflow doesn't occur
+	 * during the carry propagation passes either.  The carry values may have
+	 * an absolute value as high as INT_MAX/NBASE + 1, so really we must
+	 * normalize when digits threaten to exceed INT_MAX - INT_MAX/NBASE - 1.
+	 *
+	 * To avoid overflow in maxdiv itself, it actually represents the max
+	 * possible absolute value divided by NBASE-1, ie, at the top of the loop
+	 * it is known that no dig[] entry has an absolute value exceeding
+	 * maxdiv * (NBASE-1).
 	 */
 	maxdiv = 1;
 
@@ -6293,7 +6300,7 @@ div_var_fast(NumericVar *var1, NumericVa
 		{
 			/* Do we need to normalize now? */
 			maxdiv += Abs(qdigit);
-			if (maxdiv > INT_MAX / (NBASE - 1))
+			if (maxdiv > (INT_MAX - INT_MAX / NBASE - 1) / (NBASE - 1))
 			{
 /* Yes, do it */
 carry = 0;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: multiple psql option -c

2015-11-17 Thread Robert Haas
On Tue, Nov 17, 2015 at 2:25 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Mon, Nov 16, 2015 at 6:05 PM, Andrew Dunstan  wrote:
>>> I honestly don't see what's so confusing about it, and if there is any
>>> confusion then surely we could make sure what's happening is well
>>> documented.
>
>> +1.  I'm actually kind of wondering if we should just back up and
>> change the way -c works instead, and allow it to be specified more
>> than once.  The current behavior is essentially a crock that has only
>> backward compatibility to recommend it, and not having two
>> confusingly-similar options is of some non-trivial value.
>
> Well, it's not *entirely* true that it has only backwards compatibility
> to recommend it: without -c in its current form, there would be no way
> to test multiple-commands-in-one-PQexec cases without hacking up some
> custom test infrastructure.  That's not a very strong reason maybe, but
> it's a reason.

True.  We could have a --no-split-commands option for that case, perhaps.

> And backwards compatibility is usually a strong argument
> around here anyway.

Yeah, but not - at least in my book - at the expense of being stuck
with a confusing interface forever.

> I've not been following this thread in any detail, but have we considered
> the approach of allowing multiple -c and saying that each -c is a separate
> PQexec (or backslash command)?  So the semantics of one -c wouldn't change,
> but commands submitted through multiple -c switches would behave
> relatively unsurprisingly, and we wouldn't need two kinds of switch.
>
> Another issue here is how -1 ought to interact with multiple -c.

On a code level, I think the issue here is that ACT_SINGLE_QUERY
bypasses a lot of stuff that happens in the ACT_FILE case: directly in
main, there's process_psqlrc(); inside process_file(), there's the
single_txn handling; then inside MainLoop, there's splitting of
commands and cancel handling and various other stuff.  In my
imagination, it's like this because originally psql wasn't nearly as
complicated as it is now, and as features got added in various places,
-c gradually diverged.  I don't know whether that's really what
happened, but it seems to me that it would be good to bring those
things back together.

A few years ago there was a proposal to not only allow multiple -c
options, but to allow -c and -f to be intermingled.  This seems really
rather useful; I'd like to be able to type psql -c do_this_first -f
script.sql and have that work.  But of course it's kind of hard to
figure out how that should behave given the various differences
between -c and -f.  I think in the long run we'll be better off
rationalizing the interface; I really doubt how many people, even on
this mailing list, can even enumerate all the differences between -c
and -f.  If it's too complicated for hackers to remember, it's
probably not very good for users either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] SQL function to report log message

2015-11-17 Thread Peter Eisentraut
On 11/17/15 2:16 AM, Jim Nasby wrote:
> On 11/15/15 10:56 PM, dinesh kumar wrote:
>> So, shall we make this pg_report_log TO pg_write_log OR pg_ereport OR
>>  from you.
> 
> Why not pg_raise to mirror plpgsql? (The function does have the same
> semantics, right? It's not doing something like only sending to the log
> and not the client?)

I think the "raise" terminology is specific to plpgsql, as it actually
raises an exception in that language.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-17 Thread Masahiko Sawada
On Wed, Nov 18, 2015 at 12:56 AM, Thom Brown  wrote:
> On 17 November 2015 at 15:43, Jim Nasby  wrote:
>> On 11/17/15 4:41 AM, Thom Brown wrote:
>>>
>>> Could someone post a TL;DR summary of what the current plan looks
>>> like?  I can see there is a huge amount of discussion to trawl back
>>> through.  I can see it's something to do with the visibility map.  And
>>> does it avoid freezing very large tables like the title originally
>>> sought?
>>
>>
>> Basically, it follows the same pattern that all-visible bits do, except
>> instead of indicating a page is all-visible, the bit shows that all tuples
>> on the page are frozen. That allows a scan_all vacuum to skip those pages.
>
> So the visibility map is being repurposed?

My proposal is to add additional one bit that indicates all tuples on
page are completely frozen, into visibility map.
That is, the visibility map will become a bitmap with two bits
(all-visible, all-frozen) per page.

> And if a row on a frozen
> page is modified, what happens to the visibility of all other rows on
> that page, as the bit will be set back to 0?

In this case, the corresponding VM both bits are cleared.
Such behaviour is almost same as what postgresql is doing today.


Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-11-17 Thread Kouhei Kaigai
> > On Sun, Nov 8, 2015 at 7:26 PM, Kouhei Kaigai  wrote:
> > > The attached patch is an adjusted version of the previous one.
> > > Even though it co-exists a new callback and fdw_recheck_quals,
> > > the callback is kicked first as follows.
> >
> > This seems excessive to me: why would we need an arbitrary-length list
> > of plans for an FDW?  I think we should just allow an outer child and
> > an inner child, which is probably one more than we'll ever need in
> > practice.
> >
> It just intends to keep code symmetry with custom-scan case, so not
> a significant reason.
> And, I expected ForeignScan will also need multiple sub-plans soon
> to support more intelligent push-down like:
> http://www.postgresql.org/message-id/9A28C8860F777E439AA12E8AEA7694F8010F47D
> a...@bpxm15gp.gisp.nec.co.jp
> 
> It is a separate discussion, of course, so I don't have strong preference
> here.
> 
> > This looks like an independent bug fix:
> >
> > +   fscan->fdw_recheck_quals = (List *)
> > +   fix_upper_expr(root,
> > +  (Node *)
> > fscan->fdw_recheck_quals,
> > +  itlist,
> > +  INDEX_VAR,
> > +  rtoffset);
> > pfree(itlist);
> >
> > If so, it should be committed separately and back-patched to 9.5.
> >
> OK, I'll split the patch into two.
>
The attached patch is the portion cut from the previous EPQ recheck
patch.

Regarding of the fdw_plans or fdw_plan, I'll follow your suggestion.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



pgsql-bugfix-fdw_recheck_quals-on-setrefs.patch
Description: pgsql-bugfix-fdw_recheck_quals-on-setrefs.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor comment improvement to create_foreignscan_plan

2015-11-17 Thread Etsuro Fujita

On 2015/11/18 2:57, Robert Haas wrote:

On Sun, Nov 15, 2015 at 9:25 PM, Etsuro Fujita
 wrote:

Oops, I've found another one.  I think we should update a comment in
postgresGetForeignPlan, too; add remote filtering expressions to the list of
information needed to create a ForeignScan node.



Instead of saying "remote/local", how about saying "remote and local"
or just leaving it out altogether as in the attached?


+1 for your patch.

Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-11-17 Thread Etsuro Fujita

On 2015/11/18 3:19, Robert Haas wrote:

On Thu, Nov 12, 2015 at 12:54 AM, Etsuro Fujita
 wrote:

Really?  I think there would be not a little burden on an FDW author; when
postgres_fdw delegates to the subplan to the remote server, for example, it
would need to create a remote join query by looking at tuples possibly
fetched and stored in estate->es_epqTuple[], send the query and receive the
result during the callback routine.  Furthermore, what I'm most concerned
about is that wouldn't be efficient.  So, my question about that approach is
whether FDWs really do some thing like that during the callback routine,
instead of performing a secondary join plan locally.  As I said before, I
know that KaiGai-san considers that that approach would be useful for custom
joins.  But I see zero evidence that there is a good use-case for an FDW.



It could do that.  But it could also just invoke a subplan as you are
proposing.  Or at least, I think we should set it up so that such a
thing is possible.  In which case I don't see the problem.


I suppose you (and KaiGai-san) are probably right, but I really fail to 
see it actually doing that.


Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Skip ALTER x SET SCHEMA if the schema didn't change

2015-11-17 Thread Haribabu Kommi
On Wed, Nov 18, 2015 at 6:02 AM, Robert Haas  wrote:
> On Mon, Nov 16, 2015 at 4:27 AM, Marti Raudsepp  wrote:
>> Thank you so much for the review and patch update. I should have done that
>> myself, but I've been really busy for the last few weeks. :(
>
> Maybe I'm having an attack of the stupids today, but it looks to me
> like the changes to pg_constraint.c look awfully strange to me.  In
> the old code, if object_address_present() returns true, we continue,
> skipping the rest of the loop.  In the new code, we instead set
> alreadyChanged to true.  That causes both of the following if
> statements, as revised, to fall out, so that we skip the rest of the
> loop.  Huh?  Wouldn't a one line change to add oldNspId != newNspId to
> the criteria for a simple_heap_update be just as good?

Yes, that's correct, the above change can be written as you suggested.
Updated patch attached with correction.

> Backing up a bit, maybe we should be a bit more vigorous in treating a
> same-namespace move as a no-op.  That is, don't worry about calling
> the post-alter hook in that case - just have AlterConstraintNamespaces
> start by checking whether oldNspId == newNspid right at the top; if
> so, return.  The patch seems to have the idea that it is important to
> call the post-alter hook even in that case, but I'm not sure whether
> that's true.  I'm not sure it's false, but I'm also not sure it's
> true.

I am also not sure whether calling the post-alter hook in case of constraint is
necessarily required? but it was doing for other objects, so I suggested
that way.

Regards,
Hari Babu
Fujitsu Australia


0001-Skip-ALTER-x-SET-SCHEMA-if-the-schema-didn-t-change_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-11-17 Thread Amit Kapila
On Wed, Nov 18, 2015 at 12:59 AM, Robert Haas  wrote:
>
> On Thu, Nov 12, 2015 at 10:23 AM, Amit Kapila 
wrote:
> > Thanks for the report.  The reason for this problem is that
instrumentation
> > information from workers is getting aggregated multiple times.  In
> > ExecShutdownGatherWorkers(), we call ExecParallelFinish where it
> > will wait for workers to finish and then accumulate stats from workers.
> > Now ExecShutdownGatherWorkers() could be called multiple times
> > (once we read all tuples from workers, at end of node) and it should be
> > ensured that repeated calls should not try to redo the work done by
first
> > call.
> > The same is ensured for tuplequeues, but not for parallel executor info.
> > I think we can safely assume that we need to call ExecParallelFinish()
only
> > when there are workers started by the Gathers node, so on those lines
> > attached patch should fix the problem.
>
> I suggest that we instead fix ExecParallelFinish() to be idempotent.
> Add a "bool finished" flag to ParallelExecutorInfo and return at once
> if it's already set. Get rid of the exposed
> ExecParallelReinitializeTupleQueues() interface and have
> ExecParallelReinitialize(pei) instead.  Have that call
> ReinitializeParallelDSM(), ExecParallelSetupTupleQueues(pei->pcxt,
> true), and set pei->finished = false.  I think that would give us a
> slightly cleaner separation of concerns between nodeGather.c and
> execParallel.c.
>

Okay, attached patch fixes the issue as per above suggestion.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


fix_finish_parallel_executor_info_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-11-17 Thread Robert Haas
On Tue, Nov 17, 2015 at 4:51 PM, Bert  wrote:

> Hey Robert,
>
> Thank you for the help. As you might (not) know, I'm quite new to the
> community, but I'm learning. with the help from people like you.
> anyhow, find attached a third attempt to a valid backtrace file.
>
> This run is compiled from commit 5f10b7a604c87fc61a2c20a56552301f74c9bd5f
> and your latest patch atteched in this mailtrack.
>

Thanks.  This is great.  Can you also run these commands:

frame 1
p *scan

The first command should select the heap_parallelscan_nextpage frame.  The
second command should print the contents of the scan object.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel Seq Scan

2015-11-17 Thread Amit Kapila
On Tue, Nov 17, 2015 at 11:22 PM, Robert Haas  wrote:
>
> On Mon, Nov 16, 2015 at 9:49 PM, Amit Kapila 
wrote:
> >> I don't understand this part.
> >>
> >
> > Here the above check always passes as tqueue->tupledesc is not
> > set due to which it always try to build remap info.  Is there any reason
> > for doing so?
>
> Groan.  The problem here is that tqueue->tupledesc never gets set.

Yes that was the problem.

>  I
> think this should be fixed as in the attached.
>

Works for me!


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Conversion error of floating point numbers in pl/pgsql

2015-11-17 Thread Merlin Moncure
On Tue, Nov 17, 2015 at 9:00 PM, Robert Haas  wrote:
> On Mon, Nov 16, 2015 at 9:49 AM, Tom Lane  wrote:
>> Kyotaro HORIGUCHI  writes:
>>> Hello. I found that 9.5 has an undocumented difference from 9.4
>>> in type cast in pl/pgsql and I think it might better be mentioned
>>> as a change of behavior in release notes.
>>
>>> Whether do you think it is worth mentioning or not in release notes?
>>
>> This seems unnecessarily alarmist to me.  Anybody who's in the habit
>> of converting between float4 and float8 will already be used to this
>> behavior, because it is what has always happened everywhere else in
>> the system.
>
> Sure, but that doesn't mean nobody's functions will start behaving
> differently.  It seems worth mentioning as a backward compatibility
> issue to me, because if something breaks, it may not be immediately
> obvious why it has gotten broken.

Agreed, but the note should be followed by another one warning against
any expectations of floating point behavior below the precision
threshold :-).

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-11-17 Thread Robert Haas
On Tue, Nov 17, 2015 at 6:51 PM, Kouhei Kaigai  wrote:
> It just intends to keep code symmetry with custom-scan case, so not
> a significant reason.
> And, I expected ForeignScan will also need multiple sub-plans soon
> to support more intelligent push-down like:
> http://www.postgresql.org/message-id/9a28c8860f777e439aa12e8aea7694f8010f4...@bpxm15gp.gisp.nec.co.jp

I might be missing something, but why would that require multiple child plans?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Conversion error of floating point numbers in pl/pgsql

2015-11-17 Thread Robert Haas
On Mon, Nov 16, 2015 at 9:49 AM, Tom Lane  wrote:
> Kyotaro HORIGUCHI  writes:
>> Hello. I found that 9.5 has an undocumented difference from 9.4
>> in type cast in pl/pgsql and I think it might better be mentioned
>> as a change of behavior in release notes.
>
>> Whether do you think it is worth mentioning or not in release notes?
>
> This seems unnecessarily alarmist to me.  Anybody who's in the habit
> of converting between float4 and float8 will already be used to this
> behavior, because it is what has always happened everywhere else in
> the system.

Sure, but that doesn't mean nobody's functions will start behaving
differently.  It seems worth mentioning as a backward compatibility
issue to me, because if something breaks, it may not be immediately
obvious why it has gotten broken.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-11-17 Thread Kouhei Kaigai
> On Tue, Nov 17, 2015 at 6:51 PM, Kouhei Kaigai  wrote:
> > It just intends to keep code symmetry with custom-scan case, so not
> > a significant reason.
> > And, I expected ForeignScan will also need multiple sub-plans soon
> > to support more intelligent push-down like:
> >
> http://www.postgresql.org/message-id/9A28C8860F777E439AA12E8AEA7694F8010F47D
> a...@bpxm15gp.gisp.nec.co.jp
> 
> I might be missing something, but why would that require multiple child plans?
>
Apart from EPQ rechecks, the above aggressive push-down idea allows to send
contents of multiple relations to the remote side. In this case, ForeignScan
needs to have multiple sub-plans.

For example, please assume here is three relations; tbl_A and tbl_B are
local and small, tbl_F is remote and large.
In case when both of (tbl_A JOIN tbl_F) and (tbl_B JOIN tbl_F) produces
large number of rows thus consumes deserved amount of network traffic but
(tbl_A JOIN tbl_B JOIN tbl_F) produce small number of rows, the optimal
strategy is to send local contents to the remote side once then run
a remote query here to produce relatively smaller rows.
In the implementation level, ForeignScan shall represent (tbl_A JOIN tbl_B
JOIN tbl_F), then it returns a bunch of joined tuples. Its remote query
contains VALUES(...) clauses to pack contents of the tbl_A and tbl_B, thus,
it needs to be capable to execute underlying multiple scan plans and fetch
tuples prior to remote query execution.
So, ForeignScan may also have multiple sub-plans.

Of course, it is an independent feature from the EPQ rechecks.
It is not a matter even if we will extend this field later.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should TIDs be typbyval = FLOAT8PASSBYVAL to speed up CREATE INDEX CONCURRENTLY?

2015-11-17 Thread Corey Huinker
On Tue, Nov 17, 2015 at 7:28 PM, Michael Paquier 
wrote:

> On Wed, Nov 18, 2015 at 8:50 AM, Peter Geoghegan wrote:
> > I seem to be able to produce these sorting patches at a much greater
> > rate than they can be committed, in part because Robert is the only
> > one that ever reviews them, and he is only one person. Since you think
> > the patch is good work, perhaps you can find the time to formally
> > review it.
>
> Finding reviewing volunteers is a good thing, particularly on these
> times where we tend to have more patches than reviews, however I would
> suggest prioritizing the older items by beginning in what is in the
> current CF (47 items waiting for review at I write this message), 3
> patches for the sorting work.
>
> FWIW, I think that this series of patches is interesting and have high
> value because particularly I have seen clear improvements particularly
> with raw dumps on schemas with many indexes (disclaimer: I am the guy
> Peter talked to regarding this patch though this was not on the top
> head nor of my TODOs).
> --
> Michael
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

I'm willing, but I'm too new to the codebase to be an effective reviewer
(without guidance). The one thing I can offer in the mean time is this: my
company/client nearly always has a few spare AWS machines on the largish
side where I can compile uncommitted patches and benchmark stuff for y'all.


Re: [HACKERS] Extracting fields from 'infinity'::TIMESTAMP[TZ]

2015-11-17 Thread Vitaly Burovoy
On 11/9/15, Robert Haas  wrote:
> On Sat, Nov 7, 2015 at 9:47 AM, Vitaly Burovoy 
> wrote:
>> I'd like to raise a topic about extracting fields from infinite
>> timestamps, so much more that it is mentioned in the TODO list:
>> "Determine how to represent date/time field extraction on infinite
>> timestamps".
>>
>> Currently extracting any field from 'infinity'::TIMESTAMP[TZ] gives
>> result "0" as a mark it has "special" input value.
>>
>> The most confusing case is 'epoch' field: returning "0" from
>> "infinity" means the same thing as returning "0" from "1970-01-01+00".
>>
>> Returning zero in most other cases is only slightly less confusing
>> (may be because for me they are less often used).
>> For example, what about "SELECT EXTRACT(DOW FROM TIMESTAMP
>> 'Infinity')" with result 0, as if it is Sunday?
>> The same thing with fields: decade, hour, minute, seconds,
>> microseconds, milliseconds, timezone, timezone_hour, timezone_minute.
>> Also for "millennium" and "year" (with the note "Keep in mind there is
>> no 0 AD") current returning value is _between_ allowed values, but
>> disallowed.
>> http://www.postgresql.org/docs/9.5/static/functions-datetime.html
>>
>>
>> There was a discussion ended in nothing. It began at:
>> http://www.postgresql.org/message-id/ca+mi_8bda-fnev9ixeubnqhvacwzbyhhkwoxpqfbca9edpp...@mail.gmail.com
>>
>> Discussants agreed change is necessary, but couldn't decide what
>> behavior is preferred: throwing an error or returning NULL, NaN or +/-
>> infinity.
>>
>> My thoughts about that cases:
>> * Throwing an error: prefer to avoid it according to
>> http://www.postgresql.org/message-id/73a5666e-2d40-457e-9dff-248895db7...@gmail.com
>> * NULL: it is "absence of any value", i.e. it could be returned iff
>> input value is NULL (in the other case it is not better than returning
>> 0).
>> * NaN: it could be returned if value is outside current axe (like
>> complex value), but it is not the case.
>>
>> In a parallel discussion ("converting between infinity timestamp and
>> float8 (epoch)")
>> http://www.postgresql.org/message-id/cadakt-icuesh16ulocxbr-dkpcvwtuje4jwxnkdajaawp6j...@mail.gmail.com
>> There was interesting thought to make difference between monotonic
>> values (century, decade, epoch, isoyear, millennium and year) and
>> oscillating values (day, dow, doy, hour, isodow, microseconds,
>> milliseconds, minute, month, quarter, second and week).
>> An argument is for monotonic values +/- infinity has a sense, but not
>> for oscillating ones.
>> But for oscillating values NULL was proposed, that (IMHO) is not a
>> good idea (see above).
>> I think changing current mark "input value is not finite" allows an
>> app layer (which knows which field it tries to fetch from
>> timestamp[tz]) to handle extracted value correctly. For oscillating
>> values there can be the same values as for monotonic values, because
>> you can't mix them up.
>> The end of the parallel discussion (with the most important thoughts)
>> at
>> http://www.postgresql.org/message-id/4efcfd1c.8040...@archidevsys.co.nz
>>
>> So I think +/- infinity is the best returning value for all fields.
>>
>> The attached patch contains changes in timestamp_part and
>> timestamptz_part and tests for them.
>>
>> I doubt whether it can be backpatched (according to team's rules) or
>> not, but the patch can be applied down to 9.2 without conflicts and
>> passes tests.
>> Unfortunately, on 9.1 proposed test fails because "SELECT
>> EXTRACT(EPOCH FROM DATE '1970-01-01')" gives "28800" instead of "0".
>> Before 9.2 it was time zone-related.
>
> We're definitely not going to back-patch this.  Let's tally up the
> votes on that other thread:
>
> Danielle Varrazzo: infinity
> Bruce Momjian: infinity
> Robert Haas: not sure we want to change anything, but if so let's
> definitely NOT throw an error
> Alvaro Herrera: infinity for epoch, but what about other things?
> Brendan Jurd: infinity for epoch, error for other things
> Tom Lane: infinity for epoch, error or NaN for other things
> Josh Berkus: definitely change something, current behavior sucks
>
> That doesn't seem like enough consensus to commit this patch, which
> would change everything to +/-infinity.  That particular choice
> wouldn't bother me much, but it sounds like other people aren't sold.
> I think we need to try to hash that out a little more rather than
> rushing into a backward-incompatible change.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

I apologize for the late answer: I was very sick last week.

So, summarizing answers to the table:
 |Inf|NULL|NaN|Err
Danielle Varrazzo| + ||   |
Bruce Momjian| + ||   |
Robert Haas  |   ||   | -
Alvaro Herrera   |   ||   |
Brendan Jurd |   ||   | +
Tom Lane |   || + | +
Josh Berkus  |   ||   |

Kevin Grittner   |   | 

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2015-11-17 Thread Amit Kapila
On Tue, Nov 17, 2015 at 1:32 PM, Amit Kapila 
wrote:
>
> On Mon, Sep 21, 2015 at 6:34 AM, Peter Geoghegan  wrote:
> >
> > On Mon, Aug 31, 2015 at 9:49 PM, Amit Kapila 
wrote:
> > > Increasing CLOG buffers to 64 helps in reducing the contention due to
second
> > > reason.  Experiments revealed that increasing CLOG buffers only helps
> > > once the contention around ProcArrayLock is reduced.
> >
>
> Overall this idea sounds promising, but I think the work involved is more
> than the benefit I am expecting for the current optimization we are
> discussing.
>

Sorry, I think last line is slightly confusing, let me try to again write
it:

Overall this idea sounds promising, but I think the work involved is more
than the benefit expected from the current optimization we are
discussing.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2015-11-17 Thread Amit Kapila
On Mon, Sep 21, 2015 at 6:34 AM, Peter Geoghegan  wrote:
>
> On Mon, Aug 31, 2015 at 9:49 PM, Amit Kapila 
wrote:
> > Increasing CLOG buffers to 64 helps in reducing the contention due to
second
> > reason.  Experiments revealed that increasing CLOG buffers only helps
> > once the contention around ProcArrayLock is reduced.
>
> There has been a lot of research on bitmap compression, more or less
> for the benefit of bitmap index access methods.
>
> Simple techniques like run length encoding are effective for some
> things. If the need to map the bitmap into memory to access the status
> of transactions is a concern, there has been work done on that, too.
> Byte-aligned bitmap compression is a technique that might offer a good
> trade-off between compression clog, and decompression overhead -- I
> think that there basically is no decompression overhead, because set
> operations can be performed on the "compressed" representation
> directly. There are other techniques, too.
>

I could see benefits of doing compression for CLOG, but I think it won't
be straight forward, other than handling of compression and decompression,
currently code relies on transaction id to find the clog page, that will
not work after compression or we need to do some changes in that mapping
to make it work.  Also I think it could avoid the increase of clog buffers
which
can help readers, but it won't help much for contention around clog
updates for transaction status.

Overall this idea sounds promising, but I think the work involved is more
than the benefit I am expecting for the current optimization we are
discussing.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-17 Thread Masahiko Sawada
On Tue, Nov 17, 2015 at 10:45 AM, Robert Haas > wrote:
> On Sun, Nov 15, 2015 at 1:47 AM, Amit Kapila > wrote:
>> On Sat, Nov 14, 2015 at 1:19 AM, Andres Freund > wrote:
>>> On 2015-10-31 11:02:12 +0530, Amit Kapila wrote:
>>> > On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs >
>>> > wrote:
>>> > >
>>> > > On 1 October 2015 at 23:30, Josh Berkus > wrote:
>>> > >>
>>> > >> On 10/01/2015 07:43 AM, Robert Haas wrote:
>>> > >> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <
masao.fu...@gmail.com >
>>> > wrote:
>>> > >> >> I wonder how much it's worth renaming only the file extension
>>> > >> >> while
>>> > >> >> there are many places where "visibility map" and "vm" are used,
>>> > >> >> for example, log messages, function names, variables, etc.
>>> > >> >
>>> > >> > I'd be inclined to keep calling it the visibility map (vm) even
if
>>> > >> > it
>>> > >> > also contains freeze information.
>>> > >> >
>>> >
>>> > What is your main worry about changing the name of this map, is it
>>> > about more code churn or is it about that we might introduce new
issues
>>> > or is it about that people are already accustomed to call this map as
>>> > visibility map?
>>>
>>> Several:
>>> * Visibility map is rather descriptive, none of the replacement terms
>>>   imo come close. Few people will know what a 'freeze' map is.
>>> * It increases the size of the patch considerably
>>> * It forces tooling that knows about the layout of the database
>>>   directory to change their tools
>>>
>>
>> All these points are legitimate.
>>
>>> On the benfit side the only argument I've heard so far is that it allows
>>> to disambiguate the format. But, uh, a look at the major version does
>>> that just as well, for far less trouble.
>>>
>>> > It seems to me quite logical for understanding purpose as well.  Any
new
>>> > person who wants to work in this area or is looking into it will
always
>>> > wonder why this map is named as visibility map even though it contains
>>> > information about visibility of page as well as frozen state of page.
>>>
>>> Being frozen is about visibility as well.
>>>
>>
>> OTOH being visible doesn't mean page is frozen.  I understand that
frozen is
>> related to visibility, but still it is a separate state of page and used
for
>> different
>> purpose.  I think this is a subjective point and we could go either way,
it
>> is
>> just a matter in which way more people are comfortable.
>
> I'm stickin' with what I said before, and what I think Andres is
> saying too: renaming the map is a horrible idea.  It produces lots of
> code churn for no real benefit.  We usually avoid such changes, and I
> think we should do so here, too.

I understood.
I'm going to turn the patch back to visibility map, and just add the logic
of enhancement of VACUUM FREEZE.
If we want to add the other status not related to visibility into
visibility map in the future, it would be worth to consider.

Regards,

--
Masahiko Sawada


-- 
Regards,

--
Masahiko Sawada


Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-11-17 Thread Masahiko Sawada
On Tue, Nov 17, 2015 at 9:57 AM, Kyotaro HORIGUCHI
 wrote:
> Hello,
>
> At Tue, 17 Nov 2015 01:09:57 +0900, Masahiko Sawada  
> wrote in 

Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-17 Thread Thom Brown
On 13 November 2015 at 22:09, Robert Haas  wrote:
> On Thu, Nov 12, 2015 at 12:09 AM, Kouhei Kaigai  wrote:
>> I'm now designing the parallel feature of Append...
>>
>> Here is one challenge. How do we determine whether each sub-plan
>> allows execution in the background worker context?
>
> I've been thinking about these questions for a bit now, and I think we
> can work on improving Append in multiple phases.  The attached patch
> shows what I have in mind for phase 1.  Currently, if you set up an
> inheritance hierarchy, you might get an Append some of whose children
> are Gather nodes with Parallel Seq Scans under them, like this:
>
> Append
> -> Gather
>   -> Parallel Seq Scan
> -> Gather
>   -> Parallel Seq Scan
> -> Seq Scan
>
> This is a crappy plan.  Each Gather node will try to launch its own
> bunch of workers, which sucks.  The attached patch lets you get this
> plan instead:
>
> Gather
> -> Append
>   -> Partial Seq Scan
>   -> Partial Seq Scan
>   -> Partial Seq Scan
>
> That's a much better plan.
>
> To make this work, this plan introduces a couple of new concepts.
> Each RelOptInfo gets a new partial_pathlist field, which stores paths
> that need to be gathered to produce a workable plan.  Once we've
> populated the partial_pathlist with whatever partial paths we know how
> to generate, we can either push a Gather node on top of one of those
> partial paths to create a real path; or we can use those partial paths
> to build more partial paths.  The current patch handles only the
> simplest case: we can build a partial path for an appendrel by
> appending a partial path for each member rel.  But eventually I hope
> to handle joinrels this way as well: you can join a partial path with
> an ordinary path for form a partial path for the joinrel.
>
> This requires some way of figuring out how many workers to request for
> the append-path, so this patch also adds a parallel_degree field to
> the path object, which is 0 for non-parallel things and the number of
> workers that the path believes to be ideal otherwise.  Right now, it
> just percolates the highest worker count of any child up to the
> appendrel, which might not be right, especially once the append node
> knows how to balance workers, but it seems like a reasonable place to
> start.
>
>> Type-A) It can be performed on background worker context and
>>picked up by multiple worker processes concurrently.
>>   (e.g: Parallel SeqScan)
>> Type-B) It can be performed on background worker context but
>>   cannot be picked up by multiple worker processes.
>>   (e.g: non-parallel aware node)
>> Type-C) It should not be performed on background worker context.
>>(e.g: plan/path node with volatile functions)
>
> At the time that we're forming an AppendPath, we can identify what
> you're calling type-A paths very easily: they're the ones that are
> coming from the partial_pathlist.  We can distinguish between type-B
> and type-C paths coming from the childrel's pathlist based on the
> childrel's consider_parallel flag: if it's false, it's type-C, else
> type-B.  At some point, we might need a per-path flag to distinguish
> cases where a particular path is type-C even though some other plan
> for that relation might be type-B.  But right now that case doesn't
> arise.
>
> Now, of course, it's not good enough to have this information
> available when we're generating the AppendPath; it has to survive
> until execution time.  But that doesn't mean the paths need to be
> self-identifying.  We could, of course, decide to make them so, and
> maybe that's the best design, but I'm thinking about another approach:
> suppose the append node itself, instead of having one list of child
> plans, has a list of type-A plans, a list of type-B plans, and a list
> of type-C plans.  This actually seems quite convenient, because at
> execution time, you presumably want the leader to prioritize type-C
> plans over any of the others, since it's the only one that can execute
> them, and the workers to prioritize type-B plans, since they can only
> take one worker each and are thus prone to be the limiting factor on
> finishing the whole Append.  Having the plans divided in advance
> between these three lists (say, restricted_plans, safe_plans,
> parallel_plans) makes that easy to implement.
>
> Incidentally, I think it's subtly wrong to think of the parallel_aware
> flag as telling you whether the plan can absorb multiple workers.
> That's not really what it's for.  It's to tell you whether the plan is
> doing *something* parallel aware - that is, whether its Estimate,
> InitializeDSM, and InitializeWorker callbacks should do anything.  For
> SeqScan, flipping parallel_aware actually does split the input among
> all the workers, but for Append it's probably just load balances and
> for other nodes it might be something else again.  The term I'm using
> to indicate a path/plan that returns only a subset of the results in

Re: [HACKERS] Making tab-complete.c easier to maintain

2015-11-17 Thread Kyotaro HORIGUCHI
Hello, I tried to implement the mini-language, which is a
simplified reglar expression for this specific use.

As a ultra-POC, the attached patch has very ad-hoc preprocess
function and does on-the-fly preprocessing, compilation then
execution of regular expression. And it is applied to only the
first ten or so matchings in psql_completion().

The first attachment is the same with that of previous patchset.

Every matching line looks like the following,

> else if (RM("ALTER {AGGREGATE|FUNCTION} [#id](.."))
>   COMPLETE_WITH_FUNCTION_ARG(CAPTURE(1));

As a temporary desin, "{}" means grouping, "|" means alternatives,
"[]" means capture and '#id' means any identifier.

The largest problem of this would be its computation cost:( This
in turn might be too slow if about three hundred matches run...

I see no straight-forward way to preprocess these strings.. A
possible solution would be extracting these strings then
auto-generate a c-srouce to preprocess them. And when running,
RM() retrieves the compiled regular expression using the string
as the key.


At Tue, 17 Nov 2015 15:35:43 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20151117.153543.183036803.horiguchi.kyot...@lab.ntt.co.jp>
> At Mon, 16 Nov 2015 12:16:07 -0300, Alvaro Herrera  
> wrote in <20151116151606.GW614468@alvherre.pgsql>
> > I don't think this is an improvement.  It seems to me that a lot more
> > work is required to maintain these regular expressions, which while
> > straightforward are not entirely trivial (harder to read).
> > 
> > If we could come up with a reasonable format that is pre-processed to a
> > regexp, that might be better.  I think Thomas' proposed patch is closer
> > to that than Horiguchi-san's.
> 
> I aimed that matching mechanism can handle optional elements in
> syntexes more robustly. But as all you are saying, bare regular
> expression is too complex here.

At Tue, 17 Nov 2015 16:09:25 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20151117.160925.45883793.horiguchi.kyot...@lab.ntt.co.jp>
> if (Match("^,ALTER,TABLE,\id,$") ||
> Match("^,ALTER,TABLE,IF,EXISTS,\id,$"))...
> 
> Interpreting this kind of mini-language into regular expressions
> could be doable..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From cdc0b9cce43e38463af0b2b7ad4a0181f41995a2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 30 Oct 2015 18:03:18 +0900
Subject: [PATCH 1/2] Allow regex module to be used outside server.

To make regular expression available on frontend, enable pg_regex
module and the files included from it to be detached from the features
not available when used outside backend.
---
 src/backend/regex/regc_pg_locale.c |  7 +++
 src/backend/regex/regcomp.c| 12 
 src/include/regex/regcustom.h  |  6 ++
 src/include/utils/pg_locale.h  |  7 +--
 4 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c
index b707b06..a631ac2 100644
--- a/src/backend/regex/regc_pg_locale.c
+++ b/src/backend/regex/regc_pg_locale.c
@@ -220,6 +220,13 @@ static const unsigned char pg_char_properties[128] = {
 };
 
 
+/* Get rid of using server-side feature elsewhere of server. */
+#ifdef FRONTEND
+#define lc_ctype_is_c(col) (true)
+#define GetDatabaseEncoding() PG_UTF8
+#define ereport(x,...) exit(1)
+#endif
+
 /*
  * pg_set_regex_collation: set collation for these functions to obey
  *
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index a165b3b..b35ccb4 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -2040,11 +2040,17 @@ rfree(regex_t *re)
  * The current implementation is Postgres-specific.  If we ever get around
  * to splitting the regex code out as a standalone library, there will need
  * to be some API to let applications define a callback function for this.
+ *
+ * This check is available only on server side.
  */
 static int
 rcancelrequested(void)
 {
+#ifndef FRONTEND
 	return InterruptPending && (QueryCancelPending || ProcDiePending);
+#else
+	return 0;
+#endif
 }
 
 /*
@@ -2056,11 +2062,17 @@ rcancelrequested(void)
  * The current implementation is Postgres-specific.  If we ever get around
  * to splitting the regex code out as a standalone library, there will need
  * to be some API to let applications define a callback function for this.
+ *
+ * This check is available only on server side.
  */
 static int
 rstacktoodeep(void)
 {
+#ifndef FRONTEND
 	return stack_is_too_deep();
+#else
+	return 0;
+#endif
 }
 
 #ifdef REG_DEBUG
diff --git a/src/include/regex/regcustom.h b/src/include/regex/regcustom.h
index dbb461a..ffc4031 100644
--- a/src/include/regex/regcustom.h
+++ b/src/include/regex/regcustom.h
@@ -29,7 +29,11 @@
  */
 
 /* headers if any */
+#ifdef FRONTEND
+#include 

Re: [HACKERS] Should TIDs be typbyval = FLOAT8PASSBYVAL to speed up CREATE INDEX CONCURRENTLY?

2015-11-17 Thread Simon Riggs
On 17 November 2015 at 00:52, Peter Geoghegan  wrote:


> This patch does seem like a slam dunk, even if I do say so myself


Short and sweet! Looks good.

I would be inclined to add more comments to explain it, these things have a
habit of being forgotten.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2015-11-17 Thread Simon Riggs
On 17 November 2015 at 06:50, Amit Kapila  wrote:


> In anycase, I went ahead and tried further reducing the CLogControlLock
> contention by grouping the transaction status updates.  The basic idea
> is same as is used to reduce the ProcArrayLock contention [1] which is to
> allow one of the proc to become leader and update the transaction status
> for
> other active transactions in system.  This has helped to reduce the
> contention
> around CLOGControlLock.
>

Sounds good. The technique has proved effective with proc array and makes
sense to use here also.


> Attached patch group_update_clog_v1.patch
> implements this idea.
>

I don't think we should be doing this only for transactions that don't have
subtransactions. We are trying to speed up real cases, not just benchmarks.

So +1 for the concept, patch is going in right direction though lets do the
full press-up.

The above data indicates that contention due to CLogControlLock is
> reduced by around 50% with this patch.
>
> The reasons for remaining contention could be:
>
> 1. Readers of clog data (checking transaction status data) can take
> Exclusive CLOGControlLock when reading the page from disk, this can
> contend with other Readers (shared lockers of CLogControlLock) and with
> exclusive locker which updates transaction status. One of the ways to
> mitigate this contention is to increase the number of CLOG buffers for
> which
> patch has been already posted on this thread.
>
> 2. Readers of clog data (checking transaction status data) takes shared
> CLOGControlLock which can contend with exclusive locker (Group leader)
> which
> updates transaction status.  I have tried to reduce the amount of work done
> by group leader, by allowing group leader to just read the Clog page once
> for all the transactions in the group which updated the same CLOG page
> (idea similar to what we currently we use for updating the status of
> transactions
> having sub-transaction tree), but that hasn't given any further
> performance boost,
> so I left it.
>
> I think we can use some other ways as well to reduce the contention around
> CLOGControlLock by doing somewhat major surgery around SLRU like using
> buffer pools similar to shared buffers, but this idea gives us moderate
> improvement without much impact on exiting mechanism.
>

My earlier patch to reduce contention by changing required lock level is
still valid here. Increasing the number of buffers doesn't do enough to
remove that.

I'm working on a patch to use a fast-update area like we use for GIN. If a
page is not available when we want to record commit, just store it in a
hash table, when not in crash recovery. I'm experimenting with writing WAL
for any xids earlier than last checkpoint, though we could also trickle
writes and/or flush them in batches at checkpoint time - your code would
help there.

The hash table can also be used for lookups. My thinking is that most reads
of older xids are caused by long running transactions, so they cause a page
fault at commit and then other page faults later when people read them back
in. The hash table works for both kinds of page fault.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Question concerning XTM (eXtensible Transaction Manager API)

2015-11-17 Thread konstantin knizhnik

On Nov 17, 2015, at 10:44 AM, Amit Kapila wrote:

> 
> I think the general idea is that if Commit is WAL logged, then the
> operation is considered to committed on local node and commit should
> happen on any node, only once prepare from all nodes is successful.
> And after that transaction is not supposed to abort.  But I think you are
> trying to optimize the DTM in some way to not follow that kind of protocol.

DTM is still following 2PC protocol:
First transaction is saved in WAL at all nodes and only after it commit is 
completed at all nodes.
We try to avoid maintaining of separate log files for 2PC (as now for prepared 
transactions)
and do not want to change logic of work with WAL.

DTM approach is based on the assumption that PostgreSQL CLOG and visibility 
rules allows to "hide" transaction even if it is committed in WAL.


> By the way, how will arbiter does the recovery in a scenario where it
> crashes, won't it need to contact all nodes for the status of in-progress or
> prepared transactions? 

The current answer is that arbiter can not crash. To provide fault tolerance we 
spawn replicas of arbiter which are managed using Raft protocol.
If master is crashed or network is partitioned then new master is chosen.
PostgreSQL backends have list of possible arbiter addresses. Once connection 
with arbiter is broken, backend tries to reestablish connection using 
alternative addresses.
But only master accepts incomming connections.


> I think it would be better if more detailed design of DTM with respect to
> transaction management and recovery could be updated on wiki for having
> discussion on this topic.  I have seen that you have already updated many
> details of the system, but still the complete picture of DTM is not clear.

I agree.
But please notice that pg_dtm is just one of the possible implementations of 
distributed transaction management.
We also experimenting with other implementations, for example pg_tsftm based on 
timestamps. It doesn't require central arbiter and so shows much better (almost 
linear) scalability.
But recovery in case of pg_tsdtm is even more obscure.
Also performance of pg_tsdtm greatly depends on system clock synchronization 
and network delays. We git about 70k TPS on cluster with 12 nodes connected 
with 10Gbit network., 
But when we run the same test on hosts located in different geographic regions 
(several thousands km), then performance falls down to 15 TPS.
 


> 
> 
> 
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Bug in numeric multiplication

2015-11-17 Thread Dean Rasheed
On 21 September 2015 at 17:14, Tom Lane  wrote:
> Dean Rasheed  writes:
>> On 21 September 2015 at 16:09, Tom Lane  wrote:
>>> After trying to rework the comment to explain what maxdig really meant
>>> after your changes, I came to the conclusion that it'd be better to do
>>> it as per attached.  Does this look sane to you?
>
>> Yes that looks better. It's still the same amount of extra headroom
>> (21), but I think it's clearer your way.
>
> OK, pushed (after further hacking on the comment ...)
>
> regards, tom lane

I just noticed that div_var_fast() has almost identical code, and so
in principle it has the same vulnerability, although it obviously only
affects the transcendental functions.

I don't actually have a test case that triggers it, but it's basically
the same algorithm, so logically it needs the same additional headroom
to avoid a possible overflow.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pageinspect patch, for showing tuple data

2015-11-17 Thread Guillaume Lelarge
Hi,

Le 12 nov. 2015 1:05 AM, "Michael Paquier"  a
écrit :
>
> On Thu, Nov 12, 2015 at 12:41 AM, Nikolay Shaplov
>  wrote:
> > В письме от 28 октября 2015 16:57:36 пользователь Michael Paquier
написал:
> >> On Sat, Oct 17, 2015 at 1:48 AM, Michael Paquier wrote:
> >> > On Sat, Oct 17, 2015 at 5:15 AM, Nikolay Shaplov wrote:
> >> >> Or it's ready to commit, and just not marked this way?
> >> >
> >> > No, I don't think we have reached this state yet.
> >> >
> >> >> I am going to make report based on this patch in Vienna. It would be
> >> >> nice to have it committed until then :)
> >> >
> >> > This is registered in this November's CF and the current one is not
> >> > completely wrapped up, so I haven't rushed into looking at it.
> >>
> >> So, I have finally been able to look at this patch in more details,
> >> resulting in the attached. I noticed a couple of things that should be
> >> addressed, mainly:
> >> - addition of a new routine text_to_bits to perform the reverse
> >> operation of bits_to_text. This was previously part of
> >> tuple_data_split, I think that it deserves its own function.
> >> - split_tuple_data should be static
> >> - t_bits_str should not be a text, rather a char* fetched using
> >> text_to_cstring(PG_GETARG_TEXT_PP(4)). This way there is no need to
> >> perform extra calculations with VARSIZE and VARHDRSZ
> >> - split_tuple_data can directly use the relation OID instead of the
> >> tuple descriptor of the relation
> >> - t_bits was leaking memory. For correctness I think that it is better
> >> to free it after calling split_tuple_data.
> >> - PG_DETOAST_DATUM_COPY allocates some memory, this was leaking as
> >> well in raw_attr actually. I refactored the code such as a bytea* is
> >> used and always freed when allocated to avoid leaks. Removing raw_attr
> >> actually simplified the code a bit.
> >> - I simplified the docs, that was largely too verbose in my opinion.
> >> - Instead of using VARATT_IS_1B_E and VARTAG_EXTERNAL, using
> >> VARATT_IS_EXTERNAL and VARATT_IS_EXTERNAL_ONDISK seems more adapted to
> >> me, those other ones are much more low-level and not really spread in
> >> the backend code.
> >> - Found some typos in the code, the docs and some comments. I reworked
> >> the error messages as well.
> >> - The code format was not really in line with the project guidelines.
> >> I fixed that as well.
> >> - I removed heap_page_item_attrs for now to get this patch in a
> >> bare-bone state. Though I would not mind if this is re-added (I
> >> personally don't think that's much necessary in the module
> >> actually...).
> >> - The calculation of the length of t_bits using HEAP_NATTS_MASK is
> >> more correct as you mentioned earlier, so I let it in its state.
> >> That's actually more accurate for error handling as well.
> >> That's everything I recall I have. How does this look?
> > You've completely rewrite everything ;-)
> >
> > Let everything be the way you wrote. This code is better than mine.
> >
> > With one exception. I really  need heap_page_item_attrs function. I
used it in
> > my Tuples Internals presentation
> > https://github.com/dhyannataraj/tuple-internals-presentation
> > and I am 100% sure that this function is needed for educational
purposes, and
> > this function should be as simple as possible, so students can use it
without
> > extra efforts.
>
> Fine. That's your patch after all.
>
> > I still have an opinion that documentation should be more verbose, than
your
> > version, but I can accept your version.
>
> I am not sure that's necessary, pageinspect is for developers.
>

FWIW, I agree that pageinspect is mostly for devs. Still, as i said to
Nikolay after his talk at pgconf.eu, it's a nice tool for people who like
to know what's going on deep inside PostgreSQL.

So +1 for that nice feature.

> > Who is going to add heap_page_item_attrs to your patch? me or you?
>
> I recall some parts of the code I still did not like much :) I'll grab
> some room to have an extra look at it.


Re: [HACKERS] Erroneous cost estimation for nested loop join

2015-11-17 Thread Robert Haas
On Mon, Nov 16, 2015 at 6:50 PM, Jeff Janes  wrote:
> On Mon, Nov 9, 2015 at 6:37 AM, Tom Lane  wrote:
>> kawami...@tkl.iis.u-tokyo.ac.jp writes:
>>>   - cost parameter calibration: random_page_cost = 92.89
>>
>> TBH, you lost me there already.  I know of no hardware on which that would
>> be a sane depiction of reality, so I think you've probably overfitted the
>> model to some particular case it was already inaccurate on.
>
> I can easily get a ratio of random to sequential of 50, and my RAID is
> nothing special.  I don't see why a high-end RAID couldn't justify 100
> or more, as long as they know the cache-hit is very low. (The default
> value of 4 seems to bake in the notion that 90% of even random IO is
> going to be satisfied from the cache, which might be a good estimate
> if you have frequently used smallish lookup tables that always get
> joined to the RAM-busters, but some people aren't going to have that
> type of database queries as their main load).

I agree.  What I've been thinking about is:

- If we're sequential scanning a small table, let's say less than 1/4
of shared_buffers, which is the point where synchronized scans kick
in, then assume the data is coming from shared_buffers.
- If we're scanning a medium-sized table, let's say less than
effective_cache_size, then assume the data is coming from the OS
cache.  Maybe this is the same cost as the previous case, or maybe
it's slightly more.
- Otherwise, assume that the first effective_cache_size pages are
coming from cache and the rest has to be read from disk.  This is
perhaps unrealistic, but we don't want the cost curve to be
discontinuous.

The effect of this would be that sequential scanning a small table
would look cheaper per page than sequential scanning a large table,
which is a good idea, because it's likely to be true.  Similar
adaptations could be made for index scans, index-only scans, bitmap
index scans, and bitmap heap scans.  The default value of
random_page_cost would go up, but there would be a new
cached_page_cost GUC that would apply in some cases where
seq_page_cost and/or random_page_cost apply today.

Previous attempts to improve the modeling of cache effects have
faltered because we don't know what will be cached at execution time,
and even if we did it can change very quickly and we don't want plan
instability.  But it seems to me that guessing based on the size of
the relation is pretty reasonable - essentially we'd be trying to
refine a model which says that every page is equally likely to be
cached, and that shouldn't be too high a bar to clear.

A problem with this sort of thing, of course, is that it's really hard
to test a proposed change broadly enough to be certain how it will
play out in the real world.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Cause TestLib.pm to define $windows_os in all branches.

2015-11-17 Thread Tom Lane
Michael Paquier  writes:
> On Tue, Oct 13, 2015 at 10:17 PM, Andrew Dunstan wrote:
>> In general I think we can be a good deal more liberal about backpatching the
>> testing regime than we are with production code, where we are always
>> cautious, and the caution has paid big dividends in our reputation for
>> stability.

> Attached are patches for 9.4 and 9.5 syncing up everything with
> master. I tested both of them on OSX, Linux and Windows (MSVC only
> though using vcregress tapcheck).

I'm not in a position to double-check that these patches work on Windows,
but I reviewed them through the expedient of diff'ing the patched files
against HEAD.  The only problem I found was you'd left out the
documentation addition about needing IPC::Run in install-windows.sgml.
So I've pushed them with that fix.

Where do we now stand on invoking the TAP tests in the buildfarm?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Implement failover on libpq connect level.

2015-11-17 Thread Peter Eisentraut
On 11/5/15 10:12 AM, Victor Wagner wrote:
> 2. You are suggesting that it should be possible to specify all options 
> separately for each of the fallback hosts. 

I'm not necessarily suggesting that.  I think it could very well be

conn="postgresql://foo postgresql://bar order=random"



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Default Roles (was: Additional role attributes)

2015-11-17 Thread Stephen Frost
Michael,

* Michael Paquier (michael.paqu...@gmail.com) wrote:
> Will there be any work on this patch for this commit fest or not? This
> is being carried around for a couple of months now with not much
> progress. This thread is idle for 4 months now.

This thread and the other one kind of merged.  The last update on the
overall discussion is here: 

http://www.postgresql.org/message-id/2015093020.gm3...@tamriel.snowman.net

Which was closer to 1.5 months ago and was the requested split of the
patch, which mainly needs to get review and/or buy-in from others on the
reservation of the role prefix 'pg_' (which is the first patch).

I'm happy to update the patches if they don't apply, of course, but
they're relatively straight-forward and we just need to agree that
reservation of the prefix is acceptable and then I can just commit
them.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Parallel Seq Scan

2015-11-17 Thread Bert
Hey Robert,

Thank you for the help. As you might (not) know, I'm quite new to the
community, but I'm learning. with the help from people like you.
anyhow, find attached a third attempt to a valid backtrace file.

This run is compiled from commit 5f10b7a604c87fc61a2c20a56552301f74c9bd5f
and your latest patch atteched in this mailtrack.


cheers,
Bert​
 full_backtrace.log

​

On Tue, Nov 17, 2015 at 6:55 PM, Robert Haas  wrote:

> On Tue, Nov 17, 2015 at 6:52 AM, Bert  wrote:
> > edit: maybe this is more useful? :)
>
> Definitely.  But if you've built with --enable-debug and not stripped
> the resulting executable, we ought to get line numbers as well, plus
> the arguments to each function on the stack.  That would help a lot
> more.  The only things that get dereferenced in that function are
> "scan" and "parallel_scan", so it's a good bet that one of those
> pointers is pointing off into never-never land.  I can't immediately
> guess how that's happening, though.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



-- 
Bert Desmet
0477/305361


Re: [HACKERS] proposal: multiple psql option -c

2015-11-17 Thread Tom Lane
Robert Haas  writes:
> A few years ago there was a proposal to not only allow multiple -c
> options, but to allow -c and -f to be intermingled.  This seems really
> rather useful; I'd like to be able to type psql -c do_this_first -f
> script.sql and have that work.  But of course it's kind of hard to
> figure out how that should behave given the various differences
> between -c and -f.

Hm.  That's actually a good reason for changing -c, I guess.  (Or else
introducing a -C that is compatible with -f, but I agree that that
seems a bit ugly.)

If we made -c handle its input just like it had come from a file,
then what would we need to explain to people that wanted the old
behavior?  I guess we'd need to tell them to use --no-psqlrc and
--single-transaction at least, and "-v ON_ERROR_STOP=1".  And
even then it wouldn't replicate the behavior of discarding all
but the last command output.  (Maybe nobody out there is relying
on that, but I wouldn't bet on it.)  And that's all still assuming
that they don't care about multi-command-per-PQexec in itself, but
only the easily user-visible differences.

So that is kind of looking like a mess, and 90% of the mess is from
not wanting to use a PQexec per -c switch, which if you ask me is
not very much of the value-add here.  AFAICS the only thing that's
really in conflict with -f is the implied --no-psqlrc.  How about
this design:

1. -c no longer implies --no-psqlrc.  That's a backwards incompatibility,
but very easy to explain and very easy to work around.

2. You can have multiple -c and/or -f.  Each -c is processed in
the traditional way, ie, either it's a single backslash command
or it's sent in a single PQexec.  That doesn't seem to me to have
much impact on the behavior of adjacent -c or -f.

3. If you combine -1 with -c and/or -f, you get one BEGIN inserted
at the beginning and one COMMIT at the end.  Nothing else changes.

As long as you put only one SQL command per -c, I don't think that
this definition has any real surprises.  And we can discourage
people from putting more than one, saying that that will invoke
legacy behaviors you probably don't want.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-17 Thread Thom Brown
On 17 November 2015 at 20:08, Robert Haas  wrote:
> On Tue, Nov 17, 2015 at 4:26 AM, Thom Brown  wrote:
>
>> However, the first parallel seq scan shows it getting 170314 rows.
>> Another run shows it getting 194165 rows.  The final result is
>> correct, but as you can see from the rows on the Append node (59094295
>> rows), it doesn't match the number of rows on the Gather node
>> (3000).
>
> Is this the same issue reported in
> http://www.postgresql.org/message-id/CAFj8pRBF-i=qdg9b5nzrxyfchzbezwmthxyphidqvwomojh...@mail.gmail.com
> and not yet fixed?  I am inclined to think it probably is.

Yes, that seems to be the same issue.

>> And also, for some reason, I can no longer get this using more than 2
>> workers, even with max_worker_processes = 16 and max_parallel_degree =
>> 12.  I don't know if that's anything to do with this patch though.
>
> The number of workers is limited based on the size of the largest
> table involved in the Append.  That probably needs considerable
> improvement, of course, but this patch is still a step forward over
> not-this-patch.

Ah, okay.  I wasn't aware of this.  I'll bear that in mind in future.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-17 Thread Masahiko Sawada
On Tue, Nov 17, 2015 at 7:29 PM, Masahiko Sawada  wrote:
>
>
> On Tue, Nov 17, 2015 at 10:45 AM, Robert Haas  wrote:
>> On Sun, Nov 15, 2015 at 1:47 AM, Amit Kapila 
>> wrote:
>>> On Sat, Nov 14, 2015 at 1:19 AM, Andres Freund 
>>> wrote:
 On 2015-10-31 11:02:12 +0530, Amit Kapila wrote:
 > On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs 
 > wrote:
 > >
 > > On 1 October 2015 at 23:30, Josh Berkus  wrote:
 > >>
 > >> On 10/01/2015 07:43 AM, Robert Haas wrote:
 > >> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao
 > >> > 
 > wrote:
 > >> >> I wonder how much it's worth renaming only the file extension
 > >> >> while
 > >> >> there are many places where "visibility map" and "vm" are used,
 > >> >> for example, log messages, function names, variables, etc.
 > >> >
 > >> > I'd be inclined to keep calling it the visibility map (vm) even
 > >> > if
 > >> > it
 > >> > also contains freeze information.
 > >> >
 >
 > What is your main worry about changing the name of this map, is it
 > about more code churn or is it about that we might introduce new
 > issues
 > or is it about that people are already accustomed to call this map as
 > visibility map?

 Several:
 * Visibility map is rather descriptive, none of the replacement terms
   imo come close. Few people will know what a 'freeze' map is.
 * It increases the size of the patch considerably
 * It forces tooling that knows about the layout of the database
   directory to change their tools

>>>
>>> All these points are legitimate.
>>>
 On the benfit side the only argument I've heard so far is that it allows
 to disambiguate the format. But, uh, a look at the major version does
 that just as well, for far less trouble.

 > It seems to me quite logical for understanding purpose as well.  Any
 > new
 > person who wants to work in this area or is looking into it will
 > always
 > wonder why this map is named as visibility map even though it contains
 > information about visibility of page as well as frozen state of page.

 Being frozen is about visibility as well.

>>>
>>> OTOH being visible doesn't mean page is frozen.  I understand that frozen
>>> is
>>> related to visibility, but still it is a separate state of page and used
>>> for
>>> different
>>> purpose.  I think this is a subjective point and we could go either way,
>>> it
>>> is
>>> just a matter in which way more people are comfortable.
>>
>> I'm stickin' with what I said before, and what I think Andres is
>> saying too: renaming the map is a horrible idea.  It produces lots of
>> code churn for no real benefit.  We usually avoid such changes, and I
>> think we should do so here, too.
>
> I understood.
> I'm going to turn the patch back to visibility map, and just add the logic
> of enhancement of VACUUM FREEZE.

Attached latest v24 patch.
I've changed patch so that just adding frozen bit into visibility map.
So the size of patch is almost half of previous one.

Please review it.

Regards,

--
Masahiko Sawada
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..e8ebfe9 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
 		 * If the page has only visible tuples, then we can find out the free
 		 * space from the FSM and move on.
 		 */
-		if (visibilitymap_test(rel, blkno, ))
+		if (VM_ALL_VISIBLE(rel, blkno, ))
 		{
 			freespace = GetRecordedFreeSpace(rel, blkno);
 			stat->tuple_len += BLCKSZ - freespace;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6e14851..c75a166 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5905,7 +5905,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
   
   

-VACUUM performs a whole-table scan if the table's
+VACUUM performs an eager freezing if the table's
 pg_class.relfrozenxid field has reached
 the age specified by this setting.  The default is 150 million
 transactions.  Although users can set this value anywhere from zero to
@@ -5949,7 +5949,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
   
   

-VACUUM performs a whole-table scan if the table's
+VACUUM performs an eager freezing if the table's
 pg_class.relminmxid field has reached
 the age specified by this setting.  The default is 150 million multixacts.
 Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..5a43c28 100644
--- 

Re: [HACKERS] pageinspect patch, for showing tuple data

2015-11-17 Thread Michael Paquier
On Wed, Nov 18, 2015 at 3:10 AM, Nikolay Shaplov wrote:
> Everything seems to be ok. I've changed only one thing in your version
> of the patch. I've renamed split_tuple_data to
> tuple_data_split_internal, because when there are split_tuple_data and
> tuple_data_split in the same file, nobody will guess what the difference
> between these function and why are they named in such a strange way.

Yep, that sounds better this way.

> If it is ok, set proper status, and let commiter do his job :-)

OK. I have switched the status of this patch to "Ready for committer"
(please, committer-san, double-check the area around
tuple_data_split_internal when fetching data for each attribute, I
think that we got that right but I may be missing something as well).
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Default Roles (was: Additional role attributes)

2015-11-17 Thread Michael Paquier
On Wed, Nov 18, 2015 at 5:36 AM, Stephen Frost  wrote:
> Michael,
>
> * Michael Paquier (michael.paqu...@gmail.com) wrote:
>> Will there be any work on this patch for this commit fest or not? This
>> is being carried around for a couple of months now with not much
>> progress. This thread is idle for 4 months now.
>
> This thread and the other one kind of merged.  The last update on the
> overall discussion is here:
>
> http://www.postgresql.org/message-id/2015093020.gm3...@tamriel.snowman.net

Right. The CF entry links to two threads, and I somehow missed the first one.

> Which was closer to 1.5 months ago and was the requested split of the
> patch, which mainly needs to get review and/or buy-in from others on the
> reservation of the role prefix 'pg_' (which is the first patch).
> I'm happy to update the patches if they don't apply, of course, but
> they're relatively straight-forward and we just need to agree that
> reservation of the prefix is acceptable and then I can just commit
> them.

I'll reply directly on the other thread, there is no meaning to mess
up things here.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch (3): Implement failover on libpq connect level.

2015-11-17 Thread Victor Wagner
On Tue, 17 Nov 2015 19:42:42 +
Thom Brown  wrote:

> 
> This patch doesn't apply.  On line 636, this appears:


It doesn't apply to current state of the master branch due
to changes in fe-connect.c made by Tom Lane in the commit c40591885.
Unfortunately these changes appears in the source just line by line
with my changes. I'm attaching here updated version of the patch,
applicable to master later that 11-12-2015.


diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 0ee018e..ec12fce 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -792,7 +792,7 @@ host=localhost port=5432 dbname=mydb connect_timeout=10

The general form for a connection URI is:
 
-postgresql://[user[:password]@][netloc][:port][/dbname][?param1=value1...]
+postgresql://[user[:password]@][netloc][:port][,netloc[:port]...][/dbname][?param1=value1...]
 

 
@@ -809,6 +809,7 @@ postgresql://localhost/mydb
 postgresql://user@localhost
 postgresql://user:secret@localhost
 postgresql://other@localhost/otherdb?connect_timeout=10application_name=myapp
+postgresql://node1,node2:5433,node3:4432,node4/mydb?hostorder=randomreadonly=1
 
 Components of the hierarchical part of the URI can also
 be given as parameters.  For example:
@@ -831,7 +832,9 @@ postgresql:///mydb?host=localhostport=5433

 For improved compatibility with JDBC connection URIs,
 instances of parameter ssl=true are translated into
-sslmode=require.
+sslmode=require and
+loadBalanceHosts=true  into
+hostorder=random.

 

@@ -841,6 +844,10 @@ postgresql:///mydb?host=localhostport=5433
 postgresql://[2001:db8::1234]/database
 

+   
+   There can be serveral host specifications, optionally accompanied
+   with port, separated by comma.
+   
 

 The host component is interpreted as described for the parameter PostgreSQL was built). On machines without
 Unix-domain sockets, the default is to connect to localhost.

+   
+   There can be more than one host parameter in
+   the connect string. In this case these hosts would be considered
+   alternate entries into same database and if connect to first one
+   fails, library would try to connect second etc. This can be used
+   for high availability cluster or for load balancing. See
+parameter.
+   
+   
+   Network host name can be accompanied with port number, separated by
+   colon. If so, this port number is used only when connected to
+   this host. If there is no port number, port specified in the
+parameter would be used.
+   
   
  
 
@@ -942,8 +963,44 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname


   
-
+
+  
+  hostorder
+  
+  
+  Specifies how to choose host from list of alternate hosts,
+  specified in the  parameter.
+  
+  
+  If value of this argument is sequential (the
+  default) library connects to the hosts in order they specified,
+  and tries to connect second one only if connection to the first
+  fails.
+  
+  
+  If value is random host to connect is randomly
+  picked from the list. It allows to balance load between several
+  cluster nodes. However, currently PostgreSQL doesn't support
+  multimaster clusters. So, without use of third-party products,
+  only read-only connections can take advantage from the
+  load-balancing. See 
+  
+  
+  
+  
+  readonly
+  
+  
+  If this parameter is 0 (the default), upon successful connection
+  library checks if host is in recovery state, and if it is so,
+  tries next host in the connect string. If this parameter is
+  non-zero, connection to warm standby nodes are considered
+  successful.
+  
+  
+  
   
+
port


@@ -985,7 +1042,6 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   
   
  
-
  
   connect_timeout
   
@@ -996,7 +1052,27 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   
   
  
-
+ 
+ falover_timeout
+ 
+ 
+ Maximum time to cycilically retry all the hosts in commect string.
+ (as decimal integer number of seconds). If not specified, then
+ hosts are tried just once.
+ 
+ 
+ If we have replicating cluster, and master node fails, it might
+ take some time to promote one of standby nodes to the new master.
+ So clients which notice that connect to the master fails, can
+ already give up attempt to reestablish a connection when new master
+ became available. 
+ 
+ 
+ Setting this parameter to reasonable time makes library try to
+ reconnect all the host in cyclically until new master appears.
+ 
+ 
+ 
  
   client_encoding
   
@@ -7161,6 +7237,18 @@ user=admin
An example file is provided at
share/pg_service.conf.sample.