date:20160110

Re: [HACKERS] ExecGather() + nworkers

2016-01-10 Thread Pavel Stehule

>
> > More importantly, I have other, entirely general concerns. Other major
> > RDBMSs have settings that are very similar to max_parallel_degree,
> > with a setting of 1 effectively disabling all parallelism. Both Oracle
> > and SQL Server have settings that they both call the "maximum degree
> > or parallelism". I think it's a bit odd that with Postgres,
> > max_parallel_degree = 1 can still use parallelism at all. I have to
> > wonder: are we conflating controlling the resources used by parallel
> > operations with how shared memory is doled out?
>
> We could redefined things so that max_parallel_degree = N means use N
> - 1 workers, with a minimum value of 1 rather than 0, if there's a
> consensus that that's better.  Personally, I prefer it the way we've
> got it: it's real darned clear in my mind that max_parallel_degree=0
> means "not parallel".  But I won't cry into my beer if a consensus
> emerges that the other way would be better.
>
>
when max_parallel_degree will be renamed to max_query_workers or some
similar, then the new metric has sense. And can be more intuitive.

Regards

Pavel



> > I could actually "give back" my parallel worker slots early if I
> > really wanted to (that would be messy, but the then-acquiesced workers
> > do nothing for the duration of the merge beyond conceptually owning
> > the shared tape temp files). I don't think releasing the slots early
> > makes sense, because I tend to think that hanging on to the workers
> > helps the DBA in managing the server's resources. The still-serial
> > merge phase is likely to become a big bottleneck with parallel sort.
>
> Like I say, the sort code better not know anything about this
> directly, or it's going to break when embedded in a query.
>
> > With parallel sequential scan, a max_parallel_degree of 8 could result
> > in 16 processes scanning in parallel. That's a concern, and not least
> > because it happens only sometimes, when things are timed just right.
> > The fact that only half of those processes are "genuine" workers seems
> > to me like a distinction without a difference.
>
> This seems dead wrong.  A max_parallel_degree of 8 means you have a
> leader and 8 workers.  Where are the other 7 processes coming from?
> What you should have is 8 processes each of which is participating in
> both the parallel seq scan and the parallel sort, not 8 processes
> scanning and 8 separate processes sorting.
>
> >> I think that's probably over-engineered.  I mean, it wouldn't be that
> >> hard to have the workers just exit if you decide you don't want them,
> >> and I don't really want to make the signaling here more complicated
> >> than it really needs to be.
> >
> > I worry about the additional overhead of constantly starting and
> > stopping a single worker in some cases (not so much with parallel
> > index build, but other uses of parallel sort beyond 9.6). Furthermore,
> > the coordination between worker and leader processes to make this
> > happen seems messy -- you actually have the postmaster launch
> > processes, but they must immediately get permission to do anything.
> >
> > It wouldn't be that hard to offer a general way of doing this, so why
> not?
>
> Well, if these things become actual problems, fine, we can fix them.
> But let's not decide to add the API before we're agreed that we need
> it to solve an actual problem that we both agree we have.  We are not
> there yet.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: [HACKERS] checkpointer continuous flushing

2016-01-10 Thread Amit Kapila

On Sat, Jan 9, 2016 at 7:10 PM, Andres Freund  wrote:
>
> On 2016-01-09 19:05:54 +0530, Amit Kapila wrote:
> > Right that won't be acceptable, however I think with your latest
> > proposal [1]
>
> Sure, that'd address that problem.
>
>
> > [...] think that idea will help to mitigate the problem of backend and
> > bgwriter writes as well.  In that, can't we do it with the help of
> > existing infrastructure of *pendingOpsTable* and
> > *CheckpointerShmem->requests[]*, as already the flush requests are
> > remembered in those structures, we can use those to apply your idea to
> > issue flush requests.
>
> Hm, that might be possible. But that might have some bigger implications
> - we currently can issue thousands of flush requests a second, without
> much chance of merging. I'm not sure it's a good idea to overlay that
> into the lower frequency pendingOpsTable.
>

In that case, we can have unified structure to remember flush requests
rather than backend and bgwriter noting that information in
CheckpointerShmem and checkpointer in pendingOpsTable.  I understand
there are some benefits of having pendingOpsTable, but having a
common structure seems to be more beneficial and in particular
because it can be used for the purpose of flush hints.

Now, I am sure we can invent a new way of tracking the flush
requests for flush hints, but I think we might want to consider why
can't we have one unified way of tracking the flush requests which
can be used both for *flush* and *flush hints*.

> Backends having to issue
> fsyncs because the pending fsync queue is full is darn expensive. In
> contrast to that a 'flush hint' request getting lost doesn't cost that
> much.
>

In general, I think the cases where backends have to do flush should
be less as the size of fsync queue is NBuffers and we take care of
handling duplicate fsync requests for the same buffer.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] PATCH: add pg_current_xlog_flush_location function

2016-01-10 Thread Michael Paquier

On Mon, Jan 11, 2016 at 1:15 PM, Amit Kapila  wrote:
> On Mon, Jan 11, 2016 at 3:29 AM, Tomas Vondra 
> wrote:
>>
>> Hi,
>>
>> On 12/13/2015 08:38 PM, Tomas Vondra wrote:
>>>
>>> Hi,
>>>
>>> On 12/13/2015 06:13 AM, Amit Kapila wrote:
>>>  >

 ...
>>>
>>>  >

 Is there a reason why you can't use existing function
 GetFlushRecPtr() in xlog.c?
>>>
>>>
>>> No, not really. I think I somehow missed that function when writing
>>> the initial version of the patch. Will fix in v2 of the patch.
>>
>>
>> Hmm, so I've been looking at this, and I've realized that I've written it
>> like this because that's pretty much what pg_current_xlog_location() does.
>> It calls GetXLogWriteRecPtr which does this:
>>
>> /*
>>  * Get latest WAL write pointer
>>  */
>> XLogRecPtr
>> GetXLogWriteRecPtr(void)
>> {
>> SpinLockAcquire(&XLogCtl->info_lck);
>> LogwrtResult = XLogCtl->LogwrtResult;
>> SpinLockRelease(&XLogCtl->info_lck);
>>
>> return LogwrtResult.Write;
>> }
>>
>> so the patch does the same thing, except that I've returned "Flush".
>>
>> OTOH GetFlushRecPtr does this:
>>
>> XLogRecPtr
>> GetFlushRecPtr(void)
>> {
>> XLogRecPtr  recptr;
>>
>> SpinLockAcquire(&XLogCtl->info_lck);
>> recptr = XLogCtl->LogwrtResult.Flush;
>> SpinLockRelease(&XLogCtl->info_lck);
>>
>> return recptr;
>> }
>>
>> i.e. it does not update LogwrtResult, the local private copy. Not sure
>> what's appropriate here ...
>>
>
> I think for the purpose of exposing the new API
> pg_current_xlog_flush_location(), I see no reason why it has to
> update the local variable LogwrtResult, although doing it either way
> seems to be okay, however introducing new function
> GetXLogFlushRecPtr() seems redundant.  The internal function
> (GetXLogInsertRecPtr()) used for API pg_current_xlog_insert_location()
> doesn't update the local variable which indicates that calling the existing
> function GetFlushRecPtr() is sufficient for
> pg_current_xlog_flush_location().

Updating LogwrtResult directly when calling your new function
GetXLogFlushRecPtr() does not strike me as a particularly good idea
per this portion in XLogFlush():
/* Quick exit if already known flushed */
if (record <= LogwrtResult.Flush)
return;

The same counts for GetXLogWriteRecPtr, we had better allocate the
value in an independent variable as there are checks using it. For now
it does not matter much for the write position because all the code
paths doing the checks explicitly update again the pointer before
looking at it but it seems to me that using an independent variable
would make the code more robust.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why doesn't src/backend/port/win32/socket.c implement bind()?

2016-01-10 Thread Amit Kapila

On Sun, Jan 10, 2016 at 11:55 PM, Tom Lane  wrote:
>
> Some of the Windows buildfarm members occasionally fail like this:
>
> LOG:  could not bind IPv4 socket: No error
> HINT:  Is another postmaster already running on port 64470? If not, wait
a few seconds and retry.
> WARNING:  could not create listen socket for "127.0.0.1"
> FATAL:  could not create any TCP/IP sockets
>
> (bowerbird, in particular, has a few recent examples)
>
> I think the reason why we're getting "No error" instead of a useful
> strerror report is that socket.c doesn't provide an implementation
> of bind() that includes TranslateSocketError().
>

listen also doesn't have such an implementation and probably few others.

>  Why is that?
>

Not sure, but I could see that bind and listen doesn't have the equivalent
Win sock API (checked in winsock2.h) and while googling on same,
I found that there are reasons [1] why Win Sockets doesn't have the
equivalent of some of the socket API's.

I think here we should add a win32 wrapper over bind and listen
API's which ensures TranslateSocketError() should be called for
error cases.

[1] -
http://stackoverflow.com/questions/3255899/why-are-there-wsa-pendants-for-socket-connect-send-and-so-on-but-not-fo

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] PATCH: add pg_current_xlog_flush_location function

2016-01-10 Thread Amit Kapila

On Mon, Jan 11, 2016 at 3:29 AM, Tomas Vondra 
wrote:

> Hi,
>
> On 12/13/2015 08:38 PM, Tomas Vondra wrote:
>
>> Hi,
>>
>> On 12/13/2015 06:13 AM, Amit Kapila wrote:
>>  >
>>
>>> ...
>>>
>>  >
>>
>>> Is there a reason why you can't use existing function
>>> GetFlushRecPtr() in xlog.c?
>>>
>>
>> No, not really. I think I somehow missed that function when writing
>> the initial version of the patch. Will fix in v2 of the patch.
>>
>
> Hmm, so I've been looking at this, and I've realized that I've written it
> like this because that's pretty much what pg_current_xlog_location() does.
> It calls GetXLogWriteRecPtr which does this:
>
> /*
>  * Get latest WAL write pointer
>  */
> XLogRecPtr
> GetXLogWriteRecPtr(void)
> {
> SpinLockAcquire(&XLogCtl->info_lck);
> LogwrtResult = XLogCtl->LogwrtResult;
> SpinLockRelease(&XLogCtl->info_lck);
>
> return LogwrtResult.Write;
> }
>
> so the patch does the same thing, except that I've returned "Flush".
>
> OTOH GetFlushRecPtr does this:
>
> XLogRecPtr
> GetFlushRecPtr(void)
> {
> XLogRecPtr  recptr;
>
> SpinLockAcquire(&XLogCtl->info_lck);
> recptr = XLogCtl->LogwrtResult.Flush;
> SpinLockRelease(&XLogCtl->info_lck);
>
> return recptr;
> }
>
> i.e. it does not update LogwrtResult, the local private copy. Not sure
> what's appropriate here ...
>
>
I think for the purpose of exposing the new API
pg_current_xlog_flush_location(), I see no reason why it has to
update the local variable LogwrtResult, although doing it either way
seems to be okay, however introducing new function
GetXLogFlushRecPtr() seems redundant.  The internal function
(GetXLogInsertRecPtr()) used for API pg_current_xlog_insert_location()
doesn't update the local variable which indicates that calling the existing
function GetFlushRecPtr() is sufficient
for pg_current_xlog_flush_location().


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] ExecGather() + nworkers

2016-01-10 Thread Amit Kapila

On Mon, Jan 11, 2016 at 3:14 AM, Peter Geoghegan  wrote:
>
> On Sun, Jan 10, 2016 at 9:13 AM, Robert Haas 
wrote:
> >> I'm not sure why the test for nworkers following the
> >> LaunchParallelWorkers() call doesn't look like this, though:
> >>
> >> /* Set up tuple queue readers to read the results. */
> >> if (pcxt->nworkers_launched > 0)
> >> {
> >> ...
> >> }
> >
> > Hmm, yeah, I guess it could do that.
>
> That would make it clearer as an example.
>
> >> But going to this additional trouble (detecting no workers launched on
> >> the basis of !nworkers_launched) suggests that simply testing
> >> nworkers_launched would be wrong, which AFAICT it isn't. Can't we just
> >> do that, and in so doing also totally remove the "for" loop shown
> >> here?
> >
> > I don't see how the for loop goes away.
>
> I meant that some code in the "for" loop goes away. Not all of it.
> Just the more obscure code. As I said, I'm mostly pointing this out
> out of concern for making it clearer as example code.
>

Right, I can write a patch to do it in a way you are suggesting if you
are not planning to do it.

>
> >> In the case of parallel sequential scan, it looks like one worker can
> >> be helpful, because then the gather node (leader process) can run the
> >> plan itself to some degree, and so there are effectively 2 processes
> >> scanning at a minimum (unless 0 workers could be allocated to begin
> >> with). How useful is it to have a parallel scan when this happens,
> >> though?
> >
> > Empirically, that's really quite useful.  When you have 3 or 4
> > workers, the leader really doesn't make a significant contribution to
> > the work, but what I've seen in my testing is that 1 worker often runs
> > almost twice as fast as 0 workers.
>
> I suppose that makes sense, given that parallel sequential scan works
> best when most tuples are eliminated in workers; there ought not to be
> many tuples filling the single worker's queue anyway.
>
> > I don't really understand why this should be so.  I thought the idea
> > of parallel sort is (roughly) that each worker should read data until
> > it fills work_mem, sort that data, and write a tape.  Repeat until no
> > data remains.  Then, merge the tapes.  I don't see any reason at all
> > why this shouldn't work just fine with a leader and 1 worker.
>
> It will work fine with a leader and 1 worker -- the code will be
> correct, and without any special cases. But it will be a suboptimal
> use of resources. From the caller's point of view, there is no reason
> to think it will be faster, and some reason to think it will be
> slower. A particular concern for parallel sort is that the sort might
> not use enough memory to need to be an external sort, but you
> effectively force it to be one by making it a parallel sort (that is
> not ideal in the long term, but it's a good compromise for 9.6's
> parallel index build stuff). You're also consuming a
> BackgroundWorkerSlot for the duration of the sort, in an environment
> where evidently those are in short supply.
>
> Now, you might wonder why it is that the leader cannot also sort runs,
> just as a worker would. It's possible, but it isn't exactly
> straightforward. You have to have special cases in several places,
> even though it probably is going to be uncommon to only have one
> BackgroundWorkerSlot available in practice. It's simpler to just
> opt-out, and seems better given that max_parallel_degree is a way of
> resource limiting based on available cores (it's certainly not about
> the availability of shared memory for the BackgroundWorkerSlot array).
>

If I understand correctly, you are worried about the case where if the
leader is not able to launch the minimum required number of workers,
the parallel index builds will be slower as compare serial index builds.
I think it is genuine to worry about such cases, but it won't be
difficult to just make parallel execution behave as serial execution
(basically, you need to get all the work done by leader).  Now, one
could worry, that there will be some overhead of setting-up and
destroy of workers in this case, but I think that could be treated as
a limitation for the initial version of implementation and if such a
case is more common in general usage, then we could have some
mechanism to reserve the workers and start parallelism only when
the leader is able to reserve required number of workers.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] No Issue Tracker - Say it Ain't So!

2016-01-10 Thread Joshua D. Drake


On 01/10/2016 06:19 PM, Robert Haas wrote:

[snip]

No arguments with what was written above.


If we had an "issue" tracker rather than a bug tracker, I'd expect a
lot more unproductive bickering.


This I disagree with. It wouldn't be any worse than we have now. An 
issue tracker (if it is a good one) becomes the forum, whether you use 
an email or web interface. So expect at least as much banter as there is 
on lists but with a much easier management interface.


JD

--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] POC, WIP: OR-clause support for indexes

2016-01-10 Thread David Rowley

On 27 December 2015 at 07:04, Teodor Sigaev  wrote:

> I'd like to present OR-clause support for indexes. Although OR-clauses
> could be supported by bitmapOR index scan it isn't very effective and such
> scan lost any order existing in index. We (with Alexander Korotkov)
> presented results on Vienna's conference this year. In short, it provides
> performance improvement:
>
> EXPLAIN ANALYZE
> SELECT count(*) FROM tst WHERE id = 5 OR id = 500 OR id = 5000;
> me=0.080..0.267 rows=173 loops=1)
>  Recheck Cond: ((id = 5) OR (id = 500) OR (id = 5000))
>  Heap Blocks: exact=172
>  ->  Bitmap Index Scan on idx_gin  (cost=0.00..57.50 rows=15000
> width=0) (actual time=0.059..0.059 rows=147 loops=1)
>Index Cond: ((id = 5) OR (id = 500) OR (id = 5000))
>  Planning time: 0.077 ms
>  Execution time: 0.308 ms   <---
> QUERY PLAN
>
> ---
>  Aggregate  (cost=51180.53..51180.54 rows=1 width=0) (actual
> time=796.766..796.766 rows=1 loops=1)
>->  Index Only Scan using idx_btree on tst  (cost=0.42..51180.40
> rows=55 width=0) (actual time=0.444..796.736 rows=173 loops=1)
>  Filter: ((id = 5) OR (id = 500) OR (id = 5000))
>  Rows Removed by Filter: 999829
>  Heap Fetches: 102
>  Planning time: 0.087 ms
>  Execution time: 796.798 ms  <--
> QUERY PLAN
>
> -
>  Aggregate  (cost=21925.63..21925.64 rows=1 width=0) (actual
> time=160.412..160.412 rows=1 loops=1)
>->  Seq Scan on tst  (cost=0.00..21925.03 rows=237 width=0) (actual
> time=0.535..160.362 rows=175 loops=1)
>  Filter: ((id = 5) OR (id = 500) OR (id = 5000))
>  Rows Removed by Filter: 999827
>  Planning time: 0.459 ms
>  Execution time: 160.451 ms
>
>
> It also could work together with KNN feature of GiST and in this case
> performance improvement could be up to several orders of magnitude, in
> artificial example it was 37000 times faster.
>
> Not all  indexes can support oR-clause, patch adds support to GIN, GiST
> and BRIN indexes. pg_am table is extended for adding amcanorclause column
> which indicates possibility of executing of OR-clause by index.
>
>  indexqual and indexqualorig doesn't contain implicitly-ANDed list of
> index qual expressions, now that lists could contain OR RestrictionInfo.
> Actually, the patch just tries to convert BitmapOr node to IndexScan or
> IndexOnlyScan. Thats significantly simplifies logic to find possible
> clause's list for index.
> Index always gets a array of ScanKey but for indexes which support
> OR-clauses
> array  of ScanKey is actually exection tree in reversed polish notation
> form. Transformation is done in ExecInitIndexScan().
>
> The problems on the way which I see for now:
> 1 Calculating cost. Right now it's just a simple transformation of costs
> computed for BitmapOr path. I'd like to hope that's possible and so index's
> estimation function could be non-touched. So, they could believe that all
> clauses are implicitly-ANDed
> 2 I'd like to add such support to btree but it seems that it should be a
> separated patch. Btree search algorithm doesn't use any kind of stack of
> pages and algorithm to walk over btree doesn't clear for me for now.
> 3 I could miss some places which still assumes  implicitly-ANDed list of
> clauses although regression tests passes fine.
>
> Hope, hackers will not have an strong objections to do that. But obviously
> patch
> requires further work and I'd like to see comments, suggestions and
> recommendations. Thank you.


Hi,

I'd like to see comments too! but more so in the code. :) I've had a look
over this, and it seems like a great area in which we could improve on, and
your reported performance improvements are certainly very interesting too.
However I'm finding the code rather hard to follow, which might be a
combination of my lack of familiarity with the index code, but more likely
it's the lack of comments to explain what's going on. Let's just take 1
function as an example:

Here there's not a single comment, so I'm just going to try to work out
what's going on based on the code.

+static void
+compileScanKeys(IndexScanDesc scan)
+{
+ GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ int *stack,
+ stackPos = -1,
+ i;
+
+ if (scan->numberOfKeys <= 1 || so->useExec == false)
+ return;
+
+ Assert(scan->numberOfKeys >=3);

Why can numberOfKeys never be 2? I looked at what calls this and I can't
really work it out. I'm really also not sure what useExec means as there's
no comment in that struct member, and what if numberOfKeys == 1 and useExec
== false, won't this Assert() fail? If that's not a possible situation then
why not?

+
+ if (so->leftA

Re: [HACKERS] Apparently deprecated code in planner.c

2016-01-10 Thread Robert Haas

On Sun, Jan 10, 2016 at 5:28 PM, Dickson S. Guedes  wrote:
> Hi all,
>
> I'm wondering whether the #ifdef FORCE_PARALLEL_MODE code [1] was deprecated:
>
>  *
>  * FIXME: It's assumed that code further down will set parallelModeNeeded
>  * to true if a parallel path is actually chosen.  Since the core
>  * parallelism code isn't committed yet, this currently never happens.
>  */
> #ifdef FORCE_PARALLEL_MODE
> glob->parallelModeNeeded = glob->parallelModeOK;
> #else
> glob->parallelModeNeeded = false;
> #endif

That comment is obsolete, but defining FORCE_PARALLEL_MODE is still a
useful thing to do for testing purposes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] No Issue Tracker - Say it Ain't So!

2016-01-10 Thread Robert Haas

On Fri, Jan 8, 2016 at 3:37 AM, Magnus Hagander  wrote:
>> /me waves hand
>>
>> There are quite a few contributing companies that likely have people that
>> could help out with this in an educated fashion but aren't coders.
>
> Sort of like how they could also have helped out with cf management?

I agree with this sentiment.  But let me also say this: managing a
CommitFest well is a Hard Job.  It takes a very sizable amount of
time, a fair amount of technical knowledge, and an awful lot of
emotional energy.  It's a completely thankless task: if you do it
well, some people are unhappy because their patches got bumped; if you
do it poorly, other people (or the same ones) are unhappy because the
CommitFest lasted too long.  If you use the wrong tone in writing an
email to somebody, they will flame you as if you were trying to hurt
them rather than, as is actually the case, trying to help the
community.  And you can't make anybody do anything: if not enough
reviewers turn up, or not enough committers turn up, you will fail,
even if you do everything right.  Something under half of the attempts
to do this over the years have succeeded brilliantly (at least, IMHO);
many others have been indifferent successes or else failures, despite
the attempts having been made by community members of long standing.
It's just a really tough problem - you've got to be a combination
motivational speaker, technical whiz, diplomat, and organizational
guru to succeed.

It's unclear to me whether administering the bug tracker would be an
easier job or not.  I think it would probably be somewhat easier.
There's probably not a lot that's real subjective about deciding which
bugs we fixed and which ones we're not gonna fix.  I think the biggest
problem would likely be that we might occasionally have cases where
somebody submits something and somebody from the community replies to
say "we don't really care" and then the bug gets closed and the
submitter gets an email and goes ballistic, and unproductive arguing
ensues.  It will be important to have an understanding that the person
updating the tracker is merely trying to make it reflect the expressed
will of the community, not trying to determine that will.  If somebody
disagrees with the status applied to the bug, they should argue with
the community members who said "we don't care", not the person who set
the status to won't-fix.

If we had an "issue" tracker rather than a bug tracker, I'd expect a
lot more unproductive bickering.  The consensus around which features
are worth having changes frequently, and a lot of features that nobody
desperately objects to are nevertheless awfully low-value and not
worth tracking until somebody comes up with a patch.  Development of
feature X often changes the perceived value of feature Y, sometimes in
a positive way and sometimes in a negative way.  I don't really have
any use for a system that's just a random collection of things
somebody wanted at some point, which is basically what the TODO list
is.  A list of unfixed bugs, though, sounds like it might have real
utility, particularly if we got some volunteers to apply tags to them
so we could search for "multixact" or whatever.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Avoid pin scan for replay of XLOG_BTREE_VACUUM

2016-01-10 Thread Robert Haas

On Sun, Jan 10, 2016 at 3:50 PM, Simon Riggs  wrote:
> On 10 January 2016 at 16:32, Robert Haas  wrote:
>>
>> On Sat, Jan 9, 2016 at 5:13 AM, Simon Riggs  wrote:
>> > Avoid pin scan for replay of XLOG_BTREE_VACUUM
>> > Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to
>> > require
>> > complex interlocking that matched the requirements on the master. This
>> > required
>> > an O(N) operation that became a significant problem with large indexes,
>> > causing
>> > replication delays of seconds or in some cases minutes while the
>> > XLOG_BTREE_VACUUM was replayed.
>> >
>> > This commit skips the “pin scan” that was previously required, by
>> > observing in
>> > detail when and how it is safe to do so, with full documentation. The
>> > pin scan
>> > is skipped only in replay; the VACUUM code path on master is not touched
>> > here.
>> >
>> > The current commit still performs the pin scan for toast indexes, though
>> > this
>> > can also be avoided if we recheck scans on toast indexes. Later patch
>> > will
>> > address this.
>> >
>> > No tests included. Manual tests using an additional patch to view WAL
>> > records
>> > and their timing have shown the change in WAL records and their handling
>> > has
>> > successfully reduced replication delay.
>>
>> I suspect I might be missing something here, but I don't see how a
>> test against rel->rd_rel->relnamespace can work during recovery.
>> Won't the relcache entry we're looking at here be one created by
>> CreateFakeRelcacheEntry(), and thus that field won't be valid?
>
> The test isn't made during recovery, its made on the master.

/me crawls into a hole.

Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ExecGather() + nworkers

2016-01-10 Thread Robert Haas

On Sun, Jan 10, 2016 at 4:44 PM, Peter Geoghegan  wrote:
>> I don't really understand why this should be so.  I thought the idea
>> of parallel sort is (roughly) that each worker should read data until
>> it fills work_mem, sort that data, and write a tape.  Repeat until no
>> data remains.  Then, merge the tapes.  I don't see any reason at all
>> why this shouldn't work just fine with a leader and 1 worker.
>
> It will work fine with a leader and 1 worker -- the code will be
> correct, and without any special cases. But it will be a suboptimal
> use of resources. From the caller's point of view, there is no reason
> to think it will be faster, and some reason to think it will be
> slower. A particular concern for parallel sort is that the sort might
> not use enough memory to need to be an external sort, but you
> effectively force it to be one by making it a parallel sort (that is
> not ideal in the long term, but it's a good compromise for 9.6's
> parallel index build stuff). You're also consuming a
> BackgroundWorkerSlot for the duration of the sort, in an environment
> where evidently those are in short supply.

Well, in general, the parallel sort code doesn't really get to pick
whether or not a BackgroundWorkerSlot gets used or not.  Whoever
created the parallel context decides how many workers to request, and
then the context got as many of those as it could.  It then did
arbitrary computation, which at some point in the middle involves one
or more parallel sorts.  You can't just have one of those workers up
and exit in the middle.  Now, in the specific case of parallel index
build, you probably can do that, if you want to.  But to be honest,
I'd be inclined not to include that in the first version.  If you get
fewer workers than you asked for, just use the number you got.  Let's
see how that actually works out before we decide that we need a lot
more mechanism here.  You may find that it's surprisingly effective to
do it this way.

> Now, you might wonder why it is that the leader cannot also sort runs,
> just as a worker would. It's possible, but it isn't exactly
> straightforward. You have to have special cases in several places,
> even though it probably is going to be uncommon to only have one
> BackgroundWorkerSlot available in practice. It's simpler to just
> opt-out, and seems better given that max_parallel_degree is a way of
> resource limiting based on available cores (it's certainly not about
> the availability of shared memory for the BackgroundWorkerSlot array).

I am surprised that this is not straightforward.  I don't see why it
shouldn't be, and it worries me that you think it isn't.

> More importantly, I have other, entirely general concerns. Other major
> RDBMSs have settings that are very similar to max_parallel_degree,
> with a setting of 1 effectively disabling all parallelism. Both Oracle
> and SQL Server have settings that they both call the "maximum degree
> or parallelism". I think it's a bit odd that with Postgres,
> max_parallel_degree = 1 can still use parallelism at all. I have to
> wonder: are we conflating controlling the resources used by parallel
> operations with how shared memory is doled out?

We could redefined things so that max_parallel_degree = N means use N
- 1 workers, with a minimum value of 1 rather than 0, if there's a
consensus that that's better.  Personally, I prefer it the way we've
got it: it's real darned clear in my mind that max_parallel_degree=0
means "not parallel".  But I won't cry into my beer if a consensus
emerges that the other way would be better.

> I could actually "give back" my parallel worker slots early if I
> really wanted to (that would be messy, but the then-acquiesced workers
> do nothing for the duration of the merge beyond conceptually owning
> the shared tape temp files). I don't think releasing the slots early
> makes sense, because I tend to think that hanging on to the workers
> helps the DBA in managing the server's resources. The still-serial
> merge phase is likely to become a big bottleneck with parallel sort.

Like I say, the sort code better not know anything about this
directly, or it's going to break when embedded in a query.

> With parallel sequential scan, a max_parallel_degree of 8 could result
> in 16 processes scanning in parallel. That's a concern, and not least
> because it happens only sometimes, when things are timed just right.
> The fact that only half of those processes are "genuine" workers seems
> to me like a distinction without a difference.

This seems dead wrong.  A max_parallel_degree of 8 means you have a
leader and 8 workers.  Where are the other 7 processes coming from?
What you should have is 8 processes each of which is participating in
both the parallel seq scan and the parallel sort, not 8 processes
scanning and 8 separate processes sorting.

>> I think that's probably over-engineered.  I mean, it wouldn't be that
>> hard to have the workers just exit if you decide you don't want them,
>>

Re: [HACKERS] Improved tab completion for FDW DDL

2016-01-10 Thread Peter Eisentraut

The second part is not necessary, because there is already code that
completes FDW names after "FOREIGN DATA WRAPPER".  So this already works.

The other two parts are valid improvements, but they should be done
consistently across commands.  So please

- Also complete RENAME TO in ALTER FOREIGN DATA WRAPPER.

- Also complete OPTIONS in FOREIGN DATA WRAPPER and SERVER commands.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] COPY (... tab completion

2016-01-10 Thread Tom Lane

Peter Eisentraut  writes:
> I think this would be a useful addition.  A couple of problems:
> This change in the comment doesn't make sense to me and doesn't seem to
> match the code:

> - /* If we have COPY [BINARY] , complete it with "TO" or "FROM" */
> + /* If we have COPY|BINARY , complete it with "TO" or "FROM" */

Offhand, that looks like an accidental reversion of a change made in
9b181b0363deb65b15a9feaf3eb74f86707498a9.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] COPY (... tab completion

2016-01-10 Thread Peter Eisentraut

On 1/10/16 8:01 PM, Peter Eisentraut wrote:
> The list of commands to allow as the "query" inside the parentheses is
> documented to be: SELECT, VALUES, INSERT, UPDATE or DELETE; and actually
> TABLE should also work.  Your list doesn't include all of those.

To be fair, this is actually a recent new feature.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] No Issue Tracker - Say it Ain't So!

2016-01-10 Thread Merlin Moncure

On Fri, Jan 8, 2016 at 7:12 AM, Greg Stark  wrote:
> This really sounds like you're looking for leverage to fix one problem
> by finding other problems that you hope to solve with the same hammer.
> That's a recipe for a tool that solves neither problem well and gets
> ignored by the both sets of users.

+1

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Copy-pasto in logical decoding docs

2016-01-10 Thread Peter Eisentraut

On 12/31/15 10:45 PM, Petr Jelinek wrote:
> I noticed that the filter callback is documented as
> LogicalDecodeChangeCB in the logical decoding docs. Here is one-line
> patch to fix it.

applied to master and 9.5



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] COPY (... tab completion

2016-01-10 Thread Peter Eisentraut

I think this would be a useful addition.  A couple of problems:

This change in the comment doesn't make sense to me and doesn't seem to
match the code:

-   /* If we have COPY [BINARY] , complete it with "TO" or "FROM" */
+   /* If we have COPY|BINARY , complete it with "TO" or "FROM" */

The list of commands to allow as the "query" inside the parentheses is
documented to be: SELECT, VALUES, INSERT, UPDATE or DELETE; and actually
TABLE should also work.  Your list doesn't include all of those.  So
please adjust that.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Apparently deprecated code in planner.c

2016-01-10 Thread Dickson S. Guedes

Hi all,

I'm wondering whether the #ifdef FORCE_PARALLEL_MODE code [1] was deprecated:

 *
 * FIXME: It's assumed that code further down will set parallelModeNeeded
 * to true if a parallel path is actually chosen.  Since the core
 * parallelism code isn't committed yet, this currently never happens.
 */
#ifdef FORCE_PARALLEL_MODE
glob->parallelModeNeeded = glob->parallelModeOK;
#else
glob->parallelModeNeeded = false;
#endif


[1] 
http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/optimizer/plan/planner.c;h=147c4deef3bb708ebb32b6781330f6ed980fc90c;hb=HEAD#l245

[]s
-- 
Dickson S. Guedes
mail/xmpp: gue...@guedesoft.net - skype: guediz
http://github.com/guedes - http://guedesoft.net
http://www.postgresql.org.br


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: add pg_current_xlog_flush_location function

2016-01-10 Thread Tomas Vondra

Hi,

On 12/13/2015 08:38 PM, Tomas Vondra wrote:

Hi,

On 12/13/2015 06:13 AM, Amit Kapila wrote:
 >

...

 >

Is there a reason why you can't use existing function
GetFlushRecPtr() in xlog.c?

No, not really. I think I somehow missed that function when writing
the initial version of the patch. Will fix in v2 of the patch.

Hmm, so I've been looking at this, and I've realized that I've written 
it like this because that's pretty much what pg_current_xlog_location() 
does. It calls GetXLogWriteRecPtr which does this:

/*
 * Get latest WAL write pointer
 */
XLogRecPtr
GetXLogWriteRecPtr(void)
{
SpinLockAcquire(&XLogCtl->info_lck);
LogwrtResult = XLogCtl->LogwrtResult;
SpinLockRelease(&XLogCtl->info_lck);

return LogwrtResult.Write;
}

so the patch does the same thing, except that I've returned "Flush".

OTOH GetFlushRecPtr does this:

XLogRecPtr
GetFlushRecPtr(void)
{
XLogRecPtr  recptr;

SpinLockAcquire(&XLogCtl->info_lck);
recptr = XLogCtl->LogwrtResult.Flush;
SpinLockRelease(&XLogCtl->info_lck);

return recptr;
}

i.e. it does not update LogwrtResult, the local private copy. Not sure 
what's appropriate here ...

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ExecGather() + nworkers

2016-01-10 Thread Peter Geoghegan

On Sun, Jan 10, 2016 at 1:44 PM, Peter Geoghegan  wrote:
> With parallel sequential scan, a max_parallel_degree of 8 could result
> in 16 processes scanning in parallel.

I meant a max_worker_processes setting, which of course is different.
Nevertheless, I find it surprising that max_parallel_degree = 1 does
not prevent parallel operations, and that max_parallel_degree is
defined in terms of the availability of worker processes (in the
strict sense of worker processes that are launched by
LaunchParallelWorkers(), and not a broader and more practical
definition).

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ExecGather() + nworkers

2016-01-10 Thread Peter Geoghegan

On Sun, Jan 10, 2016 at 9:13 AM, Robert Haas  wrote:
>> I'm not sure why the test for nworkers following the
>> LaunchParallelWorkers() call doesn't look like this, though:
>>
>> /* Set up tuple queue readers to read the results. */
>> if (pcxt->nworkers_launched > 0)
>> {
>> ...
>> }
>
> Hmm, yeah, I guess it could do that.

That would make it clearer as an example.

>> But going to this additional trouble (detecting no workers launched on
>> the basis of !nworkers_launched) suggests that simply testing
>> nworkers_launched would be wrong, which AFAICT it isn't. Can't we just
>> do that, and in so doing also totally remove the "for" loop shown
>> here?
>
> I don't see how the for loop goes away.

I meant that some code in the "for" loop goes away. Not all of it.
Just the more obscure code. As I said, I'm mostly pointing this out
out of concern for making it clearer as example code.

>> In the case of parallel sequential scan, it looks like one worker can
>> be helpful, because then the gather node (leader process) can run the
>> plan itself to some degree, and so there are effectively 2 processes
>> scanning at a minimum (unless 0 workers could be allocated to begin
>> with). How useful is it to have a parallel scan when this happens,
>> though?
>
> Empirically, that's really quite useful.  When you have 3 or 4
> workers, the leader really doesn't make a significant contribution to
> the work, but what I've seen in my testing is that 1 worker often runs
> almost twice as fast as 0 workers.

I suppose that makes sense, given that parallel sequential scan works
best when most tuples are eliminated in workers; there ought not to be
many tuples filling the single worker's queue anyway.

> I don't really understand why this should be so.  I thought the idea
> of parallel sort is (roughly) that each worker should read data until
> it fills work_mem, sort that data, and write a tape.  Repeat until no
> data remains.  Then, merge the tapes.  I don't see any reason at all
> why this shouldn't work just fine with a leader and 1 worker.

It will work fine with a leader and 1 worker -- the code will be
correct, and without any special cases. But it will be a suboptimal
use of resources. From the caller's point of view, there is no reason
to think it will be faster, and some reason to think it will be
slower. A particular concern for parallel sort is that the sort might
not use enough memory to need to be an external sort, but you
effectively force it to be one by making it a parallel sort (that is
not ideal in the long term, but it's a good compromise for 9.6's
parallel index build stuff). You're also consuming a
BackgroundWorkerSlot for the duration of the sort, in an environment
where evidently those are in short supply.

Now, you might wonder why it is that the leader cannot also sort runs,
just as a worker would. It's possible, but it isn't exactly
straightforward. You have to have special cases in several places,
even though it probably is going to be uncommon to only have one
BackgroundWorkerSlot available in practice. It's simpler to just
opt-out, and seems better given that max_parallel_degree is a way of
resource limiting based on available cores (it's certainly not about
the availability of shared memory for the BackgroundWorkerSlot array).

More importantly, I have other, entirely general concerns. Other major
RDBMSs have settings that are very similar to max_parallel_degree,
with a setting of 1 effectively disabling all parallelism. Both Oracle
and SQL Server have settings that they both call the "maximum degree
or parallelism". I think it's a bit odd that with Postgres,
max_parallel_degree = 1 can still use parallelism at all. I have to
wonder: are we conflating controlling the resources used by parallel
operations with how shared memory is doled out?

I could actually "give back" my parallel worker slots early if I
really wanted to (that would be messy, but the then-acquiesced workers
do nothing for the duration of the merge beyond conceptually owning
the shared tape temp files). I don't think releasing the slots early
makes sense, because I tend to think that hanging on to the workers
helps the DBA in managing the server's resources. The still-serial
merge phase is likely to become a big bottleneck with parallel sort.

With parallel sequential scan, a max_parallel_degree of 8 could result
in 16 processes scanning in parallel. That's a concern, and not least
because it happens only sometimes, when things are timed just right.
The fact that only half of those processes are "genuine" workers seems
to me like a distinction without a difference.

> I think that's probably over-engineered.  I mean, it wouldn't be that
> hard to have the workers just exit if you decide you don't want them,
> and I don't really want to make the signaling here more complicated
> than it really needs to be.

I worry about the additional overhead of constantly starting and
stopping a single worker in some

Re: [HACKERS] WIP: bloom filter in Hash Joins with batches

2016-01-10 Thread David Rowley

On 11 January 2016 at 09:30, Tomas Vondra 
wrote:

> Hi,
>
> On 01/10/2016 04:03 AM, Peter Geoghegan wrote:
>
>> On Sat, Jan 9, 2016 at 4:08 PM, Peter Geoghegan  wrote:
>
> Also, are you aware of this?
>>
>>
>> http://www.nus.edu.sg/nurop/2010/Proceedings/SoC/NUROP_Congress_Cheng%20Bin.pdf
>>
>> It talks about bloom filters for hash joins in PostgreSQL
>> specifically. Interestingly, they talk about specific TPC-H queries.
>>
>
> Interesting. The way that paper uses bloom filters is very different from
> what I do in the patch. They build the bloom filters and then propagate
> them into the scan nodes to eliminate the tuples early.
>
>
That does sound interesting, but unless I'm somehow mistaken, I guess to do
that you'd have to abandon the more efficient hashing of the hash value
that you're doing in the current patch, and hash the complete value in the
scan node, then hash them again if they make it into the hash join node.
That does not sound like it would be a win if hashing longer varlana values.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] WIP: bloom filter in Hash Joins with batches

2016-01-10 Thread Tomas Vondra


Hi,

On 01/10/2016 05:11 AM, Peter Geoghegan wrote:

On Sat, Jan 9, 2016 at 11:02 AM, Tomas Vondra
 wrote:

Which means the "dim.r" column has 100 different values (0-99) with uniform
distribution. So e.g. "WHERE r < 15" matches 15%.


I think that the use of a uniform distribution to demonstrate this
patch is a bad idea, unless you want to have a conversation about the
worst case.

Look at the use cases for bloom filters in general. They're almost
all some variation on the same theme: checking a bloom filter
inexpensively, to avoid an expensive cache miss (of some fashion).


Right.


The Google Chrome web browser uses a bloom filter for malicious URLs.


FWIW I don't think Chrome is using bloom filter for this purpose 
anymore. Chromium certainly does not (it's using PrefixSet instead).



It usually avoids consulting Google's servers about whether or not
any given URL that is visited is malicious by using the bloom filter.
This is effective in large part because the top 100 websites ranked
by popularity have a huge falloff in popularity as you go down the
list. It looks like a Zipfian distribution, and so I imagine they get
pretty far with a very small bloom filter, even though in theory the
chances of any given valid URL resulting in a cache miss is very
high.


I'm not familiar with how Chrome used bloom filters, but I'd expect them 
to be very careful about false positives (and also false negatives, as 
the bloom filter can't contain all malicious URLs).


My assumptions is that they've been able to make that work because they 
do have detailed stats about how frequently people visit those URLs, and 
use that when building the bloom filter. But we don't have such 
information in hashjoin.




Generally, uniform distributions are rare in a non-canonical list of
things, like a hash join outer relation's attribute.


Well, I'm not claiming testing uniform distribution is enough, but 
surely it's one of the cases we should handle just fine.


The problem with non-uniform cases is that it really depends on the 
outer side of the join.


For example let's say the hash table contains 1000 values, and the bloom 
filter has 1% false positive rate. But let's assume that the outer side 
has a value that triggers the false positive rate, and that it's 
actually 99% of the outer table. Suddenly, you have 99% false positive 
rate, rendering the bloom filter pointless.


I don't think this is fixable while creating the bloom filter. All we 
can do is watch the bloom filter lookups and disable the bloom filter 
once the false positive rate reaches some threshold.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Avoid pin scan for replay of XLOG_BTREE_VACUUM

2016-01-10 Thread Simon Riggs

On 10 January 2016 at 16:32, Robert Haas  wrote:

> On Sat, Jan 9, 2016 at 5:13 AM, Simon Riggs  wrote:
> > Avoid pin scan for replay of XLOG_BTREE_VACUUM
> > Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to
> require
> > complex interlocking that matched the requirements on the master. This
> required
> > an O(N) operation that became a significant problem with large indexes,
> causing
> > replication delays of seconds or in some cases minutes while the
> > XLOG_BTREE_VACUUM was replayed.
> >
> > This commit skips the “pin scan” that was previously required, by
> observing in
> > detail when and how it is safe to do so, with full documentation. The
> pin scan
> > is skipped only in replay; the VACUUM code path on master is not touched
> here.
> >
> > The current commit still performs the pin scan for toast indexes, though
> this
> > can also be avoided if we recheck scans on toast indexes. Later patch
> will
> > address this.
> >
> > No tests included. Manual tests using an additional patch to view WAL
> records
> > and their timing have shown the change in WAL records and their handling
> has
> > successfully reduced replication delay.
>
> I suspect I might be missing something here, but I don't see how a
> test against rel->rd_rel->relnamespace can work during recovery.
> Won't the relcache entry we're looking at here be one created by
> CreateFakeRelcacheEntry(), and thus that field won't be valid?
>

The test isn't made during recovery, its made on the master.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WIP: bloom filter in Hash Joins with batches

2016-01-10 Thread Tomas Vondra

Hi,

On 01/10/2016 04:03 AM, Peter Geoghegan wrote:

On Sat, Jan 9, 2016 at 4:08 PM, Peter Geoghegan  wrote:

Also, have you considered Hash join conditions with multiple
attributes as a special case? I'm thinking of cases like this:

Sorry, accidentally fat-fingered my enter key before I was finished
drafting that mail. That example isn't useful, because there is no
early/cheap elimination of tuples while scanning the outer relation.

My point about bloom filters on only one attribute (or perhaps
multiple bloom filters on multiple attributes) is that you might be
able to make the bloom filter more effective, particularly relative
to the amount of memory available, by only building it for one
attribute where that distinction (one attribute vs multiple) happens
to exist.

I'm not sure what you mean. The patch builds bloom filter on hashvalue, 
not the attribute values directly. I find this very efficient as we 
don't have to hash the original values again and instead work with just 
a short 32-bit hashvalue.

I really doubt doing some magic by only building the bloom filter on one 
of the attributes is useful - it's way more complicated, requires 
estimating the ndistinct for each attribute, and also somehow estimate 
the impact on the accuracy of the bloom filter. That seems rather 
complex and I don't have a good idea how to do that.

Simon said: "So the objective must be to get a Bloom Filter that is
small enough that it lives in a higher/faster level of cache than the
main Hash table". I agree that that's really important here, although
I did note that Tomas wasn't so sure, emphasizing the importance of
avoiding serialization and deserialization overhead -- and certainly,
that's why the patch helps some cases a lot.

I think the impact of the bloom filter size really depends on the size 
of the hash join. I see three cases, as indicated in the benchmarks I posted

(a) single batch (1M:10M)

- fitting into CPU cache really important (otherwise almost no
  improvement)

- impossible to speed up for very small hash tables (that fit into
  L3 cache)

(b) multiple batches w. medium data set (10M:100M)

- important, but not as much as for (a), as we still eliminate the
  (de)serialization overhead

(c) large data set (5M:250M)

- not important, it's mostly about eliminating (de)serialization
  overhead and/or I/O overhead

> But Simon's point applies more to the worst case than the best or
> average cases, and evidently we have plenty of worrying about the
> worst case left to do here.

So what do you consider to be the worst case?

So, one attribute could have relatively low cardinality, and thus
fewer distinct items in the bloom filter's set, making it far slower
to degrade (i.e. have increased probability of false positives) as
compared to a composite of two or more attributes, and yet perhaps
not significantly less useful in terms of its ability to eliminate
(de)serialization overhead. Or, with two bloom filters (one per
attribute), one bloom filter could degrade far faster than the other
(due to having more distinct values), often very unpredictably, and
yet it wouldn't matter because we'd discard the bad one (maybe
really early, during the scan of the inner relation, using HLL).

I don't think so.

Firstly, I don't know what you mean by "slower to degrade" as the false 
positive rate of a bloom filter does not depend on the cardinality of 
the column used to build the bloom filter, but rather on the "load 
factor" of the bloom filter (i.e. number of items added to the filter 
compared to the initial capacity). Not to mention that you still have to 
estimate the cardinality of the columns, which is tricky (and one of the 
main issues of the current patch, IMHO).

Secondly, by splitting the "composite" bloom filter into per-column 
filters you make it impossible to track correlation between the columns.

For example let's say we have two attributes (a,b), 'a' having 1000 
distinct values while 'b' only has 10. But let's assume that there's a 
correlation between 'a' and 'b' so that (mod(a,10) = b). If you build 
the bloom filter on the columns separately, it's going to be entirely 
useless. Sure, this is a rather trivial (and perhaps naive) example, but 
it shows how easy it's to break it.

I'd also like to point out that vast majority of joins is on a single 
column, so perhaps we should try to solve those first, and then perhaps 
improve the multi-column case if possible.

I notice that all these example queries involve less-than operators
(inner relation tuples are thereby prevented from being entered into
the hash table). It might not be a terrible idea to hash abbreviated
keys (or even truncated abbreviated keys) for the bloom filter in
certain cases, in particular when there is a correlation in the inner
relation attributes (equi-joined attribute, and attribute that is
other part of qual). There might not be a correlation, of course, but
with some car

[HACKERS] Close handle leak in SSPI auth

2016-01-10 Thread Christian Ullrich


Hello,

here's a one-line patch to close a handle leak in pg_SSPI_recvauth().

According to the docs, the token retrieved with 
QuerySecurityContextToken() must be closed.


--
Christian

diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
new file mode 100644
index 0131bfd..57c2f48
*** a/src/backend/libpq/auth.c
--- b/src/backend/libpq/auth.c
*** pg_SSPI_recvauth(Port *port)
*** 1253,1258 
--- 1253,1260 
(errmsg_internal("could not get user token: 
error code %lu",
 
GetLastError(;
  
+   CloseHandle(token);
+ 
if (!LookupAccountSid(NULL, tokenuser->User.Sid, accountname, 
&accountnamesize,
  domainname, &domainnamesize, 
&accountnameuse))
ereport(ERROR,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pglogical - logical replication contrib module

2016-01-10 Thread Steve Singer


On 01/09/2016 01:30 PM, Steve Singer wrote:

On 12/31/2015 06:34 PM, Petr Jelinek wrote:



I'm not really sure what to do to 'recover' my cluster at this point 
so I'll send this off and rebuild my cluster and start over.





I had a setup test1--->test2 (with 2 tables in the default set)

I then created a third database (all three hosted on the same PG cluster)

In the third database (test3)
test3=# create extension pglogical;
CREATE EXTENSION
test3=# select pglogical.create_node(node_name:='test3', 
dsn:='host=localhost dbname=test3 port=5436');

 create_node
-
  2001662995
(1 row)

test3=# select 
pglogical.create_subscription(subscription_name:='defaultsub',provider_dsn:='host=localhost 
dbname=test2 port=5436');

 create_subscription
-
  2974019075

It copied the schema over but not the data (if I use test2 as the 
provider_dsn then it does copy the data).


I then tried inserting a row into a table on test1.  Things crashed and 
after crash recovery I keep getting


2016-01-10 13:03:15 EST LOG:  database system is ready to accept connections
2016-01-10 13:03:15 EST LOG:  autovacuum launcher started
2016-01-10 13:03:15 EST LOG:  starting apply for subscription defaultsub
2016-01-10 13:03:15 EST LOG:  starting apply for subscription defaultsub
2016-01-10 13:03:15 EST test2LOG:  starting logical decoding for slot 
"pgl_test3

_test2_defaultsub"
2016-01-10 13:03:15 EST test2DETAIL:  streaming transactions committing 
after 0/

18292D8, reading WAL from 0/18292D8
2016-01-10 13:03:15 EST test2LOG:  logical decoding found consistent 
point at 0/

18292D8
2016-01-10 13:03:15 EST test2DETAIL:  Logical decoding will begin using 
saved sn

apshot.
TRAP: FailedAssertion("!(IsTransactionState())", File: "catcache.c", 
Line: 1127)

2016-01-10 13:03:15 EST test2LOG:  unexpected EOF on standby connection
2016-01-10 13:03:15 EST LOG:  worker process: pglogical apply 
17016:2974019075 (

PID 24746) was terminated by signal 6: Aborted

The stack trace is

#3  0x007b83af in SearchCatCache (cache=0xe27d18, v1=15015784,
v2=v2@entry=0, v3=v3@entry=0, v4=v4@entry=0) at catcache.c:1127
#4  0x007c503e in SearchSysCache (cacheId=cacheId@entry=47,
key1=, key2=key2@entry=0, key3=key3@entry=0,
key4=key4@entry=0) at syscache.c:981
#5  0x006996d4 in replorigin_by_name (
roname=0xe51f30 "pgl_test2_test1_defaultsub",
missing_ok=missing_ok@entry=0 '\000') at origin.c:216
#6  0x7fdb54a908d3 in handle_origin (s=0x7ffd873f6da0)
at pglogical_apply.c:235
#7  replication_handler (s=0x7ffd873f6da0) at pglogical_apply.c:1031
#8  apply_work (streamConn=streamConn@entry=0xe84fb0) at 
pglogical_apply.c:1309

#9  0x7fdb54a911cc in pglogical_apply_main (main_arg=)
at pglogical_apply.c:1691
#10 0x00674912 in StartBackgroundWorker () at bgworker.c:726
---Type  to continue, or q  to quit---
#11 0x0067f7e2 in do_start_bgworker (rw=0xe03890) at 
postmaster.c:5501

#12 maybe_start_bgworker () at postmaster.c:5676
#13 0x00680206 in sigusr1_handler 
(postgres_signal_arg=)

at postmaster.c:4937
#14 
#15 0x7fdb54fa2293 in __select_nocancel ()
at ../sysdeps/unix/syscall-template.S:81
#16 0x00468285 in ServerLoop () at postmaster.c:1648
#17 0x0068161e in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0xddede0) at postmaster.c:1292
#18 0x0046979d in main (argc=3, argv=0xddede0) at main.c:223



I tried dropping the subscription and re-adding it.  I keep getting

2016-01-10 13:21:48 EST test1LOG:  logical decoding found consistent 
point at 0/1830080

2016-01-10 13:21:48 EST test1DETAIL:  There are no running transactions.
2016-01-10 13:21:48 EST test1LOG:  exported logical decoding snapshot: 
"04DE-1" with 0 transaction IDs

2016-01-10 13:21:48 EST test3ERROR:  relation "a" already exists
2016-01-10 13:21:48 EST test3STATEMENT:  CREATE TABLE a (
a integer NOT NULL,
b integer
);



pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 182; 1259 16700 TABLE a 
ssinger
pg_restore: [archiver (db)] could not execute query: ERROR: relation "a" 
already exists

Command was: CREATE TABLE a (
a integer NOT NULL,
b integer
);



2016-01-10 13:21:48 EST ERROR:  could not execute command 
"/usr/local/pgsql96gitlogical/bin/pg_restore --section="pre-data" 
--exit-on-error -1 -d "host=localhost dbname=test3 port=5436" 
"/tmp/pglogical-28079.dump""
2016-01-10 13:21:48 EST test1LOG:  unexpected EOF on client connection 
with an open transaction
2016-01-10 13:21:48 EST LOG:  worker process: pglogical apply 
17016:844915593 (PID 28079) exited with exit code 1
2016-01-10 13:21:48 EST ERROR:  subscriber defaultsub4 initialization 
failed during nonrecoverable step (s), please try the setup again


Which is probably also the cause of the error I reported yesterday (that 
I tried creating a subscription without dropping the table

Re: [HACKERS] WIP: bloom filter in Hash Joins with batches

2016-01-10 Thread Tomas Vondra


Hi,

On 01/10/2016 01:08 AM, Peter Geoghegan wrote:

On Sat, Jan 9, 2016 at 11:02 AM, Tomas Vondra
 wrote:

So, this seems to bring reasonable speedup, as long as the selectivity is
below 50%, and the data set is sufficiently large.


What about semijoins? Apparently they can use bloom filters
particularly effectively. Have you considered them as a special case?


You mean to handle them in a special way in the code, or just to perform 
benchmark semijoins (and not just regular joins)?



Also, have you considered Hash join conditions with multiple
attributes as a special case? I'm thinking of cases like this:

regression=# set enable_mergejoin = off;
SET
regression=# explain analyze select * from tenk1 o join tenk2 t on
o.twenty = t.twenty and t.hundred = o.hundred;
QUERY PLAN
──
  Hash Join  (cost=595.00..4103.00 rows=5 width=488) (actual
time=12.086..1026.194 rows=100 loops=1)
Hash Cond: ((o.twenty = t.twenty) AND (o.hundred = t.hundred))
->  Seq Scan on tenk1 o  (cost=0.00..458.00 rows=1 width=244)
(actual time=0.017..4.212 rows=1 loops=1)
->  Hash  (cost=445.00..445.00 rows=1 width=244) (actual
time=12.023..12.023 rows=1 loops=1)
  Buckets: 16384  Batches: 1  Memory Usage: 2824kB
  ->  Seq Scan on tenk2 t  (cost=0.00..445.00 rows=1
width=244) (actual time=0.006..3.453 rows=1 loops=1)
  Planning time: 0.567 ms
  Execution time: 1116.094 ms
(8 rows)

(Note that while the optimizer has a slight preference for a merge
join in this case, the plan I show here is a bit faster on my
machine).


The patch I posted actually does not build bloom filter on the values 
directly, but on the hashvalue we already use. So it handles hashjoins 
with arbitrary number of attributes just fine.


Or perhaps you're thinking about some special optimization that might 
improve such cases (I can't think of any)?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Why doesn't src/backend/port/win32/socket.c implement bind()?

2016-01-10 Thread Tom Lane

Some of the Windows buildfarm members occasionally fail like this:

LOG:  could not bind IPv4 socket: No error
HINT:  Is another postmaster already running on port 64470? If not, wait a few 
seconds and retry.
WARNING:  could not create listen socket for "127.0.0.1"
FATAL:  could not create any TCP/IP sockets

(bowerbird, in particular, has a few recent examples)

I think the reason why we're getting "No error" instead of a useful
strerror report is that socket.c doesn't provide an implementation
of bind() that includes TranslateSocketError().  Why is that?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BUG #13854: SSPI authentication failure: wrong realm name used

2016-01-10 Thread Christian Ullrich


* Christian Ullrich wrote:


* Christian Ullrich wrote:

> According to the release notes, the default for the "include_realm"
> option in SSPI authentication was changed from off to on in 9.5 for

> > improved security. However, the authenticated user name, with the
> > option enabled, includes the NetBIOS domain name, *not* the Kerberos

> realm name:



Below is a patch to correct this behavior. I suspect it has some
serious compatibility issues, so I would appreciate feedback.


Updated patch, sorry. The first one worked by accident only.

--
Christian


diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml
new file mode 100644
index 3b2935c..7236459
*** a/doc/src/sgml/client-auth.sgml
--- b/doc/src/sgml/client-auth.sgml
*** omicron bryanh
*** 1097,1102 
--- 1097,1132 
   
  
   
+   real_realm
+   
+
+ If set to 0, the domain's SAM-compatible name (also known as the
+ NetBIOS name) is used for the include_realm
+ option. This is the default. If set to 1, the true realm name from
+ the Kerberos user principal name is used. If you used the
+ include_realm option, you can leave this option
+ disabled to maintain compatibility with existing 
+ pg_ident.conf files.
+
+   
+  
+ 
+  
+   upn_username
+   
+
+ If this option is enabled along with real_realm,
+ the user name from the Kerberos UPN is used for authentication. If
+ it is disabled (the default), the SAM-compatible user name is used.
+ Note that libpq uses the SAM-compatible name if no
+ explicit user name is specified. If you use
+ libpq (e.g. through the ODBC driver), you should
+ leave this option disabled.
+
+   
+  
+ 
+  
map

 
diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
new file mode 100644
index 0131bfd..0f28f54
*** a/src/backend/libpq/auth.c
--- b/src/backend/libpq/auth.c
*** typedef SECURITY_STATUS
*** 155,160 
--- 155,165 
(WINAPI * QUERY_SECURITY_CONTEXT_TOKEN_FN) (

   PCtxtHandle, void **);
  static intpg_SSPI_recvauth(Port *port);
+ static intpg_SSPI_make_upn(char *accountname,
+size_t accountnamesize,
+char *domainname,
+size_t domainnamesize,
+bool 
update_accountname);
  #endif
  
  /*
*** static int
*** 1026,1031 
--- 1031,1037 
  pg_SSPI_recvauth(Port *port)
  {
int mtype;
+   int status;
StringInfoData buf;
SECURITY_STATUS r;
CredHandle  sspicred;
*** pg_SSPI_recvauth(Port *port)
*** 1261,1266 
--- 1267,1281 
  
free(tokenuser);
  
+   if (port->hba->real_realm) {
+   status = pg_SSPI_make_upn(accountname, sizeof(accountname),
+ domainname, 
sizeof(domainname),
+ 
port->hba->upn_username);
+   if (status != STATUS_OK) {
+   return status;
+   }
+   }
+ 
/*
 * Compare realm/domain if requested. In SSPI, always compare case
 * insensitive.
*** pg_SSPI_recvauth(Port *port)
*** 1296,1301 
--- 1311,1407 
else
return check_usermap(port->hba->usermap, port->user_name, 
accountname, true);
  }
+ 
+ static intpg_SSPI_make_upn(char *accountname,
+size_t accountnamesize,
+char *domainname,
+size_t domainnamesize,
+bool 
update_accountname)
+ {
+   char *samname;
+   char *upname = NULL;
+   char *p = NULL;
+   ULONG upnamesize = 0;
+   size_t upnamerealmsize;
+   BOOLEAN res;
+ 
+   /* Build SAM name (DOMAIN\\user), then translate to UPN
+  (user@kerberos.realm). The realm name is returned in
+  lower case, but that is fine because in SSPI auth,
+  string comparisons are always case-insensitive. */
+ 
+   samname = psprintf("%s\\%s", domainname, accountname);
+   res = TranslateName(samname, NameSamCompatible, NameUserPrincipal,
+   NULL, &upnamesize);
+ 
+   if ((!res && GetLastError() != ERROR_INSUFFICIENT_B

Re: [HACKERS] strange CREATE INDEX tab completion cases

2016-01-10 Thread Peter Eisentraut

On 12/13/15 9:16 AM, Michael Paquier wrote:
> Please see the attached to address those things (and others) with
> extra fixes for a couple of comments.

I have ported these changes to the new world order and divided them up
into more logical changes that are more clearly documented.  Please
check that this matches what you had in mind.
From bce311ab08053aa6d21b31625a1be85a33138300 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sun, 10 Jan 2016 11:24:51 -0500
Subject: [PATCH 1/3] psql: Update tab completion comment

This just updates a comment to match the code.

from Michael Paquier
---
 src/bin/psql/tab-complete.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 4d2bee1..dcbe515 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1999,7 +1999,8 @@ psql_completion(const char *text, int start, int end)
 	/* First off we complete CREATE UNIQUE with "INDEX" */
 	else if (TailMatches2("CREATE", "UNIQUE"))
 		COMPLETE_WITH_CONST("INDEX");
-	/* If we have CREATE|UNIQUE INDEX, then add "ON" and existing indexes */
+	/* If we have CREATE|UNIQUE INDEX, then add "ON", "CONCURRENTLY",
+	   and existing indexes */
 	else if (TailMatches2("CREATE|UNIQUE", "INDEX"))
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_indexes,
    " UNION SELECT 'ON'"
-- 
2.7.0

From d61d232b429e911d73598d9615fc73397fb44bda Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sun, 10 Jan 2016 11:43:27 -0500
Subject: [PATCH 2/3] psql: Fix CREATE INDEX tab completion

The previous code supported a syntax like CREATE INDEX name
CONCURRENTLY, which never existed.  Mistake introduced in commit
37ec19a15ce452ee94f32ebc3d6a9a45868e82fd.  Remove the addition of
CONCURRENTLY at that point.
---
 src/bin/psql/tab-complete.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index dcbe515..52336aa 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2009,13 +2009,12 @@ psql_completion(const char *text, int start, int end)
 	else if (TailMatches3("INDEX", MatchAny, "ON") ||
 			 TailMatches2("INDEX|CONCURRENTLY", "ON"))
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_tm, NULL);
-	/* If we have CREATE|UNIQUE INDEX  CONCURRENTLY, then add "ON" */
-	else if (TailMatches3("INDEX", MatchAny, "CONCURRENTLY") ||
-			 TailMatches2("INDEX", "CONCURRENTLY"))
+	/* If we have CREATE|UNIQUE INDEX CONCURRENTLY, then add "ON" */
+	else if (TailMatches2("INDEX", "CONCURRENTLY"))
 		COMPLETE_WITH_CONST("ON");
-	/* If we have CREATE|UNIQUE INDEX , then add "ON" or "CONCURRENTLY" */
+	/* If we have CREATE|UNIQUE INDEX , then add "ON" */
 	else if (TailMatches3("CREATE|UNIQUE", "INDEX", MatchAny))
-		COMPLETE_WITH_LIST2("CONCURRENTLY", "ON");
+		COMPLETE_WITH_CONST("ON");
 
 	/*
 	 * Complete INDEX  ON  with a list of table columns (which
-- 
2.7.0

From 84725481c29ae2d3e7af0ca7b4e3b58de48c56e4 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sun, 10 Jan 2016 12:04:28 -0500
Subject: [PATCH 3/3] psql: Improve CREATE INDEX CONCURRENTLY tab completion

The completion of CREATE INDEX CONCURRENTLY was lacking in several ways
compared to a plain CREATE INDEX command:

- CREATE INDEX  ON completes table names, but didn't with
  CONCURRENTLY.

- CREATE INDEX completes ON and existing index names, but with
  CONCURRENTLY it only completed ON.

- CREATE INDEX  completes ON, but didn't with CONCURRENTLY.

These are now all fixed.
---
 src/bin/psql/tab-complete.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 52336aa..97580b9 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2005,15 +2005,18 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_indexes,
    " UNION SELECT 'ON'"
    " UNION SELECT 'CONCURRENTLY'");
-	/* Complete ... INDEX [] ON with a list of tables  */
-	else if (TailMatches3("INDEX", MatchAny, "ON") ||
+	/* Complete ... INDEX|CONCURRENTLY [] ON with a list of tables  */
+	else if (TailMatches3("INDEX|CONCURRENTLY", MatchAny, "ON") ||
 			 TailMatches2("INDEX|CONCURRENTLY", "ON"))
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_tm, NULL);
-	/* If we have CREATE|UNIQUE INDEX CONCURRENTLY, then add "ON" */
+	/* If we have CREATE|UNIQUE INDEX CONCURRENTLY, then add "ON" and existing
+	   indexes */
 	else if (TailMatches2("INDEX", "CONCURRENTLY"))
-		COMPLETE_WITH_CONST("ON");
-	/* If we have CREATE|UNIQUE INDEX , then add "ON" */
-	else if (TailMatches3("CREATE|UNIQUE", "INDEX", MatchAny))
+		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_indexes,
+   " UNION SELECT 'ON'");
+	/* If we have CREATE|UNIQUE INDEX [CONCURRENTLY] , then add "ON" */
+	else if (TailMatches3("CREATE|UNIQUE", "INDEX", MatchAny) ||
+			 TailMatches4("CREATE|

Re: [HACKERS] ExecGather() + nworkers

2016-01-10 Thread Robert Haas

On Sun, Jan 10, 2016 at 12:29 AM, Peter Geoghegan  wrote:
> The Gather node executor function ExecGather() does this:
> [ code ]
> I'm not sure why the test for nworkers following the
> LaunchParallelWorkers() call doesn't look like this, though:
>
> /* Set up tuple queue readers to read the results. */
> if (pcxt->nworkers_launched > 0)
> {
> ...
> }

Hmm, yeah, I guess it could do that.

> But going to this additional trouble (detecting no workers launched on
> the basis of !nworkers_launched) suggests that simply testing
> nworkers_launched would be wrong, which AFAICT it isn't. Can't we just
> do that, and in so doing also totally remove the "for" loop shown
> here?

I don't see how the for loop goes away.

> In the case of parallel sequential scan, it looks like one worker can
> be helpful, because then the gather node (leader process) can run the
> plan itself to some degree, and so there are effectively 2 processes
> scanning at a minimum (unless 0 workers could be allocated to begin
> with). How useful is it to have a parallel scan when this happens,
> though?

Empirically, that's really quite useful.  When you have 3 or 4
workers, the leader really doesn't make a significant contribution to
the work, but what I've seen in my testing is that 1 worker often runs
almost twice as fast as 0 workers.

> I guess it isn't obvious to me how to reliably back out of not being
> able to launch at least 2 workers in the case of my parallel index
> build patch, because I suspect 2 workers (plus the leader process) are
> the minimum number that will make index builds faster. Right now, it
> looks like I'll have to check nworkers_launched in the leader (which
> will be the only process to access the ParallelContext, since it's in
> its local memory). Then, having established that there are at least
> the minimum useful number of worker processes launched for sorting,
> the leader can "permit" worker processes to "really" start based on
> changing some state in the TOC/segment in common use. Otherwise, the
> leader must call the whole thing off and do a conventional, serial
> index build, even though technically the main worker process function
> has started execution in worker processes.

I don't really understand why this should be so.  I thought the idea
of parallel sort is (roughly) that each worker should read data until
it fills work_mem, sort that data, and write a tape.  Repeat until no
data remains.  Then, merge the tapes.  I don't see any reason at all
why this shouldn't work just fine with a leader and 1 worker.

> I think what might be better is a general solution to my problem,
> which I imagine will crop up again and again as new clients are added.
> I would like an API that lets callers of LaunchParallelWorkers() only
> actually launch *any* worker on the basis of having been able to
> launch some minimum sensible number (typically 2). Otherwise, indicate
> failure, allowing callers to call the whole thing off in a general
> way, without the postmaster having actually launched anything, and
> without custom "call it all off" code for parallel index builds. This
> would probably involve introducing a distinction between a
> BackgroundWorkerSlot being "reserved" rather than "in_use", lest the
> postmaster accidentally launch 1 worker process before we established
> definitively that launching any is really a good idea.

I think that's probably over-engineered.  I mean, it wouldn't be that
hard to have the workers just exit if you decide you don't want them,
and I don't really want to make the signaling here more complicated
than it really needs to be.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Re: [DOCS] Document Upper Limit for NAMEDATELEN in pgsql 9.5+

2016-01-10 Thread Robert Haas

On Fri, Jan 8, 2016 at 3:05 PM, Tom Lane  wrote:
> [ redirecting from -docs, which isn't the best place to discuss code fixes ]
>
> Whoever thought this was a good idea:
>
> StaticAssertStmt(NAMEDATALEN <= MAX_LEVENSHTEIN_STRLEN,
>  "Levenshtein hinting mechanism restricts NAMEDATALEN");
>
> is nuts.
>
> A minimal fix that would not put stumbling blocks in the way of changes
> to NAMEDATALEN is to redefine MAX_LEVENSHTEIN_STRLEN as
> Max(255, NAMEDATALEN).  But I wonder why we have to have an arbitrary limit
> in this code at all.  I trust nobody here believes this is the only O(N^2)
> algorithm in the backend; the others don't have this sort of thing in
> front of them.

The arbitrary limit in that code is not new, but the Assert is.  I
didn't foresee the consequence that it would become harder to change
NAMEDATALEN, and I agree that needs to be fixed.  Ripping the
arbitrary limit out does not seem like a particularly good idea to me,
even if we put CHECK_FOR_INTERRUPTS() into that path.  Just because
there are other ways to make the backend burn insane amounts of CPU
time doesn't mean that we should try to make it easier.

Actually, though, varstr_levenshtein_less_equal() never gets called
from parse_relation.c with a distance bound of more than four, so it
can't actually go quadratic.  There's another call in that file to
varstr_levenshtein(), which in retrospect looks silly, but I'm pretty
sure that could also be changed to use a distance bound of 4 (well,
MAX_FUZZY_DISTNACE + 1) without changing the behavior at all.  Given a
constant distance bound, the algorithm is, I believe, only O(max(m,
n)) not O(mn).  Knowing that, we could let the internal code just
bypass the length check and only enforce the length restriction when
the code is called from SQL via fuzzystrmatch.

(There are of course other ways to solve the problem, too; this is
just one that came to mind.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Avoid pin scan for replay of XLOG_BTREE_VACUUM

2016-01-10 Thread Robert Haas

On Sat, Jan 9, 2016 at 5:13 AM, Simon Riggs  wrote:
> Avoid pin scan for replay of XLOG_BTREE_VACUUM
> Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to 
> require
> complex interlocking that matched the requirements on the master. This 
> required
> an O(N) operation that became a significant problem with large indexes, 
> causing
> replication delays of seconds or in some cases minutes while the
> XLOG_BTREE_VACUUM was replayed.
>
> This commit skips the “pin scan” that was previously required, by observing in
> detail when and how it is safe to do so, with full documentation. The pin scan
> is skipped only in replay; the VACUUM code path on master is not touched here.
>
> The current commit still performs the pin scan for toast indexes, though this
> can also be avoided if we recheck scans on toast indexes. Later patch will
> address this.
>
> No tests included. Manual tests using an additional patch to view WAL records
> and their timing have shown the change in WAL records and their handling has
> successfully reduced replication delay.

I suspect I might be missing something here, but I don't see how a
test against rel->rd_rel->relnamespace can work during recovery.
Won't the relcache entry we're looking at here be one created by
CreateFakeRelcacheEntry(), and thus that field won't be valid?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Blind attempt at a Cygwin fix

2016-01-10 Thread Michael Paquier

On Sun, Jan 10, 2016 at 8:10 AM, Michael Paquier
 wrote:
> On Sun, Jan 10, 2016 at 2:00 AM, Andrew Dunstan  wrote:
>> I downloaded the official Cygwin packages into a Cygwin instance and checked
>> how they do things. As I rather expected, they do not use pg_ctl at all to
>> install or run as a service. Rather, they use the standard Cygwin service
>> utility cygrunsrv. This is all managed via a SYSV style init script.
>
> Thanks for the investigation!
>
>> So if anything I'd be inclined to disable all the service-related code in
>> pg_ctl for Cygwin, and treat it just as we treat Unix.
>
> We had better do the same for back branches then. Need of a patch?

OK, here is a patch to disable all the service-related code in pg_ctl
for cygwin. This time it is not a blind shot and this compiles
correctly. Changing the option layer is fine for me if this is
HEAD-only. For back-branches, I would suggest to do nothing, the
service-related code paths are not going to run anyway, any output
going to stderr.
-- 
Michael
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 919d764..192f587 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -105,7 +105,7 @@ static char backup_file[MAXPGPATH];
 static char recovery_file[MAXPGPATH];
 static char promote_file[MAXPGPATH];
 
-#if defined(WIN32) || defined(__CYGWIN__)
+#ifdef WIN32
 static DWORD pgctl_start_type = SERVICE_AUTO_START;
 static SERVICE_STATUS status;
 static SERVICE_STATUS_HANDLE hStatus = (SERVICE_STATUS_HANDLE) 0;
@@ -133,7 +133,7 @@ static void do_kill(pgpid_t pid);
 static void print_msg(const char *msg);
 static void adjust_data_dir(void);
 
-#if defined(WIN32) || defined(__CYGWIN__)
+#ifdef WIN32
 #if (_MSC_VER >= 1800)
 #include 
 #else
@@ -165,7 +165,7 @@ static void unlimit_core_size(void);
 #endif
 
 
-#if defined(WIN32) || defined(__CYGWIN__)
+#ifdef WIN32
 static void
 write_eventlog(int level, const char *line)
 {
@@ -207,20 +207,11 @@ write_stderr(const char *fmt,...)
 	va_list		ap;
 
 	va_start(ap, fmt);
-#if !defined(WIN32) && !defined(__CYGWIN__)
+#ifndef WIN32
 	/* On Unix, we just fprintf to stderr */
 	vfprintf(stderr, fmt, ap);
 #else
 
-/*
- * On Cygwin, we don't yet have a reliable mechanism to detect when
- * we're being run as a service, so fall back to the old (and broken)
- * stderr test.
- */
-#ifdef __CYGWIN__
-#define	pgwin32_is_service()	(isatty(fileno(stderr)))
-#endif
-
 	/*
 	 * On Win32, we print to stderr if running on a console, or write to
 	 * eventlog if running as a service
@@ -718,7 +709,7 @@ test_postmaster_connection(pgpid_t pm_pid, bool do_checkpoint)
 #endif
 
 		/* No response, or startup still in process; wait */
-#if defined(WIN32)
+#ifdef WIN32
 		if (do_checkpoint)
 		{
 			/*
@@ -1342,7 +1333,7 @@ do_kill(pgpid_t pid)
 	}
 }
 
-#if defined(WIN32) || defined(__CYGWIN__)
+#ifdef WIN32
 
 #if (_MSC_VER < 1800)
 static bool
@@ -1408,20 +1399,6 @@ pgwin32_CommandLine(bool registration)
 		}
 	}
 
-#ifdef __CYGWIN__
-	/* need to convert to windows path */
-	{
-		char		buf[MAXPGPATH];
-
-#if CYGWIN_VERSION_DLL_MAJOR >= 1007
-		cygwin_conv_path(CCP_POSIX_TO_WIN_A, cmdPath, buf, sizeof(buf));
-#else
-		cygwin_conv_to_full_win32_path(cmdPath, buf);
-#endif
-		strcpy(cmdPath, buf);
-	}
-#endif
-
 	/* if path does not end in .exe, append it */
 	if (strlen(cmdPath) < 4 ||
 		pg_strcasecmp(cmdPath + strlen(cmdPath) - 4, ".exe") != 0)
@@ -1926,7 +1903,7 @@ CreateRestrictedProcess(char *cmd, PROCESS_INFORMATION *processInfo, bool as_ser
 	 */
 	return r;
 }
-#endif   /* defined(WIN32) || defined(__CYGWIN__) */
+#endif   /* WIN32 */
 
 static void
 do_advice(void)
@@ -1950,7 +1927,7 @@ do_help(void)
 	printf(_("  %s status  [-D DATADIR]\n"), progname);
 	printf(_("  %s promote [-D DATADIR] [-s]\n"), progname);
 	printf(_("  %s killSIGNALNAME PID\n"), progname);
-#if defined(WIN32) || defined(__CYGWIN__)
+#ifdef WIN32
 	printf(_("  %s register   [-N SERVICENAME] [-U USERNAME] [-P PASSWORD] [-D DATADIR]\n"
 			 "[-S START-TYPE] [-w] [-t SECS] [-o \"OPTIONS\"]\n"), progname);
 	printf(_("  %s unregister [-N SERVICENAME]\n"), progname);
@@ -1958,7 +1935,7 @@ do_help(void)
 
 	printf(_("\nCommon options:\n"));
 	printf(_("  -D, --pgdata=DATADIR   location of the database storage area\n"));
-#if defined(WIN32) || defined(__CYGWIN__)
+#ifdef WIN32
 	printf(_("  -e SOURCE  event source for logging when running as a service\n"));
 #endif
 	printf(_("  -s, --silent   only print errors, no informational messages\n"));
@@ -1991,7 +1968,7 @@ do_help(void)
 	printf(_("\nAllowed signal names for kill:\n"));
 	printf("  ABRT HUP INT QUIT TERM USR1 USR2\n");
 
-#if defined(WIN32) || defined(__CYGWIN__)
+#ifdef WIN32
 	printf(_("\nOptions for register and unregister:\n"));
 	printf(_("  -N SERVICENAME  service name with which to register PostgreSQL server\n"));
 	printf(_("  -P PASSWORD password of account to register PostgreSQL server\n"));
@@ -2067,7 +2044,7 @@ s

Re: [HACKERS] GIN pending list clean up exposure to SQL

2016-01-10 Thread Julien Rouhaud

On 29/12/2015 00:30, Jeff Janes wrote:
> On Wed, Nov 25, 2015 at 9:29 AM, Jeff Janes  wrote:
>>
>> I'll prepare a patch for core for the January commitfest, and see if
>> it flies.  If not, there is always the extension to fall back to.
> 
> Here is an updated patch.  I've added type and permission checks,
> docs, and some regression tests.
> 

I just reviewed it. Patch applies cleanly, everything works as intended,
including regression tests.

I think the function should be declared as strict.

Also, there are some trailing whitespaces in the documentation diff.

Regards.

> Cheers,
> 
> Jeff
> 
> 
> 
> 


-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spelling corrections

2016-01-10 Thread David Rowley

On 10 January 2016 at 19:34, Peter Geoghegan  wrote:

> Attached patch fixes a couple of misspellings.
>

Snap!
http://www.postgresql.org/message-id/CAKJS1f-AGgQaurTwhY=wkJjspgDcmDUE8Yx03XTXCDz=kae...@mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Speedup twophase transactions

2016-01-10 Thread Simon Riggs

On 9 January 2016 at 20:28, Stas Kelvich  wrote:

> Thanks a lot for your edits, now that patch is much more cleaner.
>
> > Your comments say
> >
> >   "In case of crash replay will move data from xlog to files, if that
> hasn't happened before."
> >
> > but I don't see that in code. Can you show me where that happens?
>
> xact.c calls RecreateTwoPhaseFile in xact_redo() function (xact.c:5596)


So we've only optimized half the usage? We're still going to cause
replication delays.

Sounds like we should be fixing both.

We can either

1) Skip fsyncing the RecreateTwoPhaseFile and then fsync during
restartpoints

2) Copy the contents to shmem and then write them at restartpoint as we do
for checkpoint
(preferred)


> > On 09 Jan 2016, at 18:29, Simon Riggs  wrote:
> >
> > Hmm, I was just preparing this for commit.
> >
> > Please have a look at my mild edits and extended comments.
>
>
> One concern that come into my mind while reading updated
> patch is about creating extra bool field in GlobalTransactionData
> structure. While this improves readability, it
> also increases size of that structure and that size have impact on
> performance on systems with many cores
> (say like 60-80). Probably one byte will not make measurable difference,
> but I think it is good idea to keep
> GXact as small as possible. As far as I understand the same logic was
> behind split of
> PGPROC to PGPROC+PGXACT in 9.2 (comment in proc.h:166)


I think padding will negate the effects of the additional bool.

If we want to reduce the size of the array GIDSIZE is currently 200, but XA
says maximum 128 bytes.

Anybody know why that is set to 200?

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

39 matches

Mail list logo