date:20151112

Re: [HACKERS] checkpointer continuous flushing

2015-11-12 Thread Amit Kapila

On Wed, Nov 11, 2015 at 1:08 PM, Andres Freund  wrote:
>
> On 2015-09-10 17:15:26 +0200, Fabien COELHO wrote:
> > Here is a v13, which is just a rebase after 1aba62ec.
>
>
> 3) I found that latency wasn't improved much for workloads that are
>significantly bigger than shared buffers. The problem here is that
>neither bgwriter nor the backends have, so far, done
>sync_file_range() calls. That meant that the old problem of having
>gigabytes of dirty data that periodically get flushed out, still
>exists. Having these do flushes mostly attacks that problem.
>
>
> Benchmarking revealed that for workloads where the hot data set mostly
> fits into shared buffers flushing and sorting is anywhere from a small
> to a massive improvement, both in throughput and latency. Even without
> the patch from 2), although fixing that improves things furhter.
>
>
>
> What I did not expect, and what confounded me for a long while, is that
> for workloads where the hot data set does *NOT* fit into shared buffers,
> sorting often led to be a noticeable reduction in throughput. Up to
> 30%. The performance was still much more regular than before, i.e. no
> more multi-second periods without any transactions happening.
>
> By now I think I know what's going on: Before the sorting portion of the
> patch the write-loop in BufferSync() starts at the current clock hand,
> by using StrategySyncStart(). But after the sorting that obviously
> doesn't happen anymore - buffers are accessed in their sort order. By
> starting at the current clock hand and moving on from there the
> checkpointer basically makes it more less likely that victim buffers
> need to be written either by the backends themselves or by
> bgwriter. That means that the sorted checkpoint writes can, indirectly,
> increase the number of unsorted writes by other processes :(
>

That sounds to be a tricky problem.  I think the way to improve the current
situation is to change buffer allocation algorithm such that instead of
backend issuing the write for dirty buffer, it will just continue to find
next
free buffer when it finds that selected buffer is dirty and if it could not
find the non-dirty buffer for certain number of attempts, it will signal
bgwriter
to write-out some buffers.  Now the writing algorithm of bgwriter has to
be such that it picks the buffers in chunks from checkpoint-list, sort them
and then write them.  Checkpoint also uses the same checkpoint-list to flush
the dirty buffers.  This will ensure that the writes will always be
sorted-writes
irrespective of which process does the writes. There could be multiple ways
to form this checkpoint-list and one of the way could be MarkBufferDirty()
adds it to such a list.  I think following such a mechanism could solve the
problem of unsorted writes in the system, but it arises a question, what
kind
of latency such a mechanism could introduce for a backend which
signals bgwriter after not finding a non-dirty buffer for certain number of
attempts, I think if we sense this could be a problematic case, then we
can make both bgwriter and checkpoint to always start from next victim
buffer and then traverse the checkpoint-list.

>
> My benchmarking suggest that that effect is the larger, the shorter the
> checkpoint timeout is. That seems to intuitively make sense, give the
> above explanation attempt. If the checkpoint takes longer the clock hand
> will almost certainly soon overtake checkpoints 'implicit' hand.
>
> I'm not sure if we can really do anything about this problem. While I'm
> pretty jet lagged, I still spent a fair amount of time thinking about
> it. Seems to suggest that we need to bring back the setting to
> enable/disable sorting :(
>
>
> What I think needs to happen next with the patch is:
> 1) Hoist up the FileFlushContext stuff into the smgr layer. Carefully
>handling the issue of smgr invalidations.
> 2) Replace the boolean checkpoint_flush_to_disk GUC with a list guc that
>later can contain multiple elements like checkpoint, bgwriter,
>backends, ddl, bulk-writes. That seems better than adding GUCs for
>these separately. Then make the flush locations in the patch
>configurable using that.
> 3) I think we should remove the sort timing from the checkpoint logging
>before commit. It'll always be pretty short.
>

It seems for now you have left out the windows specific implementation
in pg_flush_data().

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel Seq Scan

2015-11-12 Thread Robert Haas

On Thu, Nov 12, 2015 at 10:39 PM, Amit Kapila  wrote:
> The number of shared buffers hit could be different across different runs
> because the read sequence of parallel workers can't be guaranteed, also
> I don't think same is even guaranteed for Seq Scan node,

The number of hits could be different.  However, it seems like any
sequential scan, parallel or not, should have a number of accesses
(hit + read) equal to the size of the relation.  Not sure if that's
what is happening here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-11-12 Thread Kyotaro HORIGUCHI

Hello, I also uncertain about what exactly is the blocker..

At Fri, 13 Nov 2015 02:31:53 +, Kouhei Kaigai  wrote 
in <9a28c8860f777e439aa12e8aea7694f80116f...@bpxm15gp.gisp.nec.co.jp>
> > Sorry, my explanation was not enough.  The reason for that is that in
> > the above postgres_fdw case for example, the overhead in sending the
> > query to the remote end and transferring the result to the local end
> > would not be negligible.  Yeah, we might be able to apply a special
> > handling for the improved efficiency when using early row locking, but
> > otherwise can we do the same thing?
> >
> It is trade-off. Late locking semantics allows to lock relatively smaller
> number of remote rows, it will take extra latency.
> Also, it became clear we have a challenge to pull a joined tuple at once.

Late row locking anyway needs to send query to the remote side
and needs to generate the joined row in either side of the
connection. Early row locking on FDW don't need that since the
necessary tuples are already in out hand. Is there any
performance issue in this? Unfortunately I've not comprehend what
is the problem:(

Or, Are you Fujita-san thinking about bulk late row locking or
such?  If so, it is a matter of future, as update/insert
pushdown, I suppose.

> > I didn't say that my approach is *valuable* either.  What I think is, I
> > see zero evidence that there is a good use-case for an FDW to do
> > something other than doing an ExecProcNode in the callback routine, as I
> > said below, so I don't see the need to add such a routine while that
> > would cause maybe not a large, but not a little burden for writing such
> > a routine on FDW authors.
> >
> It is quite natural because we cannot predicate what kind of extension
> is implemented on FDW interface. You might know the initial version of
> PG-Strom is implemented on FDW (about 4 years before...). If I would
> continue to stick FDW, it became a FDW driver with own join engine.
> (cstore_fdw may potentially support own join logic on top of their
> columnar storage for instance?)
> 
> From the standpoint of interface design, if we would not admit flexibility
> of implementation unless community don't see a working example, a reasonable
> tactics *for extension author* is to follow the interface restriction even
> if it is not best approach from his standpoint.
> It does not mean the approach by majority is also best for the minority.
> It just requires the minority a compromise.

Or try to open the way to introduce the feature he/she wants. If
workable postgres_fdw with join pushdown based on this API to any
extent be shown here, we can envestigate on the problem
there. But perhaps the deadline is just before us..

> > I did the same thing in an earlier version of the patch I posted.
> > Although I agreed on Robert's comment "The Plan tree and the PlanState
> > tree should be mirror images of each other; breaking that equivalence
> > will cause confusion, at least.", I think that that would make code much
> > simpler, especially the code for setting chgParam for inner/outer
> > subplans.

I see that the Kaigai-san's patch doesn't put different nodes
from paths during plan creation, in other words, it doesn't break
coherence between paths and plans as long as core's point of
view. The Fujita-san's patch mentioned above altered a node in
core's sight. I understand that it is the most significant
difference between them..

> >  But one thing I'm concerned about is enable both inner and
> > outer plans, because I think that that would make the planner
> > postprocessing complicated, depending on what the foreign scans do by
> > the inner/outer subplans.  Is it worth doing so?  Maybe I'm missing
> > something, though.

This is discussion about late row locking? Join pushdown itself
is a kind of complicated process. And since it fools planner in
one aspect, the additional feature would be inevitable to be
complex to some extent. We could discuss on that after some
specific problem comes in out sight.

> If you persuade other person who has different opinion, you need to
> explain why was it complicated, how much complicated and what was
> the solution you tried at that time.
> The "complicated" is a subjectively-based term. At least, we don't
> share your experience, so it is hard to understand the how complexity.

Mee too.  It surely might be complicated (though the extent is
mainly in indivisual's mind..) but also I don't see how the
Fujita-san's patch resolves that "problem".

> I guess it is similar to what built-in logic is usually doing, thus,
> it should not be a problem we cannot solve. A utility routine FDW
> driver can call will solve the issue (even if it is not supported
> on v9.5 yet).
>
> > >>> (2) Add a recheck callback.
> > >>>
> > >>> If the foreign data wrapper wants to adopt the solution you're
> > >>> proposing, the recheck callback can call
> > >>> ExecProcNode(outerPlanState(node)).  I don't think this should end up
> > >>> being more than

Re: [HACKERS] Freeze avoidance of very large table.

2015-11-12 Thread Amit Kapila

On Fri, Nov 13, 2015 at 4:48 AM, Masahiko Sawada 
wrote:
>
>
> Thank you for reviewing the patch.
>
> I changed the patch so that the visibility map become the page info
> map, in source code and documentation.
>

One thing to notice is that this almost doubles the patch size which
might makes it slightly difficult to review, but on the other hand if
no-body opposes for such a change, this seems to be the right direction.

> And fixed review comments I received.
> Attached v22 patch.
>
> > I think both the above cases could happen for frozen state
> > as well, unless you think otherwise, we need similar handling
> > for frozen bit.
>
> It's not happen the situation where is all-frozen and not all-visible,
> and the bits of visibility map are cleared at the same time, page
> flags are as well.
> So I think it's enough to handle only all-visible situation. Am I
>
> missing something?
>

No, I think you are right as information for both is cleared together
and all-visible is superset of all-frozen (means if all-frozen is set,
then all-visible must be set), so it is sufficient to check visibility
info in above situation, but I feel we can update the comment to
indicate the same and add an Assert to ensure if all-frozen is set
all-visibile must be set.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2015-11-12 Thread Kouhei Kaigai

> > On Wed, Nov 11, 2015 at 3:29 PM, Andres Freund  wrote:
> > > On 2015-11-11 14:59:33 -0500, Robert Haas wrote:
> > >> I don't see this as being a particularly good idea.  The same issue
> > >> exists for FDWs, and we're just living with it in that case.
> > >
> > > It's absolutely horrible there. I don't see why that's a justification
> > > for much.  To deal with the lack of extensible copy/out/readfuncs I've
> > > just had to copy the entirety of readfuncs.c into an extension. Or you
> > > build replacements for those (as e.g. postgres_fdw essentially has
> > > done).
> > >
> > >> If we do want to improve it, I'm not sure this is the way to go,
> > >> either.  I think there could be other designs where we focus on making
> > >> the serialization and deserialization options better, rather than
> > >> letting people tack stuff onto the struct.
> > >
> > > Just better serialization doesn't actually help all that much. Being
> > > able to conveniently access data directly, i.e. as fields in a struct,
> > > makes code rather more readable. And in many cases more
> > > efficient. Having to serialize internal datastructures unconditionally,
> > > just so copyfuncs.c works if actually used, makes for a fair amount of
> > > inefficiency (forced deserialization, even when not copying) and uglier
> > > code.
> >
> > Fair enough, but I'm still not very interested in having something for
> > CustomScan only.
> >
> I agree with we have no reason why only custom-scan is allowed to have
> serialize/deserialize capability. I can implement an equivalent stuff
> for foreign-scan also, and it is helpful for extension authors, especially,
> who tries to implement remote join feature because FDW driver has to
> keep larger number of private fields than scan.
>
I tried to make a patch to support serialization/deserialization on both
of ForeignScan and CustomScan, but have not tested yet.

One exceptional stuff in ForeignScan is ForeignPath in outfuncs.c, because
ForeignPath itself does not have identifier to get callback functions (it
is kept in RelOptInfo; that is sufficient in planning phase), thus, we have
no way to write out if ForeignPath is a part of larger structure.
We ought to ignore it at this point. How about your opinion?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


pgsql-v9.6-foreign-custom-serialization.v1.patch
Description: pgsql-v9.6-foreign-custom-serialization.v1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-11-12 Thread Kyotaro HORIGUCHI

Hello,

At Fri, 13 Nov 2015 09:07:01 +0900, Masahiko Sawada  
wrote in 
> On Thu, Oct 29, 2015 at 11:16 PM, Fujii Masao  wrote:
> > On Thu, Oct 22, 2015 at 12:47 AM, Masahiko Sawada  
> > wrote:
...
> >> This patch is a tool for discussion, so I'm not going to fix this bug
> >> until getting consensus.
> >>
> >> We are still under the discussion to find solution that can get consensus.
> >> I felt that it's difficult to select from the two approaches within
> >> this development cycle, and there would not be time to implement such
> >> big feature even if we selected.
> >> But this feature is obviously needed by many users.
> >> So I'm considering more simple and extensible something solution, the
> >> idea I posted is one of them.
> >> The another worth considering approach is that just specifying the
> >> number of sync standby. It also can cover the main use cases in
> >> some-cases.
> >
> > Yes, it covers main and simple use case like "I want to have multiple
> > synchronous replicas!". Even if we miss quorum commit at the first
> > version, the feature is still very useful.

+1

> It can cover not only the case you mentioned but also main use case
> Michael mentioned by setting same application_name.
> And that first version patch is almost implemented, so just needs to
> be reviewed.
> 
> I think that it would be good to implement the simple feature at the
> first version, and then coordinate the design based on opinion and
> feed backs from more user, use-case.

Yeah. I agree with it. And I have two proposals in this
direction.

- Notation

 synchronous_standby_names, and synchronous_replication_method as
 a variable to provide other syntax is probably no argument
 except its name. But I feel synchronous_standby_num looks bit
 too specific.

 I'd like to propose if this doesn't reprise the argument on
 notation for replication definitions:p

 The following two GUCs would be enough to bear future expansion
 of notation syntax and/or method.

 synchronous_standby_names :  as it is

 synchronous_replication_method:

   default is "1-priority", which means the same with the current
   meaning.  possible additional values so far would be,

"n-priority": the format of s_s_names is "n, , , ...",
  where n is the number of required acknowledges.

"n-quorum":   the format of s_s_names is the same as above, but
  it is read in quorum context.

 These can be expanded, for example, as follows, but in future.

"complex" : Michael's format.
"json": JSON?
"json-ext": specify JSON in external file.

Even after we have complex notations, I suppose that many use
cases are coverd by the first tree notations.

- Internal design

 What should be done in SyncRepReleaseWaiters() is calculating a
 pair of LSNs that can be regarded as synced and decide whether
 *this* walsender have advanced the LSN pair, then trying to
 release backends that wait for the LSNs *if* this walsender has
 advanced them.

 From such point, the proposed patch will make redundant trials
 to release backens.

 Addition to that, the patch looks to be a mixture of the current
 implement and the new feature. These are for the same objective
 so they cannot coexist each other, I think. As the result, codes
 for both quorum/priority judgement appear at multiple level in
 call tree. This would be an obstacle for future (possible)
 expansion.

 So, I think this feature should be implemented as following,

 SyncRepInitConfig reads the configuration and stores the result
 structure into elsewhere such like WalSnd->syncrepset_definition
 instead of WalSnd->sync_standby_priority, which should be
 removed. Nothing would be stored if the current wal sender is
 not a member of the defined replication set. Storing a pointer
 to matching function there would increase the flexibility but
 such implement in contrast will make the code difficult to be
 read.. (I often look for the entity of xlogreader->read_page()
 ;)

 Then SyncRepSyncedLsnAdvancedTo() instead of
 SyncRepGetSynchronousStandbys() returns an LSN pair that can be
 regarded as 'synced' according to specified definition of
 replication set and whether this walsender have advanced the
 LSNs.

 Finally, SyncRepReleaseWaiters() uses it to release backends if
 needed.

 The differences among quorum/priority or others are confined in
 SyncRepSyncedLsnAdvancedTo(). As the result,
 SyncRepReleaseWaiters would look as following.

 | SyncRepReleaseWaiters(void)
 | {
 |   if (MyWalSnd->syncrepset_definition == NULL || ...)
 |  return;
 |   ...
 |   if (!SyncRepSyncedLsnAdvancedTo(&flush_pos, &write_pos))
 |   {
 | /* I haven't advanced the synced LSNs */
 | LWLockRelease(SyncRepLock);
 | rerturn;
 |   }
 |   /* Set the lsn first so that when we wake backends they will relase...

 I'm not thought concretely about what SyncRepSyncedLsnAdvancedTo
 does but perhaps yes we can:p in effective manner..

 What do you think about this?

regards,

--

Re: [HACKERS] Parallel Seq Scan

2015-11-12 Thread Amit Kapila

On Thu, Nov 12, 2015 at 9:05 PM, Thom Brown  wrote:
>
> On 12 November 2015 at 15:23, Amit Kapila  wrote:
> > On Wed, Nov 11, 2015 at 11:29 PM, Pavel Stehule 
> > wrote:
> >>
> >> Hi
> >>
> >> I have a first query
> >>
> >> I looked on EXPLAIN ANALYZE output and the numbers of filtered rows are
> >> differen
> >>
> >
> > Thanks for the report.  The reason for this problem is that
instrumentation
> > information from workers is getting aggregated multiple times.  In
> > ExecShutdownGatherWorkers(), we call ExecParallelFinish where it
> > will wait for workers to finish and then accumulate stats from workers.
> > Now ExecShutdownGatherWorkers() could be called multiple times
> > (once we read all tuples from workers, at end of node) and it should be
> > ensured that repeated calls should not try to redo the work done by
first
> > call.
> > The same is ensured for tuplequeues, but not for parallel executor info.
> > I think we can safely assume that we need to call ExecParallelFinish()
only
> > when there are workers started by the Gathers node, so on those lines
> > attached patch should fix the problem.
>
> That fixes the count issue for me, although not the number of buffers
> hit,
>

The number of shared buffers hit could be different across different runs
because the read sequence of parallel workers can't be guaranteed, also
I don't think same is even guaranteed for Seq Scan node, the other
operations
in parallel could lead to different number, however the actual problem was
that in one of the plans shown by you [1], the Buffers hit at Gather node
(175288) is lesser than the Buffers hit at Parallel Seq Scan node (241201).
Do you still (after applying above patch) see that Gather node is showing
lesser hit buffers than Parallel Seq Scan node?

[1]
 # explain (analyse, buffers, timing, verbose, costs) select count(*)
from js where content->'tags'->>'title' like '%design%';
 QUERY
PLAN


 Aggregate  (cost=132489.34..132489.35 rows=1 width=0) (actual
time=382.987..382.987 rows=1 loops=1)
   Output: count(*)
   Buffers: shared hit=175288
   ->  Gather  (cost=1000.00..132488.34 rows=401 width=0) (actual
time=382.983..382.983 rows=0 loops=1)
 Output: content
 Number of Workers: 1
 Buffers: shared hit=175288
 ->  Parallel Seq Scan on public.js  (cost=0.00..131448.24
rows=401 width=0) (actual time=379.407..1141.437 rows=0 loops=1)
   Output: content
   Filter: (((js.content -> 'tags'::text) ->>
'title'::text) ~~ '%design%'::text)
   Rows Removed by Filter: 1724810
   Buffers: shared hit=241201
 Planning time: 0.104 ms
 Execution time: 403.045 ms
(14 rows)

Time: 403.596 ms

> or the actual time taken.
>

Exactly what time you are referring here, Execution Time or actual time
shown on Parallel Seq Scan node and what problem do you see with
the reported time?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-11-12 Thread Kouhei Kaigai

> -Original Message-
> From: Etsuro Fujita [mailto:fujita.ets...@lab.ntt.co.jp]
> Sent: Thursday, November 12, 2015 6:54 PM
> To: Kaigai Kouhei(海外 浩平); Robert Haas
> Cc: Tom Lane; Kyotaro HORIGUCHI; pgsql-hackers@postgresql.org; Shigeru Hanada
> Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual
> 
> Robert and Kaigai-san,
> 
> Sorry, I sent in an unfinished email.
> 
> On 2015/11/12 15:30, Kouhei Kaigai wrote:
> >> On 2015/11/12 2:53, Robert Haas wrote:
> >>> On Sun, Nov 8, 2015 at 11:13 PM, Etsuro Fujita
> >>>  wrote:
>  To test this change, I think we should update the postgres_fdw patch so 
>  as
>  to add the RecheckForeignScan.
> 
>  Having said that, as I said previously, I don't see much value in adding
> the
>  callback routine, to be honest.  I know KaiGai-san considers that that 
>  would
>  be useful for custom joins, but I don't think that that would be useful 
>  even
>  for foreign joins, because I think that in case of foreign joins, the
>  practical implementation of that routine in FDWs would be to create a
>  secondary plan and execute that plan by performing ExecProcNode, as my 
>  patch
>  does [1].  Maybe I'm missing something, though.
> 
> >>> I really don't see why you're fighting on this point.  Making this a
> >>> generic feature will require only a few extra lines of code for FDW
> >>> authors.  If this were going to cause some great inconvenience for FDW
> >>> authors, then I'd agree it isn't worth it.  But I see zero evidence
> >>> that this is actually the case.
> 
> >> Really?  I think there would be not a little burden on an FDW author;
> >> when postgres_fdw delegates to the subplan to the remote server, for
> >> example, it would need to create a remote join query by looking at
> >> tuples possibly fetched and stored in estate->es_epqTuple[], send the
> >> query and receive the result during the callback routine.
> 
> > I cannot understand why it is the only solution.
> 
> I didn't say that.
> 
> >> Furthermore,
> >> what I'm most concerned about is that wouldn't be efficient. So, my
> 
> > You have to add "because ..." sentence here because I and Robert
> > think a little inefficiency is not a problem.
> 
> Sorry, my explanation was not enough.  The reason for that is that in
> the above postgres_fdw case for example, the overhead in sending the
> query to the remote end and transferring the result to the local end
> would not be negligible.  Yeah, we might be able to apply a special
> handling for the improved efficiency when using early row locking, but
> otherwise can we do the same thing?
>
It is trade-off. Late locking semantics allows to lock relatively smaller
number of remote rows, it will take extra latency.
Also, it became clear we have a challenge to pull a joined tuple at once.

> > Please don't start the sentence from "I think ...". We all knows
> > your opinion, but what I've wanted to see is "the reason why my
> > approach is valuable is ...".
> 
> I didn't say that my approach is *valuable* either.  What I think is, I
> see zero evidence that there is a good use-case for an FDW to do
> something other than doing an ExecProcNode in the callback routine, as I
> said below, so I don't see the need to add such a routine while that
> would cause maybe not a large, but not a little burden for writing such
> a routine on FDW authors.
>
It is quite natural because we cannot predicate what kind of extension
is implemented on FDW interface. You might know the initial version of
PG-Strom is implemented on FDW (about 4 years before...). If I would
continue to stick FDW, it became a FDW driver with own join engine.
(cstore_fdw may potentially support own join logic on top of their
columnar storage for instance?)

From the standpoint of interface design, if we would not admit flexibility
of implementation unless community don't see a working example, a reasonable
tactics *for extension author* is to follow the interface restriction even
if it is not best approach from his standpoint.
It does not mean the approach by majority is also best for the minority.
It just requires the minority a compromise.

> > Nobody prohibits postgres_fdw performs a secondary join here.
> > All you need to do is, picking up a sub-plan tree from FDW's private
> > field then call ExecProcNode() inside the callback.
> 
> >> As I said before, I know that KaiGai-san considers that
> >> that approach would be useful for custom joins.  But I see zero evidence
> >> that there is a good use-case for an FDW.
> 
> >>>  From my point of view I'm now
> >>> thinking this solution has two parts:
> >>>
> >>> (1) Let foreign scans have inner and outer subplans.  For this
> >>> purpose, we only need one, but it's no more work to enable both, so we
> >>> may as well.  If we had some reason, we could add a list of subplans
> >>> of arbitrary length, but there doesn't seem to be an urgent need for
> >>> that.
> 
> I did the same thing in an

[HACKERS] Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

2015-11-12 Thread Peter Geoghegan

We lack SortSupport for many character-like-type cases. In full, the
cases within the core system are:

* char(n) opfamily (bpchar_ops).

* text_pattern_ops opfamily (includes text and varchar "pattern"
opclasses, which are generally recommended for accelerating LIKE
operator queries).

* bpchar_pattern_ops -- the operator family/class for char(n), used
where "pattern" style indexing is required for the char(n) type.

* bytea default opclass. This is a type that, like the others, shares
its representation with text (a varlena header and some data bytes --
a string, essentially). Its comparator already behaves identically to
that of the text comparator when the "C" collation is used. I've
actually seen a specific request for this [1].

These cases do matter. For one thing, even if they're less important
than the default text/varchar opclasses, having such large
inconsistencies in character type sort performance is a fairly major
POLA violation; opclasses like text_pattern_ops are *supposed* to be
faster though less capable alternatives to the defaults. For another,
char(n) is in the SQL standard, which may be why all TPC benchmarks
use char(n) for columns that are sorted on or used for grouping.
char(n) sorting can be made about 8x faster with
SortSupport/abbreviation, making it the best candidate for
abbreviation optimization that I've ever seen. It would be regrettable
if we accidentally lost a benchmark due to not having char(n)
SortSupport.

Attached patch adds SortSupport for all of the cases listed above.

I did not bother doing anything with contrib/citext, because I think
it's not worth it. I think that we should definitely invest in case
insensitive collations, and retire contrib/citext. The fact that the
*default* collation (and not the input collation) is passed by
citextcmp()'s call to str_tolower() suggests to me that the only way
to make citext do the right thing is to basically introduce case
insensitive collations to Postgres. Of course, doing so would make
contrib/citext obsolete.

I also didn't bother extending the name type's SortSupport (something
that has existed since 9.2) to perform abbreviation, although that
wouldn't be very hard. I saw no point.

I think I might also get around to adding abbreviated key support for
the network address types during the 9.6 cycle. That seems like
something a less experienced contributor could easily pick up, though
-- if anyone wants to take it off my hands, please do so.

[1] https://lwn.net/Articles/653721/
-- 
Peter Geoghegan
From 4b6cfc423ab2d586af987a410b2706cdfc02d9fb Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Sun, 8 Nov 2015 15:34:32 -0800
Subject: [PATCH] Generalize SortSupport for text

Expose interface that allows char(n) and bytea types to piggyback on a
now-generalized SortSupport for text.  This pushes a little more
knowledge of the bpchar/char(n) type into varlena.c than might be
preferred, but that seems like the approach that creates least friction.

Also, accelerate index builds that use the text_pattern_ops opfamily
(text_pattern_ops and varchar_pattern_ops opclasses), and the
bpchar_pattern_ops opfamily (which has one opclass, also named
bpchar_pattern_ops) by introducing SortSupport.  These work just like
the bytea caller of the generalized SortSupport for text interface --
the "C" locale is forced.
---
 doc/src/sgml/datatype.sgml  |   3 +-
 src/backend/utils/adt/varchar.c |  56 ++-
 src/backend/utils/adt/varlena.c | 357 ++--
 src/include/catalog/pg_amproc.h |   4 +
 src/include/catalog/pg_proc.h   |   8 +
 src/include/utils/builtins.h|   6 +
 src/include/utils/bytea.h   |   1 +
 7 files changed, 303 insertions(+), 132 deletions(-)

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index b5191f4..0ee358b 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -1140,8 +1140,7 @@ SELECT '52093.89'::money::numeric::float8;
  advantages in some other database systems, there is no such advantage in
  PostgreSQL; in fact
  character(n) is usually the slowest of
- the three because of its additional storage costs and slower
- sorting.  In most situations
+ the three because of its additional storage costs.  In most situations
  text or character varying should be used
  instead.
 
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index df9a2d7..b3a1cdf 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -17,10 +17,12 @@
 
 #include "access/hash.h"
 #include "access/tuptoaster.h"
+#include "catalog/pg_collation.h"
 #include "libpq/pqformat.h"
 #include "nodes/nodeFuncs.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
+#include "utils/sortsupport.h"
 #include "mb/pg_wchar.h"
 
 
@@ -649,14 +651,21 @@ varchartypmodout(PG_FUNCTION_ARGS)
  */
 
 /* "True" length

Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-11-12 Thread Masahiko Sawada

On Thu, Oct 29, 2015 at 11:16 PM, Fujii Masao  wrote:
> On Thu, Oct 22, 2015 at 12:47 AM, Masahiko Sawada  
> wrote:
>> On Tue, Oct 20, 2015 at 8:10 PM, Beena Emerson  
>> wrote:
>>>
>>> On Mon, Oct 19, 2015 at 8:47 PM, Masahiko Sawada 
>>> wrote:

 Hi,

 Attached patch is a rough patch which supports multi sync replication
 by another approach I sent before.

 The new GUC parameters are:
 * synchronous_standby_num, which specifies the number of standby
 servers using sync rep. (default is 0)
 * synchronous_replication_method, which specifies replication method;
 priority or quorum. (default is priority)

 The behaviour of 'priority' and 'quorum' are same as what we've been
 discussing.
 But I write overview of these here again here.

 [Priority Method]
 The standby server has each different priority, and the active standby
 servers having the top N priroity are become sync standby server.
 If synchronous_standby_names = '*', the all active standby server
 would be sync standby server.
 If you want to set up standby like 9.5 or before, you can set
 synchronous_standby_num = 1.

>>>
>>>
>>> I used the following setting with 2 servers A and D connected:
>>>
>>> synchronous_standby_names = 'A,B,C,D'
>>> synchronous_standby_num = 2
>>> synchronous_replication_method = 'priority'
>>>
>>> Though s_r_m = 'quorum' worked fine, changing it to 'priority' caused
>>> segmentation fault.
>>>
>>
>> Thank you for taking a look!
>> This patch is a tool for discussion, so I'm not going to fix this bug
>> until getting consensus.
>>
>> We are still under the discussion to find solution that can get consensus.
>> I felt that it's difficult to select from the two approaches within
>> this development cycle, and there would not be time to implement such
>> big feature even if we selected.
>> But this feature is obviously needed by many users.
>> So I'm considering more simple and extensible something solution, the
>> idea I posted is one of them.
>> The another worth considering approach is that just specifying the
>> number of sync standby. It also can cover the main use cases in
>> some-cases.
>
> Yes, it covers main and simple use case like "I want to have multiple
> synchronous replicas!". Even if we miss quorum commit at the first
> version, the feature is still very useful.

It can cover not only the case you mentioned but also main use case
Michael mentioned by setting same application_name.
And that first version patch is almost implemented, so just needs to
be reviewed.

I think that it would be good to implement the simple feature at the
first version, and then coordinate the design based on opinion and
feed backs from more user, use-case.

Regards,

--
Masahiko Sawada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: add \pset true/false

2015-11-12 Thread Tom Lane

Peter Eisentraut  writes:
> Plus we already have \pset numericlocale as a similar feature in psql.

But \pset numericlocale is also a crock.  It doesn't affect COPY output
for instance, and its ability to identify which data types it should apply
to is really shaky.  And it's virtually unused, as demonstrated by the
fact that serious bugs in it went undetected for many years (cf 4778a0bda,
77130fc14).  That's a really poor advertisement for the usefulness of the
proposed feature.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()

2015-11-12 Thread Tom Lane

Dean Rasheed  writes:
> These results are based on the attached, updated patch which includes
> a few minor improvements.

I started to look at this patch, and was immediately bemused by the
comment in estimate_ln_weight:

/*
 * 0.9 <= var <= 1.1
 *
 * Its logarithm has a negative weight (possibly very large). Estimate
 * it using ln(var) = ln(1+x) = x + O(x^2) ~= x.
 */

This comment's nonsense of course: ln(0.9) is about -0.105, and ln(1.1) is
about 0.095, which is not even negative much less "very large".  We could
just replace that with "For values reasonably close to 1, we can estimate
the result using ln(var) = ln(1+x) ~= x."  I am wondering what's the point
though: why not flush this entire branch in favor of always using the
generic case?  I rather doubt that on any modern machine two uses of
cmp_var plus init_var/sub_var/free_var are going to be cheaper than the
log() call you save.  You would need a different way to special-case 1.0,
but that's not expensive.

Larger questions are

(1) why the Abs() in the specification of estimate_ln_weight --- that
doesn't comport with the text about "Estimate the weight of the most
significant digit".  (The branch I'm proposing you remove fails to
honor that anyway.)

(2) what should the behavior be for input == 1, and why?  The code
is returning zero, but that seems inconsistent or at least
underdocumented.

(3) if it's throwing an error for zero input, why not for negative
input?  I'm not sure that either behavior is within the purview of
this function, anyway.  Seems like returning zero might be fine.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: add \pset true/false

2015-11-12 Thread Peter Eisentraut

On 10/29/15 9:50 AM, Tom Lane wrote:
> The really key argument that hasn't been addressed here is why does such
> a behavior belong in psql, rather than elsewhere?

Because it is the job of the client to mangle the data so that it suits
the purposes of the client.  What comes over the wire is part of the
protocol, not part of the client display.  Language drivers do this sort
of mangling all the time.

> Surely legibility problems aren't unique to psql users.

But the way to deal with it is specific to psql.  In a graphical
environment you might want a checkbox instead, say.  Different clients
will have different requirements and preferences.

> Moreover, there are exactly
> parallel facilities for other datatypes on the server side: think
> DateStyle or bytea_output.

Surely DateStyle is legacy, and bytea_output was actually a performance
feature.  And clients will still mangle either type into their own formats.

Plus we already have \pset numericlocale as a similar feature in psql.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: add \pset true/false

2015-11-12 Thread David G. Johnston

On Thu, Nov 12, 2015 at 1:04 PM, David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Thu, Oct 29, 2015 at 6:50 AM, Tom Lane  wrote:
>
>> Marko Tiikkaja  writes:
>> > On 10/29/15 11:51 AM, Daniel Verite wrote:
>> >> Personally I think it would be worth having, but how about
>> >> booleans inside ROW() or composite types ?
>>
>> > There's not enough information sent over to do that in the client.
>> > Note that this works the same way as  \pset null  with  SELECT
>> > ROW(NULL), so I don't consider it a show stopper for the patch.
>>
>> The problem with that argument is that \pset null is already a kluge
>> (but at least a datatype-independent one).  Now you've added a datatype
>> specific kluge of the same ilk.  It might be useful, it might be short,
>> but that doesn't make it not a kluge.
>>
>> The really key argument that hasn't been addressed here is why does such
>> a behavior belong in psql, rather than elsewhere?  Surely legibility
>> problems aren't unique to psql users.  Moreover, there are exactly
>> parallel facilities for other datatypes on the server side: think
>> DateStyle
>
>
> Which provides a finite set of possible values.
> 
>
>
>> or bytea_output.
>
>
> Wasn't this added mostly for performance as opposed to dealing with
> "locale/style" considerations?
>
> So if you were trying to follow precedent
>> rather than invent a kluge, you'd have submitted a patch to create a GUC
>> that changes the output of boolout().
>>
>>
> I'm leaning toward doing this in the client if its offered at all.  An
> unobtrusive usability enhancement - even if limited to non-embedded
> situations - that seems like little effort for a measurable gain.
>
>
That said whatever is done really wants to be able to interact with at
least "psql \copy" but probably "SQL COPY" as well...again even if just
non-composite outputs.

David J.

Re: [HACKERS] psql: add \pset true/false

2015-11-12 Thread David G. Johnston

On Thu, Oct 29, 2015 at 6:50 AM, Tom Lane  wrote:

> Marko Tiikkaja  writes:
> > On 10/29/15 11:51 AM, Daniel Verite wrote:
> >> Personally I think it would be worth having, but how about
> >> booleans inside ROW() or composite types ?
>
> > There's not enough information sent over to do that in the client.
> > Note that this works the same way as  \pset null  with  SELECT
> > ROW(NULL), so I don't consider it a show stopper for the patch.
>
> The problem with that argument is that \pset null is already a kluge
> (but at least a datatype-independent one).  Now you've added a datatype
> specific kluge of the same ilk.  It might be useful, it might be short,
> but that doesn't make it not a kluge.
>
> The really key argument that hasn't been addressed here is why does such
> a behavior belong in psql, rather than elsewhere?  Surely legibility
> problems aren't unique to psql users.  Moreover, there are exactly
> parallel facilities for other datatypes on the server side: think
> DateStyle


Which provides a finite set of possible values.



> or bytea_output.


Wasn't this added mostly for performance as opposed to dealing with
"locale/style" considerations?

So if you were trying to follow precedent
> rather than invent a kluge, you'd have submitted a patch to create a GUC
> that changes the output of boolout().
>
>
I'm leaning toward doing this in the client if its offered at all.  An
unobtrusive usability enhancement - even if limited to non-embedded
situations - that seems like little effort for a measurable gain.

David J.

Re: [HACKERS] [PATCH] Refactoring of LWLock tranches

2015-11-12 Thread Robert Haas

On Wed, Nov 11, 2015 at 6:50 AM, Ildus Kurbangaliev
 wrote:
> Attached a new version of the patch that moves SLRU tranches and LWLocks to
> SLRU control structs.
>
> `buffer_locks` field now contains LWLocks itself, so we have some economy of
> the memory here. `pstrdup` removed in SimpleLruInit. I didn't
> change names from the previous patch yet, but I don't mind if they'll
> be changed.

I've committed this with some modifications.  I did a little bit of
renaming, and I stored a copy of the SLRU name in shared memory.
Otherwise, we're relying on the fact that a static string will be at
the same backend-private address in every process, which is a
dangerous assumption at best.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: add \pset true/false

2015-11-12 Thread David G. Johnston

On Thu, Oct 29, 2015 at 5:28 AM, Matthijs van der Vleuten 
wrote:

> I have had exactly this situation a week ago. I was testing the output of
> an algorithm that is supposed to have exactly one true value per input id.
>

If this is particularly important I would add something like (and(col1,
col2, col3, ...) = true) and/or NULLIF((not(col1)::int + not(col2)::int
..., 0) to the grid and check/test those instead of visually scanning the
output.

If the pretty presentation is destined for final output I'd say you really
want to output text and write a function so that the logic is embedded in
the query and not a side-effect of a specific environment.

David J.

Re: [HACKERS] psql: add \pset true/false

2015-11-12 Thread Daniel Verite

Matthijs van der Vleuten wrote:

> -1 for changing boolout(). It will break anything that receives 
> boolean values from the server.

Not if the default output is still 't' or 'f' like now.
Nobody seems to suggest a gratuitous compatibility break.

>  How a client is going to display values (of any
> type) is logic that should belong in the client, not in the protocol.

Maybe in general, but consider two problems here:

#1: postgres type extensibility implies that clients
display values from types they know nothing about.
It makes sense that they just use the text representation
that comes from the server, unless they have a specific
reason not to.

#2: in the case of built-in types, like boolean, it's not sufficient
to change how the base type gets displayed.
With Marko's patch, array[true,false]  is still displayed as {t,f}
when setting aternatives through the proposed \pset
feature. Same problem inside composite types as mentioned
upthread.

The issue with the patch's approach is that it falls short of
identifying bools in all situations, so the results are inconsistent.
Solving that in psql is hard or possibly irrealistic, because
the reliance on the text representation for complex types goes deep.

By contrast, making the server-side representation configurable seems
easier. Plus it might be of interest for other consumers of resultsets,
outside of psql.

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-11-12 Thread Fabien COELHO




Basically yes, I'm suggesting a mutex in the vdf struct.


I can't see that being ok. I mean what would that thing even do? VFD
isn't shared between processes, and if we get a smgr flush we have to
apply it, or risk breaking other things.


Probably something is eluding my comprehension:-)

My basic assumption is that the fopen & fd is per process, so we just have to 
deal with the one in the checkpointer process, so it is enough that the 
checkpointer does not close the file while it is flushing things to it?


Hmmm...

Maybe I'm a little bit too optimistic here, because it seems that I'm 
suggesting to create a dead lock if the checkpointer has both buffers to 
flush in waiting and wishes to close the very same file that holds them.


So on wanting to close the file the checkpointer should rather flushes the 
outstanding flushes in wait and then close the fd, which suggest some 
global variable to hold flush context so that this can be done.


Hmmm.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-12 Thread Thomas Munro

On Fri, Nov 13, 2015 at 1:16 AM, Simon Riggs  wrote:

> On 11 November 2015 at 09:22, Thomas Munro 
> wrote:
>
>
>> 1.  Reader waits with exposed LSNs, as Heikki suggests.  This is what
>> BerkeleyDB does in "read-your-writes" mode.  It means that application
>> developers have the responsibility for correctly identifying transactions
>> with causal dependencies and dealing with LSNs (or whatever equivalent
>> tokens), potentially even passing them to other processes where the
>> transactions are causally dependent but run by multiple communicating
>> clients (for example, communicating microservices).  This makes it
>> difficult to retrofit load balancing to pre-existing applications and (like
>> anything involving concurrency) difficult to reason about as applications
>> grow in size and complexity.  It is efficient if done correctly, but it is
>> a tax on application complexity.
>>
>
> Agreed. This works if you have a single transaction connected thru a pool
> that does statement-level load balancing, so it works in both session and
> transaction mode.
>
> I was in favour of a scheme like this myself, earlier, but have more
> thoughts now.
>
> We must also consider the need for serialization across sessions or
> transactions.
>
> In transaction pooling mode, an application could get assigned a different
> session, so a token would be much harder to pass around.
>

Sorry for the double reply, I just wanted to add a couple more thoughts.

As discussed elsewhere in the thread, I think it makes absolute sense to
offer some kind of support for causality tokens, I don't see that on its
own as enough for most users.  (At the least, it would be good to have
pg_wait_for_xlog_replay_location(lsn, timeout), but perhaps explicit BEGIN
syntax as suggested by Heikki, or a new field in the libpq protocol which
can be attached to any statement, and likewise for the commit LSN of
results).

It's true that a pooling system/middleware could spy on your sessions and
insert causality token handling imposing a global ordering of visibility
for you, so that naive users don't have to deal with them.  Whenever it
sees a COMMIT result (assuming they are taught to return LSNs), it could
update a highest-LSN-seen variable, and transparently insert a wait for
that LSN into every transaction that it sees beginning.  But then you would
have to push all your queries through a single point that can see
everything across all Postgres servers, and maintain this global high LSN.

In contrast, my writer-waits proposal makes different trade-offs to provide
causal reads as a built-in feature without an external single point
observer of all transactions.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Erroneous cost estimation for nested loop join

2015-11-12 Thread Jeff Janes

On Mon, Nov 9, 2015 at 2:42 AM, Simon Riggs  wrote:
> On 9 November 2015 at 10:08,  wrote:
>>
>>
>> We guessed the cause of this error would be in the cost model of Postgres,
>> and investigated the source code of optimizer, and we found the cause of
>> this problem. It was in the index cost estimation process. On scanning inner
>> table, if loop count is greater than 1, its I/O cost is counted as random
>> access. In the case of Query2, in one loop (i.e. one inner table scan) ,
>> much data is read sequentially with clustered index, so it seems to be wrong
>> that optimizer thinks its I/O workload is random access.
>
>
> We don't have a clustered index in Postgres. We do store correlation stats
> for the index, which is used in some places to reduce cost.
>
> Could you look some more at this and see if a model change that uses the
> correlation can improve this?

That is already happening.  min_IO_cost is set on the assumption of
perfect correlation, max_IO_cost is set on the assumption of no
correlation, and then later in the code it interpolates between these
two based on the observed correlation:

run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost);

But the min_IO_cost is very pessimistic in this particular case.  So I
think that their proposed change is already exactly what you are
asking them to try.

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-12 Thread Thomas Munro

On Fri, Nov 13, 2015 at 1:16 AM, Simon Riggs  wrote:

> On 11 November 2015 at 09:22, Thomas Munro 
> wrote:
>
>
>> 1.  Reader waits with exposed LSNs, as Heikki suggests.  This is what
>> BerkeleyDB does in "read-your-writes" mode.  It means that application
>> developers have the responsibility for correctly identifying transactions
>> with causal dependencies and dealing with LSNs (or whatever equivalent
>> tokens), potentially even passing them to other processes where the
>> transactions are causally dependent but run by multiple communicating
>> clients (for example, communicating microservices).  This makes it
>> difficult to retrofit load balancing to pre-existing applications and (like
>> anything involving concurrency) difficult to reason about as applications
>> grow in size and complexity.  It is efficient if done correctly, but it is
>> a tax on application complexity.
>>
>
> Agreed. This works if you have a single transaction connected thru a pool
> that does statement-level load balancing, so it works in both session and
> transaction mode.
>
> I was in favour of a scheme like this myself, earlier, but have more
> thoughts now.
>
> We must also consider the need for serialization across sessions or
> transactions.
>
> In transaction pooling mode, an application could get assigned a different
> session, so a token would be much harder to pass around.
>
> 2.  Reader waits for a conservatively chosen LSN.  This is roughly what
>> MySQL derivatives do in their "causal_reads = on" and "wsrep_sync_wait =
>> 1" modes.  Read transactions would start off by finding the current end
>> of WAL on the primary, since that must be later than any commit that
>> already completed, and then waiting for that to apply locally.  That means
>> every read transaction waits for a complete replication lag period,
>> potentially unnecessarily.  This is tax on readers with unnecessary waiting.
>>
>
> This tries to make it easier for users by forcing all users to experience
> a causality delay. Given the whole purpose of multi-node load balancing is
> performance, referencing the master again simply defeats any performance
> gain, so you couldn't ever use it for all sessions. It could be a USERSET
> parameter, so could be turned off in most cases that didn't need it.  But
> its easier to use than (1).
>
> Though this should be implemented in the pooler.
>
> 3.  Writer waits, as proposed.  In this model, there is no tax on readers
>> (they have zero overhead, aside from the added complexity of dealing with
>> the possibility of transactions being rejected when a standby falls behind
>> and is dropped from 'available' status; but database clients must already
>> deal with certain types of rare rejected queries/failures such as
>> deadlocks, serialization failures, server restarts etc).  This is a tax on
>> writers.
>>
>
> This would seem to require that all readers must first check with the
> master as to which standbys are now considered available, so it looks like
> (2).
>

No -- in (3), that is this proposal, standbys don't check with the primary
when you run a transaction.  Instead, the primary sends a constant stream
of authorizations (in the form of keepalives sent every
causal_reads_timeout / 2 in the current patch) to the standby, allowing it
to consider itself available for a short time into the future (currently
now + causal_reads_timeout - max_tolerable_clock_skew to be specific -- I
can elaborate on that logic in a separate email).  At the start of a
transaction in causal reads mode (the first call to GetTransaction to be
specific), the standby knows immediately without communicating with the
primary whether it can proceed or must raise the error.  In the happy case,
the reader simply compares the most recently received authorization's
expiry time with the system clock and proceeds.  In the worst case, when
contact is lost between primary and standby, the primary must stall
causal_reads commits for causal_reads_timeout (see CausalReadsBeginStall).
Doing that makes sure that no causal reads commit can return (see
CausalReadsCommitCanReturn) before the lost standby has definitely started
raising the error for causal_reads queries (because its most recent
authorization has expired), in case it is still alive and handling requests
from clients.

It is not at all like (2), which introduces a conservative wait at the
start of every read transaction, slowing all readers down.  In (3), readers
don't wait, they run (or are rejected) as fast as possible, but instead the
primary has to do extra things.  Hence my categorization of (2) as a 'tax
on readers', and of (3) as a 'tax on writers'.  The idea is that a site
with a high ratio of reads to writes would prefer zero-overhead reads.

> The alternative is that we simply send readers to any standby and allow
> the pool to work out separately whether the standby is still available,
> which mostly works, but it doesn't handle sporadic slow downs on particular
> standbys v

Re: [HACKERS] checkpointer continuous flushing

2015-11-12 Thread Fabien COELHO



Hello,


Basically yes, I'm suggesting a mutex in the vdf struct.


I can't see that being ok. I mean what would that thing even do? VFD
isn't shared between processes, and if we get a smgr flush we have to
apply it, or risk breaking other things.


Probably something is eluding my comprehension:-)

My basic assumption is that the fopen & fd is per process, so we just have 
to deal with the one in the checkpointer process, so it is enough that the 
checkpointer does not close the file while it is flushing things to it?



* my laptop, 16 GB Ram, 840 EVO 1TB as storage. With 2GB
shared_buffers. Tried checkpoint timeouts from 60 to 300s.


Hmmm. This is quite short.


Indeed. I'd never do that in a production scenario myself. But
nonetheless it showcases a problem.


I would say that it would render sorting ineffective because all the 
rewriting is done by bgwriter or workers, which does not totally explain 
why the throughput would be worst than before, I would expect it to be as 
bad as before...



Well, you can't easily sort bgwriter/backend writes stemming from cache
replacement. Unless your access patterns are entirely sequential the
data in shared buffers will be laid out in a nearly entirely random
order.  We could try sorting the data, but with any reasonable window,
for many workloads the likelihood of actually achieving much with that
seems low.


Maybe the sorting could be shared with others so that everybody uses the
same order?

That would suggest to have one global sorting of buffers, maybe maintained
by the checkpointer, which could be used by all processes that need to scan
the buffers (in file order), instead of scanning them in memory order.


Uh. Cache replacement is based on an approximated LRU, you can't just
remove that without serious regressions.


I understand that, but there is a balance to find. Generating random I/Os 
is very bad for performance, so the decision process must combine LRU/LFU 
heuristics with considering things in some order as well.



Hmmm. The shorter the timeout, the more likely the sorting NOT to be
effective


You mean, as evidenced by the results, or is that what you'd actually
expect?


What I would expect...


I don't see why then? If you very quickly writes lots of data the OS
will continously flush dirty data to the disk, in which case sorting is
rather important?


What I have in mind is: the shorter the timeout the less neighboring 
buffers will be touched, so the less nice sequential writes will be found 
by sorting them, so the worst the positive impact on performance...


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: add \pset true/false

2015-11-12 Thread Pavel Stehule

2015-11-12 17:41 GMT+01:00 Matthijs van der Vleuten :

>
> On 12 Nov 2015, at 14:21, Brendan Jurd  wrote:
>
> On Fri, 30 Oct 2015 at 00:51 Tom Lane  wrote:
>
>> The really key argument that hasn't been addressed here is why does such
>> a behavior belong in psql, rather than elsewhere?  Surely legibility
>> problems aren't unique to psql users.  Moreover, there are exactly
>> parallel facilities for other datatypes on the server side: think
>> DateStyle or bytea_output.  So if you were trying to follow precedent
>> rather than invent a kluge, you'd have submitted a patch to create a GUC
>> that changes the output of boolout().
>>
>
> I find Tom's analogy to datestyle and bytea_output convincing.
>
> +1 for a GUC that changes the behaviour of boolout.
>
>
> -1 for changing boolout(). It will break anything that receives boolean
> values from the server. How a client is going to display values (of any
> type) is logic that should belong in the client, not in the protocol.
>

This is not fully valid, because transformation from binary to text is
processed on server side. And client side transformation is easy only for
scalar values.

Regards

Pavel

Re: [HACKERS] Proposing COPY .. WITH PERMISSIVE

2015-11-12 Thread dinesh kumar

On Thu, Nov 12, 2015 at 4:35 AM, Michael Paquier 
wrote:

> On Sat, Oct 3, 2015 at 1:43 PM, dinesh kumar wrote:
> > We can also use "PROGRAM 'cat > Output.csv' " to achieve this "NO READ
> > ACCESS", since the program is always running as a instance owner.
> > Let me know your inputs and thoughts.
>
> That's one way. And as PROGRAM presents the advantage to open the file
> before the COPY has begun processing any tuples, the umask would be
> set before writing any data to it, hence why complicating COPY with a
> new option while we already have something to apply restrictions to
> the output file? It seems that this patch is just an unnecessary
> complication, and I doubt that we would want to change the default
> behavior of 644 that has been here for ages.
>

Hi Michael,

Thanks for your inputs.

I am also against changing the default behavior. Since, I see, some
advantages having 644.

As pg_file_write, and 'PROGRAM' uses it's instance owner umask for the
output file, I believe,
we should also have the same umaks for the COPY too. I Could be wrong here,
But I don't see
"PROGRAM" as an alternative to this. Since, we have a separate TO clause,
which takes care about
dumping data into files.

Let me know, if I'm wrong here.

--
> Michael
>

-- 

Regards,
Dinesh
manojadinesh.blogspot.com

Re: [HACKERS] checkpointer continuous flushing

2015-11-12 Thread Andres Freund

On 2015-11-12 17:44:40 +0100, Fabien COELHO wrote:
> 
> >>To fix it, ITSM that it is enough to hold a "do not close lock" on the file
> >>while a flush is in progress (a short time) that would prevent mdclose to do
> >>its stuff.
> >
> >Could you expand a bit more on this? You're suggesting something like a
> >boolean in the vfd struct?
> 
> Basically yes, I'm suggesting a mutex in the vdf struct.

I can't see that being ok. I mean what would that thing even do? VFD
isn't shared between processes, and if we get a smgr flush we have to
apply it, or risk breaking other things.

> >* my laptop, 16 GB Ram, 840 EVO 1TB as storage. With 2GB
> > shared_buffers. Tried checkpoint timeouts from 60 to 300s.
> 
> Hmmm. This is quite short.

Indeed. I'd never do that in a production scenario myself. But
nonetheless it showcases a problem.


> >Well, you can't easily sort bgwriter/backend writes stemming from cache
> >replacement. Unless your access patterns are entirely sequential the
> >data in shared buffers will be laid out in a nearly entirely random
> >order.  We could try sorting the data, but with any reasonable window,
> >for many workloads the likelihood of actually achieving much with that
> >seems low.
> 
> Maybe the sorting could be shared with others so that everybody uses the
> same order?
> 
> That would suggest to have one global sorting of buffers, maybe maintained
> by the checkpointer, which could be used by all processes that need to scan
> the buffers (in file order), instead of scanning them in memory order.

Uh. Cache replacement is based on an approximated LRU, you can't just
remove that without serious regressions.


> >>Hmmm. The shorter the timeout, the more likely the sorting NOT to be
> >>effective
> >
> >You mean, as evidenced by the results, or is that what you'd actually
> >expect?
> 
> What I would expect...

I don't see why then? If you very quickly writes lots of data the OS
will continously flush dirty data to the disk, in which case sorting is
rather important?


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: add \pset true/false

2015-11-12 Thread Vik Fearing

On 11/12/2015 05:41 PM, Matthijs van der Vleuten wrote:
> 
>> On 12 Nov 2015, at 14:21, Brendan Jurd  wrote:
>>
>> On Fri, 30 Oct 2015 at 00:51 Tom Lane > > wrote:
>> The really key argument that hasn't been addressed here is why does such
>> a behavior belong in psql, rather than elsewhere?  Surely legibility
>> problems aren't unique to psql users.  Moreover, there are exactly
>> parallel facilities for other datatypes on the server side: think
>> DateStyle or bytea_output.  So if you were trying to follow precedent
>> rather than invent a kluge, you'd have submitted a patch to create a GUC
>> that changes the output of boolout().
>>
>> I find Tom's analogy to datestyle and bytea_output convincing.
>>
>> +1 for a GUC that changes the behaviour of boolout. 
> 
> -1 for changing boolout(). It will break anything that receives boolean 
> values from the server. How a client is going to display values (of any type) 
> is logic that should belong in the client, not in the protocol.

I fully agree.  This is something I feel should happen in the client.
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-11-12 Thread Fabien COELHO




To fix it, ITSM that it is enough to hold a "do not close lock" on the file
while a flush is in progress (a short time) that would prevent mdclose to do
its stuff.


Could you expand a bit more on this? You're suggesting something like a
boolean in the vfd struct?


Basically yes, I'm suggesting a mutex in the vdf struct.


If that, how would you deal with FileClose() being called?


Just wait for the mutex, which would be held while flushes are accumulated 
into the flush context and released after the flush is performed and the 
fd is not necessary anymore for this purpose, which is expected to be 
short (at worst between the wake & sleep of the checkpointer, and just one 
file at a time).



I'm concious that the patch only addresses *checkpointer* writes, not those
from bgwrither or backends writes. I agree that these should need to be
addressed at some point as well, but given the time to get a patch through,
the more complex the slower (sort propositions are 10 years old), I think
this should be postponed for later.


I think we need to have at least a PoC of all of the relevant
changes. We're doing these to fix significant latency and throughput
issues, and if the approach turns out not to be suitable for
e.g. bgwriter or backends, that might have influence over checkpointer's
design as well.


Hmmm. See below.


What I did not expect, and what confounded me for a long while, is that
for workloads where the hot data set does *NOT* fit into shared buffers,
sorting often led to be a noticeable reduction in throughput. Up to
30%.


I did not see such behavior in the many tests I ran. Could you share more
precise details so that I can try to reproduce this performance regression?
(available memory, shared buffers, db size, ...).



I generally found that I needed to disable autovacuum's analyze to get
anything even close to stable numbers. The issue in described in
http://archives.postgresql.org/message-id/20151031145303.GC6064%40alap3.anarazel.de
otherwise badly kicks in. I basically just set
autovacuum_analyze_threshold to INT_MAX/2147483647 to prevent that from 
occuring.

I'll show actual numbers at some point yes. I tried three different systems:

* my laptop, 16 GB Ram, 840 EVO 1TB as storage. With 2GB
 shared_buffers. Tried checkpoint timeouts from 60 to 300s.


Hmmm. This is quite short. I tend to do tests with much larger timeouts. I 
would advise against a short timeout esp. in a high throughput system, the 
whole point of the checkpointer is to accumulate as much changes as 
possible.


I'll look into that.


This explanation seems to suggest that if bgwriter/workders write are sorted
and/or coordinated with the checkpointer somehow then all would be well?


Well, you can't easily sort bgwriter/backend writes stemming from cache
replacement. Unless your access patterns are entirely sequential the
data in shared buffers will be laid out in a nearly entirely random
order.  We could try sorting the data, but with any reasonable window,
for many workloads the likelihood of actually achieving much with that
seems low.


Maybe the sorting could be shared with others so that everybody uses the 
same order?


That would suggest to have one global sorting of buffers, maybe maintained 
by the checkpointer, which could be used by all processes that need to 
scan the buffers (in file order), instead of scanning them in memory 
order.


For this purpose, I think that the initial index-based sorting would 
suffice. Could be resorted periodically with some delay maintained in a 
guc, or when significant buffer changes have occured (read & writes).



ISTM that this explanation could be checked by looking whether
bgwriter/workers writes are especially large compared to checkpointer writes
in those cases with reduced throughput? The data is in the log.


What do you mean with "large"? Numerous?


I mean the amount of buffers written by bgwriter/worker is greater than 
what is written by the checkpointer. If all fits in shared buffers, 
bgwriter/worker mostly do not need to write anything and the checkpointer 
does all the writes.


The larger the memory needed, the more likely workers/bgwriter will have 
to quick in and generate random I/Os because nothing sensible is currently 
done, so this is consistent with your findings, although I'm surprised 
that it would have a large effect on throughput, as already said.



Hmmm. The shorter the timeout, the more likely the sorting NOT to be
effective


You mean, as evidenced by the results, or is that what you'd actually
expect?


What I would expect...

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LLVM miscompiles numeric.c access to short numeric var headers

2015-11-12 Thread Tom Lane

Greg Stark  writes:
> On Thu, Nov 12, 2015 at 3:51 PM, Tom Lane  wrote:
>> I think we could fix the immediate issue by redeclaring numeric
>> headers as arrays of (u)int16 rather than structs.  I'm not
>> very excited about the packed-header case.

> That would require giving up the pretense that the code supports base
> 10 and base 100 I suppose.

No, not really.  If we redefine NumericVar as a uint16 array,
then we'd have n_header or n_sign_dscale as array[0],
n_weight as (int16) array[1], and n_data as (NumericDigit *) &array[1]
or (NumericDigit *) &array[2] depending.  Doesn't matter which
way NumericDigit is declared.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: add \pset true/false

2015-11-12 Thread Matthijs van der Vleuten


> On 12 Nov 2015, at 14:21, Brendan Jurd  wrote:
> 
> On Fri, 30 Oct 2015 at 00:51 Tom Lane  > wrote:
> The really key argument that hasn't been addressed here is why does such
> a behavior belong in psql, rather than elsewhere?  Surely legibility
> problems aren't unique to psql users.  Moreover, there are exactly
> parallel facilities for other datatypes on the server side: think
> DateStyle or bytea_output.  So if you were trying to follow precedent
> rather than invent a kluge, you'd have submitted a patch to create a GUC
> that changes the output of boolout().
> 
> I find Tom's analogy to datestyle and bytea_output convincing.
> 
> +1 for a GUC that changes the behaviour of boolout. 

-1 for changing boolout(). It will break anything that receives boolean values 
from the server. How a client is going to display values (of any type) is logic 
that should belong in the client, not in the protocol.

Re: [HACKERS] LLVM miscompiles numeric.c access to short numeric var headers

2015-11-12 Thread Greg Stark

On Thu, Nov 12, 2015 at 3:51 PM, Tom Lane  wrote:
> Meh.  The palloc to create an aligned array of digits would eat up
> any possible performance win --- it'd be just about as expensive
> as the existing unpack operation.

I suppose we would only need to palloc the digits if we found they
were unaligned which would only happen if the varlena header was
packed. So if the varlena header wasn't packed (or if we were just
lucky with the packed alignment which would happen half the time) we
could share the bytes from the packed varlena in the buffer directly
in the var.

> I think we could fix the immediate issue by redeclaring numeric
> headers as arrays of (u)int16 rather than structs.  I'm not
> very excited about the packed-header case.

That would require giving up the pretense that the code supports base
10 and base 100 I suppose. And would still be doing a palloc/memcpy
for data smaller than 128 bytes.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LLVM miscompiles numeric.c access to short numeric var headers

2015-11-12 Thread Tom Lane

Greg Stark  writes:
> On Thu, Nov 12, 2015 at 2:39 PM, Tom Lane  wrote:
>> Either that's a reportable compiler bug, or someplace nearby we've
>> casted the pointer to something that would require a 4-byte struct.
>> I'm not sure which code you're looking at exactly, but maybe we're
>> using "union NumericChoice" prematurely?

> It triggers on the line with the NUMERIC_WEIGHT() macro call in
> init_var_from_num():

Hmm.  I'd argue that that is a compiler bug, but I dunno if the LLVM
guys would see it that way.  The existence of other union members in
the declaration shouldn't license them to assume that the particular
instance we're accessing is large enough for any possible member.

> I think it wouldn't be too hard to just give up on the structs
> and unions and use a char * as the underlying type. We could access
> the meta information directly using byte accesses and memcpy the
> digits to an aligned array of digits when setting up the var.

Meh.  The palloc to create an aligned array of digits would eat up
any possible performance win --- it'd be just about as expensive
as the existing unpack operation.

I think we could fix the immediate issue by redeclaring numeric
headers as arrays of (u)int16 rather than structs.  I'm not
very excited about the packed-header case.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-11-12 Thread Andres Freund

Hi,

On 2015-11-12 15:31:41 +0100, Fabien COELHO wrote:
> >A lot of details have changed, I unfortunately don't remember them all.
> >But there are more important things than the details of the patch.
> >
> >I've played *a lot* with this patch. I found a bunch of issues:
> >
> >1) The FileFlushContext context infrastructure isn't actually
> >  correct. There's two problems: First, using the actual 'fd' number to
> >  reference a to-be-flushed file isn't meaningful. If there  are lots
> >  of files open, fds get reused within fd.c.
> 
> Hmm.
> 
> My assumption is that a file being used (i.e. with modifie pages, being used
> for writes...) would not be closed before everything is cleared...

That's likely, but far from guaranteed.

> After some poking in the code, I think that this issue may indeed be there,
> although the probability of hitting it is close to 0, but alas not 0:-)

I did hit it...

> To fix it, ITSM that it is enough to hold a "do not close lock" on the file
> while a flush is in progress (a short time) that would prevent mdclose to do
> its stuff.

Could you expand a bit more on this? You're suggesting something like a
boolean in the vfd struct? If that, how would you deal with FileClose()
being called?

> >3) I found that latency wasn't improved much for workloads that are
> >  significantly bigger than shared buffers. The problem here is that
> >  neither bgwriter nor the backends have, so far, done
> >  sync_file_range() calls. That meant that the old problem of having
> >  gigabytes of dirty data that periodically get flushed out, still
> >  exists. Having these do flushes mostly attacks that problem.
> 
> I'm concious that the patch only addresses *checkpointer* writes, not those
> from bgwrither or backends writes. I agree that these should need to be
> addressed at some point as well, but given the time to get a patch through,
> the more complex the slower (sort propositions are 10 years old), I think
> this should be postponed for later.

I think we need to have at least a PoC of all of the relevant
changes. We're doing these to fix significant latency and throughput
issues, and if the approach turns out not to be suitable for
e.g. bgwriter or backends, that might have influence over checkpointer's
design as well.

> >What I did not expect, and what confounded me for a long while, is that
> >for workloads where the hot data set does *NOT* fit into shared buffers,
> >sorting often led to be a noticeable reduction in throughput. Up to
> >30%.
> 
> I did not see such behavior in the many tests I ran. Could you share more
> precise details so that I can try to reproduce this performance regression?
> (available memory, shared buffers, db size, ...).

I generally found that I needed to disable autovacuum's analyze to get
anything even close to stable numbers. The issue in described in
http://archives.postgresql.org/message-id/20151031145303.GC6064%40alap3.anarazel.de
otherwise badly kicks in. I basically just set
autovacuum_analyze_threshold to INT_MAX/2147483647 to prevent that from 
occuring.

I'll show actual numbers at some point yes. I tried three different systems:

* my laptop, 16 GB Ram, 840 EVO 1TB as storage. With 2GB
  shared_buffers. Tried checkpoint timeouts from 60 to 300s.  I could
  see issues in workloads ranging from scale 300 to 5000. Throughput
  regressions are visible for both sync_commit on/off workloads. Here
  the largest regressions were visible.

* my workstation: 24GB Ram, 2x E5520, a) Raid 10 of of 4 4TB, 7.2krpm
  devices b) Raid 1 of 2 m4 512GB SSDs. One of the latter was killed
  during the test.  Both showed regressions, but smaller.

* EC2 d2.8xlarge, 244 GB RAM, 24 x 2000 HDD, 64GB shared_buffers. I
  tried scale 3000,8000,15000. Here sorting, without flushing, didn't
  lead much to regressions.

I think generally the regressions were visible with a) noticeable shared
buffers, b) workload not fitting into shared buffers, c) significant
throughput, leading to high cache replacement ratios.

Another thing that's worthwhile to mention, while not surprising, is
that the benefits of this patch are massively smaller when WAL and data
are separated onto different disks.  For workloads fitting into
shared_buffers I saw no performance difference - not particularly
surprising. I guess if you'd construct a case where the data, not WAL,
is the bottleneck that'd be different.  Also worthwhile to mention that
the separate disks setups was noticeably faster.

> >The performance was still much more regular than before, i.e. no
> >more multi-second periods without any transactions happening.
> >
> >By now I think I know what's going on: Before the sorting portion of the
> >patch the write-loop in BufferSync() starts at the current clock hand,
> >by using StrategySyncStart(). But after the sorting that obviously
> >doesn't happen anymore - buffers are accessed in their sort order. By
> >starting at the current clock hand and moving on from there the
> >checkpointer b

Re: [HACKERS] Parallel Seq Scan

2015-11-12 Thread Thom Brown

On 12 November 2015 at 15:23, Amit Kapila  wrote:
> On Wed, Nov 11, 2015 at 11:29 PM, Pavel Stehule 
> wrote:
>>
>> Hi
>>
>> I have a first query
>>
>> I looked on EXPLAIN ANALYZE output and the numbers of filtered rows are
>> differen
>>
>
> Thanks for the report.  The reason for this problem is that instrumentation
> information from workers is getting aggregated multiple times.  In
> ExecShutdownGatherWorkers(), we call ExecParallelFinish where it
> will wait for workers to finish and then accumulate stats from workers.
> Now ExecShutdownGatherWorkers() could be called multiple times
> (once we read all tuples from workers, at end of node) and it should be
> ensured that repeated calls should not try to redo the work done by first
> call.
> The same is ensured for tuplequeues, but not for parallel executor info.
> I think we can safely assume that we need to call ExecParallelFinish() only
> when there are workers started by the Gathers node, so on those lines
> attached patch should fix the problem.

That fixes the count issue for me, although not the number of buffers
hit, or the actual time taken.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LLVM miscompiles numeric.c access to short numeric var headers

2015-11-12 Thread Greg Stark

On Thu, Nov 12, 2015 at 2:39 PM, Tom Lane  wrote:
> Either that's a reportable compiler bug, or someplace nearby we've
> casted the pointer to something that would require a 4-byte struct.
> I'm not sure which code you're looking at exactly, but maybe we're
> using "union NumericChoice" prematurely?

I don't see how these structs could ever be used without casting to
NumericChoice to get the scale and weight even when there are no
digits. I think it wouldn't be too hard to just give up on the structs
and unions and use a char * as the underlying type. We could access
the meta information directly using byte accesses and memcpy the
digits to an aligned array of digits when setting up the var. I think
the code would be simpler in the end and make it easy to support
packed varlenas.

It triggers on the line with the NUMERIC_WEIGHT() macro call in
init_var_from_num():

#define NUMERIC_WEIGHT(n) (NUMERIC_HEADER_IS_SHORT((n)) ? \
(((n)->choice.n_short.n_header & NUMERIC_SHORT_WEIGHT_SIGN_MASK ? \
  ~NUMERIC_SHORT_WEIGHT_MASK : 0) \
 | ((n)->choice.n_short.n_header & NUMERIC_SHORT_WEIGHT_MASK)) \
: ((n)->choice.n_long.n_weight))

static void
init_var_from_num(Numeric num, NumericVar *dest)
{
  dump_numeric("init_var_from_num", num);

  dest->ndigits = NUMERIC_NDIGITS(num);
  dest->weight = NUMERIC_WEIGHT(num);
  dest->sign = NUMERIC_SIGN(num);
  dest->dscale = NUMERIC_DSCALE(num);
  dest->digits = NUMERIC_DIGITS(num);
  dest->buf = NULL; /* digits array is not palloc'd */
}

I think the

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

2015-11-12 Thread Amit Kapila

On Wed, Nov 11, 2015 at 11:29 PM, Pavel Stehule 
wrote:
>
> Hi
>
> I have a first query
>
> I looked on EXPLAIN ANALYZE output and the numbers of filtered rows are
differen
>

Thanks for the report.  The reason for this problem is that instrumentation
information from workers is getting aggregated multiple times.  In
ExecShutdownGatherWorkers(), we call ExecParallelFinish where it
will wait for workers to finish and then accumulate stats from workers.
Now ExecShutdownGatherWorkers() could be called multiple times
(once we read all tuples from workers, at end of node) and it should be
ensured that repeated calls should not try to redo the work done by first
call.
The same is ensured for tuplequeues, but not for parallel executor info.
I think we can safely assume that we need to call ExecParallelFinish() only
when there are workers started by the Gathers node, so on those lines
attached patch should fix the problem.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

fix_agg_instr_issue_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: add \pset true/false

2015-11-12 Thread Vik Fearing

On 10/28/2015 10:03 AM, Marko Tiikkaja wrote:
> Hello hello,
> 
> Since the default t/f output for booleans is not very user friendly,
> attached is a patch which enables you to do for example the following:
> 
> =# \pset true TRUE
> Boolean TRUE display is "TRUE".
> =# \pset false FALSE
> Boolean FALSE display is "FALSE".
> =# select true, false;
>   bool | bool
> --+---
>   TRUE | FALSE
> (1 row)

I think the battle is already lost, but I'm casting my vote in favor of
this patch anyway.
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LLVM miscompiles numeric.c access to short numeric var headers

2015-11-12 Thread Tom Lane

Greg Stark  writes:
> In numeric.c we have the short numeric headers that have one uint16
> (in addition to the varlena header) followed by digits. When compiling
> with -O2 on x86-64 LLVM now seems to use a 4-byte access.

Either that's a reportable compiler bug, or someplace nearby we've
casted the pointer to something that would require a 4-byte struct.
I'm not sure which code you're looking at exactly, but maybe we're
using "union NumericChoice" prematurely?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-11-12 Thread Fabien COELHO



Hello Andres,


And here's v14. It's not something entirely ready.


I'm going to have a careful look at it.

A lot of details have changed, I unfortunately don't remember them all. 
But there are more important things than the details of the patch.


I've played *a lot* with this patch. I found a bunch of issues:

1) The FileFlushContext context infrastructure isn't actually
  correct. There's two problems: First, using the actual 'fd' number to
  reference a to-be-flushed file isn't meaningful. If there  are lots
  of files open, fds get reused within fd.c.


Hmm.

My assumption is that a file being used (i.e. with modifie pages, being 
used for writes...) would not be closed before everything is cleared...


After some poking in the code, I think that this issue may indeed be 
there, although the probability of hitting it is close to 0, but alas not 
0:-)


To fix it, ITSM that it is enough to hold a "do not close lock" on the 
file while a flush is in progress (a short time) that would prevent 
mdclose to do its stuff.


That part is enough fixed by referencing File instead the fd. The bigger 
problem is that the infrastructure doesn't deal with files being closed. 
There can, which isn't that hard to trigger, be smgr invalidations 
causing smgr handle and thus the file to be closed.


I think this means that the entire flushing infrastructure actually
needs to be hoisted up, onto the smgr/md level.


Hmmm. I'm not sure that it is necessary, see above my suggestion.


2) I noticed that sync_file_range() blocked far more often than I'd
  expected. Reading the kernel code that turned out to be caused by a
  pessimization in the kernel introduced years ago - in many situation
  SFR_WRITE waited for the writes. A fix for this will be in the 4.4
  kernel.


Alas, Pg cannot help issues in the kernel.


3) I found that latency wasn't improved much for workloads that are
  significantly bigger than shared buffers. The problem here is that
  neither bgwriter nor the backends have, so far, done
  sync_file_range() calls. That meant that the old problem of having
  gigabytes of dirty data that periodically get flushed out, still
  exists. Having these do flushes mostly attacks that problem.


I'm concious that the patch only addresses *checkpointer* writes, not 
those from bgwrither or backends writes. I agree that these should need to 
be addressed at some point as well, but given the time to get a patch 
through, the more complex the slower (sort propositions are 10 years old), 
I think this should be postponed for later.



Benchmarking revealed that for workloads where the hot data set mostly
fits into shared buffers flushing and sorting is anywhere from a small
to a massive improvement, both in throughput and latency. Even without
the patch from 2), although fixing that improves things furhter.


This is consistent with my experiments: sorting improves things, and 
flushing on top of sorting improves things further.



What I did not expect, and what confounded me for a long while, is that
for workloads where the hot data set does *NOT* fit into shared buffers,
sorting often led to be a noticeable reduction in throughput. Up to
30%.


I did not see such behavior in the many tests I ran. Could you share more 
precise details so that I can try to reproduce this performance 
regression? (available memory, shared buffers, db size, ...).



The performance was still much more regular than before, i.e. no
more multi-second periods without any transactions happening.

By now I think I know what's going on: Before the sorting portion of the
patch the write-loop in BufferSync() starts at the current clock hand,
by using StrategySyncStart(). But after the sorting that obviously
doesn't happen anymore - buffers are accessed in their sort order. By
starting at the current clock hand and moving on from there the
checkpointer basically makes it more less likely that victim buffers
need to be written either by the backends themselves or by
bgwriter. That means that the sorted checkpoint writes can, indirectly,
increase the number of unsorted writes by other processes :(


I'm quite surprised at such a large effect on throughput, though.

This explanation seems to suggest that if bgwriter/workders write are 
sorted and/or coordinated with the checkpointer somehow then all would be 
well?


ISTM that this explanation could be checked by looking whether 
bgwriter/workers writes are especially large compared to checkpointer 
writes in those cases with reduced throughput? The data is in the log.





My benchmarking suggest that that effect is the larger, the shorter the
checkpoint timeout is.


Hmmm. The shorter the timeout, the more likely the sorting NOT to be 
effective, and the more likely to go back to random I/Os, and maybe to 
seem some effect of the sync strategy stuff.


That seems to intuitively make sense, give the above explanation 
attempt. If the checkpoint takes longer the clock hand will almost 
certainly soo

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-12 Thread Robert Haas

On Mon, Nov 9, 2015 at 2:47 PM, Simon Riggs  wrote:
> On 9 November 2015 at 18:46, Robert Haas  wrote:
>> > I am aware of the fact
>> > that by definition PREPARE TRANSACTION ensures that a transaction will
>> > be committed with COMMIT PREPARED, just trying to see any corner cases
>> > with the approach proposed. The DTM approach is actually rather close
>> > to what a GTM in Postgres-XC does :)
>>
>> Yes.  I think that we should try to learn as much as possible from the
>> XC experience, but that doesn't mean we should incorporate XC's fuzzy
>> thinking about 2PC into PG.  We should not.
>
> Fuzzy thinking. Please explain.

Multiple people who have worked on XC have argued to me that we need
to defend against the case where PREPARE TRANSACTION succeeds and
COMMIT PREPARED fails, say because somebody manually mucked with the
files in the data directory.  I think that if people are manually
mucking with files in the data directory, we have no hope of achieving
sane behavior, and there's not much point in expending code on
defending against any individual way that could happen.

>> One point I'd like to mention is that it's absolutely critical to
>> design this in a way that minimizes network roundtrips without
>> compromising correctness.  XC's GTM proxy suggests that they failed to
>> do that.  I think we really need to look at what's going to be on the
>> other sides of the proposed APIs and think about whether it's going to
>> be possible to have a strong local caching layer that keeps network
>> roundtrips to a minimum.  We should consider whether the need for such
>> a caching layer has any impact on what the APIs should look like.
> You mean the caching layer that already exists in XL/XC?

I don't think that's what I mean, no.  If you have an external GTM
Proxy, then you have missed a trick, because whatever caching it does
could be done better inside the process that sent the request to the
proxy.

>> For example, consider a 10-node cluster where each node has 32 cores
>> and 32 clients, and each client is running lots of short-running SQL
>> statements.  The demand for snapshots will be intense.  If every
>> backend separately requests a snapshot for every SQL statement from
>> the coordinator, that's probably going to be terrible.  We can make it
>> the problem of the stuff behind the DTM API to figure out a way to
>> avoid that, but maybe that's going to result in every DTM needing to
>> solve the same problems.
>
> The whole purpose of that XTM API is to allow different solutions for that
> to be created. Konstantin has made a very good case for such an API to
> exist, based around 3 markedly different approaches.

I'm not arguing with that.

> Whether we allow the API into core to be accessible via extensions is a
> different issue, but it looks fine for its purpose.

I'm not attacking the API.  I'm trying to have a discussion about the
important design issues in this area.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Dangling Client Backend Process

2015-11-12 Thread Robert Haas

On Wed, Nov 11, 2015 at 1:55 AM, Michael Paquier
 wrote:
> On Wed, Nov 4, 2015 at 2:18 AM, Robert Haas wrote:
>> The second conclusion does not appear to be correct.  parseInput()
>> will call pqParseInput3() or pqParseInput2(), either of which will
>> handle an error as if it were a notice - i.e. by printing it out.
>
> Right per pqGetErrorNotice3 when the connection is in PGASYNC_IDLE state.
>
>> Here's a patch based on that analysis, addressing just that one
>> function, not any of the other changes talked about on this thread.
>> Does this make sense?  Would we want to back-patch it, and if so how
>> far, or just adjust master?  My gut is just master, but I don't know
>> why this issue wouldn't also affect Hot Standby kills and maybe other
>> kinds of connection termination situations, so maybe there's an
>> argument for back-patching.  On the third hand, failing to read the
>> error message off of a just-terminated connection isn't exactly a
>> crisis of the first order either.
>
> Looks sane to me. As the connection is in PGASYNC idle state when
> crossing the path of pqHandleSendFailure() we would finish eating up
> all the error messages received from server and print an internal
> notice for the rest with "message type blah received from server while
> idle. Based on the lack of complaints regarding libpq on this side, I
> would just go for master, as for 9.5 is pretty late in this game to
> put some dust on it before a potential backpatch.

OK, committed just to master then, plus the change which is the
purpose of this thread.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CustomScan support on readfuncs.c

2015-11-12 Thread Kouhei Kaigai

> On Thu, Nov 12, 2015 at 7:59 AM, Kouhei Kaigai  wrote:
> >> On Mon, Nov 9, 2015 at 9:46 PM, Kouhei Kaigai  wrote:
> >> > The attached main patch (custom-scan-on-readfuncs.v3.patch) once
> >> > removes TextOutCustomScan, thus all the serialized tokens become
> >> > known to the core backend, and add _readCustomScan() at readfuncs.c.
> >> > It deserializes CustomScan nodes from cstring tokens, especially,
> >> > reloads the pointer of CustomScanMethods tables using a pair of
> >> > library name and symbol name.
> >> > Thus, it also requires custom scan providers to keep the methods
> >> > table visible from external objects.
> >>
> >> Committed with some kibitzing.
> >>
> > Thanks,
> >
> > How do we deal with TextOutCustomScan in the v9.5 tree?
> > I think we ought to remove this callback also prior to v9.5 release.
> 
> I'll do that if there's a clear consensus in favor.  I wasn't sure it
> really made sense so close to release.
>
Unless we don't reach a consensus in the "CustomScan in a larger
structure" thread, TextOutCustomScan callback becomes obsoleted in
the v9.6 release.

So, I think this callback shall be dropped once.
Let's wait for a few days for other persons' opinion?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Documentation tweak for row-valued expressions and null

2015-11-12 Thread Robert Haas

On Tue, Nov 10, 2015 at 11:10 AM, Jim Nasby  wrote:
> On 7/26/15 8:40 PM, Thomas Munro wrote:
>>
>> I wonder if it might be worth adding a tiny note to the manual to
>> point out that the special logic for " IS [ NOT
>> ] NULL" doesn't apply anywhere else that we handle nulls or talk about
>> [non]-null values in the manual.  See attached.
>
> Yes. This is a common point of confusion.

Agreed.  But I actually don't know whether Thomas's proposed language
matches the actual behavior in every case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: add \pset true/false

2015-11-12 Thread Brendan Jurd

On Fri, 30 Oct 2015 at 00:51 Tom Lane  wrote:

> The really key argument that hasn't been addressed here is why does such
> a behavior belong in psql, rather than elsewhere?  Surely legibility
> problems aren't unique to psql users.  Moreover, there are exactly
> parallel facilities for other datatypes on the server side: think
> DateStyle or bytea_output.  So if you were trying to follow precedent
> rather than invent a kluge, you'd have submitted a patch to create a GUC
> that changes the output of boolout().
>

I find Tom's analogy to datestyle and bytea_output convincing.

+1 for a GUC that changes the behaviour of boolout.

And also +1 for doing anything at all to improve on the t/f output.  Those
glyphs are way too similar to each other.

I think U+2713 and U+2717 (✓ and ✗) are the obvious choices for a boolean,
but if we have a GUC we can set this to taste.

Cheers,
BJ

Re: [HACKERS] CustomScan support on readfuncs.c

2015-11-12 Thread Robert Haas

On Thu, Nov 12, 2015 at 7:59 AM, Kouhei Kaigai  wrote:
>> On Mon, Nov 9, 2015 at 9:46 PM, Kouhei Kaigai  wrote:
>> > The attached main patch (custom-scan-on-readfuncs.v3.patch) once
>> > removes TextOutCustomScan, thus all the serialized tokens become
>> > known to the core backend, and add _readCustomScan() at readfuncs.c.
>> > It deserializes CustomScan nodes from cstring tokens, especially,
>> > reloads the pointer of CustomScanMethods tables using a pair of
>> > library name and symbol name.
>> > Thus, it also requires custom scan providers to keep the methods
>> > table visible from external objects.
>>
>> Committed with some kibitzing.
>>
> Thanks,
>
> How do we deal with TextOutCustomScan in the v9.5 tree?
> I think we ought to remove this callback also prior to v9.5 release.

I'll do that if there's a clear consensus in favor.  I wasn't sure it
really made sense so close to release.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multixid hindsight design

2015-11-12 Thread Robert Haas

On Mon, Nov 9, 2015 at 9:57 PM, Craig Ringer  wrote:
> On 10 November 2015 at 02:26, Robert Haas  wrote:
>>> I'd like to see, say, python and the 'unittest' module added, and
>>> to see acceptance of tests using psycopg2 to stream and decode WAL
>>> from a slot.
>>
>> I actually kind of hate this sort of thing.
>
> For what it's worth, I don't actually like it either. However, I
> suspect that the current restricted tools are a significant impediment
> to test tooling and infrastructure improvement.

But that's exactly the argument that was made to justify adding the
TAP testing framework.  It didn't work?

>> Now I'm not dead set against changing anything at all here, but I want
>> to point out that we just added the TAP framework and already there
>> are proposals (like Kevin's snapshot too old patch) to require DBD::Pg
>> for some tests, which a lot of people aren't going to have
>
> The alternative seems to be driving psql -qAtz as a coprocess over a
> pipeline as a poor man's database driver. I really doubt that's any
> kind of improvement when it comes to the complexity of maintaining
> tests, fixing them when they break, etc. I'd rather just write tests
> in C against libpq despite the verbosity, annoyance of maintaining
> them, etc.

I thought about an idea like this: have some sort of dummy program,
and then have a module written in Perl that provides an interface to
it, so that we don't have to depend on DBD::Pg.  I'm not at all
convinced that's a bad plan.

> Part of why I'm somewhat interested in seeing python and psycopg2
> added is that I'd personally quite like to see widely used client
> drivers covered by the buildfarm to some extent specifically to see if
> we break them. Though really, it'd make sense to add the most
> important drivers like psqlODBC and PgJDBC first if doing that and
> that'd be a whole new barrel of fun.

EnterpriseDB has something like this, and it does catch some bugs, but
it is an unbelievable amount of work to keep it working.  When
something fails, it's unlikely that any individual developer has
exactly the right configuration on their machine to reproduce the
failure.  Now maybe we could come up with something that provides
similar test coverage but is less work to maintain - I'm certainly not
going to rule out the possibility that it can be done better than
EnterpriseDB has done it.  However, I bet it will still be a lot of
work, and unlike at EntepriseDB where we've got entire teams of people
who spend major amounts of time on this kind of stuff, PostgreSQL
relies on its committers to keep the buildfarm green.

And if I commit something and it breaks the Win32 version of some test
that is written in Python and relies on pgsqlODBC, there is no way in
heck I'm committing myself to reproducing that environment.  Even if
it's Linux64 and JDBC I'm not doing it.  If you want to build such a
thing and maintain it in perpetuity, that is fine, and you don't need
my permission, and I and I'm sure many others will greatly appreciate
both the effort and any bugs that get found and fixed that way.  But
when you talk about putting this kind of thing to core you are asking
the committers to be responsible for it, and I'm still not very
confident we've got the TAP tests in a state where maintaining them is
going to be a reasonable level of effort.  Those tests failed on my
machine for months, and the only reason they eventually got fixed is
because I knew enough Perl to track the problem down to some screwy
coding in one of the Perl modules MacPorts installs.  If the same
thing happens with Python, it would be extremely difficult for me to
find it, because I don't actually know Python.

>> Accepting more tools also increases the breadth of knowledge expected
>> of committers and patch authors.
>
> That's the real problem IMO. The cognitive and knowledge load.

That's definitely a big part of it, but I think you're selling short
the amount of havoc that build dependencies cause.  On a Linux system
where you can install whatever you need via 'yum install', sure, it's
easy.  It's not so easy on Windows.  It's less easy on other UNIX
systems without modern package managers.  But that's just the tip of
the iceberg.  Installing your operating system's DBD::Pg package is
going to install a version of DBD::Pg compiled against the system's
libpq, not the latest PostgreSQL libpq that you just built.  Now
perhaps, you will say, that's OK: that way we test cross-version libpq
compatibility, too!  And on a certain level I can't argue with that.
But now you could have a situation where the tests pass for one user
and fail for another user even though both are compiling from the same
PostgreSQL commit and the same DBD::Pg version, and that kind of stuff
is a lot of work to track down.  And it may not really be a bug: maybe
the person who added the test intended the test to exercise some new
libpq feature which is indeed supported by DBD::Pg, but your build of
DBD::Pg doesn't have that be

Re: [HACKERS] psql: add \pset true/false

2015-11-12 Thread Marko Tiikkaja


On 11/12/15 1:50 PM, Michael Paquier wrote:

FWIW, I am -1 on the concept of enforcing output values for particular
datatypes because that's not the job of psql


In my view, the job of psql is to make working with a postgres database 
easy for us human beings.  That means (among other things) formatting 
and aligning query output for readability.  I don't see how reformatting 
unreadable boolean values meant for computers is that big of a stretch.



.m


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CustomScan support on readfuncs.c

2015-11-12 Thread Kouhei Kaigai

> On Mon, Nov 9, 2015 at 9:46 PM, Kouhei Kaigai  wrote:
> > The attached main patch (custom-scan-on-readfuncs.v3.patch) once
> > removes TextOutCustomScan, thus all the serialized tokens become
> > known to the core backend, and add _readCustomScan() at readfuncs.c.
> > It deserializes CustomScan nodes from cstring tokens, especially,
> > reloads the pointer of CustomScanMethods tables using a pair of
> > library name and symbol name.
> > Thus, it also requires custom scan providers to keep the methods
> > table visible from external objects.
> 
> Committed with some kibitzing.
>
Thanks,

How do we deal with TextOutCustomScan in the v9.5 tree?
I think we ought to remove this callback also prior to v9.5 release.
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: add \pset true/false

2015-11-12 Thread Michael Paquier

On Thu, Oct 29, 2015 at 10:50 PM, Tom Lane  wrote:
> Marko Tiikkaja  writes:
>> On 10/29/15 11:51 AM, Daniel Verite wrote:
>>> Personally I think it would be worth having, but how about
>>> booleans inside ROW() or composite types ?
>
>> There's not enough information sent over to do that in the client.
>> Note that this works the same way as  \pset null  with  SELECT
>> ROW(NULL), so I don't consider it a show stopper for the patch.
>
> The problem with that argument is that \pset null is already a kluge
> (but at least a datatype-independent one).  Now you've added a datatype
> specific kluge of the same ilk.  It might be useful, it might be short,
> but that doesn't make it not a kluge.

FWIW, I am -1 on the concept of enforcing output values for particular
datatypes because that's not the job of psql, and it is easy to do
that at query level (no comment about the existing \pset NULL stuff).

+fprintf(output, _("  true   set the string to be
prined in place of a TRUE value\n"));
+fprintf(output, _("  false  set the string to be
prined in place of a FALSE value\n"));
s/prined/printed/
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-12 Thread Kouhei Kaigai

> On 2015/11/12 14:09, Kouhei Kaigai wrote:
> > I'm now designing the parallel feature of Append...
> >
> > Here is one challenge. How do we determine whether each sub-plan
> > allows execution in the background worker context?
> >
> > The commit f0661c4e8c44c0ec7acd4ea7c82e85b265447398 added
> > 'parallel_aware' flag on Path and Plan structure.
> > It tells us whether the plan/path-node can perform by multiple
> > background worker processes concurrently, but also tells us
> > nothing whether the plan/path-node are safe to run on background
> > worker process context.
> 
> When I was looking at the recent parallelism related commits, I noticed a
> RelOptInfo.consider_parallel flag. That and the function
> set_rel_consider_parallel() may be of interest in this regard.
> set_append_rel_size() passes the parent's state of this flag down to child
> relations but I guess that's not  what you're after.
>
Thanks for this information. Indeed, it shall inform us which base
relations are valid for parallel execution.
In case of parallel-append, we can give up parallelism if any of
underlying base relation is not parallel aware. We can use same
logic for join relation cases, potentially.

A challenge is how to count up optimal number of background worker
process. I assume the number of workers of parallel-append shall be
sum of required number of workers by the sub-plans unless it does not
exceed max_parallel_degree.
Probably, we need pathnode_tree_walker() to count up number of workers
required by the sub-plans.

BTW, is the idea of consider_parallel flag in RelOptInfo workable for
join relations also? In case when A LEFT JOIN B for example, it can be
parallel aware if join path has A as outer input, but it cannot be if
B would be outer input.
I doubt that this kind of information belong to Path, not RelOptInfo.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CustomScan support on readfuncs.c

2015-11-12 Thread Robert Haas

On Mon, Nov 9, 2015 at 9:46 PM, Kouhei Kaigai  wrote:
> The attached main patch (custom-scan-on-readfuncs.v3.patch) once
> removes TextOutCustomScan, thus all the serialized tokens become
> known to the core backend, and add _readCustomScan() at readfuncs.c.
> It deserializes CustomScan nodes from cstring tokens, especially,
> reloads the pointer of CustomScanMethods tables using a pair of
> library name and symbol name.
> Thus, it also requires custom scan providers to keep the methods
> table visible from external objects.

Committed with some kibitzing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] LLVM miscompiles numeric.c access to short numeric var headers

2015-11-12 Thread Greg Stark

I've been using LLVM's sanitizers and asan turned up a new bit of
compiler behaviour that I hadn't had before. I don't see it in 3.7 or
before, only in their HEAD so I don't know if it's a bug or
intentional.

In numeric.c we have the short numeric headers that have one uint16
(in addition to the varlena header) followed by digits. When compiling
with -O2 on x86-64 LLVM now seems to use a 4-byte access. When there
are no digits (because the number is 0) that overruns the palloced
block.

init_var_from_num: NUMERIC (short) w=0 d=0 POS 6 bytes: 18 00 00 00 00 80
=
==30523==ERROR: AddressSanitizer: unknown-crash on address
0x6250002c7edc at pc 0x0144a6d6 bp 0x7ffcfbaadeb0 sp
0x7ffcfbaadea8
READ of size 4 at 0x6250002c7edc thread T0
#0 0x144a6d5 in init_var_from_num
/home/stark/src/pg/postgresql-master/src/backend/utils/adt/numeric.c:4764:17

Shadow bytes around the buggy address:
=>0x0c4a80050fd0: 00 00 00 00 00 00 f7 f7 00 00 f7[06]00 00 f7 00

   0x0144a6d6 <+1078>: mov$0x2e02680,%edi
   0x0144a6db <+1083>: mov%r14,0x10(%rbx)
   0x0144a6df <+1087>: mov%rsi,%r14

Now, in Postgres currently this is safe for the same reason that the
Form_Data struct accesses are safe. Numerics will always be 4-byte
aligned when allocated in memory. They won't be aligned on disk if
they have a short varlena header but we don't use the GETARG_PP macros
in numeric.c so those always get reallocated in the GETARG macro.

This does represent a hazard though in case we ever do add support for
unaligned packed varlena support to numeric.c. In fact it's extremely
tempting to do so since we never do much work on the varlena form
anyways, we just convert it to NumericVar anyways. It's awfully
annoying that we have to palloc and memcpy to this other intermediate
form just to do it again to the usable internal form. If we did that
we would have to deal with unaligned accesses anyways though so we
would probably have to change this code to just use a char* buffer and
skip the union and struct anyways. Doing so while maintaining binary
on-disk compatibility could be tricky.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposing COPY .. WITH PERMISSIVE

2015-11-12 Thread Michael Paquier

On Sat, Oct 3, 2015 at 1:43 PM, dinesh kumar wrote:
> We can also use "PROGRAM 'cat > Output.csv' " to achieve this "NO READ
> ACCESS", since the program is always running as a instance owner.
> Let me know your inputs and thoughts.

That's one way. And as PROGRAM presents the advantage to open the file
before the COPY has begun processing any tuples, the umask would be
set before writing any data to it, hence why complicating COPY with a
new option while we already have something to apply restrictions to
the output file? It seems that this patch is just an unnecessary
complication, and I doubt that we would want to change the default
behavior of 644 that has been here for ages.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-12 Thread Simon Riggs

On 11 November 2015 at 09:22, Thomas Munro 
wrote:

> 1.  Reader waits with exposed LSNs, as Heikki suggests.  This is what
> BerkeleyDB does in "read-your-writes" mode.  It means that application
> developers have the responsibility for correctly identifying transactions
> with causal dependencies and dealing with LSNs (or whatever equivalent
> tokens), potentially even passing them to other processes where the
> transactions are causally dependent but run by multiple communicating
> clients (for example, communicating microservices).  This makes it
> difficult to retrofit load balancing to pre-existing applications and (like
> anything involving concurrency) difficult to reason about as applications
> grow in size and complexity.  It is efficient if done correctly, but it is
> a tax on application complexity.
>

Agreed. This works if you have a single transaction connected thru a pool
that does statement-level load balancing, so it works in both session and
transaction mode.

I was in favour of a scheme like this myself, earlier, but have more
thoughts now.

We must also consider the need for serialization across sessions or
transactions.

In transaction pooling mode, an application could get assigned a different
session, so a token would be much harder to pass around.

2.  Reader waits for a conservatively chosen LSN.  This is roughly what
> MySQL derivatives do in their "causal_reads = on" and "wsrep_sync_wait =
> 1" modes.  Read transactions would start off by finding the current end
> of WAL on the primary, since that must be later than any commit that
> already completed, and then waiting for that to apply locally.  That means
> every read transaction waits for a complete replication lag period,
> potentially unnecessarily.  This is tax on readers with unnecessary waiting.
>

This tries to make it easier for users by forcing all users to experience a
causality delay. Given the whole purpose of multi-node load balancing is
performance, referencing the master again simply defeats any performance
gain, so you couldn't ever use it for all sessions. It could be a USERSET
parameter, so could be turned off in most cases that didn't need it.  But
its easier to use than (1).

Though this should be implemented in the pooler.

3.  Writer waits, as proposed.  In this model, there is no tax on readers
> (they have zero overhead, aside from the added complexity of dealing with
> the possibility of transactions being rejected when a standby falls behind
> and is dropped from 'available' status; but database clients must already
> deal with certain types of rare rejected queries/failures such as
> deadlocks, serialization failures, server restarts etc).  This is a tax on
> writers.
>

This would seem to require that all readers must first check with the
master as to which standbys are now considered available, so it looks like
(2).

The alternative is that we simply send readers to any standby and allow the
pool to work out separately whether the standby is still available, which
mostly works, but it doesn't handle sporadic slow downs on particular
standbys very well (if at all).

I think we need to look at whether this does actually give us anything, or
whether we are missing the underlying Heisenberg reality.

More later.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-11-12 Thread Etsuro Fujita


Horiguchi-san,

On 2015/11/12 16:10, Kyotaro HORIGUCHI wrote:

I really don't see why you're fighting on this point.  Making this a
generic feature will require only a few extra lines of code for FDW
authors.  If this were going to cause some great inconvenience for FDW
authors, then I'd agree it isn't worth it.  But I see zero evidence
that this is actually the case.



Really?  I think there would be not a little burden on an FDW author;
when postgres_fdw delegates to the subplan to the remote server, for
example, it would need to create a remote join query by looking at
tuples possibly fetched and stored in estate->es_epqTuple[], send the
query and receive the result during the callback routine.



Do you mind that FDW cannot generate a plan so that make a tuple
from eqpTules then apply fdw_quals from predefined executor
nodes?


No.  Please see my previous email.  Sorry for my unfinished email.

Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minor comment improvement to create_foreignscan_plan

2015-11-12 Thread Etsuro Fujita


On 2015/11/10 3:53, Robert Haas wrote:

On Mon, Nov 9, 2015 at 5:34 AM, Etsuro Fujita
 wrote:

Here is a small patch to update an comment in create_foreignscan_plan;
add fdw_recheck_quals to the list of expressions that need the
replace_nestloop_params processing.  I should have updated the comment
when I proposed the patch for the fdw_recheck_quals.



OK, not a big deal, but thanks.  Committed.


Thanks!

Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-11-12 Thread Etsuro Fujita


Robert and Kaigai-san,

Sorry, I sent in an unfinished email.

On 2015/11/12 15:30, Kouhei Kaigai wrote:

On 2015/11/12 2:53, Robert Haas wrote:

On Sun, Nov 8, 2015 at 11:13 PM, Etsuro Fujita
 wrote:

To test this change, I think we should update the postgres_fdw patch so as
to add the RecheckForeignScan.

Having said that, as I said previously, I don't see much value in adding the
callback routine, to be honest.  I know KaiGai-san considers that that would
be useful for custom joins, but I don't think that that would be useful even
for foreign joins, because I think that in case of foreign joins, the
practical implementation of that routine in FDWs would be to create a
secondary plan and execute that plan by performing ExecProcNode, as my patch
does [1].  Maybe I'm missing something, though.



I really don't see why you're fighting on this point.  Making this a
generic feature will require only a few extra lines of code for FDW
authors.  If this were going to cause some great inconvenience for FDW
authors, then I'd agree it isn't worth it.  But I see zero evidence
that this is actually the case.



Really?  I think there would be not a little burden on an FDW author;
when postgres_fdw delegates to the subplan to the remote server, for
example, it would need to create a remote join query by looking at
tuples possibly fetched and stored in estate->es_epqTuple[], send the
query and receive the result during the callback routine.



I cannot understand why it is the only solution.


I didn't say that.


Furthermore,
what I'm most concerned about is that wouldn't be efficient. So, my



You have to add "because ..." sentence here because I and Robert
think a little inefficiency is not a problem.


Sorry, my explanation was not enough.  The reason for that is that in 
the above postgres_fdw case for example, the overhead in sending the 
query to the remote end and transferring the result to the local end 
would not be negligible.  Yeah, we might be able to apply a special 
handling for the improved efficiency when using early row locking, but 
otherwise can we do the same thing?



Please don't start the sentence from "I think ...". We all knows
your opinion, but what I've wanted to see is "the reason why my
approach is valuable is ...".


I didn't say that my approach is *valuable* either.  What I think is, I 
see zero evidence that there is a good use-case for an FDW to do 
something other than doing an ExecProcNode in the callback routine, as I 
said below, so I don't see the need to add such a routine while that 
would cause maybe not a large, but not a little burden for writing such 
a routine on FDW authors.



Nobody prohibits postgres_fdw performs a secondary join here.
All you need to do is, picking up a sub-plan tree from FDW's private
field then call ExecProcNode() inside the callback.



As I said before, I know that KaiGai-san considers that
that approach would be useful for custom joins.  But I see zero evidence
that there is a good use-case for an FDW.



 From my point of view I'm now
thinking this solution has two parts:

(1) Let foreign scans have inner and outer subplans.  For this
purpose, we only need one, but it's no more work to enable both, so we
may as well.  If we had some reason, we could add a list of subplans
of arbitrary length, but there doesn't seem to be an urgent need for
that.


I did the same thing in an earlier version of the patch I posted. 
Although I agreed on Robert's comment "The Plan tree and the PlanState 
tree should be mirror images of each other; breaking that equivalence 
will cause confusion, at least.", I think that that would make code much 
simpler, especially the code for setting chgParam for inner/outer 
subplans.  But one thing I'm concerned about is enable both inner and 
outer plans, because I think that that would make the planner 
postprocessing complicated, depending on what the foreign scans do by 
the inner/outer subplans.  Is it worth doing so?  Maybe I'm missing 
something, though.



(2) Add a recheck callback.

If the foreign data wrapper wants to adopt the solution you're
proposing, the recheck callback can call
ExecProcNode(outerPlanState(node)).  I don't think this should end up
being more than a few lines of code, although of course we should
verify that.


Yeah, I think FDWs would probably need to create a subplan accordingly 
at planning time, and then initializing/closing the plan at execution 
time.  I think we could facilitate subplan creation by providing helper 
functions for that, though.


Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pglogical_output - a general purpose logical decoding output plugin

2015-11-12 Thread Amit Langote


On 2015/11/02 23:36, Craig Ringer wrote:
> On 2 November 2015 at 20:17, Craig Ringer  wrote:
>> Hi all
>>
>> I'd like to submit pglogical_output for inclusion in the 9.6 series as
>> a contrib.
> 
> Here's the protocol documentation discussed in the README. It's
> asciidoc at the moment, so it can be formatted into something with
> readable tables.

Kudos! From someone who doesn't really read wire protocol docs a lot, this
was such an enlightening read.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Erroneous cost estimation for nested loop join

2015-11-12 Thread KAWAMICHI Ryoji

 wrote:
>>
>> We guessed the cause of this error would be in the cost model of Postgres,
>> and investigated the source code of optimizer, and we found the cause of
>> this problem. It was in the index cost estimation process. On scanning
>> inner table, if loop count is greater than 1, its I/O cost is counted as
>> random access. In the case of Query2, in one loop (i.e. one inner table
>> scan) , much data is read sequentially with clustered index, so it seems to
>> be wrong that optimizer thinks its I/O workload is random access.
>>
> 
> We don't have a clustered index in Postgres. We do store correlation stats
> for the index, which is used in some places to reduce cost.

Yes, postgres does not have a clustered index as you pointed. I meant an index 
whose correlation is 1.0 by using word “clustered index”. In this case, 
the index is primary key (records are physically ordered by this) and the index 
was created just after the whole data was loaded. We’ve been assuming OLAP 
workload for our experiments, so I think correlation = 1.0 is the basic case 
for our experiments.

> Could you look some more at this and see if a model change that uses the
> correlation can improve this?

I cannot understand the question so let me clarify. Did you mean that I should 
read the optimizer code more, and I can find the correlation is used to improve 
cost estimation? 

Of course I read them, and I know that correlation is used to determine 
the value between the min cost and max cost. (The min cost is the best case 
cost (i.e. correlation is 1.0), and the max cost is the worst case cost (i.e. 
correlation is 0).

But in both case, I/O cost is counted as random access on scanning inner table.
I think I/O cost should be counted as sequential access when the correlation is 
1.0, so I tried to modify the code as previous mail. But this modification is 
just an example of solution. I’m not so familiar with optimizer code yet, so 
I’m wondering this is the right way or not.

Thank you for your comment.
Ryoji

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Erroneous cost estimation for nested loop join

2015-11-12 Thread KAWAMICHI Ryoji

 wrote:
> 
> More knowledgeable people are sure to reply in more detail!
> 
> However, they would probably appreciate it if you can run with 9.4.5
> (the latest released version).  Running it with the beta of 9.5 would be
> a bonus!
> 
> Note that I don't know enough to say for sure that later versions would
> make any difference in this case, but at least using later later
> versions would kill lots of Red Herrings!  :-)
> 

Difference between minor versions does not matter in this case. I confirmed 
the cost calculation logic regarding this problem was not changed between 
Postgres 9.4.1 and 9.4.5.

I heard there are some performance improvements on 9.5. It might change 
something regarding this problem. so I will try these queries on 9.5!

Thanks
Ryoji

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Erroneous cost estimation for nested loop join

2015-11-12 Thread KAWAMICHI Ryoji


 wrote:
>>
>>   - cost parameter calibration: random_page_cost = 92.89
>>
> 
> This demands some explanation and raises question of value of seq_page_cost
> parameter -- I don't see anything about it your mail.

seq_page_cost was set to 1.0 (default), and I explained the reason about 
random_page_cost value in reply to Tom.
Please see it.

Thanks
Ryoji


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

61 matches

Mail list logo