date:20150903

Re: [HACKERS] Horizontal scalability/sharding

2015-09-03 Thread Tomas Vondra

Hi,

On 09/03/2015 05:02 AM, Amit Kapila wrote:

On Thu, Sep 3, 2015 at 8:28 AM, Bruce Momjian > wrote:
 >
 > On Wed, Sep  2, 2015 at 07:50:25PM -0700, Joshua Drake wrote:
 > > >Can you explain why logical replication is better than binary
 > > >replication for this use-case?
 > > >
 > >
 > > Selectivity?
 >

I was assuming you would just create identical slaves to handle
failure, rather than moving selected data around.

 >

Yes, I also think so, otherwise when the shard goes down and it's
replica has to take the place of shard, it will take more time to
make replica available as it won't have all the data as original
shard had.

Not really, the idea is that you don't need to create the replica 
immediately. The system recognizes that primary shard location is 
unavailable and redirects the tasks to the "replicas." So the time to 
recreate the failed node is not that critical.

It needs to be done in a smart way to prevent some typical issues like 
suddenly doubling the load on replicas due to failure of the primary 
location. By using different group of nodes for each "data segment" you 
can eliminate this, because the group of nodes to handle the additional 
load will be larger.

The other issue then of course is that the groups of nodes must not be 
entirely random, otherwise the cluster would suffer data loss in case of 
outage of arbitrary group of K nodes (where K is the number of replicas 
for each piece of data).

It's also non-trivial to do this when you have to consider racks, data 
centers etc.

With regular slaves you can't do any of this - no matter what you do, 
you have to load balance the additional load only on the slaves.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Horizontal scalability/sharding

2015-09-03 Thread Bruce Momjian

On Thu, Sep  3, 2015 at 03:40:40PM +0200, Tomas Vondra wrote:
> Not really, the idea is that you don't need to create the replica
> immediately. The system recognizes that primary shard location is
> unavailable and redirects the tasks to the "replicas." So the time
> to recreate the failed node is not that critical.
> 
> It needs to be done in a smart way to prevent some typical issues
> like suddenly doubling the load on replicas due to failure of the
> primary location. By using different group of nodes for each "data
> segment" you can eliminate this, because the group of nodes to
> handle the additional load will be larger.
> 
> The other issue then of course is that the groups of nodes must not
> be entirely random, otherwise the cluster would suffer data loss in
> case of outage of arbitrary group of K nodes (where K is the number
> of replicas for each piece of data).
> 
> It's also non-trivial to do this when you have to consider racks,
> data centers etc.
> 
> With regular slaves you can't do any of this - no matter what you
> do, you have to load balance the additional load only on the slaves.

Yes, and imagine doing this with FDW's, updating the catalog table
location of the FDW as part of the failover process --- interesting.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Can pg_dump make use of CURRENT/SESSION_USER

2015-09-03 Thread Fabrízio de Royes Mello

On Thu, Sep 3, 2015 at 8:16 AM, Andres Freund  wrote:
>
> On 2015-08-25 22:01:51 +0900, Michael Paquier wrote:
> > Seeing no activity in the last couple of months for this patch that
> > had reviews, it is now marked as returned with feedback.
>
> Fabrizio, you after the above moved the patch to next commitfest,
> without a new patch or a additional discussion. Why? Just dragging
> patches through commitfest justincreases the work for a number of people
> (this round me) without a benefit, so that's really not a good idea.
>

My intention was move to next commitfest and finish the "rework", but I
didn't finish it due some other professional priorities.

Sorry by the noise.


> I'm marking the patch as returned with feedback again. Once there's a
> new patch we can deal with it in the appropriate commitfest.
>

Ok.


Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
>> Github: http://github.com/fabriziomello

Re: [HACKERS] [PROPOSAL] Effective storage of duplicates in B-tree index.

2015-09-03 Thread Anastasia Lubennikova




01.09.2015 21:23, Peter Geoghegan:

On Mon, Aug 31, 2015 at 12:41 AM, Anastasia Lubennikova
 wrote:

Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index could
contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

I'm glad someone is thinking about this, because it is certainly
needed. I thought about working on it myself, but there is always
something else to do. I should be able to assist with review, though.

Thank you)

So it seems to be very useful improvement. Of course it requires a lot of
changes in B-tree implementation, so I need approval from community.

1. Compatibility.
It's important to save compatibility with older index versions.
I'm going to change BTREE_VERSION to 3.
And use new (posting) features for v3, saving old implementation for v2.
Any objections?

It might be better to just have a flag bit for pages that are
compressed -- there are IIRC 8 free bits in the B-Tree page special
area flags variable. But no real opinion on this from me, yet. You
have plenty of bitspace to work with to mark B-Tree pages, in any
case.

Hmm.. If we are talking about storing duplicates in posting lists (and 
trees) as in GIN, I don't see a way how to apply it to separate pages, 
while not applying to others. Look some notes below .



2. There are several tricks to handle non-unique keys in B-tree.
More info in btree readme (chapter - Differences to the Lehman & Yao
algorithm).
In the new version they'll become useless. Am I right?

I think that the L algorithm makes assumptions for the sake of
simplicity, rather than because they really believed that there were
real problems. For example, they say that deletion can occur offline
or something along those lines, even though that's clearly
impractical. They say that because they didn't want to write a paper
about deletion within B-Trees, I suppose.

See also, my opinion of how they claim to not need read locks [1].
Also, note that despite the fact that the GIN README mentions "Lehman
& Yao style right links", it doesn't actually do the L trick of
avoiding lock coupling -- the whole point of L -- so that remark is
misleading. This must be why B-Tree has much better concurrency than
GIN in practice.


Yes, thanks for extensive explanation.
I mean such tricks as moving right in _bt_findinsertloc(), for example.

/*--
 * If we will need to split the page to put the item on this page,
 * check whether we can put the tuple somewhere to the right,
 * instead.  Keep scanning right until we
 *(a) find a page with enough free space,
 *(b) reach the last page where the tuple can legally go, or
 *(c) get tired of searching.
 * (c) is not flippant; it is important because if there are many
 * pages' worth of equal keys, it's better to split one of the early
 * pages than to scan all the way to the end of the run of equal keys
 * on every insert.  We implement "get tired" as a random choice,
 * since stopping after scanning a fixed number of pages wouldn't work
 * well (we'd never reach the right-hand side of previously split
 * pages).  Currently the probability of moving right is set at 0.99,
 * which may seem too high to change the behavior much, but it does an
 * excellent job of preventing O(N^2) behavior with many equal keys.
 *--
 */

If there is no multiple tuples with the same key, we shouldn't care 
about it at all. It would be possible to skip these steps in "effective 
B-tree implementation". That's why I want to change btree_version.



  So I'm really talking about a slightly
different thing -- prefix compression, rather than handling
duplicates. Whether or not you should do prefix compression instead of
deduplication is certainly not clear to me, but it should be
considered. Also, I always imagined that prefix compression would use
the highkey as the thing that is offset for each "real" IndexTuple,
because it's there anyway, and that's simple. However, I suppose that
that means that duplicate handling can't really work in a way that
makes duplicates have a fixed cost, which may be a particularly
important property to you.


You're right, that is two different techniques.
1. Effective storing of duplicates, which I propose, works with equal 
keys. And allow us to delete repeats.

Index tuples are stored like this:

IndexTupleData + Attrs (key) | IndexTupleData + Attrs (key) | 
IndexTupleData + Attrs (key)


If all Attrs are equal, it seems reasonable not to repeat them. So we 
can store it in following structure:


MetaData + Attrs (key) | IndexTupleData | IndexTupleData | IndexTupleData

It is a posting list. It doesn't require significant changes in index 
page layout, because we can use ordinary IndexTupleData for meta 
information. Each IndexTupleData has fixed size, so it's easy to

Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-09-03 Thread Stephen Frost

* Robert Haas (robertmh...@gmail.com) wrote:
> On Thu, Sep 3, 2015 at 4:00 AM, Shulgin, Oleksandr
>  wrote:
> > I believe that having a floating IP for the master is much more practical
> > approach and it doesn't require any patch to libpq or modification of the
> > client connection settings.
> 
> I think that's a great approach if all the machines are on the same
> subnet.  If they are in different datacenters, it doesn't work.

Anycast could technically be used to address that issue, but there's a
whole host of reasons why that would be quite painful for a PG
connection.

> I think it would be far better to progress to talking about what
> design we'd be comfortable with, rather than kidding ourselves that a
> feature that everyone else has and which somebody has taken the time
> to implement (thus, obviously it has value for them) and which has
> been discussed to general approval at PGCon developer meetings and
> which has been endorsed on this thread by three committers is somehow
> something that nobody really needs.  Seriously?

Agreed.  For my part, I like the JDBC configuration approach and
definitely would ask that we support 'host:port' options since not all
servers will be on the same port.  I don't agree with Tom's concern
regarding the simultaneous connection to all servers at once (yes, it's
a bit unfriendly, but I don't see that as a reason to not provide that
choice and there's a lot of reasons why you'd want it).

What would be nice is a better way to configure these more complicated
options than the single string or even the current very simple
pg_service.conf file.  For example, a service name which could define
*other* service names to try along with a plan for how to connect to
them (round robin, simultaneously, read/write only, etc) and perhaps
also support specifying multiple service names to the 'service'
parameter.  I'd prefer that we support all different configuration
options through the 'single string' method also, but I'm not convinced
that's a hard requirement.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Horizontal scalability/sharding

2015-09-03 Thread Kevin Grittner

Josh Berkus  wrote:

>>> You'd also need a way to let the connection nodes know when a replica
>>> has fallen behind so that they can be taken out of
>>> load-balancing/sharding for read queries.  For the synchronous model,
>>> that would be "fallen behind at all"; for asynchronous it would be
>>> "fallen more than ### behind".
>>
>> How is that different from the previous thing?  Just that we'd treat
>> "lagging" as "down" beyond some threshold?  That doesn't seem like a
>> mandatory feature.
>
> It's a mandatory feature if you want to load-balance reads.  We have to
> know which nodes not to send reads to because they are out of sync.

There is another approach to this that we should consider how (if?)
we are going to cover: database affinity.  I have seen cases where
there are multiple databases which are targets of asynchronous
replication, with a web application load balancing among them.  The
application kept track of which copy each connection was using, so
that if when they were not exactly in sync the user never saw "time
moving backward".  Two different users might see versions of the
data from different points in time, but that generally doesn't
matter, especially if the difference is just a few minutes.  If one
copy got too far behind for some reason, they would load-shift to
the other servers (time still moves forward, only there is a "jump"
forward at the shift).  This would allow the tardy database to be
dedicated to catching up again.

Bottom line is that this very smooth behavior required two features
-- the ability for the application to control database affinity,
and the ability to shift that affinity gracefully (with no down
time).

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Hooking at standard_join_search (Was: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual)

2015-09-03 Thread Robert Haas

On Wed, Sep 2, 2015 at 1:47 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Wed, Sep 2, 2015 at 10:30 AM, Tom Lane  wrote:
>>> But if you have in mind that typical FDWs would actually create join paths
>>> at that point, consider that
>>>
>>> 1. The FDW would have to find all the combinations of its supplied
>>> relations (unless you are only intending to generate one path for the
>>> union of all such rels, which seems pretty narrow-minded from here).
>
>> Well, if the remote end is another database server, presumably we can
>> leave it to optimize the query, so why would we need more than one
>> path?
>
> If you have say 5 relations in the query, 3 of which are foreign, it might
> make sense to join all 3 at the remote end, or maybe you should only join
> 2 of them remotely because it's better to then join to one of the local
> rels before joining the last remote rel.

True.  But that's not the problem I'm concerned about.  Suppose the
query looks like this:

SELECT * FROM ft1 LEFT JOIN ft2 ON ft1.x = ft2.x LEFT JOIN t1 ON ft2.y
= t1.y LEFT JOIN ft3 ON ft1.z = ft3.z LEFT JOIN t2 ON ft1.w = t2.w;

Now, no matter where we put the hooks, we'll consider foreign join
paths for all of the various combinations of tables that we could push
down.  We'll decide between those various options based on cost, which
is fine.  But let's consider just one joinrel, the one that includes
(ft1 ft2 ft3).  Assuming that the remote tables have the same name as
the local tables.  The path that implements a pushed-down join of all
three tables will send one of these two queries to the remote server:

SELECT * FROM ft1 LEFT JOIN ft2 ON ft1.x = ft2.x LEFT JOIN ft3 ON ft1.z = ft3.z;
SELECT * FROM ft1 LEFT JOIN ft3 ON ft1.z = ft3.z LEFT JOIN ft2 ON
ft1.x = ft2.x ;

We need to generate one of those two queries, and we need to figure
out what the remote server thinks it will cost to execute.  We
presumably do not to cost both of them, because if it's legal to
commute the joins, the remote server can and will do that itself.  It
would be stupid to cost both possible queries if the remote server is
going to pick the same plan either way.  However - and this is the key
point - the one we choose to generate *must represent a legal join
order*.  If the ft1-ft2 join were a FULL JOIN instead of a LEFT JOIN,
the second query wouldn't be a legal thing to send to the remote
server.  So, the problem I'm worried about is: given that we know we
want to at least consider the path that pushes the whole join to the
remote server, how do we construct an SQL query that embodies a legal
join order of the relations being pushed down?

> Even if you claim that that
> would never make sense from a cost standpoint (a claim easily seen to be
> silly), there might not be any legal way to join all 3 directly because of
> join order constraints.
>
> The larger point is that we can't expect the remote server to be fully
> responsible for optimizing, because it will know nothing of what's being
> done on our end.

No argument with any of that.

>> So, the problem is that I don't think this entirely skirts the
>> join_is_legal issues, which are a principal point of concern for me.
>> Say this is a joinrel between (A B) and (C D E).  We need to generate
>> an SQL query for (A B C D E).  We know that the outermost syntactic
>> join can be (A B) to (C D E).  But how do we know which join orders
>> are legal as among (C D E)?  Maybe there's a simple way to handle this
>> that I'm not seeing.
>
> Well, if the joins get built up in the way I think should happen, we'd
> have already considered (C D E), and we could have recorded the legal join
> orders within that at the time.  (I imagine that we should allow FDWs to
> store some data within RelOptInfo structs that represent foreign joins
> belonging entirely to them, so that there'd be a handy place to keep that
> data till later.)

Yes, that would help.  Can fdw_private serve that purpose, or do we
need something else?

> Or we could trawl through the paths associated with the
> child joinrel, which will presumably include instances of every reasonable
> sub-join combination.  Or the FDW could look at the SpecialJoinInfo data
> and determine things for itself (or more likely, ask join_is_legal about
> that).

Yeah, this is the part I'm worried will be complex, which accounts for
the current hook placement.  I'm worried that trawling through that
SpecialJoinInfo data will end up needing to duplicate much of
make_join_rel and add_paths_to_joinrel.  For example, consider:

SELECT * FROM verysmall v JOIN (bigft1 FULL JOIN bigft2 ON bigft1.x =
bigft2.x) ON v.q = bigft1.q AND v.r = bigft2.r;

The best path for this plan is presumably something like this:

Nested Loop
-> Seq Scan on verysmall v
-> Foreign Scan on bigft1 and bigft2
Remote SQL: SELECT * FROM bigft1 FULL JOIN bigft2 ON bigft1.x =
bigft2.x AND bigft1.q = $1 AND bigft2.r = $2

Now, how is the

[HACKERS] Too many duplicated condition query return wrong value

2015-09-03 Thread Atsushi Yoshida

Hi.

I cought a strange result.
I execute such query.

> SELECT "attend"."lid", "attend"."status" FROM "attend" WHERE "attend"."sid" = 
> 325 AND "attend"."lid" IN ('ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 
> 'ABF0010', 'ABF0010', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 
> 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020', 'ABF0020',

Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-09-03 Thread Dave Page

On Thu, Sep 3, 2015 at 3:56 PM, Robert Haas  wrote:
> On Thu, Sep 3, 2015 at 4:00 AM, Shulgin, Oleksandr
>  wrote:
>> I believe that having a floating IP for the master is much more practical
>> approach and it doesn't require any patch to libpq or modification of the
>> client connection settings.
>
> I think that's a great approach if all the machines are on the same
> subnet.  If they are in different datacenters, it doesn't work.
>
> The amount of opposition to this feature is remarkable considering
> that it's available in Oracle, SQL Server, MongoDB, Cassandra, and
> MySQL.  See for example:
>
> http://docs.mongodb.org/manual/reference/connection-string/
> https://datastax.github.io/python-driver/getting_started.html
>
> This is a small patch with minimal to no downside implementing a
> feature that is present in most or all of the major competing
> products.  We're really doing ourselves a disservice if we reject it.
> I think it would be far better to progress to talking about what
> design we'd be comfortable with, rather than kidding ourselves that a
> feature that everyone else has and which somebody has taken the time
> to implement (thus, obviously it has value for them) and which has
> been discussed to general approval at PGCon developer meetings and
> which has been endorsed on this thread by three committers is somehow
> something that nobody really needs.  Seriously?

+100

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-03 Thread Michael Paquier

On Thu, Sep 3, 2015 at 1:46 AM, Jim Nasby  wrote:
> The double set of exceptions seems to be critical; if instead of calling
> tf.get(invoice) (which recursively does tf.get(customer)) I define the
> cursor to call tf.get(customer) directly there's no assert.

Finally I have been able to crack down the problem, and it can be
reproduced with the following test case for example:
BEGIN;
CREATE OR REPLACE FUNCTION create_self_tab() RETURNS text
LANGUAGE plpgsql AS $$
BEGIN
  CREATE TEMP TABLE new_table ON COMMIT DROP AS SELECT create_self_tab();
  RETURN 'foo';
END $$;
DECLARE h CURSOR FOR SELECT create_self_tab();
SAVEPOINT self_tab_point;
FETCH h; -- error
COMMIT;

When fetching the first tuple, the transaction status is cleaned up to the
savepoint because the call actually fails in the second loop at the
temporary relation exists, but it happens that the temporary table that has
been created tried to be cleanup up but it is still referenced, causing the
assertion failure. So that's not related to the use of the exception
blocks. What directed me to the SAVEPOINT causing the issue is the use of
ON_ERROR_ROLLBACK in the test case Jim proposed. I don't have a patch at
hand yet, still now things are easier to test.
-- 
Michael

Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-09-03 Thread Robert Haas

On Thu, Sep 3, 2015 at 4:00 AM, Shulgin, Oleksandr
 wrote:
> I believe that having a floating IP for the master is much more practical
> approach and it doesn't require any patch to libpq or modification of the
> client connection settings.

I think that's a great approach if all the machines are on the same
subnet.  If they are in different datacenters, it doesn't work.

The amount of opposition to this feature is remarkable considering
that it's available in Oracle, SQL Server, MongoDB, Cassandra, and
MySQL.  See for example:

http://docs.mongodb.org/manual/reference/connection-string/
https://datastax.github.io/python-driver/getting_started.html

This is a small patch with minimal to no downside implementing a
feature that is present in most or all of the major competing
products.  We're really doing ourselves a disservice if we reject it.
I think it would be far better to progress to talking about what
design we'd be comfortable with, rather than kidding ourselves that a
feature that everyone else has and which somebody has taken the time
to implement (thus, obviously it has value for them) and which has
been discussed to general approval at PGCon developer meetings and
which has been endorsed on this thread by three committers is somehow
something that nobody really needs.  Seriously?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-09-03 Thread Robert Haas

On Thu, Sep 3, 2015 at 11:42 AM, Stephen Frost  wrote:
>> > I believe that having a floating IP for the master is much more practical
>> > approach and it doesn't require any patch to libpq or modification of the
>> > client connection settings.
>>
>> I think that's a great approach if all the machines are on the same
>> subnet.  If they are in different datacenters, it doesn't work.
>
> Anycast could technically be used to address that issue, but there's a
> whole host of reasons why that would be quite painful for a PG
> connection.

/me rolls eyes.

>> I think it would be far better to progress to talking about what
>> design we'd be comfortable with, rather than kidding ourselves that a
>> feature that everyone else has and which somebody has taken the time
>> to implement (thus, obviously it has value for them) and which has
>> been discussed to general approval at PGCon developer meetings and
>> which has been endorsed on this thread by three committers is somehow
>> something that nobody really needs.  Seriously?
>
> Agreed.  For my part, I like the JDBC configuration approach and
> definitely would ask that we support 'host:port' options since not all
> servers will be on the same port.  I don't agree with Tom's concern
> regarding the simultaneous connection to all servers at once (yes, it's
> a bit unfriendly, but I don't see that as a reason to not provide that
> choice and there's a lot of reasons why you'd want it).

Yep.  And it can even be configurable behavior, as I suggested upthread.

> What would be nice is a better way to configure these more complicated
> options than the single string or even the current very simple
> pg_service.conf file.  For example, a service name which could define
> *other* service names to try along with a plan for how to connect to
> them (round robin, simultaneously, read/write only, etc) and perhaps
> also support specifying multiple service names to the 'service'
> parameter.  I'd prefer that we support all different configuration
> options through the 'single string' method also, but I'm not convinced
> that's a hard requirement.

Maybe someday we should have all that, but I think for right now
that's complicating things unnecessarily.  I think the best proposal
so far is to allow the host=X option to be repeated multiple times.
If you repeat the host=X option N times, you can also repeat the
port=X option exactly N times, or else you can specify it just once.
Done.

Alternatively, leave the host=X option alone and add a new option
hostlist=X, allowing a comma-separated list of names or IPs, with each
hostname or IP allowed an optional :port suffix.  If host=X parameter
is omitted or the connection to that machine fails, try everybody in
the hostlist concurrently, or with some configurable (and presumably
short) delay between one and then next.  Again, done.

Alternatively, change the rules for parsing the existing host=X
parameter so that we split it on some separator that isn't a valid
hostname character, and then strip off an optional :port syntax from
each entry; that value, if present, overrides port=X for that entry.

I think we're really tying ourselves in knots about problems that
really aren't very hard to solve here.  I'm sure some of these
proposals are better than others and the idea thing may be something
else again.  But if NASA can send a space probe 7.5 billion kilometers
to a frigid spheroid in the outer solar system without crashing into
anything or having any catastrophic software or hardware failures, I
bet we can come up with a convenient way to specify multiple IP
addresses.  I'd like the story of this feature to resemble a work by
e.e. cummings more than it does one by Robert Jordan.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [PROPOSAL] Inputs on forcing VACUUM VERBOSE to write timestamp

2015-09-03 Thread dinesh kumar

Hi All,

Greetings for the day.

Would like to know your inputs/suggestions on below proposal. Kindly let me
know, if it's already taken care.

*Proposal*

Forcing VACUUM VERBOSE to write timestamp, for each "INFO" entry. This was
raised already in this

thread, and wanted to discuss it again. I didn't see any replies on this
old thread.

*Justification*


As we are recording the timestamp too, it gives an idea about how much time
it took for each VACUUM activity.

Regards,
Dinesh
manojadinesh.blogspot.com

Re: [HACKERS] Freeze avoidance of very large table.

2015-09-03 Thread Masahiko Sawada

On Thu, Aug 27, 2015 at 1:54 AM, Masahiko Sawada  wrote:
> On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera
>  wrote:
>> Jim Nasby wrote:
>>
>>> I think things like pageinspect are very different; I really can't see any
>>> use for those beyond debugging (and debugging by an expert at that).
>>
>> I don't think that necessarily means it must continue to be in contrib.
>> Quite the contrary, I think it is a tool critical enough that it should
>> not be relegated to be a second-class citizen as it is now (let's face
>> it, being in contrib *is* second-class citizenship).
>>
>
> Attached patch is latest patch.

The previous patch lacks some files for regression test.
Attached fixed v12 patch.


Regards,

--
Masahiko Sawada


000_add_frozen_bit_into_visibilitymap_v12.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-09-03 Thread Robert Haas

On Thu, Sep 3, 2015 at 12:57 PM, Shulgin, Oleksandr
 wrote:
> On Thu, Sep 3, 2015 at 6:02 PM, Robert Haas  wrote:
>> Maybe someday we should have all that, but I think for right now
>> that's complicating things unnecessarily.  I think the best proposal
>> so far is to allow the host=X option to be repeated multiple times.
>> If you repeat the host=X option N times, you can also repeat the
>> port=X option exactly N times, or else you can specify it just once.
>> Done.
>
> But this already breaks backwards-compatibility with any clients who belief
> that whatever value specified the latest takes precedence.  I'm not arguing
> that there are such use cases in the wild or that it's entirely sane thing
> to do, but still.

Yep.  If we care about backward compatibility, there can be a new
option that must be specified to get the new behavior.  We can also
decide not to care about this case.

> More importantly, this will break any code that tries to parse the conninfo
> string and produce a hashmap from it for modification.

That is true, but I am not sure I agree that it is important.  Switch
to a hashmap whose values are arrays.

>> Alternatively, leave the host=X option alone and add a new option
>> hostlist=X, allowing a comma-separated list of names or IPs, with each
>> hostname or IP allowed an optional :port suffix.  If host=X parameter
>> is omitted or the connection to that machine fails, try everybody in
>> the hostlist concurrently, or with some configurable (and presumably
>> short) delay between one and then next.  Again, done.
>
> The exact behavior in case of both host/port and hostlist are specified
> becomes really tricky then.  It's already tricky enough, if you recall the
> service files -- how are they going to come into play here?

It doesn't seem that tricky to me, but maybe I'm biased by having just
invented it 5 minutes ago.

> I believe the less there are implicit workings in the way libpq connects,
> the better.

I don't disagree with that as a general rule - only when it keeps us
from implementing useful features.

>>> Alternatively, change the rules for parsing the existing host=X
>> parameter so that we split it on some separator that isn't a valid
>> hostname character, and then strip off an optional :port syntax from
>> each entry; that value, if present, overrides port=X for that entry.
>
> It's tempting to use ':' as the separator here, but it's still valid for
> directory names and host can be one in case of UN*X sockets.

The directory name is only likely to contain : on Windows, and Windows
doesn't support UNIX sockets.

All of these objections seem pretty thin to me.  I'd accept any of
them as a reason for preferring one alternative over another, but I
don't accept that the presence of a few problems of this magnitude
means we should give up on the feature.  It's a good enough feature
that it is worth the possibility of slightly inconveniencing someone
running in an unusual configuration.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.

2015-09-03 Thread Robert Haas

On Thu, Aug 20, 2015 at 10:46 AM, Alvaro Herrera
 wrote:
> Jim Nasby wrote:
>> I think things like pageinspect are very different; I really can't see any
>> use for those beyond debugging (and debugging by an expert at that).
>
> I don't think that necessarily means it must continue to be in contrib.
> Quite the contrary, I think it is a tool critical enough that it should
> not be relegated to be a second-class citizen as it is now (let's face
> it, being in contrib *is* second-class citizenship).

I have resisted that principle for years and will continue to do so.
It is entirely reasonable for some DBAs to want certain functionality
(debugging tools, crypto) to not be installed on their machines.
Folding everything into core is not a good policy, IMHO.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-09-03 Thread Shulgin, Oleksandr

On Thu, Sep 3, 2015 at 6:02 PM, Robert Haas  wrote:

>
> Maybe someday we should have all that, but I think for right now
> that's complicating things unnecessarily.  I think the best proposal
> so far is to allow the host=X option to be repeated multiple times.
> If you repeat the host=X option N times, you can also repeat the
> port=X option exactly N times, or else you can specify it just once.
> Done.
>

But this already breaks backwards-compatibility with any clients who belief
that whatever value specified the latest takes precedence.  I'm not arguing
that there are such use cases in the wild or that it's entirely sane thing
to do, but still.

More importantly, this will break any code that tries to parse the conninfo
string and produce a hashmap from it for modification.

Alternatively, leave the host=X option alone and add a new option
> hostlist=X, allowing a comma-separated list of names or IPs, with each
> hostname or IP allowed an optional :port suffix.  If host=X parameter
> is omitted or the connection to that machine fails, try everybody in
> the hostlist concurrently, or with some configurable (and presumably
> short) delay between one and then next.  Again, done.
>

The exact behavior in case of both host/port and hostlist are specified
becomes really tricky then.  It's already tricky enough, if you recall the
service files -- how are they going to come into play here?

I believe the less there are implicit workings in the way libpq connects,
the better.

Alternatively, change the rules for parsing the existing host=X
> parameter so that we split it on some separator that isn't a valid
> hostname character, and then strip off an optional :port syntax from
> each entry; that value, if present, overrides port=X for that entry.
>

It's tempting to use ':' as the separator here, but it's still valid for
directory names and host can be one in case of UN*X sockets.

--
Alex

Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-09-03 Thread Stephen Frost

* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
> Shulgin, Oleksandr wrote:
> 
> > > Alternatively, change the rules for parsing the existing host=X
> > > parameter so that we split it on some separator that isn't a valid
> > > hostname character, and then strip off an optional :port syntax from
> > > each entry; that value, if present, overrides port=X for that entry.
> > 
> > It's tempting to use ':' as the separator here, but it's still valid for
> > directory names and host can be one in case of UN*X sockets.
> 
> I think that's rare enough that we could just say that if you want to
> have a : in a directory name used for local connections, you have to
> escape the : character.  This is going to be pretty easy to detect as a
> problem because of the obvious error message ("cannot parse "pg" in
> /usr/sockets:pg as a port number"), except in the even rarer case that
> the only stuff after the colon is digits.

If we really want to worry about this, we could simply check if the
directory exists with the ':5433' or whatever at the end and, if it
does, use whatever the port specification is.  If that directory doesn't
exist, and one without the ':5433' does, then we try that directory and
that port.

Personally, I agree with Alvaro that it's really just overkill to worry
about though.

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Is this a bug?

2015-09-03 Thread Robert Haas

On Wed, Aug 26, 2015 at 3:31 PM, Alvaro Herrera
 wrote:
> Fabrízio de Royes Mello wrote:
>
>> Why this patch was reverted one day after applied [1]? I didn't see any
>> discussion around it.
>
> Contributors whose patches are getting committed should really subscribe
> to pgsql-committers.

I would have thought discussion of committed patches should be moved
to -hackers.  The description for the -committers list says:

"Notification of git commits are sent to this list. Do not post here!"

So, it's understandable that people would not expect other traffic there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PG_CATCH used without PG_RETHROW

2015-09-03 Thread Tom Lane

Greg Stark  writes:
> My understanding is that PG_TRY/PG_CATCH doesn't save enough state to
> avoid rethrowing errors and if you want to actually continue the
> transaction you must use a subtransaction. As a result I was under the
> impression it was mandatory to PG_RETHROW as a result.

> If that's the case then I think I just came across a bug in
> utils/adt/xml.c where there's no PG_RETHROW:

The reason we think that's OK is that we assume libxml2 does not call back
into the general backend code, so there is no PG state we'd have to undo.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.

2015-09-03 Thread Robert Haas

On Thu, Sep 3, 2015 at 2:26 PM, Alvaro Herrera  wrote:
> Robert Haas wrote:
>> On Thu, Aug 20, 2015 at 10:46 AM, Alvaro Herrera
>>  wrote:
>
>> > I don't think that necessarily means it must continue to be in contrib.
>> > Quite the contrary, I think it is a tool critical enough that it should
>> > not be relegated to be a second-class citizen as it is now (let's face
>> > it, being in contrib *is* second-class citizenship).
>>
>> I have resisted that principle for years and will continue to do so.
>> It is entirely reasonable for some DBAs to want certain functionality
>> (debugging tools, crypto) to not be installed on their machines.
>> Folding everything into core is not a good policy, IMHO.
>
> I don't understand.  I'm just proposing that the source code for the
> extension to live in src/extensions/, and have the shared library
> installed by toplevel make install; I'm not suggesting that the
> extension is installed automatically.  For that, you still need a
> superuser to run CREATE EXTENSION.

Oh.  Well, that's different.  I don't particularly support that
proposal, but I'm not prepared to fight over it either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PROPOSAL] Inputs on forcing VACUUM VERBOSE to write timestamp

2015-09-03 Thread Jeff Janes

On Thu, Sep 3, 2015 at 9:18 AM, Andres Freund  wrote:

> Hi,
>
> On 2015-09-03 21:45:52 +0530, dinesh kumar wrote:
> > Forcing VACUUM VERBOSE to write timestamp, for each "INFO" entry. This
> was
> > raised already in this
> > <
> http://www.postgresql.org/message-id/20031110162349.65542.qm...@web21408.mail.yahoo.com
> >
> > thread, and wanted to discuss it again. I didn't see any replies on this
> > old thread.
>
> Unconvinced - sounds like you're just re-inventing log_line_prefix.
>

Many times I've wanted a client_log_line_prefix.  If someone wants to
invent that, I'd second it.

Cheers,

Jeff

Re: [HACKERS] September 2015 Commitfest

2015-09-03 Thread Tomas Vondra


On 09/03/2015 03:06 PM, Andres Freund wrote:

Hi,

Two days ago the September Commitfest started. I'm going to be your
host^W commitfest manager this time round.

To start off I went through all entries and tried to make sure their
state is accurate. Right now we have:

Needs review:   47
Waiting on Author:  24
Ready for Committer: 7
Committed:   7
Rejected:3
Returned with Feedback:  3
Total:   91

Of the open entries 41 currently have no assigned reviewer(s).


I've just noticed that we're not tracking the hashjoin alloc bugfix, 
which was reported back in June. Can you please add it to this CF? This 
is the last message in the thread:


http://www.postgresql.org/message-id/55e78322.10...@2ndquadrant.com

I've also noticed the 'multivariate stats' patch was returned with 
feedback (as agreed on 25/8), but it was not added to this CF. Maybe I 
missed something, but I assumed that happens automatically. I plan to 
submit a new version, reflecting the discussion. Can we add it?


BTW is there some place tracking these commitfest rules?


My impression from the last commitfests and the current state of
this fest is that several authors with, in some cases numerous or
large, patches are not doing the corresponding amount of review on
other patches. Let's change that!


Given the size of the multivariate stats patch, I guess I'm one of those 
slackers, Will fix.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-09-03 Thread Tom Lane

I wrote:
> Andres Freund  writes:
>> I'don't like adding a couple seconds of test runtime for the benefit of
>> very slow platforms.

> Me either.  This is the first time I've seen an indication that the
> start_postmaster change mentioned in the comment is actually important
> for production use, rather than just being cleanup.  I think we ought
> to just fix it.  I'm willing to take care of the Unix side if someone
> will explain how to change the Windows side.

Attached is a draft patch for this.  I think it's fine for Unix (unless
someone wants to object to relying on "/bin/sh -c"), but I have no idea
whether it works for Windows.  The main risk is that if CMD.EXE runs
the postmaster as a subprocess rather than overlaying itself a la shell
"exec", the PID we'd get back would apply only to CMD.EXE not to the
eventual postmaster.  If so, I do not know how to fix that, or whether
it's fixable at all.

Note that this makes the test case in question fail reliably, which is
reasonable behavior IMO so I just changed the test.

If this does (or can be made to) work on Windows, I'd propose applying
it back to 9.2, where the current coding came in.

regards, tom lane

diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 6a36d29..f0025d7 100644
*** a/src/bin/pg_ctl/pg_ctl.c
--- b/src/bin/pg_ctl/pg_ctl.c
***
*** 28,33 
--- 28,34 
  #include 
  #include 
  #include 
+ #include 
  #include 

  #ifdef HAVE_SYS_RESOURCE_H
*** static int	CreateRestrictedProcess(char 
*** 153,162 
  static pgpid_t get_pgpid(bool is_status_request);
  static char **readfile(const char *path);
  static void free_readfile(char **optlines);
! static int	start_postmaster(void);
  static void read_post_opts(void);

! static PGPing test_postmaster_connection(bool);
  static bool postmaster_is_alive(pid_t pid);

  #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE)
--- 154,163 
  static pgpid_t get_pgpid(bool is_status_request);
  static char **readfile(const char *path);
  static void free_readfile(char **optlines);
! static pgpid_t start_postmaster(void);
  static void read_post_opts(void);

! static PGPing test_postmaster_connection(pgpid_t pm_pid, bool do_checkpoint);
  static bool postmaster_is_alive(pid_t pid);

  #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE)
*** free_readfile(char **optlines)
*** 419,454 
   * start/test/stop routines
   */

! static int
  start_postmaster(void)
  {
  	char		cmd[MAXPGPATH];

  #ifndef WIN32

  	/*
  	 * Since there might be quotes to handle here, it is easier simply to pass
  	 * everything to a shell to process them.
- 	 *
- 	 * XXX it would be better to fork and exec so that we would know the child
- 	 * postmaster's PID directly; then test_postmaster_connection could use
- 	 * the PID without having to rely on reading it back from the pidfile.
  	 */
  	if (log_file != NULL)
! 		snprintf(cmd, MAXPGPATH, "\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1 &",
   exec_path, pgdata_opt, post_opts,
   DEVNULL, log_file);
  	else
! 		snprintf(cmd, MAXPGPATH, "\"%s\" %s%s < \"%s\" 2>&1 &",
   exec_path, pgdata_opt, post_opts, DEVNULL);

! 	return system(cmd);
  #else			/* WIN32 */

  	/*
! 	 * On win32 we don't use system(). So we don't need to use & (which would
! 	 * be START /B on win32). However, we still call the shell (CMD.EXE) with
! 	 * it to handle redirection etc.
  	 */
  	PROCESS_INFORMATION pi;

--- 420,479 
   * start/test/stop routines
   */

! static pgpid_t
  start_postmaster(void)
  {
  	char		cmd[MAXPGPATH];

  #ifndef WIN32
+ 	pgpid_t		pm_pid;
+ 
+ 	/* Flush stdio channels just before fork, to avoid double-output problems */
+ 	fflush(stdout);
+ 	fflush(stderr);
+ 
+ 	pm_pid = fork();
+ 	if (pm_pid < 0)
+ 	{
+ 		/* fork failed */
+ 		write_stderr(_("%s: could not start server: %s\n"),
+ 	 progname, strerror(errno));
+ 		exit(1);
+ 	}
+ 	if (pm_pid > 0)
+ 	{
+ 		/* fork succeeded, in parent */
+ 		return pm_pid;
+ 	}
+ 
+ 	/* fork succeeded, in child */

  	/*
  	 * Since there might be quotes to handle here, it is easier simply to pass
  	 * everything to a shell to process them.
  	 */
  	if (log_file != NULL)
! 		snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" >> \"%s\" 2>&1",
   exec_path, pgdata_opt, post_opts,
   DEVNULL, log_file);
  	else
! 		snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" 2>&1",
   exec_path, pgdata_opt, post_opts, DEVNULL);

! 	(void) execl("/bin/sh", "/bin/sh", "-c", cmd, (char *) NULL);
! 
! 	/* exec failed */
! 	write_stderr(_("%s: could not start server: %s\n"),
!  progname, strerror(errno));
! 	exit(1);
! 
! 	return 0;	/* keep dumb compilers quiet */
! 
  #else			/* WIN32 */

  	/*
! 	 * As with the Unix case, it's easiest to use the shell (CMD.EXE) to
! 	 * handle redirection etc.
  	 */
  	PROCESS_INFORMATION pi;

*** start_postmaster(void)
***

Re: [HACKERS] Freeze avoidance of very large table.

2015-09-03 Thread Alvaro Herrera

Robert Haas wrote:
> On Thu, Aug 20, 2015 at 10:46 AM, Alvaro Herrera
>  wrote:

> > I don't think that necessarily means it must continue to be in contrib.
> > Quite the contrary, I think it is a tool critical enough that it should
> > not be relegated to be a second-class citizen as it is now (let's face
> > it, being in contrib *is* second-class citizenship).
> 
> I have resisted that principle for years and will continue to do so.
> It is entirely reasonable for some DBAs to want certain functionality
> (debugging tools, crypto) to not be installed on their machines.
> Folding everything into core is not a good policy, IMHO.

I don't understand.  I'm just proposing that the source code for the
extension to live in src/extensions/, and have the shared library
installed by toplevel make install; I'm not suggesting that the
extension is installed automatically.  For that, you still need a
superuser to run CREATE EXTENSION.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Is this a bug?

2015-09-03 Thread Alvaro Herrera

Robert Haas wrote:
> On Wed, Aug 26, 2015 at 3:31 PM, Alvaro Herrera
>  wrote:
> > Fabrízio de Royes Mello wrote:
> >
> >> Why this patch was reverted one day after applied [1]? I didn't see any
> >> discussion around it.
> >
> > Contributors whose patches are getting committed should really subscribe
> > to pgsql-committers.
> 
> I would have thought discussion of committed patches should be moved
> to -hackers.

I agree, but it happens anyway quite frequently.  Since recently, I make
a point of changing the CC from -committers to -hackers, but due to
force of habit I often forget.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] PG_CATCH used without PG_RETHROW

2015-09-03 Thread Greg Stark

My understanding is that PG_TRY/PG_CATCH doesn't save enough state to
avoid rethrowing errors and if you want to actually continue the
transaction you must use a subtransaction. As a result I was under the
impression it was mandatory to PG_RETHROW as a result.

If that's the case then I think I just came across a bug in
utils/adt/xml.c where there's no PG_RETHROW:

/*
 * Functions for checking well-formed-ness
 */

#ifdef USE_LIBXML
static bool
wellformed_xml(text *data, XmlOptionType xmloption_arg)
{
bool result;
volatile xmlDocPtr doc = NULL;

/* We want to catch any exceptions and return false */
PG_TRY();
{
doc = xml_parse(data, xmloption_arg, true, GetDatabaseEncoding());
result = true;
}
PG_CATCH();
{
FlushErrorState();
result = false;
}
PG_END_TRY();

if (doc)
xmlFreeDoc(doc);

return result;
}
#endif

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench stats per script & other stuff

2015-09-03 Thread Robert Haas

On Thu, Sep 3, 2015 at 3:26 AM, Fabien COELHO  wrote:
> I've left out:
>  - removing -N/-S upward compatibility shorthands
>but I will not cry if they are removed

I see no particular merit to breaking backward compatibility here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Horizontal scalability/sharding

2015-09-03 Thread Robert Haas

On Thu, Sep 3, 2015 at 6:57 AM, Bruce Momjian  wrote:
> Yes, I assumed that.  Logical replication uses WAL, so if you are
> synchronous with WAL, logical replication is synchronous too.  However,
> of course, it is synchronous in being durable, not synchronous in terms
> of applying the WAL.  This is true of binary and logical replication.

But, Thomas Munro is fixing it!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2015-09-03 Thread Robert Haas

On Tue, Aug 25, 2015 at 1:25 AM, David Rowley
 wrote:
> If that's the case then why do we not enable verbose for all of the non-text
> outputs?
> It seems strange to start making exceptions on a case-by-case basis.

+1.  FORMAT and VERBOSE are separate options, and each one should
control what the name suggests, not the other thing.

Also: very nice performance results.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PROPOSAL] Inputs on forcing VACUUM VERBOSE to write timestamp

2015-09-03 Thread Andres Freund

Hi,

On 2015-09-03 21:45:52 +0530, dinesh kumar wrote:
> Forcing VACUUM VERBOSE to write timestamp, for each "INFO" entry. This was
> raised already in this
> 
> thread, and wanted to discuss it again. I didn't see any replies on this
> old thread.

Unconvinced - sounds like you're just re-inventing log_line_prefix.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Potential GIN vacuum bug

2015-09-03 Thread Jeff Janes

On Mon, Aug 31, 2015 at 12:10 AM, Jeff Janes  wrote:

> On Sun, Aug 30, 2015 at 3:57 PM, Tom Lane  wrote:
>
>> Jeff Janes  writes:
>> > On Sun, Aug 30, 2015 at 11:11 AM, Tom Lane  wrote:
>>
> > But we would still have to deal with the
>> > fact that unconditional acquire attempt by the backends will cause a
>> vacuum
>> > to cancel itself, which is undesirable.
>>
>> Good point.
>>
>> > If we define a new namespace for
>> > this lock (like the relation extension lock has its own namespace) then
>> > perhaps the cancellation code could be made to not cancel on that
>> > condition.  But that too seems like a lot of work to backpatch.
>>
>> We could possibly teach the autocancel logic to distinguish this lock type
>> from others without using a new namespace.  That seems a bit klugy, but
>> maybe better than adding a new namespace.  (For example, there are
>> probably only a couple of modes in which we take page-level locks at
>> present.  Choosing a currently unused, but self-exclusive, mode for taking
>> such a lock might serve.)
>>
>
> Like the attached?  (The conditional variant for user backends was
> unceremoniously yanked out.)
>

A problem here is that now we have the user backends waiting on vacuum to
do the clean up, but vacuum is using throttled IO and so taking its sweet
time at it.  Under the old code, the user backend could pass up the vacuum
while it was sleeping.

Maybe we could have the vacuum detect when someone is waiting on it, and
temporarily suspend throttling and just run at full speed.  I don't believe
there is any precedence for that, but I do think there are other places
where such a technique could be useful.  That is kind of a scary change to
backpatch.

I am running out of ideas.

Cheers,

Jeff

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-03 Thread Tom Lane

Michael Paquier  writes:
> Finally I have been able to crack down the problem, and it can be
> reproduced with the following test case for example:

Hm.  It looks like the problem is that we're executing the function
within the Portal created for cursor "h", and that Portal is older
than the subtransaction created by the savepoint, so when we abort the
subtransaction we do not clean up the Portal.  That means that resources
held by that Portal haven't been released --- in particular it still has
a relcache pin on the "new_table" --- when subtransaction abort gets to
the point of wanting to drop relcache entries created during the
subtransaction.  So the Assert has caught an actual problem: we'd have
flushed the relcache entry while there was still an open reference to it.

I'm inclined to think that the only real fix for this type of problem
is that at subtransaction abort, we have to prevent further execution
of any Portals that have executed at all within the subxact (ie force
them into PORTAL_FAILED state and clean out their resources, see
MarkPortalFailed).  It's not good enough to kill the one(s) that are
actively executing at the instant of failure, because anything that's
run at all since the subxact started might be holding a reference to an
object we're about to delete.

I'm a little worried that that will break usage patterns that work today,
though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Too many duplicated condition query return wrong value

2015-09-03 Thread Jeff Janes

On Thu, Sep 3, 2015 at 5:14 AM, Atsushi Yoshida 
wrote:

> Hi.
>
> I cought a strange result.
> I execute such query.
>
> > SELECT "attend"."lid", "attend"."status" FROM "attend" WHERE
> "attend"."sid" = 325 AND "attend"."lid" IN ('ABF0010', 'ABF0010',
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010',
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010',
> 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010', 'ABF0010',

...

> 'ABF0060', 'ABF0060', 'ABF0060', 'ABF0060', 'ABF0060', 'ABF0060',
> 'ABF0060', 'ABF0060', 'ABF0060', 'ABF0060', 'ABF0060', 'ABF0060',
> 'ABF0060', 'ABF0060', 'ABF0060', 'ABF0060');
>



>
> it return
>
> >lid   | status
> > -+
> >  ABF0050 |  9
> >  ABF0040 |  9
> >  ABF0020 |  9
> >  ABF0010 |  9
> >  ABF0060 |  9
> > (5 rows)
>
> This IN condition to be unique and execute it like this.
>
> > arcvideo=> SELECT "attend"."lid", "attend"."status" FROM "attend" WHERE
> "attend"."sid" = 325 AND "attend"."lid" IN ('ABF0010', 'ABF0020',
> 'ABF0030', 'ABF0040', 'ABF0050', 'ABF0060’);
>
> It return
>
> >lid   | status
> > -+
> >  ABF0010 |  9
> >  ABF0020 |  9
> >  ABF0030 |  9
> >  ABF0040 |  9
> >  ABF0050 |  9
> >  ABF0060 |  9
> > (6 rows)
>
> First query and second query are same meaning I think, but the result is
> different.
>

Can you give an "explain (analyze, buffers)"  for each query?  Maybe you
have a corrupted index, and one query uses the index and the other does not.

Cheers,

Jeff

Re: [HACKERS] pgbench stats per script & other stuff

2015-09-03 Thread Fabien COELHO




I've left out:
 - removing -N/-S upward compatibility shorthands
   but I will not cry if they are removed


I see no particular merit to breaking backward compatibility here.


I agree, but I would not fight for this. I think there is a good argument 
*NOT* to add more if new builtin scripts are added later.


Currently the builtin script can be selected with "-b t" (t for tcpb-like), 
"-b s" (s for simple-update) and "-b se" (se for select-only).


I've reused their current names for the option selector, and it takes the 
first matching prefix.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-09-03 Thread Stephen Frost

* Robert Haas (robertmh...@gmail.com) wrote:
> Alternatively, change the rules for parsing the existing host=X
> parameter so that we split it on some separator that isn't a valid
> hostname character, and then strip off an optional :port syntax from
> each entry; that value, if present, overrides port=X for that entry.

Using a ':' should work just fine for that.  Having only one option for
how all the connections are done (concurrently, round robin, etc) and an
option for the timeout works for now.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-09-03 Thread Alvaro Herrera

Shulgin, Oleksandr wrote:

> > Alternatively, change the rules for parsing the existing host=X
> > parameter so that we split it on some separator that isn't a valid
> > hostname character, and then strip off an optional :port syntax from
> > each entry; that value, if present, overrides port=X for that entry.
> 
> It's tempting to use ':' as the separator here, but it's still valid for
> directory names and host can be one in case of UN*X sockets.

I think that's rare enough that we could just say that if you want to
have a : in a directory name used for local connections, you have to
escape the : character.  This is going to be pretty easy to detect as a
problem because of the obvious error message ("cannot parse "pg" in
/usr/sockets:pg as a port number"), except in the even rarer case that
the only stuff after the colon is digits.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Horizontal scalability/sharding

2015-09-03 Thread Josh Berkus

On 09/03/2015 03:57 AM, Bruce Momjian wrote:
>> > 
>> > Yes, the logical replication has similar syncrep properties as the
>> > binary one (feedback works same way).

Oh?  What does UDR/BDR currently support for sync?

> Yes, I assumed that.  Logical replication uses WAL, so if you are
> synchronous with WAL, logical replication is synchronous too.  However,
> of course, it is synchronous in being durable, not synchronous in terms
> of applying the WAL.  This is true of binary and logical replication.

Well, there's no such thing as simultaneity in scalable architectures.
But users are already used to that ... anybody who load-balances to read
slaves knows about lag.  The only way* to ensure near-simultenaity is to
have some kind of single-node, single-process GTM for the cluster, and
then your actual scalability goes bye-bye.

The bigger issue we'll need to address with this is the fight between
lag and load-balancing, which would become a much worse issue with
read-load-balanced shards which are transparent to the user.  They'd see
the effects of lag, without having actually chosen to use this or that
replica.  This is the other reason to look at logical replication;
presumably with logrep, we can be more discriminating about what
activities cause lag (for one thing, vacuum won't).

Also:
On 09/03/2015 07:00 AM, Kevin Grittner wrote:
> There is another approach to this that we should consider how (if?)
> we are going to cover: database affinity.  I have seen cases where
> there are multiple databases which are targets of asynchronous
> replication, with a web application load balancing among them.  The
> application kept track of which copy each connection was using, so
> that if when they were not exactly in sync the user never saw "time
> moving backward".  Two different users might see versions of the
> data from different points in time, but that generally doesn't
> matter, especially if the difference is just a few minutes.  If one
> copy got too far behind for some reason, they would load-shift to
> the other servers (time still moves forward, only there is a "jump"
> forward at the shift).  This would allow the tardy database to be
> dedicated to catching up again.
>
> Bottom line is that this very smooth behavior required two features
> -- the ability for the application to control database affinity,
> and the ability to shift that affinity gracefully (with no down
> time).

Yes.  Frankly, it would be *easier* to code things so that the same
session always gets its requests load balanced to the same copies;
making that a feature, too, is nice.

(* there are actually other ways to come close to simultaneity, but they
are much more complicated)

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-03 Thread Shulgin, Oleksandr

On Wed, Sep 2, 2015 at 3:07 PM, Shulgin, Oleksandr <
oleksandr.shul...@zalando.de> wrote:

> On Wed, Sep 2, 2015 at 3:04 PM, Pavel Stehule 
> wrote:
>>
>>
>>> Well, maybe I'm missing something, but sh_mq_create() will just
>>> overwrite the contents of the struct, so it doesn't care about
>>> sender/receiver: only sh_mq_set_sender/receiver() do.
>>>
>>
>> if you create sh_mq from scratch, then you can reuse structure.
>>
>
Please find attached a v3.

It uses a shared memory queue and also has the ability to capture plans
nested deeply in the call stack.  Not sure about using the executor hook,
since this is not an extension...

The LWLock is used around initializing/cleaning the shared struct and the
message queue, the IO synchronization is handled by the message queue
itself.  After some testing with concurrent pgbench and intentionally deep
recursive plpgsql functions (up to 700 plpgsql stack frames) I think this
approach can work.  Unless there's some theoretical problem I'm just not
aware of. :-)

Comments welcome!
--
Alex
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 0abde43..40db40d 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -26,6 +26,7 @@
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
+#include "utils/cmdstatus.h"
 
 
 /*
@@ -296,6 +297,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+	if (CheckProcSignal(PROCSIG_CMDSTATUS_INFO))
+		HandleCmdStatusInfoInterrupt();
+
 	if (set_latch_on_sigusr1)
 		SetLatch(MyLatch);
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index ce4bdaf..5d5df58 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -67,6 +67,7 @@
 #include "tcop/pquery.h"
 #include "tcop/tcopprot.h"
 #include "tcop/utility.h"
+#include "utils/cmdstatus.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -2991,6 +2992,9 @@ ProcessInterrupts(void)
 
 	if (ParallelMessagePending)
 		HandleParallelMessages();
+
+	if (CmdStatusInfoRequested)
+		ProcessCmdStatusInfoRequest();
 }
 
 
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 3ed0b44..2c8687c 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -18,7 +18,7 @@ endif
 # keep this list arranged alphabetically or it gets to be a mess
 OBJS = acl.o arrayfuncs.o array_expanded.o array_selfuncs.o \
 	array_typanalyze.o array_userfuncs.o arrayutils.o ascii.o \
-	bool.o cash.o char.o date.o datetime.o datum.o dbsize.o domains.o \
+	bool.o cash.o char.o cmdstatus.o date.o datetime.o datum.o dbsize.o domains.o \
 	encode.o enum.o expandeddatum.o \
 	float.o format_type.o formatting.o genfile.o \
 	geo_ops.o geo_selfuncs.o inet_cidr_ntop.o inet_net_pton.o int.o \
diff --git a/src/backend/utils/adt/cmdstatus.c b/src/backend/utils/adt/cmdstatus.c
new file mode 100644
index 000..5f31a2d
--- /dev/null
+++ b/src/backend/utils/adt/cmdstatus.c
@@ -0,0 +1,567 @@
+/*-
+ *
+ * cmdstatus.c
+ *	  Definitions for pg_cmdstatus function.
+ *
+ * Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/cmdstatus.c
+ *
+ *-
+ */
+
+#include "postgres.h"
+
+#include "funcapi.h"
+#include "miscadmin.h"
+
+#include "access/htup_details.h"
+#include "commands/explain.h"
+#include "lib/stringinfo.h"
+#include "storage/latch.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shm_mq.h"
+#include "tcop/dest.h"
+#include "tcop/pquery.h"
+#include "utils/builtins.h"
+#include "utils/cmdstatus.h"
+
+
+typedef enum {
+	CMD_STATUS_REQUEST_EXPLAIN = 1,
+	CMD_STATUS_REQUEST_QUERY_TEXT = 2,
+	CMD_STATUS_REQUEST_PROGRESS_TAG = 3,
+	CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE = 4
+} CmdStatusInfoRequestType;
+
+#define CMD_STATUS_MAX_REQUEST CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE
+
+typedef enum {
+	CMD_STATUS_RESULT_FAILURE = -1,
+	CMD_STATUS_RESULT_SUCCESS = 0,
+	CMD_STATUS_RESULT_BACKEND_IDLE,
+	CMD_STATUS_RESULT_NO_COMMAND_TAG
+} CmdStatusInfoResultCode;
+
+typedef struct {
+	LWLock	   *lock;
+	pid_t		target_pid;
+	pid_t		sender_pid;
+	CmdStatusInfoRequestType	request_type;
+	CmdStatusInfoResultCode		result_code;
+	char		buffer[FLEXIBLE_ARRAY_MEMBER];
+} CmdStatusInfo;
+
+#define BUFFER_SIZE			8192
+
+/*
+ * These structs are allocated on the program stack as local variables in the
+ * ExecutorRun hook.  The top of stack is current_query_stack, see below.
+ */
+typedef struct CmdInfoStack {
+	QueryDesc			   *query_desc;
+	struct CmdInfoStack	   *parent;
+} CmdInfoStack;
+
+
+bool CmdStatusInfoEnabled

Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-09-03 Thread Christopher Browne

On 3 September 2015 at 12:57, Shulgin, Oleksandr <
oleksandr.shul...@zalando.de> wrote
>
> On Thu, Sep 3, 2015 at 6:02 PM, Robert Haas  wrote:
>>
>>
>> Maybe someday we should have all that, but I think for right now
>> that's complicating things unnecessarily.  I think the best proposal
>> so far is to allow the host=X option to be repeated multiple times.
>> If you repeat the host=X option N times, you can also repeat the
>> port=X option exactly N times, or else you can specify it just once.
>> Done.
>
>
> But this already breaks backwards-compatibility with any clients who
belief that whatever value specified the latest takes precedence.  I'm not
arguing that there are such use cases in the wild or that it's entirely
sane thing to do, but still.
>
> More importantly, this will break any code that tries to parse the
conninfo string and produce a hashmap from it for modification.

The notion of an "ordered hashmap" makes me break out in hives...

>> Alternatively, leave the host=X option alone and add a new option
>> hostlist=X, allowing a comma-separated list of names or IPs, with each
>> hostname or IP allowed an optional :port suffix.  If host=X parameter
>> is omitted or the connection to that machine fails, try everybody in
>> the hostlist concurrently, or with some configurable (and presumably
>> short) delay between one and then next.  Again, done.
>
>
> The exact behavior in case of both host/port and hostlist are specified
becomes really tricky then.  It's already tricky enough, if you recall the
service files -- how are they going to come into play here?
>
> I believe the less there are implicit workings in the way libpq connects,
the better.

In that case, let's have a New Option, and expressly break with the
implicit bits.

The new option ONLY accepts URIs, but allows it to be submitted multiple
times.

psql --uri postgresql://postgres@favehost:5432/some_db_name --uri
postgresql://postgres@favehost:5432/another_db_name --uri
postgresql://postgres@favehost:5432/third_db_name --uri
postgresql://postgres@favehost:5432/fourth_backup_db

Parsing conninfo strings is no fun.  I'm finding I prefer using URIs.  They
may even be easier to parse, not that I have thus far cared; using URIs
tends to mean I don't *need* to parse anything.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-09-03 Thread Alvaro Herrera

Robert Haas wrote:

> Alternatively, change the rules for parsing the existing host=X
> parameter so that we split it on some separator that isn't a valid
> hostname character, and then strip off an optional :port syntax from
> each entry; that value, if present, overrides port=X for that entry.

: is not a valid character in hostnames, so I think this makes sense.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Horizontal scalability/sharding

2015-09-03 Thread Andres Freund

On 2015-09-03 12:10:08 -0400, Robert Haas wrote:
> On Thu, Sep 3, 2015 at 6:57 AM, Bruce Momjian  wrote:
> > Yes, I assumed that.  Logical replication uses WAL, so if you are
> > synchronous with WAL, logical replication is synchronous too.  However,
> > of course, it is synchronous in being durable, not synchronous in terms
> > of applying the WAL.  This is true of binary and logical replication.

Actually that's not really true - it's just a question which LSNs you
return. For UDR/BDR the relevant LSN is the LSN of the last durably
committed transaction. And thus they wait for apply, not anything else.

> But, Thomas Munro is fixing it!

+ many to that effort.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PROPOSAL] Inputs on forcing VACUUM VERBOSE to write timestamp

2015-09-03 Thread dinesh kumar

On Thu, Sep 3, 2015 at 9:48 PM, Andres Freund  wrote:

> Hi,
>
> On 2015-09-03 21:45:52 +0530, dinesh kumar wrote:
> > Forcing VACUUM VERBOSE to write timestamp, for each "INFO" entry. This
> was
> > raised already in this
> > <
> http://www.postgresql.org/message-id/20031110162349.65542.qm...@web21408.mail.yahoo.com
> >
> > thread, and wanted to discuss it again. I didn't see any replies on this
> > old thread.
>
> Unconvinced - sounds like you're just re-inventing log_line_prefix.
>
> Thanks for your inputs.

I thought, it will be helpful without modifying the log_min_messages
parameter. Otherwise, we have to run the vacuum verbose, by modifying the
log_min_messages at session level, which will be recorded in pg_log.

If we allow VERBOSE to log timestamp, we can have it in stdout, which can
be diverted to a required log file. This will be useful for the vacuumdb
binary too.

Let me know your thoughts.

Regards,
Dinesh
manojadinesh.blogspot.com

> Greetings,
>
> Andres Freund
>

Re: [HACKERS] Reduce ProcArrayLock contention

2015-09-03 Thread Robert Haas

On Tue, Aug 25, 2015 at 7:51 AM, Amit Kapila  wrote:
> [ new patch ]

Andres pinged me about this patch today.  It contained a couple of
very strange whitespace anomalies, but otherwise looked OK to me, so I
committed it with after fixing those issues and tweaking the comments
a bit.  Hopefully we are in good shape now, but if there are any
remaining concerns, more review is welcome.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FSM versus GIN pending list bloat

2015-09-03 Thread Masahiko Sawada

On Thu, Sep 3, 2015 at 3:57 AM, Robert Haas  wrote:
> On Mon, Aug 10, 2015 at 1:16 PM, Jeff Janes  wrote:
>> I have a simple test case that inserts an array of 101 md5 digests into each
>> row.  With 10_000 of these rows inserted into an already indexed table, I
>> get 40MB for the table and 80MB for the index unpatched.  With the patch, I
>> get 7.3 MB for the index.
>
> Uh, wow.  Seems like we should do something about this.
>

I looked into this patch.
The patch is applied cleanly, and complied without error.
I think it works as we expected, but the patch has some unnecessary white space.
Also the today document says that pending list is cleaned up by only
vacuum, we need to document about this?

Regards,

--
Masahiko Sawada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-03 Thread Pavel Stehule

Hi

2015-09-03 18:30 GMT+02:00 Shulgin, Oleksandr 
:

> On Wed, Sep 2, 2015 at 3:07 PM, Shulgin, Oleksandr <
> oleksandr.shul...@zalando.de> wrote:
>
>> On Wed, Sep 2, 2015 at 3:04 PM, Pavel Stehule 
>> wrote:
>>>
>>>
 Well, maybe I'm missing something, but sh_mq_create() will just
 overwrite the contents of the struct, so it doesn't care about
 sender/receiver: only sh_mq_set_sender/receiver() do.

>>>
>>> if you create sh_mq from scratch, then you can reuse structure.
>>>
>>
> Please find attached a v3.
>
> It uses a shared memory queue and also has the ability to capture plans
> nested deeply in the call stack.  Not sure about using the executor hook,
> since this is not an extension...
>
> The LWLock is used around initializing/cleaning the shared struct and the
> message queue, the IO synchronization is handled by the message queue
> itself.  After some testing with concurrent pgbench and intentionally deep
> recursive plpgsql functions (up to 700 plpgsql stack frames) I think this
> approach can work.  Unless there's some theoretical problem I'm just not
> aware of. :-)
>
>
Comments welcome!
>

I am not pretty happy from this design. Only one EXPLAIN PID/GET STATUS in
one time can be executed per server - I remember lot of queries that
doesn't handle CANCEL well ~ doesn't handle interrupt well, and this can be
unfriendly. Cannot to say if it is good enough for first iteration. This is
functionality that can be used for diagnostic when you have overloaded
server and this risk looks too high (for me). The idea of receive slot can
to solve this risk well. The difference from this code should not be too
big - although it is not trivial - needs work with PGPROC.

Other smaller issues:

* probably sending line by line is useless - shm_mq_send can pass bigger
data when nowait = false
* pg_usleep(1000L); - it is related to single point resource

Some ideas:

* this code share some important parts with auto_explain (query stack) -
and because it should be in core (due handling signal if I remember well),
it can be first step of integration auto_explain to core.

> --
> Alex
>
>

Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-09-03 Thread Tom Lane

Andrew Dunstan  writes:
> There is no equivalent of execl, nor a cmd.exe exquivalent of the 
> shell's exec. But surely the equivalent of the fork/execl you're doing 
> here would be a simple CreateProcess(). I don't see why you need a shell 
> in the middle on Windows at all.

The problem is to get the output redirection to work.  I imagine it could
be rewritten without involving the shell, but it'd be less than trivial
(or at least, I'm not going to do it).

If we did get that to work, it could probably be applied to the
service-start case as well, which currently just does a CreateProcess and
pays no attention to what happens to postmaster stderr.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-03 Thread Tom Lane

I wrote:
> I'm inclined to think that the only real fix for this type of problem
> is that at subtransaction abort, we have to prevent further execution
> of any Portals that have executed at all within the subxact (ie force
> them into PORTAL_FAILED state and clean out their resources, see
> MarkPortalFailed).  It's not good enough to kill the one(s) that are
> actively executing at the instant of failure, because anything that's
> run at all since the subxact started might be holding a reference to an
> object we're about to delete.

> I'm a little worried that that will break usage patterns that work today,
> though.

I experimented with the attached patch.  It seems to work to stop the
crash Michael exhibited (I've not tried to duplicate Jim's original
example, though).  However, it also causes a regression test failure,
because transactions.sql does this:

BEGIN;
DECLARE c CURSOR FOR SELECT unique2 FROM tenk1 ORDER BY unique2;
SAVEPOINT one;
FETCH 10 FROM c;
ROLLBACK TO SAVEPOINT one;
FETCH 10 FROM c;

which is exactly the case we're trying to reject now.  So that fills
me with fear that this approach would break existing applications.
On the other hand, I do not really see a good alternative :-(.

I thought about trying to detect whether the Portal actually had any
references to "new in subtransaction" objects to decide whether to
kill it or not, but that seems awfully fragile.

Ideas?

regards, tom lane

diff --git a/src/backend/commands/portalcmds.c b/src/backend/commands/portalcmds.c
index 2794537..327b2a5 100644
*** a/src/backend/commands/portalcmds.c
--- b/src/backend/commands/portalcmds.c
*** PersistHoldablePortal(Portal portal)
*** 335,345 
  	/*
  	 * Check for improper portal use, and mark portal active.
  	 */
! 	if (portal->status != PORTAL_READY)
! 		ereport(ERROR,
! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
!  errmsg("portal \"%s\" cannot be run", portal->name)));
! 	portal->status = PORTAL_ACTIVE;
  
  	/*
  	 * Set up global portal context pointers.
--- 335,341 
  	/*
  	 * Check for improper portal use, and mark portal active.
  	 */
! 	MarkPortalActive(portal);
  
  	/*
  	 * Set up global portal context pointers.
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 9c14e8a..0df86a2 100644
*** a/src/backend/tcop/pquery.c
--- b/src/backend/tcop/pquery.c
*** PortalRun(Portal portal, long count, boo
*** 734,744 
  	/*
  	 * Check for improper portal use, and mark portal active.
  	 */
! 	if (portal->status != PORTAL_READY)
! 		ereport(ERROR,
! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
!  errmsg("portal \"%s\" cannot be run", portal->name)));
! 	portal->status = PORTAL_ACTIVE;
  
  	/*
  	 * Set up global portal context pointers.
--- 734,740 
  	/*
  	 * Check for improper portal use, and mark portal active.
  	 */
! 	MarkPortalActive(portal);
  
  	/*
  	 * Set up global portal context pointers.
*** PortalRunFetch(Portal portal,
*** 1398,1408 
  	/*
  	 * Check for improper portal use, and mark portal active.
  	 */
! 	if (portal->status != PORTAL_READY)
! 		ereport(ERROR,
! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
!  errmsg("portal \"%s\" cannot be run", portal->name)));
! 	portal->status = PORTAL_ACTIVE;
  
  	/*
  	 * Set up global portal context pointers.
--- 1394,1400 
  	/*
  	 * Check for improper portal use, and mark portal active.
  	 */
! 	MarkPortalActive(portal);
  
  	/*
  	 * Set up global portal context pointers.
diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c
index 7530498..74f5089 100644
*** a/src/backend/utils/mmgr/portalmem.c
--- b/src/backend/utils/mmgr/portalmem.c
*** CreatePortal(const char *name, bool allo
*** 232,237 
--- 232,238 
  	portal->status = PORTAL_NEW;
  	portal->cleanup = PortalCleanup;
  	portal->createSubid = GetCurrentSubTransactionId();
+ 	portal->activeSubid = portal->createSubid;
  	portal->strategy = PORTAL_MULTI_QUERY;
  	portal->cursorOptions = CURSOR_OPT_NO_SCROLL;
  	portal->atStart = true;
*** UnpinPortal(Portal portal)
*** 403,408 
--- 404,428 
  }
  
  /*
+  * MarkPortalActive
+  *		Transition a portal from READY to ACTIVE state.
+  *
+  * NOTE: never set portal->status = PORTAL_ACTIVE directly; call this instead.
+  */
+ void
+ MarkPortalActive(Portal portal)
+ {
+ 	/* For safety, this is a runtime test not just an Assert */
+ 	if (portal->status != PORTAL_READY)
+ 		ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+  errmsg("portal \"%s\" cannot be run", portal->name)));
+ 	/* Perform the state transition */
+ 	portal->status = PORTAL_ACTIVE;
+ 	portal->activeSubid = GetCurrentSubTransactionId();
+ }
+ 
+ /*
   * MarkPortalDone
   *		Transition a portal from ACTIVE to DONE state.
   *
***

Re: [HACKERS] strange test in psql:startup.c

2015-09-03 Thread Robert Haas

On Wed, Aug 26, 2015 at 8:18 AM, Pavel Stehule  wrote:
> if (options.single_txn && options.action != ACT_FILE &&
> options.action == ACT_NOTHING)
> {
> fprintf(stderr, _("%s: -1 can only be used in
> non-interactive mode\n"), pset.progname);
> exit(EXIT_FAILURE);
> }
>
> the expression should be probably only?
>
> options.single_txn && options.action == ACT_NOTHING)

It seems this was changed by this commit:

commit c3c86ae2aff67676a49ec84240f1d6a482f359cb
Author: Peter Eisentraut 
Date:   Mon Jun 17 21:53:33 2013 -0400

psql: Re-allow -1 together with -c or -l

I guess the idea was that we wanted to allow -1 with -c or -l even
though it will have no effect in that case.  So your suggested change
looks right to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Autonomous Transaction is back

2015-09-03 Thread Robert Haas

On Sat, Aug 22, 2015 at 2:23 AM, Noah Misch  wrote:
>> > Can you get away with only looking at tuples though?  For example,
>> > what about advisory locks?  Table locks?
>>
>> Well, that's an interesting question.  Can we get away with regarding
>> those things as non-conflicting, as between the parent and child
>> transactions?
>
> For system lock types, no.  While one could define advisory locks to work
> differently, we should assume that today's advisory lockers have expectations
> like those of system lockers.  An autonomous transaction should not bypass any
> lock that a transaction of another backend could not bypass.

Why?

Suppose you do this:

BEGIN;
DECLARE CURSOR foo FOR SELECT * FROM foo;
BEGIN AUTONOMOUS TRANSACTION;
ALTER TABLE foo ALTER bar TYPE int;

This has got to fail for safety reasons, but CheckTableNotInUse() is
on it.  Suppose you do this:

BEGIN;
LOCK foo;
BEGIN AUTONOMOUS TRANSACTION;
INSERT INTO foo VALUES ('spelunk');

How will making this fail improve anything?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-09-03 Thread Alvaro Herrera

Tom Lane wrote:
> Andrew Dunstan  writes:
> > There is no equivalent of execl, nor a cmd.exe exquivalent of the 
> > shell's exec. But surely the equivalent of the fork/execl you're doing 
> > here would be a simple CreateProcess(). I don't see why you need a shell 
> > in the middle on Windows at all.
> 
> The problem is to get the output redirection to work.  I imagine it could
> be rewritten without involving the shell, but it'd be less than trivial
> (or at least, I'm not going to do it).

In the CreateProcess model, I think you can use startupInfo.hStdOutput,
see

https://msdn.microsoft.com/en-us/library/windows/desktop/ms686331%28v=vs.85%29.aspx
https://msdn.microsoft.com/en-us/library/windows/desktop/ms682425%28v=vs.85%29.aspx

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.

2015-09-03 Thread Petr Jelinek


On 2015-09-03 20:26, Alvaro Herrera wrote:

Robert Haas wrote:

On Thu, Aug 20, 2015 at 10:46 AM, Alvaro Herrera
 wrote:



I don't think that necessarily means it must continue to be in contrib.
Quite the contrary, I think it is a tool critical enough that it should
not be relegated to be a second-class citizen as it is now (let's face
it, being in contrib *is* second-class citizenship).


I have resisted that principle for years and will continue to do so.
It is entirely reasonable for some DBAs to want certain functionality
(debugging tools, crypto) to not be installed on their machines.
Folding everything into core is not a good policy, IMHO.


I don't understand.  I'm just proposing that the source code for the
extension to live in src/extensions/, and have the shared library
installed by toplevel make install; I'm not suggesting that the
extension is installed automatically.  For that, you still need a
superuser to run CREATE EXTENSION.



+! for this

--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] FETCH limited by bytes.

2015-09-03 Thread Corey Huinker

On Wed, Sep 2, 2015 at 9:08 AM, Andres Freund  wrote:

> On 2015-02-27 13:50:22 -0500, Corey Huinker wrote:
> > +static DefElem*
> > +get_option(List *options, char *optname)
> > +{
> > + ListCell *lc;
> > +
> > + foreach(lc, options)
> > + {
> > + DefElem *def = (DefElem *) lfirst(lc);
> > +
> > + if (strcmp(def->defname, optname) == 0)
> > + return def;
> > + }
> > + return NULL;
> > +}
>
>
> >   /*
> >* Do nothing in EXPLAIN (no ANALYZE) case.  node->fdw_state stays
> NULL.
> > @@ -915,6 +933,23 @@ postgresBeginForeignScan(ForeignScanState *node,
> int eflags)
> >   server = GetForeignServer(table->serverid);
> >   user = GetUserMapping(userid, server->serverid);
> >
> > + /* Reading table options */
> > + fsstate->fetch_size = -1;
> > +
> > + def = get_option(table->options, "fetch_size");
> > + if (!def)
> > + def = get_option(server->options, "fetch_size");
> > +
> > + if (def)
> > + {
> > + fsstate->fetch_size = strtod(defGetString(def), NULL);
> > + if (fsstate->fetch_size < 0)
> > + elog(ERROR, "invalid fetch size for foreign table
> \"%s\"",
> > +  get_rel_name(table->relid));
> > + }
> > + else
> > + fsstate->fetch_size = 100;
>
> I don't think it's a good idea to make such checks at runtime - and
> either way it's somethign that should be reported back using an
> ereport(), not an elog.

> Also, it seems somewhat wrong to determine this at execution
> time. Shouldn't this rather be done when creating the foreign scan node?
> And be a part of the scan state?
>

I agree, that was my original plan, but I wasn't familiar enough with the
FDW architecture to know where the table struct and the scan struct were
both exposed in the same function.

What I submitted incorporated some of Kyotaro's feedback (and working
patch) to my original incomplete patch.

>
> Have you thought about how this option should cooperate with join
> pushdown once implemented?
>

I hadn't until now, but I think the only sensible thing would be to
disregard table-specific settings once a second foreign table is detected,
and instead consider only the server-level setting.

I suppose one could argue that if ALL the tables in the join had the same
table-level setting, we should go with that, but I think that would be
complicated, expensive, and generally a good argument for changing the
server setting instead.

Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-09-03 Thread Andrew Dunstan




On 09/03/2015 03:31 PM, Tom Lane wrote:

I wrote:

Andres Freund  writes:

I'don't like adding a couple seconds of test runtime for the benefit of
very slow platforms.

Me either.  This is the first time I've seen an indication that the
start_postmaster change mentioned in the comment is actually important
for production use, rather than just being cleanup.  I think we ought
to just fix it.  I'm willing to take care of the Unix side if someone
will explain how to change the Windows side.

Attached is a draft patch for this.  I think it's fine for Unix (unless
someone wants to object to relying on "/bin/sh -c"), but I have no idea
whether it works for Windows.  The main risk is that if CMD.EXE runs
the postmaster as a subprocess rather than overlaying itself a la shell
"exec", the PID we'd get back would apply only to CMD.EXE not to the
eventual postmaster.  If so, I do not know how to fix that, or whether
it's fixable at all.

Note that this makes the test case in question fail reliably, which is
reasonable behavior IMO so I just changed the test.

If this does (or can be made to) work on Windows, I'd propose applying
it back to 9.2, where the current coding came in.




There is no equivalent of execl, nor a cmd.exe exquivalent of the 
shell's exec. But surely the equivalent of the fork/execl you're doing 
here would be a simple CreateProcess(). I don't see why you need a shell 
in the middle on Windows at all.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-03 Thread Pavel Stehule

2015-09-03 22:06 GMT+02:00 Pavel Stehule :

> Hi
>
>
>
>
> 2015-09-03 18:30 GMT+02:00 Shulgin, Oleksandr <
> oleksandr.shul...@zalando.de>:
>
>> On Wed, Sep 2, 2015 at 3:07 PM, Shulgin, Oleksandr <
>> oleksandr.shul...@zalando.de> wrote:
>>
>>> On Wed, Sep 2, 2015 at 3:04 PM, Pavel Stehule 
>>> wrote:


> Well, maybe I'm missing something, but sh_mq_create() will just
> overwrite the contents of the struct, so it doesn't care about
> sender/receiver: only sh_mq_set_sender/receiver() do.
>

 if you create sh_mq from scratch, then you can reuse structure.

>>>
>> Please find attached a v3.
>>
>> It uses a shared memory queue and also has the ability to capture plans
>> nested deeply in the call stack.  Not sure about using the executor hook,
>> since this is not an extension...
>>
>> The LWLock is used around initializing/cleaning the shared struct and the
>> message queue, the IO synchronization is handled by the message queue
>> itself.  After some testing with concurrent pgbench and intentionally deep
>> recursive plpgsql functions (up to 700 plpgsql stack frames) I think this
>> approach can work.  Unless there's some theoretical problem I'm just not
>> aware of. :-)
>>
>>
> Comments welcome!
>>
>
> I am not pretty happy from this design. Only one EXPLAIN PID/GET STATUS in
> one time can be executed per server - I remember lot of queries that
> doesn't handle CANCEL well ~ doesn't handle interrupt well, and this can be
> unfriendly. Cannot to say if it is good enough for first iteration. This is
> functionality that can be used for diagnostic when you have overloaded
> server and this risk looks too high (for me). The idea of receive slot can
> to solve this risk well (and can be used elsewhere). The difference from
> this code should not be too big - although it is not trivial - needs work
> with PGPROC. The opinion of our multiprocess experts can be interesting.
> Maybe I am too careful.
>



>
>
>
> Other smaller issues:
>
> * probably sending line by line is useless - shm_mq_send can pass bigger
> data when nowait = false
> * pg_usleep(1000L); - it is related to single point resource
>
> Some ideas:
>
> * this code share some important parts with auto_explain (query stack) -
> and because it should be in core (due handling signal if I remember well),
> it can be first step of integration auto_explain to core.
>
>
>
>> --
>> Alex
>>
>>
>

[HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-09-03 Thread Noah Misch

My AIX buildfarm members have failed the BinInstallCheck step on and off since
inception.  It became more frequent when I added animals sungazer and tern
alongside the older hornet and mandrill.  The animals share a machine with
each other and with dozens of other developers.  I setpriority() the animals
to the lowest available priority, so they probably lose the CPU for long
periods.  Separately, this machine has slow filesystem metadata operations.
For example, git-new-workdir takes ~50s for a PostgreSQL tree.

The pg_rewind suite has failed a few times when crash recovery took longer
than the 60s pg_ctl default timeout.  Disabling fsync (commit 7d7a103) reduced
median crash recovery time by 75%, which may suffice.  If not, I'll be
inclined to add --timeout=900 to each pg_ctl invocation.


The pg_ctl suite has failed with "not ok 12 - second pg_ctl start succeeds".
You can reproduce that by adding "sleep 3;" between that test and the one
before it.  The timing dependency comes from the pg_ctl "slop" time:

/*
 * Make sanity checks.  If it's for a 
standalone backend
 * (negative PID), or the recorded 
start time is before
 * pg_ctl started, then either we are 
looking at the wrong
 * data directory, or this is a 
pre-existing pidfile that
 * hasn't (yet?) been overwritten by 
our child postmaster.
 * Allow 2 seconds slop for possible 
cross-process clock
 * skew.
 */

The "second pg_ctl start succeeds" tested-for behavior is actually a minor bug
that we'd ideally fix as described in the last paragraph of the commit 3c485ca
log message:

All of this could be improved if we rewrote start_postmaster() so that it
could report the child postmaster's PID, so that we'd know a-priori the
correct PID to test with postmaster_is_alive().  That looks like a bit too
much change for so late in the 9.1 development cycle, unfortunately.

I recommend we invert the test expectation and, pending the ideal pg_ctl fix,
add the "sleep 3" to avoid falling within the time slop:

--- a/src/bin/pg_ctl/t/001_start_stop.pl
+++ b/src/bin/pg_ctl/t/001_start_stop.pl
@@ -35,6 +35,7 @@ close CONF;
 command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
'pg_ctl start -w');
-command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
-   'second pg_ctl start succeeds');
+sleep 3;# bridge test_postmaster_connection() slop threshold
+command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
+   'second pg_ctl start fails');
 command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
'pg_ctl stop -w');


Alternately, I could just remove the test.

crake failed the same way, once:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake=2015-07-07%2016%3A35%3A06


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Horizontal scalability/sharding

2015-09-03 Thread Petr Jelinek

On 2015-09-02 19:57, Josh Berkus wrote:
> On 09/01/2015 04:14 PM, Petr Jelinek wrote:
>> On 2015-09-02 00:09, Josh Berkus wrote:
>>> Not really, the mechanism is different and the behavior is different.
>>> One critical deficiency in using binary syncrep is that you can't do
>>> round-robin redundancy at all; every redundant node has to be an exact
>>> mirror of another node.  In a good HA distributed system, you want
>>> multiple shards per node, and you want each shard to be replicated to a
>>> different node, so that in the event of node failure you're not dumping
>>> the full load on one other server.
>>>
>>
>> This assumes that we use binary replication, but we can reasonably use
>> logical replication which can quite easily do filtering of what's
>> replicated where.
>
> Is there a way to do logical synchronous replication?  I didn't think
> there was.
>

Yes, the logical replication has similar syncrep properties as the
binary one (feedback works same way).

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench stats per script & other stuff

2015-09-03 Thread Andres Freund

On 2015-09-02 20:20:45 +0200, Fabien COELHO wrote:
> >>+static SQLScript sql_script[MAX_SCRIPTS];
> >>
> >>+static struct {
> >>+   char *name;   /* very short name for -b ...*/
> >>+   char *desc;   /* short description */
> >>+   char *script; /* actual pgbench script */
> >>+} builtin_script[]
> >
> >Can't we put these in the same array?
> 
> I do not understand.

Right now builtins and user defined scripts are stored in different data
structures. I'd rather see them in the same.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-09-03 Thread Shulgin, Oleksandr

On Wed, Sep 2, 2015 at 9:00 PM, Robert Haas  wrote:

> On Wed, Sep 2, 2015 at 4:52 AM, Shulgin, Oleksandr
>  wrote:
> > On Tue, Sep 1, 2015 at 8:12 PM, Andres Freund 
> wrote:
> >>
> >> On 2015-09-01 14:07:19 -0400, Robert Haas wrote:
> >> > But I think it's quite wrong to assume that the infrastructure for
> >> > this is available and usable everywhere, because in my experience,
> >> > that's far from the case.
> >>
> >> Especially when the alternative is a rather short patch implementing an
> >> otherwise widely available feature.
> >
> > But that won't actually help in the case described by Robert: if the
> master
> > server A failed, the client has no idea if B or C would become the new
> > master.
>
> Sure it does.  You just need to ensure that whichever of those is the
> new master accepts connections, and the other one doesn't.  There are
> lots of ways to do this; e.g. give the machine a second IP that
> accepts connections only when the machine is the designated master,
> and have read-write clients connect to that IP, and read-only clients
> connect to the machine's main IP.
>

Well, I see how that can help, but still sounds like a lot of hassle.

What if you have 5 servers: A..F, listed in client's connection settings in
that order, and after failing over from A, now F is the new master (for
whatever reason: I don't think it would be realistic to assume that you can
and always will fail over to the next host in the list).  So suddenly, the
read-write clients need to make 5 connection attempts before arriving at
the master (add name resolution to the picture for even more latency).
Connection pooling can probably mitigate this to some degree, of course.

I believe that having a floating IP for the master is much more practical
approach and it doesn't require any patch to libpq or modification of the
client connection settings.

Andres's point is the same as mine: we ought to accept this feature,
> in some form, because it's really quite useful.
>

Even if someone is keen on implementing the multiple connection strings
approach, nothing stops them from doing that on top libpq, and I really
think it will be ever more flexible than anything we can build into libpq
itself.

--
Alex

[HACKERS] BRIN INDEX value

2015-09-03 Thread Tatsuo Ishii

When creating a brin index, it shows an interesting behavior when used
with VACUUM.

First, I created a brin index after inserting data.
===
DROP TABLE t1;
DROP TABLE
CREATE TABLE t1(i int);
CREATE TABLE
INSERT INTO t1 VALUES (generate_series(1, 10));
INSERT 0 10
CREATE INDEX brinidx ON t1 USING brin (i);
CREATE INDEX
SELECT * FROM brin_revmap_data(get_raw_page('brinidx', 1)) WHERE pages != 
'(0,0)'::tid;
 pages 
---
 (2,1)
 (2,2)
 (2,3)
 (2,4)
(4 rows)

SELECT * FROM brin_page_items(get_raw_page('brinidx', 2), 'brinidx');
 itemoffset | blknum | attnum | allnulls | hasnulls | placeholder |   value 
  
+++--+--+-+---
  1 |  0 |  1 | f| f| f   | {1 .. 28928}
  2 |128 |  1 | f| f| f   | {28929 .. 
57856}
  3 |256 |  1 | f| f| f   | {57857 .. 
86784}
  4 |384 |  1 | f| f| f   | {86785 .. 
10}
(4 rows)

SELECT * FROM brin_revmap_data(get_raw_page('brinidx', 1)) WHERE pages != 
'(0,0)'::tid;
 pages 
---
 (2,1)
 (2,2)
 (2,3)
 (2,4)
(4 rows)

VACUUM;
VACUUM
SELECT * FROM brin_revmap_data(get_raw_page('brinidx', 1)) WHERE pages != 
'(0,0)'::tid;
 pages 
---
 (2,1)
 (2,2)
 (2,3)
 (2,4)
(4 rows)

SELECT * FROM brin_page_items(get_raw_page('brinidx', 2), 'brinidx');
 itemoffset | blknum | attnum | allnulls | hasnulls | placeholder |   value 
  
+++--+--+-+---
  1 |  0 |  1 | f| f| f   | {1 .. 28928}
  2 |128 |  1 | f| f| f   | {28929 .. 
57856}
  3 |256 |  1 | f| f| f   | {57857 .. 
86784}
  4 |384 |  1 | f| f| f   | {86785 .. 
10}
(4 rows)
===

As you can see brin index value for block 384 or more is {86785.. 10}. Good.

However I inserted data *after* creating index, the value is
different.
===
psql -e -f test.sql test
Pager usage is off.
DROP TABLE t1;
DROP TABLE
CREATE TABLE t1(i int);
CREATE TABLE
CREATE INDEX brinidx ON t1 USING brin (i);
CREATE INDEX
SELECT * FROM brin_revmap_data(get_raw_page('brinidx', 1)) WHERE pages != 
'(0,0)'::tid;
 pages 
---
 (2,1)
(1 row)

SELECT * FROM brin_page_items(get_raw_page('brinidx', 2), 'brinidx');
 itemoffset | blknum | attnum | allnulls | hasnulls | placeholder | value 
+++--+--+-+---
  1 |  0 |  1 | t| f| f   | 
(1 row)

INSERT INTO t1 VALUES (generate_series(1, 10));
INSERT 0 10
SELECT * FROM brin_revmap_data(get_raw_page('brinidx', 1)) WHERE pages != 
'(0,0)'::tid;
 pages 
---
 (2,1)
(1 row)

VACUUM;
VACUUM
SELECT * FROM brin_revmap_data(get_raw_page('brinidx', 1)) WHERE pages != 
'(0,0)'::tid;
 pages 
---
 (2,1)
 (2,2)
 (2,3)
 (2,4)
(4 rows)

SELECT * FROM brin_page_items(get_raw_page('brinidx', 2), 'brinidx');
 itemoffset | blknum | attnum | allnulls | hasnulls | placeholder |  value  
 
+++--+--+-+--
  1 |  0 |  1 | f| f| f   | {1 .. 28928}
  2 |128 |  1 | f| f| f   | {28929 .. 
57856}
  3 |256 |  1 | f| f| f   | {57857 .. 
86784}
  4 |384 |  1 | f| f| f   | {1 .. 
10}
(4 rows)
===

How the index value for block 384 could be {1 .. 10}?

I have tested with 9.5 alpha2 and 9.5-stable head as of today.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench stats per script & other stuff

2015-09-03 Thread Fabien COELHO



Hello Anders,

This v9 :
 - add "-b list" to show the list of builtins
 - remove the explicit --per-scripts-stats option, which is instead
   automatically set when several scripts are run or with per-command
   latencies (-r)
 - count scripts from 1 instead of 0 in the output

I've left out:
 - removing -N/-S upward compatibility shorthands
   but I will not cry if they are removed
 - requiring percents instead of integer weights, because
   it is too constrained
 - your "array" remark as I did not understood it

Thanks to the restructuring and sharing of stats code, the patch does not 
change the loc count, although a few features are added.


--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 2517a3a..3edd65a 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -261,6 +261,25 @@ pgbench  options  dbname
 benchmarking arguments:
 
 
+ 
+  -b scriptname[@weight]
+  --builtin scriptname[@weight]
+  
+   
+Add the specified builtin script to the list of executed scripts.
+An optional integer weight after @ allows to adjust the
+probability of drawing the test.
+Available builtin scripts are: tpcb-like,
+simple-update and select-only.
+The provided scriptname needs only to be a prefix
+of the builtin name, hence simp would be enough to select
+simple-update.
+With special name list, show the list of builtin scripts
+and exit immediately.
+   
+  
+ 
+
 
  
   -c clients
@@ -307,14 +326,15 @@ pgbench  options  dbname
  
 
  
-  -f filename
-  --file=filename
+  -f filename[@weight]
+  --file=filename[@weight]
   

-Read transaction script from filename.
+Add a transaction script read from filename to
+the list of executed scripts.
+An optional integer weight after @ allows to adjust the
+probability of drawing the test.
 See below for details.
--N, -S, and -f
-are mutually exclusive.

   
  
@@ -404,10 +424,7 @@ pgbench  options  dbname
   --skip-some-updates
   

-Do not update pgbench_tellers and
-pgbench_branches.
-This will avoid update contention on these tables, but
-it makes the test case even less like TPC-B.
+Shorthand for -b simple-update@1.

   
  
@@ -499,9 +516,9 @@ pgbench  options  dbname
 Report the specified scale factor in pgbench's
 output.  With the built-in tests, this is not necessary; the
 correct scale factor will be detected by counting the number of
-rows in the pgbench_branches table.  However, when testing
-custom benchmarks (-f option), the scale factor
-will be reported as 1 unless this option is used.
+rows in the pgbench_branches table.
+However, when testing only custom benchmarks (-f option),
+the scale factor will be reported as 1 unless this option is used.

   
  
@@ -511,7 +528,7 @@ pgbench  options  dbname
   --select-only
   

-Perform select-only transactions instead of TPC-B-like test.
+Shorthand for -b select-only@1.

   
  
@@ -661,7 +678,20 @@ pgbench  options  dbname
   What is the Transaction Actually Performed in pgbench?
 
   
-   The default transaction script issues seven commands per transaction:
+   Pgbench executes test scripts chosen randomly from a specified list.
+   They include built-in scripts with -b and
+   user-provided custom scripts with -f.
+   Each script may be given a relative weight specified after a
+   @ so as to change its drawing probability.
+   The default weight is 1.
+ 
+
+  
+   The default builtin transaction script (also invoked with -b tpcb-like)
+   issues seven commands per transaction over randomly chosen aid,
+   tid, bid and balance.
+   The scenario is inspired by the TPC-B benchmark, but is not actually TPC-B,
+   hence the name.
   
 
   
@@ -675,9 +705,15 @@ pgbench  options  dbname
   
 
   
-   If you specify -N, steps 4 and 5 aren't included in the
-   transaction.  If you specify -S, only the SELECT is
-   issued.
+   If you select the simple-update builtin (also -N),
+   steps 4 and 5 aren't included in the transaction.
+   This will avoid update contention on these tables, but
+   it makes the test case even less like TPC-B.
+  
+
+  
+   If you select the select-only builtin (also -S),
+   only the SELECT is issued.
   
  
 
@@ -689,10 +725,7 @@ pgbench  options  dbname
benchmark scenarios by replacing the default transaction script
(described above) with a transaction script read from a file
(-f option).  In this case a transaction
-   counts as one execution of a script file.  You can even specify
-   multiple scripts (multiple -f options), in which
-   case a random one of the scripts

Re: [HACKERS] [PATCH] pg_upgrade fails when postgres/template1 isn't in default tablespace

2015-09-03 Thread Andres Freund

Hi Bruce,

Are you actually waiting for review in this thread? Or is the CF entry
more of a reminder?

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-09-03 Thread Andres Freund

On 2015-09-03 02:25:00 -0400, Noah Misch wrote:
> --- a/src/bin/pg_ctl/t/001_start_stop.pl
> +++ b/src/bin/pg_ctl/t/001_start_stop.pl
> @@ -35,6 +35,7 @@ close CONF;
>  command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
>   'pg_ctl start -w');
> -command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
> - 'second pg_ctl start succeeds');
> +sleep 3;# bridge test_postmaster_connection() slop threshold
> +command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
> + 'second pg_ctl start fails');
>  command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
>   'pg_ctl stop -w');

I'don't like adding a couple seconds of test runtime for the benefit of
very slow platforms.

The second pg_ctl start doesn't seem to test something very
interesting. I'm inclined to just remove it. I'm not caffeinated
sufficiently, but afaics that ought to sidestep the issue as stop
doesn't depend on the slop time?

> crake failed the same way, once:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake=2015-07-07%2016%3A35%3A06

Sounds like an actual production hazard too.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench stats per script & other stuff

2015-09-03 Thread Fabien COELHO





Right now builtins and user defined scripts are stored in different data
structures. I'd rather see them in the same.


They already are in the same array (sql_script) when pre-processed and 
executed, there is no distinction beyond initialization.


The builtin_script array contains the equivalent of the external custom 
file (name, lines of code), so that they can be processed by 
process_builtin and addScript to build the SQLScript ready for execution, 
while for external files it relies on process_file and addScript.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] assessing parallel-safety

2015-09-03 Thread Amit Kapila

On Wed, Jul 15, 2015 at 11:20 AM, Amit Kapila 
wrote:
>
>
> I went ahead and completed this patch with respect to adding new clause
> in CREATE/ALTER FUNCTION which can allow users to define the
> parallel property for User defined functions.
>

Attached, find the rebased patch which can be used to review/test latest
version of parallel_seqscan patch which I am going to post in the parallel
seq. scan thread soonish.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

assess-parallel-safety-v8.tar
Description: Unix tar archive

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Make timestamptz_out less slow.

2015-09-03 Thread Andres Freund

On 2015-09-03 16:28:40 +1200, David Rowley wrote:
> > Wouldn't it be better to just normalize fsec to an integer in that case?
> > Afaics that's the only remaining reason for the alternative path?
> >
> > A special case would need to exist somewhere, unless we just modified
> timestamp2tm() to multiply fsec by 1 million when HAVE_INT64_TIMESTAMP is
> not defined, but that changes the function signature.

Sure, but that's a one line #ifdef?

> > We could alternatively handle this by special-casing INT_MIN, probably
> > resulting in a bit less duplicated code.
> >
> 
> Those special cases seem to do some weird stuff to the performance
> characteristics of those functions. I find my method to handle negative
> numbers to produce much faster code.

That's a bit odd.

> I experimented with finding the fastest way to convert an int to a string
> at the weekend, I did happen to be testing with int64's but likely int32
> will be the same.
> 
> This is the time taken to convert 30 into "30" 2 billion times.
> 
> The closest implementation I'm using in the patch is pg_int64tostr() one.
> The pg_int64tostr_memcpy() only performs better with bigger numbers /
> longer strings.
> 
> david@ubuntu:~/C$ gcc5.2 tostring.c -o tostring -O2
> david@ubuntu:~/C$ ./tostring
> pg_int64tostr_memcpy in 13.653252 seconds
> pg_int64tostr in 2.042616 seconds
> pg_int64tostr_new in 2.056688 seconds
> pg_lltoa in 13.604653 seconds
> 
> david@ubuntu:~/C$ clang tostring.c -o tostring_clang -O2
> david@ubuntu:~/C$ ./tostring_clang
> pg_int64tostr_memcpy in 0.04 seconds
> pg_int64tostr in 2.063335 seconds
> pg_int64tostr_new in 2.102068 seconds
> pg_lltoa in 3.424630 seconds

Are you sure this isn't an optimization artifact where the actual work
is optimized away? Doing 2 billion conversions in 2s kinda implies that
the string conversion is done in ~4 cycles. That seems unrealistically
fast, even for a pipelined cpu.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Horizontal scalability/sharding

2015-09-03 Thread Bruce Momjian

On Thu, Sep  3, 2015 at 10:33:12AM +0200, Petr Jelinek wrote:
> On 2015-09-02 19:57, Josh Berkus wrote:
> > On 09/01/2015 04:14 PM, Petr Jelinek wrote:
> >> On 2015-09-02 00:09, Josh Berkus wrote:
> >>> Not really, the mechanism is different and the behavior is different.
> >>> One critical deficiency in using binary syncrep is that you can't do
> >>> round-robin redundancy at all; every redundant node has to be an exact
> >>> mirror of another node.  In a good HA distributed system, you want
> >>> multiple shards per node, and you want each shard to be replicated to a
> >>> different node, so that in the event of node failure you're not dumping
> >>> the full load on one other server.
> >>>
> >>
> >> This assumes that we use binary replication, but we can reasonably use
> >> logical replication which can quite easily do filtering of what's
> >> replicated where.
> >
> > Is there a way to do logical synchronous replication?  I didn't think
> > there was.
> >
> 
> Yes, the logical replication has similar syncrep properties as the
> binary one (feedback works same way).

Yes, I assumed that.  Logical replication uses WAL, so if you are
synchronous with WAL, logical replication is synchronous too.  However,
of course, it is synchronous in being durable, not synchronous in terms
of applying the WAL.  This is true of binary and logical replication.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN pending clean up is not interruptable

2015-09-03 Thread Andres Freund

On 2015-09-03 12:45:34 +0900, Fujii Masao wrote:
> On Thu, Sep 3, 2015 at 2:15 AM, Andres Freund  wrote:
> > On 2015-08-12 11:59:48 -0700, Jeff Janes wrote:
> >> Attached patch does it that way.  There was also a free-standing
> >> CHECK_FOR_INTERRUPTS() which had no reason that I could see not be a
> >> vacuum_delay_point, so I changed that one as well.
> 
> -if (vac_delay)
> -vacuum_delay_point();
> +vacuum_delay_point();
> 
> If vac_delay is false, e.g., ginInsertCleanup() is called by the backend,
> vacuum_delay_point() should not be called. No?

No, that's the whole point of the change, we need a
CHECK_FOR_INTERRUPTS() even when called by backends. I personally think
it's rather ugly to rely on the the one in vacuum_delay_point, but Jeff
and Tom think it's better, and I can live with that.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-09-03 Thread Etsuro Fujita


On 2015/09/03 14:22, Etsuro Fujita wrote:

On 2015/09/03 9:41, Robert Haas wrote:

That having been said, I don't entirely like Fujita-san's patch
either.  Much of the new code is called immediately adjacent to an FDW
callback which could pretty trivially do the same thing itself, if
needed.



Another idea about that code is to call that code in eg, ExecProcNode,
instead of calling ExecForeignScan there.  I think that that might be
much cleaner and resolve the naming problem below.


I gave it another thought; the following changes to ExecInitNode would 
make the patch much simpler, ie, we would no longer need to call the new 
code in ExecInitForeignScan, ExecForeignScan, ExecEndForeignScan, and 
ExecReScanForeignScan.  I think that would resolve the name problem also.


*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***
*** 247,254  ExecInitNode(Plan *node, EState *estate, int eflags)
break;

case T_ForeignScan:
!   result = (PlanState *) ExecInitForeignScan((ForeignScan *) node,
!  estate, eflags);
break;

case T_CustomScan:
--- 247,269 
break;

case T_ForeignScan:
!   {
!   Index   scanrelid = ((ForeignScan *) 
node)->scan.scanrelid;

!
!   if (estate->es_epqTuple != NULL && scanrelid == 0)
!   {
!   /*
!* We are in foreign join inside an EvalPlanQual 
recheck.

!* Initialize local join execution plan, instead.
!*/
!   Plan   *subplan = ((ForeignScan *) 
node)->fs_subplan;

!
!   result = ExecInitNode(subplan, estate, eflags);
!   }
!   else
!   result = (PlanState *) 
ExecInitForeignScan((ForeignScan *) node,
!  estate, 
eflags);

!   }
break;

Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: function parse_ident

2015-09-03 Thread Andres Freund

On 2015-08-23 17:46:36 +0200, Pavel Stehule wrote:
> here is the patch
> 
> It is really trivial - all heavy work was done done before.

This seems to entirely disregard the comments in
http://archives.postgresql.org/message-id/29030.1440030152%40sss.pgh.pa.us
about the fact that this approach only works for a few object types?

Also, for the umpteenst time: Start actually quoting in a sane manner.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Can pg_dump make use of CURRENT/SESSION_USER

2015-09-03 Thread Andres Freund

On 2015-08-25 22:01:51 +0900, Michael Paquier wrote:
> Seeing no activity in the last couple of months for this patch that
> had reviews, it is now marked as returned with feedback.

Fabrizio, you after the above moved the patch to next commitfest,
without a new patch or a additional discussion. Why? Just dragging
patches through commitfest justincreases the work for a number of people
(this round me) without a benefit, so that's really not a good idea.

I'm marking the patch as returned with feedback again. Once there's a
new patch we can deal with it in the appropriate commitfest.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2015-09-03 Thread Andres Freund

On 2015-09-01 10:19:19 +0530, Amit Kapila wrote:
> pgbench setup
> 
> scale factor - 300
> Data is on magnetic disk and WAL on ssd.
> pgbench -M prepared tpc-b
> 
> HEAD - commit 0e141c0f
> Patch-1 - increase_clog_bufs_v1
> 
> Client Count/Patch_ver 1 8 16 32 64 128 256 HEAD 911 5695 9886 18028 27851
> 28654 25714 Patch-1 954 5568 9898 18450 29313 31108 28213
> 
> 
> This data shows that there is an increase of ~5% at 64 client-count
> and 8~10% at more higher clients without degradation at lower client-
> count. In above data, there is some fluctuation seen at 8-client-count,
> but I attribute that to run-to-run variation, however if anybody has doubts
> I can again re-verify the data at lower client counts.

> Now if we try to further increase the number of CLOG buffers to 128,
> no improvement is seen.
> 
> I have also verified that this improvement can be seen only after the
> contention around ProcArrayLock is reduced.  Below is the data with
> Commit before the ProcArrayLock reduction patch.  Setup and test
> is same as mentioned for previous test.

The buffer replacement algorithm for clog is rather stupid - I do wonder
where the cutoff is that it hurts.

Could you perhaps try to create a testcase where xids are accessed that
are so far apart on average that they're unlikely to be in memory? And
then test that across a number of client counts?

There's two reasons that I'd like to see that: First I'd like to avoid
regression, second I'd like to avoid having to bump the maximum number
of buffers by small buffers after every hardware generation...

>  /*
>   * Number of shared CLOG buffers.
>   *
> - * Testing during the PostgreSQL 9.2 development cycle revealed that on a
> + * Testing during the PostgreSQL 9.6 development cycle revealed that on a
>   * large multi-processor system, it was possible to have more CLOG page
> - * requests in flight at one time than the number of CLOG buffers which 
> existed
> - * at that time, which was hardcoded to 8.  Further testing revealed that
> - * performance dropped off with more than 32 CLOG buffers, possibly because
> - * the linear buffer search algorithm doesn't scale well.
> + * requests in flight at one time than the number of CLOG buffers which
> + * existed at that time, which was 32 assuming there are enough 
> shared_buffers.
> + * Further testing revealed that either performance stayed same or dropped 
> off
> + * with more than 64 CLOG buffers, possibly because the linear buffer search
> + * algorithm doesn't scale well or some other locking bottlenecks in the
> + * system mask the improvement.
>   *
> - * Unconditionally increasing the number of CLOG buffers to 32 did not seem
> + * Unconditionally increasing the number of CLOG buffers to 64 did not seem
>   * like a good idea, because it would increase the minimum amount of shared
>   * memory required to start, which could be a problem for people running very
>   * small configurations.  The following formula seems to represent a 
> reasonable
>   * compromise: people with very low values for shared_buffers will get fewer
> - * CLOG buffers as well, and everyone else will get 32.
> + * CLOG buffers as well, and everyone else will get 64.
>   *
>   * It is likely that some further work will be needed here in future 
> releases;
>   * for example, on a 64-core server, the maximum number of CLOG requests that
>   * can be simultaneously in flight will be even larger.  But that will
>   * apparently require more than just changing the formula, so for now we take
> - * the easy way out.
> + * the easy way out.  It could also happen that after removing other locking
> + * bottlenecks, further increase in CLOG buffers can help, but that's not the
> + * case now.
>   */

I think the comment should be more drastically rephrased to not
reference individual versions and numbers.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-09-03 Thread Tom Lane

Andres Freund  writes:
> On 2015-09-03 02:25:00 -0400, Noah Misch wrote:
>> --- a/src/bin/pg_ctl/t/001_start_stop.pl
>> +++ b/src/bin/pg_ctl/t/001_start_stop.pl
>> @@ -35,6 +35,7 @@ close CONF;
>> command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
>> 'pg_ctl start -w');
>> -command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
>> -'second pg_ctl start succeeds');
>> +sleep 3;# bridge test_postmaster_connection() slop threshold
>> +command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
>> +'second pg_ctl start fails');
>> command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
>> 'pg_ctl stop -w');

> I'don't like adding a couple seconds of test runtime for the benefit of
> very slow platforms.

Me either.  This is the first time I've seen an indication that the
start_postmaster change mentioned in the comment is actually important
for production use, rather than just being cleanup.  I think we ought
to just fix it.  I'm willing to take care of the Unix side if someone
will explain how to change the Windows side.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BRIN INDEX value

2015-09-03 Thread Amit Langote


On 9/3/2015 5:49 PM, Tatsuo Ishii wrote:
> 
> However I inserted data *after* creating index, the value is
> different.
> VACUUM;
> VACUUM
> SELECT * FROM brin_revmap_data(get_raw_page('brinidx', 1)) WHERE pages != 
> '(0,0)'::tid;
>  pages 
> ---
>  (2,1)
>  (2,2)
>  (2,3)
>  (2,4)
> (4 rows)
> 
> SELECT * FROM brin_page_items(get_raw_page('brinidx', 2), 'brinidx');
>  itemoffset | blknum | attnum | allnulls | hasnulls | placeholder |  
> value   
> +++--+--+-+--
>   1 |  0 |  1 | f| f| f   | {1 .. 
> 28928}
>   2 |128 |  1 | f| f| f   | {28929 .. 
> 57856}
>   3 |256 |  1 | f| f| f   | {57857 .. 
> 86784}
>   4 |384 |  1 | f| f| f   | {1 .. 
> 10}
> (4 rows)
> ===
> 
> How the index value for block 384 could be {1 .. 10}?
> 

The summarization during VACUUM invokes IndexBuildHeapRangeScan() which is
passed scanStartBlock and scanNumBlocks. If scanStartBlock + scanNumBlocks
> heapTotalBlocks, further down the line, heapgettup() may start returning
tuples from the beginning given the following code in it:

  page++;
  if (page >= scan->rs_nblocks)
  page = 0;

  finished = (page == scan->rs_startblock) ||
   (scan->rs_numblocks != InvalidBlockNumber ?
 --scan->rs_numblocks == 0 :
 false);

Where finished indicates whether it thinks the end of heap is reached.

In this case, scan->rs_startblock is 384 set by IndexBuildHeapRangeScan()
using heap_setscanlimits(). One can imagine how the above heap finish
criteria might not work as expected.

That helps explain why 1 becomes the min for that brin tuple.

Attached hack fixes the symptom but perhaps not the correct fix for this.

Thanks,
Amit
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..9c7fafd 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2268,7 +2268,11 @@ IndexBuildHeapRangeScan(Relation heapRelation,
 
 	/* set our scan endpoints */
 	if (!allow_sync)
+	{
+		numblocks = (start_blockno + numblocks <= scan->rs_nblocks)
+		? numblocks : scan->rs_nblocks - start_blockno;
 		heap_setscanlimits(scan, start_blockno, numblocks);
+	}
 	else
 	{
 		/* syncscan can only be requested on whole relation */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN pending clean up is not interruptable

2015-09-03 Thread Fujii Masao

On Thu, Sep 3, 2015 at 7:18 PM, Andres Freund  wrote:
> On 2015-09-03 12:45:34 +0900, Fujii Masao wrote:
>> On Thu, Sep 3, 2015 at 2:15 AM, Andres Freund  wrote:
>> > On 2015-08-12 11:59:48 -0700, Jeff Janes wrote:
>> >> Attached patch does it that way.  There was also a free-standing
>> >> CHECK_FOR_INTERRUPTS() which had no reason that I could see not be a
>> >> vacuum_delay_point, so I changed that one as well.
>>
>> -if (vac_delay)
>> -vacuum_delay_point();
>> +vacuum_delay_point();
>>
>> If vac_delay is false, e.g., ginInsertCleanup() is called by the backend,
>> vacuum_delay_point() should not be called. No?
>
> No, that's the whole point of the change, we need a
> CHECK_FOR_INTERRUPTS() even when called by backends. I personally think
> it's rather ugly to rely on the the one in vacuum_delay_point,

Same here.

> but Jeff
> and Tom think it's better, and I can live with that.

OK, probably I can live with that, too.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BRIN INDEX value

2015-09-03 Thread Tatsuo Ishii

> The summarization during VACUUM invokes IndexBuildHeapRangeScan() which is
> passed scanStartBlock and scanNumBlocks. If scanStartBlock + scanNumBlocks
>> heapTotalBlocks, further down the line, heapgettup() may start returning
> tuples from the beginning given the following code in it:
> 
>   page++;
>   if (page >= scan->rs_nblocks)
>   page = 0;
> 
>   finished = (page == scan->rs_startblock) ||
>(scan->rs_numblocks != InvalidBlockNumber ?
>  --scan->rs_numblocks == 0 :
>  false);
> 
> Where finished indicates whether it thinks the end of heap is reached.
> 
> In this case, scan->rs_startblock is 384 set by IndexBuildHeapRangeScan()
> using heap_setscanlimits(). One can imagine how the above heap finish
> criteria might not work as expected.
> 
> That helps explain why 1 becomes the min for that brin tuple.
> 
> Attached hack fixes the symptom but perhaps not the correct fix for this.

Why can't we fix summarize_range() in brin.c:

IndexBuildHeapRangeScan(heapRel, state->bs_irel, indexInfo, false, true,
heapBlk, 
state->bs_pagesPerRange,
brinbuildCallback, 
(void *) state);

This currently thoughtlessly passes scannumblocks as
state->bs_pagesPerRange. Shouldn't we change this so that
(scanStartBlock + scanNumBlocks) does not exceed scan->rs_nblocks?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] pg_upgrade fails when postgres/template1 isn't in default tablespace

2015-09-03 Thread Bruce Momjian

On Thu, Sep  3, 2015 at 01:03:30PM +0200, Andres Freund wrote:
> Hi Bruce,
> 
> Are you actually waiting for review in this thread? Or is the CF entry
> more of a reminder?

Oh, I have a commit-fest entry for this?  Well, whoever did that was
doing the right thing so things are not forgotten.  Yeah!  :-)

Anyway, I was going to work on this once I had read my email backlog,
but because of the commit-fest entry, I worked on it today instead.

I have developed the attached patch which fixes the pg_upgrade problem,
and the pg_dumpall problem with postgres and template1 in non-default
tablespaces.  I modified the pg_dumpall patch with the following changes:

*  It is a general pg_dumpall bug, not specific to binary upgrade mode,
so the new code will be used in non-binary upgrade mode too.

* I hard-coded the "connect template1" string, as it is a constant (no
need for %s and fmtId()).

*  You can't mix fprintf() and appendPQExpBuffer() and get the output in
the specified order, i.e. fprintf will come out first.  Now, in your
case, it didn't matter, but it isn't clean either.

*  I wanted the original database connection to be restored right after
the switch.

The new output looks like this:

\connect postgres
ALTER DATABASE template1 SET TABLESPACE tt;
\connect template1

Based on our previous policy, these are both bugs and cause either
errors or inaccurate restores, so I plan to apply the attached patch to
all back branches.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +
diff --git a/src/bin/pg_dump/pg_dumpall.c b/src/bin/pg_dump/pg_dumpall.c
new file mode 100644
index c4b6ae8..a597304
*** a/src/bin/pg_dump/pg_dumpall.c
--- b/src/bin/pg_dump/pg_dumpall.c
*** dumpCreateDB(PGconn *conn)
*** 1412,1417 
--- 1412,1435 
  
  			appendPQExpBufferStr(buf, ";\n");
  		}
+ 		else if (strcmp(dbtablespace, "pg_default") != 0 && !no_tablespaces)
+ 		{
+ 			/*
+ 			 * Cannot change tablespace of the database we're connected to.
+ 			 * To move "postgres" to another tablespace, we connect to
+ 			 * "template1" and vice versa.
+ 			 */
+ 			if (strcmp(dbname, "postgres") == 0)
+ appendPQExpBuffer(buf, "\\connect template1\n");
+ 			else
+ appendPQExpBuffer(buf, "\\connect postgres\n");
+ 
+ 			appendPQExpBuffer(buf, "ALTER DATABASE %s SET TABLESPACE %s;\n",
+ 			  fdbname, fmtId(dbtablespace));
+ 
+ 			/* connect to original database */
+ 			appendPQExpBuffer(buf, "\\connect %s\n", fdbname);
+ 		}
  
  		if (binary_upgrade)
  		{
diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
new file mode 100644
index e158c9f..390fc62
*** a/src/bin/pg_upgrade/info.c
--- b/src/bin/pg_upgrade/info.c
*** create_rel_filename_map(const char *old_
*** 140,145 
--- 140,146 
  		const RelInfo *old_rel, const RelInfo *new_rel,
  		FileNameMap *map)
  {
+ 	/* Someday the old/new tablespaces might not match, so handle it. */
  	if (strlen(old_rel->tablespace) == 0)
  	{
  		/*
*** create_rel_filename_map(const char *old_
*** 147,162 
  		 * exist in the data directories.
  		 */
  		map->old_tablespace = old_data;
- 		map->new_tablespace = new_data;
  		map->old_tablespace_suffix = "/base";
- 		map->new_tablespace_suffix = "/base";
  	}
  	else
  	{
  		/* relation belongs to a tablespace, so use the tablespace location */
  		map->old_tablespace = old_rel->tablespace;
- 		map->new_tablespace = new_rel->tablespace;
  		map->old_tablespace_suffix = old_cluster.tablespace_suffix;
  		map->new_tablespace_suffix = new_cluster.tablespace_suffix;
  	}
  
--- 148,171 
  		 * exist in the data directories.
  		 */
  		map->old_tablespace = old_data;
  		map->old_tablespace_suffix = "/base";
  	}
  	else
  	{
  		/* relation belongs to a tablespace, so use the tablespace location */
  		map->old_tablespace = old_rel->tablespace;
  		map->old_tablespace_suffix = old_cluster.tablespace_suffix;
+ 	}
+ 
+ 	/* Do the same for new tablespaces */
+ 	if (strlen(new_rel->tablespace) == 0)
+ 	{
+ 		map->new_tablespace = new_data;
+ 		map->new_tablespace_suffix = "/base";
+ 	}
+ 	else
+ 	{
+ 		map->new_tablespace = new_rel->tablespace;
  		map->new_tablespace_suffix = new_cluster.tablespace_suffix;
  	}
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-03 Thread Michael Paquier

On Fri, Sep 4, 2015 at 6:51 AM, Tom Lane  wrote:

> I experimented with the attached patch.  It seems to work to stop the
> crash Michael exhibited (I've not tried to duplicate Jim's original
> example, though).  However, it also causes a regression test failure,
> because transactions.sql does this:
>

Neat patch, it would have taken me longer to figure out that... I have
tried with Jim's test case and the patch is working.


> which is exactly the case we're trying to reject now.  So that fills
> me with fear that this approach would break existing applications.
> On the other hand, I do not really see a good alternative :-(.
>

This behavior is present since 2004 with fe548629, so that's a real concern
to me, especially if there are drivers around relying on this behavior.
There are for example some code patterns around Postgres ODBC that could be
impacted, not sure which ones but I guess that some internal error handling
is not going to like that.

I thought about trying to detect whether the Portal actually had any
> references to "new in subtransaction" objects to decide whether to
> kill it or not, but that seems awfully fragile.
>
> Ideas?
>

Yes. This diff on top of your patch:
@@ -922,8 +922,7 @@ AtSubAbort_Portals(SubTransactionId mySubid,
 * must be forced into FAILED state, for
the same reasons
 * discussed below.
 */
-   if (portal->status == PORTAL_READY ||
-   portal->status == PORTAL_ACTIVE)
+   if (portal->status == PORTAL_ACTIVE)
MarkPortalFailed(portal);
This way only the active portals are marked as failed. The regression tests
that are failing with your patch applied visibly do not activate the portal
they use, just marking it as ready to be used. This seems to be the safest
approach to me on stable branches, as well as on master, this way we are
sure that resources on the failed subxacts are cleaned up correctly, and
that existing applications are not impacted.

I am attaching a new patch with what I changed and a test case based on my
previous one.
Regards,
-- 
Michael
diff --git a/src/backend/commands/portalcmds.c b/src/backend/commands/portalcmds.c
index 2794537..327b2a5 100644
--- a/src/backend/commands/portalcmds.c
+++ b/src/backend/commands/portalcmds.c
@@ -335,11 +335,7 @@ PersistHoldablePortal(Portal portal)
 	/*
 	 * Check for improper portal use, and mark portal active.
 	 */
-	if (portal->status != PORTAL_READY)
-		ereport(ERROR,
-(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
- errmsg("portal \"%s\" cannot be run", portal->name)));
-	portal->status = PORTAL_ACTIVE;
+	MarkPortalActive(portal);
 
 	/*
 	 * Set up global portal context pointers.
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 9c14e8a..0df86a2 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -734,11 +734,7 @@ PortalRun(Portal portal, long count, bool isTopLevel,
 	/*
 	 * Check for improper portal use, and mark portal active.
 	 */
-	if (portal->status != PORTAL_READY)
-		ereport(ERROR,
-(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
- errmsg("portal \"%s\" cannot be run", portal->name)));
-	portal->status = PORTAL_ACTIVE;
+	MarkPortalActive(portal);
 
 	/*
 	 * Set up global portal context pointers.
@@ -1398,11 +1394,7 @@ PortalRunFetch(Portal portal,
 	/*
 	 * Check for improper portal use, and mark portal active.
 	 */
-	if (portal->status != PORTAL_READY)
-		ereport(ERROR,
-(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
- errmsg("portal \"%s\" cannot be run", portal->name)));
-	portal->status = PORTAL_ACTIVE;
+	MarkPortalActive(portal);
 
 	/*
 	 * Set up global portal context pointers.
diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c
index 7530498..639c2b3 100644
--- a/src/backend/utils/mmgr/portalmem.c
+++ b/src/backend/utils/mmgr/portalmem.c
@@ -232,6 +232,7 @@ CreatePortal(const char *name, bool allowDup, bool dupSilent)
 	portal->status = PORTAL_NEW;
 	portal->cleanup = PortalCleanup;
 	portal->createSubid = GetCurrentSubTransactionId();
+	portal->activeSubid = portal->createSubid;
 	portal->strategy = PORTAL_MULTI_QUERY;
 	portal->cursorOptions = CURSOR_OPT_NO_SCROLL;
 	portal->atStart = true;
@@ -403,6 +404,25 @@ UnpinPortal(Portal portal)
 }
 
 /*
+ * MarkPortalActive
+ *		Transition a portal from READY to ACTIVE state.
+ *
+ * NOTE: never set portal->status = PORTAL_ACTIVE directly; call this instead.
+ */
+void
+MarkPortalActive(Portal portal)
+{
+	/* For safety, this is a runtime test not just an Assert */
+	if (portal->status != PORTAL_READY)
+		ereport(ERROR,
+(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("portal \"%s\" cannot be run", portal->name)));
+	/* Perform the state transition */
+

Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-09-03 Thread David Rowley

On 3 September 2015 at 16:50, Peter Geoghegan  wrote:

> On Wed, Sep 2, 2015 at 9:43 PM, David Rowley
>  wrote:
> > Peter, would you be able to share the test case which you saw the speedup
> > on. So far I've been unable to see much of an improvement.
>
> The case I tested was an internal sort CREATE INDEX. I don't recall
> the exact details, but testing showed it to be a fairly robust
> speedup. It was not a very brief CREATE INDEX operation, or a very
> lengthily one. trace_sort output made it quite visible that there was
> a significant saving after the sort is "performed", but before it is
> "done". It wasn't hard to see an improvement on a variety of other
> cases, although the Intel vTune tool made the difference particularly
> obvious.
>
> The only thing that definitely won't be helped is pass-by-value datum
> sort cases. In case it matters, I used GCC 4.8.
>

My test cases are:

set work_mem ='1GB';
create table t1 as select md5(random()::text) from
generate_series(1,1000);

Times are in milliseconds. Median and average over 10 runs.

Test 1
select count(distinct md5) from t1;

 Master Patched Median 10,965.77 10,986.30 (99.81%) Average
10,983.63 11,013.55 (99.73%)

Test 2
select sum(rn) from (select row_number() over (order by md5) rn from t1) a;

Master Patched
Median 12,499.03 12,465.21 (100.27%)
Average 12,504.87 12,468.91 (100.29%)

Test 3
create index t1_md5_idx on t1(md5);
Master Patched
Median 12,981.47 12,888.11 (100.72%)
Average 13,416.23 13,249.32 (101.26%)

gcc version 4.8.3
Intel(R) Xeon(R) CPU E5-2630 v3

Are you seeing any speedup from any of these on your hardware?

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/

 PostgreSQL Development, 24x7 Support, Training & Services

[HACKERS] Small patch to fix an O(N^2) problem in foreign keys

2015-09-03 Thread Jan Wieck


Attached is a small patch and a script to reproduce the issue.

The problem is a cache introduced in commit 45ba4247 that improves 
foreign key lookups during bulk updates when the FK value does not 
change. When restoring a schema dump from a database with many (say 
100,000) foreign keys, this cache is growing very big and every ALTER 
TABLE command is causing a InvalidateConstraintCacheCallBack(), which 
does a sequential hash table scan.


The patch uses a heuristic method of detecting when the hash table 
should be destroyed and recreated. InvalidateConstraintCacheCallBack() 
adds the current size of the hash table to a counter. When that sum 
reaches 1,000,000, the hash table is flushed. This improves the schema 
restore of a database with 100,000 foreign keys by factor 3.


According to my tests the patch does not interfere with the bulk 
updates, the original feature was supposed to improve.



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info
diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
index 61edde9..d7023ce 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -183,6 +183,7 @@ typedef struct RI_CompareHashEntry
  * --
  */
 static HTAB *ri_constraint_cache = NULL;
+static long  ri_constraint_cache_seq_count = 0;
 static HTAB *ri_query_cache = NULL;
 static HTAB *ri_compare_cache = NULL;
 
@@ -2945,6 +2946,27 @@ InvalidateConstraintCacheCallBack(Datum arg, int cacheid, uint32 hashvalue)
 
 	Assert(ri_constraint_cache != NULL);
 
+	/*
+	 * Prevent an O(N^2) problem when creating large amounts of foreign
+	 * key constraints with ALTER TABLE, like it happens at the end of
+	 * a pg_dump with hundred-thousands of tables having references.
+	 */
+	ri_constraint_cache_seq_count += hash_get_num_entries(ri_constraint_cache);
+	if (ri_constraint_cache_seq_count > 100)
+	{
+		HASHCTL		ctl;
+
+		hash_destroy(ri_constraint_cache);
+		memset(, 0, sizeof(ctl));
+		ctl.keysize = sizeof(Oid);
+		ctl.entrysize = sizeof(RI_ConstraintInfo);
+		ri_constraint_cache = hash_create("RI constraint cache",
+		  RI_INIT_CONSTRAINTHASHSIZE,
+		  , HASH_ELEM | HASH_BLOBS);
+		ri_constraint_cache_seq_count = 0;
+		return;
+	}
+
 	hash_seq_init(, ri_constraint_cache);
 	while ((hentry = (RI_ConstraintInfo *) hash_seq_search()) != NULL)
 	{


t1.sh
Description: application/shellscript

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.

2015-09-03 Thread Gavin Flower


On 04/09/15 12:11, Bruce Momjian wrote:

On Thu, Sep  3, 2015 at 11:37:09PM +0200, Petr Jelinek wrote:

I don't understand.  I'm just proposing that the source code for the
extension to live in src/extensions/, and have the shared library
installed by toplevel make install; I'm not suggesting that the
extension is installed automatically.  For that, you still need a
superuser to run CREATE EXTENSION.


+! for this

OK, what does "+!" mean?  (I know it is probably a shift-key mistype,
but it looks interesting.)

It obviously signifies a Good Move that involved a check - at least, 
that is what it would mean when annotating a Chess Game!  :-)



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.

2015-09-03 Thread Bruce Momjian

On Thu, Sep  3, 2015 at 11:37:09PM +0200, Petr Jelinek wrote:
> >I don't understand.  I'm just proposing that the source code for the
> >extension to live in src/extensions/, and have the shared library
> >installed by toplevel make install; I'm not suggesting that the
> >extension is installed automatically.  For that, you still need a
> >superuser to run CREATE EXTENSION.
> >
> 
> +! for this

OK, what does "+!" mean?  (I know it is probably a shift-key mistype,
but it looks interesting.)

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.

2015-09-03 Thread Josh Berkus

On 09/03/2015 05:11 PM, Bruce Momjian wrote:
> On Thu, Sep  3, 2015 at 11:37:09PM +0200, Petr Jelinek wrote:
>>> I don't understand.  I'm just proposing that the source code for the
>>> extension to live in src/extensions/, and have the shared library
>>> installed by toplevel make install; I'm not suggesting that the
>>> extension is installed automatically.  For that, you still need a
>>> superuser to run CREATE EXTENSION.
>>>
>>
>> +! for this
> 
> OK, what does "+!" mean?  (I know it is probably a shift-key mistype,
> but it looks interesting.)

Add the next factorial value?


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BRIN INDEX value

2015-09-03 Thread Tatsuo Ishii

> The summarization during VACUUM invokes IndexBuildHeapRangeScan() which is
> passed scanStartBlock and scanNumBlocks. If scanStartBlock + scanNumBlocks
>> heapTotalBlocks, further down the line, heapgettup() may start returning
> tuples from the beginning given the following code in it:
> 
>   page++;
>   if (page >= scan->rs_nblocks)
>   page = 0;
> 
>   finished = (page == scan->rs_startblock) ||
>(scan->rs_numblocks != InvalidBlockNumber ?
>  --scan->rs_numblocks == 0 :
>  false);
> 
> Where finished indicates whether it thinks the end of heap is reached.
> 
> In this case, scan->rs_startblock is 384 set by IndexBuildHeapRangeScan()
> using heap_setscanlimits(). One can imagine how the above heap finish
> criteria might not work as expected.

What scares me is:

1) the bug will not be found unless someone inspects the internal data
   of BRIN. Regression test is useless here.

2) the bug effectively causes vacuum scans the heap *twice*, which
   will produce lots of I/O if the heap is not small.

3) I wonder if other index type is suffered by this type of bug.

I vaguely recall that more internal inspecting type brin regression
test was removed because it depends on contrib modules. I am not sure
whether the decision was appropreate or not.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Is this a bug?

2015-09-03 Thread Michael Paquier

On Fri, Sep 4, 2015 at 3:52 AM, Alvaro Herrera 
wrote:

> Robert Haas wrote:
> > On Wed, Aug 26, 2015 at 3:31 PM, Alvaro Herrera
> >  wrote:
> > > Fabrízio de Royes Mello wrote:
> > >
> > >> Why this patch was reverted one day after applied [1]? I didn't see
> any
> > >> discussion around it.
> > >
> > > Contributors whose patches are getting committed should really
> subscribe
> > > to pgsql-committers.
> >
> > I would have thought discussion of committed patches should be moved
> > to -hackers.
>
> I agree, but it happens anyway quite frequently.  Since recently, I make
> a point of changing the CC from -committers to -hackers, but due to
> force of habit I often forget.
>

Noted. I usually don't do that.
-- 
Michael

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-03 Thread Tom Lane

Michael Paquier  writes:
> On Fri, Sep 4, 2015 at 6:51 AM, Tom Lane  wrote:
>> Ideas?

> Yes. This diff on top of your patch:
> @@ -922,8 +922,7 @@ AtSubAbort_Portals(SubTransactionId mySubid,
>  * must be forced into FAILED state, for
> the same reasons
>  * discussed below.
>  */
> -   if (portal->status == PORTAL_READY ||
> -   portal->status == PORTAL_ACTIVE)
> +   if (portal->status == PORTAL_ACTIVE)
> MarkPortalFailed(portal);

> This way only the active portals are marked as failed.

Hmm.  I am not feeling especially comfortable about this: it's not clear
that there's anything preventing a suspended portal from containing a
dangerous reference.  However, a quick trial did not show that it was
possible to break it --- in the cases I tried, we detected that a cached
plan was no longer valid, tried to rebuild it, and noticed the missing
object at that point.  So maybe it's OK.

On reflection I think that the tracking of activeSubid in my patch is
probably overkill if we're attacking it this way.  We can just have
AtSubAbort_Portals fail any ACTIVE portal regardless of subxact level,
which is pretty analogous to what AtAbort_Portals has done for a long
time.

Let me work on this and see if I can get to a simpler patch.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-03 Thread Shulgin, Oleksandr

On Sep 3, 2015 10:14 PM, "Pavel Stehule"  wrote:
>>>
>>> Please find attached a v3.
>>>
>>> It uses a shared memory queue and also has the ability to capture plans
nested deeply in the call stack.  Not sure about using the executor hook,
since this is not an extension...
>>>
>>> The LWLock is used around initializing/cleaning the shared struct and
the message queue, the IO synchronization is handled by the message queue
itself.
>>
>> I am not pretty happy from this design. Only one EXPLAIN PID/GET STATUS
in one time can be executed per server - I remember lot of queries that
doesn't handle CANCEL well ~ doesn't handle interrupt well, and this can be
unfriendly. Cannot to say if it is good enough for first iteration. This is
functionality that can be used for diagnostic when you have overloaded
server and this risk looks too high (for me). The idea of receive slot can
to solve this risk well (and can be used elsewhere). The difference from
this code should not be too big - although it is not trivial - needs work
with PGPROC. The opinion of our multiprocess experts can be interesting.
Maybe I am too careful.

Sorry, but I still don't see how the slots help this issue - could you
please elaborate?

>> Other smaller issues:
>>
>> * probably sending line by line is useless - shm_mq_send can pass bigger
data when nowait = false

I'm not sending it like that because of the message size - I just find it
more convenient. If you think it can be problematic, its easy to do this as
before, by splitting lines on the receiving side.

>> * pg_usleep(1000L); - it is related to single point resource

But not a highly concurrent one.

-
Alex

Re: Hooking at standard_join_search (Was: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual)

2015-09-03 Thread Tom Lane

Robert Haas  writes:
> On Wed, Sep 2, 2015 at 1:47 PM, Tom Lane  wrote:
>> Offhand I think that the
>> most likely way to build that text will be to examine the query's jointree
>> to see where c,d,e appear in it.  But in any case, that's a separate issue
>> and I fail to see how plopping the join search problem into the FDW's lap
>> would make it any easier.

> Yeah, I am not advocating for putting the hook in
> standard_join_search.  I'm explaining why I put it in
> add_paths_to_joinrel instead of, as I believe you were advocating, in
> make_join_rel prior to the big switch.

If you had a solution to the how-to-build-the-query-text problem,
and it depended on that hook placement, then your argument might
make some sense.  As is, you've entirely failed to convince me
that this placement is not wrong, wasteful, and likely to create
unnecessary API breaks for FDWs.

(Also, per my last message on the subject, *after* the switch
is what I think makes sense.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-03 Thread Michael Paquier

On Fri, Sep 4, 2015 at 12:04 PM, Tom Lane  wrote:

> I wrote:
> > Hmm.  I am not feeling especially comfortable about this: it's not clear
> > that there's anything preventing a suspended portal from containing a
> > dangerous reference.  However, a quick trial did not show that it was
> > possible to break it --- in the cases I tried, we detected that a cached
> > plan was no longer valid, tried to rebuild it, and noticed the missing
> > object at that point.  So maybe it's OK.
>
> After further thought I concluded that this probably is safe.  The
> portal's original query was created and planned when it was opened,
> so it cannot contain any references to objects of the failed
> subtransaction.  We have a hazard from queries within functions,
> but if the portal is suspended then no such functions are in progress.
>
> Attached is an updated patch with comments to this effect and some
> minor other code cleanup (mainly, not assuming that CurrentResourceOwner
> is the right thing to use in AtSubAbort_Portals).
>

Thanks! This looks good to me.
-- 
Michael

Re: [HACKERS] BRIN INDEX value

2015-09-03 Thread Amit Langote

On 9/4/2015 12:01 PM, Alvaro Herrera wrote:
> Amit Langote wrote:
>> On 9/4/2015 9:22 AM, Tatsuo Ishii wrote:
>>>
>>> 3) I wonder if other index type is suffered by this type of bug.
>>>
>>
>> About 3, it seems unlikely. Both the IndexBuildHeapRangeScan() and
>> heap_setscanlimits() were introduced by the BRIN patch and I didn't find
>> anything else using it, yet.
> 
> As I recall, there's some patch that Robert Haas or Amit Kapila that
> wants to use that function.  Maybe something in the parallel seqscan
> patch series?

I just checked Amit Kapila's latest[1] patch, and found they have since
abandoned the heap_setscanlimits() approach.

Thanks,
Amit

[1]
http://www.postgresql.org/message-id/caa4ek1j6sqoeakxidoxp7kqx2_d-xnspx4rq4ao17p+sqky...@mail.gmail.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-09-03 Thread Shulgin, Oleksandr

On Sep 3, 2015 7:30 PM, "Robert Haas"  wrote:
>
> All of these objections seem pretty thin to me.  I'd accept any of
> them as a reason for preferring one alternative over another, but I
> don't accept that the presence of a few problems of this magnitude
> means we should give up on the feature.  It's a good enough feature
> that it is worth the possibility of slightly inconveniencing someone
> running in an unusual configuration.

I give up.

Though I still don't see any compelling reason for this to be in libpq
itself. By the way, what about mixing conninfo and uris - should this not
be allowed?

-
Alex

Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-03 Thread Pavel Stehule

2015-09-04 5:50 GMT+02:00 Shulgin, Oleksandr :

> On Sep 3, 2015 10:14 PM, "Pavel Stehule"  wrote:
> >>>
> >>> Please find attached a v3.
> >>>
> >>> It uses a shared memory queue and also has the ability to capture
> plans nested deeply in the call stack.  Not sure about using the executor
> hook, since this is not an extension...
> >>>
> >>> The LWLock is used around initializing/cleaning the shared struct and
> the message queue, the IO synchronization is handled by the message queue
> itself.
> >>
> >> I am not pretty happy from this design. Only one EXPLAIN PID/GET STATUS
> in one time can be executed per server - I remember lot of queries that
> doesn't handle CANCEL well ~ doesn't handle interrupt well, and this can be
> unfriendly. Cannot to say if it is good enough for first iteration. This is
> functionality that can be used for diagnostic when you have overloaded
> server and this risk looks too high (for me). The idea of receive slot can
> to solve this risk well (and can be used elsewhere). The difference from
> this code should not be too big - although it is not trivial - needs work
> with PGPROC. The opinion of our multiprocess experts can be interesting.
> Maybe I am too careful.
>
> Sorry, but I still don't see how the slots help this issue - could you
> please elaborate?
>
with slot (or some similiar) there is not global locked resource. If I'll
have a time at weekend I'll try to write some prototype.


> >> Other smaller issues:
> >>
> >> * probably sending line by line is useless - shm_mq_send can pass
> bigger data when nowait = false
>
> I'm not sending it like that because of the message size - I just find it
> more convenient. If you think it can be problematic, its easy to do this as
> before, by splitting lines on the receiving side.
>
Yes, shm queue sending data immediately - so slicing on sender generates
more interprocess communication


> >> * pg_usleep(1000L); - it is related to single point resource
>
> But not a highly concurrent one.
>
I believe so it is not becessary - waiting (sleeping) can be deeper in
reading from queue - the code will be cleaner

Regard

Pavel


> -
> Alex
>

Re: [HACKERS] proposal: function parse_ident

2015-09-03 Thread Pavel Stehule

2015-09-03 13:11 GMT+02:00 Andres Freund :

> On 2015-08-23 17:46:36 +0200, Pavel Stehule wrote:
> > here is the patch
> >
> > It is really trivial - all heavy work was done done before.
>
> This seems to entirely disregard the comments in
> http://archives.postgresql.org/message-id/29030.1440030152%40sss.pgh.pa.us
> about the fact that this approach only works for a few object types?
>
>
The alghoritm for parsing identifiers is same - the differences are in a
names of levels, and in ending symbols.

I don't would to write totally generic parser - more without access to
system catalog or without external hint, you cannot to decide if identifier
is schema.table or table.column. But the rules for parsing is exactly same.

The function can be redesigned little bit:

FUNCTION parse_ident(OUT level1 text,OUT level2 text,OUT level3 text,OUT
specific text)

so it can parse function myschema.myfunc(xx int)

level1: NULL
level2: myschema
level3: myfunc
specific: (xx int)

Is it acceptable?

Regards

Pavel





> Also, for the umpteenst time: Start actually quoting in a sane manner.
>

> Greetings,
>
> Andres Freund
>

Re: [HACKERS] BRIN INDEX value

2015-09-03 Thread Tatsuo Ishii

> On 9/4/2015 8:28 AM, Tatsuo Ishii wrote:
>>>
>>> Attached hack fixes the symptom but perhaps not the correct fix for this.
>> 
>> Why can't we fix summarize_range() in brin.c:
>> 
>>  IndexBuildHeapRangeScan(heapRel, state->bs_irel, indexInfo, false, true,
>>  heapBlk, 
>> state->bs_pagesPerRange,
>>  brinbuildCallback, 
>> (void *) state);
>> 
>> This currently thoughtlessly passes scannumblocks as
>> state->bs_pagesPerRange. Shouldn't we change this so that
>> (scanStartBlock + scanNumBlocks) does not exceed scan->rs_nblocks?
>> 
> 
> Ah, it did cross my mind to the fix it in brin.c but was not sure. I did
> it that way in the attached patch.

Amit,

I have looked into your patch and am a little bit worried because it
calls RelationGetNumberOfBlocks() which is an expensive function.
I think there are two ways to avoid it.

1) fall back to your former patch. However it may break existing
   behavior what you said. I think there's very little chance it
   actually happens because IndexBuildHeapRangeScan() is only used by
   brin (I confirmed this). But Alvaro said some patches may be it's
   customers. Robert, Amit, Can you please confirm this?

2) change the signature of IndexBuildHeapRangeScan() to be able to
   teach it to limit the target block number to not exceed the last
   page of the relation. Again this may break someone's patch and need
   confirmation.

What do you think?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BRIN INDEX value

2015-09-03 Thread Amit Langote

On 9/4/2015 1:33 PM, Tatsuo Ishii wrote:
>> Ah, it did cross my mind to the fix it in brin.c but was not sure. I did
>> it that way in the attached patch.
> 
> Amit,
> 
> I have looked into your patch and am a little bit worried because it
> calls RelationGetNumberOfBlocks() which is an expensive function.
> I think there are two ways to avoid it.
> 
> 1) fall back to your former patch. However it may break existing
>behavior what you said. I think there's very little chance it
>actually happens because IndexBuildHeapRangeScan() is only used by
>brin (I confirmed this). But Alvaro said some patches may be it's
>customers. Robert, Amit, Can you please confirm this?
> 
> 2) change the signature of IndexBuildHeapRangeScan() to be able to
>teach it to limit the target block number to not exceed the last
>page of the relation. Again this may break someone's patch and need
>confirmation.
> 
> What do you think?

One thing I imagine we could do is to change the signature of
summrize_range() to also include heapNumBlks which its (only) caller
brinsummarize() already computes. It will look like:

static void summarize_range(IndexInfo *indexInfo, BrinBuildState *state,
  Relation heapRel,
  BlockNumber heapBlk,
  BlockNumber heapNumBlks);

I'd think changing summarize_range()'s signature would be relatively
easier/safer.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BRIN INDEX value

2015-09-03 Thread Tatsuo Ishii

> One thing I imagine we could do is to change the signature of
> summrize_range() to also include heapNumBlks which its (only) caller
> brinsummarize() already computes. It will look like:
> 
> static void summarize_range(IndexInfo *indexInfo, BrinBuildState *state,
>   Relation heapRel,
>   BlockNumber heapBlk,
>   BlockNumber heapNumBlks);
> 
> I'd think changing summarize_range()'s signature would be relatively
> easier/safer.

Yeah, sounds good.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench - allow backslash-continuations in custom scripts

2015-09-03 Thread Kyotaro HORIGUCHI

Hello, Thank you for registering this to CF-Sep.
I missed the regtration window.. It ended earlier than usual..

Most troubles have gone and I'll be back next week.

> The work to be left is eliminating double-format of Command
> struct.

This is done as the additional fourth patch, not merged into
previous ones, to show what's changed in the manner of command
storing.

I repost on this thread the new version of this patch including
this and posted before. This is rebased to current master.

The changes in behaviors brought by this patch has been described
in the privous mail as the following,

> Hmm. psqlscan.l handles multistatement naturally.
> I worked on that and the attached patche set does,
> 
>  - backslash continuation for pgbench metacommands.
> 
>set variable \
>
> 
>  - SQL statement natural continuation lines.
> 
>SELECT :foo
> FROM  :bar;
> 
>  - SQL multi-statement.
> 
>SELECT 1; SELECT 2;

Each of the four patches does the following thigs,

1. 0001-Prepare-to-share-psqlscan-with-pgbench.patch

  The global variable pset, VariableSpace and backslash syntax of
  psql are the obstacles for psqlscan.l from being used by
  pgbench.  This patch eliminates direct reference to pset and
  masks VariableSpace feature (looks ugry..), and enables
  backslash syntax in psqlscan.l to be hidden from outside psql
  by defining the symbol OUTSIDE_PSQL.

  No behavioral changes of pasql are introduced by ths patch.

2. 0002-Make-use-of-psqlscan-for-parsing-of-custom-script.patch

  This is the core of this patch, which makes pgbench to use
  psqlscan.l and enables multi-statements,
  multiline-SQL-statement and backslash-continuation of
  metacommands.

  The struct Command is modified that it can be chained in order
  to convey multistatement in one line. But the commands are
  stored in an array of Command just outside process_commands as
  of old. This double-formatting will be removed by the fourth
  patch.

  psqlscan.c is compiled as a part of mainloop.c for some reason
  described at the end of the file. I haven't confirmed that the
  same thing will happen in pgbench, but I did the same thing for
  pgbenc.c.

  Compilation will fail on Windows as of this patch.

3. 0003-Change-MSVC-Build-script.patch

  Changes the build script for Windows platform. It certainly
  works but might look a bit odd because of the anormaly of the
  compilation way of psqlscan.l

4. 0004-Change-the-way-to-hold-command-list.patch

  Changes the way to hold commad list from an array to linked
  list, to remove the double formatting of Command-list
  introduced by the second patch. This removes the explicit
  limitation on the number of commands in scripts, as a
  side-effect.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 6842b81ba19f779244e4cad3ab3ba69db537b3ea Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 23 Jul 2015 20:44:37 +0900
Subject: [PATCH 1/4] Prepare to share psqlscan with pgbench.

Eliminate direct usage of pset variables and enable parts unnecessary
for other than psql to be disabled by defining OUTSIDE_PSQL.
---
 src/bin/psql/mainloop.c |  6 ++--
 src/bin/psql/psqlscan.h | 14 +
 src/bin/psql/psqlscan.l | 79 -
 src/bin/psql/startup.c  |  4 +--
 4 files changed, 67 insertions(+), 36 deletions(-)

diff --git a/src/bin/psql/mainloop.c b/src/bin/psql/mainloop.c
index b6cef94..e98cb94 100644
--- a/src/bin/psql/mainloop.c
+++ b/src/bin/psql/mainloop.c
@@ -233,7 +233,8 @@ MainLoop(FILE *source)
 		/*
 		 * Parse line, looking for command separators.
 		 */
-		psql_scan_setup(scan_state, line, strlen(line));
+		psql_scan_setup(scan_state, line, strlen(line),
+		pset.db, pset.vars, pset.encoding);
 		success = true;
 		line_saved_in_history = false;
 
@@ -373,7 +374,8 @@ MainLoop(FILE *source)
 	resetPQExpBuffer(query_buf);
 	/* reset parsing state since we are rescanning whole line */
 	psql_scan_reset(scan_state);
-	psql_scan_setup(scan_state, line, strlen(line));
+	psql_scan_setup(scan_state, line, strlen(line),
+	pset.db, pset.vars, pset.encoding);
 	line_saved_in_history = false;
 	prompt_status = PROMPT_READY;
 }
diff --git a/src/bin/psql/psqlscan.h b/src/bin/psql/psqlscan.h
index 55070ca..4bf8dcb 100644
--- a/src/bin/psql/psqlscan.h
+++ b/src/bin/psql/psqlscan.h
@@ -11,7 +11,11 @@
 #include "pqexpbuffer.h"
 
 #include "prompt.h"
-
+#if !defined OUTSIDE_PSQL
+#include "variables.h"
+#else
+typedef int * VariableSpace;
+#endif
 
 /* Abstract type for lexer's internal state */
 typedef struct PsqlScanStateData *PsqlScanState;
@@ -36,12 +40,11 @@ enum slash_option_type
 	OT_NO_EVAL	/* no expansion of backticks or variables */
 };
 
-
 extern PsqlScanState psql_scan_create(void);
 extern void psql_scan_destroy(PsqlScanState state);
 
-extern void psql_scan_setup(PsqlScanState state,
-const char *line, int line_len);

Re: [HACKERS] Allow replication roles to use file access functions

2015-09-03 Thread Michael Paquier

On Thu, Sep 3, 2015 at 9:53 PM, Stephen Frost  wrote:
> * Michael Paquier (michael.paqu...@gmail.com) wrote:
>> On Thu, Sep 3, 2015 at 11:20 AM, Stephen Frost wrote:
>> >> Not only, +clog, configuration files, etc.
>> >
>> > Configuration files?  Perhaps you could elaborate?
>>
>> Sure. Sorry for being unclear. It copies everything that is not a
>> relation file, a kind of base backup without the relation files then.
>
> How does that work on systems where the configuration files aren't
> stored under PGDATA (Debian and derivatives, at least)?

When a file is out of PGDATA, it is not fetched. Symlinks in PGDATA
have their contents fetched as well if I recall correctly.

> I guess I don't
> quite see why it's necessary for pg_rewind to copy the configuration
> files in the first place, it doesn't have the same role as
> pg_basebackup, at least as I understand it.

Of course, that's not mandatory to fetch them. It is as well not worth
the complication to apply a filter to not fetch a portion of the
files, and I think that's why Heikki took the approach to fetch
everything in PGDATA (except relation files) because that was just
more simple to implement as such for little gain.

>> I guess that what you are suggesting instead is an approach where
>> caller sends something like that through the replication protocol with
>> a relation OID and a block list:
>> BLOCK_DIFF relation_oid BLOCK_LIST m,n,[o, ...]
>
> Right, something along those lines is what I had been thinking.  We
> would probably need to provide independent commands for the different
> file types, with the parameters expressed in terms appropriate for each
> kind of file (block numbers for heap, XIDs for WAL and CLOG?).
> Essentially, whatever API would be both simple for pg_rewind and general
> enough to be useful for other clients in the future.  At least, I
> imagine that pg_rewind would be a bit simpler if it could communicate
> with the backend in the 'language of PG' rather than having to specify
> file names and paths.

I guess that makes sense if we want to remove the superuser-only
barrier, still this would require to invent new commands each time a
new file type is added as this new type of file may be needed as well
in the rewound node (imagine a pg_clog2 for example). I would rather
take the other approach by applying an exclude list in an existing
command, and not an include list or a new set of commands.

> Other clients that might find such an interface useful are incremental
> pg_basebackup or possibly parallel pg_basebackup.

Possible.

>> We would need as well to extend BASE_BACKUP so as it does not include
>> relation files though for this use case.
>
> ... huh?  I'm not following this comment at all.  We might need to
> provide explicit start/stop backup commands and/or extend BASE_BACKUP
> for things like parallel pg_basebackup, but I'm not following why we
> would need to change it for pg_rewind.  Further BASE_BACKUP clearly does
> include relation files today..

The whole point of pg_rewind is not to have to fetch relation files,
so my idea would be basically to extend a bit BASE_BACKUP so as it
accepts a set of regex expressions aimed at filtering files to not
include in a base backup as requested by the client.

Still, for now it seems that this patch is taking the wrong approach
and that the general consensus would be to use the replication
protocol instead, so I am marking this patch as returned with
feedback.
Thanks!
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce ProcArrayLock contention

2015-09-03 Thread Amit Kapila

On Thu, Sep 3, 2015 at 11:07 PM, Robert Haas  wrote:

> On Tue, Aug 25, 2015 at 7:51 AM, Amit Kapila 
> wrote:
> > [ new patch ]
>
> Andres pinged me about this patch today.  It contained a couple of
> very strange whitespace anomalies, but otherwise looked OK to me, so I
> committed it with after fixing those issues and tweaking the comments
> a bit.

Thanks for looking into it.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-03 Thread Michael Paquier

On Fri, Sep 4, 2015 at 9:57 AM, Tom Lane  wrote:

> On reflection I think that the tracking of activeSubid in my patch is
> probably overkill if we're attacking it this way.  We can just have
> AtSubAbort_Portals fail any ACTIVE portal regardless of subxact level,
> which is pretty analogous to what AtAbort_Portals has done for a long
> time.
>

Tracking the activated subxact looked neat from my side, that's more
consistent with what is done when the portal is marked as ready,
particularly with the new routine introduced.

> Let me work on this and see if I can get to a simpler patch.
>

Oh, OK. Thanks!
-- 
Michael

1 2 >

1 - 100 of 120 matches

Mail list logo