Re: Another way to fix inherited UPDATE/DELETE

2019-02-19 Thread Amit Langote
On 2019/02/20 13:54, Tom Lane wrote:
> Amit Langote  writes:
>> On 2019/02/20 6:48, Tom Lane wrote:
>>> What if we dropped that idea, and instead defined the plan tree as
>>> returning only the columns that are updated by SET, plus the row
>>> identity?  It would then be the ModifyTable node's job to fetch the
>>> original tuple using the row identity (which it must do anyway) and
>>> form the new tuple by combining the updated columns from the plan
>>> output with the non-updated columns from the original tuple.
> 
>> Regarding child target relations that are foreign tables, the
>> expand-target-inheritance-at-the-bottom approach perhaps leaves no way to
>> allow pushing the update (possibly with joins) to remote side?
> 
> That's something we'd need to think about.  Obviously, anything
> along this line breaks the existing FDW update APIs, but let's assume
> that's acceptable.  Is it impossible, or even hard, for an FDW to
> support this definition of UPDATE rather than the existing one?
> I don't think so --- it seems like it's just different --- but
> I might well be missing something.

IIUC, in the new approach, only the root of the inheritance tree (target
table specified in the query) will appear in the query's join tree, not
the child target tables, so pushing updates with joins to the remote side
seems a bit hard, because we're not going to consider child joins.  Maybe
I'm missing something though.

Thanks,
Amit




Re: Row Level Security − leakproof-ness and performance implications

2019-02-19 Thread Pierre Ducroquet
On Wednesday, February 20, 2019 12:43:50 AM CET Laurenz Albe wrote:
> Pierre Ducroquet wrote:
> > In order to increase our security, we have started deploying row-level
> > security in order to add another safety net if any issue was to happen in
> > our applications.
> > After a careful monitoring of our databases, we found out that a lot of
> > queries started to go south, going extremely slow.
> > The root of these slowdowns is that a lot of the PostgreSQL functions are
> > not marked as leakproof, especially the ones used for operators.
> > In current git master, the following query returns 258 functions that are
> > used by operators returning booleans and not marked leakproof:
> > 
> > SELECT proname FROM pg_proc
> > 
> > WHERE exists(select 1 from pg_operator where oprcode::name =
> > proname)
> > AND NOT(proleakproof)
> > AND prorettype = 16;
> > 
> > Among these, if you have for instance a table like:
> > create table demo_enum (
> > 
> > username text,
> > flag my_enum
> > 
> > );
> > 
> > With an index on (username, flag), the index will only be used on username
> > because flag is an enum and the enum_eq function is not marked leakproof.
> > 
> > For simple functions like enum_eq/enum_ne, marking them leakproof is an
> > obvious fix (patch attached to this email, including also textin/textout).
> > And
> The enum_eq part looks totally safe, but the text functions allocate memory,
> so you could create memory pressure, wait for error messages that tell you
> the size of the allocation that failed and this way learn about the data.
> 
> Is this a paranoid idea?

This is not paranoid, it depends on your threat model.
In the model we implemented, the biggest threat we consider from an user point 
of view is IDOR (Insecure Direct Object Reference): a developer forgetting to 
check that its input is sane and matches the other parts of the URL or the 
current session user. Exploiting the leak you mentioned in such a situation is 
almost impossible, and to be honest the attacker has much easier targets if he 
gets to that level…
Maybe leakproof is too simple? Should we be able to specify a 'paranoid-level' 
to allow some leaks depending on our threat model?

> > if we had a 'RLS-enabled' context on functions, a way to make a lot of
> > built- in functions leakproof would simply be to prevent them from
> > logging errors containing values.
> > 
> > For others, like arraycontains, it's much trickier :[...]
> 
> I feel a little out of my depth here, so I won't comment.
> 
> > A third issue we noticed is the usage of functional indexes. If you create
> > an index on, for instance, (username, leaky_function(my_field)), and then
> > search on leaky_functon(my_field) = 42, the functional index can be used
> > only if leaky_function is marked leakproof, even if it is never going to
> > be executed on invalid rows thanks to the index. After a quick look at
> > the source code, it also seems complicated to implement since the
> > decision to reject potential dangerous calls to leaky_function is done
> > really early in the process, before the optimizer starts.
> 
> If there is a bitmap index scan, the condition will be rechecked, so the
> function will be executed.

That's exactly my point: using a bitmap index scan would be dangerous and thus 
should not be allowed, but an index scan works fine. Or the recheck on bitmap 
scan must first check the RLS condition before doing its check.

> > I am willing to spend some time on these issues, but I have no specific
> > knowledge of the planner and optimizer, and I fear proper fixes would
> > require much digging there. If you think it would be appropriate for
> > functions like arraycontains or range_contains to require the 'inner'
> > comparison function to be leakproof, or just keep looking at the other
> > functions in pg_proc and leakproof the ones that can be, I would be happy
> > to write the corresponding patches.
> 
> Thanks, and I think that every function that can safely be marked leakproof
> is a progress!

Thank you for your comments






Re: shared-memory based stats collector

2019-02-19 Thread Kyotaro HORIGUCHI
At Fri, 15 Feb 2019 15:53:28 -0300, Alvaro Herrera  
wrote in <20190215185328.GA29663@alvherre.pgsql>
> On 2019-Feb-15, Andres Freund wrote:
> 
> > It's fine to do such renames, just do them as separate patches. It's
> > hard enough to review changes this big...
> 
> Talk about moving the whole file to another subdir ...

Sounds reasonable. It was a searate patch at the first but
currently melded to bloat the patch. I'm going to revert the
separation/moving.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [PATCH][PROPOSAL] Add enum releation option type

2019-02-19 Thread Nikolay Shaplov
В письме от среда, 20 февраля 2019 г. 15:08:32 MSK пользователь Michael 
Paquier написал:
> On Sun, Jan 06, 2019 at 06:28:11PM +0300, Nikolay Shaplov wrote:
> > Actually I am not satisfied with it too... That is why I started that
> > bigger reloptions refactoring work. So for now I would offer to keep
> > initialization as it was for other types: in initialize_reloptions  using
> > [type]RelOpts[] arrays. And then I will offer patch that changes it for
> > all types and remove difference between biult-in reloptions and
> > reloptions in extensions.
> Should this be switched as waiting on author or just marked as
> returned with feedback then?
Actually I would prefer "Commited" ;-)

And speaking seriously... Anyway, this test module in src/test/modules should 
be a separate patch, as it would test that all types of options are properly 
reaching index extension, not only enum.

From my point of view, it can be committed without any dependency with enum 
reloption. Either we first commit this patch, and then that test module will 
test that enum support, or first we commit that test moudule without testing 
enum support, and add tests into this enum patch.

I would prefer to commit enum first, and I would promise that I will add thus 
module to the top of my dev TODO list. If it is not ok for you, then please 
tell me about it and set this patch as "Waiting for author" and I will do the 
test patch first.

> > 2.5  May be this src/test/modules dummy index is subject to another patch.
> > So I will start working on it right now, but we will do this work not
> > dependent to any other patches. And just add there what we need when it
> > is ready. I would actually add there some code that checks all types of
> > options, because we actually never check that option value from reloptons
> > successfully reaches extension code. I did this checks manually, when I
> > wrote that bigger reloption patch, but there is no facilities to do
> > regularly check is via test suit. Creating dummy index will create such
> > facilities.
> 
> bloom is a user-facing extension, while src/test/modules are normally
> not installed (there is an exception for MSVC anyway..), so stressing
> add_enum_reloption in src/test/modules looks more appropriate to me,
> except if you can define an option which is useful for the user and is
> an enum.
Thank you for pointing me right direction. 've been waiting for it. Now I can 
go on :)) So let's it be src/test/modules module...





RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-02-19 Thread Tsunakawa, Takayuki
From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> Hm.  Putting a list header for a purely-local data structure into shared
> memory seems quite ugly.  Isn't there a better place to keep that?

Agreed.  I put it in the global variable.


> Do we really want a dlist here at all?  I'm concerned that bloating
> LOCALLOCK will cost us when there are many locks involved.  This patch
> increases the size of LOCALLOCK by 25% if I counted right, which does
> not seem like a negligible penalty.

To delete the LOCALLOCK in RemoveLocalLock(), we need a dlist.  slist requires 
the list iterator to be passed from callers.


From: Andres Freund [mailto:and...@anarazel.de]
> Sure, we should do that. I don't buy the "illogical" bit, just moving
> hashcode up to after tag isn't more or less logical, and saves most of
> the padding, and moving the booleans to the end isn't better/worse
> either.
> 
> I don't find

Thanks, I've done it.  


From: Simon Riggs [mailto:si...@2ndquadrant.com]
> Can we use our knowledge of the structure of locks, i.e. partition locks
> are all children of the partitioned table, to do a better job?

I couldn't come up with a idea.


Regards
Takayuki Tsunakawa



faster-locallock-scan_v2.patch
Description: faster-locallock-scan_v2.patch


Re: Prepared transaction releasing locks before deregistering its GID

2019-02-19 Thread Masahiko Sawada
On Tue, Feb 19, 2019 at 12:27 PM Michael Paquier  wrote:
>
> On Mon, Feb 18, 2019 at 05:05:13PM +0100, Oleksii Kliukin wrote:
> > That looks like a race condition to me. What happens is that another
> > transaction with the name identical to the running one can start and proceed
> > to the prepare phase while the original one commits, failing at last instead
> > of waiting for the original one to finish.
>
> It took me 50 clients and a bit more than 20 seconds, but I have been
> able to reproduce the problem with one error.  Thanks for the
> reproducer.  This is indeed a race condition with 2PC.
>
> > By looking at the source code of FinishPreparedTransaction() I can see the
> > RemoveGXact() call, which removes the prepared transaction from
> > TwoPhaseState->prepXacts. It is placed at the very end of the function,
> > after the post-commit callbacks that clear out the locks held by the
> > transaction. Those callbacks are not guarded by the TwoPhaseStateLock,
> > resulting in a possibility for a concurrent session to proceed will
> > MarkAsPreparing after acquiring the locks released by them.
>
> Hm, I see.  Taking a breakpoint just after ProcessRecords() or putting
> a sleep there makes the issue plain.  The same issue can happen with
> predicate locks.
>
> > I couldn’t find any documentation on the expected outcome in cases like
> > this, so I assume it might not be a bug, but an undocumented behavior.
>
> If you run two transactions in parallel using your script, the second
> transaction would wait at LOCK time until the first transaction
> releases its locks with the COMMIT PREPARED.
>
> > Should I go about and produce a patch to put a note in the description of
> > commit prepared, or is there any interest in changing the behavior to avoid
> > such conflicts altogether (perhaps by holding the lock throughout the
> > cleanup phase)?
>
> That's a bug, let's fix it.  I agree with your suggestion that we had
> better not release the 2PC lock using the callbacks for COMMIT
> PREPARED or ROLLBACK PREPARED until the shared memory state is
> cleared.  At the same time, I think that we should be smarter in the
> ordering of the actions: we care about predicate locks here, but not
> about the on-disk file removal and the stat counters.  One trick is
> that the post-commit callback calls TwoPhaseGetDummyProc() which would
> try to take TwoPhaseStateLock which needs to be applied so we need
> some extra logic to not take a lock in this case.  From what I can see
> this is older than 9.4 as the removal of the GXACT entry in shared
> memory and the post-commit hooks are out of sync for a long time :(
>
> Attached is a patch that fixes the problem for me.  Please note the
> safety net in TwoPhaseGetGXact().

Thank you for working on this.

@@ -811,6 +811,9 @@ TwoPhaseGetGXact(TransactionId xid)
static TransactionId cached_xid = InvalidTransactionId;
static GlobalTransaction cached_gxact = NULL;

+   Assert(!lock_held ||
+  LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
+
/*
 * During a recovery, COMMIT PREPARED, or ABORT PREPARED,
we'll be called
 * repeatedly for the same XID.  We can save work with a simple cache.
@@ -818,7 +821,8 @@ TwoPhaseGetGXact(TransactionId xid)
if (xid == cached_xid)
return cached_gxact;

-   LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+   if (!lock_held)
+   LWLockAcquire(TwoPhaseStateLock, LW_SHARED);

for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
{

It seems strange to me, why do we always require an exclusive lock
here in spite of acquiring a shared lock?

-
@@ -854,7 +859,7 @@ TwoPhaseGetGXact(TransactionId xid)
 BackendId
 TwoPhaseGetDummyBackendId(TransactionId xid)
 {
-   GlobalTransaction gxact = TwoPhaseGetGXact(xid);
+   GlobalTransaction gxact = TwoPhaseGetGXact(xid, false);

return gxact->dummyBackendId;
 }

TwoPhaseGetDummyBackendId() is called by
multixact_twophase_postcommit() which is called while holding
TwoPhaseStateLock in exclusive mode in FinishPreparedTransaction().
Since we cache the global transaction when we call
lock_twophase_commit() I couldn't produce issue but it seems to have a
potential of locking problem.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



RE: Speeding up creating UPDATE/DELETE generic plan for partitioned table into a lot

2019-02-19 Thread Kato, Sho
Before addressing to speeding up creating generic plan of UPDATE/DELETE, I will 
begin with the speed up creating SELECT plan.

I will explain the background as time has passed.
Since generic plan creates plans of all partitions and is cached, we can skip 
planning and expect performance improvements.
But, When a table is partitioned into thousands, it takes time to execute a 
generic plan for the first time because planner creates plans for all child 
tables including a child table that may not be accessed.

Therefore, I would like to develop a method to gradually create a child table 
plan instead of creating and caching all child table plans at once at EXECUTE.
I came up with a mechanism that caches the information like PlannerInfo -- 
necessary to create the plan and the plan and adds the access plan of the child 
table to the cached plan.

However, I'm not sure that this can be realized and this is right, so I want an 
opinion.
Also, I'd like advice if it would be better to create a new path for 
partitioning like "Partition Scan Path" or "Partition Index Scan Path".

regards,
Sho Kato

> -Original Message-
> From: Kato, Sho [mailto:kato-...@jp.fujitsu.com]
> Sent: Friday, February 1, 2019 5:16 PM
> To: Tsunakawa, Takayuki/綱川 貴之 ;
> 'David Rowley' ; Amit Langote
> 
> Cc: Pg Hackers 
> Subject: Speeding up creating UPDATE/DELETE generic plan for partitioned
> table into a lot
> 
> Sorry, I lost previous mail[1].
> 
> On Fri, 28 Dec 2018 at 20:36, Tsunakawa, Takayuki
>  wrote:
> > Although I may say the same thing as you, I think a natural idea would
> be to create a generic plan gradually.  The starting simple question is
> "why do we have to touch all partitions at first?"  That is, can we behave
> like this:
> 
> I also think creating a generic plan gradually is a better idea because
> planner should create a plan when it is needed.
> Any ideas?
> 
> On 2018-12-31 08:57:04, David Rowley wrote
> >I imagine the place to start looking would be around why planning is
> so slow for that many partitions.
> 
> As you may already know, FLATCOPY at range_table_mutator has a large
> bottleneck.
> Executing UPDATE, about npart squared RangeTblEntry is copied.
> When I execute UPDATE to 100 partitioned table, FLATCOPY takes about 100
> * 0.067 ms while total planning time takes 12.689 ms.
> 
> On 2018-12-31 08:57:04, David Rowley wrote
> >Another possible interesting idea would be to, instead of creating
> >large Append/MergeAppend plans for partition scanning, invent some
> >"Partition Seq Scan" and "Partition Index Scan" nodes that are able to
> >build plans more similar to scanning a normal table. Likely such nodes
> >would need to be programmed with a list of Oids that they're to scan
> >during their execution. They'd also need to take care of their own
> >tuple mapping for when partitions had their columns in varying orders.
> 
> Inventing some "Partition Seq Scan" and "Partition Index Scan" nodes is
> interesting.
> It seems easy to add Scan nodes to each partition gradually.
> 
> [1]:CAKJS1f-y1HQK+VjG7=C==vGcLnzxjN8ysD5NmaN8Wh4=VsYipw@mail.gmail.c
> om
> 
> regards,
> 
> 




RE: Timeout parameters

2019-02-19 Thread Jamison, Kirk
Hi,

I tried to re-read the whole thread.
Based from what I read, there are two proposed timeout parameters,
which I think can be discussed and commited separately:
(1) tcp_user_timeout 
(2) tcp_socket_timeout (or suggested client_statement_timeout,
send_timeout/recv_timeout)

Regarding the use-case of each parameter, Tsunakawa-san briefly
explained them above. Quoting again:
> * TCP keepalive and TCP user (retransmission) timeout: for network problems
> * statement_timeout: for long-running queries
> * socket_timeout (or send/recv_timeout): for saturated servers

The already existing statement_timeout mainly limits how long
each statement should run. [1]
However, even if statement_timeout was configured, it does not
handle the timeout for instances that a network failure occurs,
so the application would not recover from error. 
Therefore, there's a need for these features, to meet the cases
that statement_timeout currently does not handle.

1) tcp_user_timeout parameter
As for user_timeout param, there seems to be a common agreement
with regards to its need.

Just minor nitpick:
+   char   *tcp_user_timeout;   /* TCP USER TIMEOUT */
I think that's unnecessary capitalization in the user timeout part.

The latest tcp_user_timeout patch seems to be almost in good shape,
feedback about doc (clearer description from Andrei)
and code (whitespace, C-style, parse_int_param, handling old kernel
version) were addressed.

I think this can be "committed" separately when it's finalized.


2) tcp_socket_timeout parameter
On the other hand, there needs to be a further discussion and design
improvement with regards to the implementation of socket_timeout:
- whether (a) it should abort the connection from pqWait() or
  other means, or (b) cancel the statement similar to how psql
  does it as suggested by Fabien
- proper parameter name
 
Based from your description below, I agree with Fabien that it's somehow
the application/client side query timeout
> "socket_timeout" is the application layer timeout parameter from when 
> frontend issues SQL query to when frontend receives the execution result 
> from backend. When this parameter is active and timeout occurs, frontend 
> close the socket. It is a merit for client to set the maximum time to 
> wait for SQL.  

In PgJDBC, it serves two purpose though: query timeout and network problem
detection. The use of socketTimeout aborts the connection. [2]
> The timeout value used for socket read operations. If reading from 
> the server takes longer than this value, the connection is closed. 
> This can be used as both a brute force global query timeout and a
> method of detecting network problems. The timeout is specified in 
> seconds and a value of zero means that it is disabled.
 
Perhaps you could also clarify a bit more through documentation on how
socket_timeout handles the timeout differently from statement_timeout
and tcp_user_timeout.
Then we can decide on the which parameter name is better once the
implementation becomes clearer.

[1] https://www.postgresql.org/docs/devel/runtime-config-client.html
[2] 
https://jdbc.postgresql.org/documentation/head/connect.html#connection-parameters


Regards,
Kirk Jamison



Re: [PATCH][PROPOSAL] Add enum releation option type

2019-02-19 Thread Michael Paquier
On Sun, Jan 06, 2019 at 06:28:11PM +0300, Nikolay Shaplov wrote:
> Actually I am not satisfied with it too... That is why I started that bigger 
> reloptions refactoring work. So for now I would offer to keep initialization 
> as it was for other types: in initialize_reloptions  using [type]RelOpts[] 
> arrays. And then I will offer patch that changes it for all types and remove 
> difference between biult-in reloptions and reloptions in extensions.

Should this be switched as waiting on author or just marked as
returned with feedback then?

> 2.5  May be this src/test/modules dummy index is subject to another patch. So 
> I will start working on it right now, but we will do this work not dependent 
> to any other patches. And just add there what we need when it is ready. I 
> would actually add there some code that checks all types of options, because 
> we actually never check that option value from reloptons successfully reaches 
> extension code. I did this checks manually, when I wrote that bigger 
> reloption 
> patch, but there is no facilities to do regularly check is via test suit. 
> Creating dummy index will create such facilities.

bloom is a user-facing extension, while src/test/modules are normally
not installed (there is an exception for MSVC anyway..), so stressing
add_enum_reloption in src/test/modules looks more appropriate to me,
except if you can define an option which is useful for the user and is
an enum.
--
Michael


signature.asc
Description: PGP signature


Re: speeding up planning with partitions

2019-02-19 Thread Amit Langote
On 2019/02/20 5:57, Tom Lane wrote:
> I wrote:
>> OK, I'll make another pass over 0001 today.
> 
> So I started the day with high hopes for this, but the more I looked at
> it the less happy I got, and finally I ran into something that looks to
> be a complete show-stopper.  Namely, that the patch does not account
> for the possibility of an inherited target rel being the outside for a
> parameterized path to some other rel.  Like this example in the
> regression database:
> 
> explain update a set aa = aa + 1
> from tenk1 t where a.aa = t.unique2;
> 
> With HEAD, you get a perfectly nice plan that consists of an append
> of per-child plans like this:
> 
>->  Nested Loop  (cost=0.29..8.31 rows=1 width=16)
>  ->  Seq Scan on a  (cost=0.00..0.00 rows=1 width=10)
>  ->  Index Scan using tenk1_unique2 on tenk1 t  (cost=0.29..8.30 
> rows=1 
> width=10)
>Index Cond: (unique2 = a.aa)
> 
> With the 0001 patch, this gets an Assert during set_base_rel_pathlists,
>
> because indxpath.c tries to make a parameterized path for tenk1
> with "a" as the outer rel.  Since tenk1's joinlist hasn't been
> touched, it's still referencing the inheritance parent, and the
> code notices that we haven't made a rowcount estimate for that.

Hmm, yeah.  It wouldn't have crashed with an earlier version of the patch,
because with it, we were setting the parent relation's rows to a dummy
value of 1 at the end of set_inherited_target_rel_sizes, which I removed
in the latest patch after your comment upthread.

> Even if we had, we'd generate a Path referencing Vars of the parent
> rel, which would not work.
>
> Conceivably, such a Path could be fixed up later (say by applying
> adjust_appendrel_attrs to it during createplan.c),

Actually, reparameterize_path_by_child (invented by partitionwise-join
commit) seems to take care of fixing up the Path to have child attributes,
so the plan comes out exactly as on HEAD.  But to be honest, that means
this new approach of inherited update join planning only appears to work
by accident.

> but that is not
> going to fix the fundamental problem: the cost estimate for such a
> Path should vary depending on how large we think the outer rel is,
> and we don't have a reasonable way to set that if we're trying to
> make a one-size-fits-all Path for something that's being joined to
> an inheritance tree with a widely varying set of relation sizes.

What if we set the parent target relation's rows to an average of child
target relation's rows, that is, instead of setting it to dummy 1 that
previous versions of the patches were doing?

Thanks,
Amit




RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-02-19 Thread Tsunakawa, Takayuki
From: Tomas Vondra [mailto:tomas.von...@2ndquadrant.com]
> On 2/12/19 7:33 AM, Tsunakawa, Takayuki wrote:
> > Imai-san confirmed performance improvement with this patch:
> >
> > https://commitfest.postgresql.org/22/1993/
> >
> 
> Can you quantify the effects? That is, how much slower/faster does it get?

Ugh, sorry, I wrote a wrong URL.  The correct page is:

https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A512787EC%40g01jpexmbkw24

The quoted figures re:

[v20 + faster-locallock-scan.patch]
auto:   9,069 TPS
custom: 9,015 TPS

[v20]
auto:   8,037 TPS
custom: 8,933 TPS


In the original problematic case, plan_cache_mode = auto (default), we can see 
about 13% improvement.


Regards
Takayuki Tsunakawa





Re: Another way to fix inherited UPDATE/DELETE

2019-02-19 Thread Tom Lane
Pavan Deolasee  writes:
> We will need to consider how this affects EvalPlanQual which currently
> doesn't have to do anything special for partitioned tables. I solved that
> via tracking the expanded-at-the-bottom child in a separate
> mergeTargetRelation, but that approach has been criticised. May be Tom's
> idea doesn't have the same problem or most likely he will have a far better
> approach to address that.

I did spend a few seconds thinking about that, and my gut says that
this wouldn't change anything interesting for EPQ.  But the devil
is in the details as always, so maybe working out the patch would
find problems ...

regards, tom lane



Re: Another way to fix inherited UPDATE/DELETE

2019-02-19 Thread Pavan Deolasee
On Wed, Feb 20, 2019 at 4:23 AM David Rowley 
wrote:

> On Wed, 20 Feb 2019 at 10:49, Tom Lane  wrote:
> > What if we dropped that idea, and instead defined the plan tree as
> > returning only the columns that are updated by SET, plus the row
> > identity?  It would then be the ModifyTable node's job to fetch the
> > original tuple using the row identity (which it must do anyway) and
> > form the new tuple by combining the updated columns from the plan
> > output with the non-updated columns from the original tuple.
> >
> > DELETE would be even simpler, since it only needs the row identity
> > and nothing else.
>
> While I didn't look at the patch in great detail, I think this is how
> Pavan must have made MERGE work for partitioned targets. I recall
> seeing the tableoid being added to the target list and a lookup of the
> ResultRelInfo by tableoid.
>
> Maybe Pavan can provide more useful details than I can.
>

Yes, that's the approach I took in MERGE, primarily because of the hurdles
I faced in handling partitioned tables, which take entirely different route
for UPDATE/DELETE vs INSERT and in MERGE we had to do all three together.
But the approach also showed significant performance improvements.
UPDATE/DELETE via MERGE is far quicker as compared to regular UPDATE/DELETE
when there are non-trivial number of partitions. That's also a reason why I
recommended doing the same for regular UPDATE/DELETE, but that got lost in
the MERGE discussions. So +1 for the approach.

We will need to consider how this affects EvalPlanQual which currently
doesn't have to do anything special for partitioned tables. I solved that
via tracking the expanded-at-the-bottom child in a separate
mergeTargetRelation, but that approach has been criticised. May be Tom's
idea doesn't have the same problem or most likely he will have a far better
approach to address that.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Another way to fix inherited UPDATE/DELETE

2019-02-19 Thread Tom Lane
Amit Langote  writes:
> On 2019/02/20 6:48, Tom Lane wrote:
>> What if we dropped that idea, and instead defined the plan tree as
>> returning only the columns that are updated by SET, plus the row
>> identity?  It would then be the ModifyTable node's job to fetch the
>> original tuple using the row identity (which it must do anyway) and
>> form the new tuple by combining the updated columns from the plan
>> output with the non-updated columns from the original tuple.

> Regarding child target relations that are foreign tables, the
> expand-target-inheritance-at-the-bottom approach perhaps leaves no way to
> allow pushing the update (possibly with joins) to remote side?

That's something we'd need to think about.  Obviously, anything
along this line breaks the existing FDW update APIs, but let's assume
that's acceptable.  Is it impossible, or even hard, for an FDW to
support this definition of UPDATE rather than the existing one?
I don't think so --- it seems like it's just different --- but
I might well be missing something.

regards, tom lane



Re: pg_basebackup ignores the existing data directory permissions

2019-02-19 Thread Michael Paquier
On Wed, Feb 20, 2019 at 03:16:48PM +1100, Haribabu Kommi wrote:
> Lack of complaints from the users, how about making this change in the HEAD?

Fine by me.  I would tend to patch pg_basebackup so as the target
folder gets the correct umask instead of nuking the chmod call in
initdb so I think that we are on the same page.  Let's see what the
others think.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] kNN for btree

2019-02-19 Thread Thomas Munro
On Wed, Feb 20, 2019 at 2:18 PM Nikita Glukhov  wrote:
> On 04.02.2019 8:35, Michael Paquier wrote:
> This patch set needs a rebase because of conflicts caused by the
> recent patches for pluggable storage.

Hi Nikita,
>From the department of trivialities: according to cfbot the
documentation doesn't build.  Looks like you have some cases of ,
but these days you have to write out  (or whatever) in full.

-- 
Thomas Munro
https://enterprisedb.com



Re: pg_basebackup ignores the existing data directory permissions

2019-02-19 Thread Haribabu Kommi
On Fri, Feb 15, 2019 at 10:15 AM Michael Paquier 
wrote:

> On Thu, Feb 14, 2019 at 11:21:19PM +1100, Haribabu Kommi wrote:
> > On Thu, Feb 14, 2019 at 8:57 PM Magnus Hagander 
> wrote:
> >> I think it could be argued that neither initdb *or* pg_basebackup should
> >> change the permissions on an existing directory, because the admin may
> have
> >> done that intentionally. But when they do create the directory, they
> should
> >> follow the same patterns.
> >
> > Hmm, even if the administrator set some specific permissions to the data
> > directory, PostgreSQL server doesn't allow server to start if the
> > permissions are not (0700) for versions less than 11 and (0700 or
> > 0750) for version 11 or later.
>
> Yes, particularly with pg_basebackup -R this adds an extra step in the
> user flow.
>
> > To let the user to use the PostgreSQL server, user must change the
> > permissions of the data directory. So, I don't see a problem in
> > changing the permissions by these tools.
>
> I certainly agree with the point of Magnus that both tools should
> behave consistently, and I cannot actually imagine why it would be
> useful for an admin to keep a more permissive data folder while all
> the contents already have umasks set at the same level as the primary
> (or what initdb has been told to use), but perhaps I lack imagination.
> If we doubt about potential user impact, the usual, best, answer is to
> let back-branches behave the way they do now, and only do something on
> HEAD.
>

I also agree that both inidb and pg_basebackup should behave same.
Our main concern is that standby data directory that doesn't follow
the primary data directory permissions can lead failures when the standby
gets promoted.

Lack of complaints from the users, how about making this change in the HEAD?

Regards,
Haribabu Kommi
Fujitsu Australia


Re: Protect syscache from bloating with negative cache entries

2019-02-19 Thread Kyotaro HORIGUCHI
At Thu, 14 Feb 2019 00:40:10 -0800, Andres Freund  wrote in 
<20190214084010.bdn6tmba2j7sz...@alap3.anarazel.de>
> Hi,
> 
> On 2019-02-13 15:31:14 +0900, Kyotaro HORIGUCHI wrote:
> > Instead, I added an accounting(?) interface function.
> > 
> > | MemoryContextGettConsumption(MemoryContext cxt);
> > 
> > The API returns the current consumption in this memory
> > context. This allows "real" memory accounting almost without
> > overhead.
> 
> That's definitely *NOT* almost without overhead. This adds additional
> instructions to one postgres' hottest set of codepaths.

I'm not sure how much the two instructions in AllocSetAlloc
actually impacts, but I agree that it is doubtful that the
size-limit feature worth the possible slowdown in any extent.

# I faintly remember that I tried the same thing before..

> I think you're not working incrementally enough here. I strongly suggest
> solving the negative cache entry problem, and then incrementally go from
> there after that's committed. The likelihood of this patch ever getting
> merged otherwise seems extremely small.

Mmm. Scoping to the negcache prolem, my very first patch posted
two-years ago does that based on invalidation for pg_statistic
and pg_class, like I think Tom have suggested somewhere in this
thread.

https://www.postgresql.org/message-id/20161219.201505.11562604.horiguchi.kyot...@lab.ntt.co.jp

This is completely different approach from the current shape and
it would be useless after pruning is introduced. So I'd like to
go for the generic pruning by age.

Difference from v15:

  Removed AllocSet accounting stuff. We use approximate memory
  size for catcache.

  Removed prune-by-number(or size) stuff.

  Adressing comments from Tsunakawa-san and Ideriha-san .

  Separated catcache monitoring feature. (Removed from this set)
(But it is crucial to check this feature...)


Is this small enough ?

regards.  

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 191496e02abd4d7b261705e8d2a0ef4aed5827c7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 7 Feb 2019 14:56:07 +0900
Subject: [PATCH 1/2] Add dlist_move_tail

We have dlist_push_head/tail and dlist_move_head but not
dlist_move_tail. Add it.
---
 src/include/lib/ilist.h | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/src/include/lib/ilist.h b/src/include/lib/ilist.h
index b1a5974ee4..659ab1ac87 100644
--- a/src/include/lib/ilist.h
+++ b/src/include/lib/ilist.h
@@ -394,6 +394,25 @@ dlist_move_head(dlist_head *head, dlist_node *node)
 	dlist_check(head);
 }
 
+/*
+ * Move element from its current position in the list to the tail position in
+ * the same list.
+ *
+ * Undefined behaviour if 'node' is not already part of the list.
+ */
+static inline void
+dlist_move_tail(dlist_head *head, dlist_node *node)
+{
+	/* fast path if it's already at the tail */
+	if (head->head.prev == node)
+		return;
+
+	dlist_delete(node);
+	dlist_push_tail(head, node);
+
+	dlist_check(head);
+}
+
 /*
  * Check whether 'node' has a following node.
  * Caution: unreliable if 'node' is not in the list.
-- 
2.16.3

>From 59f53da08abb70398611b33f635b46bda87a7534 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 16 Oct 2018 13:04:30 +0900
Subject: [PATCH 2/2] Remove entries that haven't been used for a certain time

Catcache entries can be left alone for several reasons. It is not
desirable that they eat up memory. With this patch, This adds
consideration of removal of entries that haven't been used for a
certain time before enlarging the hash array.

This also can put a hard limit on the number of catcache entries.
---
 doc/src/sgml/config.sgml  |  40 +
 src/backend/tcop/postgres.c   |  13 ++
 src/backend/utils/cache/catcache.c| 243 --
 src/backend/utils/init/globals.c  |   1 +
 src/backend/utils/init/postinit.c |  11 ++
 src/backend/utils/misc/guc.c  |  23 +++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/miscadmin.h   |   1 +
 src/include/utils/catcache.h  |  43 -
 src/include/utils/timeout.h   |   1 +
 10 files changed, 364 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8bd57f376b..7a93aef659 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1661,6 +1661,46 @@ include_dir 'conf.d'
   
  
 
+ 
+  catalog_cache_prune_min_age (integer)
+  
+   catalog_cache_prune_min_age configuration
+   parameter
+  
+  
+  
+   
+Specifies the minimum amount of unused time in seconds at which a
+system catalog cache entry is removed. -1 indicates that this feature
+is disabled at all. The value defaults to 300 seconds (5
+minutes). The catalog cache entries that are not used for
+the duration can be removed to prevent it from 

Re: Copy function for logical replication slots

2019-02-19 Thread Masahiko Sawada
On Wed, Feb 20, 2019 at 12:26 PM Michael Paquier  wrote:
>
> On Tue, Feb 19, 2019 at 05:09:33PM +0900, Masahiko Sawada wrote:
> > On Tue, Feb 19, 2019 at 1:28 AM Andres Freund  wrote:
> >> Well, I'd not thought we'd do it without acquiring the other slot. But
> >> that still seems to be easy enough to address, we just need to recheck
> >> whether the slot still exists (with the right name) the second time we
> >> acquire the spinlock?
> >
> > Yeah, I think that would work. The attached patch takes this
> > direction. Please review it.
>
> + if (XLogRecPtrIsInvalid(copy_restart_lsn) ||
> +  copy_restart_lsn < src_restart_lsn ||
> +  src_islogical != copy_islogical ||
> +  strcmp(copy_name, NameStr(*src_name)) != 0)
> +  ereport(ERROR,
> +  (errmsg("could not copy logical replication 
> slot \"%s\"",
> +   NameStr(*src_name)),
> +   errdetail("The source replication slot has 
> been dropped during copy")));
> +
> +  /* Install copied values again */
> +  SpinLockAcquire(>mutex);
>
> Worth worrying about this window not reduced to zero?  If the slot is
> dropped between both then the same issue would arise.

You meant that the destination slot (i.e. MyReplicationSlot) could be
dropped before acquiring its lock? Since we're holding the new slot it
cannot be dropped.

BTW, XLogRecPtrIsInvalid(copy_restart_lsn) ||  copy_restart_lsn <
src_restart_lsn is redundant, the former should be removed.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: libpq debug log

2019-02-19 Thread Robert Haas
On Mon, Feb 18, 2019 at 10:06 PM Jamison, Kirk  wrote:
> It sounds more logical to me if there is a parameter that switches on/off the 
> logging
> similar to other postgres logs. I suggest trace_log as the parameter name.

I'm not really sure that I like the design of this patch in any way.
But leaving that aside, trace_log seems like a poor choice of name,
because it's got two words telling you what kind of thing we are doing
(tracing, logging) and zero words describing the nature of the thing
being traced or logged (the wire protocol communication with the
client).  A tracing facility for the planner, or the executor, or any
other part of the system would have an equally good claim on the same
name.  That can't be right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Pluggable Storage - Andres's take

2019-02-19 Thread Haribabu Kommi
On Tue, Nov 27, 2018 at 4:59 PM Amit Langote 
wrote:

> Hi,
>
> On 2018/11/02 9:17, Haribabu Kommi wrote:
> > Here I attached the cumulative fixes of the patches, new API additions
> for
> > zheap and
> > basic outline of the documentation.
>
> I've read the documentation patch while also looking at the code and here
> are some comments.
>

Thanks for the review and apologies for the delay.
I have taken care of your most of the comments in the latest version of the
doc patches.


>
> +  
> +
> +TupleTableSlotOps *
> +slot_callbacks (Relation relation);
> +
> +   API to access the slot specific methods;
> +   Following methods are available;
> +   TTSOpsVirtual,
> +   TTSOpsHeapTuple,
> +   TTSOpsMinimalTuple,
> +   TTSOpsBufferTuple,
> +  
>
> Unless I'm misunderstanding what the TupleTableSlotOps abstraction is or
> its relations to the TableAmRoutine abstraction, I think the text
> description could better be written as:
>
> "API to get the slot operations struct for a given table access method"
>
> It's not clear to me why various TTSOps* structs are listed here?  Is the
> point that different AMs may choose one of the listed alternatives?  For
> example, I see that heap AM implementation returns TTOpsBufferTuple, so it
> manipulates slots containing buffered tuples, right?  Other AMs are free
> to return any one of these?  For example, some AMs may never use buffer
> manager and hence not use TTOpsBufferTuple.  Is that understanding correct?
>

Yes, AM can decide what type of Slot method it wants to use.

Regards,
Haribabu Kommi
Fujitsu Australia


Re: Pluggable Storage - Andres's take

2019-02-19 Thread Haribabu Kommi
On Mon, Feb 4, 2019 at 2:31 PM Haribabu Kommi 
wrote:

>
> On Tue, Jan 22, 2019 at 1:43 PM Haribabu Kommi 
> wrote:
>
>>
>>
>> OK. I will work on the doc changes.
>>
>
> Sorry for the delay.
>
> Attached a draft patch of doc and comments changes that I worked upon.
> Currently I added comments to the callbacks that are present in the
> TableAmRoutine
> structure and I copied it into the docs. I am not sure whether it is a
> good approach or not?
> I am yet to add description for the each parameter of the callbacks for
> easier understanding.
>
> Or, Giving description of each callbacks in the docs with division of
> those callbacks
> according to them divided in the TableAmRoutine structure? Currently
> following divisions
> are available.
> 1. Table scan
> 2. Parallel table scan
> 3. Index scan
> 4. Manipulation of physical tuples
> 5. Non-modifying operations on individual tuples
> 6. DDL
> 7. Planner
> 8. Executor
>
> Suggestions?
>

Here I attached the doc patches for the pluggable storage, I divided the
API's into the above
specified groups and explained them in the docs.I can further add more
details if the approach
seems fine.

Regards,
Haribabu Kommi
Fujitsu Australia


0008-Table-access-method-API-explanation.patch
Description: Binary data


0001-Docs-of-default_table_access_method-GUC.patch
Description: Binary data


0002-Rename-indexam.sgml-to-am.sgml.patch
Description: Binary data


0003-Reorganize-am-as-both-table-and-index.patch
Description: Binary data


0004-Doc-update-of-Create-access-method-type-table.patch
Description: Binary data


0005-Doc-update-of-create-materialized-view-.-USING-synta.patch
Description: Binary data


0006-Doc-update-of-CREATE-TABLE-.-USING-syntax.patch
Description: Binary data


0007-Doc-of-CREATE-TABLE-AS-.-USING-syntax.patch
Description: Binary data


Re: Incorrect visibility test function assigned to snapshot

2019-02-19 Thread Michael Paquier
On Mon, Feb 18, 2019 at 10:45:38AM -0300, Alvaro Herrera wrote:
> None here.

Thanks.  And committed.
--
Michael


signature.asc
Description: PGP signature


Re: Compressed TOAST Slicing

2019-02-19 Thread Юрий Соколов
Some time ago I posted PoC patch with alternative TOAST compression scheme:
instead of "compress-then-chunk" I suggested "chunk-then-compress". It
decrease compression level, but allows efficient arbitrary slicing.

ср, 20 февр. 2019 г., 2:09 Paul Ramsey pram...@cleverelephant.ca:

> On Sat, Feb 16, 2019 at 7:25 AM Simon Riggs  wrote:
>
> > Could we get an similarly optimized implementation of -> operator for
> JSONB as well?
> > Are there any other potential uses? Best to fix em all up at once and
> then move on to other things. Thanks.
>
> Oddly enough, I couldn't find many/any things that were sensitive to
> left-end decompression. The only exception is "LIKE this%" which
> clearly would be helped, but unfortunately wouldn't be a quick
> drop-in, but a rather major reorganization of the regex handling.
>
> I had a look at "->" and I couldn't see how a slice could be used to
> make it faster? We don't a priori know how big a slice would give us
> what we want. This again makes Stephen's case for an iterator, but of
> course all the iterator benefits only come when the actual function at
> the top (in this case the json parser) are also updated to be
> iterative.
>
> Committing this little change doesn't preclude an iterator, or even
> make doing one more complicated... :)
>
> P.
>
>


Re: Copy function for logical replication slots

2019-02-19 Thread Michael Paquier
On Tue, Feb 19, 2019 at 05:09:33PM +0900, Masahiko Sawada wrote:
> On Tue, Feb 19, 2019 at 1:28 AM Andres Freund  wrote:
>> Well, I'd not thought we'd do it without acquiring the other slot. But
>> that still seems to be easy enough to address, we just need to recheck
>> whether the slot still exists (with the right name) the second time we
>> acquire the spinlock?
> 
> Yeah, I think that would work. The attached patch takes this
> direction. Please review it.

+ if (XLogRecPtrIsInvalid(copy_restart_lsn) ||
+  copy_restart_lsn < src_restart_lsn ||
+  src_islogical != copy_islogical ||
+  strcmp(copy_name, NameStr(*src_name)) != 0)
+  ereport(ERROR,
+  (errmsg("could not copy logical replication slot 
\"%s\"",
+   NameStr(*src_name)),
+   errdetail("The source replication slot has been 
dropped during copy")));
+
+  /* Install copied values again */
+  SpinLockAcquire(>mutex);

Worth worrying about this window not reduced to zero?  If the slot is
dropped between both then the same issue would arise.
--
Michael


signature.asc
Description: PGP signature


Re: Delay locking partitions during INSERT and UPDATE

2019-02-19 Thread Robert Haas
On Tue, Feb 19, 2019 at 4:07 PM David Rowley
 wrote:
> I'd say that here we should only discuss what this patch is doing, not
> anything else that's in flight that you're concerned will conflict
> with the ATTACH/DETACH PARTITION CONCURRENTLY patch.
>
> During INSERT and UPDATE, not all partitions will always be locked
> before executor startup. This patch removing the find_all_inheritors()
> call from ExecSetupPartitionTupleRouting() is not going to break
> ATTACH/DETACH PARTITION CONCURRENTLY if it wasn't already broken in
> the first place.  With this patch, we're still obtaining locks after
> execution has begun, it just may delay the locking until a bit later.
> It was previously already happening after executor startup had begun,
> so the window for the problems that you describe were already
> non-zero.

The issue of whether it's OK to lock child partitions in variable
order has come up on many different threads, and as far as I know this
is the only thread where Tom has expressed any view on it.  Since Tom
is a smart guy,[1] I was hoping to get him to expand on those views a
little bit.  Even if that leads this thread to detour slightly from
the matter immediately at hand, I think it will be worth it if it gets
us to some kind of consensus on what has been a thorny question for
some time now.

On the other hand, Tom may not reply, in which case the rest of us
will just have to keep faking it as best we can.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] citation needed, except not really



Re: Reaping Temp tables to avoid XID wraparound

2019-02-19 Thread Michael Paquier
On Tue, Feb 19, 2019 at 09:56:28AM +0100, Magnus Hagander wrote:
> 2. Or probably even better, just put it in PgBackendStatus? Overhead here
> is a lot cheaper than PGPROC.
> 
> ISTM 2 is probably the most reasonable option here?

Yes, I forgot this one.  That would be more consistent, even if the
information can be out of date quickly we don't care here.
--
Michael


signature.asc
Description: PGP signature


Re: Another way to fix inherited UPDATE/DELETE

2019-02-19 Thread Amit Langote
On 2019/02/20 10:55, Amit Langote wrote:
> Maybe I should've mentioned that on this thread at some point.

I meant the other thread where we're discussing my patches.

Thanks,
Amit




Re: Another way to fix inherited UPDATE/DELETE

2019-02-19 Thread Amit Langote
Hi,

On 2019/02/20 6:48, Tom Lane wrote:
> While contemplating the wreckage of 
> https://commitfest.postgresql.org/22/1778/
> I had the beginnings of an idea of another way to fix that problem.
>
> The issue largely arises from the fact that for UPDATE, we expect
> the plan tree to emit a tuple that's ready to be stored back into
> the target rel ... well, almost, because it also has a CTID or some
> other row-identity column, so we have to do some work on it anyway.
> But the point is this means we potentially need a different
> targetlist for each child table in an inherited UPDATE.
> 
> What if we dropped that idea, and instead defined the plan tree as
> returning only the columns that are updated by SET, plus the row
> identity?  It would then be the ModifyTable node's job to fetch the
> original tuple using the row identity (which it must do anyway) and
> form the new tuple by combining the updated columns from the plan
> output with the non-updated columns from the original tuple.
> 
> DELETE would be even simpler, since it only needs the row identity
> and nothing else.

I had bookmarked link to an archived email of yours from about 5 years
ago, in which you described a similar attack plan for UPDATE planning:

https://www.postgresql.org/message-id/1598.1399826841%40sss.pgh.pa.us

It's been kind of in the back of my mind for a while, even considered
implementing it based on your sketch back then, but didn't have solutions
for some issues surrounding optimization of updates of foreign partitions
(see below).  Maybe I should've mentioned that on this thread at some point.

> Having done that, we could toss inheritance_planner into the oblivion
> it so richly deserves, and just treat all types of inheritance or
> partitioning queries as expand-at-the-bottom, as SELECT has always
> done it.
> 
> Arguably, this would be more efficient even for non-inheritance join
> situations, as less data (typically) would need to propagate through the
> join tree.  I'm not sure exactly how it'd shake out for trivial updates;
> we might be paying for two tuple deconstructions not one, though perhaps
> there's a way to finesse that.  (One easy way would be to stick to the
> old approach when there is no inheritance going on.)
> 
> In the case of a standard inheritance or partition tree, this seems to
> go through really easily, since all the children could share the same
> returned CTID column (I guess you'd also need a TABLEOID column so you
> could figure out which table to direct the update back into).  It gets
> a bit harder if the tree contains some foreign tables, because they might
> have different concepts of row identity, but I'd think in most cases you
> could still combine those into a small number of output columns.

Regarding child target relations that are foreign tables, the
expand-target-inheritance-at-the-bottom approach perhaps leaves no way to
allow pushing the update (possibly with joins) to remote side?

-- no inheritance
explain (costs off, verbose) update ffoo f set a = f.a + 1 from fbar b
where f.a = b.a;
  QUERY PLAN

──
 Update on public.ffoo f
   ->  Foreign Update
 Remote SQL: UPDATE public.foo r1 SET a = (r1.a + 1) FROM
public.bar r2 WHERE ((r1.a = r2.a))
(3 rows)

-- inheritance
explain (costs off, verbose) update p set aa = aa + 1 from ffoo f where
p.aa = f.a;
QUERY PLAN

───
 Update on public.p
   Update on public.p1
   Update on public.p2
   Foreign Update on public.p3
   ->  Nested Loop
 Output: (p1.aa + 1), p1.ctid, f.*
 ->  Seq Scan on public.p1
   Output: p1.aa, p1.ctid
 ->  Foreign Scan on public.ffoo f
   Output: f.*, f.a
   Remote SQL: SELECT a FROM public.foo WHERE (($1::integer = a))
   ->  Nested Loop
 Output: (p2.aa + 1), p2.ctid, f.*
 ->  Seq Scan on public.p2
   Output: p2.aa, p2.ctid
 ->  Foreign Scan on public.ffoo f
   Output: f.*, f.a
   Remote SQL: SELECT a FROM public.foo WHERE (($1::integer = a))
   ->  Foreign Update
 Remote SQL: UPDATE public.base3 r5 SET aa = (r5.aa + 1) FROM
public.foo r2 WHERE ((r5.aa = r2.a))
(20 rows)

Does that seem salvageable?

Thanks,
Amit




RE: SQL statement PREPARE does not work in ECPG

2019-02-19 Thread Takahashi, Ryohei
Hi Meskes-san,


Thank you for your replying.


> Please try this instead:
> 
> EXEC SQL PREPARE test_prep (int) AS SELECT id from test_table where id
> = $1;
> EXEC SQL EXECUTE test_prep using 2;
> 
> This should work.


I tried as follows.



  EXEC SQL PREPARE test_prep (int) AS SELECT id from test_table where id = $1;
  EXEC SQL EXECUTE test_prep using 2;




  { ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_normal, "prepare \"test_prep\" ( int 
) as \" select id from test_table where id = $1 \"", ECPGt_EOIT, ECPGt_EORT);
#line 16 "test_app.pgc"

if (sqlca.sqlcode < 0) error_exit ( );}
#line 16 "test_app.pgc"

  { ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_execute, "test_prep",
ECPGt_const,"2",(long)1,(long)1,strlen("2"),
ECPGt_NO_INDICATOR, NULL , 0L, 0L, 0L, ECPGt_EOIT, ECPGt_EORT);
#line 17 "test_app.pgc"

if (sqlca.sqlcode < 0) error_exit ( );}
#line 17 "test_app.pgc"



Unfortunately, this does not work.
ECPGst_execute seems good, but prepare statement is the same as my first post.
It fails with "PostgreSQL error : -202[too few arguments on line 16]".

This error is caused by following source code.



/* Check if there are unmatched things left. */
if (next_insert(stmt->command, position, stmt->questionmarks, 
std_strings) >= 0)
{
ecpg_raise(stmt->lineno, ECPG_TOO_FEW_ARGUMENTS,
   
ECPG_SQLSTATE_USING_CLAUSE_DOES_NOT_MATCH_PARAMETERS, NULL);
ecpg_free_params(stmt, false);
return false;
}


if (text[p] == '$' && isdigit((unsigned char) text[p + 
1]))
{
/* this can be either a dollar quote or a 
variable */
int i;

for (i = p + 1; isdigit((unsigned char) 
text[i]); i++)
 /* empty loop body */ ;
if (!isalpha((unsigned char) text[i]) &&
isascii((unsigned char) text[i]) 
&[i] != '_')
/* not dollar delimited quote */
return p;
}


I think next_insert() should ignore "$n" in the case of SQL statement PREPARE.


In addition, we should fix following, right?

(1)
As Matsumura-san wrote, ECPG should not produce '"' for SQL statement PREPARE.

(2)
ECPG should produce argument for execute statement such as "EXEC SQL EXECUTE 
test_prep (2);"



Regards,
Ryohei Takahashi



Re: CPU costs of random_zipfian in pgbench

2019-02-19 Thread Peter Geoghegan
On Tue, Feb 19, 2019 at 7:14 AM Fabien COELHO  wrote:
> What I like in "pgbench" is that it is both versatile and simple so that
> people can benchmark their own data with their own load and their own
> queries by writing a few lines of trivial SQL and psql-like slash command
> and adjusting a few options, and extract meaningful statistics out of it.

That's also what I like about it. However, I don't think that pgbench
is capable of helping me answer questions that are not relatively
simple. That is going to become less and less interesting over time.

> I have not encountered other tools with this versatility and simplicity.
> The TPC-C implementation you point out and others I have seen are
> structurally targetted at TPC-C and nothing else. I do not care about
> TPC-C per se, I care about people being able to run relevant benchmarks
> with minimal effort.

Lots and lots of people care about TPC-C. Far more than care about
TPC-B, which has been officially obsolete for a long time. I don't
doubt that there are some bad reasons for the interest that you see
from vendors, but the TPC-C stuff has real merit (just read Jim Gray,
who you referenced in relation to the Zipfian generator). Lots of
smart people worked for a couple of years on the original
specification of TPC-C. There is a lot of papers on TPC-C. It *is*
complicated in various ways, which is a good thing, as it approximates
a real-world workload, and exercises a bunch of code paths that TPC-B
does not. TPC-A and TPC-B were early attempts, and managed to be
better than nothing at a time when performance validation was not
nearly as advanced as it is today.

> > I have been using BenchmarkSQL as a fair-use TPC-C implementation for
> > my indexing project, with great results. pgbench just isn't very
> > useful when validating the changes to B-Tree page splits that I
> > propose, because the insertion pattern cannot be modeled
> > probabilistically.
>
> I do not understand the use case, and why pgbench could not be used for
> this purpose.

TPC-C is characterized by *localized* monotonically increasing
insertions in most of its indexes. By far the biggest index is the
order lines table primary key, which is on '(ol_w_id, ol_d_id,
ol_o_id, ol_number)'. You get pathological performance with this
currently, because you should really to split at the point that new
items are inserted at, but we do a standard 50/50 page split. The left
half of the page isn't inserted into again (except by rare non-HOT
updates), so you end up *reliably* wasting about half of all space in
the index.

IOW, there are cases where we should behave like we're doing a
rightmost page split (kind of), that don't happen to involve the
rightmost page. The problem was described but not diagnosed in this
blog post: https://www.commandprompt.com/blog/postgres_autovacuum_bloat_tpc-c/

If you had random insertions (or insertions that were characterized or
defined in terms of a probability distribution and range), then you
would not see this problem. Instead, you'd get something like 70%
space utilization -- not 50% utilization. I think that it would be
difficult if not impossible to reproduce the pathological performance
with pgbench, even though it's a totally realistic scenario. There
needs to be explicit overall ordering/phases across co-operating
backends, or backends that are in some sense localized (e.g.
associated with a particular warehouse inputting a particular order).
TPC-C offers several variations of this same pathological case.

This is just an example. The point is that there is a lot to be said
for investing significant effort in coming up with a benchmark that is
a distillation of a real workload, with realistic though still kind of
adversarial bottlenecks. I wouldn't have become aware of the page
split problem without TPC-C, which suggests to me that the TPC people
know what they're doing. Also, there is an advantage to having
something that is a known quantity, that enables comparisons across
systems.

I also think that TPC-E is interesting, since it stresses OLTP systems
in a way that is quite different to TPC-C. It's much more read-heavy,
and has many more secondary indexes.

> Yep. Pgbench only does "simple stats". I script around the per-second
> progress output for graphical display and additional stats (eg 5 number
> summary…).

It's far easier to spot regressions over time and other such surprises
if you have latency graphs that break down latency by transaction.
When you're benchmarking queries with joins, then you need to be
vigilant of planner issues over time. The complexity has its pluses as
well as its minuses.

I'm hardly in a position to tell you what to work on. I think that
there may be another perspective on this that you could take something
away from, though.

-- 
Peter Geoghegan



Re: Using old master as new replica after clean switchover

2019-02-19 Thread Michael Paquier
On Tue, Feb 19, 2019 at 04:27:02PM -0800, RSR999GMAILCOM wrote:
> So  wanted to clarify if this procedure really requires the WAL archive
> location on a shared storage ?

Shared storage for WAL archives is not a requirement.  It is perfectly
possible to use streaming replication to get correct WAL changes.
Using an archive is recommended for some deployments and depending on
your requirements and data retention policy, still you could have
those archives on a different host and have the restore_command of the
standbyt in recovery or the archive_command of the primary save the
segments to it.  Depending on the frequency new WAL segments are
generated, this depends of course.
--
Michael


signature.asc
Description: PGP signature


Re: Prepared transaction releasing locks before deregistering its GID

2019-02-19 Thread Michael Paquier
On Tue, Feb 19, 2019 at 08:17:14PM +0100, Oleksii Kliukin wrote:
> I gave it a spin on the same VM host as shown to constantly reproduce the
> issue and observed neither 'identifier already in use' nor any locking
> issues over a few dozens of runs, so it looks good to me.

Thanks for the confirmation.  I'll try to wrap up this one then.

> That was HEAD, but since FinishPreparedTransaction seems to be identical
> there and on the back branches it should work for PG 10 and 11 as well.

The issue exists since two-phase commit has been added, down to 8.1 if
my read of the code is right, so it took 14 years to be found.
Congrats.  I have also manually tested that the code is broken down to
9.4 though.

It is good that you stress-test 2PC the way you do, and that's the
second bug you have spotted since the duplicate XIDs in the standby
snapshots which are now lazily discarded at recovery.  Now, it seems
to me that the potential ABI breakage argument (which can be solved by
introducing an extra routine, which means more conflicts to handle
when back-patching 2PC stuff), and the time it took to find the issue
are pointing out that we should patch only HEAD.  In the event of a
back-patch, on top of the rest we would need to rework RemoveGXact so
as it does not take a LWLock on two-phase state data in 10 and
upwards, so that would be invasive, and we have way less automated
testing in those versions..
--
Michael


signature.asc
Description: PGP signature


RE: Protect syscache from bloating with negative cache entries

2019-02-19 Thread Tsunakawa, Takayuki
From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
> number of tables   | 100  |1000|1
> ---
> TPS (master)   |10966  |10654 |9099
> TPS (patch)| 11137 (+1%) |10710 (+0%) |772 (-91%)
> 
> It seems that before cache exceeding the limit (no pruning at 100 and 1000),
> the results are almost same with master but after exceeding the limit (at
> 1)
> the decline happens.

How many concurrent clients?

Can you show the perf's call graph sampling profiles of both the unpatched and 
patched version, to confirm that the bottleneck is around catcache eviction and 
refill?


Regards
Takayuki Tsunakawa




Using old master as new replica after clean switchover

2019-02-19 Thread RSR999GMAILCOM
Hello Postgres Gurus,

After searching (on www.postgresql.org/Google) I found that the following
steps can be used to perform a switchover in Postgres (version 9.3):
*Step 1.* Do clean shutdown of Primary (-m fast or smart).
*Step 2. *Check for sync status and recovery status of Standby before
promoting it.
  Once Standby is in complete sync. At this stage we are safe
to promote it as Primary.
*Step 3. *Open the Standby as new Primary by pg_ctl promote or creating a
trigger file.
*Step 4.* Restart old Primary as standby and allow to follow the new
timeline by passing "recovery_target_timline='latest'" in \
$PGDATA/recovery.conf file.

But I also read in one of the google post that this procedure requires the
WAL archive location to exist on a shared storage to which both the Master
and Slave should have access to.

So  wanted to clarify if this procedure really requires the WAL archive
location on a shared storage ?

Thanks
Raj


ZRE: Protect syscache from bloating with negative cache entries

2019-02-19 Thread Tsunakawa, Takayuki
From: Tomas Vondra [mailto:tomas.von...@2ndquadrant.com]
> 0.7% may easily be just a noise, possibly due to differences in layout
> of the binary. How many runs? What was the variability of the results
> between runs? What hardware was this tested on?

3 runs, with the variability of about +-2%.  Luckly, all those three runs 
(incidentally?) showed slight performance decrease with the patched version.  
The figures I wrote are the highest ones.

The hardware is, a very poor man's VM:
CPU: 4 core Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz
RAM: 4GB



> FWIW I doubt tests with such small small schema are proving anything -
> the cache/lists are likely tiny. That's why I tested with much larger
> number of relations.

Yeah, it requires many relations to test the repeated catcache eviction and new 
entry creation, which would show the need to increase catalog_cache_max_size 
for users.  I tested with small number of relations to see the impact of 
additional processing while the catcache is not full -- memory accounting and 
catcache LRU chain maintenance.


Regards
Takayuki Tsunakawa




RE: Protect syscache from bloating with negative cache entries

2019-02-19 Thread Ideriha, Takeshi
>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>But at the same time, I did some benchmark with only hard limit option enabled 
>and
>time-related option disabled, because the figures of this case are not 
>provided in this
>thread.
>So let me share it.

I'm sorry but I'm taking back result about patch and correcting it.
I configured postgresql (master) with only CFLAGS=O2
but I misconfigured postgres (path applied) with 
--enable-cassert --enable-debug --enable-tap-tests 'CFLAGS=-O0'.
These debug option (especially --enable-cassert) caused enourmous overhead.
(I thought I checked the configure option.. I was maybe tired.)
So I changed these to only 'CFLAGS=-O2' and re-measured them.

>I did two experiments. One is to show negative cache bloat is suppressed.
>This thread originated from the issue that negative cache of pg_statistics is 
>bloating as
>creating and dropping temp table is repeatedly executed.
>https://www.postgresql.org/message-id/20161219.201505.11562604.horiguchi.kyot
>aro%40lab.ntt.co.jp
>Using the script attached the first email in this thread, I repeated create 
>and drop
>temp table at 1 times.
>(experiment is repeated 5 times. catalog_cache_max_size = 500kB.
> compared master branch and patch with hard memory limit)
>
>Here are TPS and CacheMemoryContext 'used' memory (total - freespace) 
>calculated
>by MemoryContextPrintStats() at 100, 1000, 1 times of create-and-drop
>transaction. The result shows cache bloating is suppressed after exceeding the 
>limit
>(at 1) but tps declines regardless of the limit.
>
>number of tx (create and drop)   | 100  |1000|1
>---
>used CacheMemoryContext  (master) |610296|2029256 |15909024 used
>CacheMemoryContext  (patch)  |755176|880552  |880592
>---
>TPS (master) |414   |407 |399
>TPS (patch)   |242   |225 |220

Correct one:
number of tx (create and drop)   | 100  |1000|1
---
TPS (master) |414   |407 |399
TPS (patch)   |447   |415 |409

The results between master and patch is almost same.


>Another experiment is using Tomas's script posted while ago, The scenario is 
>do select
>1 from multiple tables randomly (uniform distribution).
>(experiment is repeated 5 times. catalog_cache_max_size = 10MB.
> compared master branch and patch with only hard memory limit enabled)
>
>Before doing the benchmark, I checked pruning is happened only at 1 tables 
>using
>debug option. The result shows degradation regardless of before or after 
>pruning.
>I personally still need hard size limitation but I'm surprised that the 
>difference is so
>significant.
>
>number of tables   | 100  |1000|1
>---
>TPS (master)   |10966  |10654 |9099
>TPS (patch)|4491   |2099 |378

Correct one:
number of tables   | 100  |1000|1
---
TPS (master)   |10966  |10654 |9099
TPS (patch)| 11137 (+1%) |10710 (+0%) |772 (-91%)

It seems that before cache exceeding the limit (no pruning at 100 and 1000),
the results are almost same with master but after exceeding the limit (at 
1) 
the decline happens.


Regards,
Takeshi Ideriha



Re: Row Level Security − leakproof-ness and performance implications

2019-02-19 Thread Laurenz Albe
Pierre Ducroquet wrote:
> In order to increase our security, we have started deploying row-level 
> security in order to add another safety net if any issue was to happen in our 
> applications.
> After a careful monitoring of our databases, we found out that a lot of 
> queries started to go south, going extremely slow.
> The root of these slowdowns is that a lot of the PostgreSQL functions are not 
> marked as leakproof, especially the ones used for operators.
> In current git master, the following query returns 258 functions that are 
> used 
> by operators returning booleans and not marked leakproof:
> 
> SELECT proname FROM pg_proc 
> WHERE exists(select 1 from pg_operator where oprcode::name = proname) 
> AND NOT(proleakproof) 
> AND prorettype = 16;
> 
> 
> Among these, if you have for instance a table like:
> create table demo_enum (
> username text,
> flag my_enum
> );
> 
> With an index on (username, flag), the index will only be used on username 
> because flag is an enum and the enum_eq function is not marked leakproof.
> 
> For simple functions like enum_eq/enum_ne, marking them leakproof is an 
> obvious fix (patch attached to this email, including also textin/textout). And

The enum_eq part looks totally safe, but the text functions allocate memory,
so you could create memory pressure, wait for error messages that tell you
the size of the allocation that failed and this way learn about the data.

Is this a paranoid idea?

> if we had a 'RLS-enabled' context on functions, a way to make a lot of built-
> in functions leakproof would simply be to prevent them from logging errors 
> containing values.
> 
> For others, like arraycontains, it's much trickier :[...]

I feel a little out of my depth here, so I won't comment.

> A third issue we noticed is the usage of functional indexes. If you create an 
> index on, for instance, (username, leaky_function(my_field)), and then search 
> on leaky_functon(my_field) = 42, the functional index can be used only if 
> leaky_function is marked leakproof, even if it is never going to be executed 
> on invalid rows thanks to the index. After a quick look at the source code, 
> it 
> also seems complicated to implement since the decision to reject potential 
> dangerous calls to leaky_function is done really early in the process, before 
> the optimizer starts.

If there is a bitmap index scan, the condition will be rechecked, so the
function will be executed.

> I am willing to spend some time on these issues, but I have no specific 
> knowledge of the planner and optimizer, and I fear proper fixes would require 
> much digging there. If you think it would be appropriate for functions like 
> arraycontains or range_contains to require the 'inner' comparison function to 
> be leakproof, or just keep looking at the other functions in pg_proc and 
> leakproof the ones that can be, I would be happy to write the corresponding 
> patches.

Thanks, and I think that every function that can safely be marked leakproof
is a progress!

Yours,
Laurenz Albe




Re: WAL insert delay settings

2019-02-19 Thread Tomas Vondra



On 2/19/19 8:40 PM, Andres Freund wrote:
> Hi,
> 
> On 2019-02-19 20:34:25 +0100, Tomas Vondra wrote:
>> On 2/19/19 8:22 PM, Andres Freund wrote:
>>> And my main point is that even if you implement a proper bytes/sec limit
>>> ONLY for WAL, the behaviour of VACUUM rate limiting doesn't get
>>> meaningfully more confusing than right now.
>>>
>>
>> So, why not to modify autovacuum to also use this approach? I wonder if
>> the situation there is more complicated because of multiple workers
>> sharing the same budget ...
> 
> I think the main reason is that implementing a scheme like this for WAL
> rate limiting isn't a small task, but it'd be aided by the fact that
> it'd probably not on by default, and that there'd not be any regressions
> because the behaviour didn't exist before. I contrast, people are
> extremely sensitive to autovacuum behaviour changes, even if it's to
> improve autovacuum. I think it makes more sense to build the logic in a
> lower profile case first, and then migrate autovacuum over it. Even
> leaving the maturity issue aside, reducing the scope of the project into
> more bite sized chunks seems to increase the likelihood of getting
> anything substantially.
> 

Maybe.

I guess the main thing I'm advocating for here is to aim for a unified
throttling approach, not multiple disparate approaches interacting in
ways that are hard to understand/predict.

The time-based approach you described looks fine, an it's kinda what I
was imagining (and not unlike the checkpoint throttling). I don't think
it'd be that hard to tweak autovacuum to use it too, but I admit I have
not thought about it particularly hard and there's stuff like per-table
settings which might make it more complex.

So maybe doing it for WAL first makes sense. But I still think we need
to have at least a rough idea how it interacts with the existing
throttling and how it will work in the end.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



More smarts about CoerceViaIO, and less stupidity about ArrayCoerceExpr

2019-02-19 Thread Tom Lane
I poked into a recent complaint[1] about PG not being terribly smart
about whether an IS NOT NULL index predicate is implied by a WHERE
clause, and determined that there are a couple of places where we
are being less bright than we could be about CoerceViaIO semantics.
CoerceViaIO is strict independently of whether the I/O functions it
calls are (and they might not be --- in particular, domain_in isn't).
However, not everyplace knows that:

* clauses.c's contain_nonstrict_functions_walker() uses default logic
that will examine the referenced I/O functions to see if they're strict.
That's expensive, requiring several syscache lookups, and it might not
even give the right answer --- though fortunately it'd err in the
conservative direction.

* predtest.c's clause_is_strict_for() doesn't know anything about
CoerceViaIO, so it fails to make the proof requested in [1].

I started to fix this, and was in the midst of copying-and-pasting
contain_nonstrict_functions_walker's handling of ArrayCoerceExpr,
when I realized that that code is actually wrong:

return expression_tree_walker((Node *) ((ArrayCoerceExpr *) node)->arg,
  contain_nonstrict_functions_walker,
  context);

It should be recursing to itself, not to expression_tree_walker.
As coded, the strictness check doesn't get applied to the immediate
child node of the ArrayCoerceExpr, so that if that node is non-strict
we may arrive at the wrong conclusion.

contain_nonstrict_functions() isn't used in very many places, fortunately,
and ArrayCoerceExpr isn't that easy to produce either, which may explain
the lack of field reports.  I was able to cons up this example though,
which demonstrates an incorrect conclusion about whether it's safe to
inline a function declared STRICT:

regression=# create table t1 (f1 int);
CREATE TABLE
regression=# insert into t1 values(1),(null);
INSERT 0 2
regression=# create or replace function sfunc(int) returns int[] language sql
as 'select array[0, $1]::bigint[]::int[]' strict;
CREATE FUNCTION
regression=# select sfunc(f1) from t1;  
  sfunc   
--
 {0,1}
 {0,NULL}
(2 rows)

Of course, since sfunc is strict, that last output should be NULL not
an array containing a NULL.

The attached patch fixes both of these things.  At least the second
hunk needs to be back-patched.  I'm less sure about whether the
CoerceViaIO changes merit back-patching; they're not fixing wrong
answers, but they are responding to a field complaint.  Thoughts?

regards, tom lane

[1] 
https://www.postgresql.org/message-id/CAHkN8V9Rfh6uAjQLURJfnHsQfC_MYiFUSWEVcwVSiPdokmkniw%40mail.gmail.com

diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index 3737166..501b0e9 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -1172,6 +1172,16 @@ contain_nonstrict_functions_walker(Node *node, void *context)
 		return true;
 	if (IsA(node, FieldStore))
 		return true;
+	if (IsA(node, CoerceViaIO))
+	{
+		/*
+		 * CoerceViaIO is strict regardless of whether the I/O functions are,
+		 * so just go look at its argument; asking check_functions_in_node is
+		 * useless expense and could deliver the wrong answer.
+		 */
+		return contain_nonstrict_functions_walker((Node *) ((CoerceViaIO *) node)->arg,
+  context);
+	}
 	if (IsA(node, ArrayCoerceExpr))
 	{
 		/*
@@ -1179,9 +1189,8 @@ contain_nonstrict_functions_walker(Node *node, void *context)
 		 * the per-element expression is; so we should ignore elemexpr and
 		 * recurse only into the arg.
 		 */
-		return expression_tree_walker((Node *) ((ArrayCoerceExpr *) node)->arg,
-	  contain_nonstrict_functions_walker,
-	  context);
+		return contain_nonstrict_functions_walker((Node *) ((ArrayCoerceExpr *) node)->arg,
+  context);
 	}
 	if (IsA(node, CaseExpr))
 		return true;
diff --git a/src/backend/optimizer/util/predtest.c b/src/backend/optimizer/util/predtest.c
index 01f64ee..179b9aa 100644
--- a/src/backend/optimizer/util/predtest.c
+++ b/src/backend/optimizer/util/predtest.c
@@ -1352,6 +1352,15 @@ clause_is_strict_for(Node *clause, Node *subexpr)
 		return false;
 	}
 
+	/*
+	 * CoerceViaIO is strict (whether or not the I/O functions it calls are).
+	 * This is worth checking mainly because it saves us having to explain to
+	 * users why some type coercions are known strict and others aren't.
+	 */
+	if (IsA(clause, CoerceViaIO))
+		return clause_is_strict_for((Node *) ((CoerceViaIO *) clause)->arg,
+	subexpr);
+
 	return false;
 }
 


Re: Compressed TOAST Slicing

2019-02-19 Thread Paul Ramsey
On Sat, Feb 16, 2019 at 7:25 AM Simon Riggs  wrote:

> Could we get an similarly optimized implementation of -> operator for JSONB 
> as well?
> Are there any other potential uses? Best to fix em all up at once and then 
> move on to other things. Thanks.

Oddly enough, I couldn't find many/any things that were sensitive to
left-end decompression. The only exception is "LIKE this%" which
clearly would be helped, but unfortunately wouldn't be a quick
drop-in, but a rather major reorganization of the regex handling.

I had a look at "->" and I couldn't see how a slice could be used to
make it faster? We don't a priori know how big a slice would give us
what we want. This again makes Stephen's case for an iterator, but of
course all the iterator benefits only come when the actual function at
the top (in this case the json parser) are also updated to be
iterative.

Committing this little change doesn't preclude an iterator, or even
make doing one more complicated... :)

P.



Re: Some thoughts on NFS

2019-02-19 Thread Thomas Munro
On Wed, Feb 20, 2019 at 7:58 AM Robert Haas  wrote:
> On Tue, Feb 19, 2019 at 1:56 PM Andres Freund  wrote:
> > My point is that for iSCSC to be performant we'd need *all* the
> > infrastructure we also need for direct IO *and* a *lot* more. And that
> > it seems insane to invest very substantial resources into developing our
> > own iSCSI client when we don't even have DIO support. And DIO support
> > would allow us to address the error reporting issues, while also
> > drastically improving performance in a lot of situations. And we'd not
> > have to essentially develop our own filesystem etc.
>
> OK, got it.  So, I'll merge the patch for direct I/O support tomorrow,
> and then the iSCSI patch can go in on Thursday.  OK?  :-)

Not something I paid a lot of attention to as an application
developer, but in a past life I have seen a lot of mission critical
DB2 and Oracle systems running on ext4 or XFS over (kernel) iSCSI
plugged into big monster filers, and also I think perhaps also cases
of NFS, but those systems use DIO by default (and the latter has its
own NFS client IIUC).  So I suspect if you can just get DIO merged today
we can probably skip the userland iSCSI and call it done.  :-P


--
Thomas Munro
https://enterprisedb.com



Re: Another way to fix inherited UPDATE/DELETE

2019-02-19 Thread David Rowley
On Wed, 20 Feb 2019 at 10:49, Tom Lane  wrote:
> What if we dropped that idea, and instead defined the plan tree as
> returning only the columns that are updated by SET, plus the row
> identity?  It would then be the ModifyTable node's job to fetch the
> original tuple using the row identity (which it must do anyway) and
> form the new tuple by combining the updated columns from the plan
> output with the non-updated columns from the original tuple.
>
> DELETE would be even simpler, since it only needs the row identity
> and nothing else.

While I didn't look at the patch in great detail, I think this is how
Pavan must have made MERGE work for partitioned targets. I recall
seeing the tableoid being added to the target list and a lookup of the
ResultRelInfo by tableoid.

Maybe Pavan can provide more useful details than I can.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Another way to fix inherited UPDATE/DELETE

2019-02-19 Thread Andres Freund
Hi,

On 2019-02-19 16:48:55 -0500, Tom Lane wrote:
> I have no idea how this might play with the pluggable-storage work.

I don't think it'd have a meaningful impact, except for needing changes
to an overlapping set of lines. But given the different timeframes, I'd
not expect a problem with that.

Greetings,

Andres Freund



Re: propagating replica identity to partitions

2019-02-19 Thread Alvaro Herrera
On 2019-Feb-19, Robert Haas wrote:

> On Tue, Feb 19, 2019 at 3:40 PM Alvaro Herrera  
> wrote:
> > Maybe we should be using the inheritance marker in the command to note
> > whether to recurse to partitions?  That is, if you say
> >   ALTER TABLE ONLY parent SET REPLICA IDENTITY
> > then we don't recurse and just change the parent table and future
> > partitions, whereas if you leave out the ONLY then it does affect
> > existing partitions.
> >
> > However, I think TABLESPACE and any other property that involves
> > rewriting tables wholesale can reasonably be expected to behave
> > differently (w.r.t. recursing) from commands that just modify the
> > catalogs.  I think we already have such a distinction somewhere.
> 
> I don't really follow why that should be different, or why the user
> should be expected to know or care whether a particular property
> involves rewriting.

OK, let me concede that point -- it's not rewriting that makes things
act differently, but rather TABLESPACE (as well as some other things)
behave that way.  ALTER TABLE ... SET DATA TYPE is the obvious
counterexample.

The Notes section of ALTER TABLE says:

: The actions for identity columns (ADD GENERATED, SET etc., DROP IDENTITY), as
: well as the actions TRIGGER, CLUSTER, OWNER, and TABLESPACE never recurse to
: descendant tables; that is, they always act as though ONLY were specified.
: Adding a constraint recurses only for CHECK constraints that are not marked NO
: INHERIT.

Since REPLICA IDENTITY does not appear in this list, the documented
behavior is to recurse, per the description of the "name" parameter:

: The name (optionally schema-qualified) of an existing table to
: alter. If ONLY is specified before the table name, only that table
: is altered. If ONLY is not specified, the table and all its
: descendant tables (if any) are altered. Optionally, * can be
: specified after the table name to explicitly indicate that
: descendant tables are included.

I didn't come up with this on my own, as you imply.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Some thoughts on NFS

2019-02-19 Thread Andres Freund
Hi,

On 2019-02-20 11:25:22 +1300, Thomas Munro wrote:
> This seems to make sense, and has the advantage that it uses
> interfaces that exist right now.  But it seems a bit like we'll have
> to wait for them to finish building out the errseq_t support for NFS
> to avoid various races around the mapping's AS_EIO flag (A: fsync() ->
> EIO, B: fsync() -> SUCCESS, log checkpoint; A: panic), and then maybe
> we'd have to get at least one of { fd-passing, direct IO, threads }
> working on our side ...

I think we could "just" make use of DIO for relation extensions when
detecting NFS. Given that we just about never actually read the result
of the file extension write, just converting that write to DIO shouldn't
have that bad an overall impact - of course it'll cause slowdowns, but
only while extending files. And that ought to handle ENOSPC correctly,
while leaving the EIO handling separate?

Greetings,

Andres Freund



Re: Some thoughts on NFS

2019-02-19 Thread Thomas Munro
On Wed, Feb 20, 2019 at 5:52 AM Andres Freund  wrote:
> > 1.  Figure out how to get the ALLOCATE command all the way through the
> > stack from PostgreSQL to the remote NFS server, and know for sure that
> > it really happened.  On the Debian buster Linux 4.18 system I checked,
> > fallocate() reports EOPNOTSUPP for fallocate(), and posix_fallocate()
> > appears to succeed but it doesn't really do anything at all (though I
> > understand that some versions sometimes write zeros to simulate
> > allocation, which in this case would be equally useless as it doesn't
> > reserve anything on an NFS server).  We need the server and NFS client
> > and libc to be of the right version and cooperate and tell us that
> > they have really truly reserved space, but there isn't currently a way
> > as far as I can tell.  How can we achieve that, without writing our
> > own NFS client?
> >
> > 2.  Deal with the resulting performance suckage.  Extending 8kb at a
> > time with synchronous network round trips won't fly.
>
> I think I'd just go for fsync();pwrite();fsync(); as the extension
> mechanism, iff we're detecting a tablespace is on NFS. The first fsync()
> to make sure there's no previous errors that we could mistake for
> ENOSPC, the pwrite to extend, the second fsync to make sure there's
> actually space. Then we can detect ENOSPC properly.  That possibly does
> leave some errors where we could mistake ENOSPC as something more benign
> than it is, but the cases seem pretty narrow, due to the previous
> fsync() (maybe the other side could be thin provisioned and get an
> ENOSPC there - but in that case we didn't actually loose any data. The
> only dangerous scenario I can come up with is that the remote side is on
> thinly provisioned CoW system, and a concurrent write to an earlier
> block runs out of space - but seriously, good riddance to you).

This seems to make sense, and has the advantage that it uses
interfaces that exist right now.  But it seems a bit like we'll have
to wait for them to finish building out the errseq_t support for NFS
to avoid various races around the mapping's AS_EIO flag (A: fsync() ->
EIO, B: fsync() -> SUCCESS, log checkpoint; A: panic), and then maybe
we'd have to get at least one of { fd-passing, direct IO, threads }
working on our side ...

-- 
Thomas Munro
https://enterprisedb.com



Re: Some thoughts on NFS

2019-02-19 Thread Thomas Munro
On Tue, Feb 19, 2019 at 8:03 PM Thomas Munro  wrote:
> A theoretical question I thought of is whether there are any
> interleavings of operations that allow a checkpoint to complete
> bogusly, while a concurrent close() in a regular backend fails with
> EIO for data that was included in the checkpoint, and panics.  I
> *suspect* the answer is that every interleaving is safe for 4.16+
> kernels that report IO errors to every descriptor.  In older kernels I
> wonder if there could be a schedule where an arbitrary backend eats
> the error while closing, then the checkpointer calls fsync()
> successfully and then logs a checkpoint, and then then the arbitrary
> backend panics (too late).  I suspect EIO on close() doesn't happen in
> practice on regular local filesystems, which is why I mention it in
> the context of NFS, but I could be wrong about that.

Ugh.  It looks like Linux NFS doesn't even use the new errseq_t
machinery in 4.16+.  So even if we had the fd-passing patch, I think
there may be a dangerous schedule like this:

A: close() -> EIO, clears AS_EIO flag
B: fsync() -> SUCCESS, log a checkpoint
A: panic!  (but it's too late, we already logged a checkpoint but
didn't flush all the dirty data the belonged to it)

-- 
Thomas Munro
https://enterprisedb.com



Another way to fix inherited UPDATE/DELETE

2019-02-19 Thread Tom Lane
While contemplating the wreckage of 
https://commitfest.postgresql.org/22/1778/
I had the beginnings of an idea of another way to fix that problem.

The issue largely arises from the fact that for UPDATE, we expect
the plan tree to emit a tuple that's ready to be stored back into
the target rel ... well, almost, because it also has a CTID or some
other row-identity column, so we have to do some work on it anyway.
But the point is this means we potentially need a different
targetlist for each child table in an inherited UPDATE.

What if we dropped that idea, and instead defined the plan tree as
returning only the columns that are updated by SET, plus the row
identity?  It would then be the ModifyTable node's job to fetch the
original tuple using the row identity (which it must do anyway) and
form the new tuple by combining the updated columns from the plan
output with the non-updated columns from the original tuple.

DELETE would be even simpler, since it only needs the row identity
and nothing else.

Having done that, we could toss inheritance_planner into the oblivion
it so richly deserves, and just treat all types of inheritance or
partitioning queries as expand-at-the-bottom, as SELECT has always
done it.

Arguably, this would be more efficient even for non-inheritance join
situations, as less data (typically) would need to propagate through the
join tree.  I'm not sure exactly how it'd shake out for trivial updates;
we might be paying for two tuple deconstructions not one, though perhaps
there's a way to finesse that.  (One easy way would be to stick to the
old approach when there is no inheritance going on.)

In the case of a standard inheritance or partition tree, this seems to
go through really easily, since all the children could share the same
returned CTID column (I guess you'd also need a TABLEOID column so you
could figure out which table to direct the update back into).  It gets
a bit harder if the tree contains some foreign tables, because they might
have different concepts of row identity, but I'd think in most cases you
could still combine those into a small number of output columns.

I have no idea how this might play with the pluggable-storage work.

Obviously this'd be a major rewrite with no chance of making it into v12,
but it doesn't sound too big to get done during v13.

Thoughts?

regards, tom lane



Re: Delay locking partitions during INSERT and UPDATE

2019-02-19 Thread David Rowley
On Wed, 20 Feb 2019 at 06:36, Robert Haas  wrote:
>
> On Mon, Feb 18, 2019 at 6:15 PM Tom Lane  wrote:
> > I'm inclined to think that if we already have lock on the parent
> > partitioned table (thereby, IIUC, guaranteeing that its partitioning
> > info can't change) that the order in which we acquire the same lock
> > level on its partition(s) isn't very important.
>
> Well, as it turns out, there's also a pending patch (series) to remove
> that guarantee which I would like to get committed.  The thread for
> that is "ATTACH/DETACH PARTITION CONCURRENTLY".  I believe I've more
> or less got the issues with that patch sorted out, but I'm concerned
> about how it interacts with all of the other things that are currently
> in flight.  Delaying or skipping lock acquisition in some cases and
> weakeni ng the locks we take in others is not a process that can
> continue indefinitely without at some point hitting a wall.

I'd say that here we should only discuss what this patch is doing, not
anything else that's in flight that you're concerned will conflict
with the ATTACH/DETACH PARTITION CONCURRENTLY patch.

During INSERT and UPDATE, not all partitions will always be locked
before executor startup. This patch removing the find_all_inheritors()
call from ExecSetupPartitionTupleRouting() is not going to break
ATTACH/DETACH PARTITION CONCURRENTLY if it wasn't already broken in
the first place.  With this patch, we're still obtaining locks after
execution has begun, it just may delay the locking until a bit later.
It was previously already happening after executor startup had begun,
so the window for the problems that you describe were already
non-zero.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: speeding up planning with partitions

2019-02-19 Thread Tom Lane
I wrote:
> OK, I'll make another pass over 0001 today.

So I started the day with high hopes for this, but the more I looked at
it the less happy I got, and finally I ran into something that looks to
be a complete show-stopper.  Namely, that the patch does not account
for the possibility of an inherited target rel being the outside for a
parameterized path to some other rel.  Like this example in the
regression database:

explain update a set aa = aa + 1
from tenk1 t where a.aa = t.unique2;

With HEAD, you get a perfectly nice plan that consists of an append
of per-child plans like this:

   ->  Nested Loop  (cost=0.29..8.31 rows=1 width=16)
 ->  Seq Scan on a  (cost=0.00..0.00 rows=1 width=10)
 ->  Index Scan using tenk1_unique2 on tenk1 t  (cost=0.29..8.30 rows=1 
width=10)
   Index Cond: (unique2 = a.aa)

With the 0001 patch, this gets an Assert during set_base_rel_pathlists,
because indxpath.c tries to make a parameterized path for tenk1
with "a" as the outer rel.  Since tenk1's joinlist hasn't been
touched, it's still referencing the inheritance parent, and the
code notices that we haven't made a rowcount estimate for that.
Even if we had, we'd generate a Path referencing Vars of the parent
rel, which would not work.

Conceivably, such a Path could be fixed up later (say by applying
adjust_appendrel_attrs to it during createplan.c), but that is not
going to fix the fundamental problem: the cost estimate for such a
Path should vary depending on how large we think the outer rel is,
and we don't have a reasonable way to set that if we're trying to
make a one-size-fits-all Path for something that's being joined to
an inheritance tree with a widely varying set of relation sizes.

So I do not see any way to make this approach work without a
significant(?) sacrifice in the quality of plans.

I've got other issues with the patch too, but it's probably not
worth getting into them unless we can answer this objection.

regards, tom lane



Re: WAL insert delay settings

2019-02-19 Thread David Rowley
On Wed, 20 Feb 2019 at 07:28, Robert Haas  wrote:
> Or maybe we should just blow up the current vacuum cost delay stuff
> and replace it with something that is easier to tune.  For example, we
> could just have one parameter that sets the maximum read rate in kB/s
> and another that sets the maximum dirty-page rate in kB/s.  Whichever
> limit is tighter binds.  If we also have the thing that is the topic
> of this thread, that's a third possible upper limit.

I had similar thoughts when I saw that Peter's proposal didn't seem
all that compatible with how the vacuum cost delays work today. I
agree the cost limit would have to turn into something time based
rather than points based.

To me, it seems just too crude to have a per-backend limit.  I think
global "soft" limits would be better. Let's say, for example, the DBA
would like to CREATE INDEX CONCURRENTLY on a 6TB table.  They think
this is going to take about 36 hours, so they start the operation at
the start of off-peak, which is expected to last 12 hours. This means
the create index is going to run for 2 off-peaks and 1 on-peak.   Must
they really configure the create index to run at a speed that is
suitable for running at peak-load?  That's pretty wasteful as surely
it could run much more quickly during the off-peak.

I know there's debate as to if this can rate limit WAL, but, if we can
find a way to do that, then it seems to me some settings like:

max_soft_global_wal_rate (MB/sec)
min_hard_local_wal_rate (MB/sec)

That way the rate limited process would slow down to
min_hard_local_wal_rate when the WAL rate of all processes is
exceeding max_soft_global_wal_rate.   The min_hard_local_wal_rate is
just there to ensure the process never stops completely. It can simply
tick along at that rate until the global WAL rate slows down again.

It's likely going to be easier to do something like that in WAL than
with buffer read/write/dirties since we already can easily see how
much WAL has been written by looking at the current LSN.

(Of course, these GUC names are not very good. I just picked them out
the air quickly to try and get their meaning across)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: propagating replica identity to partitions

2019-02-19 Thread Robert Haas
On Tue, Feb 19, 2019 at 3:40 PM Alvaro Herrera  wrote:
> Maybe we should be using the inheritance marker in the command to note
> whether to recurse to partitions?  That is, if you say
>   ALTER TABLE ONLY parent SET REPLICA IDENTITY
> then we don't recurse and just change the parent table and future
> partitions, whereas if you leave out the ONLY then it does affect
> existing partitions.
>
> However, I think TABLESPACE and any other property that involves
> rewriting tables wholesale can reasonably be expected to behave
> differently (w.r.t. recursing) from commands that just modify the
> catalogs.  I think we already have such a distinction somewhere.

I don't really follow why that should be different, or why the user
should be expected to know or care whether a particular property
involves rewriting.

> I have no argument against somebody doing that, but I don't think I have
> the resources to do in in time for 12; on the other hand, leaving a
> known misfeature unfixed because nobody has looked for hypothetical
> similar bugs would be bad.

Oh, come on.  If you don't have time to study the issue and come up
with a plan that can at least in theory be applied to all similar
cases in a principled way, whether or not you have time to apply it to
all of those cases, then you have no business committing anything at
all.  You're basically saying that you don't have time to do this
right, so you're just going to do something at random and hope it
doesn't create too many problems later.  That's terrible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re[2]: PGAdmin 4 don't refresh server info after restarting

2019-02-19 Thread Andrey Klychkov
> You should report this on a pgAdmin mailing list rather than a
> PostgreSQL mailing list.

Hi, of course, I made a mistake. Thank you for showing me


>Вторник, 19 февраля 2019, 19:41 +03:00 от Robert Haas :
>
>On Mon, Feb 18, 2019 at 1:20 AM Andrey Klychkov < aaklych...@mail.ru > wrote:
>> Hello,
>> We've noticed that pgadmin 3.x / 4.x doesn't refresh server info like server 
>> version after restarting. It makes people confused.
>> For example,
>> 1. PgAdmin was rinning
>> 2. We upgraded postgres from 11.1 to 11.2
>> 3. PgAdmin was restarted
>> 4. We continued to see 11.1
>> 5. We must push "Disconnect from server" and connect again for updating 
>> server version.
>>
>> It would be convenient that PgAdmin checks server info from time to time, at 
>> least, when PgAdmin was restarted.
>
>You should report this on a pgAdmin mailing list rather than a
>PostgreSQL mailing list.
>
>-- 
>Robert Haas
>EnterpriseDB:  http://www.enterprisedb.com
>The Enterprise PostgreSQL Company


-- 
Regards,
Andrew K.


Re: propagating replica identity to partitions

2019-02-19 Thread Alvaro Herrera
On 2019-Feb-19, Robert Haas wrote:

> It's not unreasonable to use the parent's REPLICA IDENTITY setting as
> the default for new partitions, much as we now do for the TABLESPACE,
> because the parent's replica identity is otherwise without meaning.
> But I'm less convinced that it's reasonable to have changing the
> parent's replica identity recurse to existing children, because:
> 
> 1. That's different from the TABLESPACE case.  We shouldn't just treat
> each new instance of this problem as a green field where we can pick
> any old behavior at random; it should be consistent as far as
> reasonable,

Maybe we should be using the inheritance marker in the command to note
whether to recurse to partitions?  That is, if you say
  ALTER TABLE ONLY parent SET REPLICA IDENTITY
then we don't recurse and just change the parent table and future
partitions, whereas if you leave out the ONLY then it does affect
existing partitions.

However, I think TABLESPACE and any other property that involves
rewriting tables wholesale can reasonably be expected to behave
differently (w.r.t. recursing) from commands that just modify the
catalogs.  I think we already have such a distinction somewhere.

EXPLAIN ALTER TABLE would sure be handy :-)

> 2. It confuses a change to the default for new partitions with a
> change to the value for existing partitions.  Say that you've got 5
> existing partitions that use one setting, but now you want new
> partitions to use a different setting.  You can't get that with your
> proposed semantics, because trying to change the default will change
> all the existing partitions to the new value also, even though that's
> not what you wanted.

If we make sure to heed ONLY (and I admit to not thinking about it in my
original patch) then I think this angle is covered.

> We should really, really try to figure out all of the things that are
> like this -- a property that is meaningless for the partitioned table
> itself but may serve as a useful default for its children -- and try
> to make them all work the same, ideally in one release, rather than
> changing them at different times, back-patching sometimes, and having
> no consistency in the details.

I have no argument against somebody doing that, but I don't think I have
the resources to do in in time for 12; on the other hand, leaving a
known misfeature unfixed because nobody has looked for hypothetical
similar bugs would be bad.

I'm not proposing to back-patch this, but I would love it if I can
borrow somebody's time machine to retroactively fix it for 11.0.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: proposal: pg_restore --convert-to-text

2019-02-19 Thread Euler Taveira
Em seg, 18 de fev de 2019 às 19:21, Tom Lane  escreveu:
>
> Euler Taveira  writes:
> > Since no one has stepped up, I took a stab at it. It will prohibit
> > standard output unless '-f -' be specified. -l option also has the
> > same restriction.
>
> Hm, don't really see the need to break -l usage here.
>
After thinking about it, revert it.

> Pls add to next CF, if you didn't already.
>
Done.


-- 
   Euler Taveira   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
From a0f84d0dd7148e8a167972e76ddbf2e05f20ca9d Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Sun, 17 Feb 2019 14:16:27 +
Subject: [PATCH] pg_restore supports stdout in --file

pg_restore defaults to standard output if neither -f nor -d is
specified. This behavior confuses users that expect an error if the
restore target (database or file) isn't specified. Other clients already
support '-f -' but pg_restore does not.

This change breaks backward compatibility because it errors out if
neither -f nor -d is specified and '-f -' doesn't create a file called
'-' instead it outputs to standard output.

Discussion: https://www.postgresql.org/message-id/87sgwrmhdv@news-spur.riddles.org.uk
---
 doc/src/sgml/ref/pg_restore.sgml |  4 ++--
 src/bin/pg_dump/pg_backup_archiver.c |  4 +++-
 src/bin/pg_dump/pg_restore.c |  7 +++
 src/bin/pg_dump/t/001_basic.pl   | 14 +++---
 4 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml
index 725acb1..c8c4c20 100644
--- a/doc/src/sgml/ref/pg_restore.sgml
+++ b/doc/src/sgml/ref/pg_restore.sgml
@@ -176,8 +176,8 @@
   

 Specify output file for generated script, or for the listing
-when used with -l. Default is the standard
-output.
+when used with -l. Use -f
+- for stdout.

   
  
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 8b55f59..e6b8b6a 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -1517,7 +1517,9 @@ SetOutput(ArchiveHandle *AH, const char *filename, int compression)
 {
 	int			fn;
 
-	if (filename)
+	if (filename && strcmp(filename, "-") == 0)
+		fn = fileno(stdout);
+	else if (filename)
 		fn = -1;
 	else if (AH->FH)
 		fn = fileno(AH->FH);
diff --git a/src/bin/pg_dump/pg_restore.c b/src/bin/pg_dump/pg_restore.c
index 428e040..21cd6da 100644
--- a/src/bin/pg_dump/pg_restore.c
+++ b/src/bin/pg_dump/pg_restore.c
@@ -303,6 +303,13 @@ main(int argc, char **argv)
 		exit_nicely(1);
 	}
 
+	/* Complain if neither -f nor -d was specified (this restriction is not valid for TOC) */
+	if (!opts->dbname && !opts->filename && !opts->tocSummary)
+	{
+		fprintf(stderr, _("%s: option -d/--dbname or -f/--file should be specified\n"), progname);
+		exit_nicely(1);
+	}
+
 	/* Should get at most one of -d and -f, else user is confused */
 	if (opts->dbname)
 	{
diff --git a/src/bin/pg_dump/t/001_basic.pl b/src/bin/pg_dump/t/001_basic.pl
index a875d54..0cfda48 100644
--- a/src/bin/pg_dump/t/001_basic.pl
+++ b/src/bin/pg_dump/t/001_basic.pl
@@ -50,7 +50,7 @@ command_fails_like(
 );
 
 command_fails_like(
-	[ 'pg_restore', '-s', '-a' ],
+	[ 'pg_restore', '-s', '-a', '-f -' ],
 	qr/\Qpg_restore: options -s\/--schema-only and -a\/--data-only cannot be used together\E/,
 	'pg_restore: options -s/--schema-only and -a/--data-only cannot be used together'
 );
@@ -66,7 +66,7 @@ command_fails_like(
 	'pg_dump: options -c/--clean and -a/--data-only cannot be used together');
 
 command_fails_like(
-	[ 'pg_restore', '-c', '-a' ],
+	[ 'pg_restore', '-c', '-a', '-f -' ],
 	qr/\Qpg_restore: options -c\/--clean and -a\/--data-only cannot be used together\E/,
 	'pg_restore: options -c/--clean and -a/--data-only cannot be used together'
 );
@@ -92,12 +92,12 @@ command_fails_like(
 	'pg_dump: invalid output format');
 
 command_fails_like(
-	[ 'pg_restore', '-j', '-1' ],
+	[ 'pg_restore', '-j', '-1', '-f -' ],
 	qr/\Qpg_restore: invalid number of parallel jobs\E/,
 	'pg_restore: invalid number of parallel jobs');
 
 command_fails_like(
-	[ 'pg_restore', '--single-transaction', '-j3' ],
+	[ 'pg_restore', '--single-transaction', '-j3', '-f -' ],
 	qr/\Qpg_restore: cannot specify both --single-transaction and multiple jobs\E/,
 	'pg_restore: cannot specify both --single-transaction and multiple jobs');
 
@@ -107,12 +107,12 @@ command_fails_like(
 	'pg_dump: compression level must be in range 0..9');
 
 command_fails_like(
-	[ 'pg_restore', '--if-exists' ],
+	[ 'pg_restore', '--if-exists', '-f -' ],
 	qr/\Qpg_restore: option --if-exists requires option -c\/--clean\E/,
 	'pg_restore: option --if-exists requires option -c/--clean');
 
 command_fails_like(
-	[ 'pg_restore', '-F', 'garbage' ],
+	[ 'pg_restore', '-f -', '-F', 'garbage' ],
 	qr/\Qpg_restore: unrecognized archive format 

Re: WAL insert delay settings

2019-02-19 Thread Andres Freund
Hi,

On 2019-02-19 20:34:25 +0100, Tomas Vondra wrote:
> On 2/19/19 8:22 PM, Andres Freund wrote:
> > And my main point is that even if you implement a proper bytes/sec limit
> > ONLY for WAL, the behaviour of VACUUM rate limiting doesn't get
> > meaningfully more confusing than right now.
> > 
> 
> So, why not to modify autovacuum to also use this approach? I wonder if
> the situation there is more complicated because of multiple workers
> sharing the same budget ...

I think the main reason is that implementing a scheme like this for WAL
rate limiting isn't a small task, but it'd be aided by the fact that
it'd probably not on by default, and that there'd not be any regressions
because the behaviour didn't exist before. I contrast, people are
extremely sensitive to autovacuum behaviour changes, even if it's to
improve autovacuum. I think it makes more sense to build the logic in a
lower profile case first, and then migrate autovacuum over it. Even
leaving the maturity issue aside, reducing the scope of the project into
more bite sized chunks seems to increase the likelihood of getting
anything substantially.

Greetings,

Andres Freund



Re: WAL insert delay settings

2019-02-19 Thread Tomas Vondra



On 2/19/19 8:22 PM, Andres Freund wrote:
> On 2019-02-19 20:02:32 +0100, Tomas Vondra wrote:
>> Let's do a short example. Assume the default vacuum costing parameters
>>
>> vacuum_cost_limit = 200
>> vacuum_cost_delay = 20ms
>> cost_page_dirty = 20
>>
>> and for simplicity we only do writes. So vacuum can do ~8MB/s of writes.
>>
>> Now, let's also throttle based on WAL - once in a while, after producing
>> some amount of WAL we sleep for a while. Again, for simplicity let's
>> assume the sleeps perfectly interleave and are also 20ms. So we have
>> something like:
> 
>> sleep(20ms); -- vacuum
>> sleep(20ms); -- WAL
>> sleep(20ms); -- vacuum
>> sleep(20ms); -- WAL
>> sleep(20ms); -- vacuum
>> sleep(20ms); -- WAL
>> sleep(20ms); -- vacuum
>> sleep(20ms); -- WAL
>>
>> Suddenly, we only reach 4MB/s of writes from vacuum. But we also reach
>> only 1/2 the WAL throughput, because it's affected exactly the same way
>> by the sleeps from vacuum throttling.
>>
>> We've not reached either of the limits. How exactly is this "lower limit
>> takes effect"?
> 
> Because I upthread said that that's not how I think a sane
> implementation of WAL throttling would work. I think the whole cost
> budgeting approach is BAD, and it'd be serious mistake to copy it for a
> WAL rate limit (it disregards the time taken to execute IO and CPU costs
> etc, and in this case the cost of other bandwidth limitations).  What
> I'm saying is that we ought to instead specify an WAL rate in bytes/sec
> and *only* sleep once we've exceeded it for a time period (with some
> optimizations, so we don't gettimeofday after every XLogInsert(), but
> instead compute how many bytes later need to re-determine the time to
> see if we're still in the same 'granule').
> 

OK, I agree with that. That's mostly what I described in response to
Robert a while ago, I think. (If you've described that earlier in the
thread, I missed it.)

> Now, a non-toy implementation would probably would want to have a
> sliding window to avoid being overly bursty, and reduce the number of
> gettimeofday as mentioned above, but for explanation's sake basically
> imagine that at the "main loop" of an bulk xlog emitting command would
> invoke a helper with a a computation in pseudocode like:
> 
> current_time = gettimeofday();
> if (same_second(current_time, last_time))
> {
> wal_written_in_second += new_wal_written;
> if (wal_written_in_second >= wal_write_limit_per_second)
> {
>double too_much = (wal_written_in_second - 
> wal_write_limit_per_second);
>sleep_fractional_seconds(too_much / wal_written_in_second);
> 
>last_time = current_time;
> }
> }
> else
> {
> last_time = current_time;
> }
> 
> which'd mean that in contrast to your example we'd not continually sleep
> for WAL, we'd only do so if we actually exceeded (or are projected to
> exceed in a smarter implementation), the specified WAL write rate. As
> the 20ms sleeps from vacuum effectively reduce the WAL write rate, we'd
> correspondingly sleep less.
> 

Yes, that makes sense.

> 
> And my main point is that even if you implement a proper bytes/sec limit
> ONLY for WAL, the behaviour of VACUUM rate limiting doesn't get
> meaningfully more confusing than right now.
> 

So, why not to modify autovacuum to also use this approach? I wonder if
the situation there is more complicated because of multiple workers
sharing the same budget ...

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WAL insert delay settings

2019-02-19 Thread Andres Freund
On 2019-02-19 20:02:32 +0100, Tomas Vondra wrote:
> Let's do a short example. Assume the default vacuum costing parameters
>
> vacuum_cost_limit = 200
> vacuum_cost_delay = 20ms
> cost_page_dirty = 20
>
> and for simplicity we only do writes. So vacuum can do ~8MB/s of writes.
>
> Now, let's also throttle based on WAL - once in a while, after producing
> some amount of WAL we sleep for a while. Again, for simplicity let's
> assume the sleeps perfectly interleave and are also 20ms. So we have
> something like:

> sleep(20ms); -- vacuum
> sleep(20ms); -- WAL
> sleep(20ms); -- vacuum
> sleep(20ms); -- WAL
> sleep(20ms); -- vacuum
> sleep(20ms); -- WAL
> sleep(20ms); -- vacuum
> sleep(20ms); -- WAL
>
> Suddenly, we only reach 4MB/s of writes from vacuum. But we also reach
> only 1/2 the WAL throughput, because it's affected exactly the same way
> by the sleeps from vacuum throttling.
>
> We've not reached either of the limits. How exactly is this "lower limit
> takes effect"?

Because I upthread said that that's not how I think a sane
implementation of WAL throttling would work. I think the whole cost
budgeting approach is BAD, and it'd be serious mistake to copy it for a
WAL rate limit (it disregards the time taken to execute IO and CPU costs
etc, and in this case the cost of other bandwidth limitations).  What
I'm saying is that we ought to instead specify an WAL rate in bytes/sec
and *only* sleep once we've exceeded it for a time period (with some
optimizations, so we don't gettimeofday after every XLogInsert(), but
instead compute how many bytes later need to re-determine the time to
see if we're still in the same 'granule').

Now, a non-toy implementation would probably would want to have a
sliding window to avoid being overly bursty, and reduce the number of
gettimeofday as mentioned above, but for explanation's sake basically
imagine that at the "main loop" of an bulk xlog emitting command would
invoke a helper with a a computation in pseudocode like:

current_time = gettimeofday();
if (same_second(current_time, last_time))
{
wal_written_in_second += new_wal_written;
if (wal_written_in_second >= wal_write_limit_per_second)
{
   double too_much = (wal_written_in_second - 
wal_write_limit_per_second);
   sleep_fractional_seconds(too_much / wal_written_in_second);

   last_time = current_time;
}
}
else
{
last_time = current_time;
}

which'd mean that in contrast to your example we'd not continually sleep
for WAL, we'd only do so if we actually exceeded (or are projected to
exceed in a smarter implementation), the specified WAL write rate. As
the 20ms sleeps from vacuum effectively reduce the WAL write rate, we'd
correspondingly sleep less.


And my main point is that even if you implement a proper bytes/sec limit
ONLY for WAL, the behaviour of VACUUM rate limiting doesn't get
meaningfully more confusing than right now.

Greetings,

Andres Freund



Re: WAL insert delay settings

2019-02-19 Thread Tomas Vondra



On 2/19/19 7:28 PM, Robert Haas wrote:
> On Fri, Feb 15, 2019 at 1:42 PM Andres Freund  wrote:
>> I think it'd not be insane to add two things:
>> - WAL write rate limiting, independent of the vacuum stuff. It'd also be
>>   used by lots of other bulk commands (CREATE INDEX, ALTER TABLE
>>   rewrites, ...)
>> - Account for WAL writes in the current vacuum costing logic, by
>>   accounting for it using a new cost parameter
>>
>> Then VACUUM would be throttled by the *minimum* of the two, which seems
>> to make plenty sense to me, given the usecases.
> 
> Or maybe we should just blow up the current vacuum cost delay stuff
> and replace it with something that is easier to tune.  For example, we
> could just have one parameter that sets the maximum read rate in kB/s
> and another that sets the maximum dirty-page rate in kB/s.  Whichever
> limit is tighter binds.  If we also have the thing that is the topic
> of this thread, that's a third possible upper limit.
> 
> I really don't see much point in doubling down on the current vacuum
> cost delay logic.  The overall idea is good, but the specific way that
> you have to set the parameters is pretty inscrutable, and I think we
> should just fix it so that it can be, uh, scruted.
> 

I think changing the vacuum throttling so that it uses actual I/O
amounts (in kB/s) instead of the cost limit would be a step in the right
directly. It's clearer, and it also works with arbitrary page sizes.

Then, instead of sleeping for a fixed amount of time after reaching the
cost limit, we should track progress and compute the amount of time we
actually need to sleep. AFAICS that's what spread checkpoints do.

I'm sure it's bound to be trickier in practice, of course.

FWIW I'm not entirely sure we want to fully separate the limits. I'd
argue using the same vacuum budget for both reads and writes is actually
the right thing to do - the I/O likely hits the same device. So it makes
sense to "balance" those two in some way. But that may or may not be the
case for WAL, which is often moved to a different device.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: propagating replica identity to partitions

2019-02-19 Thread Robert Haas
On Mon, Feb 4, 2019 at 11:30 AM Alvaro Herrera  wrote:
> If you do ALTER TABLE .. REPLICA IDENTITY to a partitioned table, the
> command operates on the parent table itself and does not propagate to
> partitions.  Why is this?  Maybe not recursing was the right call when
> we only had regular inheritance (back in 9.4), but since partitioned
> tables got introduced, I think it is completely the other way around:
> not recursing is an usability fail.

This sounds an awful like the TABLESPACE case.  I wonder how many more
similar things there are.

It's not unreasonable to use the parent's REPLICA IDENTITY setting as
the default for new partitions, much as we now do for the TABLESPACE,
because the parent's replica identity is otherwise without meaning.
But I'm less convinced that it's reasonable to have changing the
parent's replica identity recurse to existing children, because:

1. That's different from the TABLESPACE case.  We shouldn't just treat
each new instance of this problem as a green field where we can pick
any old behavior at random; it should be consistent as far as
reasonable, and

2. It confuses a change to the default for new partitions with a
change to the value for existing partitions.  Say that you've got 5
existing partitions that use one setting, but now you want new
partitions to use a different setting.  You can't get that with your
proposed semantics, because trying to change the default will change
all the existing partitions to the new value also, even though that's
not what you wanted.

We should really, really try to figure out all of the things that are
like this -- a property that is meaningless for the partitioned table
itself but may serve as a useful default for its children -- and try
to make them all work the same, ideally in one release, rather than
changing them at different times, back-patching sometimes, and having
no consistency in the details.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Prepared transaction releasing locks before deregistering its GID

2019-02-19 Thread Oleksii Kliukin
Hi,

Oleksii Kliukin  wrote:
> 
> The approach looks good to me. Surprisingly, I saw no stalled backends
> because of the double acquisition of lock at TwoPhaseGetGXact once I put a
> simple TwoPhaseStateLock right before the "gxact->valid = false” line; I
> will test your patch and post the outcome.

I gave it a spin on the same VM host as shown to constantly reproduce the
issue and observed neither 'identifier already in use' nor any locking
issues over a few dozens of runs, so it looks good to me.

That was HEAD, but since FinishPreparedTransaction seems to be identical
there and on the back branches it should work for PG 10 and 11 as well.

Regards,
Oleksii Kliukin




Re: Some thoughts on NFS

2019-02-19 Thread Robert Haas
On Tue, Feb 19, 2019 at 2:05 PM Magnus Hagander  wrote:
> C'mon Robert.
>
> Surely you know that such patches should be landed on *Fridays*, not 
> Thursdays.

Oh, right.  And preferably via airplane wifi from someplace over the
Atlantic ocean, right?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Some thoughts on NFS

2019-02-19 Thread Magnus Hagander
On Tue, Feb 19, 2019 at 7:58 PM Robert Haas  wrote:

> On Tue, Feb 19, 2019 at 1:56 PM Andres Freund  wrote:
> > My point is that for iSCSC to be performant we'd need *all* the
> > infrastructure we also need for direct IO *and* a *lot* more. And that
> > it seems insane to invest very substantial resources into developing our
> > own iSCSI client when we don't even have DIO support. And DIO support
> > would allow us to address the error reporting issues, while also
> > drastically improving performance in a lot of situations. And we'd not
> > have to essentially develop our own filesystem etc.
>
> OK, got it.  So, I'll merge the patch for direct I/O support tomorrow,
> and then the iSCSI patch can go in on Thursday.  OK?  :-)
>

C'mon Robert.

Surely you know that such patches should be landed on *Fridays*, not
Thursdays.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: WAL insert delay settings

2019-02-19 Thread Tomas Vondra
On 2/19/19 7:50 PM, Andres Freund wrote:
> On 2019-02-19 19:43:14 +0100, Tomas Vondra wrote:
>>
>>
>> On 2/19/19 7:35 PM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2019-02-19 13:28:00 -0500, Robert Haas wrote:
 On Fri, Feb 15, 2019 at 1:42 PM Andres Freund  wrote:
> I think it'd not be insane to add two things:
> - WAL write rate limiting, independent of the vacuum stuff. It'd also be
>   used by lots of other bulk commands (CREATE INDEX, ALTER TABLE
>   rewrites, ...)
> - Account for WAL writes in the current vacuum costing logic, by
>   accounting for it using a new cost parameter
>
> Then VACUUM would be throttled by the *minimum* of the two, which seems
> to make plenty sense to me, given the usecases.

 Or maybe we should just blow up the current vacuum cost delay stuff
 and replace it with something that is easier to tune.  For example, we
 could just have one parameter that sets the maximum read rate in kB/s
 and another that sets the maximum dirty-page rate in kB/s.  Whichever
 limit is tighter binds.  If we also have the thing that is the topic
 of this thread, that's a third possible upper limit.
>>>
 I really don't see much point in doubling down on the current vacuum
 cost delay logic.  The overall idea is good, but the specific way that
 you have to set the parameters is pretty inscrutable, and I think we
 should just fix it so that it can be, uh, scruted.
>>>
>>> I agree that that's something worthwhile to do, but given that the
>>> proposal in this thread is *NOT* just about VACUUM, I don't see why it'd
>>> be useful to tie a general WAL rate limiting to rewriting cost limiting
>>> of vacuum.  It seems better to write the WAL rate limiting logic with an
>>> eye towards structuring it in a way that'd potentially allow reusing
>>> some of the code for a better VACUUM cost limiting.
>>>
>>> I still don't *AT ALL* buy Stephen and Tomas' argument that it'd be
>>> confusing that when both VACUUM and WAL cost limiting are active, the
>>> lower limit takes effect.
>>>
>>
>> Except that's not my argument. I'm not arguing against throttling once
>> we hit the minimum of limits.
>>
>> The problem I have with implementing a separate throttling logic is that
>> it also changes the other limits (which are already kinda fuzzy). If you
>> add sleeps somewhere, those will affects the throttling built into
>> autovacuum (lowering them in some unknown way).
> 
> Those two paragraphs, to me, flat out contradict each other. If you
> throttle according to the lower of two limits, *of course* the
> throttling for the higher limit is affected in the sense that the limit
> won't be reached.  So yea, if you have a WAL rate limit, and you reach
> it, you won't be able to predict the IO rate for vacuum. Duh?  I must be
> missing something here.
> 

Let's do a short example. Assume the default vacuum costing parameters

vacuum_cost_limit = 200
vacuum_cost_delay = 20ms
cost_page_dirty = 20

and for simplicity we only do writes. So vacuum can do ~8MB/s of writes.

Now, let's also throttle based on WAL - once in a while, after producing
some amount of WAL we sleep for a while. Again, for simplicity let's
assume the sleeps perfectly interleave and are also 20ms. So we have
something like:

sleep(20ms); -- vacuum
sleep(20ms); -- WAL
sleep(20ms); -- vacuum
sleep(20ms); -- WAL
sleep(20ms); -- vacuum
sleep(20ms); -- WAL
sleep(20ms); -- vacuum
sleep(20ms); -- WAL

Suddenly, we only reach 4MB/s of writes from vacuum. But we also reach
only 1/2 the WAL throughput, because it's affected exactly the same way
by the sleeps from vacuum throttling.

We've not reached either of the limits. How exactly is this "lower limit
takes effect"?


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Some thoughts on NFS

2019-02-19 Thread Robert Haas
On Tue, Feb 19, 2019 at 1:56 PM Andres Freund  wrote:
> My point is that for iSCSC to be performant we'd need *all* the
> infrastructure we also need for direct IO *and* a *lot* more. And that
> it seems insane to invest very substantial resources into developing our
> own iSCSI client when we don't even have DIO support. And DIO support
> would allow us to address the error reporting issues, while also
> drastically improving performance in a lot of situations. And we'd not
> have to essentially develop our own filesystem etc.

OK, got it.  So, I'll merge the patch for direct I/O support tomorrow,
and then the iSCSI patch can go in on Thursday.  OK?  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Some thoughts on NFS

2019-02-19 Thread Andres Freund
Hi,

On 2019-02-19 13:45:28 -0500, Robert Haas wrote:
> On Tue, Feb 19, 2019 at 1:29 PM Andres Freund  wrote:
> > And I think it's not that likely that we'd not screw up a
> > number of times implementing iSCSI ourselves - not to speak of the fact
> > that that seems like an odd place to focus development on, given that
> > it'd basically require all the infrastructure also needed for local DIO,
> > which'd likely gain us much more.
> 
> I don't really disagree with you here, but I also think it's important
> to be honest about what size hammer is likely to be sufficient to fix
> the problem.  Project policy for many years has been essentially
> "let's assume the kernel guys know what they are doing," but, I don't
> know, color me a little skeptical at this point.

Yea, and I think around e.g. using the kernel page cache / not using
DIO, several people, including kernel developers and say me, told us
that's stupid.


> We've certainly made lots of mistakes all of our own, and it's
> certainly true that reimplementing large parts of what the kernel does
> in user space is not very appealing ... but on the other hand it looks
> like filesystem error reporting isn't even really reliable for local
> operation (unless we do an incredibly complicated fd-passing thing
> that has deadlock problems we don't know how to solve and likely
> performance problems too, or convert the whole backend to use threads)
> or for NFS operation (though maybe your suggestion will fix that) so
> the idea that iSCSI is just going to be all right seems a bit
> questionable to me.  Go ahead, call me a pessimist...

My point is that for iSCSC to be performant we'd need *all* the
infrastructure we also need for direct IO *and* a *lot* more. And that
it seems insane to invest very substantial resources into developing our
own iSCSI client when we don't even have DIO support. And DIO support
would allow us to address the error reporting issues, while also
drastically improving performance in a lot of situations. And we'd not
have to essentially develop our own filesystem etc.

Greetings,

Andres Freund



Re: WAL insert delay settings

2019-02-19 Thread Andres Freund
On 2019-02-19 19:43:14 +0100, Tomas Vondra wrote:
> 
> 
> On 2/19/19 7:35 PM, Andres Freund wrote:
> > Hi,
> > 
> > On 2019-02-19 13:28:00 -0500, Robert Haas wrote:
> >> On Fri, Feb 15, 2019 at 1:42 PM Andres Freund  wrote:
> >>> I think it'd not be insane to add two things:
> >>> - WAL write rate limiting, independent of the vacuum stuff. It'd also be
> >>>   used by lots of other bulk commands (CREATE INDEX, ALTER TABLE
> >>>   rewrites, ...)
> >>> - Account for WAL writes in the current vacuum costing logic, by
> >>>   accounting for it using a new cost parameter
> >>>
> >>> Then VACUUM would be throttled by the *minimum* of the two, which seems
> >>> to make plenty sense to me, given the usecases.
> >>
> >> Or maybe we should just blow up the current vacuum cost delay stuff
> >> and replace it with something that is easier to tune.  For example, we
> >> could just have one parameter that sets the maximum read rate in kB/s
> >> and another that sets the maximum dirty-page rate in kB/s.  Whichever
> >> limit is tighter binds.  If we also have the thing that is the topic
> >> of this thread, that's a third possible upper limit.
> > 
> >> I really don't see much point in doubling down on the current vacuum
> >> cost delay logic.  The overall idea is good, but the specific way that
> >> you have to set the parameters is pretty inscrutable, and I think we
> >> should just fix it so that it can be, uh, scruted.
> > 
> > I agree that that's something worthwhile to do, but given that the
> > proposal in this thread is *NOT* just about VACUUM, I don't see why it'd
> > be useful to tie a general WAL rate limiting to rewriting cost limiting
> > of vacuum.  It seems better to write the WAL rate limiting logic with an
> > eye towards structuring it in a way that'd potentially allow reusing
> > some of the code for a better VACUUM cost limiting.
> > 
> > I still don't *AT ALL* buy Stephen and Tomas' argument that it'd be
> > confusing that when both VACUUM and WAL cost limiting are active, the
> > lower limit takes effect.
> > 
> 
> Except that's not my argument. I'm not arguing against throttling once
> we hit the minimum of limits.
> 
> The problem I have with implementing a separate throttling logic is that
> it also changes the other limits (which are already kinda fuzzy). If you
> add sleeps somewhere, those will affects the throttling built into
> autovacuum (lowering them in some unknown way).

Those two paragraphs, to me, flat out contradict each other. If you
throttle according to the lower of two limits, *of course* the
throttling for the higher limit is affected in the sense that the limit
won't be reached.  So yea, if you have a WAL rate limit, and you reach
it, you won't be able to predict the IO rate for vacuum. Duh?  I must be
missing something here.

Greetings,

Andres Freund



Re: WAL insert delay settings

2019-02-19 Thread Robert Haas
On Tue, Feb 19, 2019 at 1:35 PM Andres Freund  wrote:
> I still don't *AT ALL* buy Stephen and Tomas' argument that it'd be
> confusing that when both VACUUM and WAL cost limiting are active, the
> lower limit takes effect.

I think you guys may all be in vigorous -- not to say mortal -- agreement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Some thoughts on NFS

2019-02-19 Thread Robert Haas
On Tue, Feb 19, 2019 at 1:29 PM Andres Freund  wrote:
> > Is that a new thing?  I ran across PostgreSQL-over-iSCSI a number of
> > years ago and the evidence strongly suggested that it did not reliably
> > report disk errors back to PostgreSQL, leading to corruption.
>
> How many years ago are we talking? I think it's been mostly robust in
> the last 6-10 years, maybe?

I think it was ~9 years ago.

> But note that the postgres + linux fsync
> issues would have plagued that use case just as well as it did local
> storage, at a likely higher incidence of failures (i.e. us forgetting to
> retry fsyncs in checkpoints, and linux throwing away dirty data after
> fsync failure would both have caused problems that aren't dependent on
> iSCSI).

IIRC, and obviously that's difficult to do after so long, there were
clearly disk errors in the kernel logs, but no hint of a problem in
the PostgreSQL logs.  So it wasn't just a case of us responding to
errors with sufficient vigor -- either they weren't being reported at
all, or only to system calls we weren't checking, e.g. close or
something.

> And I think it's not that likely that we'd not screw up a
> number of times implementing iSCSI ourselves - not to speak of the fact
> that that seems like an odd place to focus development on, given that
> it'd basically require all the infrastructure also needed for local DIO,
> which'd likely gain us much more.

I don't really disagree with you here, but I also think it's important
to be honest about what size hammer is likely to be sufficient to fix
the problem.  Project policy for many years has been essentially
"let's assume the kernel guys know what they are doing," but, I don't
know, color me a little skeptical at this point.  We've certainly made
lots of mistakes all of our own, and it's certainly true that
reimplementing large parts of what the kernel does in user space is
not very appealing ... but on the other hand it looks like filesystem
error reporting isn't even really reliable for local operation (unless
we do an incredibly complicated fd-passing thing that has deadlock
problems we don't know how to solve and likely performance problems
too, or convert the whole backend to use threads) or for NFS operation
(though maybe your suggestion will fix that) so the idea that iSCSI is
just going to be all right seems a bit questionable to me.  Go ahead,
call me a pessimist...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WAL insert delay settings

2019-02-19 Thread Tomas Vondra



On 2/19/19 7:35 PM, Andres Freund wrote:
> Hi,
> 
> On 2019-02-19 13:28:00 -0500, Robert Haas wrote:
>> On Fri, Feb 15, 2019 at 1:42 PM Andres Freund  wrote:
>>> I think it'd not be insane to add two things:
>>> - WAL write rate limiting, independent of the vacuum stuff. It'd also be
>>>   used by lots of other bulk commands (CREATE INDEX, ALTER TABLE
>>>   rewrites, ...)
>>> - Account for WAL writes in the current vacuum costing logic, by
>>>   accounting for it using a new cost parameter
>>>
>>> Then VACUUM would be throttled by the *minimum* of the two, which seems
>>> to make plenty sense to me, given the usecases.
>>
>> Or maybe we should just blow up the current vacuum cost delay stuff
>> and replace it with something that is easier to tune.  For example, we
>> could just have one parameter that sets the maximum read rate in kB/s
>> and another that sets the maximum dirty-page rate in kB/s.  Whichever
>> limit is tighter binds.  If we also have the thing that is the topic
>> of this thread, that's a third possible upper limit.
> 
>> I really don't see much point in doubling down on the current vacuum
>> cost delay logic.  The overall idea is good, but the specific way that
>> you have to set the parameters is pretty inscrutable, and I think we
>> should just fix it so that it can be, uh, scruted.
> 
> I agree that that's something worthwhile to do, but given that the
> proposal in this thread is *NOT* just about VACUUM, I don't see why it'd
> be useful to tie a general WAL rate limiting to rewriting cost limiting
> of vacuum.  It seems better to write the WAL rate limiting logic with an
> eye towards structuring it in a way that'd potentially allow reusing
> some of the code for a better VACUUM cost limiting.
> 
> I still don't *AT ALL* buy Stephen and Tomas' argument that it'd be
> confusing that when both VACUUM and WAL cost limiting are active, the
> lower limit takes effect.
> 

Except that's not my argument. I'm not arguing against throttling once
we hit the minimum of limits.

The problem I have with implementing a separate throttling logic is that
it also changes the other limits (which are already kinda fuzzy). If you
add sleeps somewhere, those will affects the throttling built into
autovacuum (lowering them in some unknown way).

>From this POV it would be better to include this into the vacuum cost
limit, because then it's at least subject to the same budget.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Row Level Security − leakproof-ness and performance implications

2019-02-19 Thread Pierre Ducroquet
Hello

In order to increase our security, we have started deploying row-level 
security in order to add another safety net if any issue was to happen in our 
applications.
After a careful monitoring of our databases, we found out that a lot of 
queries started to go south, going extremely slow.
The root of these slowdowns is that a lot of the PostgreSQL functions are not 
marked as leakproof, especially the ones used for operators.
In current git master, the following query returns 258 functions that are used 
by operators returning booleans and not marked leakproof:

SELECT proname FROM pg_proc 
WHERE exists(select 1 from pg_operator where oprcode::name = proname) 
AND NOT(proleakproof) 
AND prorettype = 16;


Among these, if you have for instance a table like:
create table demo_enum (
username text,
flag my_enum
);

With an index on (username, flag), the index will only be used on username 
because flag is an enum and the enum_eq function is not marked leakproof.

For simple functions like enum_eq/enum_ne, marking them leakproof is an 
obvious fix (patch attached to this email, including also textin/textout). And 
if we had a 'RLS-enabled' context on functions, a way to make a lot of built-
in functions leakproof would simply be to prevent them from logging errors 
containing values.

For others, like arraycontains, it's much trickier : arraycontains can be 
leakproof only if the comparison function of the inner type is leakproof too. 
This raises some interesting issues, like arraycontains being marked as strict 
parallel safe while there is nothing making it mandatory for the inner type.
In order to optimize queries on a table like:
create table demo_array (username text, my_array int[]);
one would need to mark arraycontains leakproof, thus implying that any 
comparison operator must be leakproof too? Another possibility, not that 
complicated, would be to create specific-types entries in pg_proc for each 
type that has a leakproof comparison operator. Or the complicated path would 
be to have a 'dynamic' leakproof for functions like arraycontains that depend 
on the inner type, but since this is determined at planning, it seems very 
complicated to implement.

A third issue we noticed is the usage of functional indexes. If you create an 
index on, for instance, (username, leaky_function(my_field)), and then search 
on leaky_functon(my_field) = 42, the functional index can be used only if 
leaky_function is marked leakproof, even if it is never going to be executed 
on invalid rows thanks to the index. After a quick look at the source code, it 
also seems complicated to implement since the decision to reject potential 
dangerous calls to leaky_function is done really early in the process, before 
the optimizer starts.


I am willing to spend some time on these issues, but I have no specific 
knowledge of the planner and optimizer, and I fear proper fixes would require 
much digging there. If you think it would be appropriate for functions like 
arraycontains or range_contains to require the 'inner' comparison function to 
be leakproof, or just keep looking at the other functions in pg_proc and 
leakproof the ones that can be, I would be happy to write the corresponding 
patches.


Best regards

>From a4dd6eb9a16e7e2d2e44b03c5a04c485841aab25 Mon Sep 17 00:00:00 2001
From: Pierre Ducroquet 
Date: Tue, 19 Feb 2019 17:21:12 +0100
Subject: [PATCH] Mark textin/out and enum_eq/ne as leakproof

---
 src/include/catalog/pg_proc.dat | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a4e173b484..d4ac402835 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -95,10 +95,10 @@
   prorettype => 'regprocedure', proargtypes => 'text',
   prosrc => 'to_regprocedure' },
 { oid => '46', descr => 'I/O',
-  proname => 'textin', prorettype => 'text', proargtypes => 'cstring',
+  proname => 'textin', proleakproof => 't', prorettype => 'text', proargtypes => 'cstring',
   prosrc => 'textin' },
 { oid => '47', descr => 'I/O',
-  proname => 'textout', prorettype => 'cstring', proargtypes => 'text',
+  proname => 'textout', proleakproof => 't', prorettype => 'cstring', proargtypes => 'text',
   prosrc => 'textout' },
 { oid => '48', descr => 'I/O',
   proname => 'tidin', prorettype => 'tid', proargtypes => 'cstring',
@@ -8339,10 +8339,10 @@
   proname => 'enum_out', provolatile => 's', prorettype => 'cstring',
   proargtypes => 'anyenum', prosrc => 'enum_out' },
 { oid => '3508',
-  proname => 'enum_eq', prorettype => 'bool', proargtypes => 'anyenum anyenum',
+  proname => 'enum_eq', proleakproof => 't', prorettype => 'bool', proargtypes => 'anyenum anyenum',
   prosrc => 'enum_eq' },
 { oid => '3509',
-  proname => 'enum_ne', prorettype => 'bool', proargtypes => 'anyenum anyenum',
+  proname => 'enum_ne', proleakproof => 't', prorettype => 'bool', proargtypes 

Re: WAL insert delay settings

2019-02-19 Thread Andres Freund
Hi,

On 2019-02-19 13:28:00 -0500, Robert Haas wrote:
> On Fri, Feb 15, 2019 at 1:42 PM Andres Freund  wrote:
> > I think it'd not be insane to add two things:
> > - WAL write rate limiting, independent of the vacuum stuff. It'd also be
> >   used by lots of other bulk commands (CREATE INDEX, ALTER TABLE
> >   rewrites, ...)
> > - Account for WAL writes in the current vacuum costing logic, by
> >   accounting for it using a new cost parameter
> >
> > Then VACUUM would be throttled by the *minimum* of the two, which seems
> > to make plenty sense to me, given the usecases.
> 
> Or maybe we should just blow up the current vacuum cost delay stuff
> and replace it with something that is easier to tune.  For example, we
> could just have one parameter that sets the maximum read rate in kB/s
> and another that sets the maximum dirty-page rate in kB/s.  Whichever
> limit is tighter binds.  If we also have the thing that is the topic
> of this thread, that's a third possible upper limit.

> I really don't see much point in doubling down on the current vacuum
> cost delay logic.  The overall idea is good, but the specific way that
> you have to set the parameters is pretty inscrutable, and I think we
> should just fix it so that it can be, uh, scruted.

I agree that that's something worthwhile to do, but given that the
proposal in this thread is *NOT* just about VACUUM, I don't see why it'd
be useful to tie a general WAL rate limiting to rewriting cost limiting
of vacuum.  It seems better to write the WAL rate limiting logic with an
eye towards structuring it in a way that'd potentially allow reusing
some of the code for a better VACUUM cost limiting.

I still don't *AT ALL* buy Stephen and Tomas' argument that it'd be
confusing that when both VACUUM and WAL cost limiting are active, the
lower limit takes effect.

Greetings,

Andres Freund



Re: Some thoughts on NFS

2019-02-19 Thread Andres Freund
Hi,

On 2019-02-19 13:21:21 -0500, Robert Haas wrote:
> On Tue, Feb 19, 2019 at 1:17 PM Andres Freund  wrote:
> > On 2019-02-19 16:59:35 +0100, Magnus Hagander wrote:
> > > There might be a use-case for the split that you mention, absolutely, but
> > > it's not going to solve the people-who-want-NFS situation. You'd solve 
> > > more
> > > of that by having the middle layer speak "raw device" underneath and be
> > > able to sit on top of things like iSCSI (yes, really).
> >
> > There's decent iSCSI implementations in several kernels, without the NFS
> > problems. I'm not sure what we'd gain by reimplementing those?
> 
> Is that a new thing?  I ran across PostgreSQL-over-iSCSI a number of
> years ago and the evidence strongly suggested that it did not reliably
> report disk errors back to PostgreSQL, leading to corruption.

How many years ago are we talking? I think it's been mostly robust in
the last 6-10 years, maybe? But note that the postgres + linux fsync
issues would have plagued that use case just as well as it did local
storage, at a likely higher incidence of failures (i.e. us forgetting to
retry fsyncs in checkpoints, and linux throwing away dirty data after
fsync failure would both have caused problems that aren't dependent on
iSCSI).  And I think it's not that likely that we'd not screw up a
number of times implementing iSCSI ourselves - not to speak of the fact
that that seems like an odd place to focus development on, given that
it'd basically require all the infrastructure also needed for local DIO,
which'd likely gain us much more.

Greetings,

Andres Freund



Re: WAL insert delay settings

2019-02-19 Thread Robert Haas
On Fri, Feb 15, 2019 at 1:42 PM Andres Freund  wrote:
> I think it'd not be insane to add two things:
> - WAL write rate limiting, independent of the vacuum stuff. It'd also be
>   used by lots of other bulk commands (CREATE INDEX, ALTER TABLE
>   rewrites, ...)
> - Account for WAL writes in the current vacuum costing logic, by
>   accounting for it using a new cost parameter
>
> Then VACUUM would be throttled by the *minimum* of the two, which seems
> to make plenty sense to me, given the usecases.

Or maybe we should just blow up the current vacuum cost delay stuff
and replace it with something that is easier to tune.  For example, we
could just have one parameter that sets the maximum read rate in kB/s
and another that sets the maximum dirty-page rate in kB/s.  Whichever
limit is tighter binds.  If we also have the thing that is the topic
of this thread, that's a third possible upper limit.

I really don't see much point in doubling down on the current vacuum
cost delay logic.  The overall idea is good, but the specific way that
you have to set the parameters is pretty inscrutable, and I think we
should just fix it so that it can be, uh, scruted.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Some thoughts on NFS

2019-02-19 Thread Robert Haas
On Tue, Feb 19, 2019 at 1:17 PM Andres Freund  wrote:
> On 2019-02-19 16:59:35 +0100, Magnus Hagander wrote:
> > There might be a use-case for the split that you mention, absolutely, but
> > it's not going to solve the people-who-want-NFS situation. You'd solve more
> > of that by having the middle layer speak "raw device" underneath and be
> > able to sit on top of things like iSCSI (yes, really).
>
> There's decent iSCSI implementations in several kernels, without the NFS
> problems. I'm not sure what we'd gain by reimplementing those?

Is that a new thing?  I ran across PostgreSQL-over-iSCSI a number of
years ago and the evidence strongly suggested that it did not reliably
report disk errors back to PostgreSQL, leading to corruption.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Some thoughts on NFS

2019-02-19 Thread Andres Freund
Hi,

On 2019-02-19 16:59:35 +0100, Magnus Hagander wrote:
> There might be a use-case for the split that you mention, absolutely, but
> it's not going to solve the people-who-want-NFS situation. You'd solve more
> of that by having the middle layer speak "raw device" underneath and be
> able to sit on top of things like iSCSI (yes, really).

There's decent iSCSI implementations in several kernels, without the NFS
problems. I'm not sure what we'd gain by reimplementing those?

Greetings,

Andres Freund



Re: Some thoughts on NFS

2019-02-19 Thread Magnus Hagander
On Tue, Feb 19, 2019 at 5:38 PM Christoph Moench-Tegeder 
wrote:

> ## Magnus Hagander (mag...@hagander.net):
>
> >  You'd solve more
> > of that by having the middle layer speak "raw device" underneath and be
> > able to sit on top of things like iSCSI (yes, really).
>
> Back in ye olden days we called these middle layers "kernel" and
> "filesystem" and had that maintained by specialists.
>

Yeah. Unfortunately that turned out in a number of cases to be things like
specialists that considered fsync unimportant, or dropping data from the
buffer cache without errors.

But what I'm mainly saying is that if we want to run postgres on top of a
block device protocol, we should go all the way and do it, not somewhere
half way that will be unable to help most people. I'm not saying that we
*should*, there is a very big if in that.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Some thoughts on NFS

2019-02-19 Thread Magnus Hagander
On Tue, Feb 19, 2019 at 5:33 PM Tomas Vondra 
wrote:

>
> On 2/19/19 5:20 PM, Robert Haas wrote:
> > On Tue, Feb 19, 2019 at 10:59 AM Magnus Hagander 
> wrote:
>
> >> There might be a use-case for the split that you mention,
> >> absolutely, but it's not going to solve the people-who-want-NFS
> >> situation. You'd solve more of that by having the middle layer
> >> speak "raw device" underneath and be able to sit on top of things
> >> like iSCSI (yes, really).
> >
> > Not sure I follow this part.
> >
>
> I think Magnus says that people running PostgreSQL on NFS generally
> don't do that because they somehow chose NFS, but because that's what
> their company uses for network storage. Even if we support the custom
> block protocol, they probably won't be able to use it.
>

Yes, with the addition that they also often export iSCSI endpoints today,
so if we wanted to sit on top of something that could also work. But not
sit on top of a custom block protocol we invent ourselves.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: CPU costs of random_zipfian in pgbench

2019-02-19 Thread David Fetter
On Sun, Feb 17, 2019 at 11:02:37PM +0100, Tomas Vondra wrote:
> On 2/17/19 6:33 PM, David Fetter wrote:
> > On Sun, Feb 17, 2019 at 11:09:27AM -0500, Tom Lane wrote:
> >> Fabien COELHO  writes:
>  I'm trying to use random_zipfian() for benchmarking of skewed data sets, 
>  and I ran head-first into an issue with rather excessive CPU costs. 
> >>
> >>> If you want skewed but not especially zipfian, use exponential which is 
> >>> quite cheap. Also zipfian with a > 1.0 parameter does not have to compute 
> >>> the harmonic number, so it depends in the parameter.
> >>
> >> Maybe we should drop support for parameter values < 1.0, then.  The idea
> >> that pgbench is doing something so expensive as to require caching seems
> >> flat-out insane from here.  That cannot be seen as anything but a foot-gun
> >> for unwary users.  Under what circumstances would an informed user use
> >> that random distribution rather than another far-cheaper-to-compute one?
> >>
> >>> ... This is why I submitted a pseudo-random permutation 
> >>> function, which alas does not get much momentum from committers.
> >>
> >> TBH, I think pgbench is now much too complex; it does not need more
> >> features, especially not ones that need large caveats in the docs.
> >> (What exactly is the point of having zipfian at all?)
> > 
> > Taking a statistical perspective, Zipfian distributions violate some
> > assumptions we make by assuming uniform distributions. This matters
> > because Zipf-distributed data sets are quite common in real life.
> > 
> 
> I don't think there's any disagreement about the value of non-uniform
> distributions. The question is whether it has to be a zipfian one, when
> the best algorithm we know about is this expensive in some cases? Or
> would an exponential distribution be enough?

I suppose to people who care about the difference between Zipf and
exponential would appreciate having the former around to test.

Whether pgbench should support this is a different question, and it's
sounding a little like the answer to that one is "no."

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: restrict pg_stat_ssl to superuser?

2019-02-19 Thread Robert Haas
On Thu, Feb 7, 2019 at 3:30 AM Peter Eisentraut
 wrote:
> As discussed in [0], should we restrict access to pg_stat_ssl to
> superusers (and an appropriate pg_ role)?
>
> If so, is there anything in that view that should be made available to
> non-superusers?  If not, then we could perhaps do this via a simple
> permission change instead of going the route of blanking out individual
> columns.

Shouldn't unprivileged users be able to see their own entries, or
entries for roles which they could assume via SET ROLE?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Delay locking partitions during INSERT and UPDATE

2019-02-19 Thread Robert Haas
On Mon, Feb 18, 2019 at 6:15 PM Tom Lane  wrote:
> I'm inclined to think that if we already have lock on the parent
> partitioned table (thereby, IIUC, guaranteeing that its partitioning
> info can't change) that the order in which we acquire the same lock
> level on its partition(s) isn't very important.

Well, as it turns out, there's also a pending patch (series) to remove
that guarantee which I would like to get committed.  The thread for
that is "ATTACH/DETACH PARTITION CONCURRENTLY".  I believe I've more
or less got the issues with that patch sorted out, but I'm concerned
about how it interacts with all of the other things that are currently
in flight.  Delaying or skipping lock acquisition in some cases and
weakening the locks we take in others is not a process that can
continue indefinitely without at some point hitting a wall.

But that being said, I don't think I quite see how the two things you
mention here are connected to each other.  If operation A locks
parents before children, and operation B also locks parents before
children, they generally won't deadlock against each other, even if
each locks the children in a different order.  An exception would be
if the locks involved are of different levels such that the locks on
the parents don't conflict but the locks on the children do, e.g.
RowExclusiveLock on partitioned tables and AccessExclusiveLock on
partitions.  But that's a pretty weird setup and I don't know of any
current or proposed code that does that.  So my question is - what do
you mean by the parenthetical note about the partitioning info not
changing?  Regardless of whether it does or does not, I think the same
property holds.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Delay locking partitions during INSERT and UPDATE

2019-02-19 Thread Robert Haas
On Fri, Feb 15, 2019 at 9:22 PM Andres Freund  wrote:
> On 2019-01-31 13:46:33 -0500, Robert Haas wrote:
> > I have reviewed this patch and I am in favor of it.  I think it likely
> > needs minor rebasing because of the heap_open -> table_open renaming.
> > I also agree that it's worth taking some deadlock risk for the rather
> > massive performance gain, although I suspect it's likely that a few
> > users are going to complain about deadlocks and I wonder whether we'll
> > have to put some energy into that problem at some point.  However, I
> > think what we probably want to do there is reduce the probability of
> > deadlocks through some trickery or maybe invent some new locking
> > mechanisms that work around the problem.  The alternative of trying to
> > block this patch seems unpalatable.
>
> Are you saying that such workarounds would have to be merged at the same
> time as this patch? Or that we'd address such complaints that way at a
> later time?

Later.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: speeding up planning with partitions

2019-02-19 Thread Tom Lane
Amit Langote  writes:
> On 2019/02/19 4:42, Tom Lane wrote:
>> I don't much care for re-calling build_base_rel_tlists to add extra
>> Vars to the appropriate relations; 99% of the work it does will be
>> wasted, and with enough child rels you could run into an O(N^2)
>> problem.  Maybe you could call add_vars_to_targetlist directly,
>> since you know exactly what Vars you're adding?

> Assuming you're talking about the build_base_rel_tlists() call in
> create_inherited_target_child_root(), it's necessary because *all* tlist
> Vars are new, because the targetlist has just been translated at that
> point.  But maybe I missed your point?

Hmm, I'll take a closer look --- I thought it was just there to add the
new ctid-or-equivalent columns.  Didn't our earlier translation of the
whole subroot structure reach the reltargetlists?

> By the way, later patch in the series will cause partition pruning to
> occur before these child PlannerInfos are generated, so
> create_inherited_target_child_root will be called only as many times as
> there are un-pruned child target relations.

Well, that helps, but only for "point update" queries.  You still need
to be wary of not causing O(N^2) behavior when there are lots of unpruned
partitions.

>> What is the point of moving the calculation of all_baserels?  The earlier
>> you construct that, the more likelihood that code will have to be written
>> to modify it (like, say, what you had to put into
>> create_inherited_target_child_root), and I do not see anything in this
>> patch series that needs it to be available earlier.

> all_baserels needs to be built in original PlannerInfo before child
> PlannerInfos are constructed, so that they can simply copy it and have the
> parent target baserel RT index in it replaced by child target baserel RT
> index.  set_inherited_target_rel_sizes/pathlists use child PlannerInfos,
> so all_baserels must be set in them just like it is in the original
> PlannerInfo.

No, I think you're not getting my point: if you build all_baserels in
the same place where it already is being built, you don't need to do
any of that because it's already correct.  It looks to me like this
code motion is left over from some earlier iteration of the patch where
it probably was necessary, but now it's just making your life harder.

>> Couldn't inh_target_child_rels list be removed in favor of looking
>> at the relid fields of the inh_target_child_path_rels entries?
>> Having to keep those two lists in sync seems messy.

> Don't like having two lists either, but inh_target_child_path_rels entries
> can be RELOPT_JOINREL, so relid can be 0.

So?  For a join, there's no particularly relevant integer to put into
inh_target_child_rels either.  (I might like these two lists better
if they had better-chosen names.  Merely making them long enough to
induce carpal tunnel syndrome isn't helping distinguish them.)

> Attached updated patches.  0001 that I'd previously posted is no longer
> included, as I said in the other email.

OK, I'll make another pass over 0001 today.

regards, tom lane



Re: unconstify equivalent for volatile

2019-02-19 Thread Andres Freund
Hi,

On 2019-02-19 11:48:16 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > The real reason why variables commonly need to be volatile when used in
> > signal handlers is not the signal handler side, but the normal code flow
> > side.
> 
> Yeah, exactly.  You have not explained why it'd be safe to ignore that.

Because SetLatch() is careful to have a pg_memory_barrier() before
touching shared state and conversely so are ResetLatch() (and
WaitEventSetWait(), which already has no volatiles). And if we've got
this wrong they aren't safe for shared latches, because volatiles don't
enforce meaningful ordering on weakly ordered architectures.

But even if we were to decide we'd want to keep a volatile in SetLatch()
- which I think really would only serve to hide bugs - that'd not mean
it's a good idea to keep it on all the other functions in latch.c.

Greetings,

Andres Freund



Re: Some thoughts on NFS

2019-02-19 Thread Andres Freund
Hi,

On 2019-02-19 20:03:05 +1300, Thomas Munro wrote:
> The first is practical.  Running out of diskspace (or quota) is not
> all that rare (much more common that EIO from a dying disk, I'd
> guess), and definitely recoverable by an administrator: just create
> more space.  It would be really nice to avoid panicking for an
> *expected* condition.

Well, that's true, but OTOH, we don't even handle that properly on local
filesystems for WAL. And while people complain, it's not *that* common.


> 1.  Figure out how to get the ALLOCATE command all the way through the
> stack from PostgreSQL to the remote NFS server, and know for sure that
> it really happened.  On the Debian buster Linux 4.18 system I checked,
> fallocate() reports EOPNOTSUPP for fallocate(), and posix_fallocate()
> appears to succeed but it doesn't really do anything at all (though I
> understand that some versions sometimes write zeros to simulate
> allocation, which in this case would be equally useless as it doesn't
> reserve anything on an NFS server).  We need the server and NFS client
> and libc to be of the right version and cooperate and tell us that
> they have really truly reserved space, but there isn't currently a way
> as far as I can tell.  How can we achieve that, without writing our
> own NFS client?
> 
> 2.  Deal with the resulting performance suckage.  Extending 8kb at a
> time with synchronous network round trips won't fly.

I think I'd just go for fsync();pwrite();fsync(); as the extension
mechanism, iff we're detecting a tablespace is on NFS. The first fsync()
to make sure there's no previous errors that we could mistake for
ENOSPC, the pwrite to extend, the second fsync to make sure there's
actually space. Then we can detect ENOSPC properly.  That possibly does
leave some errors where we could mistake ENOSPC as something more benign
than it is, but the cases seem pretty narrow, due to the previous
fsync() (maybe the other side could be thin provisioned and get an
ENOSPC there - but in that case we didn't actually loose any data. The
only dangerous scenario I can come up with is that the remote side is on
thinly provisioned CoW system, and a concurrent write to an earlier
block runs out of space - but seriously, good riddance to you).

Given the current code we'll already try to extend in bigger chunks when
there's contention, we just need to combine the writes for those, that
ought to not be that hard now that we don't initialize bulk-extended
pages anymore. That won't solve the issue of extending during single
threaded writes, but I feel like that's secondary to actually being
correct. And using bulk-extension in more cases doesn't sound too hard
to me.


Greetings,

Andres Freund



Re: unconstify equivalent for volatile

2019-02-19 Thread Tom Lane
Andres Freund  writes:
> The real reason why variables commonly need to be volatile when used in
> signal handlers is not the signal handler side, but the normal code flow
> side.

Yeah, exactly.  You have not explained why it'd be safe to ignore that.

regards, tom lane



Re: PGAdmin 4 don't refresh server info after restarting

2019-02-19 Thread Robert Haas
On Mon, Feb 18, 2019 at 1:20 AM Andrey Klychkov  wrote:
> Hello,
> We've noticed that pgadmin 3.x / 4.x doesn't refresh server info like server 
> version after restarting. It makes people confused.
> For example,
> 1. PgAdmin was rinning
> 2. We upgraded postgres from 11.1 to 11.2
> 3. PgAdmin was restarted
> 4. We continued to see 11.1
> 5. We must push "Disconnect from server" and connect again for updating 
> server version.
>
> It would be convenient that PgAdmin checks server info from time to time, at 
> least, when PgAdmin was restarted.

You should report this on a pgAdmin mailing list rather than a
PostgreSQL mailing list.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Some thoughts on NFS

2019-02-19 Thread Christoph Moench-Tegeder
## Magnus Hagander (mag...@hagander.net):

>  You'd solve more
> of that by having the middle layer speak "raw device" underneath and be
> able to sit on top of things like iSCSI (yes, really).

Back in ye olden days we called these middle layers "kernel" and
"filesystem" and had that maintained by specialists.

Regards,
Christoph

-- 
Spare Space.



Re: Some thoughts on NFS

2019-02-19 Thread Robert Haas
On Tue, Feb 19, 2019 at 10:59 AM Magnus Hagander  wrote:
> The only case I've run into people wanting to use postgres on NFS, the NFS 
> server is a big filer from netapp or hitachi or whomever. And you're not 
> going to be able to run something like that on top of it.

Yeah.  :-(

It seems, however, we have no way of knowing to what extent that big
filer actually implements the latest NFS specs and does so correctly.
And if it doesn't, and data goes down the tubes, people are going to
blame PostgreSQL, not the big filer, either because they really
believe we ought to be able to handle it, or because they know that
filing a trouble ticket with NetApp isn't likely to provoke any sort
of swift response.  If PostgreSQL itself is speaking NFS, we might at
least have a little more information about what behavior the filer
claims to implement, but even then it could easily be "lying."  And if
we're just seeing it as a filesystem mount, then we're just ... flying
blind.

> There might be a use-case for the split that you mention, absolutely, but 
> it's not going to solve the people-who-want-NFS situation. You'd solve more 
> of that by having the middle layer speak "raw device" underneath and be able 
> to sit on top of things like iSCSI (yes, really).

Not sure I follow this part.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Some thoughts on NFS

2019-02-19 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Tue, Feb 19, 2019 at 2:03 AM Thomas Munro  wrote:
> > How can we achieve that, without writing our
> > own NFS client?
> 
> 
> 
> Instead of writing our own NFS client, how about writing our own
> network storage protocol?  Imagine a stripped-down postmaster process
> running on the NFS server that essentially acts as a block server.
> Through some sort of network chatter, it can exchange blocks with the
> real postmaster running someplace else.  The mini-protocol would
> contain commands like READ_BLOCK, WRITE_BLOCK, EXTEND_RELATION,
> FSYNC_SEGMENT, etc. - basically whatever the relevant operations at
> the smgr layer are.  And the user would see the remote server as a
> tablespace mapped to a special smgr.

In reading this, I honestly thought somewhere along the way you'd say
"and then you have WAL, so just run a replica and forget this whole
network filesystem business."

The practical issue of WAL replay being single-process is an issue
though.  It seems like your mini-protocol was going in a direction that
would have allowed multiple processes to be working between the PG
system and the storage system concurrently, avoiding the single-threaded
issue with WAL but also making it such that the replica wouldn't be able
to be used for read-only queries (without some much larger changes
happening anyway).  I'm not sure the use-case is big enough but it does
seem to me that we're getting to a point where people are generating
enough WAL with systems that they care an awful lot about that they
might be willing to forgo having the ability to perform read-only
queries on the replica as long as they know that they can flip traffic
over to the replica without losing data.

So, what this all really boils down to is that I think this idea of a
different protocol that would allow PG to essentially replicate to a
remote system, or possibly run entirely off of the remote system without
any local storage, could be quite interesting in some situations.

On the other hand, I pretty much agree 100% with Magnus that the NFS
use-case is almost entirely because someone bought a big piece of
hardware that talks NFS and no, you don't get to run whatever code you
want on it.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Allowing extensions to find out the OIDs of their member objects

2019-02-19 Thread Robert Haas
On Mon, Jan 21, 2019 at 9:46 PM Tom Lane  wrote:
> Perhaps this also gives some impetus to the lets-use-identifiers-
> not-numbers approach that Andrew was pushing.  I didn't care for
> that too much so far as an extension's own internal references
> are concerned, but for cross-extension references it seems a
> lot better to be looking for "postgis / function_foo_int_int"
> than for "postgis / 3".

Yeah, I agree.  I think names are a good idea.  I also agree with the
other comments that trying to run an OID registry will not work out
well.  Either we'll accept every request for an OID range and go nuts
tracking them all as they rapidly balloon -- or more likely we'll
reject requests from insufficiently-famous extensions which will, of
course, hinder their attempts to become famous.  It seems much better
to come up with a solution where every extension can DTRT without any
central coordination.  Perhaps if we replaced OIDs with UUIDs that
would Just Work, but an OID-mapping system seems like a good, perhaps
better, answer as well.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Some thoughts on NFS

2019-02-19 Thread Joe Conway
On 2/19/19 10:59 AM, Magnus Hagander wrote:
> On Tue, Feb 19, 2019 at 4:46 PM Robert Haas  > wrote:
> 
> On Tue, Feb 19, 2019 at 2:03 AM Thomas Munro  > wrote:
> > How can we achieve that, without writing our
> > own NFS client?
> 
> 
> 
> 
> You'll need it :)
> 
> 
> Instead of writing our own NFS client, how about writing our own
> network storage protocol?  Imagine a stripped-down postmaster process
> running on the NFS server that essentially acts as a block server.
> Through some sort of network chatter, it can exchange blocks with the
> real postmaster running someplace else.  The mini-protocol would
> contain commands like READ_BLOCK, WRITE_BLOCK, EXTEND_RELATION,
> FSYNC_SEGMENT, etc. - basically whatever the relevant operations at
> the smgr layer are.  And the user would see the remote server as a
> tablespace mapped to a special smgr.
> 
> As compared with your proposal, this has both advantages and
> disadvantages.  The main advantage is that we aren't dependent on
> being able to make NFS behave in any particular way; indeed, this type
> of solution could be used not only to work around problems with NFS,
> but also problems with any other network filesystem.  We get to reuse
> all of the work we've done to try to make local operation reliable;
> the remote server can run the same code that would be run locally
> whenever the master tells it to do so.  And you can even imagine
> trying to push more work to the remote side in some future version of
> the protocol.  The main disadvantage is that it doesn't help unless
> you can actually run software on the remote box.  If the only access
> you have to the remote side is that it exposes an NFS interface, then
> this sort of thing is useless.  And that's probably a pretty common
> scenario.
> 
> 
> In my experience, that covers approximately 100% of the usecase.
> 
> The only case I've run into people wanting to use postgres on NFS, the
> NFS server is a big filer from netapp or hitachi or whomever. And you're
> not going to be able to run something like that on top of it.


Exactly my experience too.


> There might be a use-case for the split that you mention, absolutely,
> but it's not going to solve the people-who-want-NFS situation. You'd
> solve more of that by having the middle layer speak "raw device"
> underneath and be able to sit on top of things like iSCSI (yes, really).

Interesting idea but sounds ambitious ;-)

Joe
-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: unconstify equivalent for volatile

2019-02-19 Thread Andres Freund
Hi,

On February 19, 2019 7:00:58 AM PST, Peter Eisentraut 
 wrote:
>On 2019-02-18 21:25, Andres Freund wrote:
>> ISTM this one should rather be solved by removing all volatiles from
>> latch.[ch]. As that's a cross-process concern we can't rely on it
>> anyway (and have placed barriers a few years back to allay concerns /
>> bugs due to reordering).
>
>Aren't the volatiles there so that Latch variables can be set from
>signal handlers?

Why would they be required, given we have an explicit compiler & memory 
barrier? That forces the compiler to spill the writes to memory already, even 
in a signal handler. And conversely the reading side is similarly forced - if 
not we have a bug in multi core systems - to read the variable from memory 
after a barrier.

The real reason why variables commonly need to be volatile when used in signal 
handlers is not the signal handler side, but the normal code flow side. Without 
using explicit care the variable's value can be "cached"in a register, and a 
change not noticed. But as volatiles aren't sufficient for cross process 
safety, latches need more anyway.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: FOP warnings about id attributes in title tags

2019-02-19 Thread Peter Eisentraut
On 2019-02-18 16:37, Peter Eisentraut wrote:
>> It appears that these are due to title elements having id tags. At
>>  is says:
>>
>> When adding an |id| or |xml:id| attribute, put it on the element
>> itself, not the |title|.
>>
>> So maybe we need to fix those up?
> 
> You can't just remove the ids, since some of them are referenced from
> elsewhere.

Here was a discussion on getting rid of them:
https://www.postgresql.org/message-id/flat/4a60dfc3-061b-01c4-2b86-279d3a612fd2%402ndquadrant.com

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Some thoughts on NFS

2019-02-19 Thread Magnus Hagander
On Tue, Feb 19, 2019 at 4:46 PM Robert Haas  wrote:

> On Tue, Feb 19, 2019 at 2:03 AM Thomas Munro 
> wrote:
> > How can we achieve that, without writing our
> > own NFS client?
>
> 
>

You'll need it :)


Instead of writing our own NFS client, how about writing our own
> network storage protocol?  Imagine a stripped-down postmaster process
> running on the NFS server that essentially acts as a block server.
> Through some sort of network chatter, it can exchange blocks with the
> real postmaster running someplace else.  The mini-protocol would
> contain commands like READ_BLOCK, WRITE_BLOCK, EXTEND_RELATION,
> FSYNC_SEGMENT, etc. - basically whatever the relevant operations at
> the smgr layer are.  And the user would see the remote server as a
> tablespace mapped to a special smgr.
>
> As compared with your proposal, this has both advantages and
> disadvantages.  The main advantage is that we aren't dependent on
> being able to make NFS behave in any particular way; indeed, this type
> of solution could be used not only to work around problems with NFS,
> but also problems with any other network filesystem.  We get to reuse
> all of the work we've done to try to make local operation reliable;
> the remote server can run the same code that would be run locally
> whenever the master tells it to do so.  And you can even imagine
> trying to push more work to the remote side in some future version of
> the protocol.  The main disadvantage is that it doesn't help unless
> you can actually run software on the remote box.  If the only access
> you have to the remote side is that it exposes an NFS interface, then
> this sort of thing is useless.  And that's probably a pretty common
> scenario.
>

In my experience, that covers approximately 100% of the usecase.

The only case I've run into people wanting to use postgres on NFS, the NFS
server is a big filer from netapp or hitachi or whomever. And you're not
going to be able to run something like that on top of it.

There might be a use-case for the split that you mention, absolutely, but
it's not going to solve the people-who-want-NFS situation. You'd solve more
of that by having the middle layer speak "raw device" underneath and be
able to sit on top of things like iSCSI (yes, really).

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: restrict pg_stat_ssl to superuser?

2019-02-19 Thread Peter Eisentraut
On 2019-02-18 04:58, Michael Paquier wrote:
> On Fri, Feb 15, 2019 at 02:04:59PM +0100, Peter Eisentraut wrote:
>> We could remove default privileges from pg_stat_get_activity().  Would
>> that be a problem?
> 
> I don't think so, still I am wondering about the impact that this
> could have for monitoring tools calling it directly as we document
> it.. 

Actually, this approach isn't going to work anyway, because function
permissions in a view are checked as the current user, not the view owner.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Some thoughts on NFS

2019-02-19 Thread Robert Haas
On Tue, Feb 19, 2019 at 2:03 AM Thomas Munro  wrote:
> How can we achieve that, without writing our
> own NFS client?



Instead of writing our own NFS client, how about writing our own
network storage protocol?  Imagine a stripped-down postmaster process
running on the NFS server that essentially acts as a block server.
Through some sort of network chatter, it can exchange blocks with the
real postmaster running someplace else.  The mini-protocol would
contain commands like READ_BLOCK, WRITE_BLOCK, EXTEND_RELATION,
FSYNC_SEGMENT, etc. - basically whatever the relevant operations at
the smgr layer are.  And the user would see the remote server as a
tablespace mapped to a special smgr.

As compared with your proposal, this has both advantages and
disadvantages.  The main advantage is that we aren't dependent on
being able to make NFS behave in any particular way; indeed, this type
of solution could be used not only to work around problems with NFS,
but also problems with any other network filesystem.  We get to reuse
all of the work we've done to try to make local operation reliable;
the remote server can run the same code that would be run locally
whenever the master tells it to do so.  And you can even imagine
trying to push more work to the remote side in some future version of
the protocol.  The main disadvantage is that it doesn't help unless
you can actually run software on the remote box.  If the only access
you have to the remote side is that it exposes an NFS interface, then
this sort of thing is useless.  And that's probably a pretty common
scenario.

So that brings us back to your proposal.  I don't know whether there's
anyway of solving the problem you postulate: "We need the server and
NFS client and libc to be of the right version and cooperate and tell
us that they have really truly reserved space."  If there's not a set
of APIs that can be used to make that happen, then I don't know how we
can ever solve this problem without writing our own client.  Well, I
guess we could submit patches to every libc in the world to add those
APIs.  But that seems like a painful way forward.

I'm kinda glad you're thinking about this problem because I think the
unreliably of PostgreSQL on NFS is a real problem for users and kind
of a black eye for the project.  However, I am not sure that I see an
easy solution in what you wrote, or in general.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Allowing extensions to find out the OIDs of their member objects

2019-02-19 Thread Tom Lane
I wrote:
> [ discussion about ways to let extension C code find out object OIDs ]

I wanted to close out this thread, for the time being, by saying that
I'm not expecting to get anything done about it for v12.  It seems
pretty late in the dev cycle to be proposing any major new user-visible
functionality, and even without that consideration, I have too many
other things on my plate.

There are ways that an extension can implement
SupportRequestIndexCondition without knowing hard-wired OIDs for its
objects:

1. The first problem is to determine whether the index column you've
been called for has the opfamily you want to work with.  Our core-code
examples mostly check req->opfamily against a hard-wired OID constant,
which doesn't work for an extension-defined opfamily.  I think best
practice here will probably be to look up the opfamily's catalog entry via
the OPFAMILYOID syscache and see if its name is what you expect (checking
only the name, not the schema, since the latter might be variable).
While a false match is theoretically possible it seems unlikely, and
there's not a security risk since we only allow superusers to create
opfamilies.  Another idea is to check req->index->relam to see if the
index type is what you expect, and just assume that your datatype has
only one opfamily per index type.  This is mainly attractive if the
index type is a built-in one with a known OID --- if the index AM is
also extension-defined, then you're still stuck needing to look up and
check its name.

2. Then, you need to be able to identify operator OIDs so that you can
build indexqual OpExprs.  The operators you care about for this have
to be members of the opfamily, so you can look up their OIDs pretty
easily using get_opfamily_member ... assuming you know the OIDs of their
input datatypes (see below).  like_support.c has examples of this.

3.  You might need to determine datatype OIDs, for the above purpose and
so that you can build comparison constants to pass to the operators.
This is pretty easy if the comparison value is of the same type as one
of the inputs to your function or operator being optimized: just apply
exprType() to that input expression.  However there are cases where this
isn't true; for instance, PostGIS sometimes wants to optimize a function
with more than 2 inputs by converting "func(indexcol, something, something)"
to "indexcol indexable-operator row(something, something)::compositetype".
The OID of the custom composite type isn't readily available.  In such
cases you might have no really better answer than to look up the type
by name.  This can probably be done safely by assuming that it's in the
same schema as your target function, which you can look up since you
have the function's OID.  Using the type OID successfully in a later
get_opfamily_member lookup would provide additional confidence that
you have the right type.

In all of these cases, you could probably get away with caching the
results of successful lookups in static variables, so that you don't
need to do them more than once per session.

This is clearly an area that's ripe for improvement by providing better
infrastructure, but I think we can tolerate this state of affairs for
the time being.

regards, tom lane



Re: Progress reporting for pg_verify_checksums

2019-02-19 Thread Fabien COELHO



Hallo Michael,


New patch attached.


Patch applies cleanly. Compiles, "make check" ok. doc build is also ok.

There are no tests, which is probably fine for such an interactive 
feature.


Docs look okay to me. Clear and to the point.

About :

 total_percent = total_size ? (int64) ((current_size / MEGABYTES) * 100 / 
(total_size / MEGABYTES)) : 0;

MEGABYTES can be simplified and will enhance precision. ISTM that the 
percent could be a double:


  total_percent = total_size ? 100.0 * current_size / total_size : 0.0 ;

and then just let fprintf do the rounding.

I would bother rounding down < 100% to 100, because then you would get

  1560/1492 MB (100\%, X MB/s)

which is kind of silly. If it goes over it just mean that the initial 
estimate was wrong, which might happen if the db is running (other patch 
in the queue). I'm fine with over 100 if it is consistent with the MB 
ratio next to it.


fprintf could be called once instead of twice by merging the eol and the 
previous message.


Maybe I'd consider inverting the sizeonly boolean so that more is done 
when true.


I'd still would use more precise than one second precision time so that 
the speed displayed at the beginning is more realistic. eg I got on a 
short test with probably in system cache files:


  188/188 MB (100%, 188 MB/s)

but the reality is that it took 0.126 seconds, and the speed was really 
1492 MB/s.


--
Fabien.



  1   2   >