from:"Tsunakawa, Takayuki"

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-09-26 Thread Tsunakawa, Takayuki

From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com]
> On 2019-Sep-03, Tsunakawa, Takayuki wrote:
> > I don't think it's rejected.  It would be a pity (mottainai) to refuse
> > this, because it provides significant speedup despite its simple
> > modification.
> 
> I don't necessarily disagree with your argumentation, but Travis is
> complaining thusly:

I tried to revise David's latest patch (v8) and address Tom's comments in his 
last mail.  But I'm a bit at a loss.

First, to accurately count the maximum number of acquired locks in a 
transaction, we need to track the maximum entries in the hash table, and make 
it available via a new function like hash_get_max_entries().  However, to cover 
the shared partitioned hash table (that is not necessary for 
LockMethodLocalHash), we must add a spin lock in hashhdr and lock/unlock it 
when entering and removing entries in the hash table.  It spoils the effort to 
decrease contention by hashhdr->freelists[].mutex.  Do we want to track the 
maximum number of acquired locks in the global variable in lock.c, not in the 
hash table?

Second, I couldn't understand the comment about the fill factor well.  I can 
understand that it's not correct to compare the number of hash buckets and the 
number of locks.  But what can we do?

I'm sorry to repeat what I mentioned in my previous mail, but my v2 patch's 
approach is based on the database textbook and seems intuitive.  So I attached 
the rebased version.

Regards
Takayuki Tsunakawa

faster-locallock-scan_v3.patch
Description: faster-locallock-scan_v3.patch

RE: [bug fix??] Fishy code in tts_cirtual_copyslot()

2019-09-23 Thread Tsunakawa, Takayuki

From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> I temporarily changed the Assert to be "==" rather than "<=", and
> it still passed check-world, so evidently we are not testing any
> cases where the descriptors are of different lengths.  This explains
> the lack of symptoms.  It's still a bug though, so pushed.

Thank you for committing.

> > BTW, I thought that in PostgreSQL coding convention, local variables
> should be defined at the top of blocks, but this function writes "for (int
> natts;".
> 
> Yeah, we've agreed to join the 21st century to the extent of allowing
> local for-loop variables.

That's good news.  It'll help a bit to code comfortably.


Regards
Takayuki Tsunakawa

[bug fix??] Fishy code in tts_cirtual_copyslot()

2019-09-12 Thread Tsunakawa, Takayuki

Hello,

In the following code in execTuples.c, shouldn' srcdesc point to the source 
slot's tuple descriptor?  The attached fix passes make check.  What kind of 
failure could this cause?

BTW, I thought that in PostgreSQL coding convention, local variables should be 
defined at the top of blocks, but this function writes "for (int natts;".  I 
didn't modify it because many other source files also write in that way.


--
static void
tts_virtual_copyslot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
{
TupleDesc   srcdesc = dstslot->tts_tupleDescriptor;

Assert(srcdesc->natts <= dstslot->tts_tupleDescriptor->natts);

tts_virtual_clear(dstslot);

slot_getallattrs(srcslot);

for (int natt = 0; natt < srcdesc->natts; natt++)
{
dstslot->tts_values[natt] = srcslot->tts_values[natt];
dstslot->tts_isnull[natt] = srcslot->tts_isnull[natt];
}
--


Regards
Takayuki Tsunakawa



virtslot_tiny_fix.patch
Description: virtslot_tiny_fix.patch

RE: [bug fix] Produce a crash dump before main() on Windows

2019-09-10 Thread Tsunakawa, Takayuki

From: Michael Paquier [mailto:mich...@paquier.xyz]
> Imagine an application which relies on Postgres, still does *not* start
> it as a service but uses "pg_ctl start"
> automatically.  This could be triggered as part of another service startup
> which calls say system(), or as another script.  Wouldn't that cause
> potentially a freeze?  

Do you mean by "another service startup":

another service (Windows service)
-> an application (not a Windows service)
  -> pg_ctl start

Then pg_ctl runs under the Windows service environment, and the popup won't 
appear.


> I am also not sure that I quite understand why a
> popup would never show up if a different service startup triggers pg_ctl
> start by itself that fails.

I experimented it with a sample program that I showed in this thread.  A 
program invoked by a Windows services also runs in the service.


Regards
Takayuki Tsunakawa

RE: SIGQUIT on archiver child processes maybe not such a hot idea?

2019-09-10 Thread Tsunakawa, Takayuki

From: David Steele [mailto:da...@pgmasters.net]
> > Can't we use SIGKILL instead of SIGINT/SIGTERM to stop the grandchildren,
> just in case they are slow to respond to or ignore SIGINT/SIGTERM?  That
> matches the idea of pg_ctl's immediate shutdown.
> 
> -1, at least not immediately.  Archivers can be complex processes and
> they should be given the chance to do a graceful shutdown.

I agree that the user's archiver program should receive the chance for graceful 
stop in smart or fast shutdown.  But I think in immediate shutdown, all should 
stop immediately.  That's what I expect from the word "immediate."

If the grandchildren left running don't disturb the cleanup of PostgreSQL's 
resources (shared memory, file/directory access, etc.) or restart of 
PostgreSQL, we may well be able to just advice the grandchildren to stop 
immediately with SIGINT/SIGTERM.  However, for example, in the failover of 
shared-disk HA clustering, when the clustering software stops PostgreSQL with 
"pg_ctl stop -m immediate" and then tries to unmount the file systems for 
$PGDATA and archived WAL, the unmount  may take time or fail due to the access 
from PostgreSQL's grandchildren.


Regards
Takayuki Tsunakawa

RE: SIGQUIT on archiver child processes maybe not such a hot idea?

2019-09-10 Thread Tsunakawa, Takayuki

From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> SIGTERM, which needs to be adjusted.  For another, its
> SIGQUIT handler does exit(1) not _exit(2), which seems rather
> dubious ... should we make it more like the rest?  I think
> the reasoning there might've been that if some DBA decides to
> SIGQUIT the archiver, we don't need to force a database-wide
> reset; but why exactly should we tolerate that?

postmaster doesn't distinguish return codes other than 0 for the archiver, and 
just starts the archiver unless postmaster is shutting down.  So we can use 
_exit(2) like the other children.

Can't we use SIGKILL instead of SIGINT/SIGTERM to stop the grandchildren, just 
in case they are slow to respond to or ignore SIGINT/SIGTERM?  That matches the 
idea of pg_ctl's immediate shutdown.

(Windows cannot stop grandchildren because kill() in src/port/kill.c doesn't 
support the process group...  That's a separate topic.)


Regards
Takayuki Tsunakawa

RE: Libpq support to connect to standby server as priority

2019-09-10 Thread Tsunakawa, Takayuki

From: Alvaro Herrera from 2ndQuadrant [mailto:alvhe...@alvh.no-ip.org]
> Testing protocol version 2 is difficult!  Almost every single test fails
> because of error messages being reported differently; and streaming
> replication (incl. pg_basebackup) doesn't work at all because it's not
> possible to establish replication connections.  Manual inspection shows
> it behaves correctly.

Yeah, the code path for protocol v2 is sometimes annoying.  I wish v2 support 
will be dropped soon.  I know there was a discussion on it, but I didn't track 
the conclusion.


> Remaining patchset attached (per my count it's v13 of your patchset.

I'm afraid those weren't attached.


> think we should merge one half of it with each of the other two patches
> where the changes are introduced (0003 and 0004).  I'm not convinced
> that we need 0004-0006 to be separate commits.

It was hard to review those separate patches, so I think it's better to merge 
those.  OTOH, I can understand Haribabu-san's idea that the community may not 
accept the latter patches, e.g. accept only 0001-0005.


Regards
Takayuki Tsunakawa

RE: [bug fix] Produce a crash dump before main() on Windows

2019-09-09 Thread Tsunakawa, Takayuki

From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> Alvaro Herrera from 2ndQuadrant  writes:
> > Well, IMV this is a backpatchable, localized bug fix.
> 
> I dunno.  This thread is approaching two years old, and a quick
> review shows few signs that we actually have any consensus on
> making behavioral changes here.  If there is any consensus,
> it's that the SetErrorMode calls should depend on checking
> pgwin32_is_service(); but we haven't even seen a patch that
> does that, let alone tested it.  I think we're way too close
> to beta4 wrap to be considering pushing such a patch the moment
> it appears.

We don't have to call pgwin32_is_service() to determine whether we call 
SetErrorMode() in postmaster.c because:

* The dialog box doesn't appear under Windows service, whether PostgreSQL 
itself starts as a service or another (server) application runs as a service 
and does "pg_ctl start."

* It's OK for the dialog box to appear when the user runs "pg_ctl start" on the 
command prompt.  That usage is interactive, and should not be for production 
use (postgres processes vanish when the user mistakenly presses Ctrl+C, or 
closes the command prompt).  Even now, the dialog box pops up if postmaster 
crashes before main().


> BTW, I also violently dislike taking Windows-specific stuff out of
> startup_hacks where it belongs and putting it into main() where
> it doesn't.  I think the entire Windows bit in front of get_progname
> should migrate into startup_hacks.  Surely the odds of failure
> inside get_progname are not worth worrying about --- they seem
> significantly less than the odds of failure inside
> pgwin32_is_service(), for instance.

Agreed.  Fixed.  I changed the CF status to "need review."


Regards
Takayuki Tsunakawa



crash_dump_before_main_v3.patch
Description: crash_dump_before_main_v3.patch

RE: [bug fix] Produce a crash dump before main() on Windows

2019-09-06 Thread Tsunakawa, Takayuki

From: Michael Paquier [mailto:mich...@paquier.xyz]
> The last patch submitted is here:
> https://www.postgresql.org/message-id/0A3221C70F24FB45833433255569204D
> 1F8ECF73@G01JPEXMBYT05
> And based on the code paths it touches I would recommend to not play with
> REL_12_STABLE at this stage.

I'm reading the thread to see what I should do...  Anyway, I don't think we 
should rush to include this fix in PG 12, too.  But at the same time, I don't 
think it would be dangerous to put in PG 12.


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-09-02 Thread Tsunakawa, Takayuki

From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com]
> Hmm ... is this patch rejected, or is somebody still trying to get it to
> committable state?  David, you're listed as committer.

I don't think it's rejected.  It would be a pity (mottainai) to refuse this, 
because it provides significant speedup despite its simple modification.

Again, I think the v2 patch is OK.  Tom's comment was as follows:


[Tom's comment against v2]

FWIW, I tried this patch against current HEAD (959d00e9d).
Using the test case described by Amit at

I do measure an undeniable speedup, close to 35%.

However ... I submit that that's a mighty extreme test case.
(I had to increase max_locks_per_transaction to get it to run
at all.)  We should not be using that sort of edge case to drive
performance optimization choices.

If I reduce the number of partitions in Amit's example from 8192
to something more real-world, like 128, I do still measure a
performance gain, but it's ~ 1.5% which is below what I'd consider
a reproducible win.  I'm accustomed to seeing changes up to 2%
in narrow benchmarks like this one, even when "nothing changes"
except unrelated code.

Trying a standard pgbench test case (pgbench -M prepared -S with
one client and an -s 10 database), it seems that the patch is about
0.5% slower than HEAD.  Again, that's below the noise threshold,
but it's not promising for the net effects of this patch on workloads
that aren't specifically about large and prunable partition sets.

I'm also fairly concerned about the effects of the patch on
sizeof(LOCALLOCK) --- on a 64-bit machine it goes from 72 to 88
bytes, a 22% increase.  That's a lot if you're considering cases
with many locks.

On the whole I don't think there's an adequate case for committing
this patch.

I'd also point out that this is hardly the only place where we've
seen hash_seq_search on nearly-empty hash tables become a bottleneck.
So I'm not thrilled about attacking that with one-table-at-time patches.
I'd rather see us do something to let hash_seq_search win across
the board.



* Extreme test case: 
Not extreme.  Two of our customers, who are considering to use PostgreSQL, are 
using thousands of partitions now.  We hit this issue -- a point query gets 
nearly 20% slower after automatically creating a generic plan.  That's the 
reason for this proposal.

* 0.5% slowdown with pgbench:
I think it's below the noise, as Tom said.

* sizeof(LOCALLOCK):
As Andres replied to Tom in the immediately following mail, LOCALLOCK was 
bigger in PG 11.

* Use case is narrow:
No.  The bloated LockMethodLocalHash affects the performance of the items below 
as well as transaction commit/abort:
  - AtPrepare_Locks() and PostPrepare_Locks(): the hash table is scanned twice 
in PREPARE!
  - LockReleaseSession: advisory lock
  - LockReleaseCurrentOwner: ??
  - LockReassignCurrentOwner: ??


Regards
Takayuki Tsunakawa

RE: SIGQUIT on archiver child processes maybe not such a hot idea?

2019-09-02 Thread Tsunakawa, Takayuki

From: Kyotaro Horiguchi [mailto:horikyota@gmail.com]
> Since we are allowing OPs to use arbitrary command as
> archive_command, providing a replacement with non-standard signal
> handling for a specific command doesn't seem a general solution
> to me. Couldn't we have pg_system(a tentative name), which
> intercepts SIGQUIT then sends SIGINT to children? Might be need
> to resend SIGQUIT after some interval, though..

The same idea that you referred to as pg_system occurred to me, too.  But I 
wondered if the archiver process can get the pid of its child (shell? 
archive_command?), while keeping the capabilities of system() (= the shell).  
Even if we fork() and then system(), doesn't the OS send SIGQUIT to any 
descendents of the archiver when postmaster sends SIGQUIT to the child process 
group?


> # Is there any means to view the whole of a thread from archive?
> # I'm a kind of reluctant to wander among messages like a rat in
> # a maze:p

You can see the whole thread by clicking the "Whole Thread" link.


Regards
Takayuki Tsunakawa

RE: SIGQUIT on archiver child processes maybe not such a hot idea?

2019-09-01 Thread Tsunakawa, Takayuki

From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> After investigation, the mechanism that's causing that is that the
> src/test/recovery/t/010_logical_decoding_timelines.pl test shuts
> down its replica server with a mode-immediate stop, which causes
> that postmaster to shut down all its children with SIGQUIT, and
> in particular that signal propagates to a "cp" command that the
> archiver process is executing.  The "cp" is unsurprisingly running
> with default SIGQUIT handling, which per the signal man page
> includes dumping core.

We've experienced this (core dump in the data directory by an archive command) 
years ago.  Related to this, the example of using cp in the PostgreSQL manual 
is misleading, because cp doesn't reliably persist the WAL archive file.


> This makes me wonder whether we shouldn't be using some other signal
> to shut down archiver subprocesses.  It's not real cool if we're
> spewing cores all over the place.  Admittedly, production servers
> are likely running with "ulimit -c 0" on most modern platforms,
> so this might not be a huge problem in the field; but accumulation
> of core files could be a problem anywhere that's configured to allow
> server core dumps.

We enable the core dump in production to help the investigation just in case.


> Ideally, perhaps, we'd be using SIGINT not SIGQUIT to shut down
> non-Postgres child processes.  But redesigning the system's signal
> handling to make that possible seems like a bit of a mess.
> 
> Thoughts?

We're using a shell script and a command that's called in the shell script.  
That is:

archive_command = 'call some_shell_script.sh ...'

[some_shell_script.sh]
ulimit -c 0
trap SIGQUIT to just exit on the receipt of the signal
call some_command to copy file

some_command also catches SIGQUIT just exit.  It copies and syncs the file.

I proposed something in this line as below, but I couldn't respond to Peter's 
review comments due to other tasks.  Does anyone think it's worth resuming this?

https://www.postgresql.org/message-id/7E37040CF3804EA5B018D7A022822984@maumau


Regards
Takayuki Tsunakawa

RE: Why overhead of SPI is so large?

2019-08-21 Thread Tsunakawa, Takayuki

From: Konstantin Knizhnik [mailto:k.knizh...@postgrespro.ru]
> PL/pgSQL:   29044.361 ms
> C/SPI:  22785.597 ms
> 
> The fact that difference between PL/pgSQL and function implemented in C
> using SPI is not so large  was expected by me.

This PL/pgSQL overhead is not so significant compared with the three times, but 
makes me desire some feature like Oracle's ALTER PROCEDURE ... COMPILE; that 
compiles the PL/SQL logic to native code.  I've seen a few dozen percent speed 
up.


Regards
Takayuki Tsunakawa

RE: [Patch] PQconnectPoll() is blocked if target_session_attrs is read-write

2019-07-24 Thread Tsunakawa, Takayuki

From: Matsumura, Ryo [mailto:matsumura@jp.fujitsu.com]
> Detail:
> If target_session_attrs is set to read-write, PQconnectPoll() calls
> PQsendQuery("SHOW transaction_read_only") althogh previous return value
> was PGRES_POLLING_READING not WRITING.

The current code probably assumes that PQsendQuery() to send "SHOW 
transaction_read_only" shouldn't block, because the message is small enough to 
fit in the socket send buffer.  Likewise, the code in 
CONNECTION_AWAITING_RESPONSE case sends authentication data using 
pg_fe_sendauth() without checking for the write-ready status.  OTOH, the code 
in CONNECTION_MADE case waits for write-ready status in advance before sending 
the startup packet.  That's because the startup packet could get large enough 
to cause pqPacketSend() to block.

So, I don't think the fix is necessary.

> In result, PQsendQuery() may be blocked in pqsecure_raw_write().

FWIW, if PQsendQuery() blocked during connection establishment, I think it 
should block in poll() called from  from pqWait(), because the libpq's 
socket is set non-blocking.


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-22 Thread Tsunakawa, Takayuki

From: David Rowley [mailto:david.row...@2ndquadrant.com]
> Another counter-argument to this is that there's already an
> unexplainable slowdown after you run a query which obtains a large
> number of locks in a session or use prepared statements and a
> partitioned table with the default plan_cache_mode setting. Are we not
> just righting a wrong here? Albeit, possibly 1000 queries later.
> 
> I am, of course, open to other ideas which solve the problem that v5
> has, but failing that, I don't see v6 as all that bad.  At least all
> the logic is contained in one function.  I know Tom wanted to steer
> clear of heuristics to reinitialise the table, but most of the stuff
> that was in the patch back when he complained was trying to track the
> average number of locks over the previous N transactions, and his
> concerns were voiced before I showed the 7% performance regression
> with unconditionally rebuilding the table.

I think I understood what you mean.  Sorry, I don't have a better idea.  This 
unexplanability is probably what we should accept to avoid the shocking 7% 
slowdown.

OTOH, how about my original patch that is based on the local lock list?  I 
expect that it won't that significant slowdown in the same test case.  If it's 
not satisfactory, then v6 is the best to commit.


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-21 Thread Tsunakawa, Takayuki

From: David Rowley [mailto:david.row...@2ndquadrant.com]
> For the use case we've been measuring with partitioned tables and the
> generic plan generation causing a sudden spike in the number of
> obtained locks, then having plan_cache_mode = force_custom_plan will
> cause the lock table not to become bloated. I'm not sure there's
> anything interesting to measure there.

I meant the difference between the following two cases, where the query only 
touches one partition (e.g. SELECT ... WHERE pkey = value):

* plan_cache_mode=force_custom_plan: LocalLockHash won't bloat.  The query 
execution time is steady.

* plan_cache_mode=auto: LocalLockHash bloats on the sixth execution due to the 
creation of the generic plan.  The generic plan is not adopted because its cost 
is high.  Later executions of the query will suffer from the bloat until the 
1006th execution when LocalLockHash is shrunk.


Depending on the number of transactions and what each transaction does, I 
thought the difference will be noticeable or not.


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-21 Thread Tsunakawa, Takayuki

From: David Rowley [mailto:david.row...@2ndquadrant.com]
> I personally don't think that's true.  The only way you'll notice the
> LockReleaseAll() overhead is to execute very fast queries with a
> bloated lock table.  It's pretty hard to notice that a single 0.1ms
> query is slow. You'll need to execute thousands of them before you'll
> be able to measure it, and once you've done that, the lock shrink code
> will have run and the query will be performing optimally again.

Maybe so.  Will the difference be noticeable between plan_cache_mode=auto 
(default) and plan_cache_mode=custom?


> I voice my concerns with v5 and I wasn't really willing to push it
> with a known performance regression of 7% in a fairly valid case. v6
> does not suffer from that.

You're right.  We may have to consider the unpredictability to users by this 
hidden behavior as a compromise for higher throughput.


> > Where else does the extra overhead come from?
> 
> hash_get_num_entries(LockMethodLocalHash) == 0 &&
> + hash_get_max_bucket(LockMethodLocalHash) >
> + LOCKMETHODLOCALHASH_SHRINK_THRESHOLD)
> 
> that's executed every time, not every 1000 times.

I see.  Thanks.


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-21 Thread Tsunakawa, Takayuki

From: David Rowley [mailto:david.row...@2ndquadrant.com]
> I went back to the drawing board on this and I've added some code that counts
> the number of times we've seen the table to be oversized and just shrinks
> the table back down on the 1000th time.  6.93% / 1000 is not all that much.

I'm afraid this kind of hidden behavior would appear mysterious to users.  They 
may wonder "Why is the same query fast at first in the session (5 or 6 times of 
execution), then gets slower for a while, and gets faster again?  Is there 
something to tune?  Am I missing something wrong with my app (e.g. how to use 
prepared statements)?"  So I prefer v5.


> Of course, not all the extra overhead might be from rebuilding the table,
> so here's a test with the updated patch.

Where else does the extra overhead come from?


Regards
Takayuki Tsunakawa

RE: Zedstore - compressed in-core columnar storage

2019-06-30 Thread Tsunakawa, Takayuki

From: Ashwin Agrawal [mailto:aagra...@pivotal.io]
> The objective is to gather feedback on design and approach to the same.
> The implementation has core basic pieces working but not close to complete.

Thank you for proposing a very interesting topic.  Are you thinking of 
including this in PostgreSQL 13 if possible?


> * All Indexes supported
...
> work. Btree indexes can be created. Btree and bitmap index scans work.

Does Zedstore allow to create indexes of existing types on the table (btree, 
GIN, BRIN, etc.) and perform index scans (point query, range query, etc.)?


> * Hybrid row-column store, where some columns are stored together, and
>   others separately. Provide flexibility of granularity on how to
>   divide the columns. Columns accessed together can be stored
>   together.
...
> This way of laying out the data also easily allows for hybrid row-column
> store, where some columns are stored together, and others have a dedicated
> B-tree. Need to have user facing syntax to allow specifying how to group
> the columns.
...
> Zedstore Table can be
> created using command:
> 
> CREATE TABLE  (column listing) USING zedstore;

Are you aiming to enable Zedstore to be used for HTAP, i.e. the same table can 
be accessed simultaneously for both OLTP and analytics with the minimal 
performance impact on OLTP?  (I got that impression from the word "hybrid".)
If yes, is the assumption that only a limited number of  columns are to be 
stored in columnar format (for efficient scanning), and many other columns are 
to be stored in row format for efficient tuple access?
Are those row-formatted columns stored in the same file as the column-formatted 
columns, or in a separate file?

Regarding the column grouping, can I imagine HBase and Cassandra?
How could the current CREATE TABLE syntax support column grouping?  (I guess 
CREATE TABLE needs a syntax for columnar store, and Zedstore need to be 
incorporated in core, not as an extension...)


> A column store uses the same structure but we have *multiple* B-trees, one
> for each column, all indexed by TID. The B-trees for all columns are stored
> in the same physical file.

Did you think that it's not a good idea to have a different file for each group 
of columns?  Is that because we can't expect physical adjacency of data blocks 
on disk even if we separate a column in a separate file?

I thought a separate file for each group of columns would be easier and less 
error-prone to implement and debug.  Adding and dropping the column group would 
also be very easy and fast.


Regards
Takayuki Tsunakawa

RE: Commitfest 2019-07, the first of five* for PostgreSQL 13

2019-06-30 Thread Tsunakawa, Takayuki

From: Stephen Frost [mailto:sfr...@snowman.net]
> sh, don't look now, but there might be a "Resend email" button in the
> archives now that you can click to have an email sent to you...
> 
> Note that you have to be logged in, and the email will go to the email address
> that you're logging into the community auth system with.
> 
> (thank you Magnus)

Thank you so much, Magnus.  This is very convenient.  I'm forced to use Outlook 
at work, which doesn't allow to reply to a downloaded email.  Your help 
eliminates the need to save all emails.


Regards
Takayuki Tsunakawa

RE: [bug fix] Produce a crash dump before main() on Windows

2019-06-30 Thread Tsunakawa, Takayuki

From: Amit Kapila [mailto:amit.kapil...@gmail.com]
> Tsunakawa/Haribabu - By reading this thread briefly, it seems we need
> some more inputs from other developers on whether to fix this or not,
> so ideally the status of this patch should be 'Needs Review'.  Why it
> is in 'Waiting on Author' state?  If something is required from
> Author, then do we expect to see the updated patch in the next few
> days?

Thank you for paying attention to this.  I think the patch is good, but someone 
else may have a different solution.  So I marked it as needs review.


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-06-26 Thread Tsunakawa, Takayuki

From: David Rowley [mailto:david.row...@2ndquadrant.com]
v5 is attached.


Thank you, looks good.  I find it ready for committer (I noticed the status is 
already set so.)


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-06-16 Thread Tsunakawa, Takayuki

From: David Rowley [mailto:david.row...@2ndquadrant.com]
> I've revised the patch to add a new constant named
> LOCKMETHODLOCALHASH_SHRINK_SIZE. I've set this to 64 for now. Once the hash

Thank you, and good performance.  The patch passed make check.

I'm OK with the current patch, but I have a few comments.  Please take them as 
you see fit (I wouldn't mind if you don't.)


(1)
+#define LOCKMETHODLOCALHASH_SHRINK_SIZE 64

How about LOCKMETHODLOCALHASH_SHRINK_THRESHOLD, because this determines the 
threshold value to trigger shrinkage?  Code in PostgreSQL seems to use the term 
threshold.


(2)
+/* Complain if the above are not set to something sane */
+#if LOCKMETHODLOCALHASH_SHRINK_SIZE < LOCKMETHODLOCALHASH_INIT_SIZE
+#error "invalid LOCKMETHODLOCALHASH_SHRINK_SIZE"
+#endif

I don't think these are necessary, because these are fixed and not 
configurable.  FYI, src/include/utils/memutils.h doesn't have #error to test 
these macros.

#define ALLOCSET_DEFAULT_MINSIZE   0
#define ALLOCSET_DEFAULT_INITSIZE  (8 * 1024)
#define ALLOCSET_DEFAULT_MAXSIZE   (8 * 1024 * 1024)


(3)
+   if (hash_get_num_entries(LockMethodLocalHash) == 0 &&
+   hash_get_max_bucket(LockMethodLocalHash) >
+   LOCKMETHODLOCALHASH_SHRINK_SIZE)
+   CreateLocalLockHash();

I get an impression that Create just creates something where there's nothing.  
How about Init or Recreate?


Regards
Takayuki Tsunakawa

RE: [PATCH] Speedup truncates of relation forks

2019-06-13 Thread Tsunakawa, Takayuki

From: Masahiko Sawada [mailto:sawada.m...@gmail.com]
> for (i = 0; i < NBuffers; i++)
> {
> (snip)
> buf_state = LockBufHdr(bufHdr);
> 
> /* check with the lower bound and skip the loop */
> if (bufHdr->tag.blockNum < minBlock)
> {
> UnlockBufHdr(bufHdr, buf_state);
> continue;
> }
> 
> for (k = 0; k < nforks; k++)
> {
> if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> bufHdr->tag.forkNum == forkNum[k] &&
> bufHdr->tag.blockNum >= firstDelBlock[k])
> 
> But since we acquire the buffer header lock after all and the number
> of the internal loops is small (at most 3 for now)  the benefit will
> not be big.

Yeah, so I think we can just compare the block number without locking the 
buffer header here.


Regards
Takayuki Tsunakawa

RE: [PATCH] Speedup truncates of relation forks

2019-06-12 Thread Tsunakawa, Takayuki

From: Masahiko Sawada [mailto:sawada.m...@gmail.com]
> We do RelationTruncate() also when we truncate heaps that are created
> in the current transactions or has a new relfilenodes in the current
> transaction. So I think there is a room for optimization Thomas
> suggested, although I'm not sure it's a popular use case.

Right, and I don't think of a use case that motivates the opmitizaion, too.


> I've not look at this patch deeply but in DropRelFileNodeBuffer I
> think we can get the min value of all firstDelBlock and use it as the
> lower bound of block number that we're interested in. That way we can
> skip checking the array during scanning the buffer pool.

That sounds reasonable, although I haven't examined the code, either.


> Don't we use each elements of nblocks for each fork? That is, each
> fork uses an element at its fork number in the nblocks array and sets
> InvalidBlockNumber for invalid slots, instead of passing the valid
> number of elements. That way the following code that exist at many places,

I think the current patch tries to reduce the loop count in 
DropRelFileNodeBuffers() by passing the number of target forks.


Regards
Takayuki Tsunakawa

RE: [PATCH] Speedup truncates of relation forks

2019-06-11 Thread Tsunakawa, Takayuki

From: Tomas Vondra [mailto:tomas.von...@2ndquadrant.com]
> Years ago I've implemented an optimization for many DROP TABLE commands
> in a single transaction - instead of scanning buffers for each relation,
> the code now accumulates a small number of relations into an array, and
> then does a bsearch for each buffer.
> 
> Would something like that be applicable/useful here? That is, if we do
> multiple TRUNCATE commands in a single transaction, can we optimize it
> like this?

Unfortunately not.  VACUUM and autovacuum handles each table in a different 
transaction.

BTW, what we really want to do is to keep the failover time within 10 seconds.  
The customer periodically TRUNCATEs tens of thousands of tables.  If failover 
unluckily happens immediately after those TRUNCATEs, the recovery on the 
standby could take much longer.  But your past improvement seems likely to 
prevent that problem, if the customer TRUNCATEs tables in the same transaction.

On the other hand, it's now highly possible that the customer can only TRUNCATE 
a single table in a transaction, thus run as many transactions as the TRUNCATEd 
tables.  So, we also want to speed up each TRUNCATE by touching only the 
buffers for the table, not scanning the whole shared buffers.  Andres proposed 
one method that uses a radix tree, but we don't have an idea how to do it yet.

Speeding up each TRUNCATE and its recovery is a different topic.  The patch 
proposed here is one possible improvement to shorten the failover time.


Regards
Takayuki Tsunakawa

RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-04-08 Thread Tsunakawa, Takayuki

From: Fujii Masao [mailto:masao.fu...@gmail.com]
> Thanks for the info, so I marked the patch as committed.

Thanks a lot for your hard work!  This felt relatively tough despite the 
simplicity of the patch.  I'm starting to feel the difficulty and fatigue in 
developing in the community...


Regards
Takayuki Tsunakawa

RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-04-07 Thread Tsunakawa, Takayuki

From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com]
> "vacuum_truncate" gets my vote too.

+1


From: 'Andres Freund' [mailto:and...@anarazel.de]
> Personally I think the name just needs some committer to make a
> call. This largely is going to be used after encountering too many
> cancellations in production, and researching the cause. Minor spelling
> differences don't seem to particularly matter here.

Absolutely.  Thank you.


From: 'Andres Freund' [mailto:and...@anarazel.de]
> I think it needs to be an autovac option. The production issue is that
> autovacuums constantly cancel queries on the standbys despite
> hot_standby_feedback if you have a workload that does frequent
> truncations. If there's no way to configure it in a way that autovacuum
> takes it into account, people will just disable autovacuum and rely
> entirely on manual scripts. That's what already happens - leading to a
> much bigger foot-gun than disabling truncation.  FWIW, people already in
> production use the workaround to configuring snapshot_too_old as that,
> for undocumented reasons, disables trunctations. That's not better
> either.

Right.  We don't want autovacuum to be considered as a criminal.


Regards
Takayuki Tsunakawa

RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-04-07 Thread Tsunakawa, Takayuki

From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> And, as far as I can see from a quick review of the thread,
> we don't really have consensus on the names and behaviors.

Consensus on the name seems to use truncate rather than shrink (a few poople 
kindly said they like shrink, and I'm OK with either name.)  And there's no 
dispute on the behavior.  Do you see any other point?


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Tsunakawa, Takayuki

From: David Rowley [mailto:david.row...@2ndquadrant.com]
> It would be good to get your view on the
> shrink_bloated_locallocktable_v3.patch I worked on last night. I was
> unable to measure any overhead to solving the problem that way.

Thanks, it looks super simple and good.  I understood the idea behind your 
patch is:

* Transactions that touch many partitions and/or tables are a special event and 
not normal, and the hash table bloat is an unlucky accident.  So it's 
reasonable to revert the bloated hash table back to the original size.

* Repeated transactions that acquire many locks have to enlarge the hash table 
every time.  However, the overhead of hash table expansion should be hidden 
behind other various processing (acquiring/releasing locks, reading/writing the 
relations, accessing the catalogs of those relations)


TBH, I think the linked list approach feels more intuitive because the 
resulting code looks what it wants to do (efficiently iterate over acquired 
locks) and is based on the classic book.  But your approach seems to relieve 
people.  So I'm OK with your patch.

I'm registering you as another author and me as a reviewer, and marking this 
ready for committer.


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Tsunakawa, Takayuki

From: 'Andres Freund' [mailto:and...@anarazel.de]
> On 2019-04-08 02:28:12 +0000, Tsunakawa, Takayuki wrote:
> > I think the linked list of LOCALLOCK approach is natural, simple, and
> > good.
> 
> Did you see that people measured slowdowns?

Yeah, 0.5% decrease with pgbench -M prepared -S (select-only), which feels like 
a somewhat extreme test case.  And that might be within noise as was mentioned.

If we want to remove even the noise, we may have to think of removing the 
LocalLockHash completely.  But it doesn't seem feasible...

Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Tsunakawa, Takayuki

From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> On the whole I don't think there's an adequate case for committing
> this patch.

From: Andres Freund [mailto:and...@anarazel.de]
> On 2019-04-05 23:03:11 -0400, Tom Lane wrote:
> > If I reduce the number of partitions in Amit's example from 8192
> > to something more real-world, like 128, I do still measure a
> > performance gain, but it's ~ 1.5% which is below what I'd consider
> > a reproducible win.  I'm accustomed to seeing changes up to 2%
> > in narrow benchmarks like this one, even when "nothing changes"
> > except unrelated code.
> 
> I'm not sure it's actually that narrow these days. With all the
> partitioning improvements happening, the numbers of locks commonly held
> are going to rise. And while 8192 partitions is maybe on the more
> extreme side, it's a workload with only a single table, and plenty
> workloads touch more than a single partitioned table.

I would feel happy if I could say such a many-partitions use case is narrow or 
impractical and ignore it, but it's not narrow.  Two of our customers are 
actually requesting such usage: one uses 5,500 partitions and is trying to 
migrate from a commercial database on Linux, and the other requires 200,000 
partitions to migrate from a legacy database on a mainframe.  At first, I 
thought such many partitions indicate a bad application design, but it sounded 
valid (or at least I can't insist that's bad).  PostgreSQL is now expected to 
handle such huge workloads.

From: Andres Freund [mailto:and...@anarazel.de]
> I'm not sure I'm quite that concerned. For one, a good bit of that space
> was up for grabs until the recent reordering of LOCALLOCK and nobody
> complained. But more importantly, I think commonly the amount of locks
> around is fairly constrained, isn't it? We can't really have that many
> concurrently held locks, due to the shared memory space, and the size of
> a LOCALLOCK isn't that big compared to say relcache entries.  We also
> probably fairly easily could win some space back - e.g. make nLocks 32
> bits.

+1

From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> I'd also point out that this is hardly the only place where we've
> seen hash_seq_search on nearly-empty hash tables become a bottleneck.
> So I'm not thrilled about attacking that with one-table-at-time patches.
> I'd rather see us do something to let hash_seq_search win across
> the board.
> 
> I spent some time wondering whether we could adjust the data structure
> so that all the live entries in a hashtable are linked into one chain,
> but I don't quite see how to do it without adding another list link to
> struct HASHELEMENT, which seems pretty expensive.

I think the linked list of LOCALLOCK approach is natural, simple, and good.  In 
the Jim Gray's classic book "Transaction processing: concepts and techniques", 
we can find the following sentence in "8.4.5 Lock Manager Internal Logic."  The 
sample implementation code in the book uses a similar linked list to remember 
and release a transaction's acquired locks.

"All the locks of a transaction are kept in a list so they can be quickly found 
and released at commit or rollback."

And handling this issue with the LOCALLOCK linked list is more natural than 
with the hash table resize.  We just want to quickly find all grabbed locks, so 
we use a linked list.  A hash table is a mechanism to find a particular item 
quickly.  So it was merely wrong to use the hash table to iterate all grabbed 
locks.  Also, the hash table got big because some operation in the session 
needed it, and some subsequent operations in the same session may need it 
again.  So we wouldn't be relieved with shrinking the hash table.

Regards
Takayuki Tsunakawa

RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-04-07 Thread Tsunakawa, Takayuki

Hi Andres, Fujii-san, any committer,

From: Andres Freund [mailto:and...@anarazel.de]
> On 2019-04-08 09:52:27 +0900, Fujii Masao wrote:
> > I'm thinking to commit this patch at first.  We can change the term
> > and add the support of "TRUNCATE" option for VACUUM command later.
> 
> I hope you realize feature freeze is in a few hours...

Ouch!  Could you take care of committing the patch, please?  I wouldn't be able 
to express the sadness and tiredness just in case this is pushed to 13 because 
of the parameter name...

Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-04-07 Thread Tsunakawa, Takayuki

From: Michael Paquier [mailto:mich...@paquier.xyz]
> I have just committed the GUC and libpq portion for TCP_USER_TIMEOUT after
> a last lookup, and I have cleaned up a couple of places. 

Thank you for further cleanup and committing.


> For the socket_timeout stuff, its way of solving the problem it thinks is
> solves does not seem right to me, and this thread has not reached a consensus
> anyway, so I have discarded the issue.
> 
> I am marking the CF entry as committed.  In the future, it would be better
> to not propose multiple concepts on the same thread, and if the
> socket_timeout business is resubmitted, I would suggest a completely new
> CF entry, and a new thread.

Understood.  Looking back the review process, it seems that tcp_user_timeout 
and socket_timeout should have been handled in separate threads.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-04-05 Thread Tsunakawa, Takayuki

From: Michael Paquier [mailto:mich...@paquier.xyz]
> The first letter should be upper-case.

Thank you for taking care of this patch, and sorry to cause you trouble to fix 
that...


> to me that socket_timeout_v14.patch should be rejected as it could cause
> a connection to go down with no actual reason and that the server should
> be in charge of handling timeouts.  Is my impression right?

No, the connection goes down for a good reason that the client could not get 
the response within a tolerable amount of time.


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-04 Thread Tsunakawa, Takayuki

Hi Amit-san, Imai-snan,

From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
> I was able to detect it as follows.
> plan_cache_mode = auto
> 
>HEAD: 1915 tps
> Patched: 2394 tps
> 
> plan_cache_mode = custom (non-problematic: generic plan is never created)
> 
>HEAD: 2402 tps
> Patched: 2393 tps

Thanks a lot for very quick confirmation.  I'm relieved to still see the good 
results.


Regards
Takayuki Tsunakawa

RE: Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-04-04 Thread Tsunakawa, Takayuki

From: Masahiko Sawada [mailto:sawada.m...@gmail.com]
> "VACUUM" needs  or "vacuum" is more appropriate here?

Looking at the same file and some other files, "vacuum" looks appropriate 
because it represents the vacuum action, not the specific VACUUM command.


> The format of the documentation of new option is a bit weird. Could it
> be possible to adjust it around 80 characters per line like other
> description?

Ah, fixed.  It's easy to overlook the style with the screen reader software...  
I've been wondering if there are good settings for editing .sgml in Emacs that, 
for example, puts appropriate number of spaces at the beginning of each line 
when  is pressed, automatically break the long line and put spaces, etc.


From: Julien Rouhaud [mailto:rjuju...@gmail.com]
> also, the documentation should point out that freeing is not
> guaranteed.  Something like
> 
>  + The default is true.  If true, VACUUM will try to free empty
> pages at the end of the table.

That's nice.  Done.


> > I'm not sure the consensus we got here but we don't make the vacuum
> > command option for this?
> 
> I don't think here's a clear consensus, but my personal vote is to add
> it, with  SHRINK_TABLE = [ force_on | force_off | default ] (unless a
> better proposal has been made already)

IMO, which I mentioned earlier, I don't think the VACUUM option is necessary 
because:
(1) this is a table property which is determined based on the expected 
workload, not the one that people want to change per VACUUM operation
(2) if someone wants to change the behavior for a particular VACUUM operation, 
he can do it using ALTER TABLE SET.
Anyway, we can add the VACUUM option separately if we want it by all means.  I 
don't it to be the blocker for this feature to be included in PG 12, because 
the vacuum truncaton has been bothering us like others said in this and other 
threads...


Regards
Takayuki Tsunakawa



disable-vacuum-truncation_v6.patch
Description: disable-vacuum-truncation_v6.patch

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-04 Thread Tsunakawa, Takayuki

Hi Peter, Imai-san,

From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
> I can't detect any performance improvement with the patch applied to
> current master, using the test case from Yoshikazu Imai (2019-03-19).

That's strange...  Peter, Imai-san, can you compare your test procedures?

Peter, can you check and see the performance improvement with David's method on 
March 26 instead?


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-03 Thread Tsunakawa, Takayuki

Hi Peter,

From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
> I did a bit of performance testing, both a plain pgbench and the
> suggested test case with 4096 partitions.  I can't detect any
> performance improvements.  In fact, within the noise, it tends to be
> just a bit on the slower side.
> 
> So I'd like to kick it back to the patch submitter now and ask for more
> justification and performance analysis.
> 
> Perhaps "speeding up planning with partitions" needs to be accepted first?

David kindly showed how to demonstrate the performance improvement on March 26, 
so I changed the status to needs review.  I'd appreciate it if you could 
continue the final check.


Regards
Takayuki Tsunakawa

RE: Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-04-03 Thread Tsunakawa, Takayuki

From: Fujii Masao [mailto:masao.fu...@gmail.com]
> reloption for TOAST is also required?

# I've come back to the office earlier than planned...

Hm, there's no reason to not provide toast.vacuum_shrink_enabled.  Done with 
the attached patch.


Regards
Takayuki Tsunakawa




disable-vacuum-truncation_v5.patch
Description: disable-vacuum-truncation_v5.patch

RE: Libpq support to connect to standby server as priority

2019-03-29 Thread Tsunakawa, Takayuki

Hi Hari-san,

I've reviewed all the files.  The patch would be OK when the following have 
been fixed, except for the complexity of fe-connect.c (which probably cannot be 
improved.)

Unfortunately, I'll be absent next week.  The earliest date I can do the test 
will be April 8 or 9.  I hope someone could take care of this patch...


(1) 0001
With this read-only option type, application can connect to
to a read-only server in the list of hosts, in case
...
before issuing the SHOW transaction_readonly to find out whether


"to" appears twice in a row.
transaction_readonly -> transaction_read_only


(2) 0001
+succesful connection or failure.

succesful -> successful


(3) 0008
to conenct to a standby server with a faster check instead of

conenct -> connect


(4) 0008
Logically, recovery exit should be notified after the following statement:

XLogCtl->SharedRecoveryInProgress = false;


(5) 0008
+   /* Update in_recovery status. */
+   if (LocalRecoveryInProgress)
+   SetConfigOption("in_recovery",
+   "on",
+   PGC_INTERNAL, 
PGC_S_OVERRIDE);
+

This SetConfigOption() is called for every RecoveryInProgress() call on the 
standby.  It may cause undesirable overhead.  How about just calling 
SetConfigOption() once in InitPostgres() to set the initial value for 
in_recovery?  InitPostgres() and its subsidiary functions call 
SetConfigOption() likewise.


(6) 0008
async.c is for LISTEN/UNLISTEN/NOTIFY.  How about adding the new functions in 
postgres.c like RecoveryConflictInterrupt()?


(7) 0008
+   if (pid != 0)
+   {
+   (void) SendProcSignal(pid, reason, procvxid.backendId);
+   }

The braces are not necessary because the block only contains a single statement.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-28 Thread Tsunakawa, Takayuki

Nagaura-san,

The socket_timeout patch needs the following fixes.  Now that others have 
already tested these patches successfully, they appear committable to me.


(1)
+   else
+   goto iiv_error;
...
+
+iiv_error:
+   conn->status = CONNECTION_BAD;
+   printfPQExpBuffer(&conn->errorMessage,
+ libpq_gettext("invalid integer value 
for socket_timeout\n"));
+   return false;

This goto and its corresponding iiv_error label are redundant.  You can just 
set the error message and return at the call site of parse_int_param().  i.e.:

if (!parse_int_param(...))
{
error processing
return false;
}
if(conn->socket_timeout > 0 && conn->socket_timeout < 2)
conn->socket_timeout = 2;

The reason why oom_error label is present is that it is used at multiple places 
to avoid repeating the same error processing code.


(2)
+   conn->sock = -1;

Use PGINVALID_SOCKET instead of -1.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-28 Thread Tsunakawa, Takayuki

Nagaura-san,

The client-side tcp_user_timeout patch looks good.

The server-side tcp_user_timeout patch needs fixing the following:


(1)
+   GUC_UNIT_MS | GUC_NOT_IN_SAMPLE
+   12000, 0, INT_MAX,

GUC_NOT_IN_SAMPLE should be removed because the parameter appears in 
postgresql.conf.sample.
The default value should be 0, not 12000.


(2)
Now that you've changed show_tcp_usertimeout() to use pq_gettcpusertimeout(), 
you need to modify pq_settcpusertimeout(); just immitate 
pq_setkeepalivesidle().  Otherwise, SHOW would display a wrong value.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-27 Thread Tsunakawa, Takayuki

From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com]
> From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
> > +if (setsockopt(conn->sock, IPPROTO_TCP, TCP_USER_TIMEOUT,
> > + (char *) &timeout, sizeof(timeout)) < 0 && errno !=
> > ENOPROTOOPT)
> > +{
> > +charsebuf[256];
> > +
> > +appendPQExpBuffer(&conn->errorMessage,
> > +libpq_gettext("setsockopt(TCP_USER_TIMEOUT)
> > failed: %s\n"),
> >
> > I suppose that the reason ENOPROTOOPT is excluded from error
> > condition is that the system call may fail with that errno on
> > older kernels, but I don't think that that justifies ignoring the
> > failure.
> 
> I think that's for the case where the modules is built on an OS that supports
> TCP_USER_TIMEOUT (#ifdef TCP_USER_TIMEOUT is true), and the module is used
> on an older OS that doesn't support TCP_USER_TIMEOUT.  I remember I was
> sometimes able to do such a thing on Linux and Solaris.  If we don't have
> to handle such usage, I agree about removing the special handling of
> ENOTPROTO.

Oops, I agree that we return an error even in the ENOPROTOOPT case, because 
setsockopt() is called only when the user specifies tcp_user_timeout.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-27 Thread Tsunakawa, Takayuki

From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
> +if (setsockopt(conn->sock, IPPROTO_TCP, TCP_USER_TIMEOUT,
> + (char *) &timeout, sizeof(timeout)) < 0 && errno !=
> ENOPROTOOPT)
> +{
> +charsebuf[256];
> +
> +appendPQExpBuffer(&conn->errorMessage,
> +libpq_gettext("setsockopt(TCP_USER_TIMEOUT)
> failed: %s\n"),
> 
> I suppose that the reason ENOPROTOOPT is excluded from error
> condition is that the system call may fail with that errno on
> older kernels, but I don't think that that justifies ignoring the
> failure.

I think that's for the case where the modules is built on an OS that supports 
TCP_USER_TIMEOUT (#ifdef TCP_USER_TIMEOUT is true), and the module is used on 
an older OS that doesn't support TCP_USER_TIMEOUT.  I remember I was sometimes 
able to do such a thing on Linux and Solaris.  If we don't have to handle such 
usage, I agree about removing the special handling of ENOTPROTO.


Regards
Takayuki Tsunakawa

RE: Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-27 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> You're both right and I'm wrong.
> 
> However, I think it would be better to stick with the term 'truncate'
> which is widely-used already, rather than introducing a new term.

Yeah, I have the same feeling.  OTOH, as I referred in this thread, shrink is 
used instead of truncate in the PostgreSQL documentation.  So, I chose shrink.  
To repeat myself, I'm comfortable with either word.  I'd like the committer to 
choose what he thinks better.


Regards
Takayuki Tsunakawa

RE: Libpq support to connect to standby server as priority

2019-03-26 Thread Tsunakawa, Takayuki

I've looked through 0004-0007.  I've only found the following:

(5) 0005
With this read-only option type, application can connect to
connecting to a read-only server in the list of hosts, in case
if there is any read-only servers available, the connection
attempt fails.

"connecting to" can be removed.

in case if there is any read-only servers
-> If There's no read only server


Regards
Takayuki Tsunakawa

RE: Libpq support to connect to standby server as priority

2019-03-26 Thread Tsunakawa, Takayuki

From: Haribabu Kommi [mailto:kommi.harib...@gmail.com]
> while going through the old patch where the GUC_REPORT is implemented, Tom
> has commented the logic of sending the signal to all backends to process
> the hot standby exit with SIGHUP, if we add the logic of updating the GUC
> variable value in SIGHUP, we may need to change all the SIGHUP handler code
> paths. It is also possible that there is no need to update the variable
> for other processes except backends.
> 
> If we go with adding the new SIGUSR1 type to check and update the GUC varaible
> can work for most of the backends and background workers.
> 
> opinions

SIGUSR1 looks reasonable.  We can consider it as notifying that the server 
status has changed.

I've fully reviewed 0001-0003 and my comments follow.  I'll review 0004-0007.


(1) 0001
before issuing the transaction_readonly to find out whether
the server is read-write or not is restuctured under a new


transaction_readonly -> "SHOW transaction_read_only"
restuctured -> restructured


(2) 0001
+succesful connection or failure.
+successful connection; if it returns on, means 
server

succesful -> successful
means -> it means


(3) 0003
+But for servers version 12 or greater uses the 
transaction_read_only
+GUC that is reported by the server upon successful connection.

GUC doesn't seem to be a term to be used in the user manual.  Instead:

uses the value of transaction_read_only configuration 
parameter that is...

as in:

https://www.postgresql.org/docs/devel/libpq-connect.html

client_encoding
This sets the client_encoding configuration parameter for this connection.

application_name
Specifies a value for the application_name configuration parameter.


(4) 0003
boolstd_strings;/* standard_conforming_strings */
+   booltransaction_read_only;  /* session_read_only */

Looking at the comment for std_strings, it's better to change the comment to 
transaction_read_only to represent the backing configuration parameter name.


Regards
Takayuki Tsunakawa

RE: Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-26 Thread Tsunakawa, Takayuki

From: Masahiko Sawada [mailto:sawada.m...@gmail.com]
> On Wed, Mar 27, 2019 at 2:30 AM Robert Haas  wrote:
> >
> > On Tue, Mar 26, 2019 at 11:23 AM Masahiko Sawada 
> wrote:
> > > > I don't see a patch with the naming updated, here or there, and I'm
> > > > going to be really unhappy if we end up with inconsistent naming
> > > > between two patches that do such fundamentally similar things.  -1
> > > > from me to committing either one until that inconsistency is resolved.
> > >
> > > Agreed. I've just submitted the latest version patch that adds
> > > INDEX_CLEANUP option and vacuum_index_cleanup reloption. I already
> > > mentioned on that thread but I agreed with adding phrase positively
> > > than negatively. So if we got consensus on such naming the new options
> > > added by this patch could be something like SHRINK option (with
> > > true/false) and vacuum_shrink reloption.
> >
> > No, that's just perpetuating the problem.  Then you have an option
> > SHRINK here that you set to TRUE to skip something, and an option
> > INDEX_CLEANUP over there that you set to FALSE to skip something.
> >
> 
> Well, I imagined that both INDEX_CLEANUP option and SHRINK option (or
> perhaps TRUNCATE option) should be true by default. If we want to skip
> some operation of vacuum we can set each options to false like "VACUUM
> (INDEX_CLEANUP false, SHRINK true, VERBOSE true)". I think that
> resolves the problem but am I missing something?

I almost have the same view as Sawada-san.  The reloption vacuum_shrink_enabled 
is a positive name and follows the naming style of other reloptions.  I hope 
this matches the style you have in mind.


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-26 Thread Tsunakawa, Takayuki

From: David Rowley [mailto:david.row...@2ndquadrant.com]
> Here a benchmark doing that using pgbench's script weight feature.

Wow, I didn't know that pgbench has evolved to have such a convenient feature.  
Thanks for telling me how to utilize it in testing.  PostgreSQL is cool!


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-26 Thread Tsunakawa, Takayuki

From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
> My understanding of what David wrote is that the slowness of bloated hash
> table is hard to notice, because planning itself is pretty slow.  With the
> "speeding up planning with partitions" patch, planning becomes quite fast,
> so the bloated hash table overhead and so your patch's benefit is easier
> to notice.  This patch is clearly helpful, but it's just hard to notice
> it
> when the other big bottleneck is standing in the way.

Ah, I see.  I failed to recognize the simple fact that without your patch, 
EXECUTE on a table with many partitions is slow due to the custom planning time 
proportional to the number of partitions.  Thanks for waking up my sleeping 
head!


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-26 Thread Tsunakawa, Takayuki

From: David Rowley [mailto:david.row...@2ndquadrant.com]
> On Mon, 25 Mar 2019 at 23:44, Peter Eisentraut
>  wrote:
> > Perhaps "speeding up planning with partitions" needs to be accepted first?
> 
> Yeah, I think it likely will require that patch to be able to measure
> the gains from this patch.
> 
> If planning a SELECT to a partitioned table with a large number of
> partitions using PREPAREd statements, when we attempt the generic plan
> on the 6th execution, it does cause the local lock table to expand to
> fit all the locks for each partition. This does cause the
> LockReleaseAll() to become slow due to the hash_seq_search having to
> skip over many empty buckets.   Since generating a custom plan for a
> partitioned table with many partitions is still slow in master, then I
> very much imagine you'll struggle to see the gains brought by this
> patch.

Thank you David for explaining.  Although I may not understand the effect of 
"speeding up planning with partitions" patch, this patch  takes effect even 
without it.  That is, perform the following in the same session:

1. SELECT count(*) FROM table; on a table with many partitions.  That bloats 
the LocalLockHash.
2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1;
3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate 
transaction.  Without the patch, each transaction's LockReleaseAll() has to 
scan the bloated large hash table.

Regards
Takayuki Tsunakawa

RE: Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-26 Thread Tsunakawa, Takayuki

From: David Steele [mailto:da...@pgmasters.net]
> This patch appears to have been stalled for a while.
> 
> Takayuki -- the ball appears to be in your court.  Perhaps it would be
> helpful to summarize what you think are next steps?

disable_index_cleanup is handled by Sawada-san in another thread.  I understand 
I've reflected all review comments in the latest patch, and replied to the 
opinions/proposals, so the patch status is kept "needs review."  (I hope new 
fire won't happen...)


Regards
Takayuki Tsunakawa

RE: Libpq support to connect to standby server as priority

2019-03-19 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> I really dislike having both target_sesion_attrs and
> target_server_type.  It doesn't solve any actual problem.  master,
> slave, prefer-save, or whatever you like could be put in
> target_session_attrs just as easily, and then we wouldn't end up with
> two keywords doing closely related things.  'master' is no more or
> less a server attribute than 'read-write'.

Hmm, that may be OK.  At first, I felt it strange to treat the server type 
(primary or standby) as a session attribute.  But we can see the server type as 
one attribute in a sense that a session is established for.  I'm inclined to 
agree with:

target_session_attr = {any | read-write | read-only | prefer-read | primary | 
standby | prefer-standby}


Regards
Takayuki Tsunakawa

RE: [survey] New "Stable" QueryId based on normalized query text

2019-03-19 Thread Tsunakawa, Takayuki

From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
> > > needs.1: stable accross different databases,
> >
> > Does this mean different database clusters, not different databases in
> a single database cluster?
> 
> Does this mean you want different QueryID for the same-looking
> query for another database in the same cluster?

(I'm afraid this question may be targeted at legland legland, not me...)
I think the same query text can have either same or different QueryID in 
different databases in the database cluster.  Even if the QueryID value is the 
same, we can use DatabaseID to choose desired information.


Regards
Takayuki Tsunakawa

RE: [survey] New "Stable" QueryId based on normalized query text

2019-03-19 Thread Tsunakawa, Takayuki

From: legrand legrand [mailto:legrand_legr...@hotmail.com]
> There are many projects that use alternate QueryId
> distinct from the famous pg_stat_statements jumbling algorithm.

I'd like to welcome the standard QueryID that DBAs and extension developers can 
depend on.  Are you surveying the needs for you to develop the QueryID that can 
meet as many needs as possible?


> needs.1: stable accross different databases,

Does this mean different database clusters, not different databases in a single 
database cluster?


needs.5: minimal overhead to calculate
needs.6: doesn't change across database server restarts
needs.7: same value on both the primary and standby?


> norm.9: comments aware

Is this to distinguish queries that have different comments for optimizer 
hints?  If yes, I agree.


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-19 Thread Tsunakawa, Takayuki

From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com]
> Fixed.

Rebased on HEAD.


Regards
Takayuki Tsunakawa



0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch
Description: 0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch


0002-speed-up-LOCALLOCK-scan.patch
Description: 0002-speed-up-LOCALLOCK-scan.patch

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-19 Thread Tsunakawa, Takayuki

Hi Peter, Imai-san,

From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
> Your changes in LOCALLOCK still refer to PGPROC, from your first version
> of the patch.
> 
> I think the reordering of struct members could be done as a separate
> preliminary patch.
> 
> Some more documentation in the comment before dlist_head LocalLocks to
> explain this whole mechanism would be nice.

Fixed.


> You posted a link to some performance numbers, but I didn't see the test
> setup explained there.  I'd like to get some more information on this
> impact of this.  Is there an effect with 100 tables, or do you need 10?

Imai-san, can you tell us the test setup?


Regards
Takayuki Tsunakawa



0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch
Description: 0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch


0002-speed-up-LOCALLOCK-scan.patch
Description: 0002-speed-up-LOCALLOCK-scan.patch

RE: Timeout parameters

2019-03-18 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> I don't think so.  I think it's just a weirdly-design parameter
> without a really compelling use case.  Enforcing limits on the value
> of the parameter doesn't fix that.  Most of the reviewers who have
> opined so far have been somewhere between cautious and negative about
> the value of that parameter, so I think we should just not add it.  At
> least for now.

I don't think socket_timeout is so bad.  I think Nagaura-san and I presented 
the use case, giving an answer to every question and concern.  OTOH, it may be 
better to commit the tcp_user_timeout patch when Nagaura-san has refined the 
documentation, and then continue socket_timeout.


Regards
Takayuki Tsunakawa

RE: Libpq support to connect to standby server as priority

2019-03-18 Thread Tsunakawa, Takayuki

From: Haribabu Kommi [mailto:kommi.harib...@gmail.com]
> Target_session_attrs  Target_server_type
> 
> read-write   prefer-slave, slave
> 
> prefer-read master, slave
> read-onlymaster, prefer-slave
> 
> I know that some of the cases above is possible, like master server with
> by default accepts
> read-only sessions. Instead of we put a check to validate what is right
> combination, how
> about allowing the combinations and in case if such combination is not
> possible, means
> there shouldn't be any server the supports the requirement, and connection
> fails.
> 
> comments?

I think that's OK.

To follow the existing naming, it seems better to use "primary" and "standby" 
instead of master and slave -- primary_conninfo, synchronous_standby_names, 
hot_standby, max_standby_streaming_delay and such.


> And also as we need to support the new option to connect to servers < 12
> also, this option
> sends the command "select pg_is_in_recovery()" to the server to find out
> whether the server
> is recovery mode or not?

The query looks good.  OTOH, I think we can return an error when 
target_server_type is specified against older servers because the parameter is 
new, if the libpq code would get uglier if we support older servers.


> And also regarding the implementation point of view, the new
> target_server_type option
> validation is separately handled, means the check for the required server
> is handled in a separate
> switch case, when both options are given, first find out the required server
> for target_session_attrs
> and validate the same again with target_server_type?

Logically, it seems the order should be reverse; check the server type first, 
then the session attributes, considering other session attributes in the future.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-17 Thread Tsunakawa, Takayuki

From: mikalaike...@ibagroup.eu [mailto:mikalaike...@ibagroup.eu]
> Based on your comment it seems to me that 'socket_timeout' should be
> connected with statement_timeout. I mean that end-user should wait
> statement_timeout + 'socket_timeout' for returning control. It looks much
> more safer for me.

I'm afraid we cannot enforce that relationship programmatically, so I think the 
documentation should warn that socket_timeout be longer than statement_timeout.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-15 Thread Tsunakawa, Takayuki

From: mikalaike...@ibagroup.eu [mailto:mikalaike...@ibagroup.eu]
> In case of failure PQcancel() terminates in 'socket_timeout'. So, control
> to the end-user in such a failure situation will be returned in 2 *
> 'socket_timeout' interval. It is much better than hanging forever in some
> specific cases. Moreover, such solution will not lead to the overloading
> of PostgreSQL server by abnormally ignored 'heavy' queries results by
> end-users.

Oops, unfortunately, PQcancel() does not follow any timeout parameters...  It 
uses a blocking socket.

Also, I still don't think it's a good idea to request cancellation.  
socket_timeout should be sufficiently longer than the usually expected query 
execution duration.  And long-running queries should be handled by 
statement_timeout which indicates the maximum tolerable query execution 
duration.
For example, if the usually expected query execution time is 100 ms, 
statement_timeout can be set to 3 seconds and socket_timeout to 5 seconds.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-14 Thread Tsunakawa, Takayuki

From: mikalaike...@ibagroup.eu [mailto:mikalaike...@ibagroup.eu]
> Do you mind me asking you whether you have thought that solving your problem
> can lead to the problem in the other user applications?
> Let's imagine a possible problem:
> 1. end-user sets 'socket_timeout' only for current session
> 2. 'socket_timeout' is much shorter than 'keep_alive', 'tcp_user_timeout'
> and 'statement_timeout'
> 3. end-user invokes a 'heavy' query which is not be possible to process
> by PostgreSQL server within 'socket_timeout'
> 4. if the query fails due to timeout, terminate the session and repeat it
> 
> As the query is not cancelled, PostgreSQL server will process it till
> completion or 'statement_timeout' expiration. In such a case PostgreSQL
> server can be overloaded by an 'unreliable' user/users and other end-users
> will not be able to work with PostgreSQL server as they expected.

Yes, so I think it would be necessary to describe how to set socket_timeout 
with relation to other timeout parameters -- socket_timeout > 
statement_timeout, emphasizing that socket_timeout is not for canceling 
long-running queries but for returning control to the client.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-14 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> Now you might say - what if the server is stopped not because of
> SIGSTOP but because of some other reason, like it's waiting for a
> lock?  Well, in that case, the database server is still functioning,
> and you will not want the connection to be terminated if no notifies
> are sent during the lock wait but not terminated if they are sent.  At
> least, I can't see why anyone would want that.

Yes, so I think it would be kind to describe how to set socket_timeout with 
relation to other timeout parameters -- socket_timeout > statement_timeout > 
lock_timeout, for example.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-14 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> One other thing -- I looked a bit into the pgsql-jdbc implementation
> of a similarly-named option, and it does seem to match what you are
> proposing here.  I wonder what user experiences with that option have
> been like.

One case I faintly recall is that some SQL execution got stuck in NFS access in 
the server, and statement_timeout didn't work (probably due to inability to 
cancel NFS read/write operations).  The user chose to use PgJDBC's 
socketTimeout.  I'm not sure whether this use case is fundamental, because I 
wonder if NFS mount options could have been the solution.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-14 Thread Tsunakawa, Takayuki

From: mikalaike...@ibagroup.eu [mailto:mikalaike...@ibagroup.eu]
> > For example, OS issues such as abnormally (buggy) slow process scheduling
> or paging/swapping that prevent control from being passed to postgres.  Or,
> abnormally long waits on lwlocks in postgres.  statement_timeout doesn't
> take effect while waiting on a lwlock.  I have experienced both.  And, any
> other issues in the server that we haven't experienced and not predictable.
> 
> For me all mentioned by Takayuki Tsunakawa problems looks like a lack of
> design of end-user application or configuration of DB server. It is not
> a problem of PostgreSQL.

Certainly, my cases could be avoided by the OS and database configuration.  But 
we cannot say all such problems can be avoided by configuration in advance.  
The point is to provide a method to get ready for any situation.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-14 Thread Tsunakawa, Takayuki

From: Fabien COELHO [mailto:coe...@cri.ensmp.fr]
> I think that the typical use-case of \c is to connect to another database
> on the same host, at least that what I do pretty often. The natural
> expectation is that the same "other" connection parameters are used,
> otherwise it does not make much sense, and there is already a whole logic
> of reusing the previous settings in psql, at least wrt describing the
> target (host, port, user…).

I agree about the natural expectation.


> > Anyway, I think this would be an improvement for psql's documentation
> or
> > new feature for psql.  What do you think?
> 
> I think that we should fix the behavior rather than document the current
> weirdness. I do not think that we should introduce more weirdness.

Do you think all connection parameter settings should be reused by default, or 
when -reuse-previous=on is specified?

Do you think this is a bug to be backported to previous versions, or a new 
feature?  I think I'll make a patch separately from this thread, if time 
permits.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-14 Thread Tsunakawa, Takayuki

From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
> If so, in turn the socket_timeout doesn't work as expected? I
> understand that what is proposed here is to disconnect after that
> time of waiting for *the first tuple* of a query, regardless of
> it is a long query or network failure. On of the problems raised
> here is the socket_timetout patch doesn't work that way?

No, the proposal is to set the timeout for every socket read/write like PgJDBC. 
 It is not restricted to an SQL command execution; it applies to any 
communication with the server that could block the client.


> I can't imagine that in the cases where other than applications
> cannot be rebuild for some reasons. (E.g. the source code has
> been lost or the source code no longer be built on modern
> environment..)
> 
> If an application is designed to have such a requirement, mustn't
> it have implemented that by itself by any means? For example, an
> application on JDBC can be designed to kill a querying thread
> that takes longer than threshold.

Any paid or free applications whose source code is closed.  Also, ordinary 
users don't want to modify psql and pgAdmin source code by themselves, do they?


> Doesn't TCP_KEEPALIVE work that way?
> 
> statement_timeout works on a live connection, TCP_USER_TIMEOUT
> works for an already failed network but if the network fails
> after a backend already sent the ack for a query request, it
> doesn't work. TCP_KEEPALIVE would work for the case where any
> packet doesn't reach each other for a certain time. Is there any
> other situations to save?

For example, OS issues such as abnormally (buggy) slow process scheduling or 
paging/swapping that prevent control from being passed to postgres.  Or, 
abnormally long waits on lwlocks in postgres.  statement_timeout doesn't take 
effect while waiting on a lwlock.  I have experienced both.  And, any other 
issues in the server that we haven't experienced and not predictable.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-13 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> But that's not what it will do.  As long as the server continues to
> dribble out protocol messages from time to time, the timeout will
> never fire no matter how much time passes.  I saw a system once where
> every 8kB read took many seconds to complete; such a system could
> dribble out sequential scan results over an arbitrarily long period of
> time without ever tripping the timeout.

I understood hat the example is about an SELECT that returns multiple rows.  If 
so, statement_timeout should handle it, shouldn't it?

>  If you really want to return
> control to the user in any situation, what you can do is use the libpq
> APIs in asynchronous mode which, barring certain limitations of the
> current implementation, will actually allow you to maintain control
> over the connection at all times.

Maybe.  But the users aren't often in a situation to modify the application to 
use libpq asynchronous APIs.


> I think the use case for a timeout that has both false positives (i.e.
> it will fire even when there's no problem, as when the connection is
> legitimately idle) and false negatives (i.e. it will fail to trigger
> when there is a problem, as when there are periodic notices or
> notifies from the server connection) is extremely limited if not
> nonexistent, and I think the potential for users to be confused is
> really high.

My understanding is that the false positive case doesn't occur, because libpq 
doesn't wait on the socket while the client is idle and not communicating SQL 
request/response.  As for the false negative case, resetting the timer upon 
notices or notifies receipt is good, because they show that the database server 
is functioning.  socket_timeout is not a mechanism to precisely limit the 
duration of query request/response.  It is kind of a stop-gap, last resort to 
assure return control within reasonable amount of time, rather than minutes or 
hours.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-13 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> The first thing I notice about the socket_timeout patch is that the
> documentation is definitely wrong:

Agreed.  I suppose the description should be clearer about:

* the purpose and what situation this timeout will help: not for canceling a 
long-running query or recovering from a network failure.
* relationship with other timeout parameters: e.g., socket_timeout > 
connect_timeout, socket_timeout > statement_timeout, socket_timeout > TCP 
keepalive/user timeout


> Actually, I think that socket_timeout_v8.patch is a bad idea and
> should be rejected, not because it doesn't send a cancel to the server
> but just because it's not the right way of solving either of the
> problems that it purports to solve.  If you want a timeout for
> queries, you need that to be administered by the server, which knows
> when it starts and finishes executing a query; you cannot substitute a
> client-side judgement based on the last time we received a byte of
> data.  Maybe a client-side judgement would work if the timer were
> armed when we send a Query or Execute message and disarmed when we
> receive ReadyForQuery.  And, if you want a network problem detector,
> then you should use keepalives and perhaps the TCP_USER_TIMEOUT stuff,
> so that you don't kill perfectly good connections that just happen to
> be idle.

I think the purpose of socket_timeout is to avoid infinite or unduely long wait 
and return response to users, where other existing timeout parameters wouldn't 
help.  For example, OS's process scheduling or paging/swapping problems might 
cause long waits before postgres gets control (e.g. due to Transparent HugePage 
(THP)).  Within postgres, the unfair lwlock can unexpected long waits (I 
experienced dozens of seconds per wait on ProcArrayLock, which was 
unbelievable.)  Someone may say that those should be fixed or improved instead 
of introducing this parameter, but I think it's good to have this as a stop-gap 
measure.  In other words, we can suggest setting this parameter when the user 
asks "How can I return control to the end user in any situation?"


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-13 Thread Tsunakawa, Takayuki

From: Fabien COELHO [mailto:coe...@cri.ensmp.fr]
> >> If the user reconnects, eg "\c db", the setting is lost. The
> >> re-connection handling should probably take care of this parameter, and
> maybe others.
> > I think your opinion is reasonable, but it seems not in this thread.
> 
> HI think that your patch is responsible for the added option at least.
> 
> I agree that the underlying issue that other parameters should probably
> also be reused, which would be a bug fix, does not belong to this thread.

This doesn't seem to be a bug.  \connect just establishes a new connection, not 
reusing the previous settings for most connection parameters.  As the examples 
in the following manual page show, the user needs to specify necessary 
connection parameters.

https://www.postgresql.org/docs/devel/app-psql.html

=> \c service=foo
=> \c "host=localhost port=5432 dbname=mydb connect_timeout=10 sslmode=disable"


But I'm afraid the description of \connect may not be clear enough about 
connection parameters, and that may cause users to expect the reuse of all 
connection parameter settings.  Anyway, I think this would be an improvement 
for psql's documentation or new feature for psql.  What do you think?


Regards
Takayuki Tsunakawa

RE: Protect syscache from bloating with negative cache entries

2019-02-28 Thread Tsunakawa, Takayuki

From: Ideriha, Takeshi/出利葉 健
> [Size=800, iter=1,000,000]
> Master |15.763
> Patched|16.262 (+3%)
> 
> [Size=32768, iter=1,000,000]
> Master |61.3076
> Patched|62.9566 (+2%)

What's the unit, second or millisecond?
Why is the number of digits to the right of the decimal point?

Is the measurement correct?  I'm wondering because the difference is larger in 
the latter case.  Isn't the accounting processing almost the sane in both cases?
* former: 16.262 - 15.763 = 4.99
* latter: 62.956 - 61.307 = 16.49


> At least compared to previous HashAg version, the overhead is smaller.
> It has some overhead but is increase by 2 or 3% a little bit?

I think the overhead is sufficiently small.  It may get even smaller with a 
trivial tweak.

You added the new member usedspace at the end of MemoryContextData.  The 
original size of MemoryContextData is 72 bytes, and Intel Xeon's cache line is 
64 bytes.  So, the new member will be on a separate cache line.  Try putting 
usedspace before the name member.


Regards
Takayuki Tsunakawa

RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-02-28 Thread Tsunakawa, Takayuki

From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com]
> Robert used the phrase "attractive nuisance", which maybe sounds like a
> good thing to have to a non native speaker, but it actually isn't -- he
> was saying we should avoid a GUC at all, and I can see the reason for
> that.  I think we should have a VACUUM option and a reloption, but no
> GUC.

Uh, thanks.  I've just recognized I didn't know the meaning of "nuisance."  
I've looked up the meaning in the dictionary.  Nuisance is like a trouble 
maker...

Why do you think that it's better for VACUUM command to have the option?  I 
think it's a table property whose value is determined based on the application 
workload, not per VACUUM execution.  Rather, I think GUC is more useful to 
determine the behavior of the entire database and/or application.

If we want to change a given execution of VACUUM, then we can ALTER TABLE SET, 
VACUUM, and ALTER TABLE SET back.


Regards
Takayuki Tsunakawa

RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-02-27 Thread Tsunakawa, Takayuki

From: Michael Paquier [mailto:mich...@paquier.xyz]
> So we could you consider adding an option for the VACUUM command as well
> as vacuumdb?  The interactions with the current patch is that you need to
> define the behavior at the beginning of vacuum for a given heap, instead
> of reading the parameter at the time the truncation happens, and give

I'm not confident whether this is the same as the above, I imagined this:

* Add a new USERSET GUC vacuum_shrink_table = {on | off}, on by default.
This follows the naming style "verb_object" like log_connections and 
enable_xxx.  We may want to use enable_vacuum_shrink or something like that, 
but enable_xxx seems to be used solely for planner control.  Plus, 
vacuum-related parameters seem to start with vacuum_.

* Give priority to the reloption, because it's targeted at a particular table.  
If the reloption is not set, the GUC takes effect.

* As a consequence, the user can change the behavior of VACUUM command by 
SETting the GUC in the same session in advance, when the reloption is not set.  
If the reloption is set, the user can ALTER TABLE SET, VACUUM, and ALTER TABLE 
again to restore the table's setting.  But I don't think this use case (change 
whether to shrink per VACUUM command execution) is necessary.  This is no more 
than simply possible.


Regards
Takayuki Tsunakawa

RE: Protect syscache from bloating with negative cache entries

2019-02-27 Thread Tsunakawa, Takayuki

From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
> I measured the memory context accounting overhead using Tomas's tool
> palloc_bench,
> which he made it a while ago in the similar discussion.
> https://www.postgresql.org/message-id/53f7e83c.3020...@fuzzy.cz
> 
> This tool is a little bit outdated so I fixed it but basically I followed
> him.
> Things I did:
> - make one MemoryContext
> - run both palloc() and pfree() for 32kB area 1,000,000 times.
> - And measure this time
> 
> The result shows that master is 30 times faster than patched one.
> So as Andres mentioned in upper thread it seems it has overhead.
> 
> [master (without v15 patch)]
> 61.52 ms
> 60.96 ms
> 61.40 ms
> 61.42 ms
> 61.14 ms
> 
> [with v15 patch]
> 1838.02 ms
> 1754.84 ms
> 1755.83 ms
> 1789.69 ms
> 1789.44 ms
> 

I'm afraid the measurement is not correct.  First, the older discussion below 
shows that the accounting overhead is much, much smaller, even with a more 
complex accounting.

9.5: Better memory accounting, towards memory-bounded HashAg
https://www.postgresql.org/message-id/flat/1407012053.15301.53.camel%40jeff-desktop

Second, allocation/free of memory > 8 KB calls malloc()/free().  I guess the 
accounting overhead will be more likely to be hidden under the overhead of 
malloc() and free().  What we'd like to know the overhead when malloc() and 
free() are not called.

And are you sure you didn't enable assert checking?


Regards
Takayuki Tsunakawa

RE: Libpq support to connect to standby server as priority

2019-02-27 Thread Tsunakawa, Takayuki

From: Haribabu Kommi [mailto:kommi.harib...@gmail.com]
> Attached are the updated patches.

Thanks, all look fixed.


> The target_server_type option yet to be implemented.

Please let me review once more and proceed to testing when the above is added, 
to make sure the final code looks good.  I'd like to see how complex the if 
conditions in multiple places would be after adding target_server_type, and 
consider whether we can simplify them together with you.  Even now, the if 
conditions seem complicated to me... that's probably due to the existence of 
read_write_host_index.


Regards
Takayuki Tsunakawa

RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-02-27 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> I don't think that a VACUUM option would be out of place, but a GUC
> sounds like an attractive nuisance to me.  It will encourage people to
> just flip it blindly instead of considering the particular cases where
> they need that behavior, and I think chances are good that most people
> who do that will end up being sad.

Ouch, I sent my previous mail before reading this.  I can understand it may be 
cumbersome to identify and specify each table, so I tend to agree the parameter 
in postgresql, which is USERSET to allow ALTER DATABASE/USER SET to tune 
specific databases and applications.  But should the vacuuming of system 
catalogs also follow this setting?


Regards
Takayuki Tsunakawa

RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-02-27 Thread Tsunakawa, Takayuki

From: Michael Paquier [mailto:mich...@paquier.xyz]
> This makes the test page-size sensitive.  While we don't ensure that tests
> can be run with different page sizes, we should make a maximum effort to
> keep the tests compatible if that's easy enough.  In this case you could
> just use > 0 as base comparison.  I can fix that by myself, so no need to
> send a new version.

Good idea.  Done.


> Should we also document that the parameter is effective for autovacuum?
> The name can lead to confusion regarding that.

I'm not sure for the need because autovacuum is just an automatic execution of 
vacuum, and existing vacuum_xxx parameters also apply to autovacuum.  But being 
specific is good anyway, so I added reference to autovacuum in the description.


> Also, shouldn't the relopt check happen in should_attempt_truncation()?
> It seems to me that if we use this routine somewhere else then it should
> be controlled by the option.

That makes sense.  Done.


> At the same time, we also have REL_TRUNCATE_FRACTION and
> REL_TRUNCATE_MINIMUM which could be made equally user-tunnable.
> That's more difficult to control, still why don't we also consider this
> part?

I thought of it, too.  But I didn't have a good idea on how to explain those 
parameters.  I'd like to separate it.


> Another thing that seems worth thinking about is a system-level GUC, and
> an option in the VACUUM command to control if truncation should happen or
> not.  We have a lot of infrastructure to control such options between vacuum
> and autovacuum, so it could be a waste to not consider potential synergies.

Being able to specify this parameter in postgresql.conf and SET (especially 
ALTER DATABASE/USER to target specific databases/applications) might be useful, 
but I'm not sure...  I'm less confident about whether VACUUM command can 
specify this, because this is a property of table under a specific workload, 
not a changable property of each VACUUM action.  Anyway, I expect it won't be 
difficult to add those configurability without contradicting the design, so I'm 
inclined to separate it.


From: Masahiko Sawada [mailto:sawada.m...@gmail.com]
> Yeah, that would work. Or it's kind of hackie but the rolling back the
> insertion instead of INSERT and DELETE might also work.

That's good, because it allows us to keep running reloptions test in parallel 
with other tests.  Done.


Regards
Takayuki Tsunakawa




disable-vacuum-truncation_v4.patch
Description: disable-vacuum-truncation_v4.patch

RE: [RFC] [PATCH] Flexible "partition pruning" hook

2019-02-25 Thread Tsunakawa, Takayuki

From: Mike Palmiotto [mailto:mike.palmio...@crunchydata.com]
> Attached is a patch which attempts to solve a few problems:
> 
> 1) Filtering out partitions flexibly based on the results of an external
> function call (supplied by an extension).
> 2) Filtering out partitions from pg_inherits based on the same function
> call.
> 3) Filtering out partitions from a partitioned table BEFORE the partition
> is actually opened on-disk.

What concrete problems would you expect this patch to solve?  What kind of 
extensions do you imagine?  I'd like to hear about the examples.  For example, 
"PostgreSQL 12 will not be able to filter out enough partitions when 
planning/executing SELECT ... WHERE ... statement.  But an extension like this 
can extract just one partition early."

Would this help the following issues with PostgreSQL 12?

* UPDATE/DELETE planning takes time in proportion to the number of partitions, 
even when the actually accessed partition during query execution is only one.

* Making a generic plan takes prohibitably long time (IIRC, about 12 seconds 
when the number of partitoons is 1,000 or 8,000.)


Regards
Takayuki Tsunakawa

RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-02-25 Thread Tsunakawa, Takayuki

From: Masahiko Sawada [mailto:sawada.m...@gmail.com]
> This test expects that the inserted tuple is always reclaimed by
> subsequent vacuum, but it's not always true if there are concurrent
> transactions. So size of the reloptions_test table will not be 0 if
> the tuple is not vacuumed. In my environment this test sometimes
> failed with 'make check -j 4'.

Hmm, "make check -j4" certainly fails on my poor VM, too.

Modifying src/test/regress/parallel_schedule to put the reloptions test on a 
separate line seems to have fixed this issue.  Do you think this is the correct 
remedy?


Regards
Takayuki Tsunakawa

RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-02-25 Thread Tsunakawa, Takayuki

From: Michael Paquier [mailto:mich...@paquier.xyz]
On Mon, Feb 25, 2019 at 03:59:21PM +0900, Masahiko Sawada wrote:
> > Also, I think that this test may fail in case where concurrent
> > transactions are running. So maybe should not run it in parallel to
> > other tests.
> 
> That's why autovacuum is disabled in this specific test, no?  A comment
> may be a good idea.

Exactly.  The table is disabled for autovacuum to avoid being influenced by 
autovacuum.


Regards
Takayuki Tsunakawa

RE: Protect syscache from bloating with negative cache entries

2019-02-25 Thread Tsunakawa, Takayuki

From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
> - If you find the process too much "bloat"s and you (intuirively)
>   suspect the cause is system cache, set it to certain shorter
>   value, say 1 minutes, and set the catalog_cache_memory_target
>   to allowable amount of memory for each process. The memory
>   usage will be stable at (un)certain amount above the target.

Could you guide me how to tune these parameters in an example scenario?  Let me 
take the original problematic case referenced at the beginning of this thread.  
That is:

* A PL/pgSQL function that creates a temp table, accesses it, (accesses other 
non-temp tables), and drop the temp table.
* An application repeatedly begins a transaction, calls the stored function, 
and commits the transaction.

With v16 patch applied, and leaving the catalog_cache_xxx parameters set to 
their defaults, CacheMemoryContext continued to increase as follows:

CacheMemoryContext: 1065016 total in 9 blocks; 104168 free (17 chunks); 960848 
used
CacheMemoryContext: 8519736 total in 12 blocks; 3765504 free (19 chunks); 
4754232 used
CacheMemoryContext: 25690168 total in 14 blocks; 8372096 free (21 chunks); 
17318072 used
CacheMemoryContext: 42991672 total in 16 blocks; 11741024 free (21761 chunks); 
31250648 used

How can I make sure that this context won't exceed, say, 10 MB to avoid OOM?

I'm afraid that once the catcache hash table becomes large in a short period, 
the eviction would happen less frequently, leading to memory bloat.


Regards
Takayuki Tsunakawa

RE: Protect syscache from bloating with negative cache entries

2019-02-25 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> I don't understand the idea that we would add something to PostgreSQL
> without proving that it has value.  Sure, other systems have somewhat
> similar systems, and they have knobs to tune them.  But, first, we
> don't know that those other systems made all the right decisions, and
> second, even they are, that doesn't mean that we'll derive similar
> benefits in a system with a completely different code base and many
> other internal differences.

I understand that general idea.  So, I don't have an idea why the proposed 
approach, eviction based only on elapsed time only at hash table expansion, is 
better for PostgreSQL's code base and other internal differences...


> You need to demonstrate that each and every GUC you propose to add has
> a real, measurable benefit in some plausible scenario.  You can't just
> argue that other people have something kinda like this so we should
> have it too.  Or, well, you can argue that, but if you do, then -1
> from me.

The benefit of the size limit are:
* Controllable and predictable memory usage.  The DBA can be sure that OOM 
won't happen.
* Smoothed (non-abnormal) transaction response time.  This is due to the 
elimination of bulk eviction of cache entries.


I'm not sure how to tune catalog_cache_prune_min_age and 
catalog_cache_memory_target.  Let me pick up a test scenario in a later mail in 
response to Horiguchi-san.


Regards
Takayuki Tsunakawa

RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-02-24 Thread Tsunakawa, Takayuki

From: Michael Paquier [mailto:mich...@paquier.xyz]
> I don't think that we want to use a too generic name and it seems more natural
> to reflect the context where it is used in the parameter name.
> If we were to shrink with a similar option for other contexts, we would
> most likely use a different option.  Depending on the load pattern, users
> should also be able to disable or enable a subset of contexts as well.
> 
> So I agree with Julien that [auto]vacuum_shrink_enabled is more adapted
> for this stuff.

OK, I renamed it to vacuum_shrink_enabled.


From: Julien Rouhaud [mailto:rjuju...@gmail.com]
> One last thing, I think we should at least add one regression test for
> this setting.  The one you provided previously seems perfectly suited.

Thanks, added.

Regards
Takayuki Tsunakawa





disable-vacuum-truncation_v3.patch
Description: disable-vacuum-truncation_v3.patch

RE: Libpq support to connect to standby server as priority

2019-02-24 Thread Tsunakawa, Takayuki

Hi Hari-san,

I've reviewed all files.  I think I'll proceed to testing when I've reviewed 
the revised patch and the patch for target_server_type.


(1) patch 0001
CONNECTION_CHECK_WRITABLE,  /* Check if we could make a writable
 * connection. 
*/
+   CONNECTION_CHECK_TARGET,/* Check if we have a proper target
+* connection */
CONNECTION_CONSUME  /* Wait for any pending message 
and consume
 * them. */

According to the following comment, a new enum value should be added at the end.

/*
 * Although it is okay to add to these lists, values which become unused
 * should never be removed, nor should constants be redefined - that would
 * break compatibility with existing code.
 */



(2) patch 0002
It seems to align better with the existing code to set the default value to 
pg_conn.requested_session_type explicitly in makeEmptyPGconn(), even if the 
default value is 0.  Although I feel it's redundant, other member variables do 
so.


(3) patch 0003
 IntervalStyle was not reported by releases before 8.4;
-application_name was not reported by releases before 
9.0.)
+application_name was not reported by releases before 9.0
+transaction_read_only was not reported by releases 
before 12.0.)

";" is missing at the end of the third line.


(4) patch 0004
-   /* Type of connection to make.  Possible values: any, read-write. */
+   /* Type of connection to make.  Possible values: any, read-write, 
perfer-read. */
char   *target_session_attrs;

perfer -> prefer


(5) patch 0004
@@ -3608,6 +3691,9 @@ makeEmptyPGconn(void)
conn = NULL;
}
 
+   /* Initial value */
+   conn->read_write_host_index = -1;

The new member should be initialized earlier in this function.  Otherwise, as 
you can see in the above fragment, conn can be NULL in an out-of-memory case.


(6) patch 0004
Don't we add read-only as well as prefer-read, which corresponds to Slave or 
Secondary of PgJDBC's targetServerType?  I thought the original proposal was to 
add both.


(7) patch 0004
@@ -2347,6 +2367,7 @@ keep_going:   
/* We will come back to here until there is
conn->try_next_addr = 
true;
goto keep_going;
}
+

appendPQExpBuffer(&conn->errorMessage,

  libpq_gettext("could not create socket: %s\n"),

Is this an unintended newline addition?


(8) patch 0004
+   const char *type = 
(conn->requested_session_type == SESSION_TYPE_PREFER_READ) ?
+   
"read-only" : "writable";

I'm afraid these strings are not translatable into other languages.


(9) patch 0004
+   /* Record read-write host index 
*/
+   if (conn->read_write_host_index 
== -1)
+   
conn->read_write_host_index = conn->whichhost;

At this point, the session can be either read-write or read-only, can't it?  
Add the check "!conn->transaction_read_only" in this if condition?


(10) patch 0004
+   if ((conn->target_session_attrs != NULL) &&
+  (conn->requested_session_type == 
SESSION_TYPE_PREFER_READ) &&
+  (conn->read_write_host_index != -2))

The first condition is not necessary because the second one exists.

The parenthesis surrounding each of these conditions are redundant.  It would 
be better to remove them for readability.  This also applies to the following 
part:

+   if (((strncmp(val, "on", 2) == 0) &&
+   
(conn->requested_session_type == SESSION_TYPE_READ_WRITE)) ||
+   ((strncmp(val, "off", 3) == 0) 
&&
+   
(conn->requested_session_type == SESSION_TYPE_PREFER_READ) &&
+   
(conn->read_write_host_index != -2)))


Regards
Takayuki Tsunakawa

RE: Libpq support to connect to standby server as priority

2019-02-21 Thread Tsunakawa, Takayuki

From: Haribabu Kommi [mailto:kommi.harib...@gmail.com]
Here I attached first set of patches that implemented the prefer-read option
> after reporting the transaction_read_only GUC to client. Along the lines
> of adding prefer-read option patch,

Great, thank you!  I'll review and test it.


> 3. Existing read-write code is modified to use the new reported GUC instead
> of executing the show command.

Is the code path left to use SHOW for older servers?


Regards
Takayuki Tsunakawa

RE: Problem during Windows service start

2019-02-21 Thread Tsunakawa, Takayuki

Hi Higuchi-san,


(1)
What made you think this problem rarely occurs in PG 10 or later?  Looking at 
the following code, this seems to happen in PG 10+ too.

if (do_wait)
{
write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Waiting for server 
startup...\n"));
if (wait_for_postmaster(postmasterPID, true) != POSTMASTER_READY)
{
write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server 
startup\n"));
pgwin32_SetServiceStatus(SERVICE_STOPPED);
return;
}
write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Server started and 
accepting connections\n"));
}


(2)
What state should we consider SERVICE_RUNNING as?  Isn't it the state where the 
server has completed startup processing and accepts connections?  If no, how is 
it different from SERVICE_STARTING?

(I know that when -w (wait) is not specified, the status becomes 
SERVICE_RUNNING whether or not the server completes startup processing...)


(3)
+   write_eventlog(EVENTLOG_INFORMATION_TYPE, 
_("Server startup timed out but might continue in the background\n"));

This message is new, isn't it?  I think the existing message "Time out..." is 
enough.


(4)
+   write_eventlog(EVENTLOG_ERROR_TYPE, _("Server 
startup failed. Examine the log output.\n"));

The first sentence (phenomenon) and the second line (detail or action) should 
be separated with a newline.  Below are some examples in pg_ctl.c.  Note that 
write_stderr() writes to the eventlog when running under a Windows service.

write_stderr(_("%s: could not start server\n"
   "Examine the log output.\n"),

write_stderr(_("The program \"%s\" was found by \"%s\"\n"
   "but was not the same version as %s.\n"
   "Check your installation.\n"),


Regards
Takayuki Tsunakawa

RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-02-21 Thread Tsunakawa, Takayuki

From: Julien Rouhaud [mailto:rjuju...@gmail.com]
> FWIW, I prefer shrink over truncate, though I'd rather go with
> vacuum_shink_enabled as suggested previously.

Thanks.  I'd like to leave a committer to choose the name.  FWIW, I chose 
shrink_enabled rather than vacuum_shrink_enabled because this property may be 
used in other shrink situations in the future.  What I imagined was that with 
the zheap, DELETE or some maintenance operation, not vacuum, may try to shrink 
the table.  I meant this property to indicate "whether this table shrinks or 
not" regardless of the specific operation that can shrink the table.



> I'm not sure that I get this comment.  Since both require a
> ShareUpdateExclusiveLock, you can't change the parameter while a
> VACUUM is active on that table.  Did you wanted to use another lock
> mode?

No, changing the parameter acquires ShareUpdaeExclusive lock.  I just imitated 
the description for n_distinct in the same comment block.  The meaning is that 
the setting cannot be changed during VACUUM, so in-flight VACUUM is not 
affected.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-02-21 Thread Tsunakawa, Takayuki

From: Jamison, Kirk [mailto:k.jami...@jp.fujitsu.com]
> socket_timeout (integer)

libpq documentation does not write the data type on the parameter name line.


> Terminate any connection that has been inactive for more than the specified
> number of seconds to prevent client from infinite waiting for individual
> socket read operations due to dead connection. This can be used both as
> a force global query timeout and network problems detector. A value of zero
> (the default) turns this off.
> 
> or
> 
> Controls the number of seconds of connection inactivity to prevent client
> from infinite waiting for individual socket read operations due to dead
> connection. This can be used both as a force global query timeout and network
> problems detector. A value of zero (the default) turns this off.

The second one is better, but:

* Just "connection inactivity" (i.e. the application hasn't submitted any 
request to the server for a long time) does not terminate the connection.  
* "due to dead connction" is not the cause for the timeout.  If the timeout 
occurs, consider the connection dead (see keepalives_count).
* Not restricted to "read" operation?


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-02-21 Thread Tsunakawa, Takayuki

From: mikalaike...@ibagroup.eu [mailto:mikalaike...@ibagroup.eu]
> I am not very  familiar with the PostgreSQL source code. Nevertheless,  the
> main idea of this parameter is clear for me - closing a connection when
> the PostgreSQL server does not response due to any reason. However, I have
> not found  in the discussion a reference that this parameter can be applied
> to the TCP as well as to the UNIX-domain sockets. Moreover, this parameter
> works out of communication layer. When we consider TCP communication, the
> failover is covered by keep_alive and tpc_user_timeout parameters.

Right.  This timeout is for individual socket operations, and the last resort 
to forcibly terminate the connection when other timeouts don't work (TCP 
keepalive, tcp user timeout, statement_timeout).  Also, there's no reason to 
restrict this timeout to TCP.


> According to it, we should not use 'tcp' prefix in this parameter name,
> 'socket' sub string is not suitable too.
> 
> This parameter works on the client side. So the word 'client' is a good
> candidate for using in this parameter name.
> This parameter affects only when we send a 'query' to the pg server.

No, this timeout works not only for queries but for any socket operations to 
communicate with the server.  For example, libpq sends query cancellation 
message to the server, receives server-side parameter value change 
notifications, etc.  So, like PgJDBC and Oracle, I think the name suggesting 
socket or communication is good.  That is, socket_timeout or 
communication_timeout is OK.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-02-21 Thread Tsunakawa, Takayuki

From: Jamison, Kirk/ジャミソン カーク
> Although I did review and followed the suggested way in previous email
> way back (which uses root user) and it worked as intended, I'd also like
> to hear feedback also from Fabien whether it's alright without the test
> script, or if there's another way we can test it (maybe none?).
> Then I think it's in good shape.

I'm afraid there's not a way to incorporate the test into the regression test.  
If there is, Fabien should have responded earlier.



> However, what I meant by my statement above was to have a separate
> decision in committing the two parameters, because based from the thread
> tcp_user_timeout has got more progress when it comes to reviews.
> So once it's ready, it can be committed separately without waiting for
> the socket_timeout param to be ready. (I dont think that requires a
> separate thread)

I see, thanks.


> As for the place, my knowledge is limited so I won't be able to provide
> substantial help on it. Fabien pointed out some comments about it that needs
> clarification though.
> Quoting what Fabien wrote:
>   "The implementation looks awkward, because part of the logic of
> pqWaitTimed
>   is reimplemented in pqWait. Also, I do not understand the
> computation
>   of finish_time, which seems to assume that pqWait is going to be
> called
>   immediately after sending a query, which may or may not be the case,
> and
>   if it is not the time() call there is not the start of the statement."

The patch looks good in that respect.  This timeout is for individual socket 
operations like PgJDBC and Oracle, not the query timeout.  That's what the 
parameter name suggests.


> How about the one below?
> I got it from combined explanations above. I did not include the
> differences though. This can be improved as the discussion
> continues along the way.
> 
> tcp_socket_timeout (integer)
> 
> Terminate and restart any session that has been idle for more than
> the specified number of milliseconds to prevent client from infinite
> waiting for server due to dead connection. This can be used both as
> a brute force global query timeout and detecting network problems.
> A value of zero (the default) turns this off.

Looks better than the original one.  However,

* The unit is not millisecond, but second.
* Doesn't restart the session.
* Choose words that better align with the existing libpq parameters.  e.g. 
"connection" should be used instead of "session" as in keepalive_xxx 
parameters, because there's not a concept of database session at socket level.
* Indicate that the timeout is for individual socket operations.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-02-20 Thread Tsunakawa, Takayuki

From: Nagaura, Ryohei [mailto:nagaura.ryo...@jp.fujitsu.com]
> > Maybe.  Could you suggest good description?
> Clients wait until the socket become readable when they try to get results
> of their query.
> If the socket state get readable, clients read results.
> (See src/interfaces/libpq/fe-exec.c, fe-misc.c)
> The current pg uses poll() to monitor socket states.
> "socket_timeout" is used as timeout argument of poll().
> (See [1] about poll() and its arguments)
> 
> Because "tcp_user_timeout" handles the timeout before and after the above
> procedure,
> there are different monitoring target between "socket_timeout" and
> "tcp_user_timeout".
> When the server is saturated, "statement_timeout" doesn't work while
> "socket_timeout" does work.
> In addition to this, the purpose of "statement_timeout" is to reduce
> server's burden, I think.

OK, I understand your intent.  What I asked Kirk-san is to suggest a 
description for socket_timeout parameter that the user would see in PostgreSQL 
documentation.


> My proposal is to enable clients to release their resource which is used
> communication w/ the saturated server.

I thought the primary purpose is to prevent applications from hanging too long, 
so that the user does not have to wait infinitely.  Releasing resources is a 
secondary purpose.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-02-20 Thread Tsunakawa, Takayuki

From: Nagaura, Ryohei [mailto:nagaura.ryo...@jp.fujitsu.com]
> BTW, tcp_user_timeout parameter of servers and clients have same name in
> my current implementation.
> I think it would be better different name rather than same name.
> I'll name them as the following a) or b):
>   a) server_tcp_user_timeout and client_tcp_user_timeout
>   b) tcp_user_timeout and user_timeout
> b) is the same as the naming convention of keepalive, but it is not
> user-friendly.
> Do you come up with better name?
> Or opinion?

a) is not always accurate, because libpq is also used in the server.  For 
example, postgres_fdw and WAL receiver in streaming replication.

I'm OK with either the current naming or b).  Frankly, I felt a bit strange 
when I first saw the keepalive parameters, wondering why the same names were 
not chosen.


Regards
Takayuki Tsunakawa

RE: Protect syscache from bloating with negative cache entries

2019-02-20 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> That might be enough to justify having the parameter.  But I'm not 
> quite sure how high the value would need to be set to actually get the 
> benefit in a case like that, or what happens if you set it to a value 
> that's not quite high enough.  I think it might be good to play around 
> some more with cases like this, just to get a feeling for how much 
> time you can save in exchange for how much memory.

Why don't we consider this just like the database cache and other DBMS's 
dictionary caches?  That is,

* If you want to avoid infinite memory bloat, set the upper limit on size.

* To find a better limit, check the hit ratio with the statistics view (based 
on Horiguchi-san's original 0004 patch, although that seems modification anyway)


Why do people try to get away from a familiar idea...  Am I missing something?

Ideriha-san,
Could you try simplifying the v15 patch set to see how simple the code would 
look or not?  That is:

* 0001: add dlist_push_tail() ... as is
* 0002: memory accounting, with correction based on feedback
* 0003: merge the original 0003 and 0005, with correction based on feedback


Regards
Takayuki Tsunakawa

RE: Protect syscache from bloating with negative cache entries

2019-02-20 Thread Tsunakawa, Takayuki

From: Ideriha, Takeshi/出利葉 健
> I checked it with perf record -avg and perf report.
> The following shows top 20 symbols during benchmark including kernel space.
> The main difference between master (unpatched) and patched one seems that
> patched one consumes cpu catcache-evict-and-refill functions including
> SearchCatCacheMiss(),  CatalogCacheCreateEntry(),
> CatCacheCleanupOldEntries().
> So it seems to me that these functions needs further inspection
> to suppress the performace decline as much as possible

Thank you.  It's good to see the expected functions, rather than strange 
behavior.  The performance drop is natural just like the database cache's hit 
ratio is low.  The remedy for performance by the user is also the same as the 
database cache -- increase the catalog cache.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-02-20 Thread Tsunakawa, Takayuki

From: Jamison, Kirk [mailto:k.jami...@jp.fujitsu.com]
> 1) tcp_user_timeout parameter
> I think this can be "committed" separately when it's finalized.

Do you mean you've reviewed and tested the patch by simulating a communication 
failure in the way Nagaura-san suggested?


> 2) tcp_socket_timeout parameter
> - whether (a) it should abort the connection from pqWait() or
>   other means, or 
> (b) cancel the statement similar to how psql
>   does it as suggested by Fabien

We have no choice but to terminate the connection, because we can't tell 
whether we can recover from the problem and continue to use the connection 
(e.g. long-running query) or not (permanent server or network failure).

Regarding the place, pqWait() is the best (and possibly only) place.  The 
purpose of this feature is to avoid waiting for response from the server 
forever (or too long) in any case, as a last resort.

Oracle has similar parameters called SQLNET.RECV_TIMEOUT and 
SQLNET.SEND_TIMEOUT.  From those names, I guess they use SO_RCVTIMEO and 
SO_SNDTIMEO socket options.  However, we can't use them because use 
non-blocking sockets and poll(), while SO_RCV/SND_TIMEO do ont have an effect 
for poll():

[excerpt from "man 7 socket"]
--
   SO_RCVTIMEO and SO_SNDTIMEO
  Specify the receiving or sending  timeouts  until  reporting  an
  error.  The argument is a struct timeval.  If an input or output
  function blocks for this period of time, and data has been  sent
  or  received,  the  return  value  of  that function will be the
  amount of data transferred; if no data has been transferred  and
  the  timeout has been reached then -1 is returned with errno set
  to EAGAIN or EWOULDBLOCK just as if the socket was specified  to
  be  non-blocking.   If  the timeout is set to zero (the default)
  then the operation  will  never  timeout.   Timeouts  only  have
  effect  for system calls that perform socket I/O (e.g., read(2),
  recvmsg(2), send(2), sendmsg(2)); timeouts have  no  effect  for
  select(2), poll(2), epoll_wait(2), etc.
--



> - proper parameter name
> 
> Based from your description below, I agree with Fabien that it's somehow
> the application/client side query timeout

I think the name is good because it indicates the socket-level timeout.  That's 
just like PgJDBC and Oracle, and I didn't feel strange when I read their 
manuals.


> Perhaps you could also clarify a bit more through documentation on how
> socket_timeout handles the timeout differently from statement_timeout
> and tcp_user_timeout.

Maybe.  Could you suggest good description?


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-02-19 Thread Tsunakawa, Takayuki

From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> Hm.  Putting a list header for a purely-local data structure into shared
> memory seems quite ugly.  Isn't there a better place to keep that?

Agreed.  I put it in the global variable.


> Do we really want a dlist here at all?  I'm concerned that bloating
> LOCALLOCK will cost us when there are many locks involved.  This patch
> increases the size of LOCALLOCK by 25% if I counted right, which does
> not seem like a negligible penalty.

To delete the LOCALLOCK in RemoveLocalLock(), we need a dlist.  slist requires 
the list iterator to be passed from callers.


From: Andres Freund [mailto:and...@anarazel.de]
> Sure, we should do that. I don't buy the "illogical" bit, just moving
> hashcode up to after tag isn't more or less logical, and saves most of
> the padding, and moving the booleans to the end isn't better/worse
> either.
> 
> I don't find

Thanks, I've done it.  


From: Simon Riggs [mailto:si...@2ndquadrant.com]
> Can we use our knowledge of the structure of locks, i.e. partition locks
> are all children of the partitioned table, to do a better job?

I couldn't come up with a idea.


Regards
Takayuki Tsunakawa



faster-locallock-scan_v2.patch
Description: faster-locallock-scan_v2.patch

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-02-19 Thread Tsunakawa, Takayuki

From: Tomas Vondra [mailto:tomas.von...@2ndquadrant.com]
> On 2/12/19 7:33 AM, Tsunakawa, Takayuki wrote:
> > Imai-san confirmed performance improvement with this patch:
> >
> > https://commitfest.postgresql.org/22/1993/
> >
> 
> Can you quantify the effects? That is, how much slower/faster does it get?

Ugh, sorry, I wrote a wrong URL.  The correct page is:

https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A512787EC%40g01jpexmbkw24

The quoted figures re:

[v20 + faster-locallock-scan.patch]
auto:   9,069 TPS
custom: 9,015 TPS

[v20]
auto:   8,037 TPS
custom: 8,933 TPS


In the original problematic case, plan_cache_mode = auto (default), we can see 
about 13% improvement.


Regards
Takayuki Tsunakawa

RE: Protect syscache from bloating with negative cache entries

2019-02-19 Thread Tsunakawa, Takayuki

From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
> number of tables   | 100  |1000|1
> ---
> TPS (master)   |10966  |10654 |9099
> TPS (patch)| 11137 (+1%) |10710 (+0%) |772 (-91%)
> 
> It seems that before cache exceeding the limit (no pruning at 100 and 1000),
> the results are almost same with master but after exceeding the limit (at
> 1)
> the decline happens.

How many concurrent clients?

Can you show the perf's call graph sampling profiles of both the unpatched and 
patched version, to confirm that the bottleneck is around catcache eviction and 
refill?


Regards
Takayuki Tsunakawa

ZRE: Protect syscache from bloating with negative cache entries

2019-02-19 Thread Tsunakawa, Takayuki

From: Tomas Vondra [mailto:tomas.von...@2ndquadrant.com]
> 0.7% may easily be just a noise, possibly due to differences in layout
> of the binary. How many runs? What was the variability of the results
> between runs? What hardware was this tested on?

3 runs, with the variability of about +-2%.  Luckly, all those three runs 
(incidentally?) showed slight performance decrease with the patched version.  
The figures I wrote are the highest ones.

The hardware is, a very poor man's VM:
CPU: 4 core Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz
RAM: 4GB



> FWIW I doubt tests with such small small schema are proving anything -
> the cache/lists are likely tiny. That's why I tested with much larger
> number of relations.

Yeah, it requires many relations to test the repeated catcache eviction and new 
entry creation, which would show the need to increase catalog_cache_max_size 
for users.  I tested with small number of relations to see the impact of 
additional processing while the catcache is not full -- memory accounting and 
catcache LRU chain maintenance.


Regards
Takayuki Tsunakawa

1 2 3 4 >

1 - 100 of 302 matches

Mail list logo