On 2020-10-23 13:46, Kyotaro Horiguchi wrote:
Wait...
Attachments:
0003-Enabled-pg_get_backend_memory_contexts-to-collect.patch
For a moment I thought that the number is patch number but the
predecessors are 0002-Enabled..collect.patch and 0001-(same
name). It's not mandatory but we usually do as the follows and it's
the way of git.
v1-0001-Enabled...collect.patch
v2-0001-Enabled...collect.patch
The vn is added by -v option for git-format-patch.
Sorry for the confusion. I'll follow that way next time.
At Thu, 22 Oct 2020 21:32:00 +0900, torikoshia
<torikos...@oss.nttdata.com> wrote in
> > I added a shared hash table consisted of minimal members
> > mainly for managing whether the file is dumped or not.
> > Some members like 'loc' seem useful in the future, but I
> > haven't added them since it's not essential at this point.
> Yes, that would be good.
> + /*
> + * Since we allow only one session can request to dump
> memory context at
> + * the same time, check whether the dump files already exist.
> + */
> + while (stat(dumpfile, &stat_tmp) == 0 || stat(tmpfile,
> &stat_tmp) == 0)
> + {
> + pg_usleep(1000000L);
> + }
> If pg_get_backend_memory_contexts() is executed by two or more
> sessions at the same time, it cannot be run exclusively in this way.
> Currently it seems to cause a crash when do it so.
> This is easy to reproduce and can be done as follows.
> [session-1]
> BEGIN;
> LOCK TABKE t1;
> [Session-2]
> BEGIN;
> LOCK TABLE t1; <- waiting
> [Session-3]
> select * FROM pg_get_backend_memory_contexts(<pid of session-2>);
> [Session-4]
> select * FROM pg_get_backend_memory_contexts(<pid of
> session-2>);
> If you issue commit or abort at session-1, you will get SEGV.
> Instead of checking for the existence of the file, it might be better
> to use a hash (mcxtdumpHash) entry with LWLock.
Thanks!
Added a LWLock and changed the way from checking the file existence
to finding the hash entry.
> + if (proc == NULL)
> + {
> + ereport(WARNING,
> + (errmsg("PID %d is not a PostgreSQL server
> process", dst_pid)));
> + return (Datum) 1;
> + }
> Shouldn't it clear the hash entry before return?
Yeah. added codes for removing the entry.
+ entry = AddEntryToMcxtdumpHash(dst_pid);
+
+ /* Check whether the target process is PostgreSQL backend process.
*/
+ /* TODO: Check also whether backend or not. */
+ proc = BackendPidGetProc(dst_pid);
+
+ if (proc == NULL)
+ {
+ ereport(WARNING,
+ (errmsg("PID %d is not a PostgreSQL server
process", dst_pid)));
+
+ LWLockAcquire(McxtDumpHashLock, LW_EXCLUSIVE);
+
+ if (hash_search(mcxtdumpHash, &dst_pid, HASH_REMOVE,
NULL) == NULL)
+ elog(WARNING, "hash table corrupted");
+
+ LWLockRelease(McxtDumpHashLock);
+
+ return (Datum) 1;
+ }
Why do you enter a useles entry then remove it immedately?
Do you mean I should check the process existence first
since it enables us to skip entering hash entries?
+ PG_ENSURE_ERROR_CLEANUP(McxtReqKill, (Datum)
Int32GetDatum(dst_pid));
+ {
+ SendProcSignal(dst_pid, PROCSIG_DUMP_MEMORY,
InvalidBackendId);
"PROCSIG_DUMP_MEMORY" is somewhat misleading. Hwo about
"PROCSIG_DUMP_MEMCXT" or "PROCSIG_DUMP_MEMORY_CONTEXT"?
I'll go with "PROCSIG_DUMP_MEMCXT".
I thought that the hash table would prevent multiple reqestors from
making a request at once, but the patch doesn't seem to do that.
+ /* Wait until target process finished dumping file. */
+ while (entry->dump_status == MCXTDUMPSTATUS_NOTYET)
This needs LWLock. And this could read the entry after reused by
another backend if the dumper process is gone. That isn't likely to
happen, but theoretically the other backend may set it to
MCXTDUMPSTATUS_NOTYET inbetween two successive check on the member.
Thanks for your notification.
I'll use an LWLock.
+ /*
+ * Make dump file ends with 'D'.
+ * This is checked by the caller when reading the file.
+ */
+ fputc('E', fpout);
Which is right?
Sorry, the comment was wrong..
+ fputc('E', fpout);
+
+ CHECK_FOR_INTERRUPTS();
This means that the process accepts another request and rewrite the
file even while the first requester is reading it. And, the file can
be removed by the first requestor before the second requestor can read
it.
I added CHECK_FOR_INTERRUPTS() here to make the dump cancellation
possible, however, considering your indication, it needs to think
about a way to handle only the dump cancellation.
+ mcxtdumpHash = ShmemInitHash("mcxtdump hash",
+
SHMEM_MEMCONTEXT_SIZE,
+
SHMEM_MEMCONTEXT_SIZE,
.
The space needed for this hash doesn't seem to be secured. The hash is
sized to 64 entries, so pg_get_backend_memory_contexts() may fail with
ERROR "out of shared memory".
The hash is used only to check if the dump file is completed and if
ended with error. If we need only those, an shared byte array with the
length of max_backend is sufficient.
Although there was a comment that controlling dump location may be a
good idea, but it also seems possible to include the location
information
in the struct Hoge you suggested below.
On Tue, Sep 1, 2020 at 12:03 AM Kasahara Tatsuhito <[hidden email]>
wrote:
| - send a signal to dump to a PID, it first record following
information into the shared hash.
| pid (specified pid)
| loc (dump location, currently might be ASAP)
+ PG_ENSURE_ERROR_CLEANUP(McxtReqKill, (Datum)
Int32GetDatum(dst_pid));
+ {
+ SendProcSignal(dst_pid, PROCSIG_DUMP_MEMORY,
InvalidBackendId);
+
+ /* Wait until target process finished dumping file. */
+ while (entry->dump_status == MCXTDUMPSTATUS_NOTYET)
+ {
+ CHECK_FOR_INTERRUPTS();
+ pg_usleep(10000L);
+ }
+ }
+ PG_END_ENSURE_ERROR_CLEANUP(McxtReqKill, (Datum)
Int32GetDatum(dst_pid));
+
+ if (entry->dump_status == MCXTDUMPSTATUS_ERROR)
+ {
..
+ if (stat(tmpfile, &stat_tmp) == 0)
+ unlink(tmpfile);
+ if (stat(dumpfile, &stat_tmp) == 0)
+ unlink(dumpfile);
...
+ return (Datum) 0;
+ }
+
+ /* Read values from the dumped file and put them on tuplestore.
*/
+ PutDumpedValuesOnTuplestore(dumpfile, tupstore, tupdesc,
dst_pid);
This means that if the function gets sigint before the dumper creates
the file, the dumper can leave a dump file?
In that case, the requestor removes the corresponding hash entry in the
callback McxtReqKill, then the dumper who can not find the hash entry
does not dump a file.
However, I've now noticed that when the requestor gets sigint just
after the dumper check, the dump file remains.
In the current design, only the requestor can remove the dump file,
but it seems necessary to allow the dumper to remove it.
> + /* Wait until target process finished dumping file. */
> + while (!entry->is_dumped)
> + {
> + CHECK_FOR_INTERRUPTS();
> + pg_usleep(10000L);
> + }
> If the target is killed and exit before dumping the memory
> information, you're in an infinite loop here.
> So how about making sure that any process that has to stop before
> doing a memory dump changes the status of the hash (entry->is_dumped)
> before stopping and the caller detects it?
> I'm not sure it's best or not, but you might want to use something
> like the on_shmem_exit callback.
Thanks for your idea!
Added a callback to change the status of the hash table entry.
Although I think it's necessary to remove this callback when it
finished
processing memory dumping, on_shmem_exit() does not seem to have such
a function.
I used before_shmem_exit() since it has a corresponding function to
remove registered callback.
If it's inappropriate, I'm going to add a function removing the
registered callback of on_shmem_exit().
This seems to leave a file for the pid.
As mentioned above, there can be a chance to remain files.
> In the current design, if the caller stops processing before reading
> the dumped file, you will have an orphaned file.
> It looks like the following.
> [session-1]
> BEGIN;
> LOCK TABKE t1;
> [Session-2]
> BEGIN;
> LOCK TABLE t1; <- waiting
> [Session-3]
> select * FROM pg_get_backend_memory_contexts(<pid of session-2>);
> If you cancel or terminate the session-3, then issue commit or abort
> at session-1, you will get orphaned files in pg_memusage.
> So if you allow only one session can request to dump file, it could
> call pg_memusage_reset() before send the signal in this function.
Although I'm going to allow only one session per one target process,
I'd like to allow running multiple pg_get_backend_memory_contexts()
which target process is different.
Instead of calling pg_memusage_reset(), I added a callback for
cleaning up orphaned files and the elements of the hash table
using before_shmem_exit() through PG_ENSURE_ERROR_CLEANUP() and
PG_END_ENSURE_ERROR_CLEANUP().
I chose PG_ENSURE_ERROR_CLEANUP() and PG_END_ENSURE_ERROR_CLEANUP()
here since it can handle not only termination but also cancellation.
Any thoughts?
+/*
+ * pg_memusage_reset
+ * Remove the memory context dump files.
+ */
+void
+pg_memusage_reset(int pid)
The function name seem to represents somewhat different from what it
does.
Yeah, It looks like it actually reset the memory.
I'll rename it to remove_memcxt_file or something.
I think we might need to step-back to basic design of this feature
since this patch seems to have unhandled corner cases that are
difficult to find.
Agreed and thanks for writing it down below.
- Dump file lifecycle or state-transition of the dumper
Currently the lifecycle of a dump file, or the state-transition of
the dumper process doesn't seem to be well defined.
The file is create only by the dumper.
If the requestor reads the completed file, the reader removes it.
If the dumper receives a cancel request, the dumper removes it if
any.
Of course dumper removes it if it fails to complete the file.
- Better way of signalling between the requestor and the dumper
I think there's no doubt about request signal.
About the complete signal, currently the requestor polls on a flag
in a hash entry. I'm wondering if we can get rid of polling. The
problem on doing that is the lack of a means for a dumper to know
the requestor. We need to store requestor pid or backend id in the
shared hash entry.
Agreed to get rid of polling.
BTW, it seems common to use a latch instead of pg_usleep() to wait until
signals arrive as far as I read latch.h.
I'm now thinking about using a latch here and it would make polling
removed.
By the way, about shared hash entry, it uses about 70kB for only 64
entries so it seems inefficient than a shared array that has
MaxBackends entries. If we used a following array on shared memory,
struct hoge
{
BackendId requestor[MAX_BACKENDS];
int status[MAX_BACKENDS];
LWLock lock;
};
If the requestor's id is 5 and dumper's id is 10,
is this struct used like below?
- hoge.requestor[10] = 5
- Both status[5] and status[10] change like "request", "idle" or "done"
This array has the size of 24 * MaxBackends + 16. 24kB for 1000
backends. It could be on dsm since this feature is not used
commonly.
Sorry but I'm not sure this calculation.
Do you mean 16.24kB for 1000 backends?
Regarding dsm, do you imagine using hoge on dsm?
- The way to cancel a request already made. (or management of the dump
state transition.)
Cancellation seems to contain some race conditions. But basically
that could be done by sending a request signal after setting the
hoge.requestor above to some special value, not needing the third
I imagined hoge.requestor was set to requestor's backendid.
Isn't it hoge.status?
type of signal. The special value should be different from the
initial state, which signals that the process is accepting a new
request.
As the whole, that would looks like the folloing?
------------------------------------------------------------
Successful request.
Requestor dumper state
[idle] initial
[request] -------------------> requestor pid/backendid
-signal->
[dumping]
<-signal-[done]
[read]
[done] --------------------> initial
------------------------------------------------------------
On failure, the dumper signals with setting state to initial.
[request] -------------------> requestor pid/backendid
-signal->
[dumping]
[failed] initial
<-signal-
(some other requestor might come meantime.)
<sees that the requestor is not me>
Is this "some other requestor come case" relevant to the dumper failure?
Regards,
--
Atsushi Torikoshi
[failed]
------------------------------------------------------------
If the requestor wants to cancel the request, it sets the state to
'cancel' then signal.
Requestor dumper state
[idle] initial
[request] -------------------> cancel
<if canceled. clean up>
[dumping]
<if canceled. clean up>
<-signal-[done]
-signal-><try to clean up>
Other aspects to cnosider?
regards.