On Fri, Sep 18, 2015 at 7:04 PM, Robert Haas <robertmh...@gmail.com> wrote:

>
> Frankly, I think you guys are making this out to be way more
> complicated than it really is.  Basically, I think the process being
> queried should publish a DSM via a slot it owns.  The recipient is
> responsible for reading it and then notifying the sender.  If a second
> process requests data before the first process reads its data, the
> second process can either (1) become an additional reader of the
> already-published data or (2) wait for the first process to finish,
> and then send its own inquiry.
>
> There are some problems to solve here, but they hardly seem impossible.


Thank you Robert, for your invaluable input on this patch!

I now believe that use of ProcessInterrupts() in the recently proposed
design as well as manipulation of proc latch due to use of shared memory
queue are major blockers.

In order to simplify things up, I've taken a step back and looked again at
the auto_explain contrib module.  I now propose the most simple thing
possible in my opinion: a new GUC option for auto_explain.  It will make
every backend, in which the module is loaded via *_preload_libraries
mechanism or using CREATE EXTENSION/LOAD commands, to actively publish the
plans of queries in dynamic shared memory as long as
auto_explain.publish_plans is turned on.

The greatest part for me, is that this approach doesn't require handling of
signals and is isolated in an external module, so it can be readily used
with the current server versions, no need to wait for >= 9.6.

Some implementation details:

For every backend that might be running (MaxBackends) we reserve a
dsm_handle slot in the addins shared memory.  When the new option is turned
on, the ExecutorRun hook will produce a plan in whatever format is
specified by the auto_explain.log_format, allocate a DSM segment, copy the
plan into the segment and finally store the DSM handle into its own slot.
No locking is required around this because every backend writes to its slot
exclusively, no other backend can be writing into it concurrently.  In the
ExecutorFinish hook we invalidate the backend's slot by setting the handle
to 0 and deallocate the DSM segment.

Reading of the plan is performed by a newly added function
pg_explain_backend(PID).  Since it can determine the target process'
backendId, it simply reads the DSM handle from that backend's slot and
tries to attach it (there's not much point in checking the handle for being
non-0, because the other backend could release the segment the moment after
we've checked it, so we rely on dsm_attach returning non-NULL).  If
attached successfully, we parse the contents and detach.  At this point the
backend to detach the last is actually releasing the segment, due to
reference counting.

Handling of the nested statements plans is an open question.  It can be
really useful when the top-level plan is simply displaying a "SELECT
my_stored_procedure()" and all the interesting stuff is happening behind
the scenes, but I didn't start to think about how this could be implemented
yet.

Pavel was really interested in retrieving the complete query text/plans
which could be over a megabyte in his case (and pg_stat_activity.query is
capped by 10240 bytes I believe).  This is now possible with the patch, but
some others might still want to put a threshold on the allocation,
especially given this is shared memory.  I can envision another GUC, but in
our experience the extremely long queries (and plans) are most of the time
due to use of VALUES() or IN() clauses with a huge list of literals.

I think we could fold the VALUES/IN into "?" if the query/plan text exceeds
the specified threshold, or unconditionally (yes, pg_stat_statements, I'm
looking at you).  This should help in the cases when the most interesting
part is in the plan nodes near the end, but there's such a huge list of
literals before it.

Future plans:

I believe this approach can be extended to enable instrumentation once
again.  The backend executing the query could update the published plan
every once in a while (for example, every N ms or 1% of rows processed in a
node), and any other process interested in this data, can simply read it
without the need for signals and complex and fragile communication.  This
obviously requires a core patch.

Some problems:

There is a potential problem with the limited total number of DSM segments:
it is reserved in a way to only allow 2 per backend (on average) and 64
additional per server, so if you run with the new option enabled at all
times, you're down to only 1 additional DSM per backend (again, on
average).  Not sure how critical this can be, but no one is forced to run
with this option enabled for all backends.

If you don't want to run it enabled at all times, then enabling the GUC
per-backend can be problematic.  It's still possible to update the conf
file and send SIGHUP to a single backend, but it's harder to accomplish
over psql, for example.  I think here we might still have some luck with
the signals: use another array of per-backend slots with flags, set the
target backend's flag and send it SIGUSR1.  The backend wakes on the signal
and examines its slot, then toggles the GUC if needed.  Sounds pretty safe,
eh?

No documentation changes yet, waiting for your comments. :-)

Happy hacking!
--
Alex
From 9f192d17e4674b2a042ca435ad5219f59722cd7a Mon Sep 17 00:00:00 2001
From: Oleksandr Shulgin <oleksandr.shul...@zalando.de>
Date: Fri, 25 Sep 2015 17:29:34 +0200
Subject: [PATCH] Add auto_explain.publish_plans

This adds an optional feature to publish the plans of every query being
executed into per-backend dynamic shared memory segment.  The plans can then be
examined by other backends by use of newly added pg_explain_backend() function.
The process ID that is passed to the function must be this server's backend PID
and the caller must be either superuser or the same role that the other backend
is running under.

In order to support addition of the SQL-callable function, EXTENSION packaging
is provided for auto_explain.
---
 contrib/auto_explain/Makefile              |   2 +
 contrib/auto_explain/auto_explain--1.0.sql |  10 +
 contrib/auto_explain/auto_explain.c        | 326 ++++++++++++++++++++++++++---
 contrib/auto_explain/auto_explain.control  |   5 +
 4 files changed, 315 insertions(+), 28 deletions(-)
 create mode 100644 contrib/auto_explain/auto_explain--1.0.sql
 create mode 100644 contrib/auto_explain/auto_explain.control

diff --git a/contrib/auto_explain/Makefile b/contrib/auto_explain/Makefile
index fcf36c5..b6057a3 100644
--- a/contrib/auto_explain/Makefile
+++ b/contrib/auto_explain/Makefile
@@ -3,6 +3,8 @@
 MODULE_big = auto_explain
 OBJS = auto_explain.o $(WIN32RES)
 PGFILEDESC = "auto_explain - logging facility for execution plans"
+EXTENSION = auto_explain
+DATA = auto_explain--1.0.sql
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/auto_explain/auto_explain--1.0.sql b/contrib/auto_explain/auto_explain--1.0.sql
new file mode 100644
index 0000000..6059949
--- /dev/null
+++ b/contrib/auto_explain/auto_explain--1.0.sql
@@ -0,0 +1,10 @@
+/* contrib/auto_explain/auto_explain--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION auto_explain" to load this file. \quit
+
+-- Register functions.
+CREATE FUNCTION pg_explain_backend(INT)
+RETURNS SETOF TEXT
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 2a184ed..0fa123f 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -14,9 +14,18 @@
 
 #include <limits.h>
 
+#include "access/htup_details.h"
 #include "commands/explain.h"
 #include "executor/instrument.h"
+#include "miscadmin.h"
+#include "storage/dsm.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
 #include "utils/guc.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
 
 PG_MODULE_MAGIC;
 
@@ -29,6 +38,7 @@ static bool auto_explain_log_triggers = false;
 static bool auto_explain_log_timing = true;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static bool auto_explain_log_nested_statements = false;
+static bool auto_explain_publish_plans = false;
 
 static const struct config_enum_entry format_options[] = {
 	{"text", EXPLAIN_FORMAT_TEXT, false},
@@ -38,10 +48,23 @@ static const struct config_enum_entry format_options[] = {
 	{NULL, 0, false}
 };
 
+
+/* Shared memory structs */
+static dsm_handle *published_plan_handles = NULL;
+
+#define MaxPublishedPlans		MaxBackends
+#define explain_shmem_size()	(sizeof(dsm_handle) * MaxPublishedPlans)
+
+static dsm_segment	   *MyPublishedPlanSegment = NULL;
+static dsm_handle	   *MyPublishedPlanHandle = NULL;
+
+
 /* Current nesting depth of ExecutorRun calls */
 static int	nesting_level = 0;
 
 /* Saved hook values in case of unload */
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+
 static ExecutorStart_hook_type prev_ExecutorStart = NULL;
 static ExecutorRun_hook_type prev_ExecutorRun = NULL;
 static ExecutorFinish_hook_type prev_ExecutorFinish = NULL;
@@ -51,15 +74,28 @@ static ExecutorEnd_hook_type prev_ExecutorEnd = NULL;
 	(auto_explain_log_min_duration >= 0 && \
 	 (nesting_level == 0 || auto_explain_log_nested_statements))
 
+#define publish_explain_enabled() \
+	(auto_explain_publish_plans && \
+	 nesting_level == 0)	/* XXX what do we do with the nested statements? */
+
 void		_PG_init(void);
 void		_PG_fini(void);
 
+PG_FUNCTION_INFO_V1(pg_explain_backend);
+
+static void explain_shmem_startup(void);
+static void attach_published_plan_handles(void);
+
 static void explain_ExecutorStart(QueryDesc *queryDesc, int eflags);
 static void explain_ExecutorRun(QueryDesc *queryDesc,
 					ScanDirection direction,
 					long count);
 static void explain_ExecutorFinish(QueryDesc *queryDesc);
 static void explain_ExecutorEnd(QueryDesc *queryDesc);
+static StringInfo explain_query(QueryDesc *queryDesc);
+
+static void explain_publish_plan(QueryDesc *queryDesc);
+static void explain_unpublish_my_plan(void);
 
 
 /*
@@ -159,9 +195,26 @@ _PG_init(void)
 							 NULL,
 							 NULL);
 
+	DefineCustomBoolVariable("auto_explain.publish_plans",
+							 "Publish plan of every executed statement in shared memory.",
+							 NULL,
+							 &auto_explain_publish_plans,
+							 false,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	EmitWarningsOnPlaceholders("auto_explain");
 
+	if (auto_explain_publish_plans)
+		RequestAddinShmemSpace(explain_shmem_size());
+
 	/* Install hooks. */
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = explain_shmem_startup;
+
 	prev_ExecutorStart = ExecutorStart_hook;
 	ExecutorStart_hook = explain_ExecutorStart;
 	prev_ExecutorRun = ExecutorRun_hook;
@@ -183,8 +236,42 @@ _PG_fini(void)
 	ExecutorRun_hook = prev_ExecutorRun;
 	ExecutorFinish_hook = prev_ExecutorFinish;
 	ExecutorEnd_hook = prev_ExecutorEnd;
+
+	shmem_startup_hook = prev_shmem_startup_hook;
 }
 
+
+static void
+explain_shmem_startup(void)
+{
+	if (prev_shmem_startup_hook)
+		prev_shmem_startup_hook();
+
+	attach_published_plan_handles();
+}
+
+static void
+attach_published_plan_handles(void)
+{
+	bool		found;
+
+	if (auto_explain_publish_plans && published_plan_handles == NULL)
+	{
+		LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+		published_plan_handles =
+			(dsm_handle *) ShmemInitStruct("auto_explain.published_plan_handles",
+										   explain_shmem_size(), &found);
+		if (!found)
+			MemSet(published_plan_handles, 0, explain_shmem_size());
+
+		MyPublishedPlanHandle = &published_plan_handles[MyBackendId];
+
+		LWLockRelease(AddinShmemInitLock);
+	}
+}
+
+
 /*
  * ExecutorStart hook: start up logging if needed
  */
@@ -234,6 +321,11 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
 static void
 explain_ExecutorRun(QueryDesc *queryDesc, ScanDirection direction, long count)
 {
+	if (publish_explain_enabled())
+	{
+		explain_publish_plan(queryDesc);
+	}
+
 	nesting_level++;
 	PG_TRY();
 	{
@@ -257,6 +349,11 @@ explain_ExecutorRun(QueryDesc *queryDesc, ScanDirection direction, long count)
 static void
 explain_ExecutorFinish(QueryDesc *queryDesc)
 {
+	if (MyPublishedPlanSegment)
+	{
+		explain_unpublish_my_plan();
+	}
+
 	nesting_level++;
 	PG_TRY();
 	{
@@ -294,32 +391,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 		msec = queryDesc->totaltime->total * 1000.0;
 		if (msec >= auto_explain_log_min_duration)
 		{
-			ExplainState *es = NewExplainState();
-
-			es->analyze = (queryDesc->instrument_options && auto_explain_log_analyze);
-			es->verbose = auto_explain_log_verbose;
-			es->buffers = (es->analyze && auto_explain_log_buffers);
-			es->timing = (es->analyze && auto_explain_log_timing);
-			es->summary = es->analyze;
-			es->format = auto_explain_log_format;
-
-			ExplainBeginOutput(es);
-			ExplainQueryText(es, queryDesc);
-			ExplainPrintPlan(es, queryDesc);
-			if (es->analyze && auto_explain_log_triggers)
-				ExplainPrintTriggers(es, queryDesc);
-			ExplainEndOutput(es);
-
-			/* Remove last line break */
-			if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n')
-				es->str->data[--es->str->len] = '\0';
-
-			/* Fix JSON to output an object */
-			if (auto_explain_log_format == EXPLAIN_FORMAT_JSON)
-			{
-				es->str->data[0] = '{';
-				es->str->data[es->str->len - 1] = '}';
-			}
+			StringInfo		str = explain_query(queryDesc);
 
 			/*
 			 * Note: we rely on the existing logging of context or
@@ -329,10 +401,11 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 			 */
 			ereport(LOG,
 					(errmsg("duration: %.3f ms  plan:\n%s",
-							msec, es->str->data),
+							msec, str->data),
 					 errhidestmt(true)));
 
-			pfree(es->str->data);
+			pfree(str->data);
+			pfree(str);
 		}
 	}
 
@@ -341,3 +414,200 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 	else
 		standard_ExecutorEnd(queryDesc);
 }
+
+static StringInfo
+explain_query(QueryDesc *queryDesc)
+{
+	StringInfo		result;
+	ExplainState   *es = NewExplainState();
+
+	es->analyze = (queryDesc->instrument_options && auto_explain_log_analyze);
+	es->verbose = auto_explain_log_verbose;
+	es->buffers = (es->analyze && auto_explain_log_buffers);
+	es->timing = (es->analyze && auto_explain_log_timing);
+	es->summary = es->analyze;
+	es->format = auto_explain_log_format;
+
+	ExplainBeginOutput(es);
+	ExplainQueryText(es, queryDesc);
+	ExplainPrintPlan(es, queryDesc);
+	if (es->analyze && auto_explain_log_triggers)
+		ExplainPrintTriggers(es, queryDesc);
+	ExplainEndOutput(es);
+
+	/* Remove last line break */
+	if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n')
+		es->str->data[--es->str->len] = '\0';
+
+	/* Fix JSON to output an object */
+	if (auto_explain_log_format == EXPLAIN_FORMAT_JSON)
+	{
+		es->str->data[0] = '{';
+		es->str->data[es->str->len - 1] = '}';
+	}
+
+	result = es->str;
+	pfree(es);
+
+	return result;
+}
+
+static void
+explain_publish_plan(QueryDesc *queryDesc)
+{
+	StringInfo		str = explain_query(queryDesc);
+	dsm_segment	   *dsm;
+
+	attach_published_plan_handles();
+
+	dsm = dsm_create(str->len + 1, DSM_CREATE_NULL_IF_MAXSEGMENTS);
+	if (dsm)
+	{
+		char	   *buffer = (char *) dsm_segment_address(dsm);
+		dsm_handle	handle;
+
+		memcpy(buffer, str->data, str->len + 1);
+
+		MyPublishedPlanSegment = dsm;
+
+		handle = dsm_segment_handle(dsm);
+		*MyPublishedPlanHandle = handle;
+
+		elog(DEBUG5, "published explain plan in a DSM segment: %d", handle);
+	}
+	else
+		elog(LOG, "explain_publish_plan: out of dynamic shared memory, needed: %d",
+			 str->len + 1);
+
+	pfree(str->data);
+	pfree(str);
+}
+
+static void
+explain_unpublish_my_plan(void)
+{
+	Assert(MyPublishedPlanSegment);
+	Assert(MyPublishedPlanHandle);
+
+	elog(DEBUG5, "unpublishing explain plan in a DSM segment: %d", *MyPublishedPlanHandle);
+
+	*MyPublishedPlanHandle = 0;
+
+	dsm_detach(MyPublishedPlanSegment);
+	MyPublishedPlanSegment = NULL;
+}
+
+
+/*
+ * Retrieve the plan of a statement currently running in a server backend
+ * specified by the process ID.
+ *
+ * FUNCTION pg_explain_backend(pid)
+ * RETURNS SETOF text
+ */
+Datum
+pg_explain_backend(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo	   *rsinfo	= (ReturnSetInfo *) fcinfo->resultinfo;
+	Tuplestorestate	   *tupstore;
+	TupleDesc			tupdesc;
+	MemoryContext		per_query_ctx;
+	MemoryContext		oldcontext;
+	pid_t				target_pid = (pid_t) PG_GETARG_INT32(0);
+	PGPROC			   *target_proc;
+	dsm_handle			handle = 0;
+	dsm_segment		   *seg = NULL;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* verify access to target_pid */
+	target_proc = BackendPidGetProc(target_pid);
+
+	if (target_proc == NULL)
+		ereport(ERROR,
+				(errmsg("PID %d is not a PostgreSQL server process", target_pid)));
+
+	if (!(superuser() || target_proc->roleId == GetUserId()))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser or have the same role to explain queries running in other server processes"))));
+
+	attach_published_plan_handles();
+	if (published_plan_handles == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot attach to published plans shared memory"),
+				 errhint("is auto_explain.publish_plans turned on?")));
+
+	/* need to build tuplestore in query context */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupdesc = CreateTupleDescCopy(rsinfo->expectedDesc);
+	tupstore = tuplestore_begin_heap(false, false, work_mem);
+	MemoryContextSwitchTo(oldcontext);
+
+	if (target_pid == MyProcPid)
+	{
+		/* cannot attach to our own segment, but we know where to look */
+		seg = MyPublishedPlanSegment;
+	}
+	else
+	{
+		handle = published_plan_handles[target_proc->backendId];
+		seg =  dsm_attach(handle);
+	}
+	if (seg != NULL)
+	{
+		PG_TRY();
+		{
+			const char	   *p = (const char *) dsm_segment_address(seg);
+
+			/* break into tuples on newline */
+			while (*p)
+			{
+				const char *q = strchr(p, '\n');
+				Size		len = q ? q - p : strlen(p);
+				Datum		value;
+				HeapTuple	tuple;
+				bool		isnull = false;
+
+				value = PointerGetDatum(cstring_to_text_with_len(p, len));
+
+				tuple = heap_form_tuple(tupdesc, &value, &isnull);
+				tuplestore_puttuple(tupstore, tuple);
+
+				if (!q)
+					break;
+
+				p += len + 1;
+			}
+
+			/* detach DSM, but only if it's not our own one */
+			if (handle)
+				dsm_detach(seg);
+		}
+		PG_CATCH();
+		{
+			if (handle)
+				dsm_detach(seg);
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+	}
+
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	return (Datum) 0;
+}
diff --git a/contrib/auto_explain/auto_explain.control b/contrib/auto_explain/auto_explain.control
new file mode 100644
index 0000000..bd92c71
--- /dev/null
+++ b/contrib/auto_explain/auto_explain.control
@@ -0,0 +1,5 @@
+# auto_explain extension
+comment = 'logging facility for execution plans'
+default_version = '1.0'
+module_pathname = '$libdir/auto_explain'
+relocatable = true
-- 
2.1.4

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to