On 29.01.2020 17:47, Robert Haas wrote:
On Wed, Jan 29, 2020 at 3:13 AM Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:
But I heard only two arguments:

1. Concurrent building of indexes by all backends may consume much memory 
(n_backends * maintenance_work_mem) and consume a lot of disk/CPU resources.
2. GTT in one session can contains large amount of data and we need index for 
it, but small amount of data in another session and we do not need index for it.
You seem to be ignoring the fact that two committers told you this
probably wasn't safe.

Perhaps your view is that those people made no argument, and therefore
you don't have to respond to it. But the onus is not on somebody else
to tell you why a completely novel idea is not safe. The onus is on
you to analyze it in detail and prove that it is safe. What you need
to show is that there is no code anywhere in the system which will be
confused by an index springing into existence at whatever time you're
creating it.

One problem is that there are various backend-local data structures in
the relcache, the planner, and the executor that remember information
about indexes, and that may not respond well to having more indexes
show up unexpectedly. On the one hand, they might crash; on the other
hand, they might ignore the new index when they shouldn't. Another
problem is that the code which creates indexes might fail or misbehave
when run in an environment different from the one in which it
currently runs. I haven't really studied your code, so I don't know
exactly what it does, but for example it would be really bad to try to
build an index while holding a buffer lock, both because it might
cause (low-probability) undetected deadlocks and also because it might
block another process that wants that buffer lock in a
non-interruptible wait state for a long time.

Now, maybe you can make an argument that you only create indexes at
points in the query that are "safe." But I am skeptical, because of
this example:

rhaas=# create table foo (a int primary key, b text, c text, d text);
CREATE TABLE
rhaas=# create function blump() returns trigger as $$begin create
index on foo (b); return new; end$$ language plpgsql;
CREATE FUNCTION
rhaas=# create trigger thud before insert on foo execute function blump();
CREATE TRIGGER
rhaas=# insert into foo (a) select generate_series(1,10);
ERROR:  cannot CREATE INDEX "foo" because it is being used by active
queries in this session
CONTEXT:  SQL statement "create index on foo (b)"
PL/pgSQL function blump() line 1 at SQL statement

That prohibition is there for some reason. Someone did not just decide
to arbitrarily prohibit it. A CREATE INDEX command run in that context
won't run afoul of many of the things that might be problems in other
places -- e.g. there won't be a buffer lock held. Yet, despite the
fact that a trigger context is safe for executing a wide variety of
user-defined code, this particular operation is not allowed here. That
is the sort of thing that should worry you.

At any rate, even if this somehow were or could be made safe,
on-the-fly index creation is a feature that cannot and should not be
combined with a patch to implement global temporary tables. Surely, it
will require a lot of study and work to get the details right. And so
will GTT. As I said in the other email I wrote, this feature is hard
enough without adding this kind of thing to it. There's a reason why I
never got around to implementing this ten years ago when I did
unlogged tables; I was intending that to be a precursor to the GTT
work. I found that it was too hard and I gave up. I'm glad to see
people trying again, but the idea that we can afford to add in extra
features, or frankly that either of the dueling patches on this thread
are close to committable, is just plain wrong.


Sorry, I really didn't consider statements containing word "probably" as arguments. But I agree with you: it is task of developer of new feature to prove that proposed approach is safe rather than of reviewers to demonstrate that it is unsafe.
Can I provide such proof now? I afraid that not.
But please consider two arguments:

1. Index for GTT in any case has to be initialized on demand. In case of regular tables index is initialized at the moment of its creation. In case of GTT it doesn't work. So we should somehow detect that accessed index is not initialized and perform lazy initialization of the index. The only difference with the approach proposed by Pavel  (allow index for empty GTT but prohibit it for GTT filled with data) is whether we also need to populate index with data or not. I can imagine that implicit initialization of index in read-only query (select) can be unsafe and cause some problems. I have not encountered such problems yet after performing many tests with GTTs, but certainly I have not covered all possible scenarios (not sure that it is possible at all). But I do not understand how populating  index with data can add some extra unsafety.

So I can not prove that building index for GTT on demand is safe, but it is not more unsafe than initialization of index on demand which is required in any case.

2. Actually I do not propose some completely new approach. I try to provide behavior with is compatible with regular tables. If you create index for regular table, then it can be used in all sessions, right? And all "various backend-local data structures in the relcache, the planner, and the executor that remember information about indexes" have to be properly updated.  It is done using invalidation mechanism. The same mechanism is used in case of DDL operations with GTT, because we change system catalog.

So my point here is that creation index of GTT is almost the same as creation of index for regular tables and the same mechanism will be used to provide correctness of this operation.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Reply via email to