On Tue, Apr 23, 2019 at 07:09:00PM -0400, Gunther wrote:
On 4/23/2019 16:43, Justin Pryzby wrote:
It wrote 40GB tempfiles - perhaps you can increase work_mem now to improve the
query time.
I now upped my shared_buffers back from 1 to 2GB and work_mem from 4 to
16MB. Need to set vm.overcommit_ratio from 50 to 75 (percent, with
vm.overcommit_memory = 2 as it is.)
We didn't address it yet, but your issue was partially caused by a misestimate.
It's almost certainly because these conditions are correlated, or maybe
redundant.
That may be so, but mis-estimates happen. And I can still massively
improve this query logically I am sure. In fact it sticks out like a sore
thumb, logically it makes no sense to churn over 100 million rows here,
but the point is that hopefully PostgreSQL runs stable in such outlier
situations, comes back and presents you with 2 hours of work time, 40 GB
temp space, or whatever, and then we users can figure out how to make it
work better. The frustrating thing it to get out of memory and we not
knowing what we can possibly do about it.
Sure. And I think the memory balancing algorithm implemented in the v2
patch is a step in that direction. I think we can do better in terms of
memory consumption (essentially keeping it closer to work_mem) but it's
unlikely to be any faster.
In a way this is similar to underestimates in hash aggregate, except
that in that case we don't have any spill-to-disk fallback at all.
From my previous attempt with this tmp_r and tmp_q table, I also know that
the Sort/Uniqe step is taking a lot of extra time. I can cut that out too
by addressing the causes of the "repeated result" rows. But again, that is
all secondary optimizations.
Merge Cond: (((documentinformationsubject.documentinternalid)::text =
(documentinformationsubject_1.documentinternalid)::text) AND
((documentinformationsubject.documentid)::text =
(documentinformationsubject_1.documentid)::text) AND
((documentinformationsubject.actinternalid)::text =
(documentinformationsubject_1.actinternalid)::text))
If they're completely redundant and you can get the same result after dropping
one or two of those conditions, then you should.
I understand. You are saying by reducing the amount of columns in the join
condition, somehow you might be able to reduce the size of the hash
temporary table?
No. When estimating the join result size with multiple join clauses, the
optimizer essentially has to compute
1: P((x1 = y1) && (x2 = y2) && (x3 = y3))
so it assumes statistical independence of those conditions and splits
that into
2: P(x1 = y1) * P(x2 = y2) * P(x3 = y3)
But when those conditions are dependent - for example when (x1=y1) means
that ((x2=y2) && (x3=y3)) - this results into significant underestimate.
E.g. let's assume that each of those conditions matches 1/100 rows, but
that essentially x1=x2=x3 and y1=y2=y3. Then (1) is 1/100 but (2) ends
up being 1/1000000, so 10000x off.
Chances are this is what's happenning with the inner side of the hash
join, which is estimated to return 14425 but ends up returning 37826042.
There's one trick you might try, though - using indexes on composite types:
create table t1 (a int, b int);
create table t2 (a int, b int);
insert into t1 select mod(i,1000), mod(i,1000)
from generate_series(1,100000) s(i);
insert into t2 select mod(i,1000), mod(i,1000)
from generate_series(1,100000) s(i);
analyze t1;
analyze t2;
explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);
QUERY PLAN
--------------------------------------------------------------------
Merge Join (cost=19495.72..21095.56 rows=9999 width=16)
(actual time=183.043..10360.276 rows=10000000 loops=1)
Merge Cond: ((t1.a = t2.a) AND (t1.b = t2.b))
...
create type composite_id as (a int, b int);
create index on t1 (((a,b)::composite_id));
create index on t2 (((a,b)::composite_id));
analyze t1;
analyze t2;
explain analyze select * from t1 join t2
on ((t1.a,t1.b)::composite_id = (t2.a,t2.b)::composite_id);
QUERY PLAN
--------------------------------------------------------------------------
Merge Join (cost=0.83..161674.40 rows=9999460 width=16)
(actual time=0.020..12726.767 rows=10000000 loops=1)
Merge Cond: (ROW(t1.a, t1.b)::composite_id = ROW(t2.a, t2.b)::composite_id)
Obviously, that's not exactly free - you have to pay price for the index
creation, maintenance and storage.
...
Unique/Sort actual time 6,150,303.060 ms = 6,150 s <~ 2 h.
Hash Right Join actual time 325,240.679 ms.
So really all time is wasted in that sort, no need for you guys to worry
about anything else with these 2 hours. Tomas just stated the same thing.
Right. Chances are that with a bettwe estimate the optimizer would pick
merge join instead. I wonder if that would be significantly faster.
The prospect of a merge join is interesting here to consider: with the
Sort/Unique step taking so long, it seems the Merge Join might also take a
lot of time? I see my disks are churning for the most time in this way:
avg-cpu: %user %nice %system %iowait %steal %idle
7.50 0.00 2.50 89.50 0.00 0.50
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 0.00 253.00 131.00 30.15 32.20 332.50
2.01 8.40 8.41 8.37 2.59 99.60
nvme1n1p24 0.00 0.00 253.00 131.00 30.15 32.20 332.50
2.01 8.40 8.41 8.37 2.59 99.60
I.e. 400 IOPS at 60 MB/s half of it read, half of it write. During the
previous steps, the hash join presumably, throughput was a lot higher,
like 2000 IOPS with 120 MB/s read or write.
But even if the Merge Join would have taken about the same or a little
more time than the Hash Join, I wonder, if one could not use that to
collapse the Sort/Unique step into that? Like it seems that after the
Sort/Merge has completed, one should be able to read Uniqe records without
any further sorting? In that case the Merge would be a great advantage.
Probably not, because there are far more columns in the Unique step. We
might have done something with "incremental sort" but we don't have that
capability yet.
What I like about the situation now is that with that 4x bigger work_mem,
the overall memory situation remains the same. I.e., we are scraping just
below 1GB for this process and we see oscillation, growth and shrinkage
occurring. So I consider this case closed for me. That doesn't mean I
wouldn't be available if you guys want to try anything else about it.
OK, now here is the result with the 16 MB work_mem:
Unique (cost=5462874.86..5465557.83 rows=34619 width=1197) (actual
time=6283539.282..7003311.451 rows=435274 loops=1)
-> Sort (cost=5462874.86..5462961.41 rows=34619 width=1197) (actual
time=6283539.280..6908879.456 rows=113478386 loops=1)
...
Planning Time: 2.953 ms
Execution Time: 7004340.091 ms
(70 rows)
There isn't really any big news here. But what matters is that it works.
Yeah. Once the hash join outgrows the work_mem, the fallback logick
starts ignoring that in the effort to keep the memory usage minimal.
I still think the idea with an "overflow batch" is worth considering,
because it'd allow us to keep the memory usage within work_mem. And
after getting familiar with the hash join code again (haven't messed
with it since 9.5 or so) I think it should not be all that difficult.
I'll give it a try over the weekend if I get bored for a while.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services