On 4/23/2019 16:43, Justin Pryzby wrote:
It wrote 40GB tempfiles - perhaps you can increase work_mem now to improve the
query time.
I now upped my shared_buffers back from 1 to 2GB and work_mem from 4 to
16MB. Need to set vm.overcommit_ratio from 50 to 75 (percent, with
vm.overcommit_memory = 2 as it is.)
We didn't address it yet, but your issue was partially caused by a misestimate.
It's almost certainly because these conditions are correlated, or maybe
redundant.
That may be so, but mis-estimates happen. And I can still massively
improve this query logically I am sure. In fact it sticks out like a
sore thumb, logically it makes no sense to churn over 100 million rows
here, but the point is that hopefully PostgreSQL runs stable in such
outlier situations, comes back and presents you with 2 hours of work
time, 40 GB temp space, or whatever, and then we users can figure out
how to make it work better. The frustrating thing it to get out of
memory and we not knowing what we can possibly do about it.
From my previous attempt with this tmp_r and tmp_q table, I also know
that the Sort/Uniqe step is taking a lot of extra time. I can cut that
out too by addressing the causes of the "repeated result" rows. But
again, that is all secondary optimizations.
Merge Cond: (((documentinformationsubject.documentinternalid)::text =
(documentinformationsubject_1.documentinternalid)::text) AND
((documentinformationsubject.documentid)::text =
(documentinformationsubject_1.documentid)::text) AND
((documentinformationsubject.actinternalid)::text =
(documentinformationsubject_1.actinternalid)::text))
If they're completely redundant and you can get the same result after dropping
one or two of those conditions, then you should.
I understand. You are saying by reducing the amount of columns in the
join condition, somehow you might be able to reduce the size of the hash
temporary table?
Alternately, if they're correlated but not redundant, you can use PG10
"dependency" statistics (CREATE STATISTICS) on the correlated columns (and
ANALYZE).
I think documentId and documentInternalId is 1:1 they are both primary /
alternate keys. So I could go with only one of them, but since I end up
needing both elsewhere inside the query I like to throw them all into
the natural join key, so that I don't have to deal with the duplicate
result columns.
Now running:
integrator=# set enable_nestloop to off; SET integrator=# explain
analyze select * from reports.v_BusinessOperation; WARNING:
ExecHashIncreaseNumBatches: nbatch=8 spaceAllowed=16777216 WARNING:
ExecHashIncreaseNumBatches: nbatch=16 spaceAllowed=16777216 WARNING:
ExecHashIncreaseNumBatches: nbatch=32 spaceAllowed=16777216 WARNING:
ExecHashIncreaseNumBatches: nbatch=64 spaceAllowed=16777216 WARNING:
ExecHashIncreaseNumBatches: nbatch=128 spaceAllowed=16777216 WARNING:
ExecHashIncreaseNumBatches: nbatch=256 spaceAllowed=16777216 WARNING:
ExecHashIncreaseNumBatches: nbatch=512 spaceAllowed=16777216 WARNING:
ExecHashIncreaseNumBatches: nbatch=1024 spaceAllowed=25165824 WARNING:
ExecHashIncreaseNumBatches: nbatch=2048 spaceAllowed=50331648 WARNING:
ExecHashIncreaseNumBatches: nbatch=4096 spaceAllowed=100663296 WARNING:
ExecHashIncreaseNumBatches: nbatch=8192 spaceAllowed=201326592 WARNING:
ExecHashIncreaseNumBatches: nbatch=16384 spaceAllowed=402653184 WARNING:
ExecHashIncreaseNumBatches: nbatch=32768 spaceAllowed=805306368 WARNING:
ExecHashIncreaseNumBatches: nbatch=65536 spaceAllowed=1610612736
I am waiting now, probably for that Sort/Unique to finish I think that
the vast majority of the time spent is in this sort
Unique (cost=5551524.36..5554207.33 rows=34619 width=1197) (actual
time=6150303.060..6895451.210 rows=435274 loops=1) -> Sort
(cost=5551524.36..5551610.91 rows=34619 width=1197) (actual
time=6150303.058..6801372.192 rows=113478386 loops=1) Sort Key:
documentinformationsubject.documentinternalid,
documentinformationsubject.is_current,
documentinformationsubject.documentid,
documentinformationsubject.documenttypecode,
documentinformationsubject.subjectroleinternalid,
documentinformationsubject.subjectentityinternalid,
documentinformationsubject.subjectentityid,
documentinformationsubject.subjectentityidroot,
documentinformationsubject.subjectentityname,
documentinformationsubject.subjectentitytel,
documentinformationsubject.subjectentityemail,
documentinformationsubject.otherentityinternalid,
documentinformationsubject.confidentialitycode,
documentinformationsubject.actinternalid,
documentinformationsubject.code_code,
documentinformationsubject.code_displayname, q.code_code,
q.code_displayname, an.extension, an.root,
documentinformationsubject_2.subjectentitycode,
documentinformationsubject_2.subjectentitycodesystem,
documentinformationsubject_2.effectivetime_low,
documentinformationsubject_2.effectivetime_high,
documentinformationsubject_2.statuscode,
documentinformationsubject_2.code_code, agencyid.extension,
agencyname.trivialname, documentinformationsubject_1.subjectentitycode,
documentinformationsubject_1.subjectentityinternalid Sort Method:
external merge Disk: 40726720kB -> Hash Right Join
(cost=4255031.53..5530808.71 rows=34619 width=1197) (actual
time=325240.679..1044194.775 rows=113478386 loops=1)
isn't it?
Unique/Sort actual time 6,150,303.060 ms = 6,150 s <~ 2 h.
Hash Right Join actual time 325,240.679 ms.
So really all time is wasted in that sort, no need for you guys to worry
about anything else with these 2 hours. Tomas just stated the same thing.
Right. Chances are that with a bettwe estimate the optimizer would pick
merge join instead. I wonder if that would be significantly faster.
The prospect of a merge join is interesting here to consider: with the
Sort/Unique step taking so long, it seems the Merge Join might also take
a lot of time? I see my disks are churning for the most time in this way:
avg-cpu: %user %nice %system %iowait %steal %idle 7.50 0.00 2.50 89.50
0.00 0.50 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz
await r_await w_await svctm %util nvme1n1 0.00 0.00 253.00 131.00 30.15
32.20 332.50 2.01 8.40 8.41 8.37 2.59 99.60 nvme1n1p24 0.00 0.00 253.00
131.00 30.15 32.20 332.50 2.01 8.40 8.41 8.37 2.59 99.60
I.e. 400 IOPS at 60 MB/s half of it read, half of it write. During the
previous steps, the hash join presumably, throughput was a lot higher,
like 2000 IOPS with 120 MB/s read or write.
But even if the Merge Join would have taken about the same or a little
more time than the Hash Join, I wonder, if one could not use that to
collapse the Sort/Unique step into that? Like it seems that after the
Sort/Merge has completed, one should be able to read Uniqe records
without any further sorting? In that case the Merge would be a great
advantage.
What I like about the situation now is that with that 4x bigger
work_mem, the overall memory situation remains the same. I.e., we are
scraping just below 1GB for this process and we see oscillation, growth
and shrinkage occurring. So I consider this case closed for me. That
doesn't mean I wouldn't be available if you guys want to try anything
else about it.
OK, now here is the result with the 16 MB work_mem:
Unique (cost=5462874.86..5465557.83 rows=34619 width=1197) (actual
time=6283539.282..7003311.451 rows=435274 loops=1) -> Sort
(cost=5462874.86..5462961.41 rows=34619 width=1197) (actual
time=6283539.280..6908879.456 rows=113478386 loops=1) Sort Key:
documentinformationsubject.documentinternalid,
documentinformationsubject.is_current,
documentinformationsubject.documentid,
documentinformationsubject.documenttypecode,
documentinformationsubject.subjectroleinternalid, documentinformati
onsubject.subjectentityinternalid,
documentinformationsubject.subjectentityid,
documentinformationsubject.subjectentityidroot,
documentinformationsubject.subjectentityname,
documentinformationsubject.subjectentitytel,
documentinformationsubject.subjectenti tyemail,
documentinformationsubject.otherentityinternalid,
documentinformationsubject.confidentialitycode,
documentinformationsubject.actinternalid,
documentinformationsubject.code_code,
documentinformationsubject.code_displayname, q.code_code, q.code_disp
layname, an.extension, an.root,
documentinformationsubject_2.subjectentitycode,
documentinformationsubject_2.subjectentitycodesystem,
documentinformationsubject_2.effectivetime_low,
documentinformationsubject_2.effectivetime_high,
documentinformationsubjec t_2.statuscode,
documentinformationsubject_2.code_code, agencyid.extension,
agencyname.trivialname, documentinformationsubject_1.subjectentitycode,
documentinformationsubject_1.subjectentityinternalid Sort Method:
external merge Disk: 40726872kB -> Hash Right Join
(cost=4168174.03..5442159.21 rows=34619 width=1197) (actual
time=337057.290..1695675.896 rows=113478386 loops=1) Hash Cond:
(((q.documentinternalid)::text =
(documentinformationsubject.documentinternalid)::text) AND
((r.targetinternalid)::text =
(documentinformationsubject.actinternalid)::text)) -> Hash Right Join
(cost=1339751.37..2608552.36 rows=13 width=341) (actual
time=84109.143..84109.238 rows=236 loops=1) Hash Cond:
(((documentinformationsubject_2.documentinternalid)::text =
(q.documentinternalid)::text) AND
((documentinformationsubject_2.actinternalid)::text =
(q.actinternalid)::text)) -> Gather (cost=29501.54..1298302.52 rows=1
width=219) (actual time=43932.534..43936.888 rows=0 loops=1) Workers
Planned: 2 Workers Launched: 2 -> Parallel Hash Left Join
(cost=28501.54..1297302.42 rows=1 width=219) (actual
time=43925.304..43925.304 rows=0 loops=3) ... -> Hash
(cost=1310249.63..1310249.63 rows=13 width=233) (actual
time=40176.581..40176.581 rows=236 loops=1) Buckets: 1024 Batches: 1
Memory Usage: 70kB -> Hash Right Join (cost=829388.20..1310249.63
rows=13 width=233) (actual time=35925.031..40176.447 rows=236 loops=1)
Hash Cond: ((an.actinternalid)::text = (q.actinternalid)::text) -> Seq
Scan on act_id an (cost=0.00..425941.04 rows=14645404 width=134) (actual
time=1.609..7687.986 rows=14676871 loops=1) -> Hash
(cost=829388.19..829388.19 rows=1 width=136) (actual
time=30106.123..30106.123 rows=236 loops=1) Buckets: 1024 Batches: 1
Memory Usage: 63kB -> Gather (cost=381928.46..829388.19 rows=1
width=136) (actual time=24786.510..30105.983 rows=236 loops=1) ... ->
Hash (cost=2823846.37..2823846.37 rows=34619 width=930) (actual
time=252946.367..252946.367 rows=113478127 loops=1) Buckets: 32768
(originally 32768) Batches: 65536 (originally 4) Memory Usage: 1204250kB
-> Gather Merge (cost=2807073.90..2823846.37 rows=34619 width=930)
(actual time=83891.069..153380.040 rows=113478127 loops=1) Workers
Planned: 2 Workers Launched: 2 -> Merge Left Join
(cost=2806073.87..2818850.46 rows=14425 width=930) (actual
time=83861.921..108022.671 rows=37826042 loops=3) Merge Cond:
(((documentinformationsubject.documentinternalid)::text =
(documentinformationsubject_1.documentinternalid)::text) AND
((documentinformationsubject.documentid)::text =
(documentinformationsubject_1.documentid):: text) AND
((documentinformationsubject.actinternalid)::text =
(documentinformationsubject_1.actinternalid)::text)) -> Sort
(cost=1295969.26..1296005.32 rows=14425 width=882) (actual
time=44814.114..45535.398 rows=231207 loops=3) Sort Key:
documentinformationsubject.documentinternalid,
documentinformationsubject.docum... Workers Planned: 2 Workers Launched:
2 -> Merge Left Join (cost=2806073.87..2818850.46 rows=14425 width=930)
(actual time=83861.921..108022.671 rows=37826042 loops=3) Merge Cond:
(((documentinformationsubject.documentinternalid)::text =
(documentinformationsubject_1.documentinternalid)::text) AND
((documentinformationsubject.documentid)::text =
(documentinformationsubject_1.documentid):: text) AND
((documentinformationsubject.actinternalid)::text =
(documentinformationsubject_1.actinternalid)::text)) -> Sort
(cost=1295969.26..1296005.32 rows=14425 width=882) (actual
time=44814.114..45535.398rows=231207 loops=3) ... Planning Time: 2.953
ms Execution Time: 7004340.091 ms (70 rows)
There isn't really any big news here. But what matters is that it works.
thanks & regards,
-Gunther Schadow