On 4/23/2019 16:43, Justin Pryzby wrote:
It wrote 40GB tempfiles - perhaps you can increase work_mem now to improve the
query time.

I now upped my shared_buffers back from 1 to 2GB and work_mem from 4 to 16MB. Need to set vm.overcommit_ratio from 50 to 75 (percent, with vm.overcommit_memory = 2 as it is.)

We didn't address it yet, but your issue was partially caused by a misestimate.
It's almost certainly because these conditions are correlated, or maybe
redundant.

That may be so, but mis-estimates happen. And I can still massively improve this query logically I am sure.  In fact it sticks out like a sore thumb, logically it makes no sense to churn over 100 million rows here, but the point is that hopefully PostgreSQL runs stable in such outlier situations, comes back and presents you with 2 hours of work time, 40 GB temp space, or whatever, and then we users can figure out how to make it work better. The frustrating thing it to get out of memory and we not knowing what we can possibly do about it.

From my previous attempt with this tmp_r and tmp_q table, I also know that the Sort/Uniqe step is taking  a lot of extra time. I can cut that out too by addressing the causes of the "repeated result" rows. But again, that is all secondary optimizations.

Merge Cond: (((documentinformationsubject.documentinternalid)::text = 
(documentinformationsubject_1.documentinternalid)::text) AND 
((documentinformationsubject.documentid)::text = 
(documentinformationsubject_1.documentid)::text) AND 
((documentinformationsubject.actinternalid)::text = 
(documentinformationsubject_1.actinternalid)::text))
If they're completely redundant and you can get the same result after dropping
one or two of those conditions, then you should.
I understand. You are saying by reducing the amount of columns in the join condition, somehow you might be able to reduce the size of the hash temporary table?
Alternately, if they're correlated but not redundant, you can use PG10
"dependency" statistics (CREATE STATISTICS) on the correlated columns (and
ANALYZE).

I think documentId and documentInternalId is 1:1 they are both primary / alternate keys. So I could go with only one of them, but since I end up needing both elsewhere inside the query I like to throw them all into the natural join key, so that I don't have to deal with the duplicate result columns.

Now running:

integrator=# set enable_nestloop to off; SET integrator=# explain analyze select * from reports.v_BusinessOperation; WARNING: ExecHashIncreaseNumBatches: nbatch=8 spaceAllowed=16777216 WARNING: ExecHashIncreaseNumBatches: nbatch=16 spaceAllowed=16777216 WARNING: ExecHashIncreaseNumBatches: nbatch=32 spaceAllowed=16777216 WARNING: ExecHashIncreaseNumBatches: nbatch=64 spaceAllowed=16777216 WARNING: ExecHashIncreaseNumBatches: nbatch=128 spaceAllowed=16777216 WARNING: ExecHashIncreaseNumBatches: nbatch=256 spaceAllowed=16777216 WARNING: ExecHashIncreaseNumBatches: nbatch=512 spaceAllowed=16777216 WARNING: ExecHashIncreaseNumBatches: nbatch=1024 spaceAllowed=25165824 WARNING: ExecHashIncreaseNumBatches: nbatch=2048 spaceAllowed=50331648 WARNING: ExecHashIncreaseNumBatches: nbatch=4096 spaceAllowed=100663296 WARNING: ExecHashIncreaseNumBatches: nbatch=8192 spaceAllowed=201326592 WARNING: ExecHashIncreaseNumBatches: nbatch=16384 spaceAllowed=402653184 WARNING: ExecHashIncreaseNumBatches: nbatch=32768 spaceAllowed=805306368 WARNING: ExecHashIncreaseNumBatches: nbatch=65536 spaceAllowed=1610612736

I am waiting now, probably for that Sort/Unique to finish I think that the vast majority of the time spent is in this sort

Unique (cost=5551524.36..5554207.33 rows=34619 width=1197) (actual time=6150303.060..6895451.210 rows=435274 loops=1) -> Sort (cost=5551524.36..5551610.91 rows=34619 width=1197) (actual time=6150303.058..6801372.192 rows=113478386 loops=1) Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.is_current, documentinformationsubject.documentid, documentinformationsubject.documenttypecode, documentinformationsubject.subjectroleinternalid, documentinformationsubject.subjectentityinternalid, documentinformationsubject.subjectentityid, documentinformationsubject.subjectentityidroot, documentinformationsubject.subjectentityname, documentinformationsubject.subjectentitytel, documentinformationsubject.subjectentityemail, documentinformationsubject.otherentityinternalid, documentinformationsubject.confidentialitycode, documentinformationsubject.actinternalid, documentinformationsubject.code_code, documentinformationsubject.code_displayname, q.code_code, q.code_displayname, an.extension, an.root, documentinformationsubject_2.subjectentitycode, documentinformationsubject_2.subjectentitycodesystem, documentinformationsubject_2.effectivetime_low, documentinformationsubject_2.effectivetime_high, documentinformationsubject_2.statuscode, documentinformationsubject_2.code_code, agencyid.extension, agencyname.trivialname, documentinformationsubject_1.subjectentitycode, documentinformationsubject_1.subjectentityinternalid Sort Method: external merge Disk: 40726720kB -> Hash Right Join (cost=4255031.53..5530808.71 rows=34619 width=1197) (actual time=325240.679..1044194.775 rows=113478386 loops=1)

isn't it?

Unique/Sort actual time   6,150,303.060 ms = 6,150 s <~ 2 h.
Hash Right Join actual time 325,240.679 ms.

So really all time is wasted in that sort, no need for you guys to worry about anything else with these 2 hours.  Tomas just stated the same thing.

Right. Chances are that with a bettwe estimate the optimizer would pick
merge join instead. I wonder if that would be significantly faster.
The prospect of a merge join is interesting here to consider: with the Sort/Unique step taking so long, it seems the Merge Join might also take a lot of time? I see my disks are churning for the most time in this way:

avg-cpu: %user %nice %system %iowait %steal %idle 7.50 0.00 2.50 89.50 0.00 0.50 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util nvme1n1 0.00 0.00 253.00 131.00 30.15 32.20 332.50 2.01 8.40 8.41 8.37 2.59 99.60 nvme1n1p24 0.00 0.00 253.00 131.00 30.15 32.20 332.50 2.01 8.40 8.41 8.37 2.59 99.60

I.e. 400 IOPS at 60 MB/s half of it read, half of it write. During the previous steps, the hash join presumably, throughput was a lot higher, like 2000 IOPS with 120 MB/s read or write.

But even if the Merge Join would have taken about the same or a little more time than the Hash Join, I wonder, if one could not use that to collapse the Sort/Unique step into that? Like it seems that after the Sort/Merge has completed, one should be able to read Uniqe records without any further sorting? In that case the Merge would be a great advantage.

What I like about the situation now is that with that 4x bigger work_mem, the overall memory situation remains the same. I.e., we are scraping just below 1GB for this process and we see oscillation, growth and shrinkage occurring. So I consider this case closed for me. That doesn't mean I wouldn't be available if you guys want to try anything else about it.

OK, now here is the result with the 16 MB work_mem:

Unique (cost=5462874.86..5465557.83 rows=34619 width=1197) (actual time=6283539.282..7003311.451 rows=435274 loops=1) -> Sort (cost=5462874.86..5462961.41 rows=34619 width=1197) (actual time=6283539.280..6908879.456 rows=113478386 loops=1) Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.is_current, documentinformationsubject.documentid, documentinformationsubject.documenttypecode, documentinformationsubject.subjectroleinternalid, documentinformati onsubject.subjectentityinternalid, documentinformationsubject.subjectentityid, documentinformationsubject.subjectentityidroot, documentinformationsubject.subjectentityname, documentinformationsubject.subjectentitytel, documentinformationsubject.subjectenti tyemail, documentinformationsubject.otherentityinternalid, documentinformationsubject.confidentialitycode, documentinformationsubject.actinternalid, documentinformationsubject.code_code, documentinformationsubject.code_displayname, q.code_code, q.code_disp layname, an.extension, an.root, documentinformationsubject_2.subjectentitycode, documentinformationsubject_2.subjectentitycodesystem, documentinformationsubject_2.effectivetime_low, documentinformationsubject_2.effectivetime_high, documentinformationsubjec t_2.statuscode, documentinformationsubject_2.code_code, agencyid.extension, agencyname.trivialname, documentinformationsubject_1.subjectentitycode, documentinformationsubject_1.subjectentityinternalid Sort Method: external merge Disk: 40726872kB -> Hash Right Join (cost=4168174.03..5442159.21 rows=34619 width=1197) (actual time=337057.290..1695675.896 rows=113478386 loops=1) Hash Cond: (((q.documentinternalid)::text = (documentinformationsubject.documentinternalid)::text) AND ((r.targetinternalid)::text = (documentinformationsubject.actinternalid)::text)) -> Hash Right Join (cost=1339751.37..2608552.36 rows=13 width=341) (actual time=84109.143..84109.238 rows=236 loops=1) Hash Cond: (((documentinformationsubject_2.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text)) -> Gather (cost=29501.54..1298302.52 rows=1 width=219) (actual time=43932.534..43936.888 rows=0 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Hash Left Join (cost=28501.54..1297302.42 rows=1 width=219) (actual time=43925.304..43925.304 rows=0 loops=3) ... -> Hash (cost=1310249.63..1310249.63 rows=13 width=233) (actual time=40176.581..40176.581 rows=236 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 70kB -> Hash Right Join (cost=829388.20..1310249.63 rows=13 width=233) (actual time=35925.031..40176.447 rows=236 loops=1) Hash Cond: ((an.actinternalid)::text = (q.actinternalid)::text) -> Seq Scan on act_id an (cost=0.00..425941.04 rows=14645404 width=134) (actual time=1.609..7687.986 rows=14676871 loops=1) -> Hash (cost=829388.19..829388.19 rows=1 width=136) (actual time=30106.123..30106.123 rows=236 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 63kB -> Gather (cost=381928.46..829388.19 rows=1 width=136) (actual time=24786.510..30105.983 rows=236 loops=1) ... -> Hash (cost=2823846.37..2823846.37 rows=34619 width=930) (actual time=252946.367..252946.367 rows=113478127 loops=1) Buckets: 32768 (originally 32768) Batches: 65536 (originally 4) Memory Usage: 1204250kB -> Gather Merge (cost=2807073.90..2823846.37 rows=34619 width=930) (actual time=83891.069..153380.040 rows=113478127 loops=1) Workers Planned: 2 Workers Launched: 2 -> Merge Left Join (cost=2806073.87..2818850.46 rows=14425 width=930) (actual time=83861.921..108022.671 rows=37826042 loops=3) Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid):: text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text)) -> Sort (cost=1295969.26..1296005.32 rows=14425 width=882) (actual time=44814.114..45535.398 rows=231207 loops=3) Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.docum... Workers Planned: 2 Workers Launched: 2 -> Merge Left Join (cost=2806073.87..2818850.46 rows=14425 width=930) (actual time=83861.921..108022.671 rows=37826042 loops=3) Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid):: text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text)) -> Sort (cost=1295969.26..1296005.32 rows=14425 width=882) (actual time=44814.114..45535.398rows=231207 loops=3) ... Planning Time: 2.953 ms Execution Time: 7004340.091 ms (70 rows)

There isn't really any big news here. But what matters is that it works.

thanks & regards,
-Gunther Schadow




Reply via email to