(I've closed the CF entry: https://commitfest.postgresql.org/22/1778/)
On 2019/04/01 2:04, Tom Lane wrote: > Amit Langote <amitlangot...@gmail.com> writes: >> On Sun, Mar 31, 2019 at 11:45 AM Imai Yoshikazu <yoshikazu_i...@live.jp> >> wrote: >>> Certainly, using bitmapset contributes to the performance when scanning >>> one partition(few partitions) from large partitions. > >> Thanks Imai-san for testing. > > I tried to replicate these numbers with the code as-committed, and > could not. Thanks for that. > What I get, using the same table-creation code as you > posted and a pgbench script file like > > \set param random(1, :N) > select * from rt where a = :param; > > is scaling like this: > > N tps, range tps, hash > > 2 10520.519932 10415.230400 > 8 10443.361457 10480.987665 > 32 10341.196768 10462.551167 > 128 10370.953849 10383.885128 > 512 10207.578413 10214.049394 > 1024 10042.794340 10121.683993 > 4096 8937.561825 9214.993778 > 8192 8247.614040 8486.728918 > > If I use "-M prepared" the numbers go up a bit for lower N, but > drop at high N: > > N tps, range tps, hash > > 2 11449.920527 11462.253871 > 8 11530.513146 11470.812476 > 32 11372.412999 11450.213753 > 128 11289.351596 11322.698856 > 512 11095.428451 11200.683771 > 1024 10757.646108 10805.052480 > 4096 8689.165875 8930.690887 > 8192 7301.609147 7502.806455 > > Digging into that, it seems like the degradation with -M prepared is > mostly in LockReleaseAll's hash_seq_search over the locallock hash table. > What I think must be happening is that with -M prepared, at some point the > plancache decides to try a generic plan, which causes opening/locking all > the partitions, resulting in permanent bloat in the locallock hash table. > We immediately go back to using custom plans, but hash_seq_search has > more buckets to look through for the remainder of the process' lifetime. Ah, we did find this to be a problem upthread [1] and Tsunakawa-san then even posted a patch which is being discussed at: https://commitfest.postgresql.org/22/1993/ > I do see some cycles getting spent in apply_scanjoin_target_to_paths > that look to be due to scanning over the long part_rels array, > which your proposal would ameliorate. But (a) that's pretty small > compared to other effects, and (b) IMO, apply_scanjoin_target_to_paths > is a remarkable display of brute force inefficiency to begin with. > I think we should see if we can't nuke that function altogether in > favor of generating the paths with the right target the first time. That's an option if we can make it work. Shouldn't we look at *all* of the places that have code that now look like this: for (i = 0; i < rel->nparts; i++) { RelOptInfo *partrel = rel->part_rels[i]; if (partrel == NULL) continue; ... } Beside apply_scanjoin_target_to_paths(), there are: create_partitionwise_grouping_paths() make_partitionedrel_pruneinfo() > BTW, the real elephant in the room is the O(N^2) cost of creating > these tables in the first place. The runtime for the table-creation > scripts looks like > > N range hash > > 2 0m0.011s 0m0.011s > 8 0m0.015s 0m0.014s > 32 0m0.032s 0m0.030s > 128 0m0.132s 0m0.099s > 512 0m0.969s 0m0.524s > 1024 0m3.306s 0m1.442s > 4096 0m46.058s 0m15.522s > 8192 3m11.995s 0m58.720s > > This seems to be down to the expense of doing RelationBuildPartitionDesc > to rebuild the parent's relcache entry for each child CREATE TABLE. > Not sure we can avoid that, but maybe we should consider adopting a > cheaper-to-read representation of partition descriptors. The fact that > range-style entries seem to be 3X more expensive to load than hash-style > entries is strange. I've noticed this many times too, but never prioritized doing something about it. I'll try sometime. Thanks, Amit [1] https://www.postgresql.org/message-id/CAKJS1f-dn1hDZqObwdMrYdV7-cELJwWCPRWet6EQX_WaV8JLgw%40mail.gmail.com