Re: [PoC] Reducing planning time when tables have many partitions
child_rel, List *child_tlist, List *setop_pathkeys) { ListCell *lc; ListCell *lc2 = list_head(setop_pathkeys); foreach(lc, child_tlist) { TargetEntry *tle = lfirst_node(TargetEntry, lc); EquivalenceMember *parent_em; PathKey*pk; if (tle->resjunk) continue; if (lc2 == NULL) elog(ERROR, "too few pathkeys for set operation"); pk = lfirst_node(PathKey, lc2); parent_em = linitial(pk->pk_eclass->ec_members); /* * We can safely pass the parent member as the first member in the * ec_members list as this is added first in generate_union_paths, * likewise, the JoinDomain can be that of the initial member of the * Pathkey's EquivalenceClass. */ add_eq_member(pk->pk_eclass, tle->expr, child_rel->relids, parent_em->em_jdomain, parent_em, exprType((Node *) tle->expr)); lc2 = lnext(setop_pathkeys, lc2); } /* * transformSetOperationStmt() ensures that the targetlist never contains * any resjunk columns, so all eclasses that exist in 'root' must have * received a new member in the loop above. Add them to the child_rel's * eclass_indexes. */ child_rel->eclass_indexes = bms_add_range(child_rel->eclass_indexes, 0, list_length(root->eq_classes) - 1); } = [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=66c0185a3d14bbbf51d0fc9d267093ffec735231 -- Best regards, Yuya Watari
Re: [PoC] Reducing planning time when tables have many partitions
Hello Ashutosh, Thank you for your email and for reviewing the patch. I sincerely apologize for the delay in responding. On Wed, Mar 6, 2024 at 11:16 PM Ashutosh Bapat wrote: > here's summary of observations > 1. The patch improves planning time significantly (3X to 20X) and the > improvement increases with the number of tables being joined. > 2. In the assert enabled build the patch slows down (in comparison to HEAD) > planning with higher number of tables in the join. You may want to > investigate this. But this is still better than my earlier measurements. > 3. The patch increased memory consumption by planner. But the numbers have > improved since my last measurement. Still it will be good to investigate what > causes this extra memory consumption. > 4. Generally with the assert enabled build planner consumes more memory with > or without patch. From my previous experience this might be due to Bitmapset > objects created within Assert() calls. Thank you for testing the patch and sharing the results. For comment #1, these results show the effectiveness of the patch. For comment #2, I agree that we should not slow down assert-enabled builds. The patch adds a lot of assertions to avoid adding bugs, but they might be too excessive. I will reconsider these assertions and remove unnecessary ones. For comments #3 and #4, while the patch improves time complexity, it has some negative impacts on space complexity. The patch uses a Bitmapset-based index to speed up searching for EquivalenceMembers and RestrictInfos. Reducing this memory consumption is a little hard, but this is a very important problem in committing this patch, so I will investigate this further. > Does v24-0002 have any relation/overlap with my patches to reduce memory > consumed by RestrictInfos? Those patches have code to avoid creating > duplicate RestrictInfos (including commuted RestrictInfos) from ECs. [2] Thank you for sharing these patches. My patch may be related to your patches. My patch speeds up slow linear searches over EquivalenceMembers and RestrictInfos. It uses several approaches, one of which is the Bitmapset-based index. Bitmapsets are space inefficient, so if there are many EquivalenceMembers and RestrictInfos, this index becomes large. This is true for highly partitioned cases, where there are a lot of similar (or duplicate) elements. Eliminating such duplicate elements may help my patch reduce memory consumption. I will investigate this further. Unfortunately, I've been busy due to work, so I may not be able to respond soon. I really apologize for this. However, I will look into the patches, including yours, and share further information if found. Again, I apologize for my late response and appreciate your kind review. -- Best regards, Yuya Watari
Re: [PoC] Reducing planning time when tables have many partitions
Hello, On Wed, Nov 22, 2023 at 2:32 PM Yuya Watari wrote: > Unfortunately, I've been busy due to work, so I won't be able to > respond for several weeks. I'm really sorry for not being able to see > the patches. As soon as I'm not busy, I will look at them, consider > the above approach, and reply to this thread. If there is no > objection, I will move this CF entry forward to next CF. Since the end of this month is approaching, I moved this CF entry to the next CF (January CF). I will reply to this thread in a few weeks. Again, I appreciate your kind reviews and patches. -- Best regards, Yuya Watari
Re: [PoC] Reducing planning time when tables have many partitions
Hello Alena, Andrei, and all, Thank you for reviewing this patch. I really apologize for not updating this thread for a while. On Sat, Nov 18, 2023 at 6:04 AM Alena Rybakina wrote: > Hi, all! > > While I was reviewing the patches, I noticed that they needed some rebasing, > and in one of the patches (Introduce-indexes-for-RestrictInfo.patch) there > was a conflict with the recently added self-join-removal feature [1]. So, I > rebased patches and resolved the conflicts. While I was doing this, I found a > problem that I also fixed: Thank you very much for rebasing these patches and fixing the issue. The bug seemed to be caused because these indexes were in RangeTblEntry, and the handling of their serialization was not correct. Thank you for fixing it. On Mon, Nov 20, 2023 at 1:45 PM Andrei Lepikhov wrote: > During the work on committing the SJE feature [1], Alexander Korotkov > pointed out the silver lining in this work [2]: he proposed that we > shouldn't remove RelOptInfo from simple_rel_array at all but replace it > with an 'Alias', which will refer the kept relation. It can simplify > further optimizations on removing redundant parts of the query. Thank you for sharing this information. I think the idea suggested by Alexander Korotkov is also helpful for our patch. As mentioned above, the indexes are in RangeTblEntry in the current implementation. However, I think RangeTblEntry is not the best place to store them. An 'alias' relids may help solve this and simplify fixing the above bug. I will try this approach soon. Unfortunately, I've been busy due to work, so I won't be able to respond for several weeks. I'm really sorry for not being able to see the patches. As soon as I'm not busy, I will look at them, consider the above approach, and reply to this thread. If there is no objection, I will move this CF entry forward to next CF. Again, thank you very much for looking at this thread, and I'm sorry for my late work. -- Best regards, Yuya Watari
Re: [PoC] Reducing planning time when tables have many partitions
Hello Ashutosh and Andrey, On Wed, Sep 20, 2023 at 8:03 PM Ashutosh Bapat wrote: > While working on RestrictInfo translations patch I was thinking on > these lines. [1] uses hash table for storing translated RestrictInfo. > An EC can have a hash table to store ec_member translations. The same > patchset also has some changes in the code which generates > RestrictInfo clauses from ECs. I think that code will be simplified by > your approach. Thank you for sharing this. I agree that we have to avoid adding complexity to existing or future codes through my patch. As you say, this approach mentioned in the last email is helpful to simplify the code, so I will try it. On Fri, Sep 22, 2023 at 12:49 PM Lepikhov Andrei wrote: > It is okay if we talk about the self-join-removal feature specifically > because joins are removed before an inheritance expansion. > But ec_source_indexes and ec_derive_indexes point to specific places in > eq_sources and eq_derives lists. If I removed an EquivalenceClass or a > restriction during an optimisation, I would arrange all indexes, too. > Right now, I use a workaround here and remove the index link without removing > the element from the list. But I'm not sure how good this approach can be in > perspective. > So, having eq_sources and eq_derives localised in EC could make such > optimisations a bit more simple. Thank you for pointing it out. The ec_source_indexes and ec_derive_indexes are just picked up from the previous patch, and I have not changed their design. I think a similar approach to EquivalenceMembers may be applied to RestrictInfos. I will experiment with them and share a new patch. -- Best regards, Yuya Watari
Re: [PoC] Reducing planning time when tables have many partitions
Hello Ashutosh and Andrey, Thank you for your email, and I really apologize for my late response. On Thu, Sep 7, 2023 at 3:43 PM Ashutosh Bapat wrote: > It seems that you are still investigating and fixing issues. But the > CF entry is marked as "needs review". I think a better status is > "WoA". Do you agree with that? Yes, I am now investigating and fixing issues. I agree with you and changed the entry's status to "Waiting on Author". Thank you for your advice. On Tue, Sep 19, 2023 at 5:21 PM Andrey Lepikhov wrote: > Working on self-join removal in the thread [1] nearby, I stuck into the > problem, which made an additional argument to work in this new direction > than a couple of previous ones. > With indexing positions in the list of equivalence members, we make some > optimizations like join elimination more complicated - it may need to > remove some clauses and equivalence class members. > For changing lists of derives or ec_members, we should go through all > the index lists and fix them, which is a non-trivial operation. Thank you for looking into this and pointing that out. I understand that this problem will occur somewhere like your patch [1] quoted below because we need to modify RelOptInfo->eclass_child_members in addition to ec_members. Is my understanding correct? (Of course, I know ec_[no]rel_members, but I doubt we need them.) = +static void +update_eclass(EquivalenceClass *ec, int from, int to) +{ + List *new_members = NIL; + ListCell *lc; + + foreach(lc, ec->ec_members) + { + EquivalenceMember *em = lfirst_node(EquivalenceMember, lc); + boolis_redundant = false; + ... + + if (!is_redundant) + new_members = lappend(new_members, em); + } + + list_free(ec->ec_members); + ec->ec_members = new_members; = I think we may be able to remove the eclass_child_members field by making child members on demand. v20 makes child members at add_[child_]join_rel_equivalences() and adds them into RelOptInfo->eclass_child_members. Instead of doing that, if we translate on demand when child members are requested, RelOptInfo->eclass_child_members is no longer necessary. After that, there is only ec_members, which consists of parent members, so removing clauses will still be simple. Do you think this idea will solve your problem? If so, I will experiment with this and share a new patch version. The main concern with this idea is that the same child member will be created many times, wasting time and memory. Some techniques like caching might solve this. [1] https://www.postgresql.org/message-id/flat/64486b0b-0404-e39e-322d-0801154901f3%40postgrespro.ru -- Best regards, Yuya Watari
Re: [PoC] Reducing planning time when tables have many partitions
Hello David, I really appreciate your quick reply. On Wed, Aug 9, 2023 at 7:28 PM David Rowley wrote: > If 0004 is adding an em_index to mark the index into > PlannerInfo->eq_members, can't you use that in > setup_eclass_member[_strict]_iterator to loop to verify that the two > methods yield the same result? > > i.e: > > + Bitmapset *matching_ems = NULL; > + memcpy(_iter, iter, sizeof(EquivalenceMemberIterator)); > + memcpy(_iter, iter, sizeof(EquivalenceMemberIterator)); > + > + idx_iter.use_index = true; > + noidx_iter.use_index = false; > + > + while ((em = eclass_member_iterator_strict_next(_iter)) != NULL) > + matching_ems = bms_add_member(matching_ems, em->em_index); > + > + Assert(bms_equal(matching_ems, iter->matching_ems)); > > That should void the complaint that the Assert checking is too slow. > You can also delete the list_ptr_cmp function too (also noticed a > complaint about that). Thanks for sharing your idea regarding this verification. It looks good to solve the degradation problem in assert-enabled builds. I will try it. > For the 0003 patch. Can you explain why you think these fields should > be in RangeTblEntry rather than RelOptInfo? I can only guess you might > have done this for memory usage so that we don't have to carry those > fields for join rels? I think RelOptInfo is the correct place to > store fields that are only used in the planner. If you put them in > RangeTblEntry they'll end up in pg_rewrite and be stored for all > views. Seems very space inefficient and scary as it limits the scope > for fixing bugs in back branches due to RangeTblEntries being > serialized into the catalogues in various places. This change was not made for performance reasons but to avoid null reference exceptions. The details are explained in my email [1]. In brief, the earlier patch did not work because simple_rel_array[i] could be NULL for some 'i', and we referenced such a RelOptInfo. For example, the following code snippet in add_eq_member() does not work. I inserted "Assert(rel != NULL)" into this code, and then the assertion failed. So, I moved the indexes to RangeTblEntry to address this issue, but I don't know if this solution is good. We may have to solve this in a different way. = @@ -572,9 +662,31 @@ add_eq_member(EquivalenceClass *ec, Expr *expr, Relids relids, +i = -1; +while ((i = bms_next_member(expr_relids, i)) >= 0) +{ +RelOptInfo *rel = root->simple_rel_array[i]; + +rel->eclass_member_indexes = bms_add_member(rel->eclass_member_indexes, em_index); +} = On Wed, Aug 9, 2023 at 8:54 PM David Rowley wrote: > So, I have three concerns with this patch. > I think the best way to move this forward is to explore not putting > partitioned table partitions in EMs and instead see if we can > translate to top-level parent before lookups. This might just be too > complex to translate the Exprs all the time and it may add overhead > unless we can quickly determine somehow that we don't need to attempt > to translate the Expr when the given Expr is already from the > top-level parent. If that can't be made to work, then maybe that shows > the current patch has merit. I really appreciate your detailed advice. I am sorry that I will not be able to respond for a week or two due to my vacation, but I will explore and work on these issues. Thanks again for your kind reply. [1] https://www.postgresql.org/message-id/CAJ2pMkYR_X-%3Dpq%2B39-W5kc0OG7q9u5YUwDBCHnkPur17DXnxuQ%40mail.gmail.com -- Best regards, Yuya Watari
Re: [PoC] Reducing planning time when tables have many partitions
Hello Andrey, Ashutosh, and David, Thank you for your reply and for reviewing the patch. On Mon, Aug 7, 2023 at 5:51 PM Andrey Lepikhov wrote: > One more thing: I think, you should add comments to > add_child_rel_equivalences() and add_child_join_rel_equivalences() > on replacing of: > > if (bms_is_subset(cur_em->em_relids, top_parent_relids) && > !bms_is_empty(cur_em->em_relids)) > and > if (bms_overlap(cur_em->em_relids, top_parent_relids)) > > with different logic. What was changed? It will be better to help future > developers realize this part of the code more easily by adding some > comments. The following change in add_child_join_rel_equivalences(): -/* Does this member reference child's topmost parent rel? */ -if (bms_overlap(cur_em->em_relids, top_parent_relids)) is correct because EquivalenceMemberIterator guarantees that these two Relids always overlap for the iterated results. The following code does this iteration. As seen from the below code, the iteration eliminates not overlapping Relids, so we do not need to check bms_overlap() for the iterated results. = /* * eclass_member_iterator_next * Fetch the next EquivalenceMember from an EquivalenceMemberIterator * which was set up by setup_eclass_member_iterator(). Returns NULL when * there are no more matching EquivalenceMembers. */ EquivalenceMember * eclass_member_iterator_next(EquivalenceMemberIterator *iter) { ... ListCell *lc; for_each_from(lc, iter->eclass->ec_members, iter->current_index + 1) { EquivalenceMember *em = lfirst_node(EquivalenceMember, lc); ... /* * Don't return members which have no common rels with with_relids */ if (!bms_overlap(em->em_relids, iter->with_relids)) continue; return em; } return NULL; ... } = I agree with your opinion that my patch lacks some explanations, so I will consider adding more comments. However, I received the following message from David in March. On Thu, Mar 9, 2023 at 6:23 AM David Rowley wrote: > For the main patch, I've been starting to wonder if it should work > completely differently. Instead of adding members for partitioned and > inheritance children, we could just translate the Vars from child to > top-level parent and find the member that way. I wondered if this > method might be even faster as it would forego > add_child_rel_equivalences(). I think we'd still need em_is_child for > UNION ALL children. So far, I've not looked into this in detail. I > was hoping to find an idea that would allow some means to have the > planner realise that a LIST partition which allows a single Datum > could skip pushing base quals which are constantly true. i.e: > > create table lp (a int) partition by list(a); > create table lp1 partition of lp for values in(1); > explain select * from lp where a = 1; > > Seq Scan on lp1 lp (cost=0.00..41.88 rows=13 width=4) >Filter: (a = 1) I am concerned that fixing the current patch will conflict with David's idea. Of course, I am now trying to experiment with the above idea, but I should avoid the conflict if he is working on this. David, what do you think about this? Is it OK to post a new patch to address the review comments? I am looking forward to your reply. -- Best regards, Yuya Watari
Re: [PoC] Reducing planning time when tables have many partitions
Hello, Thank you for your reply. On Thu, Aug 3, 2023 at 10:29 PM Ashutosh Bapat wrote: > If you think that the verification is important to catch bugs, you may want > to encapsulate it with an #ifdef .. #endif such that the block within is not > compiled by default. See OPTIMIZER_DEBUG for example. In my opinion, verifying the iteration results is only necessary to avoid introducing bugs while developing this patch. The verification is too excessive for regular development of PostgreSQL. I agree that we should avoid a significant degradation in assert enabled builds, so I will consider removing it. > Do you think that the memory measurement patch I have shared in those threads > is useful in itself? If so, I will start another proposal to address it. For me, who is developing the planner in this thread, the memory measurement patch is useful. However, most users do not care about memory usage, so there is room for consideration. For example, making the metrics optional in EXPLAIN ANALYZE outputs might be better. -- Best regards, Yuya Watari
Re: [PoC] Reducing planning time when tables have many partitions
Hello, On Wed, Aug 2, 2023 at 6:43 PM Andrey Lepikhov wrote: > You introduced list_ptr_cmp as an extern function of a List, but use it > the only under USE_ASSERT_CHECKING ifdef. > Maybe you hide it under USE_ASSERT_CHECKING or remove all the stuff? Thank you for your quick reply and for pointing that out. If we remove the verification code when committing this patch, we should also remove the list_ptr_cmp() function because nobody will use it. If we don't remove the verification, whether to hide it by USE_ASSERT_CHECKING is a difficult question. The list_ptr_cmp() can be used for generic use and is helpful even without assertions, so not hiding it is one option. However, I understand that it is not pretty to have the function compiled even though it is not referenced from anywhere when assertions are disabled. As you say, I think hiding it by USE_ASSERT_CHECKING is also a possible solution. -- Best regards, Yuya Watari
Re: [PoC] Reducing planning time when tables have many partitions
Hello, I really appreciate sharing very useful scripts and benchmarking results. On Fri, Jul 28, 2023 at 6:51 PM Ashutosh Bapat wrote: > Given that most of the developers run assert enabled builds it would > be good to bring down the degradation there while keeping the > excellent speedup in non-assert builds. >From my observation, this degradation in assert enabled build is caused by verifying the iteration results of EquivalenceMembers. My patch uses Bitmapset-based indexes to speed up the iteration. When assertions are enabled, we verify that the result of the iteration is the same with and without the indexes. This verification results in executing a similar loop three times, which causes the degradation. I measured planning time by using your script without this verification. The results are as follows: Master: 144.55 ms Patched (v19): 529.85 ms Patched (v19) without verification: 78.84 ms (*) All runs are with assertions. As seen from the above, verifying iteration results was the cause of the performance degradation. I agree that we should avoid such degradation because it negatively affects the development of PostgreSQL. Removing the verification when committing this patch is one possible option. > Queries on partitioned tables eat a lot of memory anyways, increasing > that further should be avoided. > > I have not studied the patches. But I think the memory increase has to > do with our Bitmapset structure. It's space inefficient when there are > thousands of partitions involved. See my comment at [2] Thank you for pointing this out. I have never considered the memory usage impact of this patch. As you say, the Bitmapset structure caused this increase. I will try to look into this further. -- Best regards, Yuya Watari
Re: [PoC] Reducing planning time when tables have many partitions
Hello, On Fri, Jul 28, 2023 at 1:27 PM Andrey Lepikhov wrote: > Sorry for this. It was definitely a false alarm. In this patch, > assertion checking adds much overhead. After switching it off, I found > out that this feature solves my problem with a quick pass through the > members of an equivalence class. Planning time results for the queries > from the previous letter: > 1 - 0.4s, 2 - 1.3s, 3 - 1.3s; (with the patches applied) > 1 - 5s; 2 - 8.7s; 3 - 22s; (current master). > > I have attached flamegraph that shows query 2 planning process after > applying this set of patches. As you can see, overhead at the > equivalence class routines has gone. I really appreciate testing the patches and sharing your results. The results are interesting because they show that our optimization effectively reduces planning time for your workload containing different queries than I have used in my benchmarks. Thank you again for reviewing this. -- Best regards, Yuya Watari
Re: Making empty Bitmapsets always be NULL
Hello, On Tue, Jul 4, 2023 at 9:36 AM David Rowley wrote: > I've now pushed the patch. Thanks for the commit! -- Best regards, Yuya Watari
Re: Making empty Bitmapsets always be NULL
Hello, On Mon, Jul 3, 2023 at 9:10 AM David Rowley wrote: > Here's the patch which includes those Asserts. I also made some small > tweaks to a comment. Thank you for your reply. I am +1 to your change. I think these assertions will help someone who changes the Bitmapset implementations in the future. -- Best regards, Yuya Watari
Re: Making empty Bitmapsets always be NULL
Hello, Thank you for your reply and for creating a new patch. On Wed, Jun 28, 2023 at 7:58 PM David Rowley wrote: > Linux with AMD 3990x, again using the patch from [1] with make installcheck > > master: 1.41720145 seconds > v4: 1.392969606 seconds (1.74% faster than master) > v4 with 0th word check: 1.404199748 seconds (0.93% faster than master) I have tested these versions with installcheck. Since the planning times obtained by installcheck vary each time, it is important to run it repeatedly and examine its distribution. I ran installcheck 100 times for each version. The following tables and the attached figure show the results. From these results, we can conclude that the v4 patch has no regression in the installcheck test. It seems to be slightly (0.31-0.38%) faster than the master. The difference between v4 and v4 with 0th word check is not so clear, but v4 may be faster. Table 1: Total Planning Time During installcheck (seconds) - | Mean | Median | Stddev - Master | 2.520865 | 2.521189 | 0.017651 v4 | 2.511447 | 2.513369 | 0.018299 v4 with 0th word check | 2.513393 | 2.515652 | 0.018391 - Table 2: Speedup (higher is better) | Speedup (Mean) | Speedup (Median) v4 | 0.38% |0.31% v4 with 0th word check | 0.30% |0.22% -- Best regards, Yuya Watari
Re: Making empty Bitmapsets always be NULL
Hello, On Tue, Jun 20, 2023 at 1:17 PM David Rowley wrote: > I've adjusted the attached patch to do that. Thank you for updating the patch. The v4 patch looks good to me. I ran another experiment. In the experiment, I issued queries of the Join Order Benchmark [1] and measured its planning times. The following table shows the result. The v4 patch obtained outstanding performance improvements in planning time. This result supports the effectiveness of the patch in real workloads. Table 1: Planning time and its speedup of Join Order Benchmark (n: the number of partitions of each table) (Speedup: higher is better) n | Speedup (v4) 2 | 102.4% 4 | 101.0% 8 | 101.6% 16 | 103.1% 32 | 107.5% 64 | 115.7% 128 | 142.9% 256 | 187.7% [1] https://github.com/winkyao/join-order-benchmark -- Best regards, Yuya Watari
Re: Making empty Bitmapsets always be NULL
Hello, On Thu, Jun 22, 2023 at 1:43 PM David Rowley wrote: > > 3. Avoid enlargement when nwords is equal wordnum. > > Can save cycles when in corner cases? > > No, you're just introducing a bug here. Arrays in C are zero-based, > so "wordnum >= a->nwords" is exactly the correct way to check if > wordnum falls outside the bounds of the existing allocated memory. By > changing that to "wordnum > a->nwords" we'll fail to enlarge the words > array when it needs to be enlarged by 1 element. I agree with David. Unfortunately, some of the regression tests failed with the v5 patch. These failures are due to the bug introduced by the #3 change. -- Best regards, Yuya Watari
Re: Making empty Bitmapsets always be NULL
Hello, On Tue, Jun 13, 2023 at 8:07 PM David Rowley wrote: > I've incorporated fixes for the bugs in the attached patch. I didn't > quite use the same approach as you did. I did the fix for 0003 > slightly differently and added two separate paths. We've no need to > track the last non-zero word when 'a' has more words than 'b' since we > can't truncate any zero-words off for that case. Not having to do > that makes the do/while loop pretty tight. I really appreciate your quick response and incorporating those fixes into your patch. The fix for 0003 looks good to me. I believe your change improves performance more. > For the fix in the 0004 patch, I think we can do what you did more > simply. I don't think there's any need to perform the loop to find > the last non-zero word. We're only deleting a member from a single > word here, so we only need to check if that word is the last word and > remove it if it's become zero. If it's not the last word then we > can't remove it as there must be some other non-zero word after it. If my thinking is correct, the do-while loop I added is still necessary. Consider the following code. The Assertion in this code passes in the master but fails in the new patch. = Bitmapset *x = bms_make_singleton(1000); x = bms_del_member(x, 1000); Assert(x == NULL); = In the code above, we get a new Bitmapset by bms_make_singleton(1000). This Bitmapset has many words. Only the last word is non-zero, and all the rest are zero. If we call bms_del_member(x, 1000) for the Bitmapset, all words of the result will be zero, including the last word, so we must return NULL. However, the new patch truncates only the last word, leading to an incorrect result. Therefore, we need to perform the loop to find the actual non-zero word after the deletion. Of course, I agree that if we are not modifying the last word, we don't have to truncate anything, so we can omit the loop. > I also made a small adjustment to bms_get_singleton_member() and > bms_singleton_member() to have them Assert fail if result is < 0 after > looping over the set. This should no longer happen so I thought it > would make more compact code if that check was just removed. We'd > likely do better if we got reports of Assert failures here than, in > the case of bms_get_singleton_member, some code accidentally doing the > wrong thing based on a corrupt Bitmapset. I agree with your change. I think failing by Assertion is better than a runtime error or unexpected behavior. -- Best regards, Yuya Watari
Re: Making empty Bitmapsets always be NULL
Hello, On Thu, Mar 16, 2023 at 10:30 AM Yuya Watari wrote: > My idea is to compute the bitwise OR of all bitmapwords of the result > Bitmapset. The bitwise OR can be represented as a single operation in > the machine code and does not require any conditional branches. If the > bitwise ORed value is zero, we can conclude the result Bitmapset is > empty. The costs related to this operation can be almost negligible; > it is significantly cheaper than calling bms_is_empty_internal() and > less expensive than using a conditional branch such as 'if.' After posting the patch, I noticed that my patch had some bugs. My idea above is not applicable to bms_del_member(), and I missed some additional operations required in bms_del_members(). I have attached the fixed version to this email. I really apologize for making the mistakes. Should we add new regression tests to prevent this kind of bug? The following tables illustrate the result of a re-run experiment. The significant improvement was a mistake, but a speedup of about 2% was still obtained when the number of partitions, namely n, was large. This result indicates that the optimization regarding bms_is_empty_internal() is effective on some workloads. Table 1: Planning time (ms) (n: the number of partitions of each table) - n | Master | Patched (trailing-zero) | Patched (bitwise-OR) - 1 | 36.903 | 36.621 | 36.731 2 | 35.842 | 35.031 | 35.704 4 | 37.756 | 37.457 | 37.409 8 | 42.069 | 42.578 | 42.322 16 | 53.670 | 67.792 | 53.618 32 | 88.412 | 100.605 | 89.147 64 | 229.734 | 271.259 | 225.971 128 | 889.367 |1272.270 | 870.472 256 | 4209.312 |8223.623 | 4129.594 - Table 2: Planning time speedup (higher is better) -- n | Patched (trailing-zero) | Patched (bitwise-OR) -- 1 | 100.8% | 100.5% 2 | 102.3% | 100.4% 4 | 100.8% | 100.9% 8 | 98.8% |99.4% 16 | 79.2% | 100.1% 32 | 87.9% |99.2% 64 | 84.7% | 101.7% 128 | 69.9% | 102.2% 256 | 51.2% | 101.9% -- -- Best regards, Yuya Watari v2-0001-Remove-bms_is_empty_internal-function-calls.patch Description: Binary data
Re: Making empty Bitmapsets always be NULL
e ORed value is zero, we can conclude the result Bitmapset is empty. The costs related to this operation can be almost negligible; it is significantly cheaper than calling bms_is_empty_internal() and less expensive than using a conditional branch such as 'if.' In the tables above, I called my patch the "bitwise-OR" patch. The patch is much faster than the master when n is large. Its speed up reached 167.1%. I think just adopting this optimization is worth considering. [1] https://www.postgresql.org/message-id/caj2pmky10j_pa2jph5m-vooo6bvjntoo_-t_znu_poap0q1...@mail.gmail.com [2] https://www.postgresql.org/message-id/caaphdvq9eq0w_afugrb6ba28ieuqn4zm5uwqxy7+lmzjjc+...@mail.gmail.com -- Best regards, Yuya Watari v1-0001-Remove-bms_is_empty_internal-function-calls.patch Description: Binary data create-tables.sql Description: Binary data query.sql Description: Binary data
Re: [PoC] Reducing planning time when tables have many partitions
Dear Andrey and Thom, Thank you for reviewing and testing the patch. I really apologize for my late response. On Tue, Nov 8, 2022 at 8:31 PM Andrey Lepikhov wrote: > Looking into find_em_for_rel() changes I see that you replaced > if (bms_is_subset(em->em_relids, rel->relids) > with assertion statement. > According of get_ecmember_indexes(), the em_relids field of returned > equivalence members can contain relids, not mentioned in the relation. > I don't understand, why it works now? For example, we can sort by t1.x, > but have an expression t1.x=t1.y*t2.z. Or I've missed something? If it > is not a mistake, maybe to add a comment why assertion here isn't failed? As you pointed out, changing the bms_is_subset() condition to an assertion is logically incorrect here. Thank you for telling me about it. I fixed it and attached the modified patch to this email. On Thu, Nov 17, 2022 at 9:05 PM Thom Brown wrote: > No issues with applying. Created 1024 partitions, each of which is > partitioned into 64 partitions. > > I'm getting a generic planning time of 1415ms. Is that considered > reasonable in this situation? Bear in mind that the planning time > prior to this patch was 282311ms, so pretty much a 200x speedup. Thank you for testing the patch with an actual query. This speedup is very impressive. When I used an original query with 1024 partitions, its planning time was about 200ms. Given that each partition is also partitioned in your workload, I think the result of 1415ms is reasonable. -- Best regards, Yuya Watari v9-0001-Apply-eclass_member_speedup_v3.patch.patch Description: Binary data v9-0002-Revert-one-of-the-optimizations.patch Description: Binary data v9-0003-Use-conventional-algorithm-for-smaller-size-cases.patch Description: Binary data v9-0004-Fix-incorrect-assertion.patch Description: Binary data
Re: [PoC] Reducing planning time when tables have many partitions
Hello, I noticed that the previous patch does not apply to the current HEAD. I attached the rebased version to this email. -- Best regards, Yuya Watari v8-0001-Apply-eclass_member_speedup_v3.patch.patch Description: Binary data v8-0002-Revert-one-of-the-optimizations.patch Description: Binary data v8-0003-Use-conventional-algorithm-for-smaller-size-cases.patch Description: Binary data
Re: [PoC] Reducing planning time when tables have many partitions
Hello, On Wed, Sep 21, 2022 at 6:43 PM Yuya Watari wrote: > 1.1. Revert one of our optimizations (v5) > > As I mentioned in the comment in > v[5|6|7]-0002-Revert-one-of-the-optimizations.patch, I reverted one of > our optimizations. This code tries to find EquivalenceMembers that do > not satisfy the bms_overlap condition. We encounter such members early > in the loop, so the linear search is enough, and our optimization is > too excessive here. As a result of experiments, I found this > optimization was a bottleneck, so I reverted it. In the previous mail, I proposed a revert of one excessive optimization. In addition, I found a new bottleneck and attached a new version of the patch solving it to this email. The new bottleneck exists in the select_outer_pathkeys_for_merge() function. At the end of this function, we count EquivalenceMembers that satisfy the specific condition. To count them, we have used Bitmapset operations. Through experiments, I concluded that this optimization is effective for larger cases but leads to some degradation for the smaller number of partitions. The new patch switches two algorithms depending on the problem sizes. 1. Experimental result 1.1. Join Order Benchmark As in the previous email, I used the Join Order Benchmark to evaluate the patches' performance. The correspondence between each version and patches is as follows. v3: v8-0001-*.patch v5 (v3 with revert): v8-0001-*.patch + v8-0002-*.patch v8 (v5 with revert): v8-0001-*.patch + v8-0002-*.patch + v8-0003-*.patch I show the speed-up of each method compared with the master branch in Table 1. When the number of partitions is 1, performance degradation is kept to 1.1% in v8, while they are 4.2% and 1.8% in v3 and v5. This result indicates that a newly introduced revert is effective. Table 1: Speedup of Join Order Benchmark (higher is better) (n = the number of partitions) -- n | v3 | v5 (v3 with revert) | v8 (v5 with revert) -- 2 | 95.8% | 98.2% | 98.9% 4 | 97.2% | 99.7% | 99.3% 8 | 101.4% | 102.5% | 103.4% 16 | 108.7% | 111.4% | 110.2% 32 | 127.1% | 127.6% | 128.8% 64 | 169.5% | 172.1% | 172.4% 128 | 330.1% | 335.2% | 332.3% 256 | 815.1% | 826.4% | 821.8% -- 1.2. pgbench The following table describes the result of pgbench. The v5 and v8 performed clearly better than the v3 patch. The difference between v5 and v8 is not so significant, but v8's performance is close to the master branch. Table 2: The result of pgbench (tps) - n | Master | v3 | v5 (v3 with revert) | v8 (v5 with revert) - 1 | 7550 | 7422 |7474 |7521 2 | 7594 | 7381 |7536 |7529 4 | 7518 | 7362 |7461 |7524 8 | 7459 | 7340 |7424 |7460 - Avg | 7531 | 7377 |7474 |7509 - 2. Conclusion and future works The revert in the v8-0003-*.patch is effective in preventing performance degradation for the smaller number of partitions. However, I don't think what I have done in the patch is the best or ideal solution. As I mentioned in the comments in the patch, switching two algorithms may be ugly because it introduces code duplication. We need a wiser solution to this problem. -- Best regards, Yuya Watari v8-0001-Apply-eclass_member_speedup_v3.patch.patch Description: Binary data v8-0002-Revert-one-of-the-optimizations.patch Description: Binary data v8-0003-Use-conventional-algorithm-for-smaller-size-cases.patch Description: Binary data
Re: [PoC] Reducing planning time when tables have many partitions
Dear David, On Fri, Aug 26, 2022 at 12:18 PM David Rowley wrote: > How about instead of doing this caching like this, why don't we code > up some iterators that we can loop over to fetch the required EMs. Thank you very much for your quick reply and for sharing your idea with code. I also think introducing EquivalenceMemberIterator is one good alternative solution. I will try to implement and test it. Thank you again for helping me. -- Best regards, Yuya Watari
Re: [PoC] Reducing planning time when tables have many partitions
Dear David, Thank you for sharing your new idea. I agree that introducing a Bitmapset field may solve this problem. I will try this approach in addition to previous ones. Thank you again for helping me. -- Best regards, Yuya Watari
Re: [PoC] Reducing planning time when tables have many partitions
Dear Andrey Lepikhov, Thank you for replying and being a reviewer for this patch. I really appreciate it. > Are you still working on this patch? Yes, I’m working on improving this patch. It is not easy to address the problems that this patch has, but I’m hoping to send a new version of it in a few weeks. -- Best regards, Yuya Watari
Re: [PoC] Reducing planning time when tables have many partitions
Dear Tom, Thank you for replying to my email. On Mon, Jul 4, 2022 at 6:28 AM Tom Lane wrote: > I'm not real thrilled with trying to throw hashtables at the problem, > though. As David noted, they'd be counterproductive for simple > queries. As you said, my approach that utilizes hash tables has some overheads, leading to degradation in query planning time. I tested the degradation by a brief experiment. In this experiment, I used a simple query shown below. === SELECT students.name, gpas.gpa AS gpa, sum(scores.score) AS total_score FROM students, scores, gpas WHERE students.id = scores.student_id AND students.id = gpas.student_id GROUP BY students.id, gpas.student_id; === Here, students, scores, and gpas tables have no partitions, i.e., they are regular tables. Therefore, my techniques do not work for this query and instead may lead to some regression. I repeatedly issued this query 1 million times and measured their planning times. The attached figure describes the distribution of the planning times. The figure indicates that my patch has no severe negative impacts on the planning performance. However, there seems to be a slight degradation. I show the mean and median of planning times below. With my patch, the planning time became 0.002-0.004 milliseconds slower. We have to deal with this problem, but reducing time complexity while keeping degradation to zero is significantly challenging. Planning time (ms) | Mean | Median -- Master | 0.682 | 0.674 Patched | 0.686 | 0.676 -- Degradation | 0.004 | 0.002 Of course, the attached result is just an example. Significant regression might occur in other types of queries. > For the bms_equal class of lookups, I wonder if we could get anywhere > by adding an additional List field to every RelOptInfo that chains > all EquivalenceMembers that match that RelOptInfo's relids. > The trick here would be to figure out when to build those lists. > The simple answer would be to do it lazily on-demand, but that > would mean a separate scan of all the EquivalenceMembers for each > RelOptInfo; I wonder if there's a way to do better? > > Perhaps the bms_is_subset class could be handled in a similar > way, ie do a one-time pass to make a List of all EquivalenceMembers > that use a RelOptInfo. Thank you for giving your idea. I will try to polish up my algorithm based on your suggestion. -- Best regards, Yuya Watari
Re: [PoC] Reducing planning time when tables have many partitions
Dear David, Thank you for your comments on my patch. I really apologize for my late response. On Thu, Mar 24, 2022 at 11:03 AM David Rowley wrote: > I think a better way to solve this would be just to have a single hash > table over all EquivalenceClasses that allows fast lookups of > EquivalenceMember->em_expr. I think there's no reason that a given > Expr should appear in more than one non-merged EquivalenceClass. The > EquivalenceClass a given Expr belongs to would need to be updated > during the merge process. Thank you for your idea. However, I think building a hash table whose key is EquivalenceMember->em_expr does not work for this case. What I am trying to optimize in this patch is the following code. = EquivalenceClass *ec = /* given */; EquivalenceMember *em; ListCell *lc; foreach(lc, ec->ec_members) { em = (EquivalenceMember *) lfirst(lc); /* predicate is bms_equal or bms_is_subset, etc */ if (!predicate(em)) continue; /* The predicate satisfies */ do something...; } = >From my observation, the predicates above will be false in most cases and the subsequent processes are not executed. My optimization is based on this notion and utilizes hash tables to eliminate calls of predicates. If the predicate were "em->em_expr == something", the hash table whose key is em_expr would be effective. However, the actual predicates are not of this type but the following. // Find EquivalenceMembers whose relids is equal to the given relids (1) bms_equal(em->em_relids, relids) // Find EquivalenceMembers whose relids is a subset of the given relids (2) bms_is_subset(em->em_relids, relids) Since these predicates perform a match search for not em_expr but em_relids, we need to build a hash table with em_relids as key. If so, we can drastically reduce the planning time for the pattern (1). Besides, by enumerating all subsets of relids, pattern (2) can be optimized. The detailed algorithm is described in the first email. I show an example of the pattern (1). The next code is in src/backend/optimizer/path/equivclass.c. As can be seen from this code, the foreach loop tries to find an EquivalenceMember whose cur_em->em_relids is equal to rel->relids. If found, subsequent processing will be performed. == Before patched == List * generate_implied_equalities_for_column(PlannerInfo *root, RelOptInfo *rel, ec_matches_callback_type callback, void *callback_arg, Relids prohibited_rels) { ... EquivalenceClass *cur_ec = (EquivalenceClass *) list_nth(root->eq_classes, i); EquivalenceMember *cur_em; ListCell *lc2; cur_em = NULL; foreach(lc2, cur_ec->ec_members) { cur_em = (EquivalenceMember *) lfirst(lc2); if (bms_equal(cur_em->em_relids, rel->relids) && callback(root, rel, cur_ec, cur_em, callback_arg)) break; cur_em = NULL; } if (!cur_em) continue; ... } === My patch modifies this code as follows. The em_foreach_relids_equals is a newly defined macro that finds EquivalenceMember satisfying the bms_equal. The macro looks up a hash table using rel->relids as a key. This type of optimization cannot be achieved without using hash tables whose key is em->em_relids. == After patched == List * generate_implied_equalities_for_column(PlannerInfo *root, RelOptInfo *rel, ec_matches_callback_type callback, void *callback_arg, Relids prohibited_rels) { ... EquivalenceClass *cur_ec = (EquivalenceClass *) list_nth(root->eq_classes, i); EquivalenceMember *cur_em; EquivalenceMember *other_em; cur_em = NULL; em_foreach_relids_equals(cur_em, cur_ec, rel->relids) { Assert(bms_equal(cur_em->em_relids, rel->relids)); if (callback(root, rel, cur_ec, cur_em, callback_arg)) break; cur_em = NULL; } if (!cur_em) continue; ... } === > We might not want to build the hash table for all queries. I agree with you. Building a lot of hash tables will consume much memory. My idea for this problem is to let the hash table's key be a pair of EquivalenceClass and Relids. However, this approach may lead to increasing looking up time of the hash table. == I noticed that the previous patch does not work with the current HEAD. I attached the modified one to this email. Additionally, I added my patch to the current commit fest [1]. [1] https://commitfest.postgresql.org/38/3701/ -- Best reg
Re: Performance Evaluation of Result Cache by using TPC-DS
Hello David, Thank you for your reply. > That's certainly one side of it. On the other side, it's pretty > important to also note that in 4 of 23 queries the result cache plan > executed faster but the planner costed it as more expensive. > > I'm not saying the costing is perfect, but what I am saying is, as you > noted above, in 5 of 23 queries the result cache was cheaper and > slower, and, as I just noted, in 4 of 23 queries, result cache was > more expensive and faster. We know that costing is never going to be > a perfect representation of what the execution time will be However, > in these examples, we've just happened to get quite a good balance. If > we add a penalty to result cache then it'll just subtract from one > problem group and add to the other. > > Overall, in my tests execution was 1.15% faster with result cache > enabled than it was without. Thank you for your analysis. I agree with your opinion. > I think it is more likely to be concerns like that one which would > cause us to default enable_resultcache to off. I am not sure whether this kind of degradation is common, but setting default behavior to off is one of the realistic solutions. Best regards, Yuya Watari
Re: Performance Evaluation of Result Cache by using TPC-DS
Hello David, Thank you for running experiments on your machine and I really appreciate your deep analysis. Your results are very interesting. In 5 queries, the result cache is cheaper but slower. Especially, in query 88, although the cost with result cache is cheaper, it has 34.23% degradation in query execution time. This is big regression. > What can be done? > === > > I'm not quite sure. The biggest problem is add_path's fuzziness. I > could go and add some penalty cost to Result Cache paths so that > they're picked less often. If I make that penalty more than 1% of the > cost, then that should get around add_path rejecting the other join > method that is not fuzzily good enough. Adding some sort of penalty > might also help the 5 of 23 queries that were cheaper and slower than > the alternative. Based on your idea, I have implemented a penalty for the cost of the result cache. I attached the patch to this e-mail. Please be noted that this patch is experimental, so it lacks comments, documents, tests, etc. This patch adds a new GUC, resultcache_cost_factor. The cost of the result cache is multiplied by this factor. If the factor is greater than 1, we impose a penalty on the result cache. The cost calculation has been modified as follows. = @@ -2541,6 +2542,13 @@ cost_resultcache_rescan(PlannerInfo *root, ResultCachePath *rcpath, */ startup_cost += cpu_tuple_cost; + /* +* We multiply the costs by resultcache_cost_factor to control the +* aggressiveness of result cache. +*/ + startup_cost *= resultcache_cost_factor; + total_cost *= resultcache_cost_factor; = @@ -1618,9 +1618,14 @@ create_resultcache_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath, * Add a small additional charge for caching the first entry. All the * harder calculations for rescans are performed in * cost_resultcache_rescan(). +* +* We multiply the costs by resultcache_cost_factor to control the +* aggressiveness of result cache. */ - pathnode->path.startup_cost = subpath->startup_cost + cpu_tuple_cost; - pathnode->path.total_cost = subpath->total_cost + cpu_tuple_cost; + pathnode->path.startup_cost = + (subpath->startup_cost + cpu_tuple_cost) * resultcache_cost_factor; + pathnode->path.total_cost = + (subpath->total_cost + cpu_tuple_cost) * resultcache_cost_factor; pathnode->path.rows = subpath->rows; return pathnode; = As this factor increases, the result cache becomes less and less likely to be adopted. I conducted an experiment to clarify the threshold of the factor. I ran EXPLAIN (not EXPLAIN ANALYZE) command with different factors. The threshold is defined as the factor at which the result cache no longer appears in the query plan. The factor more than the threshold indicates the planner does not select the result cache. This experiment was conducted on my machine, so the results may differ from those on your machine. I attached the thresholds as Excel and PDF files. The thresholds vary from 1.1 to 9.6. The threshold of 9.6 indicates that a penalty of 860% must be imposed to avoid the result cache. The Excel and PDF files also contain the chart showing the relationship between speedup ratio and threshold. Unfortunately, there is no clear correlation. If we set the factor to 5, we can avoid 11% degradation of query 62 because the threshold of the query is 4.7. However, we cannot gain a 62% speedup of query 61 with this factor. Therefore, this factor does not work well and should be reconsidered. In this patch, I impose a penalty on the result cache node. An alternative way is to increase the cost of a nested loop that contains a result cache. If so, there is no need to impose a penalty of 860%, but a penalty of about 1% is enough. This approach of introducing resultcache_cost_factor is not an essential solution. However, it is reasonable to offer a way of controlling the aggressiveness of the result cache. Repeatedly, this patch is experimental, so it needs feedback and modifications. Best regards, Yuya Watari experimental-resultcache-factor.patch Description: Binary data experimental-result.xlsx Description: MS-Excel 2007 spreadsheet experimental-result.pdf Description: Adobe PDF document
Re: Performance Evaluation of Result Cache by using TPC-DS
Hello David, Thank you for your reply. > Thanks for running that again. I see from the EXPLAIN ANALYZE output > that the planner did cost the Result Cache plan slightly more > expensive than the Hash Join plan. It's likely that add_path() did > not consider the Hash Join plan to be worth keeping because it was not > more than 1% better than the Result Cache plan. STD_FUZZ_FACTOR is set > so new paths need to be at least 1% better than existing paths for > them to be kept. That's pretty unfortunate and that alone does not > mean the costs are incorrect. It would be good to know if that's the > case for the other queries too. Thanks for your analysis. I understood why HashJoin was not selected in this query plan. > To test that, I've set up TPC-DS locally, however, it would be good if > you could send me the list of indexes that you've created. I see the > tool from the transaction processing council for TPC-DS only comes > with the list of tables. > > Can you share the output of: I listed all indexes on my machine by executing your query. I attached the result to this e-mail. I hope it will help you. Best regards, Yuya Watari pg_get_indexdef CREATE UNIQUE INDEX customer_address_pkey ON public.customer_address USING btree (ca_address_sk) CREATE UNIQUE INDEX customer_demographics_pkey ON public.customer_demographics USING btree (cd_demo_sk) CREATE UNIQUE INDEX date_dim_pkey ON public.date_dim USING btree (d_date_sk) CREATE UNIQUE INDEX ship_mode_pkey ON public.ship_mode USING btree (sm_ship_mode_sk) CREATE UNIQUE INDEX time_dim_pkey ON public.time_dim USING btree (t_time_sk) CREATE UNIQUE INDEX reason_pkey ON public.reason USING btree (r_reason_sk) CREATE UNIQUE INDEX income_band_pkey ON public.income_band USING btree (ib_income_band_sk) CREATE UNIQUE INDEX item_pkey ON public.item USING btree (i_item_sk) CREATE UNIQUE INDEX store_pkey ON public.store USING btree (s_store_sk) CREATE INDEX store_s_closed_date_sk_idx ON public.store USING btree (s_closed_date_sk) CREATE INDEX call_center_cc_closed_date_sk_idx ON public.call_center USING btree (cc_closed_date_sk) CREATE INDEX call_center_cc_open_date_sk_idx ON public.call_center USING btree (cc_open_date_sk) CREATE UNIQUE INDEX call_center_pkey ON public.call_center USING btree (cc_call_center_sk) CREATE INDEX customer_c_current_cdemo_sk_idx ON public.customer USING btree (c_current_cdemo_sk) CREATE UNIQUE INDEX customer_pkey ON public.customer USING btree (c_customer_sk) CREATE INDEX customer_c_first_shipto_date_sk_idx ON public.customer USING btree (c_first_shipto_date_sk) CREATE INDEX customer_c_first_sales_date_sk_idx ON public.customer USING btree (c_first_sales_date_sk) CREATE INDEX customer_c_current_hdemo_sk_idx ON public.customer USING btree (c_current_hdemo_sk) CREATE INDEX customer_c_current_addr_sk_idx ON public.customer USING btree (c_current_addr_sk) CREATE INDEX store_returns_sr_store_sk_idx ON public.store_returns USING btree (sr_store_sk) CREATE INDEX store_returns_sr_return_time_sk_idx ON public.store_returns USING btree (sr_return_time_sk) CREATE INDEX store_returns_sr_returned_date_sk_idx ON public.store_returns USING btree (sr_returned_date_sk) CREATE INDEX store_returns_sr_reason_sk_idx ON public.store_returns USING btree (sr_reason_sk) CREATE INDEX store_returns_sr_item_sk_idx ON public.store_returns USING btree (sr_item_sk) CREATE INDEX store_returns_sr_hdemo_sk_idx ON public.store_returns USING btree (sr_hdemo_sk) CREATE INDEX store_returns_sr_customer_sk_idx ON public.store_returns USING btree (sr_customer_sk) CREATE INDEX store_returns_sr_cdemo_sk_idx ON public.store_returns USING btree (sr_cdemo_sk) CREATE INDEX store_returns_sr_addr_sk_idx ON public.store_returns USING btree (sr_addr_sk) CREATE UNIQUE INDEX store_returns_pkey ON public.store_returns USING btree (sr_item_sk, sr_ticket_number) CREATE UNIQUE INDEX household_demographics_pkey ON public.household_demographics USING btree (hd_demo_sk) CREATE INDEX household_demographics_hd_income_band_sk_idx ON public.household_demographics USING btree (hd_income_band_sk) CREATE UNIQUE INDEX promotion_pkey ON public.promotion USING btree (p_promo_sk) CREATE INDEX promotion_p_end_date_sk_idx ON public.promotion USING btree (p_end_date_sk) CREATE INDEX promotion_p_start_date_sk_idx ON public.promotion USING btree (p_start_date_sk) CREATE INDEX promotion_p_item_sk_idx ON public.promotion USING btree (p_item_sk) CREATE UNIQUE INDEX catalog_page_pkey ON public.catalog_page USING btree (cp_catalog_page_sk) CREATE INDEX catalog_page_cp_end_date_sk_idx ON public.catalog_page USING btree (cp_end_date_sk) CREATE INDEX catalog_page_cp_start_da
Re: Performance Evaluation of Result Cache by using TPC-DS
Hello David, Thank you for your reply. > Can you share if these times were to run EXPLAIN ANALYZE or if they > were just the queries being executed normally? These times were to run EXPLAIN ANALYZE. I executed each query twice, and the **average** execution time was shown in the table of the last e-mail. Therefore, the result of the table is not simply equal to that of the attached file. I'm sorry for the insufficient explanation. > It would be really great if you could show the EXPLAIN (ANALYZE, > TIMING OFF) for query 62. There's a chance that the slowdown comes > from the additional EXPLAIN ANALYZE timing overhead with the Result > Cache version. I ran query 62 by "EXPLAIN (ANALYZE, TIMING OFF)" and normally. I attached these execution results to this e-mail. At this time, I executed each query only once (not twice). The results are as follows. Method | Execution time with result cache (s) | Execution time without result cache (s) | Speedup ratio EXPLAIN (ANALYZE, TIMING ON) 67.161 59.615-12.66% EXPLAIN (ANALYZE, TIMING OFF) 66.142 60.660 -9.04% Normal66.611 60.955 -9.28% Although there is variation in the execution time, the speedup ratio is around -10%. So, the result cache has a 10% regression in query 62. The overhead of EXPLAIN ANALYZE and TIMING ON do not seem to be high. Best regards, Yuya Watari On Tue, Apr 13, 2021 at 7:13 PM David Rowley wrote: > > On Tue, 13 Apr 2021 at 21:29, Yuya Watari wrote: > > I used the TPC-DS scale factor 100 in the evaluation. I executed all > > of the 99 queries in the TPC-DS, and the result cache worked in the 21 > > queries of them. However, some queries took too much time, so I > > skipped their execution. I set work_mem to 256MB, and > > max_parallel_workers_per_gather to 0. > > Many thanks for testing this. > > > As you can see from these results, many queries have a negative > > speedup ratio, which means that there are negative impacts on the > > query performance. In query 62, the execution time increased by > > 11.36%. I guess these regressions are due to the misestimation of the > > cost in the planner. I attached the execution plan of query 62. > > Can you share if these times were to run EXPLAIN ANALYZE or if they > were just the queries being executed normally? > > The times in the two files you attached do look very similar to the > times in your table, so I suspect either TIMING ON is not that high an > overhead on your machine, or the results are that of EXPLAIN ANALYZE. > > It would be really great if you could show the EXPLAIN (ANALYZE, > TIMING OFF) for query 62. There's a chance that the slowdown comes > from the additional EXPLAIN ANALYZE timing overhead with the Result > Cache version. > > > The result cache is currently enabled by default. However, if this > > kind of performance regression is common, we have to change its > > default behavior. > > Yes, the feedback we get during the beta period will help drive that > decision or if the costing needs to be adjusted. > > David SET QUERY PLAN Limit (cost=2816266.54..2816267.79 rows=100 width=110) (actual rows=100 loops=1) -> Sort (cost=2816266.54..2816267.38 rows=336 width=110) (actual rows=100 loops=1) Sort Key: (substr((warehouse.w_warehouse_name)::text, 1, 20)), ship_mode.sm_type, web_site.web_name Sort Method: top-N heapsort Memory: 49kB -> HashAggregate (cost=2816248.24..2816252.44 rows=336 width=110) (actual rows=360 loops=1) Group Key: substr((warehouse.w_warehouse_name)::text, 1, 20), ship_mode.sm_type, web_site.web_name Batches: 1 Memory Usage: 157kB -> Hash Join (cost=2510.74..2794150.54 rows=368295 width=78) (actual rows=14336926 loops=1) Hash Cond: (web_sales.ws_ship_mode_sk = ship_mode.sm_ship_mode_sk) -> Hash Join (cost=2509.29..2792056.55 rows=368356 width=36) (actual rows=14337390 loops=1) Hash Cond: (web_sales.ws_web_site_sk = web_site.web_site_sk) -> Hash Join (cost=2506.75..2790916.14 rows=368430 width=33) (actual rows=14338265 loops=1) Hash Cond: (web_sales.ws_warehouse_sk = warehouse.w_warehouse_sk) -> Hash Join (cost=2505.41..2789674.19 rows=368516 width=20) (actual rows=14340028 loops=1) Hash Cond:
Performance Evaluation of Result Cache by using TPC-DS
Hello, Recently, the result cache feature was committed to PostgreSQL. I tested its performance by executing TPC-DS. As a result, I found that there were some regressions in the query performance. I used the TPC-DS scale factor 100 in the evaluation. I executed all of the 99 queries in the TPC-DS, and the result cache worked in the 21 queries of them. However, some queries took too much time, so I skipped their execution. I set work_mem to 256MB, and max_parallel_workers_per_gather to 0. Evaluation results are as follows. The negative speedup ratio indicates that the execution time increased by the result cache. Query No | Execution time with result cache | Execution time without result cache | Speedup ratio 7 142.1 142.20.03% 8 144.0 142.8 -0.82% 13 164.6 162.0 -1.65% 27 138.9 138.7 -0.16% 34 135.7 137.11.02% 43 209.5 207.2 -1.10% 48 181.5 170.7 -6.32% 55 130.6 123.8 -5.48% 61 0.014 0.037 62.06% 62 66.759.9 -11.36% 68 131.3 127.2 -3.17% 72 567.0 563.4 -0.65% 73 130.1 129.8 -0.29% 88 1044.5 1048.70.40% 911.2 1.1 -7.93% 96 132.2 131.7 -0.37% As you can see from these results, many queries have a negative speedup ratio, which means that there are negative impacts on the query performance. In query 62, the execution time increased by 11.36%. I guess these regressions are due to the misestimation of the cost in the planner. I attached the execution plan of query 62. The result cache is currently enabled by default. However, if this kind of performance regression is common, we have to change its default behavior. Best regards, Yuya Watari QUERY PLAN -- Limit (cost=2816266.54..2816267.79 rows=100 width=110) (actual time=59613.454..59613.473 rows=100 loops=1) -> Sort (cost=2816266.54..2816267.38 rows=336 width=110) (actual time=59613.453..59613.464 rows=100 loops=1) Sort Key: (substr((warehouse.w_warehouse_name)::text, 1, 20)), ship_mode.sm_type, web_site.web_name Sort Method: top-N heapsort Memory: 49kB -> HashAggregate (cost=2816248.24..2816252.44 rows=336 width=110) (actual time=59612.622..59612.730 rows=360 loops=1) Group Key: substr((warehouse.w_warehouse_name)::text, 1, 20), ship_mode.sm_type, web_site.web_name Batches: 1 Memory Usage: 157kB -> Hash Join (cost=2510.74..2794150.54 rows=368295 width=78) (actual time=7.597..45818.495 rows=14336926 loops=1) Hash Cond: (web_sales.ws_ship_mode_sk = ship_mode.sm_ship_mode_sk) -> Hash Join (cost=2509.29..2792056.55 rows=368356 width=36) (actual time=7.571..38820.092 rows=14337390 loops=1) Hash Cond: (web_sales.ws_web_site_sk = web_site.web_site_sk) -> Hash Join (cost=2506.75..2790916.14 rows=368430 width=33) (actual time=7.554..35314.217 rows=14338265 loops=1) Hash Cond: (web_sales.ws_warehouse_sk = warehouse.w_warehouse_sk) -> Hash Join (cost=2505.41..2789674.19 rows=368516 width=20) (actual time=7.544..31214.782 rows=14340028 loops=1) Hash Cond: (web_sales.ws_ship_date_sk = date_dim.d_date_sk) -> Seq Scan on web_sales (cost=0.00..2598153.08 rows=72001808 width=20) (actual time=0.026..17405.391 rows=72001237 loops=1) -> Hash (cost=2500.73..2500.73 rows=374 width=4) (actual time=7.505..7.506 rows=365 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 21kB -> Seq Scan on date_dim (cost=0.00..2500.73 rows=374 width=4) (actual time=3.599..7.462 rows=365 loops=1) Filter: ((d_month_seq >= 1212) AND (d_month_seq <= 1223)) Rows Removed by Filter: 72684 -> Hash (cost=1.15..1.15 rows=15 width=21) (actual time=0.007..0.007 rows=15 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on warehouse (cost=0.00..1.15 rows=15 width=21) (actual time=0.003..0.004 rows=15 loops=1) -> Hash (cost=
Re: Keep compiler silence (clang 10, implicit conversion from 'long' to 'double' )
Hello Tom, Thank you for your comments. On Fri, Nov 8, 2019 at 1:30 AM Tom Lane wrote: > failure: not only did you fail to add any commentary about the new macros, > but you removed most of the commentary that had been in-line in the > existing usages. I apologize for the insufficient comments. I had to add more information about these macros. > I fixed those things and pushed it. Thank you very much for the commit! Best regards, Yuya Watari NTT Software Innovation Center watari.y...@gmail.com
Re: Keep compiler silence (clang 10, implicit conversion from 'long' to 'double' )
Hello Horiguchi-san, On Thu, Nov 7, 2019 at 3:10 PM Kyotaro Horiguchi wrote: > Mmm? See the bit in the patch cited below (v5). > > + /* Range check */ > + if (unlikely(!FLOAT8_FITS_IN_INT32(num)) || isnan(num)) > > If compiler doesn't any fancy, num is fed to an arithmetic before > checking if it is NaN. That seems have a chance of exception. Thank you for pointing it out. That's my mistake. I fixed it and attached the patch. Best regards, Yuya Watari NTT Software Innovation Center watari.y...@gmail.com v6-keep-compiler-silence.patch Description: Binary data
Re: Keep compiler silence (clang 10, implicit conversion from 'long' to 'double' )
Hello Tom, Thomas, and Andrew, > Tom> That commit presumes that floats follow the IEEE bitwise > Tom> representation, I think; > > Correct. (It notably does _not_ make any assumptions about how floating > point arithmetic or comparisons work - all the computation is done in > integers.) > > Tom> but it's a long way from there to assuming that float comparisons > Tom> do something that is explicitly *not* promised by C99. > > I agree. Thank you for your comments. I agree that we should not assume anything that is not guaranteed in the language specification. The modified patch (attached in the previous e-mail) checks NaN explicitly if needed. Best regards, Yuya Watari NTT Software Innovation Center watari.y...@gmail.com
Re: Keep compiler silence (clang 10, implicit conversion from 'long' to 'double' )
Hello Tom, Thank you for replying. On Wed, Nov 6, 2019 at 12:04 AM Tom Lane wrote: > > Yuya Watari writes: > > The added macro FLOAT8_FITS_IN_INT32() does not check NaN explicitly, > > but it sufficiently handles the case. > > Really? I don't think anything is guaranteed about how a NaN will > compare when using C's non-NaN-aware comparison operators. > > My thought about this was to annotate the macros with a reminder > to also check for NaN if there's any possibility that the value > is NaN. I agree with your opinion. Thank you for pointing it out. If the platform satisfies IEEE-754 standard, all comparisons (except for "not equals") between NaN and other floating values are "false". [1] In this case, the proposed FLOAT8_FITS_IN_INT32() macro handles NaN. [1] https://en.wikipedia.org/wiki/NaN#Comparison_with_NaN However, this behavior depends on the platform architecture. As you have said, C language does not always follow IEEE-754. I think adding explicit checking of NaN is necessary. I modified the patch and attached it. Best regards, Yuya Watari NTT Software Innovation Center watari.y...@gmail.com v5-keep-compiler-silence.patch Description: Binary data
Re: Keep compiler silence (clang 10, implicit conversion from 'long' to 'double' )
Hello Tom and Horiguchi-san, On Tue, Nov 5, 2019 at 1:59 PM Kyotaro Horiguchi wrote: > At Mon, 04 Nov 2019 12:53:48 -0500, Tom Lane wrote in > > I do concur with creating a macro that encapsulates a correct version > > of this test, maybe like > > > > #define DOUBLE_FITS_IN_INT64(num) \ > > ((num) >= (double) PG_INT64_MIN && \ > >(num) < -((double) PG_INT64_MIN)) Thank you for your comments. The proposed macro "DOUBLE_FITS_IN_INT64" is a good and simple way to check the overflow. According to that, I revised the patch, which includes regression tests. In the patch, I additionally modified other occurrences as follows. = +#define FLOAT8_FITS_IN_INT32(num) \ + ((num) >= (float8) PG_INT32_MIN && (num) < -((float8) PG_INT32_MIN)) = - if (unlikely(num < (float8) PG_INT32_MIN || -num >= -((float8) PG_INT32_MIN) || -isnan(num))) + /* Range check */ + if (unlikely(!FLOAT8_FITS_IN_INT32(num))) = The added macro FLOAT8_FITS_IN_INT32() does not check NaN explicitly, but it sufficiently handles the case. Best regards, Yuya Watari NTT Software Innovation Center watari.y...@gmail.com v4-keep-compiler-silence.patch Description: Binary data
Re: Keep compiler silence (clang 10, implicit conversion from 'long' to 'double' )
Horiguchi-san, On Tue, Oct 1, 2019 at 3:41 PM Kyotaro Horiguchi wrote: > I found a trick seems workable generically (*1). (2.0 * > (PG_INT64_MAX/2 + 1)) will generate the value next to the > PG_INT64_MAX based on some assumptions > (*1). IS_DOUBLE_SAFE_IN_INT64() below would be able to check if > the value can be converted into int64 safely or not. Thanks for sharing a nice way of checking overflow. I tested your IS_DOUBLE_SAFE_IN_INT64() macro in my environment by the simple code (attached to this email) and confirmed that it appropriately handled the overflow. However, further consideration is needed for different architectures. I attached the modified patch. In the patch, I placed the macro in "src/include/c.h", but this may not be a good choice because c.h is widely included from a lot of files. Do you have any good ideas about its placement? Thanks, Yuya Watari NTT Software Innovation Center watari.y...@gmail.com #include #include #include #include #define PG_INT64_MAX INT64_MAX #define PG_INT64_MIN INT64_MIN /* * Check if a double value can be casted into int64. * * This macro is assuming that FLT_RADIX == 2 so that the * 2.0 trick works, * PG_INT64_MAX is so below DBL_MAX that the doubled value can be represented * in double and DBL_MANT_DIG is equal or smaller than DBL_MAX_EXP so that * ceil() returns expected result. */ #define MIN_DOUBLE_OVER_INT64_MAX (2.0 * (double) (PG_INT64_MAX / 2 + 1)) #define MAX_DOUBLE_UNDER_INT64_MIN (2.0 * (double) (PG_INT64_MIN / 2 - 1)) #if -PG_INT64_MAX != PG_INT64_MIN #define IS_DOUBLE_SAFE_IN_INT64(x) \ ((x) < MIN_DOUBLE_OVER_INT64_MAX && ceil(x) >= (double) PG_INT64_MIN) #else #define IS_DOUBLE_SAFE_IN_INT64(x) \ ((x) < MIN_DOUBLE_OVER_INT64_MAX && (x) > MAX_DOUBLE_UNDER_INT64_MIN) #endif int main() { double values[] = { -pow(2, 63) - 1025, -pow(2, 63), pow(2, 63) - 1024, pow(2, 63), INFINITY, -INFINITY, NAN }; for (int i = 0; i < sizeof(values) / sizeof(values[0]); i++) { double value = values[i]; printf("value == %lf, Overflow = %s\n", value, !IS_DOUBLE_SAFE_IN_INT64(value) ? "true" : "false"); } } v3-keep-compiler-silence.patch Description: Binary data
Re: Keep compiler silence (clang 10, implicit conversion from 'long' to 'double' )
Hello, I add further information. This issue also has a problem about *overflow checking*. The original code is as follows. src/backend/utils/adt/timestamp.c:3222 - if (result_double > PG_INT64_MAX || result_double < PG_INT64_MIN) ereport(ERROR, (errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE), errmsg("interval out of range"))); result->time = (int64) result_double; - Here, the code checks whether "result_double" fits 64-bit integer size before casting it. However, as I have mentioned in the previous email, PG_INT64_MAX is cast to double and the value becomes 9223372036854775808 due to lack of precision. Therefore, the above code is identical to "result_double > 9223372036854775808.0". This checking does not cover the case when result_double is equal to 9223372036854775808. In this case, "(int64) result_double" will be -9223372036854775808, which is wrong. The next code confirms what I explained. === #include #include int main(void) { double value = (double) INT64_MAX; printf("INT64_MAX = %ld\n", INT64_MAX); printf("value = %lf\n", value); printf("(value > (double) INT64_MAX) == %d\n", value > (double) INT64_MAX); printf("(long int) value == %ld\n", (long int) value); } === Output: INT64_MAX = 9223372036854775807 value = 9223372036854775808.00 (value > (double) INT64_MAX) == 0 (long int) value == -9223372036854775808 === I think the code should be "result_double >= (double) PG_INT64_MAX", that is we have to use >= rather than >. I attached the modified patch. Thanks, Yuya Watari NTT Software Innovation Center watari.y...@gmail.com 2019年9月27日(金) 12:00 Yuya Watari : > > Hello, > > I found the problem that clang compiler introduces warnings when building > PostgreSQL. Attached patch fixes it. > > === > Compiler version > === > clang version 10.0.0-svn372772-1~exp1+0~20190924181208.2504~1.gbpb209ff > (trunk) > > Older versions of clang may not generate this warning. > > === > Warning > === > > timestamp.c:3236:22: warning: implicit conversion from 'long' to 'double' > changes value from 9223372036854775807 to 9223372036854775808 > [-Wimplicit-int-float-conversion] > if (result_double > PG_INT64_MAX || result_double < PG_INT64_MIN) > ~ ^~~~ > ../../../../src/include/c.h:444:22: note: expanded from macro 'PG_INT64_MAX' > #define PG_INT64_MAXINT64CONST(0x7FFF) > ^~ > ../../../../src/include/c.h:381:25: note: expanded from macro 'INT64CONST' > #define INT64CONST(x) (x##L) > ^~~~ > :234:1: note: expanded from here > 0x7FFFL > ^~~ > 1 warning generated. > pgbench.c:1657:30: warning: implicit conversion from 'long' to 'double' > changes value from 9223372036854775807 to 9223372036854775808 > [-Wimplicit-int-float-conversion] > if (dval < PG_INT64_MIN || PG_INT64_MAX < dval) >^~~~ ~ > ../../../src/include/c.h:444:22: note: expanded from macro 'PG_INT64_MAX' > #define PG_INT64_MAXINT64CONST(0x7FFF) > ^~ > ../../../src/include/c.h:381:25: note: expanded from macro 'INT64CONST' > #define INT64CONST(x) (x##L) > ^~~~ > :252:1: note: expanded from here > 0x7FFFL > ^~~ > 1 warning generated. > > === > > This warning is due to implicit conversion from PG_INT64_MAX to double, which > drops the precision as described in the warning. This drop is not a problem > in this case, but we have to get rid of useless warnings. Attached patch > casts PG_INT64_MAX explicitly. > > Thanks, > Yuya Watari > NTT Software Innovation Center > watari.y...@gmail.com v2-keep-compiler-silence.patch Description: Binary data
Keep compiler silence (clang 10, implicit conversion from 'long' to 'double' )
Hello, I found the problem that clang compiler introduces warnings when building PostgreSQL. Attached patch fixes it. === Compiler version === clang version 10.0.0-svn372772-1~exp1+0~20190924181208.2504~1.gbpb209ff (trunk) Older versions of clang may not generate this warning. === Warning === timestamp.c:3236:22: warning: implicit conversion from 'long' to 'double' changes value from 9223372036854775807 to 9223372036854775808 [-Wimplicit-int-float-conversion] if (result_double > PG_INT64_MAX || result_double < PG_INT64_MIN) ~ ^~~~ ../../../../src/include/c.h:444:22: note: expanded from macro 'PG_INT64_MAX' #define PG_INT64_MAXINT64CONST(0x7FFF) ^~ ../../../../src/include/c.h:381:25: note: expanded from macro 'INT64CONST' #define INT64CONST(x) (x##L) ^~~~ :234:1: note: expanded from here 0x7FFFL ^~~ 1 warning generated. pgbench.c:1657:30: warning: implicit conversion from 'long' to 'double' changes value from 9223372036854775807 to 9223372036854775808 [-Wimplicit-int-float-conversion] if (dval < PG_INT64_MIN || PG_INT64_MAX < dval) ^~~~ ~ ../../../src/include/c.h:444:22: note: expanded from macro 'PG_INT64_MAX' #define PG_INT64_MAXINT64CONST(0x7FFF) ^~ ../../../src/include/c.h:381:25: note: expanded from macro 'INT64CONST' #define INT64CONST(x) (x##L) ^~~~ :252:1: note: expanded from here 0x7FFFL ^~~ 1 warning generated. === This warning is due to implicit conversion from PG_INT64_MAX to double, which drops the precision as described in the warning. This drop is not a problem in this case, but we have to get rid of useless warnings. Attached patch casts PG_INT64_MAX explicitly. Thanks, Yuya Watari NTT Software Innovation Center watari.y...@gmail.com keep-compiler-silence.patch Description: Binary data