Re: [HACKERS] Runtime Partition Pruning
On Tue, Oct 13, 2020 at 3:48 PM Amit Langote wrote: > On Thu, Oct 8, 2020 at 8:25 PM Ashutosh Bapat > wrote: > > On Wed, Oct 7, 2020 at 7:00 PM Andy Fan > wrote: > > > I can understand Robert's idea now, he intended to resolve both the > > > "initial-partition-prune" case and "runtime partition prune" case > while I always think > > > about the former case (Amit reminded me about that at the beginning, > but I just > > > wake up right now..) > > > > > > With the Robert's idea, I think we can do some hack at > create_append_path, > > > we can figure out the pruneinfo (it was done at create_append_plan > now) at that > > > stage, and then did a fix_append_path with such pruneinfo. We need to > fix not > > > only the cost of AppendPath, but also rows of AppendPath, For example: > > > SELECT .. FROM t1, t2, p where t1.a = p.partkey and t1.b = t2.b, if the > > > plan is: > > > Nest Loop > > >Nest Loop > > > t1 > > > Append (p) > > >t2 > > > > > > Then the rows of Append (p) will be very important. > > > > > > For this idea, how to estimate how many partitions/rows can be pruned > is a key > > > part. Robert said "I was thinking, rather, that if we know for > example that > > > we've doing pruning on partition_column = $1, then we know that only > > > one partition will match. That's probably a common case. If we've > > > got partition_column > $1, we could assume that, say, 75% of the > > > partitions would match. partition_column BETWEEN $1 and $2 is > > > probably a bit more selective, so maybe we assume 50% of the > > > partitions would match.". I think we can't say the 75% or 50% is a > good > > > number, but the keypoint may be "partition_column = $1" is a common > > > case. And for the others case, we probably don't make it worse. > > > > > > I think we need to do something here, any thoughts? Personally I'm more > > > like this idea now. > > > > Yes, I think we have to do something on those lines. But I am > > wondering why our stats machinery is failing to do that already. For > > equality I think it's straight forward. But even for other operators > > we should use the statistics from the partitioned table to estimate > > the selectivity and then adjust number of scanned partitions = (number > > of partitions * fraction of rows scanned). That might be better than > > 50% or 75%. > > This is an interesting idea, that is, the idea of consulting parent > relation's stats to guess at the number of partitions that might be > scanned. > > However, we don't currently consult an inheritance parent relation's > stats at all during "base" relation planning, which is why you don't > see the parent relation's statistics reflected in the row count and > costs assigned to its (Append at al) paths. The hard-coded rule is to > sum up children's rows and their paths' costs; see cost_append(). > > My thinking on this was that we'd just extend that hard-coded rule to > take into account that run-time pruning, if applicable in a given > plan, will cause some of those child paths to be discarded. Maybe > Andy is headed in that direction? > > Yes, that's what I am trying to do. The minimum code in my mind is: create_append_path(...) { double run_time_prune_est; if (enable_partition_prune && .. ) List * partition_prune_step_ops = cal_prune_step_ops(rel, partitioned_rels); run_time_prune_est = estimate_runtime_prune_ratio(partition_prune_step_ops); } // adjust the rows/costs of AppendPath based on the run_time_prune_est. The difference between '=' and '<' will be handled in function estimate_runtime_prune_ratio. Since I still don't make my mind runnable now, some data structures may change, but the overall idea is something like above. -- Best Regards Andy Fan
Re: [HACKERS] Runtime Partition Pruning
On Thu, Oct 8, 2020 at 8:25 PM Ashutosh Bapat wrote: > On Wed, Oct 7, 2020 at 7:00 PM Andy Fan wrote: > > I can understand Robert's idea now, he intended to resolve both the > > "initial-partition-prune" case and "runtime partition prune" case while I > > always think > > about the former case (Amit reminded me about that at the beginning, but I > > just > > wake up right now..) > > > > With the Robert's idea, I think we can do some hack at create_append_path, > > we can figure out the pruneinfo (it was done at create_append_plan now) at > > that > > stage, and then did a fix_append_path with such pruneinfo. We need to fix > > not > > only the cost of AppendPath, but also rows of AppendPath, For example: > > SELECT .. FROM t1, t2, p where t1.a = p.partkey and t1.b = t2.b, if the > > plan is: > > Nest Loop > >Nest Loop > > t1 > > Append (p) > >t2 > > > > Then the rows of Append (p) will be very important. > > > > For this idea, how to estimate how many partitions/rows can be pruned is a > > key > > part. Robert said "I was thinking, rather, that if we know for example that > > we've doing pruning on partition_column = $1, then we know that only > > one partition will match. That's probably a common case. If we've > > got partition_column > $1, we could assume that, say, 75% of the > > partitions would match. partition_column BETWEEN $1 and $2 is > > probably a bit more selective, so maybe we assume 50% of the > > partitions would match.". I think we can't say the 75% or 50% is a good > > number, but the keypoint may be "partition_column = $1" is a common > > case. And for the others case, we probably don't make it worse. > > > > I think we need to do something here, any thoughts? Personally I'm more > > like this idea now. > > Yes, I think we have to do something on those lines. But I am > wondering why our stats machinery is failing to do that already. For > equality I think it's straight forward. But even for other operators > we should use the statistics from the partitioned table to estimate > the selectivity and then adjust number of scanned partitions = (number > of partitions * fraction of rows scanned). That might be better than > 50% or 75%. This is an interesting idea, that is, the idea of consulting parent relation's stats to guess at the number of partitions that might be scanned. However, we don't currently consult an inheritance parent relation's stats at all during "base" relation planning, which is why you don't see the parent relation's statistics reflected in the row count and costs assigned to its (Append at al) paths. The hard-coded rule is to sum up children's rows and their paths' costs; see cost_append(). My thinking on this was that we'd just extend that hard-coded rule to take into account that run-time pruning, if applicable in a given plan, will cause some of those child paths to be discarded. Maybe Andy is headed in that direction? -- Amit Langote EDB: http://www.enterprisedb.com
Re: [HACKERS] Runtime Partition Pruning
On Mon, Oct 12, 2020 at 5:48 PM Ashutosh Bapat wrote: > On Mon, Oct 12, 2020 at 7:59 AM Andy Fan wrote: > > > > > Sorry for the late reply! Suppose we have partition defined like this: > > p > > - p1 > > - p2 > > > > When you talk about "the statistics from the partitioned table", do you > > mean the statistics from p or p1/p2? I just confirmed there is no > statistics > > for p (at least pg_class.reltuples = -1), so I think you are talking > about > > p1/p2. > > I am talking about p when I say statistics from the partitioned table. > I see that pg_statistic row from p is well populated. > pg_class.reltuples = -1 indicates that the heap doesn't have any rows. > set_rel_size() sets the number of rows in the partitioned table based > on the rows in individual unpruned partitions. > > Glad to know that, Thanks for this info! > > > > Here we are talking about partkey = $1 or partkey = RunTimeValue. > > so even the value can hit 1 partition only, but since we don't know > > the exact value, so we treat all the partition equally. so looks > > nothing wrong with partition level estimation. However when we cost > > the Append path, we need know just one of them can be hit, then > > we need do something there. Both AppendPath->rows/total_cost > > should be adjusted (That doesn't happen now). > > I think in this case we can safely assume that only one partition will > remain so normalize costs considering that only one partition will > survive. > Exactly. What I am trying to do is fix this at create_append_path, do you have different suggestions? about the pkey > $1 case, I think even if we use the statistics from partition level, it would be hard-code as well since we don't know what value $1 is. I have gone through the main part of the RunTime partition prune, hope I can update a runnable patch soon. The main idea is fix the rows/ costs at create_append_path stage. So any suggestion in a different direction will be very useful. -- Best Regards Andy Fan
Re: [HACKERS] Runtime Partition Pruning
On Mon, Oct 12, 2020 at 7:59 AM Andy Fan wrote: > > Sorry for the late reply! Suppose we have partition defined like this: > p > - p1 > - p2 > > When you talk about "the statistics from the partitioned table", do you > mean the statistics from p or p1/p2? I just confirmed there is no statistics > for p (at least pg_class.reltuples = -1), so I think you are talking about > p1/p2. I am talking about p when I say statistics from the partitioned table. I see that pg_statistic row from p is well populated. pg_class.reltuples = -1 indicates that the heap doesn't have any rows. set_rel_size() sets the number of rows in the partitioned table based on the rows in individual unpruned partitions. > > Here we are talking about partkey = $1 or partkey = RunTimeValue. > so even the value can hit 1 partition only, but since we don't know > the exact value, so we treat all the partition equally. so looks > nothing wrong with partition level estimation. However when we cost > the Append path, we need know just one of them can be hit, then > we need do something there. Both AppendPath->rows/total_cost > should be adjusted (That doesn't happen now). I think in this case we can safely assume that only one partition will remain so normalize costs considering that only one partition will survive. -- Best Wishes, Ashutosh Bapat
Re: [HACKERS] Runtime Partition Pruning
Hi Ashutosh: On Thu, Oct 8, 2020 at 7:25 PM Ashutosh Bapat wrote: > On Wed, Oct 7, 2020 at 7:00 PM Andy Fan wrote: > > > > > > > > On Wed, Oct 7, 2020 at 5:05 PM Andy Fan > wrote: > >> > >> > >> > >> On Sun, Oct 4, 2020 at 3:10 PM Andy Fan > wrote: > > > > Now, in my experience, the current system for custom plans vs. generic > plans doesn't approach the problem in this way at all, and in my > experience that results in some pretty terrible behavior. It will do > things like form a custom plan every time because the estimated cost > of the custom plan is lower than the estimated cost of the generic > plan even though the two plans are structurally identical; only the > estimates differ. It will waste gobs of CPU cycles by replanning a > primary key lookup 5 times just on the off chance that a lookup on the > primary key index isn't the best option. But this patch isn't going > to fix any of that. The best we can probably do is try to adjust the > costing for Append paths in some way that reflects the costs and > benefits of pruning. I'm tentatively in favor of trying to do > something modest in that area, but I don't have a detailed proposal. > > >>> > >>> I just realized this issue recently and reported it at [1], then Amit > pointed > >>> me to this issue being discussed here, so I would like to continue > this topic > >>> here. > >>> > >>> I think we can split the issue into 2 issues. One is the partition > prune in initial > >>> partition prune, which maybe happen in custom plan case only and caused > >>> the above issue. The other one happens in the "Run-Time" partition > prune, > >>> I admit that is an important issue to resolve as well, but looks > harder. So I > >>> think we can fix the first one at first. > >>> > >>> ... When we count for the cost of a > >>> generic plan, we can reduce the cost based on such information. > >> > >> > >> This way doesn't work since after the initial partition prune, not > only the > >> cost of the Append node should be reduced, the cost of other plans > should > >> be reduced as well [1] > >> > >> However I think if we can use partition prune information from a custom > plan > >> at the cost_append_path stage, it looks the issue can be fixed. If > so, the idea > >> is similar to David's idea in [2], however Robert didn't agree with > this[2]. > >> Can anyone elaborate this objection? for a partkey > $1 or BETWEEN > cases, > >> some real results from the past are probably better than some hard-coded > >> assumptions IMO. > > > > > > I can understand Robert's idea now, he intended to resolve both the > > "initial-partition-prune" case and "runtime partition prune" case while > I always think > > about the former case (Amit reminded me about that at the beginning, but > I just > > wake up right now..) > > > > With the Robert's idea, I think we can do some hack at > create_append_path, > > we can figure out the pruneinfo (it was done at create_append_plan now) > at that > > stage, and then did a fix_append_path with such pruneinfo. We need to > fix not > > only the cost of AppendPath, but also rows of AppendPath, For example: > > SELECT .. FROM t1, t2, p where t1.a = p.partkey and t1.b = t2.b, if the > > plan is: > > Nest Loop > >Nest Loop > > t1 > > Append (p) > >t2 > > > > Then the rows of Append (p) will be very important. > > > > For this idea, how to estimate how many partitions/rows can be pruned > is a key > > part. Robert said "I was thinking, rather, that if we know for example > that > > we've doing pruning on partition_column = $1, then we know that only > > one partition will match. That's probably a common case. If we've > > got partition_column > $1, we could assume that, say, 75% of the > > partitions would match. partition_column BETWEEN $1 and $2 is > > probably a bit more selective, so maybe we assume 50% of the > > partitions would match.". I think we can't say the 75% or 50% is a good > > number, but the keypoint may be "partition_column = $1" is a common > > case. And for the others case, we probably don't make it worse. > > > > I think we need to do something here, any thoughts? Personally I'm more > > like this idea now. > > Yes, I think we have to do something on those lines. But I am > wondering why our stats machinery is failing to do that already. For > equality I think it's straight forward. But even for other operators > we should use the statistics from the partitioned table to estimate > the selectivity and then adjust number of scanned partitions = (number > of partitions * fraction of rows scanned). That might be better than > 50% or 75%. > > Sorry for the late reply! Suppose we have partition defined like this: p - p1 - p2 When you talk about "the statistics from the partitioned table", do you mean the statistics from p or p1/p2? I just confirmed there is no statistics for p (at least pg_class.re
Re: [HACKERS] Runtime Partition Pruning
On Wed, Oct 7, 2020 at 7:00 PM Andy Fan wrote: > > > > On Wed, Oct 7, 2020 at 5:05 PM Andy Fan wrote: >> >> >> >> On Sun, Oct 4, 2020 at 3:10 PM Andy Fan wrote: Now, in my experience, the current system for custom plans vs. generic plans doesn't approach the problem in this way at all, and in my experience that results in some pretty terrible behavior. It will do things like form a custom plan every time because the estimated cost of the custom plan is lower than the estimated cost of the generic plan even though the two plans are structurally identical; only the estimates differ. It will waste gobs of CPU cycles by replanning a primary key lookup 5 times just on the off chance that a lookup on the primary key index isn't the best option. But this patch isn't going to fix any of that. The best we can probably do is try to adjust the costing for Append paths in some way that reflects the costs and benefits of pruning. I'm tentatively in favor of trying to do something modest in that area, but I don't have a detailed proposal. >>> >>> I just realized this issue recently and reported it at [1], then Amit >>> pointed >>> me to this issue being discussed here, so I would like to continue this >>> topic >>> here. >>> >>> I think we can split the issue into 2 issues. One is the partition prune >>> in initial >>> partition prune, which maybe happen in custom plan case only and caused >>> the above issue. The other one happens in the "Run-Time" partition prune, >>> I admit that is an important issue to resolve as well, but looks harder. >>> So I >>> think we can fix the first one at first. >>> >>> ... When we count for the cost of a >>> generic plan, we can reduce the cost based on such information. >> >> >> This way doesn't work since after the initial partition prune, not only the >> cost of the Append node should be reduced, the cost of other plans should >> be reduced as well [1] >> >> However I think if we can use partition prune information from a custom plan >> at the cost_append_path stage, it looks the issue can be fixed. If so, the >> idea >> is similar to David's idea in [2], however Robert didn't agree with this[2]. >> Can anyone elaborate this objection? for a partkey > $1 or BETWEEN cases, >> some real results from the past are probably better than some hard-coded >> assumptions IMO. > > > I can understand Robert's idea now, he intended to resolve both the > "initial-partition-prune" case and "runtime partition prune" case while I > always think > about the former case (Amit reminded me about that at the beginning, but I > just > wake up right now..) > > With the Robert's idea, I think we can do some hack at create_append_path, > we can figure out the pruneinfo (it was done at create_append_plan now) at > that > stage, and then did a fix_append_path with such pruneinfo. We need to fix not > only the cost of AppendPath, but also rows of AppendPath, For example: > SELECT .. FROM t1, t2, p where t1.a = p.partkey and t1.b = t2.b, if the > plan is: > Nest Loop >Nest Loop > t1 > Append (p) >t2 > > Then the rows of Append (p) will be very important. > > For this idea, how to estimate how many partitions/rows can be pruned is a > key > part. Robert said "I was thinking, rather, that if we know for example that > we've doing pruning on partition_column = $1, then we know that only > one partition will match. That's probably a common case. If we've > got partition_column > $1, we could assume that, say, 75% of the > partitions would match. partition_column BETWEEN $1 and $2 is > probably a bit more selective, so maybe we assume 50% of the > partitions would match.". I think we can't say the 75% or 50% is a good > number, but the keypoint may be "partition_column = $1" is a common > case. And for the others case, we probably don't make it worse. > > I think we need to do something here, any thoughts? Personally I'm more > like this idea now. Yes, I think we have to do something on those lines. But I am wondering why our stats machinery is failing to do that already. For equality I think it's straight forward. But even for other operators we should use the statistics from the partitioned table to estimate the selectivity and then adjust number of scanned partitions = (number of partitions * fraction of rows scanned). That might be better than 50% or 75%. -- Best Wishes, Ashutosh Bapat
Re: [HACKERS] Runtime Partition Pruning
On Wed, Oct 7, 2020 at 5:05 PM Andy Fan wrote: > > > On Sun, Oct 4, 2020 at 3:10 PM Andy Fan wrote: > >> >>> >>> Now, in my experience, the current system for custom plans vs. generic >>> plans doesn't approach the problem in this way at all, and in my >>> experience that results in some pretty terrible behavior. It will do >>> things like form a custom plan every time because the estimated cost >>> of the custom plan is lower than the estimated cost of the generic >>> plan even though the two plans are structurally identical; only the >>> estimates differ. It will waste gobs of CPU cycles by replanning a >>> primary key lookup 5 times just on the off chance that a lookup on the >>> primary key index isn't the best option. But this patch isn't going >>> to fix any of that. The best we can probably do is try to adjust the >>> costing for Append paths in some way that reflects the costs and >>> benefits of pruning. I'm tentatively in favor of trying to do >>> something modest in that area, but I don't have a detailed proposal. >>> >>> >> I just realized this issue recently and reported it at [1], then Amit >> pointed >> me to this issue being discussed here, so I would like to continue this >> topic >> here. >> >> I think we can split the issue into 2 issues. One is the partition prune >> in initial >> partition prune, which maybe happen in custom plan case only and caused >> the above issue. The other one happens in the "Run-Time" partition >> prune, >> I admit that is an important issue to resolve as well, but looks harder. >> So I >> think we can fix the first one at first. >> >> ... When we count for the cost of a >> generic plan, we can reduce the cost based on such information. >> > > This way doesn't work since after the initial partition prune, not only > the > cost of the Append node should be reduced, the cost of other plans should > be reduced as well [1] > > However I think if we can use partition prune information from a custom > plan > at the cost_append_path stage, it looks the issue can be fixed. If so, > the idea > is similar to David's idea in [2], however Robert didn't agree with > this[2]. > Can anyone elaborate this objection? for a partkey > $1 or BETWEEN cases, > some real results from the past are probably better than some hard-coded > assumptions IMO. > I can understand Robert's idea now, he intended to resolve both the "initial-partition-prune" case and "runtime partition prune" case while I always think about the former case (Amit reminded me about that at the beginning, but I just wake up right now..) With the Robert's idea, I think we can do some hack at create_append_path, we can figure out the pruneinfo (it was done at create_append_plan now) at that stage, and then did a fix_append_path with such pruneinfo. We need to fix not only the cost of AppendPath, but also rows of AppendPath, For example: SELECT .. FROM t1, t2, p where t1.a = p.partkey and t1.b = t2.b, if the plan is: Nest Loop Nest Loop t1 Append (p) t2 Then the rows of Append (p) will be very important. For this idea, how to estimate how many partitions/rows can be pruned is a key part. Robert said "I was thinking, rather, that if we know for example that we've doing pruning on partition_column = $1, then we know that only one partition will match. That's probably a common case. If we've got partition_column > $1, we could assume that, say, 75% of the partitions would match. partition_column BETWEEN $1 and $2 is probably a bit more selective, so maybe we assume 50% of the partitions would match.". I think we can't say the 75% or 50% is a good number, but the keypoint may be "partition_column = $1" is a common case. And for the others case, we probably don't make it worse. I think we need to do something here, any thoughts? Personally I'm more like this idea now. -- Best Regards Andy Fan
Re: [HACKERS] Runtime Partition Pruning
On Sun, Oct 4, 2020 at 3:10 PM Andy Fan wrote: > >> >> Now, in my experience, the current system for custom plans vs. generic >> plans doesn't approach the problem in this way at all, and in my >> experience that results in some pretty terrible behavior. It will do >> things like form a custom plan every time because the estimated cost >> of the custom plan is lower than the estimated cost of the generic >> plan even though the two plans are structurally identical; only the >> estimates differ. It will waste gobs of CPU cycles by replanning a >> primary key lookup 5 times just on the off chance that a lookup on the >> primary key index isn't the best option. But this patch isn't going >> to fix any of that. The best we can probably do is try to adjust the >> costing for Append paths in some way that reflects the costs and >> benefits of pruning. I'm tentatively in favor of trying to do >> something modest in that area, but I don't have a detailed proposal. >> >> > I just realized this issue recently and reported it at [1], then Amit > pointed > me to this issue being discussed here, so I would like to continue this > topic > here. > > I think we can split the issue into 2 issues. One is the partition prune > in initial > partition prune, which maybe happen in custom plan case only and caused > the above issue. The other one happens in the "Run-Time" partition prune, > I admit that is an important issue to resolve as well, but looks harder. > So I > think we can fix the first one at first. > > ... When we count for the cost of a > generic plan, we can reduce the cost based on such information. > This way doesn't work since after the initial partition prune, not only the cost of the Append node should be reduced, the cost of other plans should be reduced as well [1] However I think if we can use partition prune information from a custom plan at the cost_append_path stage, it looks the issue can be fixed. If so, the idea is similar to David's idea in [2], however Robert didn't agree with this[2]. Can anyone elaborate this objection? for a partkey > $1 or BETWEEN cases, some real results from the past are probably better than some hard-coded assumptions IMO. [1] https://www.postgresql.org/message-id/CAKU4AWrWSCFO5fh01GTnN%2B1T8K8MyVAi4Gw-TvYC-Vhx3JohUw%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAKJS1f8q_d7_Viweeivt1eS4Q8a0WAGFbrgeX38468mVgKseTA%40mail.gmail.com [3] https://www.postgresql.org/message-id/CA%2BTgmoZv8sd9cKyYtHwmd_13%2BBAjkVKo%3DECe7G98tBK5Ejwatw%40mail.gmail.com -- Best Regards Andy Fan
Re: [HACKERS] Runtime Partition Pruning
> > > > Now, in my experience, the current system for custom plans vs. generic > plans doesn't approach the problem in this way at all, and in my > experience that results in some pretty terrible behavior. It will do > things like form a custom plan every time because the estimated cost > of the custom plan is lower than the estimated cost of the generic > plan even though the two plans are structurally identical; only the > estimates differ. It will waste gobs of CPU cycles by replanning a > primary key lookup 5 times just on the off chance that a lookup on the > primary key index isn't the best option. But this patch isn't going > to fix any of that. The best we can probably do is try to adjust the > costing for Append paths in some way that reflects the costs and > benefits of pruning. I'm tentatively in favor of trying to do > something modest in that area, but I don't have a detailed proposal. > > I just realized this issue recently and reported it at [1], then Amit pointed me to this issue being discussed here, so I would like to continue this topic here. I think we can split the issue into 2 issues. One is the partition prune in initial partition prune, which maybe happen in custom plan case only and caused the above issue. The other one happens in the "Run-Time" partition prune, I admit that is an important issue to resolve as well, but looks harder. So I think we can fix the first one at first. The proposal to fix the first one is we can remember how many partitions survived after plan time pruned for a RelOptInfo for a custom plan. and record such information in the CachedPlanSource. When we count for the cost of a generic plan, we can reduce the cost based on such information. Any thought about this? I'd be sorry if I missed some already existing discussion on this topic. [1] https://www.postgresql.org/message-id/CA%2BHiwqGsP2L0BW1ad58HRSj1NouNSVHLfL5pm7%3DPBTvL0b%2B-BQ%40mail.gmail.com -- Best Regards Andy Fan
Re: [HACKERS] Runtime Partition Pruning
On Wed, May 29, 2019 at 6:02 PM Tom Lane wrote: > Well, it *is* a problem. The whole point of this discussion I think is > to try to get better information "by default" for routine bug reports. > So if those come from production servers without debug symbols, which > I believe will be the usual case, then it seems likely to me that > libunwind will produce no better results than glibc. (But perhaps > I'm wrong about that --- I have not experimented with libunwind.) Sure, I agree. > Now it's true that "install debug symbols" is less of an ask than > "install debug symbols, *and* gdb, and make sure server core dumps are > enabled, and then go through this arcane manual procedure next time > you get a core dump". But we shouldn't fool ourselves that it isn't > an ask that's going to be hard for people with corporate policies > against installing extra stuff on production servers. There may be cases where that is true, but as you say, it's better than what we have now. Plus, what exactly is the alternative? We could: - encourage packagers to install debug symbols by default (but they might not; it might even be against policy), or - invent our own system for generating backtraces and ignore what the OS toolchain knows how to do (sounds painfully complex and expensive), or - just live with the fact that it's imperfect. Is there a fourth option? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Runtime Partition Pruning
Robert Haas writes: > On Fri, May 24, 2019 at 12:09 PM Tom Lane wrote: >> Is it actually better? The basic problem with backtrace() is that it >> only knows about global functions, and so reports call sites in static >> functions as if they were in whatever global function physically precedes >> the static one. I think doing materially better requires depending on >> debug symbols, which (at least in the Red Hat world) aren't going to >> be there in a typical production situation. > I don't have an opinion on glibc vs. libunwind, but I don't understand > this argument. If you are unlucky enough to have a production server > that is crashing due to some hitherto-unknown bug, and if it's not > possible to get a good backtrace without installing debugging symbols, > then you are going to have to pick between (1) installing those > debugging symbols and (2) getting a poor backtrace. I don't really > see that as a problem so much as just the way life is. Well, it *is* a problem. The whole point of this discussion I think is to try to get better information "by default" for routine bug reports. So if those come from production servers without debug symbols, which I believe will be the usual case, then it seems likely to me that libunwind will produce no better results than glibc. (But perhaps I'm wrong about that --- I have not experimented with libunwind.) Now it's true that "install debug symbols" is less of an ask than "install debug symbols, *and* gdb, and make sure server core dumps are enabled, and then go through this arcane manual procedure next time you get a core dump". But we shouldn't fool ourselves that it isn't an ask that's going to be hard for people with corporate policies against installing extra stuff on production servers. regards, tom lane
Re: [HACKERS] Runtime Partition Pruning
On Fri, May 24, 2019 at 12:09 PM Tom Lane wrote: > Is it actually better? The basic problem with backtrace() is that it > only knows about global functions, and so reports call sites in static > functions as if they were in whatever global function physically precedes > the static one. I think doing materially better requires depending on > debug symbols, which (at least in the Red Hat world) aren't going to > be there in a typical production situation. I don't have an opinion on glibc vs. libunwind, but I don't understand this argument. If you are unlucky enough to have a production server that is crashing due to some hitherto-unknown bug, and if it's not possible to get a good backtrace without installing debugging symbols, then you are going to have to pick between (1) installing those debugging symbols and (2) getting a poor backtrace. I don't really see that as a problem so much as just the way life is. You can't expect to get good debugging output without debugging symbols, just as you can't expect to get a clean audit without bank statements or a clear idea of how to find your way to an unknown destination without a map. If you don't have the thing that contains the information that is needed, you can't get the information; c'est la vie. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Runtime Partition Pruning
Hi, On 2019-05-25 00:42:39 +0200, Tomas Vondra wrote: > On Fri, May 24, 2019 at 09:24:28AM -0700, Andres Freund wrote: > > On 2019-05-24 12:08:57 -0400, Tom Lane wrote: > > > Andres Freund writes: > > > The basic problem with backtrace() is that it > > > only knows about global functions, and so reports call sites in static > > > functions as if they were in whatever global function physically precedes > > > the static one. > > > > Does that depend on whether the program was compiled with > > -fno-omit-frame-pointer? At least some distros now compile with frame > > pointers enabled, to get reasonably fast perf profiles (at a basically > > immeasurable slowdown, on modern-ish CPUs). > > > > I doubt that, because if that was the case we'd not be able to get > accurate profiles from perf, no? And AFAICS that's not the case, > irrespectedly of whether -fno-omit-frame-pointer is used. I can't parse this. With perf you can get accurate call-graph profiles if you either use -fno-omit-frame-pointer, to force frame pointers to be present (so the call graph can cheaply be assembled during profiling), or with dwarf (the entire stack is saved, and then dwarf is unwinding at perf report time - very large), or with lbr (CPU saves traces of branches taken, enabling call graphs to be computed, but it needs they're not very deep). > > > I think doing materially better requires depending on > > > debug symbols, which (at least in the Red Hat world) aren't going to > > > be there in a typical production situation. > > > > It's still a lot easier to install debug symbols than to attach a > > debugger and get a backtrace that way. Especially when the problem is > > hard to reproduce. > > > > Right. Debugger requires interaction with a running process, while > having it integrated would make that unnecessary. > > But I think Peter also suggested this might require the ability to > selectively enable the backtraces, and I think he's right. I doubt we > want to log them for every log message, right? Well, I think that if we had PANIC, SIGSEGV/BUS most FATALs covered, we'd be off to a very good start. I'm not sure it's wise to give users control over the computation of stack computations. Greetings, Andres Freund
Re: [HACKERS] Runtime Partition Pruning
On Fri, May 24, 2019 at 09:24:28AM -0700, Andres Freund wrote: Hi, On 2019-05-24 12:08:57 -0400, Tom Lane wrote: Andres Freund writes: > On 2019-05-24 11:34:58 -0400, Tom Lane wrote: >> Hmm, after some digging in the archives, the closest thing I can find >> is this thread: >> https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com >> where we discussed using libunwind instead, but people didn't like >> the extra dependency. > Hm, I didn't actually see that much concern about that. I still think we > should just go for libunwind. Is it actually better? I've not looked in a while, but I think at some point it was. The basic problem with backtrace() is that it only knows about global functions, and so reports call sites in static functions as if they were in whatever global function physically precedes the static one. Does that depend on whether the program was compiled with -fno-omit-frame-pointer? At least some distros now compile with frame pointers enabled, to get reasonably fast perf profiles (at a basically immeasurable slowdown, on modern-ish CPUs). I doubt that, because if that was the case we'd not be able to get accurate profiles from perf, no? And AFAICS that's not the case, irrespectedly of whether -fno-omit-frame-pointer is used. I think doing materially better requires depending on debug symbols, which (at least in the Red Hat world) aren't going to be there in a typical production situation. It's still a lot easier to install debug symbols than to attach a debugger and get a backtrace that way. Especially when the problem is hard to reproduce. Right. Debugger requires interaction with a running process, while having it integrated would make that unnecessary. But I think Peter also suggested this might require the ability to selectively enable the backtraces, and I think he's right. I doubt we want to log them for every log message, right? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Runtime Partition Pruning
Hi, On 2019-05-24 12:08:57 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2019-05-24 11:34:58 -0400, Tom Lane wrote: > >> Hmm, after some digging in the archives, the closest thing I can find > >> is this thread: > >> https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com > >> where we discussed using libunwind instead, but people didn't like > >> the extra dependency. > > > Hm, I didn't actually see that much concern about that. I still think we > > should just go for libunwind. > > Is it actually better? I've not looked in a while, but I think at some point it was. > The basic problem with backtrace() is that it > only knows about global functions, and so reports call sites in static > functions as if they were in whatever global function physically precedes > the static one. Does that depend on whether the program was compiled with -fno-omit-frame-pointer? At least some distros now compile with frame pointers enabled, to get reasonably fast perf profiles (at a basically immeasurable slowdown, on modern-ish CPUs). > I think doing materially better requires depending on > debug symbols, which (at least in the Red Hat world) aren't going to > be there in a typical production situation. It's still a lot easier to install debug symbols than to attach a debugger and get a backtrace that way. Especially when the problem is hard to reproduce. Greetings, Andres Freund
Re: [HACKERS] Runtime Partition Pruning
Andres Freund writes: > On 2019-05-24 11:34:58 -0400, Tom Lane wrote: >> Hmm, after some digging in the archives, the closest thing I can find >> is this thread: >> https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com >> where we discussed using libunwind instead, but people didn't like >> the extra dependency. > Hm, I didn't actually see that much concern about that. I still think we > should just go for libunwind. Is it actually better? The basic problem with backtrace() is that it only knows about global functions, and so reports call sites in static functions as if they were in whatever global function physically precedes the static one. I think doing materially better requires depending on debug symbols, which (at least in the Red Hat world) aren't going to be there in a typical production situation. regards, tom lane
Re: [HACKERS] Runtime Partition Pruning
Hi, On 2019-05-24 11:34:58 -0400, Tom Lane wrote: > I wrote: > > Peter Eisentraut writes: > >> What do people think about adding something like this errbacktrace() > >> from Álvaro's patch to core PostgreSQL? > > > I think we did discuss it right after that, or somewhere nearby, and > > concluded that the output is so imprecise that it's not really going > > to be worth whatever portability issues we'd have to deal with. > > Hmm, after some digging in the archives, the closest thing I can find > is this thread: > > https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com > > where we discussed using libunwind instead, but people didn't like > the extra dependency. Hm, I didn't actually see that much concern about that. I still think we should just go for libunwind. At least on debian it's likely to already be installed: andres@alap4:~$ apt rdepends libunwind8 libunwind8 Reverse Depends: Depends: libunwind-dev (= 1.2.1-9) Depends: linux-perf-4.16 Depends: linux-perf-4.15 Depends: linux-perf-4.14 Depends: rspamd Depends: linux-perf-5.0 Depends: libjulia1 Depends: julia Depends: geary Depends: libunwind8-dbgsym (= 1.2.1-9) Depends: xwayland Depends: xvfb Depends: xserver-xorg-core Depends: xserver-xephyr Depends: xnest Depends: xdmx Depends: trafficserver Depends: tigervnc-standalone-server Depends: tarantool Depends: strace Depends: spring Depends: rspamd Depends: linux-perf-4.19 Depends: libunwind-setjmp0 Depends: libeina1a Depends: libjulia1 Depends: julia Depends: intel-gpu-tools Depends: libheaptrack Depends: libgoogle-perftools4 Depends: libgoogle-glog0v5 Depends: gdnsd Depends: libevas1-engines-x Depends: libevas1 In particular strace, xserver-xorg-core, perf are reasonably likely to already installed. It's also not a large library. I'd bet if we made it an optional build-time dependency it'd get included by just about every distro. Greetings, Andres Freund
Re: [HACKERS] Runtime Partition Pruning
I wrote: > Peter Eisentraut writes: >> What do people think about adding something like this errbacktrace() >> from Álvaro's patch to core PostgreSQL? > I think we did discuss it right after that, or somewhere nearby, and > concluded that the output is so imprecise that it's not really going > to be worth whatever portability issues we'd have to deal with. Hmm, after some digging in the archives, the closest thing I can find is this thread: https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com where we discussed using libunwind instead, but people didn't like the extra dependency. However, I stand by the assertion that glibc's backtrace() is too imprecise to be useful; I've experimented with it and despaired of being able to tell where control had actually been. regards, tom lane
Re: [HACKERS] Runtime Partition Pruning
Peter Eisentraut writes: > On 2018-04-10 23:32, Alvaro Herrera wrote: >> To figure out, I used the attached patch (not intended for application) >> to add a backtrace to each log message, plus a couple of accusatory >> elog() calls in relation_open and ExecSetupPartitionPruneState. > What do people think about adding something like this errbacktrace() > from Álvaro's patch to core PostgreSQL? If we could devise a way to > selectively enable it, it might be an easier way for users to provide > backtraces from errors. I think we did discuss it right after that, or somewhere nearby, and concluded that the output is so imprecise that it's not really going to be worth whatever portability issues we'd have to deal with. I'd be all for a better version, but glibc's backtrace() just sucks, at least given our coding style with lots of static functions. regards, tom lane
Re: [HACKERS] Runtime Partition Pruning
On 2018-04-10 23:32, Alvaro Herrera wrote: > To figure out, I used the attached patch (not intended for application) > to add a backtrace to each log message, plus a couple of accusatory > elog() calls in relation_open and ExecSetupPartitionPruneState. What do people think about adding something like this errbacktrace() from Álvaro's patch to core PostgreSQL? If we could devise a way to selectively enable it, it might be an easier way for users to provide backtraces from errors. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Runtime Partition Pruning
On 25 April 2018 at 05:10, Alvaro Herrera wrote: > Pushed. Thanks! Thanks! -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Runtime Partition Pruning
On 25 April 2018 at 06:21, Alvaro Herrera wrote: > Andres Freund wrote: > >> What I wonder, after skimming this change, is where the relevant >> expression context is reset? That's not really related to this change >> but the wider thread, I just noticed it while looking at this. > > Do you mean ResetExprContext? We use the plan's context, so it should > occur together with the plan reset itself, i.e. nodeAppend.c should do > it somewhere. > > ... Hmm, it appears we don't do it anywhere there actually. It's not immediately obvious to me why this is required. All the expressions that are initialized here must live the entire length of the executor run, and since they're allocated in the ExecutorState context they'll be destroyed in FreeExecutorState(). If we do need to call ResetExprContext for these, then we'd just need to invent a teardown for ExecSetupPartitionPruneState which would free off the memory allocated and call ResetExprContext for all non-NULL ExprStates in each context->exprstates. This function would need to be called from the node's End function for any node that's set up a PartitionPruneState. Do we really need to do this given that the memory context these are allocated in will be released a moment later anyway? Just to ensure I'm not dreaming, I ran the following and couldn't see the backend's memory consumption move. create table lp (a int, value int) partition by list(a); create table lp_1 partition of lp for values in(1); create table lp_2 partition of lp for values in(2); create function lp_value(p_a int) returns int as $$ select value from lp where a = p_a; $$ language sql; insert into lp values(1,10),(2,20); select sum(lp_value(x)) from generate_Series(1,2) x, generate_series(1,1000); -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Runtime Partition Pruning
Andres Freund wrote: > What I wonder, after skimming this change, is where the relevant > expression context is reset? That's not really related to this change > but the wider thread, I just noticed it while looking at this. Do you mean ResetExprContext? We use the plan's context, so it should occur together with the plan reset itself, i.e. nodeAppend.c should do it somewhere. ... Hmm, it appears we don't do it anywhere there actually. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Runtime Partition Pruning
On 2018-04-19 12:04:35 +1200, David Rowley wrote: > On 19 April 2018 at 03:13, Robert Haas wrote: > > 10% is more than a "slight" improvement, I'd say! It's certainly got > > to be worth avoiding the repeated calls to ExecInitExpr, whatever we > > do about the memory contexts. Yea, that seems important. Good that that got in. What I wonder, after skimming this change, is where the relevant expression context is reset? That's not really related to this change but the wider thread, I just noticed it while looking at this. Greetings, Andres Freund
Re: [HACKERS] Runtime Partition Pruning
Pushed. Thanks! -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Runtime Partition Pruning
Amit Langote wrote: > That's neat! Definitely agree that we should call ExecInitExpr just once > here. The patch looks good too, except the long line. How about this as a small tweak? Determine the array index using a macro, which serves as documentation. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From 135740cd17e8c05f36a3b91b3af29515df6a2d3b Mon Sep 17 00:00:00 2001 From: "dgrow...@gmail.com" Date: Thu, 19 Apr 2018 11:51:16 +1200 Subject: [PATCH v3] Initialize ExprStates once in run-time partition pruning MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Instead of doing ExecInitExpr every time a Param needs to be evaluated in run-time partition pruning, do it once during run-time pruning set-up and cache the exprstate in PartitionPruneContext, saving a lot of work. Author: David Rowley Reviewed-by: Amit Langote, Álvaro Herrera Discussion: https://postgr.es/m/CAKJS1f8-x+q-90QAPDu_okhQBV4DPEtPz8CJ=m0940gyt4d...@mail.gmail.com --- src/backend/executor/execPartition.c | 35 +++ src/backend/partitioning/partprune.c | 27 +-- src/include/partitioning/partprune.h | 9 + 3 files changed, 61 insertions(+), 10 deletions(-) diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c index f7bbb804aa..f7418f64b1 100644 --- a/src/backend/executor/execPartition.c +++ b/src/backend/executor/execPartition.c @@ -1442,7 +1442,9 @@ ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo) PartitionDesc partdesc; Relationrel; PartitionKey partkey; + ListCell *lc2; int partnatts; + int n_steps; pprune->present_parts = bms_copy(pinfo->present_parts); pprune->subnode_map = palloc(sizeof(int) * pinfo->nparts); @@ -1465,6 +1467,7 @@ ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo) partkey = RelationGetPartitionKey(rel); partdesc = RelationGetPartitionDesc(rel); + n_steps = list_length(pinfo->pruning_steps); context->strategy = partkey->strategy; context->partnatts = partnatts = partkey->partnatts; @@ -1476,6 +1479,38 @@ ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo) context->boundinfo = partition_bounds_copy(partdesc->boundinfo, partkey); context->planstate = planstate; context->safeparams = NULL; /* empty for now */ + context->exprstates = palloc0(sizeof(ExprState *) * n_steps * partnatts); + + /* Initialize expression states for each expression */ + foreach(lc2, pinfo->pruning_steps) + { + PartitionPruneStepOp *step = (PartitionPruneStepOp *) lfirst(lc2); + ListCell *lc3; + int keyno; + + /* not needed for other step kinds */ + if (!IsA(step, PartitionPruneStepOp)) + continue; + + Assert(list_length(step->exprs) <= partnatts); + + keyno = 0; + foreach(lc3, step->exprs) + { + Expr *expr = (Expr *) lfirst(lc3); + int stateidx; + + /* not needed for Consts */ + if (!IsA(expr, Const)) + { + stateidx = PruneCxtStateIdx(partnatts, + step->step.step_id, keyno); + context->exprstates[stateidx] = + ExecInitExpr(expr, context->planstate); + } + keyno++; + } + } pprune->pruning_steps = pinfo->pruning_steps; pprune->extparams = bms_copy(pinfo->extparams); diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c index f8844ef2eb..62159477c1 100644 --- a/src/backend/partitioning/partprune.c +++ b/src/backend/partitioning/partprune.c @@ -169,7 +169,7 @@ static PruneStepResult *perform_pruning_combine_step(PartitionPruneContext *cont static bool match_boolean_partition_clause(Oid partopfamily, Expr *clause, Expr *partkey, Expr **outconst); static bool partkey_datum_from_expr(PartitionPruneContext *context, -
Re: [HACKERS] Runtime Partition Pruning
Anybody wanna argue against pushing this patch now? I'm inclined to push it on the grounds of being closure for already committed work, but there are possible arguments about this being new development after feature freeze. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Runtime Partition Pruning
Hi David. On 2018/04/19 9:04, David Rowley wrote: > On 19 April 2018 at 03:13, Robert Haas wrote: >> On Mon, Apr 16, 2018 at 10:46 PM, David Rowley >> wrote: >>> The patch does happen to improve performance slightly, but that is >>> most likely due to the caching of the ExprStates rather than the >>> change of memory management. It's not really possible to do that with >>> the reset unless we stored the executor's memory context in >>> PartitionPruneContext and did a context switch back inside >>> partkey_datum_from_expr before calling ExecInitExpr. >> >> 10% is more than a "slight" improvement, I'd say! It's certainly got >> to be worth avoiding the repeated calls to ExecInitExpr, whatever we >> do about the memory contexts. > > I've attached a patch which does just this. On benchmarking again with > this single change performance has improved 15% over master. > > Also, out of curiosity, I also checked what this performed like before > the run-time pruning patch was committed (5c0675215). Taking the > average of the times below, it seems without this patch the > performance of this case has improved about 356% and about 410% with > this patch. So, I agree, it might be worth considering. > > create table p (a int, value int) partition by hash (a); > select 'create table p'||x|| ' partition of p for values with (modulus > 10, remainder '||x||');' from generate_series(0,9) x; > \gexec > create table t1 (a int); > > insert into p select x,x from generate_Series(1,1000) x; > insert into t1 select x from generate_series(1,1000) x; > > create index on p(a); > > set enable_hashjoin = 0; > set enable_mergejoin = 0; > explain analyze select count(*) from t1 inner join p on t1.a=p.a; > > -- Unpatched > Execution Time: 20413.975 ms > Execution Time: 20232.050 ms > Execution Time: 20229.116 ms > > -- Patched > Execution Time: 17758.111 ms > Execution Time: 17645.151 ms > Execution Time: 17492.260 ms > > -- 5c0675215e153ba1297fd494b34af2fdebd645d1 > Execution Time: 72875.161 ms > Execution Time: 71817.757 ms > Execution Time: 72411.730 ms That's neat! Definitely agree that we should call ExecInitExpr just once here. The patch looks good too, except the long line. Maybe: @@ -1514,13 +1514,15 @@ ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo) foreach(lc3, step->exprs) { Expr *expr = (Expr *) lfirst(lc3); +int step_id = step->step.step_id; /* * partkey_datum_from_expr does not need an expression state * to evaluate a Const. */ if (!IsA(expr, Const)) -context->exprstates[step->step.step_id * partnatts + keyno] = ExecInitExpr(expr, context->planstate); +context->exprstates[step_id * partnatts + keyno] = +ExecInitExpr(expr, context->planstate); Thanks, Amit
Re: [HACKERS] Runtime Partition Pruning
On 19 April 2018 at 12:04, David Rowley wrote: > insert into p select x,x from generate_Series(1,1000) x; > insert into t1 select x from generate_series(1,1000) x; Correction. These were meant to read: insert into p select x,x from generate_Series(1,1000) x; insert into t1 select x from generate_series(1,1000) x; -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Runtime Partition Pruning
On 19 April 2018 at 03:13, Robert Haas wrote: > On Mon, Apr 16, 2018 at 10:46 PM, David Rowley > wrote: >> I did go and start working on a patch to test how possible this would >> be and came up with the attached. I've left a stray >> MemoryContextStatsDetail call in there which does indicate that >> something is not being freed. I'm just not sure what it is yet. >> >> The patch does happen to improve performance slightly, but that is >> most likely due to the caching of the ExprStates rather than the >> change of memory management. It's not really possible to do that with >> the reset unless we stored the executor's memory context in >> PartitionPruneContext and did a context switch back inside >> partkey_datum_from_expr before calling ExecInitExpr. > > 10% is more than a "slight" improvement, I'd say! It's certainly got > to be worth avoiding the repeated calls to ExecInitExpr, whatever we > do about the memory contexts. I've attached a patch which does just this. On benchmarking again with this single change performance has improved 15% over master. Also, out of curiosity, I also checked what this performed like before the run-time pruning patch was committed (5c0675215). Taking the average of the times below, it seems without this patch the performance of this case has improved about 356% and about 410% with this patch. So, I agree, it might be worth considering. create table p (a int, value int) partition by hash (a); select 'create table p'||x|| ' partition of p for values with (modulus 10, remainder '||x||');' from generate_series(0,9) x; \gexec create table t1 (a int); insert into p select x,x from generate_Series(1,1000) x; insert into t1 select x from generate_series(1,1000) x; create index on p(a); set enable_hashjoin = 0; set enable_mergejoin = 0; explain analyze select count(*) from t1 inner join p on t1.a=p.a; -- Unpatched Execution Time: 20413.975 ms Execution Time: 20232.050 ms Execution Time: 20229.116 ms -- Patched Execution Time: 17758.111 ms Execution Time: 17645.151 ms Execution Time: 17492.260 ms -- 5c0675215e153ba1297fd494b34af2fdebd645d1 Execution Time: 72875.161 ms Execution Time: 71817.757 ms Execution Time: 72411.730 ms -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services 0001-Initialize-expr-states-once-in-run-time-partition-pr.patch Description: Binary data
Re: [HACKERS] Runtime Partition Pruning
On Mon, Apr 16, 2018 at 10:46 PM, David Rowley wrote: > I did go and start working on a patch to test how possible this would > be and came up with the attached. I've left a stray > MemoryContextStatsDetail call in there which does indicate that > something is not being freed. I'm just not sure what it is yet. > > The patch does happen to improve performance slightly, but that is > most likely due to the caching of the ExprStates rather than the > change of memory management. It's not really possible to do that with > the reset unless we stored the executor's memory context in > PartitionPruneContext and did a context switch back inside > partkey_datum_from_expr before calling ExecInitExpr. 10% is more than a "slight" improvement, I'd say! It's certainly got to be worth avoiding the repeated calls to ExecInitExpr, whatever we do about the memory contexts. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Runtime Partition Pruning
On 17 April 2018 at 14:33, Alvaro Herrera wrote: > David Rowley wrote: > >> For a while, during my review of the faster partition pruning patch I >> was asking Amit to add pfree() calls in various places for this exact >> reason, but in the end, I gave up and decided it was easier to just >> create a new memory context to call the planner function from. I've >> now forgotten the exact reason why I finally decided it was too much >> trouble. The pruning code now works using your step logic so perhaps >> that reason no longer applies, although, on a quick scan of the >> pruning code now, it seems to require that get_matching_partitions >> performs a deep pfree of each PruneStepResult. However, there is still >> partkey_datum_from_expr which performs ExecInitExpr, although perhaps >> that can just be done once, and the result stashed in the >> PartitionPruneContext. > > I think trying to replace a well-placed MemoryContextReset (or Delete) > with a bunch of individual pfrees is pointless. I agree. I think I'd sleep better at night with the context reset in there rather than hoping we've managed to pfree everything. I did go and start working on a patch to test how possible this would be and came up with the attached. I've left a stray MemoryContextStatsDetail call in there which does indicate that something is not being freed. I'm just not sure what it is yet. The patch does happen to improve performance slightly, but that is most likely due to the caching of the ExprStates rather than the change of memory management. It's not really possible to do that with the reset unless we stored the executor's memory context in PartitionPruneContext and did a context switch back inside partkey_datum_from_expr before calling ExecInitExpr. My test case was as follows: create table p (a int, value int) partition by hash (a); select 'create table p'||x|| ' partition of p for values with (modulus 10, remainder '||x||');' from generate_series(0,9) x; \gexec create table t1 (a int); insert into p select x,x from generate_Series(1,1000) x; insert into t1 select x from generate_series(1,1000) x; create index on p(a); set enable_hashjoin = 0; set enable_mergejoin = 0; explain analyze select count(*) from t1 inner join p on t1.a=p.a; -- Unpatched Execution Time: 19725.981 ms Execution Time: 19533.655 ms Execution Time: 19542.854 ms -- Patched Execution Time: 17389.537 ms Execution Time: 17603.802 ms Execution Time: 17618.670 ms -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services recycle_mem_part_prune.patch Description: Binary data
Re: [HACKERS] Runtime Partition Pruning
David Rowley wrote: > For a while, during my review of the faster partition pruning patch I > was asking Amit to add pfree() calls in various places for this exact > reason, but in the end, I gave up and decided it was easier to just > create a new memory context to call the planner function from. I've > now forgotten the exact reason why I finally decided it was too much > trouble. The pruning code now works using your step logic so perhaps > that reason no longer applies, although, on a quick scan of the > pruning code now, it seems to require that get_matching_partitions > performs a deep pfree of each PruneStepResult. However, there is still > partkey_datum_from_expr which performs ExecInitExpr, although perhaps > that can just be done once, and the result stashed in the > PartitionPruneContext. I think trying to replace a well-placed MemoryContextReset (or Delete) with a bunch of individual pfrees is pointless. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Runtime Partition Pruning
On 14 April 2018 at 05:04, Robert Haas wrote: > But I wonder why it's the executor's job to clean up after the > planner, instead of adjusting the relevant planner functions to avoid > leaking memory? It might be possible, but it might also be risky and difficult. For a while, during my review of the faster partition pruning patch I was asking Amit to add pfree() calls in various places for this exact reason, but in the end, I gave up and decided it was easier to just create a new memory context to call the planner function from. I've now forgotten the exact reason why I finally decided it was too much trouble. The pruning code now works using your step logic so perhaps that reason no longer applies, although, on a quick scan of the pruning code now, it seems to require that get_matching_partitions performs a deep pfree of each PruneStepResult. However, there is still partkey_datum_from_expr which performs ExecInitExpr, although perhaps that can just be done once, and the result stashed in the PartitionPruneContext. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Runtime Partition Pruning
On Thu, Apr 12, 2018 at 6:01 PM, David Rowley wrote: > On 13 April 2018 at 04:57, Robert Haas wrote: >> BTW, looking at ExecSetupPartitionPruneState: >> >> /* >> * Create a sub memory context which we'll use when making calls to >> the >> * query planner's function to determine which partitions will >> match. The >> * planner is not too careful about freeing memory, so we'll ensure >> we >> * call the function in this context to avoid any memory leaking in >> the >> * executor's memory context. >> */ >> >> This is a sloppy cut-and-paste job, not only because somebody changed >> one copy of the word "planner" to "executor" and left the others >> untouched, but also because the rationale isn't really correct for the >> executor anyway, which has memory contexts all over the place and >> frees them all the time. I don't know whether the context is not >> needed at all or whether the context is needed but the rationale is >> different, but I don't buy that explanation. > > The comment is written exactly as intended. Unsure which of the > "planner"s you think should be "executor". > > The context is needed. I can easily produce an OOM without it. Oh, crap. You know, I totally misread what that comment was trying to say. Sorry. But I wonder why it's the executor's job to clean up after the planner, instead of adjusting the relevant planner functions to avoid leaking memory? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Runtime Partition Pruning
On 2018/04/13 14:48, Amit Langote wrote: > On 2018/04/13 14:38, Amit Langote wrote: >> About the specific relation_open(.., NoLock) under question, I think there >> might be a way to address this by opening the tables with the appropriate >> lock mode in partitioned_rels list in ExecLockNonleafAppendTables > > That may have sounded a bit confusing: > > I meant to say: "by opening the tables in partitioned_rels list with the > appropriate lock mode in ExecLockNonleafAppendTables" > >> Attached a PoC patch. > > There were a couple of unnecessary hunks, which removed in the attached. Sorry, still a couple more were unnecessary. Thanks, Amit diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c index 11139f743d..e3b3911945 100644 --- a/src/backend/executor/execPartition.c +++ b/src/backend/executor/execPartition.c @@ -1244,7 +1244,8 @@ adjust_partition_tlist(List *tlist, TupleConversionMap *map) * PartitionPruneInfo. */ PartitionPruneState * -ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo) +ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo, +Relation *partitioned_rels) { PartitionPruningData *prunedata; PartitionPruneState *prunestate; @@ -1303,10 +1304,11 @@ ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo) pprune->subpart_map = pinfo->subpart_map; /* -* Grab some info from the table's relcache; lock was already obtained -* by ExecLockNonLeafAppendTables. +* ExecLockNonLeafAppendTables already opened the relation for us. */ - rel = relation_open(pinfo->reloid, NoLock); + Assert(partitioned_rels[i] != NULL); + rel = partitioned_rels[i]; + Assert(RelationGetRelid(rel) == pinfo->reloid); partkey = RelationGetPartitionKey(rel); partdesc = RelationGetPartitionDesc(rel); @@ -1336,8 +1338,6 @@ ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo) prunestate->execparams = bms_add_members(prunestate->execparams, pinfo->execparams); - relation_close(rel, NoLock); - i++; } diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c index b963cae730..58a0961eb1 100644 --- a/src/backend/executor/execUtils.c +++ b/src/backend/executor/execUtils.c @@ -858,22 +858,49 @@ ShutdownExprContext(ExprContext *econtext, bool isCommit) /* * ExecLockNonLeafAppendTables * - * Locks, if necessary, the tables indicated by the RT indexes contained in - * the partitioned_rels list. These are the non-leaf tables in the partition - * tree controlled by a given Append or MergeAppend node. + * Opens and/or locks, if necessary, the tables indicated by the RT indexes + * contained in the partitioned_rels list. These are the non-leaf tables in + * the partition tree controlled by a given Append or MergeAppend node. */ void -ExecLockNonLeafAppendTables(List *partitioned_rels, EState *estate) +ExecLockNonLeafAppendTables(PlanState *planstate, + EState *estate, + List *partitioned_rels) { PlannedStmt *stmt = estate->es_plannedstmt; ListCell *lc; + int i; + /* +* If we're going to be performing pruning, allocate space for Relation +* pointers to be used later when setting up partition pruning state in +* ExecSetupPartitionPruneState. +*/ + if (IsA(planstate, AppendState)) + { + AppendState *appendstate = (AppendState *) planstate; + Append *node = (Append *) planstate->plan; + + if (node->part_prune_infos != NIL) + { + Assert(list_length(node->part_prune_infos) == + list_length(partitioned_rels)); + appendstate->partitioned_rels = (Relation *) + palloc(sizeof(Relation) * + list_length(node->part_prune_infos)); + appendstate->num_partitioned_rels = + list_length(node->part_prune_infos); + } + } + + i = 0; foreach(lc, partitioned_rels) { ListCell *l; Index rti = lfirst_int(lc); boolis_result_rel = false; Oid relid = getrelid(rti, estate->es_range_tab
Re: [HACKERS] Runtime Partition Pruning
On 2018/04/13 14:38, Amit Langote wrote: > About the specific relation_open(.., NoLock) under question, I think there > might be a way to address this by opening the tables with the appropriate > lock mode in partitioned_rels list in ExecLockNonleafAppendTables That may have sounded a bit confusing: I meant to say: "by opening the tables in partitioned_rels list with the appropriate lock mode in ExecLockNonleafAppendTables" > Attached a PoC patch. There were a couple of unnecessary hunks, which removed in the attached. Thanks, Amit diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c index 11139f743d..e3b3911945 100644 --- a/src/backend/executor/execPartition.c +++ b/src/backend/executor/execPartition.c @@ -1244,7 +1244,8 @@ adjust_partition_tlist(List *tlist, TupleConversionMap *map) * PartitionPruneInfo. */ PartitionPruneState * -ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo) +ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo, +Relation *partitioned_rels) { PartitionPruningData *prunedata; PartitionPruneState *prunestate; @@ -1303,10 +1304,11 @@ ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo) pprune->subpart_map = pinfo->subpart_map; /* -* Grab some info from the table's relcache; lock was already obtained -* by ExecLockNonLeafAppendTables. +* ExecLockNonLeafAppendTables already opened the relation for us. */ - rel = relation_open(pinfo->reloid, NoLock); + Assert(partitioned_rels[i] != NULL); + rel = partitioned_rels[i]; + Assert(RelationGetRelid(rel) == pinfo->reloid); partkey = RelationGetPartitionKey(rel); partdesc = RelationGetPartitionDesc(rel); @@ -1336,8 +1338,6 @@ ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo) prunestate->execparams = bms_add_members(prunestate->execparams, pinfo->execparams); - relation_close(rel, NoLock); - i++; } diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c index b963cae730..58a0961eb1 100644 --- a/src/backend/executor/execUtils.c +++ b/src/backend/executor/execUtils.c @@ -858,22 +858,49 @@ ShutdownExprContext(ExprContext *econtext, bool isCommit) /* * ExecLockNonLeafAppendTables * - * Locks, if necessary, the tables indicated by the RT indexes contained in - * the partitioned_rels list. These are the non-leaf tables in the partition - * tree controlled by a given Append or MergeAppend node. + * Opens and/or locks, if necessary, the tables indicated by the RT indexes + * contained in the partitioned_rels list. These are the non-leaf tables in + * the partition tree controlled by a given Append or MergeAppend node. */ void -ExecLockNonLeafAppendTables(List *partitioned_rels, EState *estate) +ExecLockNonLeafAppendTables(PlanState *planstate, + EState *estate, + List *partitioned_rels) { PlannedStmt *stmt = estate->es_plannedstmt; ListCell *lc; + int i; + /* +* If we're going to be performing pruning, allocate space for Relation +* pointers to be used later when setting up partition pruning state in +* ExecSetupPartitionPruneState. +*/ + if (IsA(planstate, AppendState)) + { + AppendState *appendstate = (AppendState *) planstate; + Append *node = (Append *) planstate->plan; + + if (node->part_prune_infos != NIL) + { + Assert(list_length(node->part_prune_infos) == + list_length(partitioned_rels)); + appendstate->partitioned_rels = (Relation *) + palloc(sizeof(Relation) * + list_length(node->part_prune_infos)); + appendstate->num_partitioned_rels = + list_length(node->part_prune_infos); + } + } + + i = 0; foreach(lc, partitioned_rels) { ListCell *l; Index rti = lfirst_int(lc); boolis_result_rel = false; Oid relid = getrelid(rti, estate->es_range_table); + int lockmode; /* If this is a result relation, alr
Re: [HACKERS] Runtime Partition Pruning
On 2018/04/13 1:57, Robert Haas wrote: >> It might be possible to do something better in each module by keeping >> an array indexed by RTI which have each entry NULL initially then on >> first relation_open set the element in the array to that pointer. > > I'm not sure that makes a lot of sense in the planner, but in the > executor it might be a good idea. See also > https://www.postgresql.org/message-id/CA%2BTgmoYKToP4-adCFFRNrO21OGuH%3Dphx-fiB1dYoqksNYX6YHQ%40mail.gmail.com > for related discussion. I think that a coding pattern where we rely > on relation_open(..., NoLock) is inherently dangerous -- it's too easy > to be wrong about whether the lock is sure to have been taken. It > would be much better to open the relation once and hold onto the > pointer, not just for performance reasons, but for robustness. About the specific relation_open(.., NoLock) under question, I think there might be a way to address this by opening the tables with the appropriate lock mode in partitioned_rels list in ExecLockNonleafAppendTables and keeping the Relation pointers around until ExecEndNode. Then instead of ExecSetupPartitionPruneState doing relation_open/close(.., NoLock), it just reuses the one that's passed by the caller. Attached a PoC patch. David, thoughts? Thanks, Amit diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c index 11139f743d..e3b3911945 100644 --- a/src/backend/executor/execPartition.c +++ b/src/backend/executor/execPartition.c @@ -1244,7 +1244,8 @@ adjust_partition_tlist(List *tlist, TupleConversionMap *map) * PartitionPruneInfo. */ PartitionPruneState * -ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo) +ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo, +Relation *partitioned_rels) { PartitionPruningData *prunedata; PartitionPruneState *prunestate; @@ -1303,10 +1304,11 @@ ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo) pprune->subpart_map = pinfo->subpart_map; /* -* Grab some info from the table's relcache; lock was already obtained -* by ExecLockNonLeafAppendTables. +* ExecLockNonLeafAppendTables already opened the relation for us. */ - rel = relation_open(pinfo->reloid, NoLock); + Assert(partitioned_rels[i] != NULL); + rel = partitioned_rels[i]; + Assert(RelationGetRelid(rel) == pinfo->reloid); partkey = RelationGetPartitionKey(rel); partdesc = RelationGetPartitionDesc(rel); @@ -1336,8 +1338,6 @@ ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo) prunestate->execparams = bms_add_members(prunestate->execparams, pinfo->execparams); - relation_close(rel, NoLock); - i++; } diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c index b963cae730..58a0961eb1 100644 --- a/src/backend/executor/execUtils.c +++ b/src/backend/executor/execUtils.c @@ -858,22 +858,49 @@ ShutdownExprContext(ExprContext *econtext, bool isCommit) /* * ExecLockNonLeafAppendTables * - * Locks, if necessary, the tables indicated by the RT indexes contained in - * the partitioned_rels list. These are the non-leaf tables in the partition - * tree controlled by a given Append or MergeAppend node. + * Opens and/or locks, if necessary, the tables indicated by the RT indexes + * contained in the partitioned_rels list. These are the non-leaf tables in + * the partition tree controlled by a given Append or MergeAppend node. */ void -ExecLockNonLeafAppendTables(List *partitioned_rels, EState *estate) +ExecLockNonLeafAppendTables(PlanState *planstate, + EState *estate, + List *partitioned_rels) { PlannedStmt *stmt = estate->es_plannedstmt; ListCell *lc; + int i; + /* +* If we're going to be performing pruning, allocate space for Relation +* pointers to be used later when setting up partition pruning state in +* ExecSetupPartitionPruneState. +*/ + if (IsA(planstate, AppendState)) + { + AppendState *appendstate = (AppendState *) planstate; + Append *node = (Append *) planstate->plan; + + if (node->part_prune_infos != NIL) + { + Assert(list_length(node->part_prune_infos) == + list_length(partitioned_rels)); + appendstate->partitioned_rels = (Relation *) +
Re: [HACKERS] Runtime Partition Pruning
On 13 April 2018 at 04:57, Robert Haas wrote: > BTW, looking at ExecSetupPartitionPruneState: > > /* > * Create a sub memory context which we'll use when making calls to > the > * query planner's function to determine which partitions will > match. The > * planner is not too careful about freeing memory, so we'll ensure we > * call the function in this context to avoid any memory leaking in > the > * executor's memory context. > */ > > This is a sloppy cut-and-paste job, not only because somebody changed > one copy of the word "planner" to "executor" and left the others > untouched, but also because the rationale isn't really correct for the > executor anyway, which has memory contexts all over the place and > frees them all the time. I don't know whether the context is not > needed at all or whether the context is needed but the rationale is > different, but I don't buy that explanation. The comment is written exactly as intended. Unsure which of the "planner"s you think should be "executor". The context is needed. I can easily produce an OOM without it. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Runtime Partition Pruning
On Thu, Apr 12, 2018 at 12:40 AM, David Rowley wrote: > I guess the problem there would be there's nothing to say that parse > analysis will shortly be followed by a call to the planner, and a call > to the planner does not mean the plan is about to be executed. So I > don't think it would be possible to keep pointers to relcache entries > between these modules, and it would be hard to determine whose > responsibility it would be to call relation_close(). Yeah, that's definitely a non-starter. > It might be possible to do something better in each module by keeping > an array indexed by RTI which have each entry NULL initially then on > first relation_open set the element in the array to that pointer. I'm not sure that makes a lot of sense in the planner, but in the executor it might be a good idea. See also https://www.postgresql.org/message-id/CA%2BTgmoYKToP4-adCFFRNrO21OGuH%3Dphx-fiB1dYoqksNYX6YHQ%40mail.gmail.com for related discussion. I think that a coding pattern where we rely on relation_open(..., NoLock) is inherently dangerous -- it's too easy to be wrong about whether the lock is sure to have been taken. It would be much better to open the relation once and hold onto the pointer, not just for performance reasons, but for robustness. BTW, looking at ExecSetupPartitionPruneState: /* * Create a sub memory context which we'll use when making calls to the * query planner's function to determine which partitions will match. The * planner is not too careful about freeing memory, so we'll ensure we * call the function in this context to avoid any memory leaking in the * executor's memory context. */ This is a sloppy cut-and-paste job, not only because somebody changed one copy of the word "planner" to "executor" and left the others untouched, but also because the rationale isn't really correct for the executor anyway, which has memory contexts all over the place and frees them all the time. I don't know whether the context is not needed at all or whether the context is needed but the rationale is different, but I don't buy that explanation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Runtime Partition Pruning
On 11 April 2018 at 09:32, Alvaro Herrera wrote: > Robert Haas wrote: >> On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera >> wrote: > >> >> I don't get this. The executor surely had to (and did) open all of >> >> the relations somewhere even before this patch. > >> > I was worried that this coding could be seen as breaking modularity, or >> > trying to do excessive work. However, after looking closer at it, it >> > doesn't really look like it's the case. So, nevermind. >> >> Well what I'm saying is that it shouldn't be necessary. If the >> relations are being opened already and the pointers to the relcache >> entries are being saved someplace, you shouldn't need to re-open them >> elsewhere to get pointers to the relcache entries. > > I looked a bit more into this. It turns out that we have indeed opened > the relation before -- first in parserOpenTable (for addRangeTableEntry), > then in expandRTE, then in QueryRewrite, then in subquery_planner, then > in get_relation_info. > > So, frankly, since each module thinks it's okay to open it every once in > a while, I'm not sure we should be terribly stressed about doing it once > more for partition pruning. Particularly since communicating the > pointer seems to be quite troublesome. I guess the problem there would be there's nothing to say that parse analysis will shortly be followed by a call to the planner, and a call to the planner does not mean the plan is about to be executed. So I don't think it would be possible to keep pointers to relcache entries between these modules, and it would be hard to determine whose responsibility it would be to call relation_close(). It might be possible to do something better in each module by keeping an array indexed by RTI which have each entry NULL initially then on first relation_open set the element in the array to that pointer. This might mean we'd save a few relation_open calls, but I don't know if there would be a way to somehow remove the Relation from the array on relation_close. Having something like this might mean we could detect lock upgrade hazards more easily, but the whole thing is a cache on top of a cache which does seem a bit weird. relation_open() should be pretty cheap if the relation is already open. It's just a hash table lookup. What is described above just changes that to an array lookup. It also does nothing for index_open. However, something like the above would simplify ExecLockNonLeafAppendTables() a bit and get rid of the O(N^2) which checks the partition is not a result relation. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Runtime Partition Pruning
On 2018/04/11 6:32, Alvaro Herrera wrote: > Robert Haas wrote: >> On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera >> wrote: > I don't get this. The executor surely had to (and did) open all of the relations somewhere even before this patch. > >>> I was worried that this coding could be seen as breaking modularity, or >>> trying to do excessive work. However, after looking closer at it, it >>> doesn't really look like it's the case. So, nevermind. >> >> Well what I'm saying is that it shouldn't be necessary. If the >> relations are being opened already and the pointers to the relcache >> entries are being saved someplace, you shouldn't need to re-open them >> elsewhere to get pointers to the relcache entries. > > I looked a bit more into this. It turns out that we have indeed opened > the relation before -- first in parserOpenTable (for addRangeTableEntry), > then in expandRTE, then in QueryRewrite, then in subquery_planner, then > in get_relation_info. > > So, frankly, since each module thinks it's okay to open it every once in > a while, I'm not sure we should be terribly stressed about doing it once > more for partition pruning. Particularly since communicating the > pointer seems to be quite troublesome. Maybe, Robert was saying somewhere in the executor itself, before ExecInitAppend, or more precisely, ExecSetupPartitionPruneState is called. But that doesn't seem to be the case. For the result relation case (insert/update/delete on a partitioned table), we don't need to do extra relation_opens, as InitPlan opens the needed relations when building the ResultRelInfo(s), from where they're later accessed, for example, in ExecInitModifyTable. However, in the Append/MergeAppend cases, we don't, at any earlier point in the executor, open the partitioned tables. InitPlan doesn't touch them. In ExecInitAppend, ExecLockNonLeafAppendTables that we call before calling ExecSetupPartitionPruneState does not open, just locks them using LockRelationOid. So, relation_open on partitioned tables in ExecSetupPartitionPruneState seem to be the first time they're opened *in the executor*. Thanks, Amit
Re: [HACKERS] Runtime Partition Pruning
Robert Haas wrote: > On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera > wrote: > >> I don't get this. The executor surely had to (and did) open all of > >> the relations somewhere even before this patch. > > I was worried that this coding could be seen as breaking modularity, or > > trying to do excessive work. However, after looking closer at it, it > > doesn't really look like it's the case. So, nevermind. > > Well what I'm saying is that it shouldn't be necessary. If the > relations are being opened already and the pointers to the relcache > entries are being saved someplace, you shouldn't need to re-open them > elsewhere to get pointers to the relcache entries. I looked a bit more into this. It turns out that we have indeed opened the relation before -- first in parserOpenTable (for addRangeTableEntry), then in expandRTE, then in QueryRewrite, then in subquery_planner, then in get_relation_info. So, frankly, since each module thinks it's okay to open it every once in a while, I'm not sure we should be terribly stressed about doing it once more for partition pruning. Particularly since communicating the pointer seems to be quite troublesome. To figure out, I used the attached patch (not intended for application) to add a backtrace to each log message, plus a couple of accusatory elog() calls in relation_open and ExecSetupPartitionPruneState. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From 5d682c8cb31f79c79c2ba4cafeb876f10b072cfc Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 10 Apr 2018 18:30:29 -0300 Subject: [PATCH] errbacktrace --- src/backend/utils/error/elog.c | 59 + src/include/postgres_ext.h | 1 + src/include/utils/elog.h| 3 ++ src/interfaces/libpq/fe-protocol3.c | 3 ++ 4 files changed, 66 insertions(+) diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index 16531f7a0f..e531d0009e 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -62,6 +62,7 @@ #ifdef HAVE_SYSLOG #include #endif +#include #include "access/transam.h" #include "access/xact.h" @@ -493,6 +494,8 @@ errfinish(int dummy,...) pfree(edata->hint); if (edata->context) pfree(edata->context); + if (edata->backtrace) + pfree(edata->backtrace); if (edata->schema_name) pfree(edata->schema_name); if (edata->table_name) @@ -811,6 +814,41 @@ errmsg(const char *fmt,...) return 0; /* return value does not matter */ } +#define BACKTRACE_FRAMES 100 +int +errbacktrace(void) +{ + ErrorData *edata = &errordata[errordata_stack_depth]; + MemoryContext oldcontext; + void *buf[BACKTRACE_FRAMES]; + int nframes; + char **strfrms; + StringInfoData errtrace; + int i; + + recursion_depth++; + CHECK_STACK_DEPTH(); + oldcontext = MemoryContextSwitchTo(edata->assoc_context); + + nframes = backtrace(buf, BACKTRACE_FRAMES); + strfrms = backtrace_symbols(buf, nframes); + if (strfrms == NULL) + return 0; + + initStringInfo(&errtrace); + + /* the first frame is always errbacktrace itself, so skip it */ + for (i = 1; i < nframes; i++) + appendStringInfo(&errtrace, "%s\n", strfrms[i]); + free(strfrms); + + edata->backtrace = errtrace.data; + + MemoryContextSwitchTo(oldcontext); + recursion_depth--; + + return 0; +} /* * errmsg_internal --- add a primary error message text to the current error @@ -1522,6 +1560,8 @@ CopyErrorData(void) newedata->hint = pstrdup(newedata->hint); if (newedata->context) newedata->context = pstrdup(newedata->context); + if (newedata->backtrace) + newedata->backtrace = pstrdup(newedata->backtrace); if (newedata->schema_name) newedata->schema_name = pstrdup(newedata->schema_name); if (newedata->table_name) @@ -1560,6 +1600,8 @@ FreeErrorData(ErrorData *edata) pfree(edata->hint); if (edata->context) pfree(edata->context); + if (edata->backtrace) + pfree(edata->backtrace); if (edata->schema_name) pfree(edata->schema_name); if (edata->table_name) @@ -1635,6 +1677,8 @@ ThrowErrorData(ErrorData *edata) newedata->hint = pstrdup(edata->hint); if (edata->context) newedata->context = pstrdup(edata->context); + if (edata->backtrace) + newedata->backtrace = pstrdup(newedata->backtrace); /* assume message_id is not available */ if (edata->schema_name) newedata->schema_name = pstrdup(edata->s
Re: [HACKERS] Runtime Partition Pruning
Robert Haas wrote: > On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera > wrote: > > Robert Haas wrote: > >> I don't get this. The executor surely had to (and did) open all of > >> the relations somewhere even before this patch. > > > > I was worried that this coding could be seen as breaking modularity, or > > trying to do excessive work. However, after looking closer at it, it > > doesn't really look like it's the case. So, nevermind. > > Well what I'm saying is that it shouldn't be necessary. If the > relations are being opened already and the pointers to the relcache > entries are being saved someplace, you shouldn't need to re-open them > elsewhere to get pointers to the relcache entries. Oh, okay. I can look into that. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Runtime Partition Pruning
On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera wrote: > Robert Haas wrote: >> On Sat, Apr 7, 2018 at 5:13 PM, Alvaro Herrera >> wrote: >> > I had reservations about a relation_open() in the new executor code. It >> > seemed a bit odd; we don't have any other relation_open in the executor >> > anywhere. However, setting up the pruneinfo needs some stuff from >> > relcache that I don't see a reasonable mechanism to pass through >> > planner. I asked Andres about it on IM and while he didn't endorse the >> > patch in any way, his quick opinion was that "it wasn't entirely >> > insane". I verified that we already hold lock on the relation. >> >> I don't get this. The executor surely had to (and did) open all of >> the relations somewhere even before this patch. > > Yeah. > > I was worried that this coding could be seen as breaking modularity, or > trying to do excessive work. However, after looking closer at it, it > doesn't really look like it's the case. So, nevermind. Well what I'm saying is that it shouldn't be necessary. If the relations are being opened already and the pointers to the relcache entries are being saved someplace, you shouldn't need to re-open them elsewhere to get pointers to the relcache entries. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Runtime Partition Pruning
Robert Haas wrote: > On Sat, Apr 7, 2018 at 5:13 PM, Alvaro Herrera > wrote: > > I had reservations about a relation_open() in the new executor code. It > > seemed a bit odd; we don't have any other relation_open in the executor > > anywhere. However, setting up the pruneinfo needs some stuff from > > relcache that I don't see a reasonable mechanism to pass through > > planner. I asked Andres about it on IM and while he didn't endorse the > > patch in any way, his quick opinion was that "it wasn't entirely > > insane". I verified that we already hold lock on the relation. > > I don't get this. The executor surely had to (and did) open all of > the relations somewhere even before this patch. Yeah. I was worried that this coding could be seen as breaking modularity, or trying to do excessive work. However, after looking closer at it, it doesn't really look like it's the case. So, nevermind. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Runtime Partition Pruning
On Sat, Apr 7, 2018 at 5:13 PM, Alvaro Herrera wrote: > I had reservations about a relation_open() in the new executor code. It > seemed a bit odd; we don't have any other relation_open in the executor > anywhere. However, setting up the pruneinfo needs some stuff from > relcache that I don't see a reasonable mechanism to pass through > planner. I asked Andres about it on IM and while he didn't endorse the > patch in any way, his quick opinion was that "it wasn't entirely > insane". I verified that we already hold lock on the relation. I don't get this. The executor surely had to (and did) open all of the relations somewhere even before this patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Runtime Partition Pruning
On 8 April 2018 at 09:13, Alvaro Herrera wrote: > I pushed this patch -- 0001, 0002 and 0003 only. I did not include > anything from 0004 and 0005; I didn't even get to the point of reading > them, so that I could focus on the first part. Oh great! Thank you for working on this and pushing it, especially so during your weekend. > While we didn't get fast pruning support for MergeAppend or the > DELETE/UPDATE parts, I think those are valuable and recommend to > resubmit those for PG12. Thanks. I'll certainly be doing that. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Runtime Partition Pruning
I pushed this patch -- 0001, 0002 and 0003 only. I did not include anything from 0004 and 0005; I didn't even get to the point of reading them, so that I could focus on the first part. I did not find anything to complain about. I made a few adjustments and asked David to supply a paragraph for perform.sgml (the "Using EXPLAIN" section) which is included here. I also adopted Jesper's (actually David's) suggestion of changing "Partitions Pruned" to "Partitions Removed" in the EXPLAIN output. I had reservations about a relation_open() in the new executor code. It seemed a bit odd; we don't have any other relation_open in the executor anywhere. However, setting up the pruneinfo needs some stuff from relcache that I don't see a reasonable mechanism to pass through planner. I asked Andres about it on IM and while he didn't endorse the patch in any way, his quick opinion was that "it wasn't entirely insane". I verified that we already hold lock on the relation. While we didn't get fast pruning support for MergeAppend or the DELETE/UPDATE parts, I think those are valuable and recommend to resubmit those for PG12. Thank you! -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Runtime Partition Pruning
Hi, On 04/07/2018 04:45 AM, David Rowley wrote: Ok, so I've gone and done this. PartitionPruning has become PartitionPruneState PartitionRelPruning has become PartitionPruningData I've changed pointers to PartitionPruneStates to be named prunestate, sometimes having the node prefix; as_, ma_, in these cases prune and state are separated with a _ which seems to be the general rule for executor state struct members. Generally, pointers to PartitionPruningData are now named pprune. Hopefully, that's ok, as this was the name previously used for PartitionPruning pointers. I applied the patch to get rid of as_noop_scan in favour of using a special as_whichplan value. There was already one special value (INVALID_SUBPLAN_INDEX), so seemed better to build on that rather than inventing something new. This also means we don't have to make the AppendState struct and wider too, which seems like a good thing to try to do. I made all the fixups which I mentioned in my review earlier and also re-removed the resultRelation parameter from make_partition_pruneinfo. It sneaked back into v22. v23 is attached. Passes check-world. Changing explain.c to "Subplans Removed" as suggested by you in [1] is a good idea. [1] https://www.postgresql.org/message-id/CAKJS1f99JnkbOshdV_4zoJZ96DPtKeHMHv43JRL_ZdHRkkVKCA%40mail.gmail.com Best regards, Jesper
Re: [HACKERS] Runtime Partition Pruning
David Rowley wrote: > On 8 April 2018 at 01:18, Alvaro Herrera wrote: > > Amit had it as "indexes" also in his original. I wanted to avoid using > > the "indexes" word alone, whose meaning is so overloaded. > > hmm, good point. > > > How about this? > > "... which are then executed to produce a set of partitions (as indexes > > of resultRelInfo->part_rels array) that satisfy the constraints in the > > step". > > Works for me, but with RelOptInfo rather than resultRelInfo. Oops, sorry about that. Pushed now, adding one line to get_matching_partitions also. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Runtime Partition Pruning
On 8 April 2018 at 01:18, Alvaro Herrera wrote: > Amit had it as "indexes" also in his original. I wanted to avoid using > the "indexes" word alone, whose meaning is so overloaded. hmm, good point. > How about this? > "... which are then executed to produce a set of partitions (as indexes > of resultRelInfo->part_rels array) that satisfy the constraints in the > step". Works for me, but with RelOptInfo rather than resultRelInfo. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Runtime Partition Pruning
David Rowley wrote: > It's not exactly wrong but: > > + * are turned into a set of "pruning steps", which are then executed to > + * produce a set of RTIs of partitions whose bounds satisfy the constraints > in > + * the step. Partitions not in the set are said to have been pruned. > > It's only prune_append_rel_partitions which is only used for the > planner's pruning needs that converts the partition indexes to RTIs. > Would it be better to mention that the output is partition indexes? > Maybe: > > "which are then executed to produce a set of partition indexes whose > bounds satisfy the constraints in the step. These partition indexes > may then be translated into RTIs", or maybe even not mention the RTIs. Amit had it as "indexes" also in his original. I wanted to avoid using the "indexes" word alone, whose meaning is so overloaded. How about this? "... which are then executed to produce a set of partitions (as indexes of resultRelInfo->part_rels array) that satisfy the constraints in the step". Maybe "the boundinfo array" instead of part_rels, which as I understand also uses the same indexing as the other array, and partprune mostly works based on boundinfo anyway? Not mentioning RTIs seems fine. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Runtime Partition Pruning
On 8 April 2018 at 00:23, Alvaro Herrera wrote: > I edited it as attached, to 1. avoid mentioning functionality that > doesn't yet exist, and 2. avoid excessive internal detail (we want a > high-level overview here), which from experience gets outdated pretty > quickly. It's not exactly wrong but: + * are turned into a set of "pruning steps", which are then executed to + * produce a set of RTIs of partitions whose bounds satisfy the constraints in + * the step. Partitions not in the set are said to have been pruned. It's only prune_append_rel_partitions which is only used for the planner's pruning needs that converts the partition indexes to RTIs. Would it be better to mention that the output is partition indexes? Maybe: "which are then executed to produce a set of partition indexes whose bounds satisfy the constraints in the step. These partition indexes may then be translated into RTIs", or maybe even not mention the RTIs. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Runtime Partition Pruning
Amit Langote wrote: > See if the attached makes it any better. > > Now I know we don't have the runtime pruning in yet, but since the > proposed patch would extend its functionality I have included its > description in the comment. Thanks! I edited it as attached, to 1. avoid mentioning functionality that doesn't yet exist, and 2. avoid excessive internal detail (we want a high-level overview here), which from experience gets outdated pretty quickly. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services commit 28768290d8d9c58e76e2594a14d5b1f8465ba262[m Author: Alvaro Herrera AuthorDate: Sat Apr 7 08:44:12 2018 -0300 CommitDate: Sat Apr 7 09:20:58 2018 -0300 Add some documentation to partprune.c Author: Amit Langote Discussion: https://postgr.es/m/CA+HiwqGzq4D6z=8r0ap+xhbtfcq-4ct+t2ekqje9fpm84_j...@mail.gmail.com diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c index 959ee1643d..07b8057e3f 100644 --- a/src/backend/partitioning/partprune.c +++ b/src/backend/partitioning/partprune.c @@ -1,10 +1,27 @@ /*- * * partprune.c - * Parses clauses attempting to match them up to partition keys of a - * given relation and generates a set of "pruning steps", which can be - * later "executed" either from the planner or the executor to determine - * the minimum set of partitions which match the given clauses. + * Support for partition pruning during query planning + * + * This module implements partition pruning using the information contained in + * table's partition descriptor and query clauses. + * + * During planning, clauses that can be matched to the table's partition key + * are turned into a set of "pruning steps", which are then executed to + * produce a set of RTIs of partitions whose bounds satisfy the constraints in + * the step. Partitions not in the set are said to have been pruned. + * + * There are two kinds of pruning steps: a "base" pruning step, which contains + * information extracted from one or more clauses that are matched to the + * (possibly multi-column) partition key, such as the expressions whose values + * to match against partition bounds and operator strategy to associate to + * each expression. The other kind is a "combine" pruning step, which combines + * the outputs of some other steps using the appropriate combination method. + * All steps that are constructed are executed in succession such that for any + * "combine" step, all of the steps whose output it depends on are executed + * first and their ouput preserved. + * + * See gen_partprune_steps_internal() for more details on step generation. * * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California
Re: [HACKERS] Runtime Partition Pruning
On Sat, Apr 7, 2018 at 1:58 PM, Amit Langote wrote: > On Sat, Apr 7, 2018 at 1:31 PM, Andres Freund wrote: >> I've not followed this thread/feature at all, but I don't find the >> comments atop partprune.c even remotely sufficient. Unless there's an >> README hidden or such hidden somewhere? > > Sorry there isn't a README and I agree partprune.c's header comment > could be improved quite a bit. > > Just to be clear, that's the fault of the patch that was already > committed earlier today (9fdb675fc "Faster partition pruning"), not > this patch, which just extends partition.c's functionality to > implement additional planner and executor support for runtime pruning. > > I'm drafting a patch that expands the partprune.c comment and will post > shortly. See if the attached makes it any better. Now I know we don't have the runtime pruning in yet, but since the proposed patch would extend its functionality I have included its description in the comment. Thanks, Amit expand-partprune-header-comment.patch Description: Binary data
Re: [HACKERS] Runtime Partition Pruning
On 2018-04-07 16:58:01 +1200, David Rowley wrote: > On 7 April 2018 at 16:31, Andres Freund wrote: > > I've not followed this thread/feature at all, but I don't find the > > comments atop partprune.c even remotely sufficient. Unless there's an > > README hidden or such hidden somewhere? > > There's not a README file. The comments for partprune.c, do you want > this to explain more about how partition pruning works or how this > patch uses the existing code? Primarily the first. This isn't trivial straightforward code. > Probably if we need to explain more there about how pruning works then > it should be a fixup patch to 9fdb675fc, no? Yea, it's about that. Sorry for accidentally jumping on the wrong thread. Greetings, Andres Freund
Re: [HACKERS] Runtime Partition Pruning
On Sat, Apr 7, 2018 at 1:58 PM, David Rowley wrote: > Probably if we need to explain more there about how pruning works then > it should be a fixup patch to 9fdb675fc, no? Yes, I just replied and working on a patch. Thanks, Amit
Re: [HACKERS] Runtime Partition Pruning
On Sat, Apr 7, 2018 at 1:31 PM, Andres Freund wrote: > On 2018-04-07 13:26:51 +0900, Amit Langote wrote: >> On Sat, Apr 7, 2018 at 11:26 AM, David Rowley >> wrote: >> > Everything else looks fine from my point of view. >> >> Me too, although I still think having struct names PartitionPruning >> and PartitionRelPruning is going to be a bit confusing. We should >> think about naming the latter to something else. I suggested >> PartitionPruningDispatch(Data), but David doesn't seem to like it. >> Maybe, PartitionPruneState, because it parallels the >> PartitionPruneInfo that comes from the planner for every partitioned >> table in the tree. > > I've not followed this thread/feature at all, but I don't find the > comments atop partprune.c even remotely sufficient. Unless there's an > README hidden or such hidden somewhere? Sorry there isn't a README and I agree partprune.c's header comment could be improved quite a bit. Just to be clear, that's the fault of the patch that was already committed earlier today (9fdb675fc "Faster partition pruning"), not this patch, which just extends partition.c's functionality to implement additional planner and executor support for runtime pruning. I'm drafting a patch that expands the partprune.c comment and will post shortly. Thanks, Amit
Re: [HACKERS] Runtime Partition Pruning
On 7 April 2018 at 16:31, Andres Freund wrote: > I've not followed this thread/feature at all, but I don't find the > comments atop partprune.c even remotely sufficient. Unless there's an > README hidden or such hidden somewhere? There's not a README file. The comments for partprune.c, do you want this to explain more about how partition pruning works or how this patch uses the existing code? Probably if we need to explain more there about how pruning works then it should be a fixup patch to 9fdb675fc, no? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Runtime Partition Pruning
On 7 April 2018 at 16:26, Amit Langote wrote: > Maybe, PartitionPruneState, because it parallels the > PartitionPruneInfo that comes from the planner for every partitioned > table in the tree. I like that. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Runtime Partition Pruning
On 2018-04-07 13:26:51 +0900, Amit Langote wrote: > On Sat, Apr 7, 2018 at 11:26 AM, David Rowley > wrote: > > Everything else looks fine from my point of view. > > Me too, although I still think having struct names PartitionPruning > and PartitionRelPruning is going to be a bit confusing. We should > think about naming the latter to something else. I suggested > PartitionPruningDispatch(Data), but David doesn't seem to like it. > Maybe, PartitionPruneState, because it parallels the > PartitionPruneInfo that comes from the planner for every partitioned > table in the tree. I've not followed this thread/feature at all, but I don't find the comments atop partprune.c even remotely sufficient. Unless there's an README hidden or such hidden somewhere? Greetings, Andres Freund
Re: [HACKERS] Runtime Partition Pruning
On Sat, Apr 7, 2018 at 11:26 AM, David Rowley wrote: > Everything else looks fine from my point of view. Me too, although I still think having struct names PartitionPruning and PartitionRelPruning is going to be a bit confusing. We should think about naming the latter to something else. I suggested PartitionPruningDispatch(Data), but David doesn't seem to like it. Maybe, PartitionPruneState, because it parallels the PartitionPruneInfo that comes from the planner for every partitioned table in the tree. Thanks, Amit
Re: [HACKERS] Runtime Partition Pruning
On 7 April 2018 at 12:03, David Rowley wrote: > Continuing to read 0003 and 0004 now. 0003: 1. "setup" -> "set" /* If run-time partition pruning is enabled, then setup that up now */ 2. We should be able to get rid of as_noopscan and just have another special negative value for as_whichplan. I've attached a patch to do this. 3. I've forgotten to drop table boolvalues in the tests. Patched attached to fix. 0004: 1. "ms_valid_subplans" -> "valid_subplans" in: * ms_valid_subplans for runtime pruning, valid mergeplans indexes to * scan. All the other fields are not being prefixed with ms_ in these comments. Everything else looks fine from my point of view. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services nodeAppend_get_rid_of_as_noopscan.patch Description: Binary data drop_table_boolvalues.patch Description: Binary data
Re: [HACKERS] Runtime Partition Pruning
On 7 April 2018 at 10:45, David Rowley wrote: > I'm looking over the rebased patches now. I've made a complete read of 0001 and 0002 so far. Your rebase looks fine. After the complete read, I only have the following comments: 0001: 1. missing "the" before "partition key": * Extract Params matching partition key and record if we got any. 2. Is this the property name we're going to stick with: ExplainPropertyInteger("Subplans Pruned", NULL, nplans - nsubnodes, es); Other ideas are: "Subplans Removed" 3. In the following comment I've used the word "hierarchy", but maybe we need to add the word "flattened" before it. * PartitionPruning - Encapsulates a hierarchy of PartitionRelPruning 4. Comment mentions "after init plan", but really we can only know the value of an exec param during actual execution. So: * Parameters that are safe to be used for partition pruning. execparams * are not safe to use until after init plan. maybe better as: * Parameters that are safe to be used for partition pruning. execparams * are not safe to use until the executor is running. 0002: Looks fine. But if I was committing this, to give me confidence, I'd want to know how the left_most_one table was generated. I used: #include int main(void) { int i = 1; printf("0, "); while (i < 256) { printf("%d, ", 31 - __builtin_clz(i)); if ((i & 0xf) == 0xf) putchar('\n'); i++; } return 0; } Continuing to read 0003 and 0004 now. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Runtime Partition Pruning
On 7 April 2018 at 09:29, Alvaro Herrera wrote: > Alvaro Herrera wrote: >> I rebased this series on top of the committed version of the other patch. >> Here's v22, with no other changes than rebasing. I did not include >> 0005, though. > > Apologies, I forgot to "git add" one fixup for 0001. 0003 I think. I'm looking over the rebased patches now. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Runtime Partition Pruning
Alvaro Herrera wrote: > I rebased this series on top of the committed version of the other patch. > Here's v22, with no other changes than rebasing. I did not include > 0005, though. Apologies, I forgot to "git add" one fixup for 0001. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c index 093ceaa867..aac05917ee 100644 --- a/src/backend/optimizer/plan/createplan.c +++ b/src/backend/optimizer/plan/createplan.c @@ -29,7 +29,6 @@ #include "nodes/nodeFuncs.h" #include "optimizer/clauses.h" #include "optimizer/cost.h" -#include "optimizer/partprune.h" #include "optimizer/paths.h" #include "optimizer/placeholder.h" #include "optimizer/plancat.h" @@ -42,6 +41,7 @@ #include "optimizer/var.h" #include "parser/parse_clause.h" #include "parser/parsetree.h" +#include "partitioning/partprune.h" #include "utils/lsyscache.h"
Re: [HACKERS] Runtime Partition Pruning
Hi David, On 2018/04/06 12:27, David Rowley wrote: > (sending my reply in parts for concurrency) > > On 6 April 2018 at 14:39, Amit Langote wrote: >> I think we can save some space here by not having the pointers stored >> here. Instead of passing the pointer itself in the recursive calls to >> find_subplans_for_extparams_recurse, et al, just pass the entire array and >> an offset to use for the given recursive invocation. > > hmm, where would those offsets be stored? I don't want to have to do > any linear searching to determine the offset, which is why I just > stored the pointer to the flat array. It seems very efficient to me to > do this. Remember that this pruning can occur per tuple in cases like > parameterized nested loops. > > Are you worried about memory consumption? It's one pointer per > partition. I imagine there's lots more allocated for DML on a > partitioned table as it needs to store maps to map attribute numbers. > > Or are you thinking the saving of storing an array of 32-bit int > values is better than the array of probably 64-bit pointers? So > requires half the space? Yeah, just copy it from the PartitionPruneInfo like you're doing for subnodeindex: memcpy(partrelprune->subpartindex, pinfo->subpartindex, sizeof(int) * pinfo->nparts); Instead I see ExecSetupPartitionPruning is now doing this: /* * Setup the PartitionedRelPruning's subpartprune so that we can * quickly find sub-PartitionedRelPruning details for any * sub-partitioned tables that this partitioned table contains. * We need to be able to find these quickly during our recursive * search to find all matching subnodes. */ for (j = 0; j < pinfo->nparts; j++) { int subpartidx = pinfo->subpartindex[j]; Assert(subpartidx < list_length(partitionpruneinfo)); if (subpartidx >= 0) partrelprune->subpartprune[j] = &partrelprunes[subpartidx]; else partrelprune->subpartprune[j] = NULL; } With that in place, pass the index/offset instead of the pointer to the next recursive invocation of find_subplans_*, along with the array containing all PartitionedRelPruning's. So, where you have in each of find_subplans_*: if (partrelprune->subnodeindex[i] >= 0) *validsubplans = bms_add_member(*validsubplans, partrelprune->subnodeindex[i]); else if (partrelprune->subpartprune[i] != NULL) find_subplans_for_allparams_recurse(partrelprune->subpartprune[i], validsubplans); I'm proposing that you do: if (partrelprune->subnodeindex[i] >= 0) *validsubplans = bms_add_member(*validsubplans, partrelprune->subnodeindex[i]); else if (partrelprune->subpartindex[i] >= 0) find_subplans_for_allparams_recurse(all_partrelprunes, partrelprune->subpartindex[i], validsubplans); And at the top of each of find_subplans_*: ParitionedRelPruning *partrelprune = all_partrelprunes[offset]; where the signature is: static void find_subplans_for_allparams_recurse( PartitionRelPruning **all_partrelprune, int offset, Bitmapset **validsubplans) The all_partrelprune above refers to the flat array in PartitionPruning. On the first call from ExecFindMatchingSubPlans, et al, you'd pass 0 for offset to start pruning with the root parent's PartitionedRelPruning. All the values contained in subnodeindex and subpartindex are indexes into the global array (whole-tree that is) anyway and that fact would be more apparent if we use this code structure, imho. >> If you look at ExecFindPartition used for tuple routing, you may see that >> there no recursion at all. Similarly find_subplans_for_extparams_recurse, >> et al might be able to avoid recursion if similar tricks are used. > > That seems pretty different. That's looking for a single node in a > tree, so just is following a single path from the root, it never has > to go back up a level and look down any other paths. > > What we need for the find_subplans_for_extparams_recurse is to find > all nodes in the entire tree which match the given clause. Doing this > without recursion would require some sort of stack so we can go back > up a level and search again down another branch. There are ways > around this without using recursion, sure, but I don't think any of > them will be quite as convenient and simple. The best I can think of > is to palloc some stack manually and use some depth_level to track > which element to use. An actual stack seems more simple. I can't > quite think of a good way to know in advance how many elements we'd > need to palloc. Hmm, yeah. I just remembered that I had to give up suggesting this a while back on this thread. So, okay, you don't need to do a
Re: [HACKERS] Runtime Partition Pruning
(sending my reply in parts for concurrency) On 6 April 2018 at 14:39, Amit Langote wrote: > I think we can save some space here by not having the pointers stored > here. Instead of passing the pointer itself in the recursive calls to > find_subplans_for_extparams_recurse, et al, just pass the entire array and > an offset to use for the given recursive invocation. hmm, where would those offsets be stored? I don't want to have to do any linear searching to determine the offset, which is why I just stored the pointer to the flat array. It seems very efficient to me to do this. Remember that this pruning can occur per tuple in cases like parameterized nested loops. Are you worried about memory consumption? It's one pointer per partition. I imagine there's lots more allocated for DML on a partitioned table as it needs to store maps to map attribute numbers. Or are you thinking the saving of storing an array of 32-bit int values is better than the array of probably 64-bit pointers? So requires half the space? > If you look at ExecFindPartition used for tuple routing, you may see that > there no recursion at all. Similarly find_subplans_for_extparams_recurse, > et al might be able to avoid recursion if similar tricks are used. That seems pretty different. That's looking for a single node in a tree, so just is following a single path from the root, it never has to go back up a level and look down any other paths. What we need for the find_subplans_for_extparams_recurse is to find all nodes in the entire tree which match the given clause. Doing this without recursion would require some sort of stack so we can go back up a level and search again down another branch. There are ways around this without using recursion, sure, but I don't think any of them will be quite as convenient and simple. The best I can think of is to palloc some stack manually and use some depth_level to track which element to use. An actual stack seems more simple. I can't quite think of a good way to know in advance how many elements we'd need to palloc. > Finally about having two different functions for different sets of Params: > can't we have just named find_subplans_for_params_recurse() and use the > appropriate one based on the value of some parameter? I can't help but > notice the duplication of code. I had decided not to make this one function previously as I didn't really want to add unnecessary branching in the code. After implementing it, it does not look as bad as I thought. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Runtime Partition Pruning
Hi David. On 2018/04/05 22:41, David Rowley wrote: >> * make_partition_pruneinfo has a parameter resultRelations that's not used >> anywhere > > It gets used in 0005. > > I guess I could only add it in 0005, but make_partition_pruneinfo is > only used in 0003, so you could say the same about that entire > function. > > Do you think I should delay adding that parameter until the 0005 patch? Yes, I think. >> * In make_partition_pruneinfo() >> >> allsubnodeindex = palloc(sizeof(int) * root->simple_rel_array_size); >> allsubpartindex = palloc(sizeof(int) * root->simple_rel_array_size); >> >> I think these arrays need to have root->simple_rel_array_size + 1 >> elements, because they're subscripted using RT indexes which start at 1. > > RT indexes are always 1-based. See setup_simple_rel_arrays. It already > sets the array size to list_length(rtable) + 1. Oh, I missed that simple_rel_array_size itself is set to consider 1-based RT indexes. relnode.c:73 root->simple_rel_array_size = list_length(root->parse->rtable) + 1; >> * Also in make_partition_pruneinfo() >> >> /* Initialize to -1 to indicate the rel was not found */ >> for (i = 0; i < root->simple_rel_array_size; i++) >> { >> allsubnodeindex[i] = -1; >> allsubpartindex[i] = -1; >> } >> >> Maybe, allocate the arrays above mentioned using palloc0 and don't do this >> initialization. Instead make the indexes that are stored in these start >> with 1 and consider 0 as invalid entries. > > 0 is a valid subplan index. It is possible to make this happen, but > I'd need to subtract 1 everywhere I used the map. That does not seem > very nice. Seems more likely to result in bugs where we might forget > to do the - 1. You can subtract 1 right here in make_partition_pruneinfo before setting the values in PartitionPruneInfo's subplan_indexes and parent_indexes. I'm only proposing to make make_partition_pruneinfo() a bit faster by not looping over both the local map arrays setting them to -1. IOW, I'm not saying that we emit PartitionPruneInfo nodes that contain 1-based indexes. > Did you want this because you'd rather have the palloc0() than the for > loop setting the array elements to -1? Or is there another reason? Yeah, that's it. >> * Instead of nesting PartitionedRelPruning inside another, just store them >> in a global flat array in the PartitionPruning, like PartitionTupleRouting >> does for PartitionDispatch of individual partitioned tables in the tree. >> >> typedef struct PartitionedRelPruning >> { >> int nparts; >> int*subnodeindex; >> struct PartitionedRelPruning **subpartprune; > > There is a flat array in PartitionPruning. subpartprune contains > pointers into that array. I want to have this pointer array so I can > directly reference the flat array in PartitionPruning. > > Maybe I've misunderstood what you mean here. I think we can save some space here by not having the pointers stored here. Instead of passing the pointer itself in the recursive calls to find_subplans_for_extparams_recurse, et al, just pass the entire array and an offset to use for the given recursive invocation. If you look at ExecFindPartition used for tuple routing, you may see that there no recursion at all. Similarly find_subplans_for_extparams_recurse, et al might be able to avoid recursion if similar tricks are used. Finally about having two different functions for different sets of Params: can't we have just named find_subplans_for_params_recurse() and use the appropriate one based on the value of some parameter? I can't help but notice the duplication of code. Thanks, Amit
Re: [HACKERS] Runtime Partition Pruning
Hi David, First of all: Solid patch set with good documentation. On 04/05/2018 09:41 AM, David Rowley wrote: Seems mostly fair. I'm not a fan of using the term "unpruned" though. I'll have a think. The "all" is meant in terms of what exists as subnodes. 'included_parts' / 'excluded_parts' probably isn't better... subplan_indexes and parent_indexes seem like better names, I agree. More clear. * Also in make_partition_pruneinfo() /* Initialize to -1 to indicate the rel was not found */ for (i = 0; i < root->simple_rel_array_size; i++) { allsubnodeindex[i] = -1; allsubpartindex[i] = -1; } Maybe, allocate the arrays above mentioned using palloc0 and don't do this initialization. Instead make the indexes that are stored in these start with 1 and consider 0 as invalid entries. 0 is a valid subplan index. It is possible to make this happen, but I'd need to subtract 1 everywhere I used the map. That does not seem very nice. Seems more likely to result in bugs where we might forget to do the - 1. Did you want this because you'd rather have the palloc0() than the for loop setting the array elements to -1? Or is there another reason? I think that doing palloc0 would be confusing; -1 is more clear, especially since it is used with 'allpartindexes' which is a Bitmapset. Doing the variable renames as Amit suggests would be good. I ran some tests (v50_v20) (make check-world passes), w/ and w/o choose_custom_plan() being false, and seeing good performance results without running into issues. Maybe 0005 should be expanded in partition_prune.sql with the supported cases to make those more clear. Thanks ! Best regards, Jesper
Re: [HACKERS] Runtime Partition Pruning
Hi Amit, Thanks for having a look at this. On 6 April 2018 at 00:54, Amit Langote wrote: > I looked at it and here are my thoughts on it after having for the first > time looked very closely at it. > > * Regarding PartitionPruneInfo: > > I think the names of the fields could be improved -- like pruning_steps > instead prunesteps, unpruned_parts instead of allpartindexs. The latter > is even a bit misleading because it doesn't in fact contain *all* > partition indexes, only those that are unpruned, because each either has a > subpath or it appears in (unpruned) partitioned_rels list. Also, I didn't > like the name subnodeindex and subpartindex very much. subplan_indexes > and parent_indexes would sound more informative to me. Seems mostly fair. I'm not a fan of using the term "unpruned" though. I'll have a think. The "all" is meant in terms of what exists as subnodes. subplan_indexes and parent_indexes seem like better names, I agree. > * make_partition_pruneinfo has a parameter resultRelations that's not used > anywhere It gets used in 0005. I guess I could only add it in 0005, but make_partition_pruneinfo is only used in 0003, so you could say the same about that entire function. Do you think I should delay adding that parameter until the 0005 patch? > * In make_partition_pruneinfo() > > allsubnodeindex = palloc(sizeof(int) * root->simple_rel_array_size); > allsubpartindex = palloc(sizeof(int) * root->simple_rel_array_size); > > I think these arrays need to have root->simple_rel_array_size + 1 > elements, because they're subscripted using RT indexes which start at 1. RT indexes are always 1-based. See setup_simple_rel_arrays. It already sets the array size to list_length(rtable) + 1. > * Also in make_partition_pruneinfo() > > /* Initialize to -1 to indicate the rel was not found */ > for (i = 0; i < root->simple_rel_array_size; i++) > { > allsubnodeindex[i] = -1; > allsubpartindex[i] = -1; > } > > Maybe, allocate the arrays above mentioned using palloc0 and don't do this > initialization. Instead make the indexes that are stored in these start > with 1 and consider 0 as invalid entries. 0 is a valid subplan index. It is possible to make this happen, but I'd need to subtract 1 everywhere I used the map. That does not seem very nice. Seems more likely to result in bugs where we might forget to do the - 1. Did you want this because you'd rather have the palloc0() than the for loop setting the array elements to -1? Or is there another reason? > * Regarding the code added in execPartition.c and execPartition.h: > > I wondered why PartitionedRelPruning is named the way it is. I saw many > parallels with PartitionDispatchData both in terms of the main thing it > consists of, that is, the map that translates partition indexes as in > partition descriptor to that of subplans or of some other executor > structure. Also, I imagine you tried to mimic PartitionTupleRouting with > PartitionPruning but not throughout. For example, tuple routing struct > pointer variables are throughout called proute, whereas PartitionPruning > ones are called partprune instead of, say, pprune. Consistency would > help, imho. Yes, I saw similarities and did end up moving all the code into execPartition a while back. I'll look into this renaming. > * Instead of nesting PartitionedRelPruning inside another, just store them > in a global flat array in the PartitionPruning, like PartitionTupleRouting > does for PartitionDispatch of individual partitioned tables in the tree. > > typedef struct PartitionedRelPruning > { > int nparts; > int*subnodeindex; > struct PartitionedRelPruning **subpartprune; There is a flat array in PartitionPruning. subpartprune contains pointers into that array. I want to have this pointer array so I can directly reference the flat array in PartitionPruning. Maybe I've misunderstood what you mean here. > * I don't see why there needs to be nparts in the above, because it > already has a PartitionPruneContext member which has that information. Good point. I'll remove that. > In fact, I made most of changes myself while going through the code. > Please see attached the delta patch. It also tweaks quite a few other > things including various comments. I think parts of it apply to 0001, > 0003, and 0004 patches. See if this looks good to you. Thanks. I'll look. It's late over this side now, so will look tomorrow. Thanks again for reviewing this. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Runtime Partition Pruning
On 2018/04/05 12:14, Amit Langote wrote: > I will post comments on your v19 later today. I looked at it and here are my thoughts on it after having for the first time looked very closely at it. * Regarding PartitionPruneInfo: I think the names of the fields could be improved -- like pruning_steps instead prunesteps, unpruned_parts instead of allpartindexs. The latter is even a bit misleading because it doesn't in fact contain *all* partition indexes, only those that are unpruned, because each either has a subpath or it appears in (unpruned) partitioned_rels list. Also, I didn't like the name subnodeindex and subpartindex very much. subplan_indexes and parent_indexes would sound more informative to me. * make_partition_pruneinfo has a parameter resultRelations that's not used anywhere * In make_partition_pruneinfo() allsubnodeindex = palloc(sizeof(int) * root->simple_rel_array_size); allsubpartindex = palloc(sizeof(int) * root->simple_rel_array_size); I think these arrays need to have root->simple_rel_array_size + 1 elements, because they're subscripted using RT indexes which start at 1. * Also in make_partition_pruneinfo() /* Initialize to -1 to indicate the rel was not found */ for (i = 0; i < root->simple_rel_array_size; i++) { allsubnodeindex[i] = -1; allsubpartindex[i] = -1; } Maybe, allocate the arrays above mentioned using palloc0 and don't do this initialization. Instead make the indexes that are stored in these start with 1 and consider 0 as invalid entries. * Regarding the code added in execPartition.c and execPartition.h: I wondered why PartitionedRelPruning is named the way it is. I saw many parallels with PartitionDispatchData both in terms of the main thing it consists of, that is, the map that translates partition indexes as in partition descriptor to that of subplans or of some other executor structure. Also, I imagine you tried to mimic PartitionTupleRouting with PartitionPruning but not throughout. For example, tuple routing struct pointer variables are throughout called proute, whereas PartitionPruning ones are called partprune instead of, say, pprune. Consistency would help, imho. * Instead of nesting PartitionedRelPruning inside another, just store them in a global flat array in the PartitionPruning, like PartitionTupleRouting does for PartitionDispatch of individual partitioned tables in the tree. typedef struct PartitionedRelPruning { int nparts; int*subnodeindex; struct PartitionedRelPruning **subpartprune; * I don't see why there needs to be nparts in the above, because it already has a PartitionPruneContext member which has that information. In fact, I made most of changes myself while going through the code. Please see attached the delta patch. It also tweaks quite a few other things including various comments. I think parts of it apply to 0001, 0003, and 0004 patches. See if this looks good to you. Thanks, Amit diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c index 17da8cdbd3..1041871e51 100644 --- a/src/backend/executor/execPartition.c +++ b/src/backend/executor/execPartition.c @@ -39,11 +39,11 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel, bool *isnull, int maxfieldlen); static List *adjust_partition_tlist(List *tlist, TupleConversionMap *map); -static void find_subplans_for_extparams_recurse( - PartitionedRelPruning *partrelprune, +static void find_subplans_for_extparams_recurse(PartitionPruningDispatch *all_ppd, + int dispatch_offset, Bitmapset **validsubplans); -static void find_subplans_for_allparams_recurse( - PartitionedRelPruning *partrelprune, +static void find_subplans_for_allparams_recurse(PartitionPruningDispatch *all_ppd, + int dispatch_offset, Bitmapset **validsubplans); @@ -1343,27 +1343,27 @@ adjust_partition_tlist(List *tlist, TupleConversionMap *map) PartitionPruning * ExecSetupPartitionPruning(PlanState *planstate, List *partitionpruneinfo) { - PartitionedRelPruning *partrelprunes; - PartitionPruning *partprune; + PartitionPruning *pprune; + PartitionPruningDispatch *all_ppd; ListCell *lc; int i; Assert(partitionpruneinfo != NIL); - partprune = (PartitionPruning *) palloc(sizeof(PartitionPruning)); - partrelprunes = (PartitionedRelPruning *) -
Re: [HACKERS] Runtime Partition Pruning
On 5 April 2018 at 15:14, Amit Langote wrote: > On 2018/03/31 22:52, David Rowley wrote: >> Initialising >> Append/MergeAppend/ModifyTable nodes with fewer subnodes than were >> originally planned did require a small change in explain.c in some >> code that was assuming the subnode arrays were the same length as the >> subplan list. I also ended up adding a Int property to EXPLAIN to show >> the number of subnodes that have been removed during execution. >> Unfortunately, there is a small corner case with this in the case >> where all subnodes are removed it leaves EXPLAIN with no subnodes to >> search for outer side Vars in. I didn't really see any sane way to >> handle this in EXPLAIN, so I think the best fix for this is what I've >> done, and that's just to always initalise at least 1 subnode, even >> when none match the external Params. Cost-wise, I don't think this is >> such a big deal as the cost savings here are coming from saving on >> initalising ten's or hundreds of subnodes, not 1. > > I have wondered about the possibility of setting a valid (non-dummy) > targetlist in Append and MergeAppend if they're created for a partitioned > table. Currently they're overwritten by a dummy one using > set_dummy_tlist_references() in set_plan_refs() citing the following reason: > > * set_dummy_tlist_references > *Replace the targetlist of an upper-level plan node with a simple > *list of OUTER_VAR references to its child. > * > * This is used for plan types like Sort and Append that don't evaluate > * their targetlists. Although the executor doesn't care at all what's in > * the tlist, EXPLAIN needs it to be realistic. > > In fact, when I had noticed that this EXPLAIN behavior I had wondered if > that's something we should have discussed when d3cc37f1d801a [1] went in. I had a quick hack at this to see if it would work and it does seem to on my very simple test. However, it would mean removing set_dummy_tlist_references from more than just Append/MergeAppend create table listp (a int, b int) partition by list(a); create table listp1 partition of listp for values in(1); create table listp2 partition of listp for values in(2); prepare q1 (int, int) as select * from listp where a in($1,$2) order by b limit 1; explain execute q1(1,2); explain execute q1(1,2); explain execute q1(1,2); explain execute q1(1,2); explain execute q1(1,2); explain (verbose, costs off) execute q1(0,0); QUERY PLAN Limit Output: listp.a, listp.b -> Sort Output: listp.a, listp.b Sort Key: listp.b -> Append Subplans Pruned: 2 (7 rows) The downside is that if we were to do this it would mean changing the output in cases like: explain (verbose, costs off) (select a z, b y from listp union all select * from listp) order by y; QUERY PLAN -- Sort Output: z, y Sort Key: y -> Append -> Seq Scan on public.listp1 Output: listp1.a, listp1.b -> Seq Scan on public.listp2 Output: listp2.a, listp2.b -> Seq Scan on public.listp1 listp1_1 Output: listp1_1.a, listp1_1.b -> Seq Scan on public.listp2 listp2_1 Output: listp2_1.a, listp2_1.b Notice the sort key now refers to the alias rather than a column from the first append child. It sure is an interesting thought, and one I'd not considered, but I don't think trying for something like this is going to be the path of least resistance. It may also add quite a bit of complexity if we just try to do it when the OUTER_VAR would lead to a Append/MergeAppend which belongs to a partitioned table scan. I think this idea has good merit and some form of it might be the nicest way to allow run-time pruning of all subnodes in the future. Perhaps we can put it on the shelf and think about it again for PG12. However, it might not be the most interesting optimization to work on, as I think probably run-time pruning away of all subnodes is probably far less common than pruning some, or all but one, and the cost of initializing the one unneeded subnode is not so big. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services set_dummy_tlist_hacks.patch Description: Binary data
Re: [HACKERS] Runtime Partition Pruning
Hi David. On 2018/03/31 22:52, David Rowley wrote: > The attached patchset is based on Amit's v45 faster partition pruning [1]. > > I've made a few changes since the v14 version. Since Amit's v45 patch > now creates the partition pruning details in a data structure that can > be copied from the planner over to the executor, it means this patch > is now able to do much of the setup work for run-time pruning in the > planner. Doing this now allows us to determine if run-time pruning is > even possible at plan time, rather than the executor attempting it and > sometimes wasting effort when it failed to find Params matching the > partition key. > > Another change from the last version is that I've separated out the > handling of exec Params and external Params. The new patch now will > perform a pruning step at executor startup if some external params > match the partition key. This is very beneficial to a PREPAREd OLTP > type query against a partitioned table as it means we can skip > sub-plan initialisation for all non-matching partitions. This is quite nice. > Initialising > Append/MergeAppend/ModifyTable nodes with fewer subnodes than were > originally planned did require a small change in explain.c in some > code that was assuming the subnode arrays were the same length as the > subplan list. I also ended up adding a Int property to EXPLAIN to show > the number of subnodes that have been removed during execution. > Unfortunately, there is a small corner case with this in the case > where all subnodes are removed it leaves EXPLAIN with no subnodes to > search for outer side Vars in. I didn't really see any sane way to > handle this in EXPLAIN, so I think the best fix for this is what I've > done, and that's just to always initalise at least 1 subnode, even > when none match the external Params. Cost-wise, I don't think this is > such a big deal as the cost savings here are coming from saving on > initalising ten's or hundreds of subnodes, not 1. I have wondered about the possibility of setting a valid (non-dummy) targetlist in Append and MergeAppend if they're created for a partitioned table. Currently they're overwritten by a dummy one using set_dummy_tlist_references() in set_plan_refs() citing the following reason: * set_dummy_tlist_references *Replace the targetlist of an upper-level plan node with a simple *list of OUTER_VAR references to its child. * * This is used for plan types like Sort and Append that don't evaluate * their targetlists. Although the executor doesn't care at all what's in * the tlist, EXPLAIN needs it to be realistic. In fact, when I had noticed that this EXPLAIN behavior I had wondered if that's something we should have discussed when d3cc37f1d801a [1] went in. > To put the new patch to the test, I tried pgbench -S -M prepared -s > 100 with and without having modified pgbench_accounts to separate into > 10 RANGE partitions of equal size. > > A non-partitioned table was getting 12503 TPS. > With partitioned tables, the old version of this patch was getting: 5470 TPS. > With partitioned tables, the attached version gets 11247 TPS. > For perspective, today's master with a partitioned table gets 4719 TPS. Quite nice! > So you can see it's a pretty good performance boost by skipping > initialisation of the 9 non-matching subplans. It's not hard to > imagine the gains getting more significant with a larger number of > partitions. Ideally, the performance of a partitioned table would be > faster than the non-partitioned case, but please remember this query > is a single row PK lookup SELECT, so is a very fast query in both > cases. Partitioning cases should improve more as the table grows and > indexes struggle to fit in RAM, or when many rows are being taken from > the partition and being aggregated. > > Unfortunately, the story is not quite as good for the non -M prepared > version of the benchmark. This shows that planning time of partitioned > table queries could still use some improvements. Amit's v45 patch > certainly makes a dent in the planner slow performance here, but it's > still nothing like as fast as the non-partitioned case. More work is > required there in the future. Certainly. Things like the issue that we both replied to yesterday should be addressed for starters [2]. > This patchset also contains a patch to improve the performance of > inheritance planning of UPDATE/DELETE queries. This patch also > implements run-time pruning for UPDATE/DELETE too. This also has a > significant performance improvement for planning of UPDATE/DELETE > operations on partitioned tables with a large number of partitions, > performance is as follows: > > Values in transactions per second. > > Partitions = 1 > Unpatched: 7323.3 > Patched: 6573.2 (-10.24%) > > Partitions = 2 > Unpatched: 6784.8 > Patched: 6377.1 (-6.01%) > > Partitions = 4 > Unpatched: 5903.0 > Patched: 6106.8 (3.45%) > > Partitions = 8 > Unpatched: 4582.0 > Patched: 5579.9 (21.78%) >
Re: [HACKERS] Runtime Partition Pruning
Hi David, On 04/03/2018 10:10 PM, David Rowley wrote: The attached case doesn't trigger a generic plan, so basically all time is spent in GetCachedPlan. Yeah, there's still no resolution to the fact that a generic plan + runtime pruning might be cheaper than a custom plan. The problem is the generic plan appears expensive to the custom vs generic plan comparison due to it containing more Append subnodes and the run-time pruning not being taking into account by that comparison. There's been some discussion about this on this thread somewhere. Forgot about that, sorry. I think the best solution is probably the one suggested by Robert [1] and that's to alter the Append plan's cost when run-time pruning is enabled to try to account for the run-time pruning. This would be a bit of a blind guess akin to what we do for clause selectivity estimates for Params, but it's probably better than nothing, and likely better than doing nothing. Yeah, something based on the number of WHERE clauses, and if the partition type has DEFAULT / NULL partition could help. Forcing choose_custom_plan() to return false does help a lot (> 400%) for the HASH case. But maybe this area is best left for PG12. Yeah, it's a bug in v46 faster partition pruning. Discussing a fix for that with Amit over on [2]. I was running with a version of faster_part_prune_v45_fixups.patch. Patch v49 with v18 (0001-0004) works. 0005 needs a rebase. Thanks again, Jesper
Re: [HACKERS] Runtime Partition Pruning
On 2018/04/04 16:04, David Rowley wrote: > On 4 April 2018 at 18:27, Amit Langote wrote: >> I'm not sure if we've yet discussed anything that'd be related to this on >> the faster pruning thread. > > hmm, yeah, I didn't really explain the context, but the report was in [1]> > [1] > https://www.postgresql.org/message-id/cakjs1f_shpuqdhqwjq-_p1kppqn7bjt71ypbdp_8b3rhwfq...@mail.gmail.com Oh, I see. Hopefully it is no longer an issue. Thanks, Amit
Re: [HACKERS] Runtime Partition Pruning
On 4 April 2018 at 18:27, Amit Langote wrote: > On 2018/04/04 11:10, David Rowley wrote: >> On 4 April 2018 at 05:44, Jesper Pedersen wrote: >>> Also, I'm seeing a regression for check-world in >>> src/test/regress/results/inherit.out >>> >>> *** >>> *** 642,648 >>> -+---+---+- >>>mlparted_tab_part1 | 1 | a | >>>mlparted_tab_part2a | 2 | a | >>> ! mlparted_tab_part2b | 2 | b | xxx >>>mlparted_tab_part3 | 3 | a | xxx >>> (4 rows) >>> >>> --- 642,648 >>> -+---+---+- >>>mlparted_tab_part1 | 1 | a | >>>mlparted_tab_part2a | 2 | a | >>> ! mlparted_tab_part2b | 2 | b | >>>mlparted_tab_part3 | 3 | a | xxx >>> (4 rows) >>> >>> I'll spend some more time tomorrow. >> >> Yeah, it's a bug in v46 faster partition pruning. Discussing a fix for >> that with Amit over on [2]. > > I'm not sure if we've yet discussed anything that'd be related to this on > the faster pruning thread. hmm, yeah, I didn't really explain the context, but the report was in [1] Basically, the OR clause in the following SQL fragment was overwriting the scan_all_non_null value: where (mlp.a = ss.a and mlp.b = 'b') or mlp.a = 3; Basically the: result->scan_all_nonnull = step_result->scan_all_nonnull; The minimum fix would have been to change that line to: result->scan_all_nonnull |= step_result->scan_all_nonnull; Anyway, it all irrelevant now as that code has all changed. [1] https://www.postgresql.org/message-id/cakjs1f_shpuqdhqwjq-_p1kppqn7bjt71ypbdp_8b3rhwfq...@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Runtime Partition Pruning
Hi David. On 2018/04/04 11:10, David Rowley wrote: > On 4 April 2018 at 05:44, Jesper Pedersen wrote: >> Also, I'm seeing a regression for check-world in >> src/test/regress/results/inherit.out >> >> *** >> *** 642,648 >> -+---+---+- >>mlparted_tab_part1 | 1 | a | >>mlparted_tab_part2a | 2 | a | >> ! mlparted_tab_part2b | 2 | b | xxx >>mlparted_tab_part3 | 3 | a | xxx >> (4 rows) >> >> --- 642,648 >> -+---+---+- >>mlparted_tab_part1 | 1 | a | >>mlparted_tab_part2a | 2 | a | >> ! mlparted_tab_part2b | 2 | b | >>mlparted_tab_part3 | 3 | a | xxx >> (4 rows) >> >> I'll spend some more time tomorrow. > > Yeah, it's a bug in v46 faster partition pruning. Discussing a fix for > that with Amit over on [2]. I'm not sure if we've yet discussed anything that'd be related to this on the faster pruning thread. It seems that the difference arises from mlparted_tab_part2b not being selected for an update query that's executed just before this test. When I execute an equivalent select query to check if mlparted_tab_part2b is inadvertently pruned due to the new code, I don't see the latest faster pruning patch doing it: explain (costs off) select * from mlparted_tab mlp, (select a from some_tab union all select a+1 from some_tab) ss (a) where (mlp.a = ss.a and mlp.b = 'b') or mlp.a = 3; QUERY PLAN -- Nested Loop Join Filter: (((mlp.a = some_tab.a) AND (mlp.b = 'b'::bpchar)) OR (mlp.a = 3)) -> Append -> Seq Scan on some_tab -> Seq Scan on some_tab some_tab_1 -> Materialize -> Append -> Seq Scan on mlparted_tab_part1 mlp Filter: ((b = 'b'::bpchar) OR (a = 3)) -> Seq Scan on mlparted_tab_part2b mlp_1 Filter: ((b = 'b'::bpchar) OR (a = 3)) -> Seq Scan on mlparted_tab_part3 mlp_2 Filter: ((b = 'b'::bpchar) OR (a = 3)) (13 rows) For the original update query, constraint exclusion selects the same set of partitions: explain (costs off) update mlparted_tab mlp set c = 'xxx' from (select a from some_tab union all select a+1 from some_tab) ss (a) where (mlp.a = ss.a and mlp.b = 'b') or mlp.a = 3; QUERY PLAN -- Update on mlparted_tab mlp Update on mlparted_tab_part1 mlp_1 Update on mlparted_tab_part2b mlp_2 Update on mlparted_tab_part3 mlp_3 -> Nested Loop Join Filter: (((mlp_1.a = some_tab.a) AND (mlp_1.b = 'b'::bpchar)) OR (mlp_1.a = 3)) -> Append -> Seq Scan on some_tab -> Seq Scan on some_tab some_tab_1 -> Materialize -> Seq Scan on mlparted_tab_part1 mlp_1 Filter: ((b = 'b'::bpchar) OR (a = 3)) -> Nested Loop Join Filter: (((mlp_2.a = some_tab.a) AND (mlp_2.b = 'b'::bpchar)) OR (mlp_2.a = 3)) -> Append -> Seq Scan on some_tab -> Seq Scan on some_tab some_tab_1 -> Materialize -> Seq Scan on mlparted_tab_part2b mlp_2 Filter: ((b = 'b'::bpchar) OR (a = 3)) -> Nested Loop Join Filter: (((mlp_3.a = some_tab.a) AND (mlp_3.b = 'b'::bpchar)) OR (mlp_3.a = 3)) -> Append -> Seq Scan on some_tab -> Seq Scan on some_tab some_tab_1 -> Materialize -> Seq Scan on mlparted_tab_part3 mlp_3 Filter: ((b = 'b'::bpchar) OR (a = 3)) (28 rows) What am I missing? Thanks, Amit
Re: [HACKERS] Runtime Partition Pruning
Hi David, On Wed, Apr 4, 2018 at 7:57 AM, David Rowley wrote: > On 4 April 2018 at 05:50, Beena Emerson wrote: >> With Amit's v46 patch, the following query in partition_prune was >> crashing during make check. >> explain (analyze, costs off, summary off, timing off) execute ab_q1 (2, 2, >> 3); > > Hi Beena, > > Thanks for looking. > > Does it work correctly if you apply [1] to Amit's v46 patch before > patching with v18 run-time partition pruning? > > [1] > https://www.postgresql.org/message-id/CAKJS1f_6%2BgXB%3DQ%2BDryeB62yW7N19sY8hH_dBSjPFjm2ifdgoCw%40mail.gmail.com > Thanks for working on it. make check passes when the patch [1] is also applied. -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Runtime Partition Pruning
On 4 April 2018 at 14:10, David Rowley wrote: > On 4 April 2018 at 05:44, Jesper Pedersen wrote: >> The attached case doesn't trigger a generic plan, so basically all time is >> spent in GetCachedPlan. > > Yeah, there's still no resolution to the fact that a generic plan + > runtime pruning might be cheaper than a custom plan. The problem is > the generic plan appears expensive to the custom vs generic plan > comparison due to it containing more Append subnodes and the run-time > pruning not being taking into account by that comparison. Just for the record, some of the benchmarks I did above also used the attached patch for the -M prepared case. I didn't intend the patch for PostgreSQL, but I am starting to think that it would be useful to have something to save from having to EXECUTE PREPAREd statements 5 times before getting a generic plan. Doing that is starting to seem a bit fragile to me. Would be nice to have some solution, but I've so far not thought of anything better than the attached (incomplete) patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services 0001-Add-force_generic_plan-GUC.patch Description: Binary data
Re: [HACKERS] Runtime Partition Pruning
On 4 April 2018 at 05:50, Beena Emerson wrote: > With Amit's v46 patch, the following query in partition_prune was > crashing during make check. > explain (analyze, costs off, summary off, timing off) execute ab_q1 (2, 2, 3); Hi Beena, Thanks for looking. Does it work correctly if you apply [1] to Amit's v46 patch before patching with v18 run-time partition pruning? [1] https://www.postgresql.org/message-id/CAKJS1f_6%2BgXB%3DQ%2BDryeB62yW7N19sY8hH_dBSjPFjm2ifdgoCw%40mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Runtime Partition Pruning
On 4 April 2018 at 05:44, Jesper Pedersen wrote: > I have tested this together with Amit's v46 patch. Thanks for testing this. > The attached case doesn't trigger a generic plan, so basically all time is > spent in GetCachedPlan. Yeah, there's still no resolution to the fact that a generic plan + runtime pruning might be cheaper than a custom plan. The problem is the generic plan appears expensive to the custom vs generic plan comparison due to it containing more Append subnodes and the run-time pruning not being taking into account by that comparison. There's been some discussion about this on this thread somewhere. I think the best solution is probably the one suggested by Robert [1] and that's to alter the Append plan's cost when run-time pruning is enabled to try to account for the run-time pruning. This would be a bit of a blind guess akin to what we do for clause selectivity estimates for Params, but it's probably better than nothing, and likely better than doing nothing. > Also, I'm seeing a regression for check-world in > src/test/regress/results/inherit.out > > *** > *** 642,648 > -+---+---+- >mlparted_tab_part1 | 1 | a | >mlparted_tab_part2a | 2 | a | > ! mlparted_tab_part2b | 2 | b | xxx >mlparted_tab_part3 | 3 | a | xxx > (4 rows) > > --- 642,648 > -+---+---+- >mlparted_tab_part1 | 1 | a | >mlparted_tab_part2a | 2 | a | > ! mlparted_tab_part2b | 2 | b | >mlparted_tab_part3 | 3 | a | xxx > (4 rows) > > I'll spend some more time tomorrow. Yeah, it's a bug in v46 faster partition pruning. Discussing a fix for that with Amit over on [2]. If you patch v46 with the patch I attached to that thread, it should work. [1] https://www.postgresql.org/message-id/CA%2BTgmoZv8sd9cKyYtHwmd_13%2BBAjkVKo%3DECe7G98tBK5Ejwatw%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAKJS1f_6%2BgXB%3DQ%2BDryeB62yW7N19sY8hH_dBSjPFjm2ifdgoCw%40mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Runtime Partition Pruning
Hello, On Tue, Apr 3, 2018 at 11:14 PM, Jesper Pedersen wrote: > Hi David, > > On 03/31/2018 09:52 AM, David Rowley wrote: >> >> I've attached a new version of the patch. I'm now at v18 after having >> some versions of the patch that I didn't release which were based on >> various versions of Amit's faster partition pruning patch. >> > > Thank you for the updated patch set ! > > I have tested this together with Amit's v46 patch. > > > Also, I'm seeing a regression for check-world in > src/test/regress/results/inherit.out > With Amit's v46 patch, the following query in partition_prune was crashing during make check. explain (analyze, costs off, summary off, timing off) execute ab_q1 (2, 2, 3); -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Runtime Partition Pruning
Hi David, On 03/31/2018 09:52 AM, David Rowley wrote: I've attached a new version of the patch. I'm now at v18 after having some versions of the patch that I didn't release which were based on various versions of Amit's faster partition pruning patch. Thank you for the updated patch set ! I have tested this together with Amit's v46 patch. The attached case doesn't trigger a generic plan, so basically all time is spent in GetCachedPlan. The standard table case (std.sql) gives: generic_cost = 8.4525 avg_custom_cost = 13.4525 total_custom_cost = 67.2625 whereas the 64 hash partition case (hash.sql) gives: generic_cost = 540.32 avg_custom_cost = 175.9425 total_custom_cost = 879.7125 I tested with pgbench -M prepared -f select.sql. Also, I'm seeing a regression for check-world in src/test/regress/results/inherit.out *** *** 642,648 -+---+---+- mlparted_tab_part1 | 1 | a | mlparted_tab_part2a | 2 | a | ! mlparted_tab_part2b | 2 | b | xxx mlparted_tab_part3 | 3 | a | xxx (4 rows) --- 642,648 -+---+---+- mlparted_tab_part1 | 1 | a | mlparted_tab_part2a | 2 | a | ! mlparted_tab_part2b | 2 | b | mlparted_tab_part3 | 3 | a | xxx (4 rows) I'll spend some more time tomorrow. Thanks for working on this ! Best regards, Jesper hash.sql Description: application/sql select.sql Description: application/sql std.sql Description: application/sql
Re: [HACKERS] Runtime Partition Pruning
David Rowley wrote: > To put the new patch to the test, I tried pgbench -S -M prepared -s > 100 with and without having modified pgbench_accounts to separate into > 10 RANGE partitions of equal size. > > A non-partitioned table was getting 12503 TPS. > With partitioned tables, the old version of this patch was getting: 5470 TPS. > With partitioned tables, the attached version gets 11247 TPS. > For perspective, today's master with a partitioned table gets 4719 TPS. > > So you can see it's a pretty good performance boost by skipping > initialisation of the 9 non-matching subplans. It's not hard to > imagine the gains getting more significant with a larger number of > partitions. These are excellent news! -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Runtime Partition Pruning
Hi David, On 03/01/2018 08:29 PM, David Rowley wrote: 0004 fails "make check-world" due to pg_restore: [archiver (db)] Error while PROCESSING TOC: pg_restore: [archiver (db)] Error from TOC entry 670; 1259 49954 TABLE boolp_f jpedersen pg_restore: [archiver (db)] could not execute query: ERROR: syntax error at or near "false" LINE 24: ..." ATTACH PARTITION "public"."boolp_f" FOR VALUES IN (false); The tests seem to have stumbled on a pg_dump bug which causes it to produce syntax that's not valid (currently) I should be able to stop my patch failing the test by dropping that table, which I should have been doing anyway. I've added that thread to the Open Items list. Thanks for the review and in advance for the future review. I'll delay releasing a new patch as there's some discussion over on the faster partition pruning thread which affects this too [1] [1] https://www.postgresql.org/message-id/CA+Tgmoa4D1c4roj7L8cx8gkkeBWAZD=mtcxkxtwbnslrhd3...@mail.gmail.com Sure, 0003-0005 depends on that thread. 0002 is refactoring so that one is ready. In 0004 should we add a HASH based test case, -- test.sql -- -- verify pruning in hash partitions create table hashp (a int primary key, b int) partition by hash (a); create table hashp_0 partition of hashp for values with (modulus 2, remainder 0); create table hashp_1 partition of hashp for values with (modulus 2, remainder 1); insert into hashp values (0,0), (1,1), (2,2), (3,3); prepare q1 (int) as select * from hashp where a = $1; execute q1 (1); execute q1 (1); execute q1 (1); execute q1 (1); execute q1 (1); explain (analyze, costs off, summary off, timing off) execute q1 (1); explain (analyze, costs off, summary off, timing off) execute q1 (3); deallocate q1; drop table hashp; -- test.sql -- Also, should 0004 consider the "Parallel Append" case, aka -- parallel.sql -- CREATE TABLE t1 ( a integer NOT NULL, b integer NOT NULL ) PARTITION BY HASH (b); CREATE TABLE t1_p00 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, REMAINDER 0); CREATE TABLE t1_p01 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, REMAINDER 1); CREATE TABLE t1_p02 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, REMAINDER 2); CREATE TABLE t1_p03 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, REMAINDER 3); INSERT INTO t1 (SELECT i, i FROM generate_series(1, 100) AS i); PREPARE q1 (int) AS SELECT * FROM t1 WHERE a = $1; EXECUTE q1 (5432); EXECUTE q1 (5432); EXECUTE q1 (5432); EXECUTE q1 (5432); EXECUTE q1 (5432); EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) EXECUTE q1 (5432); DEALLOCATE q1; DROP TABLE t1; -- parallel.sql -- Best regards, Jesper
Re: [HACKERS] Runtime Partition Pruning
On 2 March 2018 at 07:17, Jesper Pedersen wrote: > A small typo in 0001: > > + * leftmost_ons_pos[x] gives the bit number (0-7) of the leftmost one bit > in a > > ..."_one_"... Oops. I'll fix that. > 0004 fails "make check-world" due to > > pg_restore: [archiver (db)] Error while PROCESSING TOC: > pg_restore: [archiver (db)] Error from TOC entry 670; 1259 49954 TABLE > boolp_f jpedersen > pg_restore: [archiver (db)] could not execute query: ERROR: syntax error at > or near "false" > LINE 24: ..." ATTACH PARTITION "public"."boolp_f" FOR VALUES IN (false); The tests seem to have stumbled on a pg_dump bug which causes it to produce syntax that's not valid (currently) I should be able to stop my patch failing the test by dropping that table, which I should have been doing anyway. > Do you require https://commitfest.postgresql.org/17/1410/ as well ? I'll look at that thread and see if there's any pg_dump being broken discussion. > I'll look more at 0002-0005 over the coming days. Thanks for the review and in advance for the future review. I'll delay releasing a new patch as there's some discussion over on the faster partition pruning thread which affects this too [1] [1] https://www.postgresql.org/message-id/CA+Tgmoa4D1c4roj7L8cx8gkkeBWAZD=mtcxkxtwbnslrhd3...@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Runtime Partition Pruning
Hi David, On 03/01/2018 08:04 AM, David Rowley wrote: I've also split the patch out a bit more into logical parts in the hope it makes things easier to review. A small typo in 0001: + * leftmost_ons_pos[x] gives the bit number (0-7) of the leftmost one bit in a ..."_one_"... 0004 fails "make check-world" due to pg_restore: [archiver (db)] Error while PROCESSING TOC: pg_restore: [archiver (db)] Error from TOC entry 670; 1259 49954 TABLE boolp_f jpedersen pg_restore: [archiver (db)] could not execute query: ERROR: syntax error at or near "false" LINE 24: ..." ATTACH PARTITION "public"."boolp_f" FOR VALUES IN (false); Do you require https://commitfest.postgresql.org/17/1410/ as well ? I'll look more at 0002-0005 over the coming days. Thanks for working on this ! Best regards, Jesper
Re: [HACKERS] Runtime Partition Pruning
On 27 February 2018 at 22:39, Amit Langote wrote: > I've incorporated portions of 0002 and 0003 into my patch on the other > thread (v34) posted at [1]. That is, mostly the changes around handling > OR clauses and interface changes resulting from it. Thanks. I was just in the middle of swapping the order of the patches so that the OR clause patch was directly based on yours. > Attached are revised version of your patches after the aforementioned > rearrangements. Note that after I took out the optimizer portion of the > 0003 patch to incorporate it into my patch (OR clause processing bits), > not much was left in it, so I squashed it into 0002. So there are only > 0001 and 0002. I've locally got a patch which is significantly different to the v12 patch which moves lots of code into nodePartition.c and fixes up the missing node read/write functions too. > As a review comment on 0002, I think trypartitionprune is better written > as try_partition_prune. That no longer exists in the new version... Will post soonish, just need to base it all on your v34 [1] now! :) [1] https://www.postgresql.org/message-id/158f04ce-9deb-0457-ddcc-78fb73db4ebc%40lab.ntt.co.jp -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Runtime Partition Pruning
On 2018/02/23 20:40, David Rowley wrote: > On 23 February 2018 at 04:11, Jesper Pedersen > wrote: >> Are UPDATE and DELETE suppose to be supported ? > > To be honest, I had not even considered those. Without looking in > detail I imagine it may be possible to allow this simply by setting > the AppendPath->trypartitionprune in the correct cases in the > inheritence_planner(). I would need to look into this in some detail > to find out for sure. I guess you meant in ModifyTablePath. Thanks, Amit
Re: [HACKERS] Runtime Partition Pruning
(Sorry, I had mistakenly replied only to Simon on Friday) On Fri, Feb 23, 2018 at 9:04 PM, Simon Riggs wrote: > On 23 February 2018 at 11:40, David Rowley > wrote: >> On 23 February 2018 at 04:11, Jesper Pedersen >> wrote: >>> Are UPDATE and DELETE suppose to be supported ? >> >> To be honest, I had not even considered those. > > I can say that I had considered those. Handling of UPDATE and DELETE > with partitions is significantly different, so its not just an > oversight its a different branch of the project. > > We need SELECT to work first and then we have a chance of making > UPDATE/DELETE work. +1 Thanks, Amit
Re: [HACKERS] Runtime Partition Pruning
On 23 February 2018 at 11:40, David Rowley wrote: > On 23 February 2018 at 04:11, Jesper Pedersen > wrote: >> Are UPDATE and DELETE suppose to be supported ? > > To be honest, I had not even considered those. I can say that I had considered those. Handling of UPDATE and DELETE with partitions is significantly different, so its not just an oversight its a different branch of the project. We need SELECT to work first and then we have a chance of making UPDATE/DELETE work. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Runtime Partition Pruning
On 23 February 2018 at 04:11, Jesper Pedersen wrote: > Are UPDATE and DELETE suppose to be supported ? To be honest, I had not even considered those. Without looking in detail I imagine it may be possible to allow this simply by setting the AppendPath->trypartitionprune in the correct cases in the inheritence_planner(). I would need to look into this in some detail to find out for sure. Another case which likely is simple to implement is the exact same processing for MergeAppends. I currently see no reason why the same pruning cannot be done for subnodes of that node type too. I've just not done so yet. I'd rather get more sanity check reviews on the current scope of the patch before I widen it out to other areas, but at the same time also don't want to leave very simple things to PG12 which can easily be done in PG11. So I'll try to look at this and get back to you, or perhaps release a new set of patches to support the additional features. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Runtime Partition Pruning
Hi David. On 2018/02/23 0:11, David Rowley wrote: > On 23 February 2018 at 01:15, David Rowley > wrote: >> One problem that I'm facing now is down to the way I'm gathering the >> ParamIds that match the partkeys. As you'll see from the patch I've >> added a 'paramids' field to PartitionPruneContext and I'm populating >> this when the clauses are being pre-processed in >> extract_partition_clauses(). The problem is that the or_clauses are >> not pre-processed at all, so the current patch will not properly >> perform run-time pruning when the Params are "hidden" in OR branches. >> >> One way I thought to fix this was to change the clause processing to >> create an array of PartitionClauseInfos, one for each OR branch. This >> would also improve the performance of the run-time pruning, meaning >> that all of the or clauses would be already matched to the partition >> keys once, rather than having to redo that again each time a Param >> changes its value. >> >> If I go and write a patch to do that, would you want it in your patch, >> or would you rather I kept it over here? Or perhaps you have a better >> idea on how to solve...? > > I've attached a patch which does this. For now, the patch is only > intended to assist in the discussion of the above idea. > > The patch is based on a WIP version of run-time pruning that I'm not > quite ready to post yet, but with a small amount of work you could > probably take it and base it on your faster partition pruning v31 > patch [1]. > > I ended up pulling the PartitionPruneInfo out of the > PartitionPruneContext. This was required due how I've now made > extract_partition_clauses() recursively call itself. We don't want to > overwrite the context's clauseinfo with the one from the recursive > call. A side effect of this is that the subcontext is no longer > required when processing the OR clauses. You only did this so that the > context's clauseinfo was not overwritten. I also think it's better to > seperate out the inputs and outputs. Anything in context was more > intended to be for input fields. > > Let me know your thoughts about this. If you don't want it for faster > partition pruning, then I'll probably go and tidy it up and include it > for run-time pruning. Thanks for the patch. I don't have time today to look at the patch closely, but at first blush, it seems like something I should incorporate into my own patch, because it's restructuring the partprune.c code. I will study the issue and your proposed solution in the form of this restructuring more closely over the weekend and reply (probably with a new version of my patch) on Monday. Thanks, Amit
Re: [HACKERS] Runtime Partition Pruning
On 23 February 2018 at 01:15, David Rowley wrote: > One problem that I'm facing now is down to the way I'm gathering the > ParamIds that match the partkeys. As you'll see from the patch I've > added a 'paramids' field to PartitionPruneContext and I'm populating > this when the clauses are being pre-processed in > extract_partition_clauses(). The problem is that the or_clauses are > not pre-processed at all, so the current patch will not properly > perform run-time pruning when the Params are "hidden" in OR branches. > > One way I thought to fix this was to change the clause processing to > create an array of PartitionClauseInfos, one for each OR branch. This > would also improve the performance of the run-time pruning, meaning > that all of the or clauses would be already matched to the partition > keys once, rather than having to redo that again each time a Param > changes its value. > > If I go and write a patch to do that, would you want it in your patch, > or would you rather I kept it over here? Or perhaps you have a better > idea on how to solve...? Hi Amit, I've attached a patch which does this. For now, the patch is only intended to assist in the discussion of the above idea. The patch is based on a WIP version of run-time pruning that I'm not quite ready to post yet, but with a small amount of work you could probably take it and base it on your faster partition pruning v31 patch [1]. I ended up pulling the PartitionPruneInfo out of the PartitionPruneContext. This was required due how I've now made extract_partition_clauses() recursively call itself. We don't want to overwrite the context's clauseinfo with the one from the recursive call. A side effect of this is that the subcontext is no longer required when processing the OR clauses. You only did this so that the context's clauseinfo was not overwritten. I also think it's better to seperate out the inputs and outputs. Anything in context was more intended to be for input fields. Let me know your thoughts about this. If you don't want it for faster partition pruning, then I'll probably go and tidy it up and include it for run-time pruning. [1] https://www.postgresql.org/message-id/00ae2273-bb6b-1287-9ebc-5459b37c9078%40lab.ntt.co.jp -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services generate_PartitionClauseInfos_for_or_clauses.patch Description: Binary data
Re: [HACKERS] Runtime Partition Pruning
Hi David, On 02/21/2018 04:06 AM, David Rowley wrote: I've attached v11 of the patch. Are UPDATE and DELETE suppose to be supported ? With -- test.sql -- CREATE TABLE test (a integer NOT NULL, b integer) PARTITION BY HASH(a); CREATE TABLE test_p00 PARTITION OF test FOR VALUES WITH (MODULUS 2, REMAINDER 0); CREATE TABLE test_p01 PARTITION OF test FOR VALUES WITH (MODULUS 2, REMAINDER 1); CREATE INDEX idx_test_a ON test (a); CREATE INDEX idx_test_b ON test (b); INSERT INTO test (SELECT i,i FROM generate_series(1, 100) AS i); ANALYZE; -- test.sql -- and UPDATE test SET b = 1 WHERE a = ? DELETE FROM test WHERE a = ? both shows that all partitions are scanned; Update on test Update on test_p00 Update on test_p01 -> Index Scan using test_p00_a_idx on test_p00 Index Cond: (a = 1) -> Index Scan using test_p01_a_idx on test_p01 Index Cond: (a = 1) Using prune_v32 and runtime_v11 with conflicts resolved. Best regards, Jesper
Re: [HACKERS] Runtime Partition Pruning
On 22 February 2018 at 22:31, Amit Langote wrote: > Some comments: Hi Amit, Thanks for looking at this. I'll work through your comments and produce a patch sometime in the near future. One problem that I'm facing now is down to the way I'm gathering the ParamIds that match the partkeys. As you'll see from the patch I've added a 'paramids' field to PartitionPruneContext and I'm populating this when the clauses are being pre-processed in extract_partition_clauses(). The problem is that the or_clauses are not pre-processed at all, so the current patch will not properly perform run-time pruning when the Params are "hidden" in OR branches. One way I thought to fix this was to change the clause processing to create an array of PartitionClauseInfos, one for each OR branch. This would also improve the performance of the run-time pruning, meaning that all of the or clauses would be already matched to the partition keys once, rather than having to redo that again each time a Param changes its value. If I go and write a patch to do that, would you want it in your patch, or would you rather I kept it over here? Or perhaps you have a better idea on how to solve...? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Runtime Partition Pruning
Hi David. On 2018/02/21 18:06, David Rowley wrote: > Other things I don't particularly like about the current patch are how > I had to construct the partition key expressions in > set_valid_runtime_subplans_recurse(). This pretty much uses code > borrowed from set_baserel_partition_key_exprs(). One way around that > would be to change the partprune.c code to deal with the > partkey->partattrs and consume an expression from the list on attnum = > 0. I didn't do that as I want to minimise how much I touch Amit's > patch before it gets committed as doing so would likely just cause > more headaches for me keeping this merged with his work. Another > option to resolve this would be to put the partition key expressions > into the PartitionPruneInfo. Another way could be to refactor the code you've borrowed from set_baserel_partition_key_exprs() into its own function that is exported by some optimizer header. I removed PartitionKey/Relation from signatures of various functions of my patch to avoid having to re-heap_open() the relation per [1]. > I've attached v11 of the patch. Some comments: * I noticed that the patch adds a function to bitmapset.c which you might want to extract into its own patch, like your patch to add bms_add_range() that got committed as 84940644d [2]. * Maybe it's better to add `default: break;` in the two switch's you've added in partkey_datum_from_expr() partprune.c: In function ‘partkey_datum_from_expr’: partprune.c:1555:2: warning: enumeration value ‘T_Invalid’ not handled in switch [-Wswitch] switch (nodeTag(expr)) partprune.c:1562:4: warning: enumeration value ‘PARAM_SUBLINK’ not handled in switch [-Wswitch] switch (((Param *) expr)->paramkind) * I see that you moved PartitionClauseInfo's definition from partprune.c to partition.h. Isn't it better to move it to partprune.h? Thanks, Amit [1] https://www.postgresql.org/message-id/CA%2BTgmoabi-29Vs8H0xkjtYB%3DcU%2BGVCrNwPz7okpa3KsoLmdEUQ%40mail.gmail.com [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=84940644d
Re: [HACKERS] Runtime Partition Pruning
On 21 February 2018 at 22:45, Rajkumar Raghuwanshi wrote: > select count(*) from prt1 x where (x.a,x.b) in (select t1.a,t2.b from prt1 > t1,prt2 t2 where t1.a=t2.b) > and (x.c) in (select t3.c from plt1 t3,plt2 t4 where t3.c=t4.c); Thanks for the test case. It seems like the x.c Param for some reason has a ParamId of 0. I need to go learn a bit more about Params to understand why this is. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services