Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

2023-07-06 Thread Lukas Fittl
On Thu, Jul 6, 2023 at 12:56 AM Daniel Gustafsson  wrote:

> Lukas: do you have an updated patch for this commitfest to address David's
> comments?
>

I have a draft - I should be able to post an updated patch in the next
days. Thanks for checking!

Thanks,
Lukas

-- 
Lukas Fittl


Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

2023-07-06 Thread Daniel Gustafsson
> On 7 Mar 2023, at 10:51, David Rowley  wrote:
> 
> On Sun, 5 Mar 2023 at 13:21, Lukas Fittl  wrote:
>> Alternatively (or in addition) we could consider showing the "ndistinct" 
>> value that is calculated in cost_memoize_rescan - since that's the most 
>> significant contributor to the cache hit ratio (and you can influence that 
>> directly by improving the ndistinct statistics).
> 
> I think the ndistinct estimate plus the est_entries together would be
> useful. I think showing just the hit ratio number might often just
> raise too many questions about how that's calculated. To calculate the
> hit ratio we need to estimate the number of entries that can be kept
> in the cache at once and also the number of input rows and the number
> of distinct values.  We can see the input rows by looking at the outer
> side of the join in EXPLAIN, but we've no idea about the ndistinct or
> how many items the planner thought could be kept in the cache at once.
> 
> The plan node already has est_entries, so it should just be a matter
> of storing the ndistinct estimate in the Path and putting it into the
> Plan node so the executor has access to it during EXPLAIN.

Lukas: do you have an updated patch for this commitfest to address David's
comments?

--
Daniel Gustafsson





Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

2023-03-07 Thread David Rowley
On Sun, 5 Mar 2023 at 13:21, Lukas Fittl  wrote:
> Alternatively (or in addition) we could consider showing the "ndistinct" 
> value that is calculated in cost_memoize_rescan - since that's the most 
> significant contributor to the cache hit ratio (and you can influence that 
> directly by improving the ndistinct statistics).

I think the ndistinct estimate plus the est_entries together would be
useful. I think showing just the hit ratio number might often just
raise too many questions about how that's calculated. To calculate the
hit ratio we need to estimate the number of entries that can be kept
in the cache at once and also the number of input rows and the number
of distinct values.  We can see the input rows by looking at the outer
side of the join in EXPLAIN, but we've no idea about the ndistinct or
how many items the planner thought could be kept in the cache at once.

The plan node already has est_entries, so it should just be a matter
of storing the ndistinct estimate in the Path and putting it into the
Plan node so the executor has access to it during EXPLAIN.

David




Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

2023-03-04 Thread Lukas Fittl
Hi,

I was debugging a planner problem on Postgres 14.4 the other day - and the
involved "bad" plan was including Memoize - though I don't necessarily
think that Memoize is to blame (and this isn't any of the problems recently
fixed in Memoize costing).

However, what I noticed whilst trying different ways to fix the plan, is
that the Memoize output was a bit hard to reason about - especially since
the plan involving Memoize was expensive to run, and so I was mostly
running EXPLAIN without ANALYZE to look at the costing.

Here is an example of the output I was looking at:

 ->  Nested Loop  (cost=1.00..971672.56 rows=119623 width=0)
   ->  Index Only Scan using table1_idx on table1
(cost=0.43..372676.50 rows=23553966 width=8)
   ->  Memoize  (cost=0.57..0.61 rows=1 width=8)
 Cache Key: table1.table2_id
 Cache Mode: logical
 ->  Index Scan using table2_idx on table2
(cost=0.56..0.60 rows=1 width=8)
   Index Cond: (id = table1.table2_id)

The other plan I was comparing with (that I wanted the planner to choose
instead), had a total cost of 1,451,807.35 -- and so I was trying to figure
out why the Nested Loop was costed as 971,672.56.

Simple math makes me expect the Nested Loop should roughly have a total
cost of14,740,595.76 here (372,676.50 + 23,553,966 * 0.61), ignoring a lot
of the smaller costs. Thus, in this example, it appears Memoize made the
plan cost significantly cheaper (roughly 6% of the regular cost).

Essentially this comes down to the "cost reduction" performed by Memoize
only being implicitly visible in the Nested Loop's total cost - and with
nothing useful on the Memoize node itself - since the rescan costs are not
shown.

I think explicitly adding the estimated cache hit ratio for Memoize nodes
might make this easier to reason about, like this:

->  Memoize  (cost=0.57..0.61 rows=1 width=8)
 Cache Key: table1.table2_id
 Cache Mode: logical
 Cache Hit Ratio Estimated: 0.94

Alternatively (or in addition) we could consider showing the "ndistinct"
value that is calculated in cost_memoize_rescan - since that's the most
significant contributor to the cache hit ratio (and you can influence that
directly by improving the ndistinct statistics).

See attached a patch that implements showing the cache hit ratio as a
discussion starter.

I'll park this in the July commitfest for now.

Thanks,
Lukas

-- 
Lukas Fittl


v1-0001-Add-Estimated-Cache-Hit-Ratio-for-Memoize-plan-no.patch
Description: Binary data