On Thu, Apr 11, 2024 at 12:19 PM Melanie Plageman <melanieplage...@gmail.com> wrote: > > On Wed, Apr 10, 2024 at 5:21 PM Andres Freund <and...@anarazel.de> wrote: > > > > Hi, > > > > On 2024-04-10 16:50:44 -0400, Melanie Plageman wrote: > > > This brings up a question about the prefetching. We never had to have > > > this discussion for sequential scan streaming read because it didn't > > > (and still doesn't) do prefetching. But, if we push the streaming read > > > code down into the heap AM layer, it will be doing the prefetching. > > > So, do we remove the prefetching from acquire_sample_rows() and expect > > > other table AMs to implement it themselves or use the streaming read > > > API? > > > > The prefetching added to acquire_sample_rows was quite narrowly tailored to > > something heap-like - it pretty much required that block numbers to be 1:1 > > with the actual physical on-disk location for the specific AM. So I think > > it's pretty much required for this to be pushed down. > > > > Using a read stream is a few lines for something like this, so I'm not > > worried > > about it. I guess we could have a default implementation for block based > > AMs, > > similar what we have around table_block_parallelscan_*, but not sure it's > > worth doing that, the complexity is much lower than in the > > table_block_parallelscan_ case. > > This makes sense. > > I am working on pushing streaming ANALYZE into heap AM code, and I ran > into a few roadblocks. > > If we want ANALYZE to make the ReadStream object in heap_beginscan() > (like the read stream implementation of heap sequential and TID range > scans do), I don't see any way around changing the scan_begin table AM > callback to take a BufferAccessStrategy at the least (and perhaps also > the BlockSamplerData).
I will also say that, had this been 6 months ago, I would probably suggest we restructure ANALYZE's table AM interface to accommodate read stream setup and to address a few other things I find odd about the current code. For example, I think creating a scan descriptor for the analyze scan in acquire_sample_rows() is quite odd. It seems like it would be better done in the relation_analyze callback. The relation_analyze callback saves some state like the callbacks for acquire_sample_rows() and the Buffer Access Strategy. But at least in the heap implementation, it just saves them in static variables in analyze.c. It seems like it would be better to save them in a useful data structure that could be accessed later. We have access to pretty much everything we need at that point (in the relation_analyze callback). I also think heap's implementation of table_beginscan_analyze() doesn't need most of heap_beginscan()/initscan(), so doing this instead of something ANALYZE specific seems more confusing than helpful. - Melanie