HippoBaro commented on PR #9697: URL: https://github.com/apache/arrow-rs/pull/9697#issuecomment-4248638163
@alamb @nathanb9 Thank you both for your feedback — it's been very helpful. The watermark mechanism is fundamentally flawed. I worked under the assumption that all Parquet files are laid out like mine are, but the spec makes no such guarantee: "There is no physical structure that is guaranteed for a row group." Column chunks within a row group need not be contiguous or ordered, so a monotonic watermark won't be correct in general. @nathanb9's reverse-scan regression is one concrete consequence, but it's not the only one. The deletion trick is actually quite neat. > I was thinking more about this change and I thought maybe we could take a step back and figure out what you are trying to accomplish Fair point. I should have been clearer in the original description. Here are the high-level goals driving this work: * **Linear scaling with column count.** My schemas are very high cardinality. Loads of columns. `PushBuffers` must scale linearly (ideally sub-linearly) with the number of columns. The current implementation scales with the number of requested ranges from the reader, which can reach millions on adversarial inputs. * **Total decoupling from the IO layer.** The decoder should own all logic related to *what* to read; the IO layer should be exclusively concerned with *how* best to exploit the characteristics of the underlying storage medium along whatever axis the user cares about (throughput, latency, cost). The current design leaks information across that boundary: buffers have to align with logical range requests or they leak, and can't straddle request boundaries. * **Coalescing should not be load bearing.** `PushBuffers` should scale sub-linearly to the number of buffers it's given. Coalescing should be decided based on the characteristics of storage, not because the code can't deal with too many small buffers. * **First-class support for prefetching.** My data lives in blob storage. The economics of blob storage incentivize pulling as much as possible: you pay the toll per GET request. If we need two row groups separated by 100 MiB of pruned row groups, it is most likely cheapest to stream all of it and discard than to seek. There are also cases where the reader doesn't know which row groups to pull next, because that depends on the content of the current row group — i.e., there are data dependencies. So prefetching can't simply be "give a list of row groups of interest ahead of time." * **No leaks.** Whatever the IO layer does (coalescing, prefetch, etc.), regardless of access pattern or pruning ratio, we should never leak buffers. This doesn't mean every byte must be freed at the earliest opportunity, but at row-group granularity, everything should be released when we move on to the next. A single decoder should be able to stream-decode TBs worth of data, and the user should have RSS guarantees derived from the max row-group size of the data they are dealing with (and modulo how aggressive prefetching is allowed to be, which they can easily control). > if your usecase is to clear all IO after reading a row group's data, I wonder if we could you just call clear_all_buffers after reading the row group's data? The problem is that by the time I finish decoding a row group, I only know I'm done with *that* row group. The IO layer may have already prefetched data I'm about to need. Calling `release_all` would throw away that prefetched work. The watermark was a nice idea for this: if we know that after reading byte offset N we'll never read anything below N, we can prefetch freely above N and safely release everything below. But since we can't guarantee monotonic access, that doesn't hold. I'll push a revised version that addresses all the feedback shortly! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
