sdf-jkl commented on PR #9118: URL: https://github.com/apache/arrow-rs/pull/9118#issuecomment-3825678416
I want to write down my thoughts about this issue, hopefully it will be helpful. # The original issue: - Without falling back to RLE selectors, mask-based selection can fetch data from not loaded pages(skipped pages). **The simplest case where this happens is:** 1. A previous selection is contiguous enough to skip a full page in a column. 2. Another selection is added to the first one and materializes as a mask selection. (Could be in the same 1st selection) 3. Data is read using mask selection, and we receive an error: attempting to fetch data that was not loaded. The case above is pretty simple and does not require different page layouts between columns. It only requires a skipped page in a column and a mask selection. **This can happen:** - during the filter evaluation stage - while reading projections (selections coming from filters or manually created) - during both stages **This was fixed by:** - introducing page awareness to the reader. # The new issue (introduced in #9239): - Page awareness needs to be column-aware across all projected columns. Col A's page layout may differ from col B's. This can lead to mask chunk including part of a page skipped in one column, resulting in the same error: attempting to fetch data that was not loaded. **Prerequisites for this to happen:** - selection that skips a page in at least one projected column - different page layouts between columns in projection(plan or predicate) - ? **How we fixed? it:** - Clip each mask chunk at the closest upcoming page boundary to the current cursor (across all columns) - Together with the initial skip inside the mask chunk, this prevents us from reading any page that was skipped by the selection. - Initial skip also helps coalesce skipped pages without reading them one by one. **Possible positions where mask cursor can end up:** - At the start of a skipped page - At the start of a loaded page - In the middle of a loaded page - It cannot start in the middle of a skipped page, because that case would be handled by the initial skip **Cases where this issue could happen:** - During filter evaluation, if page is skipped (although we usually evaluate only the single column required for an `ArrowPredicate`) - During the final read phase (multiple columns with different page offsets) **Invariant enforced by the fix:** A mask chunk must not span across a page boundary in any column. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
