lidavidm commented on pull request #6744: URL: https://github.com/apache/arrow/pull/6744#issuecomment-620922987
One thing for consideration: we've been looking at using this with buffering multiple files in memory. (This way, any unexpected latency in I/O for subsequent files gets hidden while processing the first Parquet file.) One thing we've noticed is that the coalescing is often self-defeating. If you have three files A, B, and C, oftentimes C will finish loading before B, and so you're unnecessarily blocked. It's even worse because with the default internal thread pool, C is occupying slots in that pool that could otherwise have been used to finish B more quickly. It would be ideal if we could block C from starting I/O until B has at least started all of its I/O, and similarly for B. The current APIs aren't powerful enough to implement this: even if you overrode RandomAccessFile::ReadAsync, you don't know what all the ranges read will be, and can't use that information to schedule I/O. I realize Datasets has similar concepts: a sophisticated client can get individual FileFragments and evaluate them with its own scheduling. This by itself isn't enough to give us what we want, but it's worth considering, and would give another performance boost for large datasets. To work towards implementing scenarios like this, would a different approach than the one here be preferable? Instead of having the Parquet reader control coalescing and I/O, we could have some API to get the byte ranges that would be read for a given combination of columns and row groups. We could then refactor reader APIs to accept a read coalescer that the caller should have pre-populated. Or should I take this back to the mailing list (with more context)? This starts to feel close to the question optimizing I/O inside of Datasets, and I admit I'm not up to date with how the project has progressed. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org