lidavidm commented on pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#issuecomment-620922987


   One thing for consideration: we've been looking at using this with buffering 
multiple files in memory. (This way, any unexpected latency in I/O for 
subsequent files gets hidden while processing the first Parquet file.) One 
thing we've noticed is that the coalescing is often self-defeating. If you have 
three files A, B, and C, oftentimes C will finish loading before B, and so 
you're unnecessarily blocked. It's even worse because with the default internal 
thread pool, C is occupying slots in that pool that could otherwise have been 
used to finish B more quickly. It would be ideal if we could block C from 
starting I/O until B has at least started all of its I/O, and similarly for B.
   
   The current APIs aren't powerful enough to implement this: even if you 
overrode RandomAccessFile::ReadAsync, you don't know what all the ranges read 
will be, and can't use that information to schedule I/O.
   
   I realize Datasets has similar concepts: a sophisticated client can get 
individual FileFragments and evaluate them with its own scheduling. This by 
itself isn't enough to give us what we want, but it's worth considering, and 
would give another performance boost for large datasets.
   
   To work towards implementing scenarios like this, would a different approach 
than the one here be preferable? Instead of having the Parquet reader control 
coalescing and I/O, we could have some API to get the byte ranges that would be 
read for a given combination of columns and row groups. We could then refactor 
reader APIs to accept a read coalescer that the caller should have 
pre-populated.
   
   Or should I take this back to the mailing list (with more context)? This 
starts to feel close to the question optimizing I/O inside of Datasets, and I 
admit I'm not up to date with how the project has progressed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to