gudladona commented on PR #18241: URL: https://github.com/apache/hudi/pull/18241#issuecomment-3961255857
> Hi @gudladona Thanks for this contribution! In general, there are two questions that I wonder if you could elaborate on: > > 1. Whole-File In-Memory Processing: Implemented a "Read Whole File" strategy for files smaller than 2GB. Do we need to cache the entire file here, or is IO at the fg granularity sufficient? This is mainly a consideration of memory pressure. I'll assume you mean row group, not file group(fg). I cached the entire file as the total file read is necessary regardless; in the previous implementation, a read at column chunk put enormous pressure on the s3a client, causing timeouts. This reduces 1000s of Get(byRange) to a few S3 ops depending on the file size. I thought of implementing this at the rowgroup level, then there will be additional range gets for file metadata, which is another IO that has to happen anyway. So, my thinking is whole object get will amortize the value of the whole function. > 2. Double-Buffer: Do we definitely need this Double-Buffer? For binary copy, the CPU pressure itself is relatively low, and the overall bottleneck lies in the IO interaction with remote storage. It seems that using a double buffer for caching here is not of great practical significance. Double-Buffer is an optimization that lets a background thread keep the next file "Ready" as there is nothing IO-bound during the copy operation. When the source file itself is large with multiple rowgroups and 1000s of column chunks (true in our case) this concurrent operation was helpful squeeze additional performance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
