vburenin commented on a change in pull request #2440:
URL: https://github.com/apache/hudi/pull/2440#discussion_r558774265



##########
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
##########
@@ -274,19 +275,27 @@ private boolean isBlockCorrupt(int blocksize) throws 
IOException {
   }
 
   private long scanForNextAvailableBlockOffset() throws IOException {
+    // Make buffer large enough to scan through the file as quick as possible 
especially if it is on S3/GCS.
+    // Using lower buffer is incurring a lot of API calls thus drastically 
increasing the cost of the storage
+    // and also may take days to complete scanning trough the large files.
+    byte[] dataBuf = new byte[1024 * 1024];

Review comment:
       Buffered reader needs to check a few things to copy the right data, 
readFully logic itself is not trivial, there is also position modification each 
time it reads 6 bytes, etc so even without profiling it I bet the overhead is 
significant.
   1MB seems like a good number to me, not too much, not too little. From my 
past experience dealing with FS IO going with blocks larger than 1MB was giving 
a diminished return. However, the best number would be the one that matches 
underlying block read size, but that depends on the reader which can be any.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to