n3nash commented on a change in pull request #2440:
URL: https://github.com/apache/hudi/pull/2440#discussion_r558822909



##########
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
##########
@@ -274,19 +275,27 @@ private boolean isBlockCorrupt(int blocksize) throws 
IOException {
   }
 
   private long scanForNextAvailableBlockOffset() throws IOException {
+    // Make buffer large enough to scan through the file as quick as possible 
especially if it is on S3/GCS.
+    // Using lower buffer is incurring a lot of API calls thus drastically 
increasing the cost of the storage
+    // and also may take days to complete scanning trough the large files.
+    byte[] dataBuf = new byte[1024 * 1024];

Review comment:
       @vburenin According to the current code, we still are using 
BufferedReader for all cases except GCS, so that doesn't go away with this code 
in a generic way. Additionally, we need buffered reader code (the one I pointed 
above) anyways for GCS in the happy code path (without the need to find the 
magic header in corrupt blocks) since this method is only called when it 
encounters a corrupt block. 
   1) readFully does a bunch of if conditions so branching could cause some 
perf degradation here, don't see any other extra logic apart from copying bytes 
which is the same for irrespective of doing 6 bytes vs 1MB
   2) Position modification in BufferedInputStream is a variable assignment 
which should not cause any overhead. 
   
   Agree with you that the best number should be the one that matches the 
underlying block size. It would be great if you can do some microbenchmarking 
here. I'm OK to land this once you can add the if check for BufferedInputStream 
since that is needed anyways ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to