[ https://issues.apache.org/jira/browse/HUDI-6116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinoth Chandar updated HUDI-6116: --------------------------------- Fix Version/s: (was: 1.0.0) > Optimize log block reading by removing seeks to check corrupted blocks > ---------------------------------------------------------------------- > > Key: HUDI-6116 > URL: https://issues.apache.org/jira/browse/HUDI-6116 > Project: Apache Hudi > Issue Type: Improvement > Reporter: Prashant Wason > Assignee: Prashant Wason > Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > The code currently does an eager isCorruptedCheck for which we do a seek and > then a read which invalidates our internal buffers in opened file stream to > the log file and makes a call to DataNode to start a new blockReader. > The seek + read becomes apparent when we do cross datacenter reads or where > the latency to the file is HIGH. In cases, a single RPC will cost us about > 120ms + Cost of RPC (west coast to east coast) so this seek is bad for > performance. > Delaying the corrupt check also gives us many benefits in low latency env > where we see times reducing from (5 to 8 sec) to (3s to < 500ms) for a > moderately sized files of 250MB. > NOTE: The more number of log blocks to read, the greater the performance > improvements. > -- This message was sent by Atlassian Jira (v8.20.10#820010)