tsreaper edited a comment on pull request #17520:
URL: https://github.com/apache/flink/pull/17520#issuecomment-961764970


   @JingGe 
   
   > Have you tried to control the number of records each batchRead() will 
fetch instead of fetch all records of the current block in one shot?
   
   No I haven't. But I can come up with one problems about this: Some records 
may be large, for example json strings containing tens of thousands of 
characters (this is not rare from the production jobs I've seen so far). If we 
only control the **number** of records there is still risk of overwhelming the 
memory. The other way is to control the actual size of each record, which 
requires a method to estimate the number of bytes in each record.
   
   > how you controlled the `StreamFormat.FETCH_IO_SIZE`?
   
   Number of bytes read from file is controlled by 
`StreamFormatAdapter.TrackingFsDataInputStream`. It is controlled by 
`source.file.stream.io-fetch-size` whose default value is 1MB. However there is 
no use in tuning this value because avro reader (I mean the reader from avro 
library) will read the whole block from file. If the file size is 2MB it will 
consume 2MB of bytes and, according to the current logic of 
`StreamFormatAdapter`, deserialize all records from that block at once. I've 
tried to change that config option in the benchmark and it proves me right.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to