tsreaper edited a comment on pull request #17520:
URL: https://github.com/apache/flink/pull/17520#issuecomment-961764970


   @JingGe 
   
   > Have you tried to control the number of records each batchRead() will 
fetch instead of fetch all records of the current block in one shot?
   
   No I haven't. But I can come up with two problems about this:
   1. Some records may be large, for example json strings containing tens of 
thousands of characters (this is not rare from the production jobs I've seen so 
far). If we only control the **number** of records there is still risk of 
overwhelming the memory. The other way is to control the actual size of each 
record, which requires a method to estimate the number of bytes in each record.
   2. The reader must be kept open unless the whole block is deserialized. If 
we only deserialize a portion of a block in each batch then we still need that 
block pool to prevent the reader being closed too early.
   
   > how you controlled the `StreamFormat.FETCH_IO_SIZE`?
   
   Number of bytes read from file is controlled by 
`StreamFormatAdapter.TrackingFsDataInputStream`. It is controlled by 
`source.file.stream.io-fetch-size` whose default value is 1MB. However there is 
no use in tuning this value because avro reader (I mean the reader from avro 
library) will read the whole block from file. If the file size is 2MB it will 
consume 2MB of bytes and, according to the current logic of 
`StreamFormatAdapter`, deserialize all records from that block at once. I've 
tried to change that config option in the benchmark and it proves me right.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to