JingGe commented on pull request #17520:
URL: https://github.com/apache/flink/pull/17520#issuecomment-962947461


   @tsreaper 
   > No I haven't. But I can come up with one problems about this: Some records 
may be large, for example json strings containing tens of thousands of 
characters (this is not rare from the production jobs I've seen so far). If we 
only control the **number** of records there is still risk of overwhelming the 
memory. The other way is to control the actual size of each record, which 
requires a method to estimate the number of bytes in each record.
   
   To make the discussion easier, we are talking about the benchmark data whose 
records have almost same size. For real cases, we can control the number of 
records dynamically by controlling the bytes read from the inputStream, e.g. in 
each batchRead, read 5 records for big size records and read 50 records for 
small size records.
   
   > 
   > > how you controlled the `StreamFormat.FETCH_IO_SIZE`?
   > 
   > Number of bytes read from file is controlled by 
`StreamFormatAdapter.TrackingFsDataInputStream`. It is controlled by 
`source.file.stream.io-fetch-size` whose default value is 1MB. However there is 
no use in tuning this value because avro reader (I mean the reader from avro 
library) will read the whole block from file. If the file size is 2MB it will 
consume 2MB of bytes and, according to the current logic of 
`StreamFormatAdapter`, deserialize all records from that block at once. I've 
tried to change that config option in the benchmark and it proves me right.
   
   if you take a close look at the implementation of 
`TrackingFsDataInputStream`, you will see how it uses 
`StreamFormat.FETCH_IO_SIZE` to control how many records will be 
read/deserilized from the avro block in each batchRead().
   
   Any way, the benchmark result tell us the truth. Thanks again for sharing 
it. We will do more deep dive to figure out why using StreamFormat has these 
memory issues.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to