WinkerDu opened a new pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073


   In our scene, We use V2Format to support streaming CDC row-level insert / 
delete. There could be tons of delete file for scan task of small file rewrite 
action or query scan, such as Spark Batch scan typically.
   
   The existing scan task bin-packing logic is only based on data file size, 
means a scan task contains data files whose total file size satisfies a given 
target size, this logic work fine for V1Format since the task only deals with 
data file.
   
   But for V2Format, scan task performance also consists of delete file 
applying cost. Suppose that bin-packing target size is 128MB, the data file 
could be small due to streaming CDC update, such as 1 MB, each data file tries 
to apply 128 valid eq-delete / pos-delete files. the total delete file applied 
number of 1 task could reach to 128 * 128 = 16,384, even we have enough CPU 
core to run these tasks, we could not boost scan performance for 1 task.
   
   This PR introduce a new configuration to specified items size of 1 bin 
during bin-packing iterating (default as Integer.MAX_VALUE). We could set this 
to control task scale to boost global scan performance to fully utilize 
computing resource.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to