[GitHub] [iceberg] ahshahid commented on issue #6424: The size estimation formula for spark task is incorrect

GitBox Wed, 14 Dec 2022 10:08:25 -0800


ahshahid commented on issue #6424:
URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1351885991


   Right... That I agree. May be along with split offset ( which is the start 
of split) , we need the end of split..
   
   But still, pls allow me to describe this simplified case , where the split 
is same as the file being considered, so that split offset is 0. 
   and assume split size = file.size.
   
   Now as per current formula 
   splitOffset = 0
   so double scannedFileFraction = ((double) **length()**) / 
(file().fileSizeInBytes() ); 
   here  ** length() ** is the amount of bytes scanned ( only partially read)
   and ** file().recordCount() ** is the number of records read in that partial 
scan...
   right ?
   now scannedFileFraction  is < 1 ( assume that total file is not scanned) .
   say scannedFileFraction = 0.5
   and numRowsInScan = recordCount = 50
   
   So ideally total estimated row count in that file should be 100.
   
   But with the formula we will get total row count = 0.5 * 50  = 25
   
   But with corrected formula it will be = 50 / 0.5  = 100
   
   because in corrected formula , the fraction is = (file().fileSizeInBytes()) 
/ length = 2 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] ahshahid commented on issue #6424: The size estimation formula for spark task is incorrect

Reply via email to