ahshahid commented on issue #6424: URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1351885991
Right... That I agree. May be along with split offset ( which is the start of split) , we need the end of split.. But still, pls allow me to describe this simplified case , where the split is same as the file being considered, so that split offset is 0. and assume split size = file.size. Now as per current formula splitOffset = 0 so double scannedFileFraction = ((double) **length()**) / (file().fileSizeInBytes() ); here ** length() ** is the amount of bytes scanned ( only partially read) and ** file().recordCount() ** is the number of records read in that partial scan... right ? now scannedFileFraction is < 1 ( assume that total file is not scanned) . say scannedFileFraction = 0.5 and numRowsInScan = recordCount = 50 So ideally total estimated row count in that file should be 100. But with the formula we will get total row count = 0.5 * 50 = 25 But with corrected formula it will be = 50 / 0.5 = 100 because in corrected formula , the fraction is = (file().fileSizeInBytes()) / length = 2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
