[GitHub] [iceberg] ahshahid commented on issue #6424: The size estimation formula for spark task is incorrect

GitBox Wed, 14 Dec 2022 07:49:17 -0800


ahshahid commented on issue #6424:
URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1351671855


   @RussellSpitzer  Right, I missed the modifiucation of " - splitOffset".
   
   Though the bug, which I think is in formula, still remains.
   
   My reasoning is as follows:
   the function estimatedRowCounts has to estimate the total row count of a 
split/file (or a single file) by analyzing a fraction of split (file) .
   which means that 
   total row count of a split/file >= scanned fraction row count( which is what 
we call record count)
   
   now if total row count of a split/file  = (scannedFileFraction * 
file().recordCount())
   and scanned fraction is <= 1
   this would result in total row count <= fraction's record count.
   
   the change i proposed is based on this ratio/proportion
   
    when  scanned file/split size is length()                                   
 rows is file().recordCount()
   so when total size of file/split is (file().fileSizeInBytes() - 
splitOffset).            the total count X = ?
   
   X = (     file().recordCount()   *  (file().fileSizeInBytes() - splitOffset) 
) / length()  
   
   do u think my understanding is correct , of the objective of the function?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] ahshahid commented on issue #6424: The size estimation formula for spark task is incorrect

Reply via email to