[GitHub] [iceberg] ahshahid opened a new issue, #6424: The size estimation formula for spark task is incorrect

GitBox Wed, 14 Dec 2022 00:37:42 -0800


ahshahid opened a new issue, #6424:
URL: https://github.com/apache/iceberg/issues/6424


   ### Apache Iceberg version
   
   main (development)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   The size estimation formula used for non partition cols as seen in 
   ContentScanTask is presently as 
   default long estimatedRowsCount() {
       double scannedFileFraction = ((double) length()) / 
file().fileSizeInBytes();
       return (long) (scannedFileFraction * file().recordCount());
     }
   
   IMO it should be
   (file().fileSizeInBytes()   * file().recordCount()) / length() 
   
   We are estimating the full row count by scanning part of the file, and the 
rows contained in it.
   
   the current formula is wroing , because scannedFileFraction is bound to be 
<= 1
   so full row count has to be >= file().recordCount()
   but if full row count = scannedFileFraction * file().recordCount() implies 
that 
   full row count <= file().recordCount()
   which is incorrect.
   
   I have bugtest which shows that because of this, inefficient broadcast 
hashjoins are getting created.
   Will create a PR & bug test tomorrow.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] ahshahid opened a new issue, #6424: The size estimation formula for spark task is incorrect

Reply via email to