ahshahid opened a new issue, #6424:
URL: https://github.com/apache/iceberg/issues/6424
### Apache Iceberg version
main (development)
### Query engine
Spark
### Please describe the bug 🐞
The size estimation formula used for non partition cols as seen in
ContentScanTask is presently as
default long estimatedRowsCount() {
double scannedFileFraction = ((double) length()) /
file().fileSizeInBytes();
return (long) (scannedFileFraction * file().recordCount());
}
IMO it should be
(file().fileSizeInBytes() * file().recordCount()) / length()
We are estimating the full row count by scanning part of the file, and the
rows contained in it.
the current formula is wroing , because scannedFileFraction is bound to be
<= 1
so full row count has to be >= file().recordCount()
but if full row count = scannedFileFraction * file().recordCount() implies
that
full row count <= file().recordCount()
which is incorrect.
I have bugtest which shows that because of this, inefficient broadcast
hashjoins are getting created.
Will create a PR & bug test tomorrow.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]