Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

via GitHub Tue, 30 Jul 2024 00:34:26 -0700


SemyonSinchenko commented on PR #46368:
URL: https://github.com/apache/spark/pull/46368#issuecomment-2257672072

> @SemyonSinchenko thanks for the contribution! I have two high-level
questions:
>
> * How do you use this sizeInBytes? Is it accurate in your workload?
> * Shall we expose more stats like numRows?

@cloud-fan Thank you for the comment!

The main usage of that is for library developers / developers of the
reusable spark code. If one wants to use, for example, a broadcast hint inside
the library code / reusable code without changing a thresholds globally the
only way to get an estimated size of the data is from the plan. But in Connect
there is no way to call `queryExecution().optimizedPlan.stats` because there is
no JavaBridge. So, at the moment, Connect devs can get this information only
from the parsing of the string-representation of the plan via regexps. But this
way is very unstable and fragile to versions update, because obviously there
are zero guarantees that the string representation of the spark plan won't
change in the future versions. This estimation is very inaccurate because it is
an upper bound. But upper bound is perfect for that estimation, because devs
has a guarantee that the size of the data is not more than the output of that
function.

As an example I can point to an implementation of one of the databricks
libs, named tempo:
[code-example](https://github.com/databrickslabs/tempo/blob/master/python/tempo/tsdf.py#L712).
But it is only an example: I have no relation to databricks, I'm not a
contributor of tempo. We are using the similar techniques internally at the
moment.

More stats can be exposed, but actually I do not see usage of the `numRows`
estimation, for example. The most usable is size in bytes estimation because it
can be used for an estimation of the possibility of the `collect`, broadcast
hints, etc.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

Reply via email to