SemyonSinchenko commented on PR #46368:
URL: https://github.com/apache/spark/pull/46368#issuecomment-2257672072

   > @SemyonSinchenko thanks for the contribution! I have two high-level 
questions:
   > 
   > * How do you use this sizeInBytes? Is it accurate in your workload?
   > * Shall we expose more stats like numRows?
   
   @cloud-fan Thank you for the comment!
   
   The main usage of that is for library developers / developers of the 
reusable spark code. If one wants to use, for example, a broadcast hint inside 
the library code / reusable code without changing a thresholds globally the 
only way to get an estimated size of the data is from the plan. But in Connect 
there is no way to call `queryExecution().optimizedPlan.stats` because there is 
no JavaBridge. So, at the moment, Connect devs can get this information only 
from the parsing of the string-representation of the plan via regexps. But this 
way is very unstable and fragile to versions update, because obviously there 
are zero guarantees that the string representation of the spark plan won't 
change in the future versions. This estimation is very inaccurate because it is 
an upper bound. But upper bound is perfect for that estimation, because devs 
has a guarantee that the size of the data is not more than the output of that 
function.
   
   As an example I can point to an implementation of one of the databricks 
libs, named tempo: 
[code-example](https://github.com/databrickslabs/tempo/blob/master/python/tempo/tsdf.py#L712).
 But it is only an example: I have no relation to databricks, I'm not a 
contributor of tempo. We are using the similar techniques internally at the 
moment.
   
   
   More stats can be exposed, but actually I do not see usage of the `numRows` 
estimation, for example. The most usable is size in bytes estimation because it 
can be used for an estimation of the possibility of the `collect`, broadcast 
hints, etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to