Github user squito commented on the pull request: https://github.com/apache/spark/pull/3913#issuecomment-93486450 I think this is a good change. Yes, you could cache an RDD and see its size, but think about what a pain that actually is if you wanted to do it programmatically. You'd need to register a spark listener, wait to get the appropriate events and look at the sizes. If it was easy to do this programmatically via an RDD, then I'd say this change isn't necessary. Eg., if you could do something like ``` val (_, meta) = sc.parallelize(oneObject).cache().count() meta.getPartitionSizes ``` Then we wouldn't need to expose this. But that's an even bigger api change, and one that I would be far more nervous about (the code above is definitely not a viable alternative, lots of reasons it doesn't make sense).
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org