Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3913#issuecomment-73963012 Adding an `estimateSizeOf` method to `SparkContext` sounds reasonable to me. I agree that there's not a great way to expose something like this for Python. But I don't think the zaniness of Python-JVM interaction means that we shouldn't expose useful functionality to pure-JVM apps. > For RDD data, it might be slightly misleading here because of things like serialization in-memory. I think this is the kind of thing we can just document. Adding a separate `estimateSerializedSizeOf` method would be helpful as well. > I'm also not totally sure overall how accurate our memory estimation is and it may get less so if we add smarter caching for SchemaRDD's. I've found it to be very accurate in my experiments. We rely on its accuracy for shuffle memory management and POJO caching, so to the extent that it's inaccurate we've got bigger problems.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org