Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/3913#issuecomment-73963012
  
    Adding an `estimateSizeOf` method to `SparkContext` sounds reasonable to me.
    
    I agree that there's not a great way to expose something like this for 
Python. But I don't think the zaniness of Python-JVM interaction means that we 
shouldn't expose useful functionality to pure-JVM apps.
    
    > For RDD data, it might be slightly misleading here because of things like 
serialization in-memory.
    
    I think this is the kind of thing we can just document.  Adding a separate 
`estimateSerializedSizeOf` method would be helpful as well.
    
    > I'm also not totally sure overall how accurate our memory estimation is 
and it may get less so if we add smarter caching for SchemaRDD's.
    
    I've found it to be very accurate in my experiments.  We rely on its 
accuracy for shuffle memory management and POJO caching, so to the extent that 
it's inaccurate we've got bigger problems.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to