[GitHub] spark pull request: SPARK-5112. Expose SizeEstimator as a develope...

pwendell Thu, 16 Apr 2015 10:07:59 -0700

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/3913#issuecomment-93786580
  
    The reason I proposed to put in SparkContext is to avoid committing to the 
current namespace/package of that object and just expose a narrower utility 
function off of SparkContext. Overall our estimation code is likely to evolve 
in the future. 
    
    In terms of exposing or not, I'm okay to expose it given the reasons here. 
But can we give some warning to set expectations for the user? This estimation 
can be really inaccurate because of sampling and heuristics used internally. 
This is especially true if you have, say, a hashmap that has skewed keys - it 
will only sample a small percentage of all the keyspace and could miss hot keys.
    
    So I'd just say it's an estimate of the in-memory size and uses sampling 
internally for complex objects. I think this is also the gist of @shivaram's 
suggestion.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-5112. Expose SizeEstimator as a develope...

Reply via email to