Github user staple commented on the pull request:

    https://github.com/apache/spark/pull/2362#issuecomment-55303441
  
    Hi, I implemented this per discussion here 
https://github.com/apache/spark/pull/2347#issuecomment-55181535, assuming I 
understood the comment correctly. The context is that we are supposed to log a 
warning when running an iterative learning algorithm on an uncached rdd. What 
originally led me to identify SPARK-3488 is that if the deserialized python 
rdds are always uncached, a warning will always be logged.
    
    Obviously a meaningful performance difference would trump the 
implementation of this warning message, and I haven't measured performance - 
just discussed options in the above referenced pull request. But by way of 
comparison, is there any significant difference in memory pressure between 
caching a LabeledPoint rdd deserialized from python and caching a LabeledPoint 
rdd crated natively in scala (which is the typical use case with a scala rather 
than python client)?
    
    If I should do some performance testing, are there any examples of tests 
and infrastructure you'd suggest as a starting point?
    
    'none' means the rdd is not cached within the python -> scala mllib 
interface, where previously it was cached. The learning algorithms for which 
rdds are no longer cached implement their own caching internally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to