[ https://issues.apache.org/jira/browse/SPARK-16069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15341284#comment-15341284 ]
Julien Diener edited comment on SPARK-16069 at 6/21/16 7:22 AM: ---------------------------------------------------------------- Why would data be send to executors? I understood that cache means to keep intermediate results in memory, for later use. No need to move data around (?) was (Author: juh): Why would data be send to executors? I understood that cache means to keep intermediate results in memory, for later use > rdd.map(identity).cache very slow > --------------------------------- > > Key: SPARK-16069 > URL: https://issues.apache.org/jira/browse/SPARK-16069 > Project: Spark > Issue Type: Question > Components: Spark Core > Affects Versions: 1.6.0 > Environment: ubuntu > Reporter: Julien Diener > Labels: performance > > I found out that when using .map( identity ).cache on a rdd, it become very > slow if the items are big. While it is pretty much instantaneous otherwise. > I would really appreciate to know why? (it is potentially critical for an > application I am currently developing, if I don't find a workaround) > I posted the question on SO but did not get an answer: > http://stackoverflow.com/q/37859386/1206998 > Basically, from an in-memory cached rdd containing big items, > `map(identity).cache` is very slow. Eg: > profile( rdd.count ) // around 12 ms > profile( rdd.map(identity).count ) // same > profile( rdd.cache.count ) // same > profile( rdd.map(identity).cache.count ) // 5700 ms !!! > While, if the rdd content is little, this is very fast. So the creation of > the rdd is not at cause. > I don't understand why this would take time. In my understanding, in-memory > cache should "simply" keep a reference to the data, no copy, no serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org