[ https://issues.apache.org/jira/browse/SPARK-16069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15341492#comment-15341492 ]
Julien Diener edited comment on SPARK-16069 at 6/21/16 10:06 AM: ----------------------------------------------------------------- Maybe I wasn't clear: the input rdd is already distributed and cached. My problem if about the fourth call: rdd.map(...).cache. What I think it should do is: - on each partition, call map(identify) on the partition data -- I'd expect something like newData = (data: Seq[A]).map(...) - on each partition, keep a reference to this new Seq - create a new rdd in the driver, that does not contains the data but pointer to new (distributed) partition So the data need not to be moved, not even from/to the driver, and thus should not be [de]serialized. I am missing something? was (Author: juh): Maybe I wasn't clear: the input rdd is already distributed and cached. My problem if about the fourth call: rdd.map(...).cache. What I think it should do is: - on each partition, call map(identify) on the partition data -- I'd expect something like newData = (data: Seq[A]).map(...) - on each partition, keep a reference to this new Seq - create a new rdd in the driver, that does not contains the data but pointer to new (distributed) partition So the data need not to be moved, thus should not be [de]serialized. I am missing something? > rdd.map(identity).cache very slow > --------------------------------- > > Key: SPARK-16069 > URL: https://issues.apache.org/jira/browse/SPARK-16069 > Project: Spark > Issue Type: Question > Components: Spark Core > Affects Versions: 1.6.0 > Environment: ubuntu > Reporter: Julien Diener > Labels: performance > > I found out that when using .map( identity ).cache on a rdd, it become very > slow if the items are big. While it is pretty much instantaneous otherwise. > I would really appreciate to know why? (it is potentially critical for an > application I am currently developing, if I don't find a workaround) > I posted the question on SO but did not get an answer: > http://stackoverflow.com/q/37859386/1206998 > Basically, from an in-memory cached rdd containing big items, > `map(identity).cache` is very slow. Eg: > profile( rdd.count ) // around 12 ms > profile( rdd.map(identity).count ) // same > profile( rdd.cache.count ) // same > profile( rdd.map(identity).cache.count ) // 5700 ms !!! > While, if the rdd content is little, this is very fast. So the creation of > the rdd is not at cause. > I don't understand why this would take time. In my understanding, in-memory > cache should "simply" keep a reference to the data, no copy, no serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org