[jira] [Comment Edited] (SPARK-16069) rdd.map(identity).cache very slow

Julien Diener (JIRA) Tue, 21 Jun 2016 00:23:28 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15341284#comment-15341284
 ]


Julien Diener edited comment on SPARK-16069 at 6/21/16 7:22 AM:
----------------------------------------------------------------

Why would data be send to executors? I understood that cache means to keep 
intermediate results in memory, for later use. No need to move data around (?)


was (Author: juh):
Why would data be send to executors? I understood that cache means to keep 
intermediate results in memory, for later use

> rdd.map(identity).cache very slow
> ---------------------------------
>
>                 Key: SPARK-16069
>                 URL: https://issues.apache.org/jira/browse/SPARK-16069
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Core
>    Affects Versions: 1.6.0
>         Environment: ubuntu
>            Reporter: Julien Diener
>              Labels: performance
>
> I found out that when using .map( identity ).cache on a rdd, it become very 
> slow if the items are big. While it is pretty much instantaneous otherwise.
> I would really appreciate to know why? (it is potentially critical for an 
> application I am currently developing, if I don't find a workaround) 
> I posted the question on SO but did not get an answer:
> http://stackoverflow.com/q/37859386/1206998
> Basically, from an in-memory cached rdd containing big items, 
> `map(identity).cache` is very slow. Eg:
>     profile( rdd.count )                 // around 12 ms
>     profile( rdd.map(identity).count )   // same
>     profile( rdd.cache.count )           // same
>     profile( rdd.map(identity).cache.count ) // 5700 ms !!!
> While, if the rdd content is little, this is very fast. So the creation of 
> the rdd is not at cause. 
> I don't understand why this would take time. In my understanding, in-memory 
> cache should "simply" keep a reference to the data, no copy, no serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16069) rdd.map(identity).cache very slow

Reply via email to