[ 
https://issues.apache.org/jira/browse/SPARK-16069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15341492#comment-15341492
 ] 

Julien Diener edited comment on SPARK-16069 at 6/21/16 10:07 AM:
-----------------------------------------------------------------

Maybe I wasn't clear: the input rdd is already distributed and cached. My 
problem if about the fourth call: rdd.map(...).cache.

What I think it should do is:
  - on each partition, call map(identify) on the partition data -- I'd expect 
something like newData = (data: Seq[A]).map(...) 
  - on each partition, keep a reference to this new Seq
  - create a new rdd in the driver, that does not contains the data but pointer 
to new (distributed) partition

So the data need not to be moved, not even from/to the driver (but for some 
reference pointer), and thus should not be [de]serialized. 
I am missing something?



was (Author: juh):
Maybe I wasn't clear: the input rdd is already distributed and cached. My 
problem if about the fourth call: rdd.map(...).cache.

What I think it should do is:
  - on each partition, call map(identify) on the partition data -- I'd expect 
something like newData = (data: Seq[A]).map(...) 
  - on each partition, keep a reference to this new Seq
  - create a new rdd in the driver, that does not contains the data but pointer 
to new (distributed) partition

So the data need not to be moved, not even from/to the driver, and thus should 
not be [de]serialized. 
I am missing something?


> rdd.map(identity).cache very slow
> ---------------------------------
>
>                 Key: SPARK-16069
>                 URL: https://issues.apache.org/jira/browse/SPARK-16069
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Core
>    Affects Versions: 1.6.0
>         Environment: ubuntu
>            Reporter: Julien Diener
>              Labels: performance
>
> I found out that when using .map( identity ).cache on a rdd, it become very 
> slow if the items are big. While it is pretty much instantaneous otherwise.
> I would really appreciate to know why? (it is potentially critical for an 
> application I am currently developing, if I don't find a workaround) 
> I posted the question on SO but did not get an answer:
> http://stackoverflow.com/q/37859386/1206998
> Basically, from an in-memory cached rdd containing big items, 
> `map(identity).cache` is very slow. Eg:
>     profile( rdd.count )                 // around 12 ms
>     profile( rdd.map(identity).count )   // same
>     profile( rdd.cache.count )           // same
>     profile( rdd.map(identity).cache.count ) // 5700 ms !!!
> While, if the rdd content is little, this is very fast. So the creation of 
> the rdd is not at cause. 
> I don't understand why this would take time. In my understanding, in-memory 
> cache should "simply" keep a reference to the data, no copy, no serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to