[ 
https://issues.apache.org/jira/browse/SPARK-4423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14317334#comment-14317334
 ] 

Ilya Ganelin edited comment on SPARK-4423 at 2/12/15 2:39 AM:
--------------------------------------------------------------

Hi [~pwendell] and [~joshrosen], how do you guys feel about my adding a section 
to the Spark Programming Guide that discusses this issue - local execution on 
the driver (in {{local}} mode) versus the division of labor between the driver 
and the executors (in {{cluster}} mode). Specifically, I'd like to discuss 
where the actual data is that the executors are operating on. This also becomes 
useful during performance tuning - for example using mapPartitions to avoid 
shuffle operations, since it ties in with data aggregation for executors. 

This section could be referenced within the shorter description for foreach, 
map, mapPartitions, mapPartitionsWIthIndex, and flatMap or some other set of 
operators we care about.


Edit:
Upon further consideration I've realized that the above doesn't quite address 
the spirit of the issue. I think what is really at play here is simply a need 
to explain closures in local vs. cluster modes.  



was (Author: ilganeli):
Hi [~pwendell] and [~joshrosen], how do you guys feel about my adding a section 
to the Spark Programming Guide that discusses this issue - local execution on 
the driver (in {{local}} mode) versus the division of labor between the driver 
and the executors (in {{cluster}} mode). Specifically, I'd like to discuss 
where the actual data is that the executors are operating on. This also becomes 
useful during performance tuning - for example using mapPartitions to avoid 
shuffle operations, since it ties in with data aggregation for executors. 

This section could be referenced within the shorter description for foreach, 
map, mapPartitions, mapPartitionsWIthIndex, and flatMap or some other set of 
operators we care about.



> Improve foreach() documentation to avoid confusion between local- and 
> cluster-mode behavior
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-4423
>                 URL: https://issues.apache.org/jira/browse/SPARK-4423
>             Project: Spark
>          Issue Type: Improvement
>          Components: Documentation
>            Reporter: Josh Rosen
>            Assignee: Ilya Ganelin
>
> {{foreach}} seems to be a common source of confusion for new users: in 
> {{local}} mode, {{foreach}} can be used to update local variables on the 
> driver, but programs that do this will not work properly when executed on 
> clusters, since the {{foreach}} will update per-executor variables (note that 
> this _will_ work correctly for accumulators, but not for other types of 
> mutable objects).
> Similarly, I've seen users become confused when {{.foreach(println)}} doesn't 
> print to the driver's standard output.
> At a minimum, we should improve the documentation to warn users against 
> unsafe uses of {{foreach}} that won't work properly when transitioning from 
> local mode to a real cluster.
> We might also consider changes to local mode so that its behavior more 
> closely matches the cluster modes; this will require some discussion, though, 
> since any change of behavior here would technically be a user-visible 
> backwards-incompatible change (I don't think that we made any explicit 
> guarantees about the current local-mode behavior, but someone might be 
> relying on the current implicit behavior).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to