[ https://issues.apache.org/jira/browse/SPARK-4423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14317334#comment-14317334 ]
Ilya Ganelin edited comment on SPARK-4423 at 2/12/15 2:40 AM: -------------------------------------------------------------- Edit: Upon further consideration I think what is really at play here is simply a need to explain closures in local vs. cluster modes. I'd like to add a section on this to the Spark programming guide and then this could be referenced within the shorter description for foreach, map, mapPartitions, mapPartitionsWIthIndex, and flatMap or some other set of operators we care about. was (Author: ilganeli): Hi [~pwendell] and [~joshrosen], how do you guys feel about my adding a section to the Spark Programming Guide that discusses this issue - local execution on the driver (in {{local}} mode) versus the division of labor between the driver and the executors (in {{cluster}} mode). Specifically, I'd like to discuss where the actual data is that the executors are operating on. This also becomes useful during performance tuning - for example using mapPartitions to avoid shuffle operations, since it ties in with data aggregation for executors. This section could be referenced within the shorter description for foreach, map, mapPartitions, mapPartitionsWIthIndex, and flatMap or some other set of operators we care about. Edit: Upon further consideration I've realized that the above doesn't quite address the spirit of the issue. I think what is really at play here is simply a need to explain closures in local vs. cluster modes. > Improve foreach() documentation to avoid confusion between local- and > cluster-mode behavior > ------------------------------------------------------------------------------------------- > > Key: SPARK-4423 > URL: https://issues.apache.org/jira/browse/SPARK-4423 > Project: Spark > Issue Type: Improvement > Components: Documentation > Reporter: Josh Rosen > Assignee: Ilya Ganelin > > {{foreach}} seems to be a common source of confusion for new users: in > {{local}} mode, {{foreach}} can be used to update local variables on the > driver, but programs that do this will not work properly when executed on > clusters, since the {{foreach}} will update per-executor variables (note that > this _will_ work correctly for accumulators, but not for other types of > mutable objects). > Similarly, I've seen users become confused when {{.foreach(println)}} doesn't > print to the driver's standard output. > At a minimum, we should improve the documentation to warn users against > unsafe uses of {{foreach}} that won't work properly when transitioning from > local mode to a real cluster. > We might also consider changes to local mode so that its behavior more > closely matches the cluster modes; this will require some discussion, though, > since any change of behavior here would technically be a user-visible > backwards-incompatible change (I don't think that we made any explicit > guarantees about the current local-mode behavior, but someone might be > relying on the current implicit behavior). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org