I've run into a situation where it would appear that foreachPartition is only running on one of my executors.
I have a small cluster (2 executors with 8 cores each). When I run a job with a small file (with 16 partitions) I can see that the 16 partitions are initialized but they all appear to be initialized on only one executor. All of the work then runs on this one executor (even though the number of partitions is 16). This seems odd, but at least it works. Not sure why the other executor was not used. However, when I run a larger file (once again with 16 partitions) I can see that the 16 partitions are initialized once again (but all on the same executor). But, this time subsequent work is now spread across the 2 executors. This of course results in problems because the other executor was not initialized as all of the partitions were only initialized on the other executor. Does anyone have any suggestions for where I might want to investigate? Has anyone else seen something like this before? Any thoughts/insights would be appreciated. I'm using the Stand Alone Cluster manager, cluster started with the spark ec2 scripts and submitting my job using spark-submit. Thanks. Darin. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org