[ https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949171#comment-14949171 ]
Glenn Strycker commented on SPARK-11004: ---------------------------------------- So maybe we can simplify this idea down to forcing "disk-only shuffles" only in particular places. Spark could add a "force disk-only" parameter to the existing RDD join function so the command would look like this: rdd1.join(rdd2, diskonly = true) since it is the memory shuffles that seem to be causing my join problems. > MapReduce Hive-like join operations for RDDs > -------------------------------------------- > > Key: SPARK-11004 > URL: https://issues.apache.org/jira/browse/SPARK-11004 > Project: Spark > Issue Type: New Feature > Reporter: Glenn Strycker > > Could a feature be added to Spark that would use disk-only MapReduce > operations for the very largest RDD joins? > MapReduce is able to handle incredibly large table joins in a stable, > predictable way with gracious failures and recovery. I have applications > that are able to join 2 tables without error in Hive, but these same tables, > when converted into RDDs, are unable to join in Spark (I am using the same > cluster, and have played around with all of the memory configurations, > persisting to disk, checkpointing, etc., and the RDDs are just too big for > Spark on my cluster) > So, Spark is usually able to handle fairly large RDD joins, but occasionally > runs into problems when the tables are just too big (e.g. the notorious 2GB > shuffle limit issue, memory problems, etc.) There are so many parameters to > adjust (number of partitions, number of cores, memory per core, etc.) that it > is difficult to guarantee stability on a shared cluster (say, running Yarn) > with other jobs. > Could a feature be added to Spark that would use disk-only MapReduce commands > to do very large joins? > That is, instead of myRDD1.join(myRDD2), we would have a special operation > myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run > MapReduce, and then convert the results of the join back into a standard RDD. > This might add stability for Spark jobs that deal with extremely large data, > and enable developers to mix-and-match some Spark and MapReduce operations in > the same program, rather than writing Hive scripts and stringing together > Spark and MapReduce programs, which has extremely large overhead to convert > RDDs to Hive tables and back again. > Despite memory-level operations being where most of Spark's speed gains lie, > sometimes using disk-only may help with stability! -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org