[jira] [Commented] (SPARK-11004) MapReduce Hive-like join operations for RDDs

Glenn Strycker (JIRA) Thu, 08 Oct 2015 12:05:09 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949171#comment-14949171
 ]


Glenn Strycker commented on SPARK-11004:
----------------------------------------

So maybe we can simplify this idea down to forcing "disk-only shuffles" only in 
particular places.  Spark could add a "force disk-only" parameter to the 
existing RDD join function so the command would look like this:

rdd1.join(rdd2, diskonly = true)

since it is the memory shuffles that seem to be causing my join problems.

> MapReduce Hive-like join operations for RDDs
> --------------------------------------------
>
>                 Key: SPARK-11004
>                 URL: https://issues.apache.org/jira/browse/SPARK-11004
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Glenn Strycker
>
> Could a feature be added to Spark that would use disk-only MapReduce 
> operations for the very largest RDD joins?
> MapReduce is able to handle incredibly large table joins in a stable, 
> predictable way with gracious failures and recovery.  I have applications 
> that are able to join 2 tables without error in Hive, but these same tables, 
> when converted into RDDs, are unable to join in Spark (I am using the same 
> cluster, and have played around with all of the memory configurations, 
> persisting to disk, checkpointing, etc., and the RDDs are just too big for 
> Spark on my cluster)
> So, Spark is usually able to handle fairly large RDD joins, but occasionally 
> runs into problems when the tables are just too big (e.g. the notorious 2GB 
> shuffle limit issue, memory problems, etc.)  There are so many parameters to 
> adjust (number of partitions, number of cores, memory per core, etc.) that it 
> is difficult to guarantee stability on a shared cluster (say, running Yarn) 
> with other jobs.
> Could a feature be added to Spark that would use disk-only MapReduce commands 
> to do very large joins?
> That is, instead of myRDD1.join(myRDD2), we would have a special operation 
> myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run 
> MapReduce, and then convert the results of the join back into a standard RDD.
> This might add stability for Spark jobs that deal with extremely large data, 
> and enable developers to mix-and-match some Spark and MapReduce operations in 
> the same program, rather than writing Hive scripts and stringing together 
> Spark and MapReduce programs, which has extremely large overhead to convert 
> RDDs to Hive tables and back again.
> Despite memory-level operations being where most of Spark's speed gains lie, 
> sometimes using disk-only may help with stability!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11004) MapReduce Hive-like join operations for RDDs

Reply via email to