[ 
https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949159#comment-14949159
 ] 

Sean Owen commented on SPARK-11004:
-----------------------------------

Yes, what would that gain though? I understand plumbing this through Hive is 
crazy, and the goal is a better join of some kind. What you are describing 
sound like how Spark works already though. It certainly uses memory and disk 
for shuffling. I don't think that's the difference, but sure could be 
something. But first I think you'd have to rule out simple tuning issues, like, 
how much memory do you allow for the shuffle? They are indeed different 
creatures here. The goal is of course fine; I think that unless this coalesces 
around a particular change, nothing will happen here anyway.

> MapReduce Hive-like join operations for RDDs
> --------------------------------------------
>
>                 Key: SPARK-11004
>                 URL: https://issues.apache.org/jira/browse/SPARK-11004
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Glenn Strycker
>
> Could a feature be added to Spark that would use disk-only MapReduce 
> operations for the very largest RDD joins?
> MapReduce is able to handle incredibly large table joins in a stable, 
> predictable way with gracious failures and recovery.  I have applications 
> that are able to join 2 tables without error in Hive, but these same tables, 
> when converted into RDDs, are unable to join in Spark (I am using the same 
> cluster, and have played around with all of the memory configurations, 
> persisting to disk, checkpointing, etc., and the RDDs are just too big for 
> Spark on my cluster)
> So, Spark is usually able to handle fairly large RDD joins, but occasionally 
> runs into problems when the tables are just too big (e.g. the notorious 2GB 
> shuffle limit issue, memory problems, etc.)  There are so many parameters to 
> adjust (number of partitions, number of cores, memory per core, etc.) that it 
> is difficult to guarantee stability on a shared cluster (say, running Yarn) 
> with other jobs.
> Could a feature be added to Spark that would use disk-only MapReduce commands 
> to do very large joins?
> That is, instead of myRDD1.join(myRDD2), we would have a special operation 
> myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run 
> MapReduce, and then convert the results of the join back into a standard RDD.
> This might add stability for Spark jobs that deal with extremely large data, 
> and enable developers to mix-and-match some Spark and MapReduce operations in 
> the same program, rather than writing Hive scripts and stringing together 
> Spark and MapReduce programs, which has extremely large overhead to convert 
> RDDs to Hive tables and back again.
> Despite memory-level operations being where most of Spark's speed gains lie, 
> sometimes using disk-only may help with stability!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to