[
https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14106343#comment-14106343
]
Lianhui Wang commented on HIVE-7384:
------------------------------------
@Szehon Ho yes,i read OrderedRDDFunctions code and discove that sortByKey
actually does a range-partition. we need to replace range-partition with hash
partition. so spark maybe should create a new interface example:
partitionSortByKey.
@Brock Noland code in 1) means when sample data and more than one reducers,
Hive does a total order sort. so join does not sample data, it does not need a
total order sort.
2) i think we really need auto-parallelism. before i talk it with Reynold Xin,
spark need to support re-partition mapoutput's data as same as tez does.
> Research into reduce-side join [Spark Branch]
> ---------------------------------------------
>
> Key: HIVE-7384
> URL: https://issues.apache.org/jira/browse/HIVE-7384
> Project: Hive
> Issue Type: Sub-task
> Components: Spark
> Reporter: Xuefu Zhang
> Assignee: Szehon Ho
> Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt,
> sales_products.txt, sales_stores.txt
>
>
> Hive's join operator is very sophisticated, especially for reduce-side join.
> While we expect that other types of join, such as map-side join and SMB
> map-side join, will work out of the box with our design, there may be some
> complication in reduce-side join, which extensively utilizes key tag and
> shuffle behavior. Our design principle prefers to making Hive implementation
> work out of box also, which might requires new functionality from Spark. The
> tasks is to research into this area, identifying requirements for Spark
> community and the work to be done on Hive to make reduce-side join work.
> A design doc might be needed for this. For more information, please refer to
> the overall design doc on wiki.
--
This message was sent by Atlassian JIRA
(v6.2#6252)