liyunzhang_intel created HIVE-16980:
---------------------------------------
Summary: The partition of join is not divided evently in HOS
Key: HIVE-16980
URL: https://issues.apache.org/jira/browse/HIVE-16980
Project: Hive
Issue Type: Bug
Reporter: liyunzhang_intel
In HoS,the join implementation is union+repartition sort. We use
HashPartitioner to partition the result of union.
SortByShuffler.java
{code}
public JavaPairRDD<HiveKey, BytesWritable> shuffle(
JavaPairRDD<HiveKey, BytesWritable> input, int numPartitions) {
JavaPairRDD<HiveKey, BytesWritable> rdd;
if (totalOrder) {
if (numPartitions > 0) {
if (numPartitions > 1 && input.getStorageLevel() ==
StorageLevel.NONE()) {
input.persist(StorageLevel.DISK_ONLY());
sparkPlan.addCachedRDDId(input.id());
}
rdd = input.sortByKey(true, numPartitions);
} else {
rdd = input.sortByKey(true);
}
} else {
Partitioner partitioner = new HashPartitioner(numPartitions);
rdd = input.repartitionAndSortWithinPartitions(partitioner);
}
return rdd;
}
{code}
In spark history server, i saw that there are 28 tasks in the repartition sort
period while 21 tasks are finished less than 1s and the remaining 7 tasks spend
long time to execute. Is there any way to make the data evenly assigned to
every partition?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)