[jira] [Created] (HIVE-16980) The partition of join is not divided evently in HOS

liyunzhang_intel (JIRA) Tue, 27 Jun 2017 20:40:04 -0700

liyunzhang_intel created HIVE-16980:
---------------------------------------


             Summary: The partition of join is not divided evently in HOS
                 Key: HIVE-16980
                 URL: https://issues.apache.org/jira/browse/HIVE-16980
             Project: Hive
          Issue Type: Bug
            Reporter: liyunzhang_intel


In HoS，the join implementation is union+repartition sort. We use 
HashPartitioner to partition the result of union. 
SortByShuffler.java
{code}
    public JavaPairRDD<HiveKey, BytesWritable> shuffle(
      JavaPairRDD<HiveKey, BytesWritable> input, int numPartitions) {
    JavaPairRDD<HiveKey, BytesWritable> rdd;
    if (totalOrder) {
      if (numPartitions > 0) {
        if (numPartitions > 1 && input.getStorageLevel() == 
StorageLevel.NONE()) {
          input.persist(StorageLevel.DISK_ONLY());
          sparkPlan.addCachedRDDId(input.id());
        }
        rdd = input.sortByKey(true, numPartitions);
      } else {
        rdd = input.sortByKey(true);
      }
    } else {
      Partitioner partitioner = new HashPartitioner(numPartitions);
      rdd = input.repartitionAndSortWithinPartitions(partitioner);
    }
    return rdd;
  }
{code}
In spark history server, i saw that there are 28 tasks in the repartition sort 
period while 21 tasks are finished less than 1s and the remaining 7 tasks spend 
long time to execute. Is there any way to make the data evenly assigned to 
every partition?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-16980) The partition of join is not divided evently in HOS

Reply via email to