liyunzhang_intel created HIVE-16980: ---------------------------------------
Summary: The partition of join is not divided evently in HOS Key: HIVE-16980 URL: https://issues.apache.org/jira/browse/HIVE-16980 Project: Hive Issue Type: Bug Reporter: liyunzhang_intel In HoS,the join implementation is union+repartition sort. We use HashPartitioner to partition the result of union. SortByShuffler.java {code} public JavaPairRDD<HiveKey, BytesWritable> shuffle( JavaPairRDD<HiveKey, BytesWritable> input, int numPartitions) { JavaPairRDD<HiveKey, BytesWritable> rdd; if (totalOrder) { if (numPartitions > 0) { if (numPartitions > 1 && input.getStorageLevel() == StorageLevel.NONE()) { input.persist(StorageLevel.DISK_ONLY()); sparkPlan.addCachedRDDId(input.id()); } rdd = input.sortByKey(true, numPartitions); } else { rdd = input.sortByKey(true); } } else { Partitioner partitioner = new HashPartitioner(numPartitions); rdd = input.repartitionAndSortWithinPartitions(partitioner); } return rdd; } {code} In spark history server, i saw that there are 28 tasks in the repartition sort period while 21 tasks are finished less than 1s and the remaining 7 tasks spend long time to execute. Is there any way to make the data evenly assigned to every partition? -- This message was sent by Atlassian JIRA (v6.4.14#64029)