In mr jobs, the output is sorted only within reducer.. That can be better
emulated by sorting each partition of rdd rather than total sorting the
rdd..
In Rdd.mapPartition you can sort the data in one partition and try...
On Sep 11, 2015 7:36 AM, "周千昊" <z.qian...@gmail.com> wrote:

> Hi, all
>      Can anyone give some tips about this issue?
>
> 周千昊 <qhz...@apache.org>于2015年9月8日周二 下午4:46写道:
>
>> Hi, community
>>      I have an application which I try to migrate from MR to Spark.
>>      It will do some calculations from Hive and output to hfile which
>> will be bulk load to HBase Table, details as follow:
>>
>>      Rdd<Element> input = getSourceInputFromHive()
>>      Rdd<Tuple2<byte[], byte[]>> mapSideResult =
>> input.glom().mapPartitions(/*some calculation*/)
>>      // PS: the result in each partition has already been sorted
>> according to the lexicographical order during the calculation
>>      mapSideResult.reduceByKey(/*some
>> aggregations*/).sortByKey(/**/).map(/*transform Tuple2<byte[], byte[]> to
>> Tuple2<ImmutableBytesWritable, KeyValue>*/).saveAsNewAPIHadoopFile(/*write
>> to hfile*/)
>>
>>       *Here is the problem, as in MR, in the reducer side, the mapper
>> output has already been sorted, so that it is a merge sort which makes
>> writing to hfile is sequential and fast.*
>> *      However in Spark, the output of reduceByKey phase has been
>> shuffled, so I have to sort the rdd in order to write hfile which makes it
>> slower 2x running on Spark than on MR.*
>> *      I am wondering that, if there is anything I can leverage has the
>> same effect as MR. I happen to see a JIRA
>> ticket https://issues.apache.org/jira/browse/SPARK-2926
>> <https://issues.apache.org/jira/browse/SPARK-2926>. Is it related to what I
>> am looking for?*
>>
> --
> Best Regard
> ZhouQianhao
>

Reply via email to