Hi Qianhao,
I think you could sort the data by yourself if you want achieve the same
result as MR, like rdd.reduceByKey(...).mapPartitions(// sort within each
partition). Do not call sortByKey again since it will introduce another
shuffle (that's the reason why it is slower than MR).
The
Hi, all
Can anyone give some tips about this issue?
周千昊 于2015年9月8日周二 下午4:46写道:
> Hi, community
> I have an application which I try to migrate from MR to Spark.
> It will do some calculations from Hive and output to hfile which will
> be bulk load to HBase Table,
In mr jobs, the output is sorted only within reducer.. That can be better
emulated by sorting each partition of rdd rather than total sorting the
rdd..
In Rdd.mapPartition you can sort the data in one partition and try...
On Sep 11, 2015 7:36 AM, "周千昊" wrote:
> Hi, all
>
Hi, Shao & Pendey
Thanks for tips. I will try to workaround this.
Saisai Shao 于2015年9月11日周五 下午1:23写道:
> Hi Qianhao,
>
> I think you could sort the data by yourself if you want achieve the same
> result as MR, like rdd.reduceByKey(...).mapPartitions(// sort within
Hi, community
I have an application which I try to migrate from MR to Spark.
It will do some calculations from Hive and output to hfile which will
be bulk load to HBase Table, details as follow:
Rdd input = getSourceInputFromHive()
Rdd> mapSideResult =