Re: about mr-style merge sort

2015-09-10 Thread Saisai Shao
Hi Qianhao, I think you could sort the data by yourself if you want achieve the same result as MR, like rdd.reduceByKey(...).mapPartitions(// sort within each partition). Do not call sortByKey again since it will introduce another shuffle (that's the reason why it is slower than MR). The

Re: about mr-style merge sort

2015-09-10 Thread 周千昊
Hi, all Can anyone give some tips about this issue? 周千昊 于2015年9月8日周二 下午4:46写道: > Hi, community > I have an application which I try to migrate from MR to Spark. > It will do some calculations from Hive and output to hfile which will > be bulk load to HBase Table,

Re: about mr-style merge sort

2015-09-10 Thread Raghavendra Pandey
In mr jobs, the output is sorted only within reducer.. That can be better emulated by sorting each partition of rdd rather than total sorting the rdd.. In Rdd.mapPartition you can sort the data in one partition and try... On Sep 11, 2015 7:36 AM, "周千昊" wrote: > Hi, all >

Re: about mr-style merge sort

2015-09-10 Thread 周千昊
Hi, Shao & Pendey Thanks for tips. I will try to workaround this. Saisai Shao 于2015年9月11日周五 下午1:23写道: > Hi Qianhao, > > I think you could sort the data by yourself if you want achieve the same > result as MR, like rdd.reduceByKey(...).mapPartitions(// sort within

about mr-style merge sort

2015-09-08 Thread 周千昊
Hi, community I have an application which I try to migrate from MR to Spark. It will do some calculations from Hive and output to hfile which will be bulk load to HBase Table, details as follow: Rdd input = getSourceInputFromHive() Rdd> mapSideResult =