Why do you do a glom? It seems unnecessarily expensive to materialize each
partition in memory.


On Thu, Oct 22, 2015 at 2:02 AM, 周千昊 <qhz...@apache.org> wrote:

> Hi, spark community
>       I have an application which I try to migrate from MR to Spark.
>       It will do some calculations from Hive and output to hfile which
> will be bulk load to HBase Table, details as follow:
>
>      Rdd<Element> input = getSourceInputFromHive()
>      Rdd<Tuple2<byte[], byte[]>> mapSideResult =
> input.glom().mapPartitions(/*some calculation, equivalent to MR mapper*/)
>      // PS: the result in each partition has already been sorted according
> to the lexicographical order during the calculation
>      mapSideResult.repartitionAndSortWithPartitions(/*partition with
> byte[][] which is HTable split key, equivalent to MR shuffle  
> */).map(/*transform
> Tuple2<byte[], byte[]> to Tuple2<ImmutableBytesWritable, KeyValue>/*equivalent
> to MR reducer without output*/).saveAsNewAPIHadoopFile(/*write to hfile*/)
>
>       This all works fine on a small dataset, and spark outruns MR by
> about 10%. However when I apply it on a dataset of 150 million records, MR
> is about 100% faster than spark.(*MR 25min spark 50min*)
>        After exploring into the application UI, it shows that in the
> repartitionAndSortWithinPartitions stage is very slow, and in the shuffle
> phase a 6GB size shuffle cost about 18min which is quite unreasonable
>        *Can anyone help with this issue and give me some advice on this? 
> **It’s
> not iterative processing, however I believe Spark could be the same fast at
> minimal.*
>
>       Here are the cluster info:
>           vm: 8 nodes * (128G mem + 64 core)
>           hadoop cluster: hdp 2.2.6
>           spark running mode: yarn-client
>           spark version: 1.5.1
>
>

Reply via email to