Hi, spark community
      I have an application which I try to migrate from MR to Spark.
      It will do some calculations from Hive and output to hfile which will
be bulk load to HBase Table, details as follow:

     Rdd<Element> input = getSourceInputFromHive()
     Rdd<Tuple2<byte[], byte[]>> mapSideResult =
input.glom().mapPartitions(/*some calculation, equivalent to MR mapper*/)
     // PS: the result in each partition has already been sorted according
to the lexicographical order during the calculation
     mapSideResult.repartitionAndSortWithPartitions(/*partition with
byte[][] which is HTable split key, equivalent to MR shuffle
*/).map(/*transform
Tuple2<byte[], byte[]> to Tuple2<ImmutableBytesWritable, KeyValue>/*equivalent
to MR reducer without output*/).saveAsNewAPIHadoopFile(/*write to hfile*/)

      This all works fine on a small dataset, and spark outruns MR by about
10%. However when I apply it on a dataset of 150 million records, MR is
about 100% faster than spark.(*MR 25min spark 50min*)
       After exploring into the application UI, it shows that in the
repartitionAndSortWithinPartitions stage is very slow, and in the shuffle
phase a 6GB size shuffle cost about 18min which is quite unreasonable
       *Can anyone help with this issue and give me some advice on this? **It’s
not iterative processing, however I believe Spark could be the same fast at
minimal.*

      Here are the cluster info:
          vm: 8 nodes * (128G mem + 64 core)
          hadoop cluster: hdp 2.2.6
          spark running mode: yarn-client
          spark version: 1.5.1

Reply via email to