Why do you do a glom? It seems unnecessarily expensive to materialize each partition in memory.
On Thu, Oct 22, 2015 at 2:02 AM, 周千昊 <qhz...@apache.org> wrote: > Hi, spark community > I have an application which I try to migrate from MR to Spark. > It will do some calculations from Hive and output to hfile which > will be bulk load to HBase Table, details as follow: > > Rdd<Element> input = getSourceInputFromHive() > Rdd<Tuple2<byte[], byte[]>> mapSideResult = > input.glom().mapPartitions(/*some calculation, equivalent to MR mapper*/) > // PS: the result in each partition has already been sorted according > to the lexicographical order during the calculation > mapSideResult.repartitionAndSortWithPartitions(/*partition with > byte[][] which is HTable split key, equivalent to MR shuffle > */).map(/*transform > Tuple2<byte[], byte[]> to Tuple2<ImmutableBytesWritable, KeyValue>/*equivalent > to MR reducer without output*/).saveAsNewAPIHadoopFile(/*write to hfile*/) > > This all works fine on a small dataset, and spark outruns MR by > about 10%. However when I apply it on a dataset of 150 million records, MR > is about 100% faster than spark.(*MR 25min spark 50min*) > After exploring into the application UI, it shows that in the > repartitionAndSortWithinPartitions stage is very slow, and in the shuffle > phase a 6GB size shuffle cost about 18min which is quite unreasonable > *Can anyone help with this issue and give me some advice on this? > **It’s > not iterative processing, however I believe Spark could be the same fast at > minimal.* > > Here are the cluster info: > vm: 8 nodes * (128G mem + 64 core) > hadoop cluster: hdp 2.2.6 > spark running mode: yarn-client > spark version: 1.5.1 > >