Hi I have Spark job which does some processing on ORC data and stores back
ORC data using DataFrameWriter save() API introduced in Spark 1.4.0. I have
the following piece of code which is using heavy shuffle memory. How do I
optimize below code? Is there anything wrong with it? It is working fine as
expected only causing slowness because of GC pause and shuffles lots of data
so hitting memory issues. Please guide I am new to Spark. Thanks in advance.

JavaRDD<Row> updatedDsqlRDD = orderedFrame.toJavaRDD().coalesce(1,
false).map(new Function<Row, Row>() {
   @Override
   public Row call(Row row) throws Exception {
        List rowAsList;
        Row row1 = null;
        if (row != null) {
          rowAsList = iterate(JavaConversions.seqAsJavaList(row.toSeq()));
          row1 = RowFactory.create(rowAsList.toArray());
        }
        return row1;
   }
}).union(modifiedRDD);
DataFrame updatedDataFrame =
hiveContext.createDataFrame(updatedDsqlRDD,renamedSourceFrame.schema());
updatedDataFrame.write().mode(SaveMode.Append).format("orc").partitionBy("entity",
"date").save("baseTable");



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-huge-data-shuffling-in-Spark-when-using-union-coalesce-1-false-on-DataFrame-tp24581.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to