Hi I have the following code which I run as part of thread which becomes
child job of my main Spark job it takes hours to run for large data around
1-2GB because of coalesce(1) and if data is in MB/KB then it finishes faster
with more data sets size sometimes it does not complete at all. Please guide
what I am doing wrong please help. Thanks in advance.

JavaRDD<Row> maksedRDD =
sourceRdd.coalesce(1,true).mapPartitionsWithIndex(new Function2<Integer,
Iterator&lt;Row>, Iterator<Row>>() {
            @Override
            public Iterator<Row> call(Integer ind, Iterator<Row>
rowIterator) throws Exception {
                List<Row> rowList = new ArrayList<>();

                while (rowIterator.hasNext()) {
                    Row row = rowIterator.next();
                    List rowAsList =
updateRowsMethod(JavaConversions.seqAsJavaList(row.toSeq()));
                    Row updatedRow = RowFactory.create(rowAsList.toArray());
                    rowList.add(updatedRow);
                }           
                return rowList.iterator();
            }
        }, true);



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-optimiz-and-make-this-code-faster-using-coalesce-1-and-mapPartitionIndex-tp25947.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to