Hi I have the following code which I run as part of thread which becomes child job of my main Spark job it takes hours to run for large data around 1-2GB because of coalesce(1) and if data is in MB/KB then it finishes faster with more data sets size sometimes it does not complete at all. Please guide what I am doing wrong please help. Thanks in advance.
JavaRDD<Row> maksedRDD = sourceRdd.coalesce(1,true).mapPartitionsWithIndex(new Function2<Integer, Iterator<Row>, Iterator<Row>>() { @Override public Iterator<Row> call(Integer ind, Iterator<Row> rowIterator) throws Exception { List<Row> rowList = new ArrayList<>(); while (rowIterator.hasNext()) { Row row = rowIterator.next(); List rowAsList = updateRowsMethod(JavaConversions.seqAsJavaList(row.toSeq())); Row updatedRow = RowFactory.create(rowAsList.toArray()); rowList.add(updatedRow); } return rowList.iterator(); } }, true); -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-optimiz-and-make-this-code-faster-using-coalesce-1-and-mapPartitionIndex-tp25947.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org