Hi all,

Please help me with below scenario.

While writing below query on large dataset (rowCount=100,000,000) using below 
query

// there are other instance of below job submitting to spark in multithreaded 
app.

final Dataset<Row> df = spark.read().parquet(tablePath);
// df storage is hdfs is 5.64 GB with 45 blocks.
df.select(col).na().drop().dropDuplicates(col).coalesce(20).sort(df.col(col)).coalesce(1).write().mode(SaveMode.Ignore).csv(path);

Getting below exception.

Task failed while writing rows
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an 
output location for shuffle 2991


Here are spark env details:


  *   Cores in use: 20 Total, 0 Used
  *   Memory in use: 72.2 GB Total, 0.0 B Used

And process configuration are as

"spark.cores.max", “20"
"spark.executor.memory", “3400MB"
“spark.kryoserializer.buffer.max”,”1000MB”

Any leads would be highly appreciated.

Regards
Rohit Verma


Reply via email to