Hi I am running 10 node standalone cluster on aws and loading 100G data on HDFS.. doing first groupby operation. and then generating pairs from the groupedrdd (key,[a1,b1],key,[a,b,c]) generating the pairs like (a1,b1),(a,b),(a,c) ... n PairRDD will get large in size.
some stats from ui when starting to get errors and finally script fails Details for Stage 1 (Attempt 0) Total Time Across All Tasks: 1.3 h Shuffle Read: 4.4 GB / 1402058 Shuffle Spill (Memory): 73.1 GB Shuffle Spill (Disk): 3.6 GB Get following stack trace WARN scheduler.TaskSetManager: Lost task 0.3 in stage 1.0 (TID 943, 10.239.131.154): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:175) at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:179) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:97) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:111) ... 10 more 15/10/22 16:30:17 ERROR scheduler.TaskSetManager: Task 0 in stage 1.0 failed 4 times; aborting job 15/10/22 16:30:17 INFO scheduler.TaskSchedulerImpl: Cancelling stage 1 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-worker-exited-unexpectedly-crashed-tp25164.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org