HI again, today I've tried using bzip2 files instead of gzip, but the problem is the same, really I don't understand where is the problem :(
- logs through the master web: 16/02/23 23:48:01 INFO compress.CodecPool: Got brand-new decompressor [.bz2] 16/02/23 23:48:01 INFO compress.CodecPool: Got brand-new decompressor [.bz2] 16/02/23 23:48:01 INFO compress.CodecPool: Got brand-new decompressor [.bz2] 16/02/23 23:48:02 INFO compress.CodecPool: Got brand-new decompressor [.bz2] 16/02/23 23:48:02 INFO compress.CodecPool: Got brand-new decompressor [.bz2] 16/02/23 23:48:03 INFO executor.Executor: Executor is trying to kill task 2.0 in stage 0.0 (TID 2) Traceback (most recent call last): File "/opt/spark/current/python/lib/pyspark.zip/pyspark/daemon.py", line 157, in manager File "/opt/spark/current/python/lib/pyspark.zip/pyspark/daemon.py", line 61, in worker File "/opt/spark/current/python/lib/pyspark.zip/pyspark/worker.py", line 136, in main if read_int(infile) == SpecialLengths.END_OF_STREAM: File "/opt/spark/current/python/lib/pyspark.zip/pyspark/serializers.py", line 545, in read_int raise EOFError EOFError 16/02/23 23:48:03 INFO executor.Executor: Executor killed task 2.0 in stage 0.0 (TID 2) 16/02/23 23:48:03 INFO executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown --- through the command line (submit): ... 16/02/23 23:44:41 INFO FileInputFormat: Total input paths to process : 4227 16/02/23 23:44:46 INFO CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 3, size left: 859411098 16/02/23 23:44:46 INFO SparkContext: Starting job: count at /home/instel/rheras/GDriverTest9.py:77 16/02/23 23:44:46 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006 16/02/23 23:44:46 WARN TaskSetManager: Stage 0 contains a task of very large size (139 KB). The maximum recommended task size is 100 KB. 16/02/23 23:44:48 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, samson04.hi.inet): java.lang.UnsupportedOperationException at org.apache.hadoop.io.compress.bzip2.BZip2DummyDecompressor.decompress(BZip2DummyDecompressor.java:32) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:91) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) at java.io.InputStream.read(InputStream.java:101) at org.spark-project.guava.io.ByteStreams.copy(ByteStreams.java:207) at org.spark-project.guava.io.ByteStreams.toByteArray(ByteStreams.java:252) at org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:81) at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:161) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239) 16/02/23 23:46:45 WARN TaskSetManager: Lost task 3.1 in stage 0.0 (TID 4, samson02.hi.inet): java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:795) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:786) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:847) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) at java.io.InputStream.read(InputStream.java:101) at org.spark-project.guava.io.ByteStreams.copy(ByteStreams.java:207) at org.spark-project.guava.io.ByteStreams.toByteArray(ByteStreams.java:252) at org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:81) at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:161) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239) 16/02/23 23:46:46 INFO AppClient$ClientEndpoint: Executor updated: app-20160223234438-0004/1 is now EXITED (Command exited with code 52) 16/02/23 23:46:46 ERROR TaskSchedulerImpl: Lost executor 1 on samson02.hi.inet: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 16/02/23 23:46:46 WARN TaskSetManager: Lost task 3.2 in stage 0.0 (TID 5, samson02.hi.inet): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 16/02/23 23:46:46 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, samson02.hi.inet): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 16/02/23 23:46:46 INFO AppClient$ClientEndpoint: Executor added: app-20160223234438-0004/4 on worker-20160223163243-10.95.110.97-7078 (10.95.110.97:7078) with 16 cores 16/02/23 23:46:46 INFO AppClient$ClientEndpoint: Executor updated: app-20160223234438-0004/4 is now RUNNING 16/02/23 23:48:02 WARN TaskSetManager: Lost task 1.1 in stage 0.0 (TID 7, samson04.hi.inet): java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:795) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:786) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:847) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) at java.io.InputStream.read(InputStream.java:101) at org.spark-project.guava.io.ByteStreams.copy(ByteStreams.java:207) at org.spark-project.guava.io.ByteStreams.toByteArray(ByteStreams.java:252) at org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:81) at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:161) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239) 16/02/23 23:48:03 ERROR TaskSchedulerImpl: Lost executor 2 on samson04.hi.inet: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 16/02/23 23:48:03 WARN TaskSetManager: Lost task 3.3 in stage 0.0 (TID 6, samson04.hi.inet): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 16/02/23 23:48:03 ERROR TaskSetManager: Task 3 in stage 0.0 failed 4 times; aborting job 16/02/23 23:48:03 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, samson04.hi.inet): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 16/02/23 23:48:03 INFO AppClient$ClientEndpoint: Executor updated: app-20160223234438-0004/2 is now EXITED (Command exited with code 52) 16/02/23 23:48:03 INFO AppClient$ClientEndpoint: Executor added: app-20160223234438-0004/5 on worker-20160223163243-10.95.105.251-7078 (10.95.105.251:7078) with 16 cores 16/02/23 23:48:03 INFO AppClient$ClientEndpoint: Executor updated: app-20160223234438-0004/5 is now RUNNING -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-decompressing-gz-source-data-files-tp26285p26314.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org