[jira] [Comment Edited] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
[ https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536867#comment-14536867 ] Allan Douglas R. de Oliveira edited comment on SPARK-3630 at 5/9/15 8:34 PM: - Got something like this but using: - Java serializer - Snappy - Also Lz4 - Spark 1.3.0 (most things on default settings) We got the error many times on the same cluster (which was doing fine for days) but after recreating it the problem disappeared again. Stack traces (the first two are from two different runs using Snappy and the third from an execution using Lz4): 15/05/09 13:05:55 ERROR Executor: Exception in task 234.2 in stage 27.1 (TID 876507) java.io.IOException: PARSING_ERROR(2) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method) at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594) at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:358) at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387) at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586) at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/05/08 20:46:28 WARN scheduler.TaskSetManager: Lost task 9559.0 in stage 55.0 (TID 424644, ip-172-24-36-214.ec2.internal): java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444) at org.xerial.snappy.Snappy.uncompress(Snappy.java:480) at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362) at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387) at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586) at
[jira] [Comment Edited] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
[ https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536867#comment-14536867 ] Allan Douglas R. de Oliveira edited comment on SPARK-3630 at 5/9/15 8:36 PM: - Got something like this but using: - Java serializer - Snappy - Also Lz4 - Spark 1.3.0 (most things on default settings) We got the error many times on the same cluster (which was doing fine for days) but after recreating it the problem disappeared again. Stack traces (the first two are from two different runs using Snappy and the third from an execution using Lz4): 15/05/09 13:05:55 ERROR Executor: Exception in task 234.2 in stage 27.1 (TID 876507) java.io.IOException: PARSING_ERROR(2) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method) at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594) at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:358) at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387) at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586) at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/05/08 20:46:28 WARN scheduler.TaskSetManager: Lost task 9559.0 in stage 55.0 (TID 424644, ip-172-24-36-214.ec2.internal): java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444) at org.xerial.snappy.Snappy.uncompress(Snappy.java:480) at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362) at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387) at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586) at
[jira] [Comment Edited] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536880#comment-14536880 ] Allan Douglas R. de Oliveira edited comment on SPARK-4105 at 5/9/15 8:36 PM: - Got something like this but using: - Java serializer - Snappy - Also Lz4 - Spark 1.3.0 (most things on default settings) We got the error many times on the same cluster (which was doing fine for days) but after recreating it the problem disappeared again. Stack traces (the first two are from two different runs using Snappy and the third from an execution using Lz4): 15/05/09 13:05:55 ERROR Executor: Exception in task 234.2 in stage 27.1 (TID 876507) java.io.IOException: PARSING_ERROR(2) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method) at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594) at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:358) at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387) at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586) at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/05/08 20:46:28 WARN scheduler.TaskSetManager: Lost task 9559.0 in stage 55.0 (TID 424644, ip-172-24-36-214.ec2.internal): java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444) at org.xerial.snappy.Snappy.uncompress(Snappy.java:480) at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362) at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387) at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586) at
[jira] [Comment Edited] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
[ https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536867#comment-14536867 ] Allan Douglas R. de Oliveira edited comment on SPARK-3630 at 5/9/15 8:25 PM: - Got something like this but using: - Java serializer - Snappy - Also Lz4 - Spark 1.3.0 (most things on default settings) We got the error many times on the same cluster (which was doing fine for days) but after recreating it the problem disappeared again. Stack traces (the first two are from two different runs using Snappy and the third from an execution using Lz4): 15/05/09 13:05:55 ERROR Executor: Exception in task 234.2 in stage 27.1 (TID 876507) java.io.IOException: PARSING_ERROR(2) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method) at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594) at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:358) at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387) at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586) at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/05/08 20:46:28 WARN scheduler.TaskSetManager: Lost task 9559.0 in stage 55.0 (TID 424644, ip-172-24-36-214.ec2.internal): java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444) at org.xerial.snappy.Snappy.uncompress(Snappy.java:480) at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362) at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387) at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586) at
[jira] [Comment Edited] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
[ https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536867#comment-14536867 ] Allan Douglas R. de Oliveira edited comment on SPARK-3630 at 5/9/15 8:24 PM: - Got something like this but using: - Java serializer - Snappy - Lz4 We got the error many times on the same cluster (which was doing fine for days) but after recreating it the problem disappeared again. Stack traces (the first two are from two different runs using Snappy and the third from an execution using Lz4): 15/05/09 13:05:55 ERROR Executor: Exception in task 234.2 in stage 27.1 (TID 876507) java.io.IOException: PARSING_ERROR(2) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method) at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594) at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:358) at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387) at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586) at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/05/08 20:46:28 WARN scheduler.TaskSetManager: Lost task 9559.0 in stage 55.0 (TID 424644, ip-172-24-36-214.ec2.internal): java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444) at org.xerial.snappy.Snappy.uncompress(Snappy.java:480) at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362) at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387) at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586) at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596) at
[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
[ https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536867#comment-14536867 ] Allan Douglas R. de Oliveira commented on SPARK-3630: - Got something like this but using: - Java serializer - Snappy - Lz4 We got the error many times on the same cluster (which was doing fine for days) but after recreating it the problem disappeared again. Stack traces: 15/05/09 13:05:55 ERROR Executor: Exception in task 234.2 in stage 27.1 (TID 876507) java.io.IOException: PARSING_ERROR(2) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method) at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594) at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:358) at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387) at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586) at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/05/08 20:46:28 WARN scheduler.TaskSetManager: Lost task 9559.0 in stage 55.0 (TID 424644, ip-172-24-36-214.ec2.internal): java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444) at org.xerial.snappy.Snappy.uncompress(Snappy.java:480) at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362) at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387) at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586) at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318) at
[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536880#comment-14536880 ] Allan Douglas R. de Oliveira commented on SPARK-4105: - Got something like this but using: - Java serializer - Snappy - Also Lz4 - Spark 1.3.0 (most things on default settings) We got the error many times on the same cluster (which was doing fine for days) but after recreating it the problem disappeared again. Stack traces (the first two are from two different runs using Snappy and the third from an execution using Lz4): 15/05/09 13:05:55 ERROR Executor: Exception in task 234.2 in stage 27.1 (TID 876507) java.io.IOException: PARSING_ERROR(2) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method) at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594) at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:358) at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387) at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586) at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/05/08 20:46:28 WARN scheduler.TaskSetManager: Lost task 9559.0 in stage 55.0 (TID 424644, ip-172-24-36-214.ec2.internal): java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444) at org.xerial.snappy.Snappy.uncompress(Snappy.java:480) at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362) at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387) at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586) at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
[jira] [Comment Edited] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376833#comment-14376833 ] Allan Douglas R. de Oliveira edited comment on SPARK-5928 at 3/23/15 10:54 PM: --- I will answer with the info I have right know, later I'll try to get more information: a) I don't have the exact number now, but it's something like 300G of shuffle read/write b) It was working with 1600 partitions before this started to happen, then after the error we increased to 1 but got pretty much the same exception. c) Yes, it happens some minutes later generally. We've put the job back in production changing to lz4. But I feel that the problem will come back. d) Yes, and I think that issue also happens with other errors (e.g if the BlockManager timeouts because of too much GC) was (Author: douglaz): I will answer with the info I have right know, later I'll try to get more information: a) I don't have the exact number now, but it's something like 300G of shuffle read/write b) It was working with 1600 partitions before this started to happen, then after the error we increased to 1 but got pretty much the same exception. c) Yes, it happens some minutes later generally. We've got the job back in production changing to lz4. But I feel that the problem will come back. d) Yes, and I think that issue also happens with other errors (e.g if the BlockManager timeouts because of too much GC) Remote Shuffle Blocks cannot be more than 2 GB -- Key: SPARK-5928 URL: https://issues.apache.org/jira/browse/SPARK-5928 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Imran Rashid If a shuffle block is over 2GB, the shuffle fails, with an uninformative exception. The tasks get retried a few times and then eventually the job fails. Here is an example program which can cause the exception: {code} val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) arr } rdd.map { x = (1, x)}.groupByKey().count() {code} Note that you can't trigger this exception in local mode, it only happens on remote fetches. I triggered these exceptions running with {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} {noformat} 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 2147483647: 3021252889 - discarded at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 2147483647: 3021252889 - discarded at io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) at
[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376833#comment-14376833 ] Allan Douglas R. de Oliveira commented on SPARK-5928: - I will answer with the info I have right know, later I'll try to get more information: a) I don't have the exact number now, but it's something like 300G of shuffle read/write b) It was working with 1600 partitions before this started to happen, then after the error we increased to 1 but got pretty much the same exception. c) Yes, it happens some minutes later generally. We've got the job back in production changing to lz4. But I feel that the problem will come back. d) Yes, and I think that issue also happens with other errors (e.g if the BlockManager timeouts because of too much GC) Remote Shuffle Blocks cannot be more than 2 GB -- Key: SPARK-5928 URL: https://issues.apache.org/jira/browse/SPARK-5928 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Imran Rashid If a shuffle block is over 2GB, the shuffle fails, with an uninformative exception. The tasks get retried a few times and then eventually the job fails. Here is an example program which can cause the exception: {code} val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) arr } rdd.map { x = (1, x)}.groupByKey().count() {code} Note that you can't trigger this exception in local mode, it only happens on remote fetches. I triggered these exceptions running with {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} {noformat} 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 2147483647: 3021252889 - discarded at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 2147483647: 3021252889 - discarded at io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at
[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372988#comment-14372988 ] Allan Douglas R. de Oliveira commented on SPARK-5928: - We've hit this limit in production. It's very annoying. Our data may be somewhat skewed. But still annoying. Remote Shuffle Blocks cannot be more than 2 GB -- Key: SPARK-5928 URL: https://issues.apache.org/jira/browse/SPARK-5928 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Imran Rashid If a shuffle block is over 2GB, the shuffle fails, with an uninformative exception. The tasks get retried a few times and then eventually the job fails. Here is an example program which can cause the exception: {code} val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) arr } rdd.map { x = (1, x)}.groupByKey().count() {code} Note that you can't trigger this exception in local mode, it only happens on remote fetches. I triggered these exceptions running with {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} {noformat} 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 2147483647: 3021252889 - discarded at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 2147483647: 3021252889 - discarded at io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at
[jira] [Comment Edited] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372988#comment-14372988 ] Allan Douglas R. de Oliveira edited comment on SPARK-5928 at 3/21/15 8:23 PM: -- We've hit this limit in production. It's very annoying. Our data may be somewhat skewed. But it's still annoying. was (Author: douglaz): We've hit this limit in production. It's very annoying. Our data may be somewhat skewed. But still annoying. Remote Shuffle Blocks cannot be more than 2 GB -- Key: SPARK-5928 URL: https://issues.apache.org/jira/browse/SPARK-5928 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Imran Rashid If a shuffle block is over 2GB, the shuffle fails, with an uninformative exception. The tasks get retried a few times and then eventually the job fails. Here is an example program which can cause the exception: {code} val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) arr } rdd.map { x = (1, x)}.groupByKey().count() {code} Note that you can't trigger this exception in local mode, it only happens on remote fetches. I triggered these exceptions running with {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} {noformat} 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 2147483647: 3021252889 - discarded at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 2147483647: 3021252889 - discarded at io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at
[jira] [Created] (SPARK-3840) Spark EC2 templates fail when variables are missing
Allan Douglas R. de Oliveira created SPARK-3840: --- Summary: Spark EC2 templates fail when variables are missing Key: SPARK-3840 URL: https://issues.apache.org/jira/browse/SPARK-3840 Project: Spark Issue Type: Bug Components: EC2 Reporter: Allan Douglas R. de Oliveira For instance https://github.com/mesos/spark-ec2/pull/58 introduced this problem when AWS_ACCESS_KEY_ID isn't set: Configuring /root/shark/conf/shark-env.sh Traceback (most recent call last): File ./deploy_templates.py, line 91, in module text = text.replace({{ + key + }}, template_vars[key]) TypeError: expected a character buffer object This makes all the cluster configuration fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3840) Spark EC2 templates fail when variables are missing
[ https://issues.apache.org/jira/browse/SPARK-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14162783#comment-14162783 ] Allan Douglas R. de Oliveira commented on SPARK-3840: - PR: https://github.com/mesos/spark-ec2/pull/74 Spark EC2 templates fail when variables are missing --- Key: SPARK-3840 URL: https://issues.apache.org/jira/browse/SPARK-3840 Project: Spark Issue Type: Bug Components: EC2 Reporter: Allan Douglas R. de Oliveira For instance https://github.com/mesos/spark-ec2/pull/58 introduced this problem when AWS_ACCESS_KEY_ID isn't set: Configuring /root/shark/conf/shark-env.sh Traceback (most recent call last): File ./deploy_templates.py, line 91, in module text = text.replace({{ + key + }}, template_vars[key]) TypeError: expected a character buffer object This makes all the cluster configuration fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3332) Tags shouldn't be the main strategy for machine membership on clusters
Allan Douglas R. de Oliveira created SPARK-3332: --- Summary: Tags shouldn't be the main strategy for machine membership on clusters Key: SPARK-3332 URL: https://issues.apache.org/jira/browse/SPARK-3332 Project: Spark Issue Type: Bug Components: EC2 Reporter: Allan Douglas R. de Oliveira The implementation for SPARK-2333 changed the machine membership mechanism from security groups to tags. This is a fundamentally flawed strategy as there aren't guarantees at all the machines will have a tag (even with a retry mechanism). For instance, if the script is killed after launching the instances but before setting the tags the machines will be invisible to a destroy command, leaving a unmanageable cluster behind. The initial proposal is to go back to the previous behavior for all cases but when the new flag (--security-group-prefix) is used. Also it's worthwhile to mention that SPARK-3180 introduced the --additional-security-group flag which is a reasonable solution to SPARK-2333 (but isn't a full replacement to all use cases of --security-group-prefix). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3332) Tagging is not atomic with launching instances on EC2
[ https://issues.apache.org/jira/browse/SPARK-3332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116950#comment-14116950 ] Allan Douglas R. de Oliveira commented on SPARK-3332: - [~pwendell], yes this is a good reword and I agree to revert it for now. You mentioned the two potential solutions but I think a good compromise is the one implemented by the PR which keeps the flag allowing reuse of the same security group but also allowing to match the machines by the security group in the other cases. Tagging is not atomic with launching instances on EC2 - Key: SPARK-3332 URL: https://issues.apache.org/jira/browse/SPARK-3332 Project: Spark Issue Type: Bug Components: EC2 Reporter: Allan Douglas R. de Oliveira The implementation for SPARK-2333 changed the machine membership mechanism from security groups to tags. This is a fundamentally flawed strategy as there aren't guarantees at all the machines will have a tag (even with a retry mechanism). For instance, if the script is killed after launching the instances but before setting the tags the machines will be invisible to a destroy command, leaving a unmanageable cluster behind. The initial proposal is to go back to the previous behavior for all cases but when the new flag (--security-group-prefix) is used. Also it's worthwhile to mention that SPARK-3180 introduced the --additional-security-group flag which is a reasonable solution to SPARK-2333 (but isn't a full replacement to all use cases of --security-group-prefix). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3332) Tagging is not atomic with launching instances on EC2
[ https://issues.apache.org/jira/browse/SPARK-3332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116950#comment-14116950 ] Allan Douglas R. de Oliveira edited comment on SPARK-3332 at 9/1/14 12:49 AM: -- [~pwendell], yes this is a good reword and I agree to revert it for now. You mentioned the two potential solutions but I think a good compromise is the one implemented by the PR which keeps the flag allowing reuse of the same security group but also allowing to match the machines by the security group in the other cases. Perhaps more messages could be added when the flag has been used and the tagging failed. was (Author: douglaz): [~pwendell], yes this is a good reword and I agree to revert it for now. You mentioned the two potential solutions but I think a good compromise is the one implemented by the PR which keeps the flag allowing reuse of the same security group but also allowing to match the machines by the security group in the other cases. Tagging is not atomic with launching instances on EC2 - Key: SPARK-3332 URL: https://issues.apache.org/jira/browse/SPARK-3332 Project: Spark Issue Type: Bug Components: EC2 Reporter: Allan Douglas R. de Oliveira The implementation for SPARK-2333 changed the machine membership mechanism from security groups to tags. This is a fundamentally flawed strategy as there aren't guarantees at all the machines will have a tag (even with a retry mechanism). For instance, if the script is killed after launching the instances but before setting the tags the machines will be invisible to a destroy command, leaving a unmanageable cluster behind. The initial proposal is to go back to the previous behavior for all cases but when the new flag (--security-group-prefix) is used. Also it's worthwhile to mention that SPARK-3180 introduced the --additional-security-group flag which is a reasonable solution to SPARK-2333 (but isn't a full replacement to all use cases of --security-group-prefix). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3259) User data should be given to the master
Allan Douglas R. de Oliveira created SPARK-3259: --- Summary: User data should be given to the master Key: SPARK-3259 URL: https://issues.apache.org/jira/browse/SPARK-3259 Project: Spark Issue Type: Bug Components: EC2 Reporter: Allan Douglas R. de Oliveira This is something SPARK-2246 missed. The master also should receive the given user data. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3180) Better control of security groups
Allan Douglas R. de Oliveira created SPARK-3180: --- Summary: Better control of security groups Key: SPARK-3180 URL: https://issues.apache.org/jira/browse/SPARK-3180 Project: Spark Issue Type: Improvement Reporter: Allan Douglas R. de Oliveira Two features can be combined together to provide better control of security group policies: - The ability to specify the address authorized to access the default security group (instead of letting everyone: 0.0.0.0/0) - The possibility to place the created machines on a custom security group One can use the combinations of the two flags to restrict external access to the provided security group (e.g by setting the authorized address to 127.0.0.1/32) while maintaining compatibility with the current behavior. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3180) Better control of security groups
[ https://issues.apache.org/jira/browse/SPARK-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106171#comment-14106171 ] Allan Douglas R. de Oliveira commented on SPARK-3180: - PR: https://github.com/apache/spark/pull/2088 Better control of security groups - Key: SPARK-3180 URL: https://issues.apache.org/jira/browse/SPARK-3180 Project: Spark Issue Type: Improvement Reporter: Allan Douglas R. de Oliveira Two features can be combined together to provide better control of security group policies: - The ability to specify the address authorized to access the default security group (instead of letting everyone: 0.0.0.0/0) - The possibility to place the created machines on a custom security group One can use the combinations of the two flags to restrict external access to the provided security group (e.g by setting the authorized address to 127.0.0.1/32) while maintaining compatibility with the current behavior. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3180) Better control of security groups
[ https://issues.apache.org/jira/browse/SPARK-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106177#comment-14106177 ] Allan Douglas R. de Oliveira commented on SPARK-3180: - Perhaps it also solves SPARK-2528 Better control of security groups - Key: SPARK-3180 URL: https://issues.apache.org/jira/browse/SPARK-3180 Project: Spark Issue Type: Improvement Reporter: Allan Douglas R. de Oliveira Two features can be combined together to provide better control of security group policies: - The ability to specify the address authorized to access the default security group (instead of letting everyone: 0.0.0.0/0) - The possibility to place the created machines on a custom security group One can use the combinations of the two flags to restrict external access to the provided security group (e.g by setting the authorized address to 127.0.0.1/32) while maintaining compatibility with the current behavior. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1044) Default spark logs location in EC2 AMI leads to out-of-disk space pretty soon
[ https://issues.apache.org/jira/browse/SPARK-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073938#comment-14073938 ] Allan Douglas R. de Oliveira commented on SPARK-1044: - I think it is still a good idea even if the automatic cleanup is implemented. One large job or many small jobs can fill many gigabytes before the cleanup can kick in. Default spark logs location in EC2 AMI leads to out-of-disk space pretty soon - Key: SPARK-1044 URL: https://issues.apache.org/jira/browse/SPARK-1044 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Tathagata Das Priority: Minor The default log location is SPARK_HOME/work/ and this leads to disk space running out pretty quickly. The spark-ec2 scripts should configure the cluster to automatically set the logging directory to /mnt/spark-work/ or something like that on the mounted disks. The SPARK_HOME/work may also be symlinked to that directory to maintain the existing setup. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2246) Add user-data option to EC2 scripts
Allan Douglas R. de Oliveira created SPARK-2246: --- Summary: Add user-data option to EC2 scripts Key: SPARK-2246 URL: https://issues.apache.org/jira/browse/SPARK-2246 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Allan Douglas R. de Oliveira EC2 servers can use an user-data script for custom startup/initialization of machines. The EC2 scripts should provide an option to set this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2246) Add user-data option to EC2 scripts
[ https://issues.apache.org/jira/browse/SPARK-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041465#comment-14041465 ] Allan Douglas R. de Oliveira commented on SPARK-2246: - PR: https://github.com/apache/spark/pull/1186 Add user-data option to EC2 scripts --- Key: SPARK-2246 URL: https://issues.apache.org/jira/browse/SPARK-2246 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Allan Douglas R. de Oliveira EC2 servers can use an user-data script for custom startup/initialization of machines. The EC2 scripts should provide an option to set this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1868) Users should be allowed to cogroup at least 4 RDDs
[ https://issues.apache.org/jira/browse/SPARK-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000962#comment-14000962 ] Allan Douglas R. de Oliveira commented on SPARK-1868: - PR: https://github.com/apache/spark/pull/813 Users should be allowed to cogroup at least 4 RDDs -- Key: SPARK-1868 URL: https://issues.apache.org/jira/browse/SPARK-1868 Project: Spark Issue Type: Improvement Components: Java API, Spark Core Reporter: Allan Douglas R. de Oliveira cogroup client api currently allows up to 3 RDDs to be cogrouped. It's convenient to allow more than this as cogroup is a very fundamental operation and in the real word we need to group many RDDs together. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1868) Users should be allowed to cogroup at least 4 RDDs
Allan Douglas R. de Oliveira created SPARK-1868: --- Summary: Users should be allowed to cogroup at least 4 RDDs Key: SPARK-1868 URL: https://issues.apache.org/jira/browse/SPARK-1868 Project: Spark Issue Type: Improvement Components: Java API, Spark Core Reporter: Allan Douglas R. de Oliveira cogroup client api currently allows up to 3 RDDs to be cogrouped. It's convenient to allow more than this as cogroup is a very fundamental operation and in the real word we need to group many RDDs together. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1716) EC2 script should exit with non-zero code on UsageError
[ https://issues.apache.org/jira/browse/SPARK-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13989226#comment-13989226 ] Allan Douglas R. de Oliveira commented on SPARK-1716: - PR here: https://github.com/apache/spark/pull/638 EC2 script should exit with non-zero code on UsageError --- Key: SPARK-1716 URL: https://issues.apache.org/jira/browse/SPARK-1716 Project: Spark Issue Type: Bug Components: EC2 Reporter: Allan Douglas R. de Oliveira One reason is that some ssh errors are raised as UsageError, preventing an automated usage of the script from detecting the failure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1716) EC2 script should exit with non-zero code on UsageError
Allan Douglas R. de Oliveira created SPARK-1716: --- Summary: EC2 script should exit with non-zero code on UsageError Key: SPARK-1716 URL: https://issues.apache.org/jira/browse/SPARK-1716 Project: Spark Issue Type: Bug Components: EC2 Reporter: Allan Douglas R. de Oliveira One reason is that some ssh errors are raised as UsageError, preventing an automated usage of the script from detecting the failure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1705) Allow multiple instances per node with SPARK-EC2
Allan Douglas R. de Oliveira created SPARK-1705: --- Summary: Allow multiple instances per node with SPARK-EC2 Key: SPARK-1705 URL: https://issues.apache.org/jira/browse/SPARK-1705 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Allan Douglas R. de Oliveira SPARK_WORKER_INSTANCES should be properly configurable when using the EC2 scripts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1705) Allow multiple instances per node with SPARK-EC2
[ https://issues.apache.org/jira/browse/SPARK-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988815#comment-13988815 ] Allan Douglas R. de Oliveira commented on SPARK-1705: - Here is a pull-request: https://github.com/apache/spark/pull/612 Allow multiple instances per node with SPARK-EC2 Key: SPARK-1705 URL: https://issues.apache.org/jira/browse/SPARK-1705 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Allan Douglas R. de Oliveira SPARK_WORKER_INSTANCES should be properly configurable when using the EC2 scripts. -- This message was sent by Atlassian JIRA (v6.2#6252)