[jira] [Comment Edited] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2015-05-09 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536867#comment-14536867
 ] 

Allan Douglas R. de Oliveira edited comment on SPARK-3630 at 5/9/15 8:34 PM:
-

Got something like this but using:
- Java serializer
- Snappy
- Also Lz4
- Spark 1.3.0 (most things on default settings)

We got the error many times on the same cluster (which was doing fine for days) 
but after recreating it the problem disappeared again. Stack traces (the first 
two are from two different runs using Snappy and the third from an execution 
using Lz4):

15/05/09 13:05:55 ERROR Executor: Exception in task 234.2 in stage 27.1 (TID 
876507)
java.io.IOException: PARSING_ERROR(2)
at 
org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
at org.xerial.snappy.SnappyNative.uncompressedLength(Native 
Method)
at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594)
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:358)
at 
org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387)
at 
java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)
at 
java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)
at 
java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
at 
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
at 
org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60)
at 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
at 
org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



15/05/08 20:46:28 WARN scheduler.TaskSetManager: Lost task 9559.0 in stage 55.0 
(TID 424644, ip-172-24-36-214.ec2.internal): java.io.IOException: 
FAILED_TO_UNCOMPRESS(5)
at 
org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
at org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362)
at 
org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387)
at 
java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)
at 
java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)
at 

[jira] [Comment Edited] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2015-05-09 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536867#comment-14536867
 ] 

Allan Douglas R. de Oliveira edited comment on SPARK-3630 at 5/9/15 8:36 PM:
-

Got something like this but using:
- Java serializer
- Snappy
- Also Lz4
- Spark 1.3.0 (most things on default settings)

We got the error many times on the same cluster (which was doing fine for days) 
but after recreating it the problem disappeared again. Stack traces (the first 
two are from two different runs using Snappy and the third from an execution 
using Lz4):

15/05/09 13:05:55 ERROR Executor: Exception in task 234.2 in stage 27.1 (TID 
876507)
java.io.IOException: PARSING_ERROR(2)
at 
org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
at org.xerial.snappy.SnappyNative.uncompressedLength(Native 
Method)
at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594)
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:358)
at 
org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387)
at 
java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)
at 
java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)
at 
java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
at 
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
at 
org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60)
at 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
at 
org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



15/05/08 20:46:28 WARN scheduler.TaskSetManager: Lost task 9559.0 in stage 55.0 
(TID 424644, ip-172-24-36-214.ec2.internal): java.io.IOException: 
FAILED_TO_UNCOMPRESS(5)
at 
org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
at org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362)
at 
org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387)
at 
java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)
at 
java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)
at 

[jira] [Comment Edited] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-05-09 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536880#comment-14536880
 ] 

Allan Douglas R. de Oliveira edited comment on SPARK-4105 at 5/9/15 8:36 PM:
-

Got something like this but using:
- Java serializer
- Snappy
- Also Lz4
- Spark 1.3.0 (most things on default settings)

We got the error many times on the same cluster (which was doing fine for days) 
but after recreating it the problem disappeared again. Stack traces (the first 
two are from two different runs using Snappy and the third from an execution 
using Lz4):

15/05/09 13:05:55 ERROR Executor: Exception in task 234.2 in stage 27.1 (TID 
876507)
java.io.IOException: PARSING_ERROR(2)
at 
org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
at org.xerial.snappy.SnappyNative.uncompressedLength(Native 
Method)
at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594)
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:358)
at 
org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387)
at 
java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)
at 
java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)
at 
java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
at 
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
at 
org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60)
at 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
at 
org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



15/05/08 20:46:28 WARN scheduler.TaskSetManager: Lost task 9559.0 in stage 55.0 
(TID 424644, ip-172-24-36-214.ec2.internal): java.io.IOException: 
FAILED_TO_UNCOMPRESS(5)
at 
org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
at org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362)
at 
org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387)
at 
java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)
at 
java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)
at 

[jira] [Comment Edited] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2015-05-09 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536867#comment-14536867
 ] 

Allan Douglas R. de Oliveira edited comment on SPARK-3630 at 5/9/15 8:25 PM:
-

Got something like this but using:
- Java serializer
- Snappy
- Also Lz4
- Spark 1.3.0 (most things on default settings)

We got the error many times on the same cluster (which was doing fine for days) 
but after recreating it the problem disappeared again. Stack traces (the first 
two are from two different runs using Snappy and the third from an execution 
using Lz4):

15/05/09 13:05:55 ERROR Executor: Exception in task 234.2 in stage 27.1 (TID 
876507)
java.io.IOException: PARSING_ERROR(2)
at 
org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
at org.xerial.snappy.SnappyNative.uncompressedLength(Native 
Method)
at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594)
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:358)
at 
org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387)
at 
java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)
at 
java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)
at 
java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
at 
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
at 
org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60)
at 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
at 
org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



15/05/08 20:46:28 WARN scheduler.TaskSetManager: Lost task 9559.0 in stage 55.0 
(TID 424644, ip-172-24-36-214.ec2.internal): java.io.IOException: 
FAILED_TO_UNCOMPRESS(5)
at 
org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
at org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362)
at 
org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387)
at 
java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)
at 
java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)
at 

[jira] [Comment Edited] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2015-05-09 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536867#comment-14536867
 ] 

Allan Douglas R. de Oliveira edited comment on SPARK-3630 at 5/9/15 8:24 PM:
-

Got something like this but using:
- Java serializer
- Snappy
- Lz4

We got the error many times on the same cluster (which was doing fine for days) 
but after recreating it the problem disappeared again. Stack traces (the first 
two are from two different runs using Snappy and the third from an execution 
using Lz4):

15/05/09 13:05:55 ERROR Executor: Exception in task 234.2 in stage 27.1 (TID 
876507)
java.io.IOException: PARSING_ERROR(2)
at 
org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
at org.xerial.snappy.SnappyNative.uncompressedLength(Native 
Method)
at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594)
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:358)
at 
org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387)
at 
java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)
at 
java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)
at 
java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
at 
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
at 
org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60)
at 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
at 
org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



15/05/08 20:46:28 WARN scheduler.TaskSetManager: Lost task 9559.0 in stage 55.0 
(TID 424644, ip-172-24-36-214.ec2.internal): java.io.IOException: 
FAILED_TO_UNCOMPRESS(5)
at 
org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
at org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362)
at 
org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387)
at 
java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)
at 
java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)
at 
java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
at 

[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2015-05-09 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536867#comment-14536867
 ] 

Allan Douglas R. de Oliveira commented on SPARK-3630:
-

Got something like this but using:
- Java serializer
- Snappy
- Lz4

We got the error many times on the same cluster (which was doing fine for days) 
but after recreating it the problem disappeared again. Stack traces:

15/05/09 13:05:55 ERROR Executor: Exception in task 234.2 in stage 27.1 (TID 
876507)
java.io.IOException: PARSING_ERROR(2)
at 
org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
at org.xerial.snappy.SnappyNative.uncompressedLength(Native 
Method)
at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594)
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:358)
at 
org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387)
at 
java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)
at 
java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)
at 
java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
at 
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
at 
org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60)
at 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
at 
org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



15/05/08 20:46:28 WARN scheduler.TaskSetManager: Lost task 9559.0 in stage 55.0 
(TID 424644, ip-172-24-36-214.ec2.internal): java.io.IOException: 
FAILED_TO_UNCOMPRESS(5)
at 
org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
at org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362)
at 
org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387)
at 
java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)
at 
java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)
at 
java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
at 

[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-05-09 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536880#comment-14536880
 ] 

Allan Douglas R. de Oliveira commented on SPARK-4105:
-

Got something like this but using:
- Java serializer
- Snappy
- Also Lz4
- Spark 1.3.0 (most things on default settings)

We got the error many times on the same cluster (which was doing fine for days) 
but after recreating it the problem disappeared again. Stack traces (the first 
two are from two different runs using Snappy and the third from an execution 
using Lz4):

15/05/09 13:05:55 ERROR Executor: Exception in task 234.2 in stage 27.1 (TID 
876507)
java.io.IOException: PARSING_ERROR(2)
at 
org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
at org.xerial.snappy.SnappyNative.uncompressedLength(Native 
Method)
at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594)
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:358)
at 
org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387)
at 
java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)
at 
java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)
at 
java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
at 
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
at 
org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60)
at 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
at 
org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



15/05/08 20:46:28 WARN scheduler.TaskSetManager: Lost task 9559.0 in stage 55.0 
(TID 424644, ip-172-24-36-214.ec2.internal): java.io.IOException: 
FAILED_TO_UNCOMPRESS(5)
at 
org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
at org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362)
at 
org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:387)
at 
java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)
at 
java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)
at 
java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
 

[jira] [Comment Edited] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2015-03-23 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376833#comment-14376833
 ] 

Allan Douglas R. de Oliveira edited comment on SPARK-5928 at 3/23/15 10:54 PM:
---

I will answer with the info I have right know, later I'll try to get more 
information:
a) I don't have the exact number now, but it's something like 300G of shuffle 
read/write
b) It was working with 1600 partitions before this started to happen, then 
after the error we increased to 1 but got pretty much the same exception.
c) Yes, it happens some minutes later generally. We've put the job back in 
production changing to lz4. But I feel that the problem will come back.
d) Yes, and I think that issue also happens with other errors (e.g if the 
BlockManager timeouts because of too much GC)


was (Author: douglaz):
I will answer with the info I have right know, later I'll try to get more 
information:
a) I don't have the exact number now, but it's something like 300G of shuffle 
read/write
b) It was working with 1600 partitions before this started to happen, then 
after the error we increased to 1 but got pretty much the same exception.
c) Yes, it happens some minutes later generally. We've got the job back in 
production changing to lz4. But I feel that the problem will come back.
d) Yes, and I think that issue also happens with other errors (e.g if the 
BlockManager timeouts because of too much GC)

 Remote Shuffle Blocks cannot be more than 2 GB
 --

 Key: SPARK-5928
 URL: https://issues.apache.org/jira/browse/SPARK-5928
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Imran Rashid

 If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
 exception.  The tasks get retried a few times and then eventually the job 
 fails.
 Here is an example program which can cause the exception:
 {code}
 val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =
   val n = 3e3.toInt
   val arr = new Array[Byte](n)
   //need to make sure the array doesn't compress to something small
   scala.util.Random.nextBytes(arr)
   arr
 }
 rdd.map { x = (1, x)}.groupByKey().count()
 {code}
 Note that you can't trigger this exception in local mode, it only happens on 
 remote fetches.   I triggered these exceptions running with 
 {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
 {noformat}
 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
 imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
 imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
 org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
 2147483647: 3021252889 - discarded
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
   at 
 org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
 length exceeds 2147483647: 3021252889 - discarded
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
   at 
 

[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2015-03-23 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376833#comment-14376833
 ] 

Allan Douglas R. de Oliveira commented on SPARK-5928:
-

I will answer with the info I have right know, later I'll try to get more 
information:
a) I don't have the exact number now, but it's something like 300G of shuffle 
read/write
b) It was working with 1600 partitions before this started to happen, then 
after the error we increased to 1 but got pretty much the same exception.
c) Yes, it happens some minutes later generally. We've got the job back in 
production changing to lz4. But I feel that the problem will come back.
d) Yes, and I think that issue also happens with other errors (e.g if the 
BlockManager timeouts because of too much GC)

 Remote Shuffle Blocks cannot be more than 2 GB
 --

 Key: SPARK-5928
 URL: https://issues.apache.org/jira/browse/SPARK-5928
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Imran Rashid

 If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
 exception.  The tasks get retried a few times and then eventually the job 
 fails.
 Here is an example program which can cause the exception:
 {code}
 val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =
   val n = 3e3.toInt
   val arr = new Array[Byte](n)
   //need to make sure the array doesn't compress to something small
   scala.util.Random.nextBytes(arr)
   arr
 }
 rdd.map { x = (1, x)}.groupByKey().count()
 {code}
 Note that you can't trigger this exception in local mode, it only happens on 
 remote fetches.   I triggered these exceptions running with 
 {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
 {noformat}
 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
 imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
 imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
 org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
 2147483647: 3021252889 - discarded
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
   at 
 org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
 length exceeds 2147483647: 3021252889 - discarded
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
   at 
 io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
   at 
 io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
   at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
   at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
   at 
 

[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2015-03-21 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372988#comment-14372988
 ] 

Allan Douglas R. de Oliveira commented on SPARK-5928:
-

We've hit this limit in production. It's very annoying. Our data may be 
somewhat skewed. But still annoying.

 Remote Shuffle Blocks cannot be more than 2 GB
 --

 Key: SPARK-5928
 URL: https://issues.apache.org/jira/browse/SPARK-5928
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Imran Rashid

 If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
 exception.  The tasks get retried a few times and then eventually the job 
 fails.
 Here is an example program which can cause the exception:
 {code}
 val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =
   val n = 3e3.toInt
   val arr = new Array[Byte](n)
   //need to make sure the array doesn't compress to something small
   scala.util.Random.nextBytes(arr)
   arr
 }
 rdd.map { x = (1, x)}.groupByKey().count()
 {code}
 Note that you can't trigger this exception in local mode, it only happens on 
 remote fetches.   I triggered these exceptions running with 
 {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
 {noformat}
 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
 imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
 imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
 org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
 2147483647: 3021252889 - discarded
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
   at 
 org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
 length exceeds 2147483647: 3021252889 - discarded
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
   at 
 io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
   at 
 io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
   at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
   at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
   at 
 io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
   at 
 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
   at 
 io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
   at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
   at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
   at 
 

[jira] [Comment Edited] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2015-03-21 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372988#comment-14372988
 ] 

Allan Douglas R. de Oliveira edited comment on SPARK-5928 at 3/21/15 8:23 PM:
--

We've hit this limit in production. It's very annoying. Our data may be 
somewhat skewed. But it's still annoying.


was (Author: douglaz):
We've hit this limit in production. It's very annoying. Our data may be 
somewhat skewed. But still annoying.

 Remote Shuffle Blocks cannot be more than 2 GB
 --

 Key: SPARK-5928
 URL: https://issues.apache.org/jira/browse/SPARK-5928
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Imran Rashid

 If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
 exception.  The tasks get retried a few times and then eventually the job 
 fails.
 Here is an example program which can cause the exception:
 {code}
 val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =
   val n = 3e3.toInt
   val arr = new Array[Byte](n)
   //need to make sure the array doesn't compress to something small
   scala.util.Random.nextBytes(arr)
   arr
 }
 rdd.map { x = (1, x)}.groupByKey().count()
 {code}
 Note that you can't trigger this exception in local mode, it only happens on 
 remote fetches.   I triggered these exceptions running with 
 {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
 {noformat}
 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
 imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
 imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
 org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
 2147483647: 3021252889 - discarded
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
   at 
 org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
 length exceeds 2147483647: 3021252889 - discarded
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
   at 
 io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
   at 
 io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
   at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
   at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
   at 
 io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
   at 
 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
   at 
 io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
   at 
 

[jira] [Created] (SPARK-3840) Spark EC2 templates fail when variables are missing

2014-10-07 Thread Allan Douglas R. de Oliveira (JIRA)
Allan Douglas R. de Oliveira created SPARK-3840:
---

 Summary: Spark EC2 templates fail when variables are missing
 Key: SPARK-3840
 URL: https://issues.apache.org/jira/browse/SPARK-3840
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Allan Douglas R. de Oliveira


For instance https://github.com/mesos/spark-ec2/pull/58 introduced this problem 
when AWS_ACCESS_KEY_ID isn't set:

Configuring /root/shark/conf/shark-env.sh
Traceback (most recent call last):
  File ./deploy_templates.py, line 91, in module
text = text.replace({{ + key + }}, template_vars[key])
TypeError: expected a character buffer object

This makes all the cluster configuration fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3840) Spark EC2 templates fail when variables are missing

2014-10-07 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14162783#comment-14162783
 ] 

Allan Douglas R. de Oliveira commented on SPARK-3840:
-

PR: https://github.com/mesos/spark-ec2/pull/74

 Spark EC2 templates fail when variables are missing
 ---

 Key: SPARK-3840
 URL: https://issues.apache.org/jira/browse/SPARK-3840
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Allan Douglas R. de Oliveira

 For instance https://github.com/mesos/spark-ec2/pull/58 introduced this 
 problem when AWS_ACCESS_KEY_ID isn't set:
 Configuring /root/shark/conf/shark-env.sh
 Traceback (most recent call last):
   File ./deploy_templates.py, line 91, in module
 text = text.replace({{ + key + }}, template_vars[key])
 TypeError: expected a character buffer object
 This makes all the cluster configuration fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3332) Tags shouldn't be the main strategy for machine membership on clusters

2014-08-31 Thread Allan Douglas R. de Oliveira (JIRA)
Allan Douglas R. de Oliveira created SPARK-3332:
---

 Summary: Tags shouldn't be the main strategy for machine 
membership on clusters
 Key: SPARK-3332
 URL: https://issues.apache.org/jira/browse/SPARK-3332
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Allan Douglas R. de Oliveira


The implementation for SPARK-2333 changed the machine membership mechanism from 
security groups to tags.

This is a fundamentally flawed strategy as there aren't guarantees at all the 
machines will have a tag (even with a retry mechanism).

For instance, if the script is killed after launching the instances but before 
setting the tags the machines will be invisible to a destroy command, leaving 
a unmanageable cluster behind.

The initial proposal is to go back to the previous behavior for all cases but 
when the new flag (--security-group-prefix) is used.

Also it's worthwhile to mention that SPARK-3180 introduced the 
--additional-security-group flag which is a reasonable solution to SPARK-2333 
(but isn't a full replacement to all use cases of --security-group-prefix).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3332) Tagging is not atomic with launching instances on EC2

2014-08-31 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116950#comment-14116950
 ] 

Allan Douglas R. de Oliveira commented on SPARK-3332:
-

[~pwendell], yes this is a good reword and I agree to revert it for now. You 
mentioned the two potential solutions but I think a good compromise is the one 
implemented by the PR which keeps the flag allowing reuse of the same security 
group but also allowing to match the machines by the security group in the 
other cases.

 Tagging is not atomic with launching instances on EC2
 -

 Key: SPARK-3332
 URL: https://issues.apache.org/jira/browse/SPARK-3332
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Allan Douglas R. de Oliveira

 The implementation for SPARK-2333 changed the machine membership mechanism 
 from security groups to tags.
 This is a fundamentally flawed strategy as there aren't guarantees at all the 
 machines will have a tag (even with a retry mechanism).
 For instance, if the script is killed after launching the instances but 
 before setting the tags the machines will be invisible to a destroy 
 command, leaving a unmanageable cluster behind.
 The initial proposal is to go back to the previous behavior for all cases but 
 when the new flag (--security-group-prefix) is used.
 Also it's worthwhile to mention that SPARK-3180 introduced the 
 --additional-security-group flag which is a reasonable solution to SPARK-2333 
 (but isn't a full replacement to all use cases of --security-group-prefix).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3332) Tagging is not atomic with launching instances on EC2

2014-08-31 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116950#comment-14116950
 ] 

Allan Douglas R. de Oliveira edited comment on SPARK-3332 at 9/1/14 12:49 AM:
--

[~pwendell], yes this is a good reword and I agree to revert it for now. You 
mentioned the two potential solutions but I think a good compromise is the one 
implemented by the PR which keeps the flag allowing reuse of the same security 
group but also allowing to match the machines by the security group in the 
other cases.

Perhaps more messages could be added when the flag has been used and the 
tagging failed.


was (Author: douglaz):
[~pwendell], yes this is a good reword and I agree to revert it for now. You 
mentioned the two potential solutions but I think a good compromise is the one 
implemented by the PR which keeps the flag allowing reuse of the same security 
group but also allowing to match the machines by the security group in the 
other cases.

 Tagging is not atomic with launching instances on EC2
 -

 Key: SPARK-3332
 URL: https://issues.apache.org/jira/browse/SPARK-3332
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Allan Douglas R. de Oliveira

 The implementation for SPARK-2333 changed the machine membership mechanism 
 from security groups to tags.
 This is a fundamentally flawed strategy as there aren't guarantees at all the 
 machines will have a tag (even with a retry mechanism).
 For instance, if the script is killed after launching the instances but 
 before setting the tags the machines will be invisible to a destroy 
 command, leaving a unmanageable cluster behind.
 The initial proposal is to go back to the previous behavior for all cases but 
 when the new flag (--security-group-prefix) is used.
 Also it's worthwhile to mention that SPARK-3180 introduced the 
 --additional-security-group flag which is a reasonable solution to SPARK-2333 
 (but isn't a full replacement to all use cases of --security-group-prefix).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3259) User data should be given to the master

2014-08-27 Thread Allan Douglas R. de Oliveira (JIRA)
Allan Douglas R. de Oliveira created SPARK-3259:
---

 Summary: User data should be given to the master
 Key: SPARK-3259
 URL: https://issues.apache.org/jira/browse/SPARK-3259
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Allan Douglas R. de Oliveira


This is something SPARK-2246 missed. The master also should receive the given 
user data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3180) Better control of security groups

2014-08-21 Thread Allan Douglas R. de Oliveira (JIRA)
Allan Douglas R. de Oliveira created SPARK-3180:
---

 Summary: Better control of security groups
 Key: SPARK-3180
 URL: https://issues.apache.org/jira/browse/SPARK-3180
 Project: Spark
  Issue Type: Improvement
Reporter: Allan Douglas R. de Oliveira


Two features can be combined together to provide better control of security 
group policies:

- The ability to specify the address authorized to access the default security 
group (instead of letting everyone: 0.0.0.0/0)
- The possibility to place the created machines on a custom security group

One can use the combinations of the two flags to restrict external access to 
the provided security group (e.g by setting the authorized address to 
127.0.0.1/32) while maintaining compatibility with the current behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3180) Better control of security groups

2014-08-21 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106171#comment-14106171
 ] 

Allan Douglas R. de Oliveira commented on SPARK-3180:
-

PR: https://github.com/apache/spark/pull/2088

 Better control of security groups
 -

 Key: SPARK-3180
 URL: https://issues.apache.org/jira/browse/SPARK-3180
 Project: Spark
  Issue Type: Improvement
Reporter: Allan Douglas R. de Oliveira

 Two features can be combined together to provide better control of security 
 group policies:
 - The ability to specify the address authorized to access the default 
 security group (instead of letting everyone: 0.0.0.0/0)
 - The possibility to place the created machines on a custom security group
 One can use the combinations of the two flags to restrict external access to 
 the provided security group (e.g by setting the authorized address to 
 127.0.0.1/32) while maintaining compatibility with the current behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3180) Better control of security groups

2014-08-21 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106177#comment-14106177
 ] 

Allan Douglas R. de Oliveira commented on SPARK-3180:
-

Perhaps it also solves SPARK-2528

 Better control of security groups
 -

 Key: SPARK-3180
 URL: https://issues.apache.org/jira/browse/SPARK-3180
 Project: Spark
  Issue Type: Improvement
Reporter: Allan Douglas R. de Oliveira

 Two features can be combined together to provide better control of security 
 group policies:
 - The ability to specify the address authorized to access the default 
 security group (instead of letting everyone: 0.0.0.0/0)
 - The possibility to place the created machines on a custom security group
 One can use the combinations of the two flags to restrict external access to 
 the provided security group (e.g by setting the authorized address to 
 127.0.0.1/32) while maintaining compatibility with the current behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1044) Default spark logs location in EC2 AMI leads to out-of-disk space pretty soon

2014-07-24 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073938#comment-14073938
 ] 

Allan Douglas R. de Oliveira commented on SPARK-1044:
-

I think it is still a good idea even if the automatic cleanup is implemented. 
One large job or many small jobs can fill many gigabytes before the cleanup can 
kick in.

 Default spark logs location in EC2 AMI leads to out-of-disk space pretty soon
 -

 Key: SPARK-1044
 URL: https://issues.apache.org/jira/browse/SPARK-1044
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Tathagata Das
Priority: Minor

 The default log location is SPARK_HOME/work/ and this leads to disk space 
 running out pretty quickly. The spark-ec2 scripts should configure the 
 cluster to automatically set the logging directory to /mnt/spark-work/ or 
 something like that on the mounted disks. The SPARK_HOME/work may also be 
 symlinked to that directory to maintain the existing setup.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2246) Add user-data option to EC2 scripts

2014-06-23 Thread Allan Douglas R. de Oliveira (JIRA)
Allan Douglas R. de Oliveira created SPARK-2246:
---

 Summary: Add user-data option to EC2 scripts
 Key: SPARK-2246
 URL: https://issues.apache.org/jira/browse/SPARK-2246
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Allan Douglas R. de Oliveira


EC2 servers can use an user-data script for custom startup/initialization of 
machines. The EC2 scripts should provide an option to set this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2246) Add user-data option to EC2 scripts

2014-06-23 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041465#comment-14041465
 ] 

Allan Douglas R. de Oliveira commented on SPARK-2246:
-

PR:
https://github.com/apache/spark/pull/1186

 Add user-data option to EC2 scripts
 ---

 Key: SPARK-2246
 URL: https://issues.apache.org/jira/browse/SPARK-2246
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Allan Douglas R. de Oliveira

 EC2 servers can use an user-data script for custom startup/initialization 
 of machines. The EC2 scripts should provide an option to set this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1868) Users should be allowed to cogroup at least 4 RDDs

2014-05-17 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000962#comment-14000962
 ] 

Allan Douglas R. de Oliveira commented on SPARK-1868:
-

PR:
https://github.com/apache/spark/pull/813

 Users should be allowed to cogroup at least 4 RDDs
 --

 Key: SPARK-1868
 URL: https://issues.apache.org/jira/browse/SPARK-1868
 Project: Spark
  Issue Type: Improvement
  Components: Java API, Spark Core
Reporter: Allan Douglas R. de Oliveira

 cogroup client api currently allows up to 3 RDDs to be cogrouped.
 It's convenient to allow more than this as cogroup is a very fundamental 
 operation and in the real word we need to group many RDDs together.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1868) Users should be allowed to cogroup at least 4 RDDs

2014-05-17 Thread Allan Douglas R. de Oliveira (JIRA)
Allan Douglas R. de Oliveira created SPARK-1868:
---

 Summary: Users should be allowed to cogroup at least 4 RDDs
 Key: SPARK-1868
 URL: https://issues.apache.org/jira/browse/SPARK-1868
 Project: Spark
  Issue Type: Improvement
  Components: Java API, Spark Core
Reporter: Allan Douglas R. de Oliveira


cogroup client api currently allows up to 3 RDDs to be cogrouped.

It's convenient to allow more than this as cogroup is a very fundamental 
operation and in the real word we need to group many RDDs together.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1716) EC2 script should exit with non-zero code on UsageError

2014-05-04 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13989226#comment-13989226
 ] 

Allan Douglas R. de Oliveira commented on SPARK-1716:
-

PR here:
https://github.com/apache/spark/pull/638

 EC2 script should exit with non-zero code on UsageError
 ---

 Key: SPARK-1716
 URL: https://issues.apache.org/jira/browse/SPARK-1716
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Allan Douglas R. de Oliveira

 One reason is that some ssh errors are raised as UsageError, preventing an 
 automated usage of the script from detecting the failure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1716) EC2 script should exit with non-zero code on UsageError

2014-05-04 Thread Allan Douglas R. de Oliveira (JIRA)
Allan Douglas R. de Oliveira created SPARK-1716:
---

 Summary: EC2 script should exit with non-zero code on UsageError
 Key: SPARK-1716
 URL: https://issues.apache.org/jira/browse/SPARK-1716
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Allan Douglas R. de Oliveira


One reason is that some ssh errors are raised as UsageError, preventing an 
automated usage of the script from detecting the failure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1705) Allow multiple instances per node with SPARK-EC2

2014-05-03 Thread Allan Douglas R. de Oliveira (JIRA)
Allan Douglas R. de Oliveira created SPARK-1705:
---

 Summary: Allow multiple instances per node with SPARK-EC2
 Key: SPARK-1705
 URL: https://issues.apache.org/jira/browse/SPARK-1705
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Allan Douglas R. de Oliveira


SPARK_WORKER_INSTANCES should be properly configurable when using the EC2 
scripts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1705) Allow multiple instances per node with SPARK-EC2

2014-05-03 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988815#comment-13988815
 ] 

Allan Douglas R. de Oliveira commented on SPARK-1705:
-

Here is a pull-request:
https://github.com/apache/spark/pull/612

 Allow multiple instances per node with SPARK-EC2
 

 Key: SPARK-1705
 URL: https://issues.apache.org/jira/browse/SPARK-1705
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Allan Douglas R. de Oliveira

 SPARK_WORKER_INSTANCES should be properly configurable when using the EC2 
 scripts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)