[jira] [Comment Edited] (SPARK-22622) OutOfMemory thrown by Closure Serializer without proper failure propagation

2017-11-28 Thread Raghavendra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269974#comment-16269974
 ] 

Raghavendra edited comment on SPARK-22622 at 11/29/17 2:55 AM:
---

[~srowen] I am broadcasting all my large data strucures and im assuming that 
the actual RDD is not serialized via the closure serializer. Then what is 
causing this error? Even if there is a specific data causing this issue,a 
proactive check by the  Spark Serializer reporting the exact culprit would be 
very helpful.

Also, how about my second issue. Even though the issue could be large data, the 
Job must end gracefully. The Log4j Logs are receiving this stack trace but the 
Spark UI is not showing anything and the Job is not ending on this failure, 
instead the "dag-event-loop" thread is killed and the jobs continues 
indefinitely. I think the latter is a bug that needs to be handled at least. 
Kindly clarify if there are Bugs covering both issues and i will close the bug.

Unfortunately i cannot post my exact code here since it is copyrighted.


was (Author: ragzisme):
[~srowen] I am broadcasting all my large data strucures and im assuming that 
the actual RDD is not serialized via the closure serializer. Then what is 
causing this error? Even if there is a specific data causing this issue,a 
proactive check by the  Spark Serializer reporting the exact culprit would be 
very helpful.

Also, how about my second issue. Even though the issue could be large data, the 
Job must end gracefully. The Log4j Logs are receiving this stack trace but the 
Spark UI is not showing anything and the Job is not ending on this failure, 
instead the "dag-event-loop" thread is killed and the jobs continues 
indefinitely. I think the latter is a bug that needs to be handled at least. 
Kindly clarify.

Unfortunately i cannot post my exact code here since it is copyrighted.

> OutOfMemory thrown by Closure Serializer without proper failure propagation
> ---
>
> Key: SPARK-22622
> URL: https://issues.apache.org/jira/browse/SPARK-22622
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0
> Hadoop 2.9.0
>Reporter: Raghavendra
>Priority: Critical
>
> While moving from a Stage to another, the Closure serializer is trying to 
> Serialize the Closures and throwing OOMs.
>  This is happening when the RDD size crosses 70 GB. 
> I set the Driver Memory to 225 GB and yet the error persist.
>  There are two issues here
> * OOM thrown when there is almost 3 times of Driver memory provided than the 
> last Stage RDD size.(Even tried caching this into the disk before moving it 
> into the current stage)
> * After the Error is thrown, the Spark Job does not exit. it just continues 
> in the same state without propagating the error into the Spark UI.
> *Scenario 1*
> {color:red}Exception in thread "dag-scheduler-event-loop" 
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>   at java.util.Arrays.copyOf(Arrays.java:3236)
>   at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
>   at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>   at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>   at 
> org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
>   at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
>   at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1003)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:874)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1677)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {color}
> *Scenario 2*
> {color:red}
>Exception in thread "dag

[jira] [Comment Edited] (SPARK-22622) OutOfMemory thrown by Closure Serializer without proper failure propagation

2017-11-28 Thread Raghavendra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269974#comment-16269974
 ] 

Raghavendra edited comment on SPARK-22622 at 11/29/17 2:37 AM:
---

[~srowen] I am broadcasting all my large data strucures and im assuming that 
the actual RDD is not serialized via the closure serializer. Then what is 
causing this error? Even if there is a specific data causing this issue,a 
proactive check by the  Spark Serializer reporting the exact culprit would be 
very helpful.

Also, how about my second issue. Even though the issue could be large data, the 
Job must end gracefully. The Log4j Logs are receiving this stack trace but the 
Spark UI is not showing anything and the Job is not ending on this failure, 
instead the "dag-event-loop" thread is killed and the jobs continues 
indefinitely. I think the latter is a bug that needs to be handled at least. 
Kindly clarify.

Unfortunately i cannot post my exact code here since it is copyrighted.


was (Author: ragzisme):
[~srowen] I am broadcasting all my large data strucures and im assuming that 
the actual RDD is not serialized via the closure serializer. Then what is 
causing this error? Even if there is a specific data causing this issue,a 
proactive check by the  Spark Serializer reporting the exact culprit would be 
very helpful.

Also, how about my second issue. Even though the issue could be large data, the 
Job must end gracefully. The Log4j Logs are receiving this stack trace but the 
Spark UI is not showing anything and the Job is not ending on this failure, 
instead the "dag-event-loop" thread is killed and the jobs continues 
indefinitely. I think the latter is a bug that needs to be handled at least. 
Kindly clarify.

> OutOfMemory thrown by Closure Serializer without proper failure propagation
> ---
>
> Key: SPARK-22622
> URL: https://issues.apache.org/jira/browse/SPARK-22622
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0
> Hadoop 2.9.0
>Reporter: Raghavendra
>Priority: Critical
>
> While moving from a Stage to another, the Closure serializer is trying to 
> Serialize the Closures and throwing OOMs.
>  This is happening when the RDD size crosses 70 GB. 
> I set the Driver Memory to 225 GB and yet the error persist.
>  There are two issues here
> * OOM thrown when there is almost 3 times of Driver memory provided than the 
> last Stage RDD size.(Even tried caching this into the disk before moving it 
> into the current stage)
> * After the Error is thrown, the Spark Job does not exit. it just continues 
> in the same state without propagating the error into the Spark UI.
> *Scenario 1*
> {color:red}Exception in thread "dag-scheduler-event-loop" 
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>   at java.util.Arrays.copyOf(Arrays.java:3236)
>   at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
>   at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>   at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>   at 
> org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
>   at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
>   at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1003)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:874)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1677)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {color}
> *Scenario 2*
> {color:red}
>Exception in thread "dag-scheduler-event-loop" 
> java.lang.OutOfMemoryError
>   at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:

[jira] [Comment Edited] (SPARK-22622) OutOfMemory thrown by Closure Serializer without proper failure propagation

2017-11-28 Thread Raghavendra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269974#comment-16269974
 ] 

Raghavendra edited comment on SPARK-22622 at 11/29/17 2:36 AM:
---

[~srowen] I am broadcasting all my large data strucures and im assuming that 
the actual RDD is not serialized via the closure serializer. Then what is 
causing this error? Even if there is a specific data causing this issue,a 
proactive check by the  Spark Serializer reporting the exact culprit would be 
very helpful.

Also, how about my second issue. Even though the issue could be large data, the 
Job must end gracefully. The Log4j Logs are receiving this stack trace but the 
Spark UI is not showing anything and the Job is not ending on this failure, 
instead the "dag-event-loop" thread is killed and the jobs continues 
indefinitely. I think the latter is a bug that needs to be handled at least. 
Kindly clarify.


was (Author: ragzisme):
[~srowen] I am broadcasting all my large data strucures and im assuming that 
the actual RDD is not serialized via the closure serializer. Then what is 
causing this error?

Also, how about my second issue. Even though the issue could be large data, the 
Job must end gracefully. The Log4j Logs are receiving this stack trace but the 
Spark UI is not showing anything and the Job is not ending on this failure, 
instead the "dag-event-loop" thread is killed and the jobs continues 
indefinitely. I think the latter is a bug that needs to be handled at least. 
Kindly clarify.

> OutOfMemory thrown by Closure Serializer without proper failure propagation
> ---
>
> Key: SPARK-22622
> URL: https://issues.apache.org/jira/browse/SPARK-22622
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0
> Hadoop 2.9.0
>Reporter: Raghavendra
>Priority: Critical
>
> While moving from a Stage to another, the Closure serializer is trying to 
> Serialize the Closures and throwing OOMs.
>  This is happening when the RDD size crosses 70 GB. 
> I set the Driver Memory to 225 GB and yet the error persist.
>  There are two issues here
> * OOM thrown when there is almost 3 times of Driver memory provided than the 
> last Stage RDD size.(Even tried caching this into the disk before moving it 
> into the current stage)
> * After the Error is thrown, the Spark Job does not exit. it just continues 
> in the same state without propagating the error into the Spark UI.
> *Scenario 1*
> {color:red}Exception in thread "dag-scheduler-event-loop" 
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>   at java.util.Arrays.copyOf(Arrays.java:3236)
>   at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
>   at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>   at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>   at 
> org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
>   at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
>   at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1003)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:874)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1677)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {color}
> *Scenario 2*
> {color:red}
>Exception in thread "dag-scheduler-event-loop" 
> java.lang.OutOfMemoryError
>   at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
>   at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
>   at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>   at java.io.ByteArrayOutputStream.write(B

[jira] [Comment Edited] (SPARK-22622) OutOfMemory thrown by Closure Serializer without proper failure propagation

2017-11-28 Thread Raghavendra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269917#comment-16269917
 ] 

Raghavendra edited comment on SPARK-22622 at 11/29/17 1:50 AM:
---

The output of previous Stage is around 70GB and my driver memory is 225GB. Why 
am i getting an OOM when there is sufficient memory.
Also, since this is closure serializer, isnt it only serializing the Physical 
Plan and not the RDD?
17/11/28 07:22:53 INFO DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD
[25]
at parquet at *.java:473), which has no missing parents
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 25 - 0 - 0 - 
StorageLevel(1 replicas) - parquet at ***.java:473
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 20 - 0 - 0 - 
StorageLevel(1 replicas) - persist at ***.java:283
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 19 - 0 - 0 - 
StorageLevel(1 replicas) - persist at ***.java:283
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 12 - 0 - 0 - 
StorageLevel(1 replicas) - persist at ***.java:88
17/11/28 07:22:53 ERROR LogListener: RDD Info : FileScanRDD - 11 - 0 - 0 - 
StorageLevel(1 replicas) - persist at ***.java:88
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 22 - 0 - 0 - 
StorageLevel(1 replicas) - persist at ***.java:283
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 18 - 0 - 0 - 
StorageLevel(1 replicas) - persist at ***.java:283
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 21 - 0 - 0 - 
StorageLevel(1 replicas) - persist at ***.java:283
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 24 - 0 - 0 - 
StorageLevel(1 replicas) - parquet at ***.java:473
Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at 
org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at 
org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1003)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930)
at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:874)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1677)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)


was (Author: ragzisme):
The output of previous Stage is around 70GB and my driver memory is 225GB. Why 
am i getting an OOM when there is sufficient memory.
Also, since this is closure serializer, isnt it only serializing the Physical 
Plan and not the RDD?
17/11/28 07:22:53 INFO DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD
[25]
at parquet at LogProcessor.java:473), which has no missing parents
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 25 - 0 - 0 - 
StorageLevel(1 replicas) - parquet at ***.java:473
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 20 - 0 - 0 - 
StorageLevel(1 replicas) - persist at ***.java:283
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 19 - 0 - 0 - 
StorageLevel(1 replicas) - persist at ***.java:283
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 12 - 0 - 0 - 
StorageLevel(1 replicas) - persist at ***.java:88
17/11/28 07:22:53 ERROR LogListener: RDD Info : FileScanRDD - 11 - 0 - 0 - 
StorageLevel(1 replicas) - persist at ***.java:88
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 22 - 0 - 0 - 
StorageLevel(1 replicas) - persist at ***.java:283
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 18 - 0 - 0 - 
StorageLevel(1 replicas) - persist at ***.java:283
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 21 - 0 - 0 - 
StorageLevel(1 replicas) - persist at ***.java:283
17/11/28 07:22:53 ERROR LogListener: RDD Info 

[jira] [Comment Edited] (SPARK-22622) OutOfMemory thrown by Closure Serializer

2017-11-28 Thread Raghavendra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269894#comment-16269894
 ] 

Raghavendra edited comment on SPARK-22622 at 11/29/17 1:39 AM:
---

The output of previous Stage is around 70GB and my driver memory is 225GB. Why 
am i getting an OOM when there is sufficient memory.

Also, since this is closure serializer, isnt it only serializing the Physical 
Plan and not the RDD?

17/11/28 07:22:53 INFO DAGScheduler: Submitting ResultStage 3 
(MapPartitionsRDD[25] at parquet at LogProcessor.java:473), which has no 
missing parents
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 25 - 0 - 0 - 
StorageLevel(1 replicas) - parquet at LogProcessor.java:473
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 20 - 0 - 0 - 
StorageLevel(1 replicas) - persist at LogProcessor.java:283
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 19 - 0 - 0 - 
StorageLevel(1 replicas) - persist at LogProcessor.java:283
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 12 - 0 - 0 - 
StorageLevel(1 replicas) - persist at LogProcessor.java:88
17/11/28 07:22:53 ERROR LogListener: RDD Info : FileScanRDD - 11 - 0 - 0 - 
StorageLevel(1 replicas) - persist at LogProcessor.java:88
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 22 - 0 - 0 - 
StorageLevel(1 replicas) - persist at LogProcessor.java:283
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 18 - 0 - 0 - 
StorageLevel(1 replicas) - persist at LogProcessor.java:283
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 21 - 0 - 0 - 
StorageLevel(1 replicas) - persist at LogProcessor.java:283
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 24 - 0 - 0 - 
StorageLevel(1 replicas) - parquet at LogProcessor.java:473

Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError
at 
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at 
org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at 
org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1003)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930)
at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:874)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1677)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)




was (Author: ragzisme):
The output of previous Stage is around 70GB and my driver memory is 225GB. Why 
am i getting an OOM when there is sufficient memory.

Also, since this is closure serializer, isnt it only serializing the Physical 
Plan and not the RDD?

17/11/28 07:22:53 INFO DAGScheduler: Submitting ResultStage 3 
(MapPartitionsRDD[25] at parquet at LogProcessor.java:473), which has no 
missing parents
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 25 - 0 - 0 - 
StorageLevel(1 replicas) - parquet at LogProcessor.java:473
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 20 - 0 - 0 - 
StorageLevel(1 replicas) - persist at LogProcessor.java:283
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 19 - 0 - 0 - 
StorageLevel(1 replicas) - persist at LogProcessor.java:283
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 12 - 0 - 0 - 
StorageLevel(1 replicas) - persist at LogProcessor.java:88
17/11/28 07:22:53 ERROR LogListener: RDD Info : FileScanRDD - 11 - 0 - 0 - 
StorageLevel(1 replicas) - persist at LogProcessor.java:88
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 22 - 0 - 0 - 
StorageLevel(1 replicas) - persist at LogProcessor.java:283
17/11/28 07:22:53 ERROR LogListener: RDD Info : MapPartitionsRDD - 18 - 0 - 0 - 
StorageL