[jira] [Commented] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2020-08-05 Thread Udit Mehrotra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171724#comment-17171724
 ] 

Udit Mehrotra commented on SPARK-29767:
---

The issue has been open for quite sometime. Can someone please take a look at 
this ?

> Core dump happening on executors while doing simple union of Data Frames
> 
>
> Key: SPARK-29767
> URL: https://issues.apache.org/jira/browse/SPARK-29767
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.4
> Environment: AWS EMR 5.27.0, Spark 2.4.4
>Reporter: Udit Mehrotra
>Priority: Major
> Attachments: coredump.zip, hs_err_pid13885.log, 
> part-0-0189b5c2-7f7b-4d0e-bdb8-506380253597-c000.snappy.parquet
>
>
> Running a union operation on two DataFrames through both Scala Spark Shell 
> and PySpark, resulting in executor contains doing a *core dump* and existing 
> with Exit code 134.
> The trace from the *Driver*:
> {noformat}
> Container exited with a non-zero exit code 134
> .
> 19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
> aborting job
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure 
> (executor 11 exited caused by one of the running tasks) Reason: Container 
> from a bad node: container_1572981097605_0021_01_77 on host: 
> ip-172-30-6-79.ec2.internal. Exit status: 134. Diagnostics: Exception from 
> container-launch.
> Container id: container_1572981097605_0021_01_77
> Exit code: 134
> Exception message: /bin/bash: line 1: 12611 Aborted 
> LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
>  /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
> '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
> '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
> '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
> -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
>  org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 
> --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
> application_1572981097605_0021 --user-class-path 
> file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
>  > 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
>  2> 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack
>  trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted  
>
> LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
>  /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
> '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
> '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
> '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
> -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
>  org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 
> --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
> application_1572981097605_0021 --user-class-path 
> file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
>  > 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
>  2> 
> /var/log/hadoop-yarn/co

[jira] [Commented] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2020-03-19 Thread Udit Mehrotra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062898#comment-17062898
 ] 

Udit Mehrotra commented on SPARK-29767:
---

[~hyukjin.kwon] Can you take a look at it ? There has been no activity on this 
for months now. I have provided the executor dump. Please let me know if there 
is any more information I can provide to help drive this.

> Core dump happening on executors while doing simple union of Data Frames
> 
>
> Key: SPARK-29767
> URL: https://issues.apache.org/jira/browse/SPARK-29767
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.4
> Environment: AWS EMR 5.27.0, Spark 2.4.4
>Reporter: Udit Mehrotra
>Priority: Major
> Attachments: coredump.zip, hs_err_pid13885.log, 
> part-0-0189b5c2-7f7b-4d0e-bdb8-506380253597-c000.snappy.parquet
>
>
> Running a union operation on two DataFrames through both Scala Spark Shell 
> and PySpark, resulting in executor contains doing a *core dump* and existing 
> with Exit code 134.
> The trace from the *Driver*:
> {noformat}
> Container exited with a non-zero exit code 134
> .
> 19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
> aborting job
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure 
> (executor 11 exited caused by one of the running tasks) Reason: Container 
> from a bad node: container_1572981097605_0021_01_77 on host: 
> ip-172-30-6-79.ec2.internal. Exit status: 134. Diagnostics: Exception from 
> container-launch.
> Container id: container_1572981097605_0021_01_77
> Exit code: 134
> Exception message: /bin/bash: line 1: 12611 Aborted 
> LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
>  /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
> '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
> '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
> '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
> -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
>  org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 
> --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
> application_1572981097605_0021 --user-class-path 
> file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
>  > 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
>  2> 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack
>  trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted  
>
> LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
>  /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
> '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
> '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
> '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
> -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
>  org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 
> --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
> application_1572981097605_0021 --user-class-path 
> file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
>  > 
> /var/log/

[jira] [Comment Edited] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-06 Thread Udit Mehrotra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968895#comment-16968895
 ] 

Udit Mehrotra edited comment on SPARK-29767 at 11/7/19 3:41 AM:


[~hyukjin.kwon] was finally able to get the core dump of crashing executors. 
Attached *hs_err_pid13885.log* the error report written along with core dump.

In that I notice the following trace:
{noformat}
RAX=
[error occurred during error reporting (printing register info), id 0xb]Stack: 
[0x7fbe8850f000,0x7fbe8861],  sp=0x7fbe8860dad0,  free 
space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xa9ae92]
J 4331  sun.misc.Unsafe.getLong(Ljava/lang/Object;J)J (0 bytes) @ 
0x7fbea94ffabe [0x7fbea94ffa00+0xbe]
j  org.apache.spark.unsafe.Platform.getLong(Ljava/lang/Object;J)J+5
j  org.apache.spark.unsafe.bitset.BitSetMethods.isSet(Ljava/lang/Object;JI)Z+66
j  org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(I)Z+14
j  
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.fieldToString_0_2$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/expressions/codegen/UTF8StringBuilder;)V+160
j  
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V+76
j  
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;+25
j  
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Ljava/lang/Object;)Ljava/lang/Object;+5
j  scala.collection.Iterator$$anon$11.next()Ljava/lang/Object;+13
j  scala.collection.Iterator$$anon$10.next()Ljava/lang/Object;+22
j  
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(Lscala/collection/Iterator;)Lscala/collection/Iterator;+78
j  
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(Ljava/lang/Object;)Ljava/lang/Object;+5
j  
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(Lorg/apache/spark/TaskContext;ILscala/collection/Iterator;)Lscala/collection/Iterator;+8
j  
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+13
j  
org.apache.spark.rdd.MapPartitionsRDD.compute(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+27
j  
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+26
j  
org.apache.spark.rdd.RDD.iterator(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+33
j  
org.apache.spark.rdd.MapPartitionsRDD.compute(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+24
j  
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+26
j  
org.apache.spark.rdd.RDD.iterator(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+33
j  
org.apache.spark.scheduler.ResultTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object;+187
j  
org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;)Ljava/lang/Object;+210
j  
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply()Ljava/lang/Object;+37
j  
org.apache.spark.util.Utils$.tryWithSafeFinally(Lscala/Function0;Lscala/Function0;)Ljava/lang/Object;+3
j  org.apache.spark.executor.Executor$TaskRunner.run()V+383
j  
java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95
j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5
j  java.lang.Thread.run()V+11
v  ~StubRoutines::call_stub
V  [libjvm.so+0x680c5e]
V  [libjvm.so+0x67e024]
V  [libjvm.so+0x67e639]
V  [libjvm.so+0x6c3d41]
V  [libjvm.so+0xa77c22]
V  [libjvm.so+0x8c3b12]
C  [libpthread.so.0+0x7de5]  start_thread+0xc5{noformat}

Also attached the core dump file *coredump.zip*


was (Author: uditme):
[~hyukjin.kwon] was finally able to get the core dump of crashing executors. 
Attached *hs_err_pid13885.log* the error report written along with core dump.

In that I notice the following trace:
{noformat}
RAX=
[error occurred during error reporting (printing register info), id 0xb]Stack: 
[0x7fbe8850f000,0x7fbe8861],  sp=0x7fbe8860dad0,  free 
space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xa9ae92]
J 4331  sun.misc.Unsafe.getLong(Ljava/lang/Object;J)J

[jira] [Updated] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-06 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated SPARK-29767:
--
Attachment: coredump.zip

> Core dump happening on executors while doing simple union of Data Frames
> 
>
> Key: SPARK-29767
> URL: https://issues.apache.org/jira/browse/SPARK-29767
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.4
> Environment: AWS EMR 5.27.0, Spark 2.4.4
>Reporter: Udit Mehrotra
>Priority: Major
> Attachments: coredump.zip, hs_err_pid13885.log, 
> part-0-0189b5c2-7f7b-4d0e-bdb8-506380253597-c000.snappy.parquet
>
>
> Running a union operation on two DataFrames through both Scala Spark Shell 
> and PySpark, resulting in executor contains doing a *core dump* and existing 
> with Exit code 134.
> The trace from the *Driver*:
> {noformat}
> Container exited with a non-zero exit code 134
> .
> 19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
> aborting job
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure 
> (executor 11 exited caused by one of the running tasks) Reason: Container 
> from a bad node: container_1572981097605_0021_01_77 on host: 
> ip-172-30-6-79.ec2.internal. Exit status: 134. Diagnostics: Exception from 
> container-launch.
> Container id: container_1572981097605_0021_01_77
> Exit code: 134
> Exception message: /bin/bash: line 1: 12611 Aborted 
> LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
>  /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
> '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
> '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
> '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
> -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
>  org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 
> --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
> application_1572981097605_0021 --user-class-path 
> file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
>  > 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
>  2> 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack
>  trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted  
>
> LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
>  /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
> '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
> '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
> '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
> -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
>  org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 
> --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
> application_1572981097605_0021 --user-class-path 
> file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
>  > 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
>  2> 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderr
> at org.apache.hadoop.ut

[jira] [Commented] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-06 Thread Udit Mehrotra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968895#comment-16968895
 ] 

Udit Mehrotra commented on SPARK-29767:
---

[~hyukjin.kwon] was finally able to get the core dump of crashing executors. 
Attached *hs_err_pid13885.log* the error report written along with core dump.

In that I notice the following trace:
{noformat}
RAX=
[error occurred during error reporting (printing register info), id 0xb]Stack: 
[0x7fbe8850f000,0x7fbe8861],  sp=0x7fbe8860dad0,  free 
space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xa9ae92]
J 4331  sun.misc.Unsafe.getLong(Ljava/lang/Object;J)J (0 bytes) @ 
0x7fbea94ffabe [0x7fbea94ffa00+0xbe]
j  org.apache.spark.unsafe.Platform.getLong(Ljava/lang/Object;J)J+5
j  org.apache.spark.unsafe.bitset.BitSetMethods.isSet(Ljava/lang/Object;JI)Z+66
j  org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(I)Z+14
j  
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.fieldToString_0_2$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/expressions/codegen/UTF8StringBuilder;)V+160
j  
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V+76
j  
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;+25
j  
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Ljava/lang/Object;)Ljava/lang/Object;+5
j  scala.collection.Iterator$$anon$11.next()Ljava/lang/Object;+13
j  scala.collection.Iterator$$anon$10.next()Ljava/lang/Object;+22
j  
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(Lscala/collection/Iterator;)Lscala/collection/Iterator;+78
j  
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(Ljava/lang/Object;)Ljava/lang/Object;+5
j  
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(Lorg/apache/spark/TaskContext;ILscala/collection/Iterator;)Lscala/collection/Iterator;+8
j  
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+13
j  
org.apache.spark.rdd.MapPartitionsRDD.compute(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+27
j  
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+26
j  
org.apache.spark.rdd.RDD.iterator(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+33
j  
org.apache.spark.rdd.MapPartitionsRDD.compute(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+24
j  
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+26
j  
org.apache.spark.rdd.RDD.iterator(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+33
j  
org.apache.spark.scheduler.ResultTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object;+187
j  
org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;)Ljava/lang/Object;+210
j  
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply()Ljava/lang/Object;+37
j  
org.apache.spark.util.Utils$.tryWithSafeFinally(Lscala/Function0;Lscala/Function0;)Ljava/lang/Object;+3
j  org.apache.spark.executor.Executor$TaskRunner.run()V+383
j  
java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95
j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5
j  java.lang.Thread.run()V+11
v  ~StubRoutines::call_stub
V  [libjvm.so+0x680c5e]
V  [libjvm.so+0x67e024]
V  [libjvm.so+0x67e639]
V  [libjvm.so+0x6c3d41]
V  [libjvm.so+0xa77c22]
V  [libjvm.so+0x8c3b12]
C  [libpthread.so.0+0x7de5]  start_thread+0xc5{noformat}

> Core dump happening on executors while doing simple union of Data Frames
> 
>
> Key: SPARK-29767
> URL: https://issues.apache.org/jira/browse/SPARK-29767
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.4
> Environment: AWS EMR 5.27.0, Spark 2.4.4
>Reporter: Udit Mehrotra
>Priority: Major
> Attachments: hs_err_pid13885.log, 
> part-0-0189b5c2-7f7b-4d0e-bdb8-506380253597-c000.snappy.parquet
>
>
> Running a union operation o

[jira] [Updated] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-06 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated SPARK-29767:
--
Attachment: hs_err_pid13885.log

> Core dump happening on executors while doing simple union of Data Frames
> 
>
> Key: SPARK-29767
> URL: https://issues.apache.org/jira/browse/SPARK-29767
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.4
> Environment: AWS EMR 5.27.0, Spark 2.4.4
>Reporter: Udit Mehrotra
>Priority: Major
> Attachments: hs_err_pid13885.log, 
> part-0-0189b5c2-7f7b-4d0e-bdb8-506380253597-c000.snappy.parquet
>
>
> Running a union operation on two DataFrames through both Scala Spark Shell 
> and PySpark, resulting in executor contains doing a *core dump* and existing 
> with Exit code 134.
> The trace from the *Driver*:
> {noformat}
> Container exited with a non-zero exit code 134
> .
> 19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
> aborting job
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure 
> (executor 11 exited caused by one of the running tasks) Reason: Container 
> from a bad node: container_1572981097605_0021_01_77 on host: 
> ip-172-30-6-79.ec2.internal. Exit status: 134. Diagnostics: Exception from 
> container-launch.
> Container id: container_1572981097605_0021_01_77
> Exit code: 134
> Exception message: /bin/bash: line 1: 12611 Aborted 
> LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
>  /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
> '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
> '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
> '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
> -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
>  org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 
> --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
> application_1572981097605_0021 --user-class-path 
> file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
>  > 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
>  2> 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack
>  trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted  
>
> LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
>  /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
> '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
> '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
> '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
> -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
>  org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 
> --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
> application_1572981097605_0021 --user-class-path 
> file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
>  > 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
>  2> 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderr
> at org.apache.hadoop.util.Shel

[jira] [Commented] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-06 Thread Udit Mehrotra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968757#comment-16968757
 ] 

Udit Mehrotra commented on SPARK-29767:
---

[~hyukjin.kwon] As I have mentioned in the description and you an see from the 
*stdout* logs it fails to write the *core dump*. Any idea how I can get around 
it ?

> Core dump happening on executors while doing simple union of Data Frames
> 
>
> Key: SPARK-29767
> URL: https://issues.apache.org/jira/browse/SPARK-29767
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.4
> Environment: AWS EMR 5.27.0, Spark 2.4.4
>Reporter: Udit Mehrotra
>Priority: Major
> Attachments: 
> part-0-0189b5c2-7f7b-4d0e-bdb8-506380253597-c000.snappy.parquet
>
>
> Running a union operation on two DataFrames through both Scala Spark Shell 
> and PySpark, resulting in executor contains doing a *core dump* and existing 
> with Exit code 134.
> The trace from the *Driver*:
> {noformat}
> Container exited with a non-zero exit code 134
> .
> 19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
> aborting job
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure 
> (executor 11 exited caused by one of the running tasks) Reason: Container 
> from a bad node: container_1572981097605_0021_01_77 on host: 
> ip-172-30-6-79.ec2.internal. Exit status: 134. Diagnostics: Exception from 
> container-launch.
> Container id: container_1572981097605_0021_01_77
> Exit code: 134
> Exception message: /bin/bash: line 1: 12611 Aborted 
> LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
>  /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
> '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
> '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
> '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
> -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
>  org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 
> --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
> application_1572981097605_0021 --user-class-path 
> file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
>  > 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
>  2> 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack
>  trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted  
>
> LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
>  /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
> '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
> '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
> '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
> -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
>  org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 
> --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
> application_1572981097605_0021 --user-class-path 
> file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
>  > 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_00

[jira] [Updated] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-06 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated SPARK-29767:
--
Description: 
Running a union operation on two DataFrames through both Scala Spark Shell and 
PySpark, resulting in executor contains doing a *core dump* and existing with 
Exit code 134.

The trace from the *Driver*:
{noformat}
Container exited with a non-zero exit code 134
.
19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure (executor 11 
exited caused by one of the running tasks) Reason: Container from a bad node: 
container_1572981097605_0021_01_77 on host: ip-172-30-6-79.ec2.internal. 
Exit status: 134. Diagnostics: Exception from container-launch.
Container id: container_1572981097605_0021_01_77
Exit code: 134
Exception message: /bin/bash: line 1: 12611 Aborted 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack
 trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted
 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderr
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.T

[jira] [Updated] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-05 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated SPARK-29767:
--
Description: 
Running a union operation on two DataFrames through both Scala Spark Shell and 
PySpark, resulting in executor contains doing a *core dump* and existing with 
Exit code 134.

The trace from the *Driver*:
{noformat}
Container exited with a non-zero exit code 134
.
19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure (executor 11 
exited caused by one of the running tasks) Reason: Container from a bad node: 
container_1572981097605_0021_01_77 on host: ip-172-30-6-79.ec2.internal. 
Exit status: 134. Diagnostics: Exception from container-launch.
Container id: container_1572981097605_0021_01_77
Exit code: 134
Exception message: /bin/bash: line 1: 12611 Aborted 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack
 trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted
 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderr
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.T

[jira] [Updated] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-05 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated SPARK-29767:
--
Description: 
Running a union operation on two DataFrames through both Scala Spark Shell and 
PySpark, resulting in executor contains doing a *core dump* and existing with 
Exit code 134.

The trace from the *Driver*:
{noformat}
Container exited with a non-zero exit code 134
.
19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure (executor 11 
exited caused by one of the running tasks) Reason: Container from a bad node: 
container_1572981097605_0021_01_77 on host: ip-172-30-6-79.ec2.internal. 
Exit status: 134. Diagnostics: Exception from container-launch.
Container id: container_1572981097605_0021_01_77
Exit code: 134
Exception message: /bin/bash: line 1: 12611 Aborted 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack
 trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted
 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderr
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.T

[jira] [Updated] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-05 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated SPARK-29767:
--
Description: 
Running a union operation on two DataFrames through both Scala Spark Shell and 
PySpark, resulting in executor contains doing a *core dump* and existing with 
Exit code 134.

The trace from the *Driver*:
{noformat}
Container exited with a non-zero exit code 134
.
19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure (executor 11 
exited caused by one of the running tasks) Reason: Container from a bad node: 
container_1572981097605_0021_01_77 on host: ip-172-30-6-79.ec2.internal. 
Exit status: 134. Diagnostics: Exception from container-launch.
Container id: container_1572981097605_0021_01_77
Exit code: 134
Exception message: /bin/bash: line 1: 12611 Aborted 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack
 trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted
 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderr
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.T

[jira] [Updated] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-05 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated SPARK-29767:
--
Description: 
Running a union operation on two DataFrames through both Scala Spark Shell and 
PySpark, resulting in executor contains doing a *core dump* and existing with 
Exit code 134.

The trace from the *Driver*:
{noformat}
Container exited with a non-zero exit code 134
.
19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure (executor 11 
exited caused by one of the running tasks) Reason: Container from a bad node: 
container_1572981097605_0021_01_77 on host: ip-172-30-6-79.ec2.internal. 
Exit status: 134. Diagnostics: Exception from container-launch.
Container id: container_1572981097605_0021_01_77
Exit code: 134
Exception message: /bin/bash: line 1: 12611 Aborted 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack
 trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted
 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderr
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.T

[jira] [Updated] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-05 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated SPARK-29767:
--
Description: 
Running a union operation on two DataFrames through both Scala Spark Shell and 
PySpark, resulting in executor contains doing a *core dump* and existing with 
Exit code 134.

The trace from the *Driver*:
{noformat}
Container exited with a non-zero exit code 134
.
19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure (executor 11 
exited caused by one of the running tasks) Reason: Container from a bad node: 
container_1572981097605_0021_01_77 on host: ip-172-30-6-79.ec2.internal. 
Exit status: 134. Diagnostics: Exception from container-launch.
Container id: container_1572981097605_0021_01_77
Exit code: 134
Exception message: /bin/bash: line 1: 12611 Aborted 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack
 trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted
 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderr
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.T

[jira] [Updated] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-05 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated SPARK-29767:
--
Description: 
Running a union operation on two DataFrames through both Scala Spark Shell and 
PySpark, resulting in executor contains doing a *core dump* and existing with 
Exit code 134.

The trace from the *Driver*:
{noformat}
Container exited with a non-zero exit code 134
.
19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure (executor 11 
exited caused by one of the running tasks) Reason: Container from a bad node: 
container_1572981097605_0021_01_77 on host: ip-172-30-6-79.ec2.internal. 
Exit status: 134. Diagnostics: Exception from container-launch.
Container id: container_1572981097605_0021_01_77
Exit code: 134
Exception message: /bin/bash: line 1: 12611 Aborted 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack
 trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted
 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderr
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.T

[jira] [Updated] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-05 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated SPARK-29767:
--
Description: 
Running a union operation on two DataFrames through both Scala Spark Shell and 
PySpark, resulting in executor contains doing a *core dump* and existing with 
Exit code 134.

The trace from the *Driver*:
{noformat}
Container exited with a non-zero exit code 134
.
19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure (executor 11 
exited caused by one of the running tasks) Reason: Container from a bad node: 
container_1572981097605_0021_01_77 on host: ip-172-30-6-79.ec2.internal. 
Exit status: 134. Diagnostics: Exception from container-launch.
Container id: container_1572981097605_0021_01_77
Exit code: 134
Exception message: /bin/bash: line 1: 12611 Aborted 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack
 trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted
 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderr
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.T

[jira] [Updated] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-05 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated SPARK-29767:
--
Attachment: 
part-0-0189b5c2-7f7b-4d0e-bdb8-506380253597-c000.snappy.parquet

> Core dump happening on executors while doing simple union of Data Frames
> 
>
> Key: SPARK-29767
> URL: https://issues.apache.org/jira/browse/SPARK-29767
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.4
> Environment: AWS EMR 5.27.0, Spark 2.4.4
>Reporter: Udit Mehrotra
>Priority: Major
> Attachments: 
> part-0-0189b5c2-7f7b-4d0e-bdb8-506380253597-c000.snappy.parquet
>
>
> Running a union operation on two DataFrames through both Scala Spark Shell 
> and PySpark, resulting in executor contains doing a *core dump* and existing 
> with Exit code 134.
> The trace from the *Driver*:
> {noformat}
> Container exited with a non-zero exit code 134
> .
> 19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
> aborting job
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure 
> (executor 11 exited caused by one of the running tasks) Reason: Container 
> from a bad node: container_1572981097605_0021_01_77 on host: 
> ip-172-30-6-79.ec2.internal. Exit status: 134. Diagnostics: Exception from 
> container-launch.
> Container id: container_1572981097605_0021_01_77
> Exit code: 134
> Exception message: /bin/bash: line 1: 12611 Aborted 
> LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
>  /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
> '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
> '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
> '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
> -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
>  org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 
> --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
> application_1572981097605_0021 --user-class-path 
> file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
>  > 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
>  2> 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack
>  trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted  
>
> LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
>  /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
> '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
> '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
> '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
> -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
>  org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 
> --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
> application_1572981097605_0021 --user-class-path 
> file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
>  > 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
>  2> 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderr
> at

[jira] [Updated] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-05 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated SPARK-29767:
--
Description: 
Running a union operation on two DataFrames through both Scala Spark Shell and 
PySpark, resulting in executor contains doing a *core dump* and existing with 
Exit code 134.

The trace from the *Driver*:
{noformat}
Container exited with a non-zero exit code 134
.
19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure (executor 11 
exited caused by one of the running tasks) Reason: Container from a bad node: 
container_1572981097605_0021_01_77 on host: ip-172-30-6-79.ec2.internal. 
Exit status: 134. Diagnostics: Exception from container-launch.
Container id: container_1572981097605_0021_01_77
Exit code: 134
Exception message: /bin/bash: line 1: 12611 Aborted 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack
 trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted
 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderr
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.T

[jira] [Updated] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-05 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated SPARK-29767:
--
Description: 
Running a union operation on two DataFrames through both Scala Spark Shell and 
PySpark, resulting in executor contains doing a *core dump* and existing with 
Exit code 134.

The trace from the *Driver*:
{noformat}
Container exited with a non-zero exit code 134
.
19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure (executor 11 
exited caused by one of the running tasks) Reason: Container from a bad node: 
container_1572981097605_0021_01_77 on host: ip-172-30-6-79.ec2.internal. 
Exit status: 134. Diagnostics: Exception from container-launch.
Container id: container_1572981097605_0021_01_77
Exit code: 134
Exception message: /bin/bash: line 1: 12611 Aborted 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack
 trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted
 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderr
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.T

[jira] [Updated] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-05 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated SPARK-29767:
--
Description: 
Running a union operation on two DataFrames through both Scala Spark Shell and 
PySpark, resulting in executor contains doing a *core dump* and existing with 
Exit code 134.

The trace from the *Driver*:
{noformat}
Container exited with a non-zero exit code 134
.
19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure (executor 11 
exited caused by one of the running tasks) Reason: Container from a bad node: 
container_1572981097605_0021_01_77 on host: ip-172-30-6-79.ec2.internal. 
Exit status: 134. Diagnostics: Exception from container-launch.
Container id: container_1572981097605_0021_01_77
Exit code: 134
Exception message: /bin/bash: line 1: 12611 Aborted 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack
 trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted
 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderr
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.T

[jira] [Updated] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-05 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated SPARK-29767:
--
Description: 
Running a union operation on two DataFrames through both Scala Spark Shell and 
PySpark, resulting in executor contains doing a *core dump* and existing with 
Exit code 134.

The trace from the *Driver*:
{noformat}
Container exited with a non-zero exit code 134
.
19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure (executor 11 
exited caused by one of the running tasks) Reason: Container from a bad node: 
container_1572981097605_0021_01_77 on host: ip-172-30-6-79.ec2.internal. 
Exit status: 134. Diagnostics: Exception from container-launch.
Container id: container_1572981097605_0021_01_77
Exit code: 134
Exception message: /bin/bash: line 1: 12611 Aborted 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack
 trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted
 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderr
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.T

[jira] [Created] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-05 Thread Udit Mehrotra (Jira)
Udit Mehrotra created SPARK-29767:
-

 Summary: Core dump happening on executors while doing simple union 
of Data Frames
 Key: SPARK-29767
 URL: https://issues.apache.org/jira/browse/SPARK-29767
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 2.4.4
 Environment: AWS EMR 5.27.0, Spark 2.4.4
Reporter: Udit Mehrotra


Running a union operation on two DataFrames through both Scala Spark Shell and 
PySpark, resulting in executor contains doing a *core dump* and existing with 
Exit code 134.

The trace from the *Driver*:

 
{noformat}
Container exited with a non-zero exit code 134
.
19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure (executor 11 
exited caused by one of the running tasks) Reason: Container from a bad node: 
container_1572981097605_0021_01_77 on host: ip-172-30-6-79.ec2.internal. 
Exit status: 134. Diagnostics: Exception from container-launch.
Container id: container_1572981097605_0021_01_77
Exit code: 134
Exception message: /bin/bash: line 1: 12611 Aborted 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack
 trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted
 
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
'-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
'-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
-Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 
11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
application_1572981097605_0021 --user-class-path 
file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
 > 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderr
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at 
org.apache.hadoop.yarn.server.nodem

[jira] [Updated] (SPARK-21494) Spark 2.2.0 AES encryption not working with External shuffle

2017-07-20 Thread Udit Mehrotra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated SPARK-21494:
--
Attachment: logs.zip

> Spark 2.2.0 AES encryption not working with External shuffle
> 
>
> Key: SPARK-21494
> URL: https://issues.apache.org/jira/browse/SPARK-21494
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Shuffle
>Affects Versions: 2.2.0
> Environment: AWS EMR
>Reporter: Udit Mehrotra
> Attachments: logs.zip
>
>
> Spark’s new AES based authentication mechanism does not seem to work when 
> configured with external shuffle service on YARN. 
> Here is the stack trace for the error we see in the driver logs:
> ERROR YarnScheduler: Lost executor 40 on ip-10-167-104-125.ec2.internal: 
> Unable to create executor due to Unable to register with external shuffle 
> server due to: java.lang.IllegalArgumentException: Authentication failed.
> at 
> org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:125)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:157)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> org.spark_project.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> org.spark_project.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
>  
> Here are the settings we are configuring in ‘spark-defaults’ and ‘yarn-site’:
> spark.network.crypto.enabled true
> spark.network.crypto.saslFallback false
> spark.authenticate   true
>  
> Turning on DEBUG logs for class ‘org.apache.spark.network.crypto’ on both 
> Spark and YARN side is not giving much information either about why 
> authentication fails. The driver and node manager logs have been attached to 
> the JIRA.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21494) Spark 2.2.0 AES encryption not working with External shuffle

2017-07-20 Thread Udit Mehrotra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated SPARK-21494:
--
Description: 
Spark’s new AES based authentication mechanism does not seem to work when 
configured with external shuffle service on YARN. 

Here is the stack trace for the error we see in the driver logs:
ERROR YarnScheduler: Lost executor 40 on ip-10-167-104-125.ec2.internal: Unable 
to create executor due to Unable to register with external shuffle server due 
to: java.lang.IllegalArgumentException: Authentication failed.
at 
org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:125)
at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:157)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
org.spark_project.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
org.spark_project.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
 
Here are the settings we are configuring in ‘spark-defaults’ and ‘yarn-site’:
spark.network.crypto.enabled true
spark.network.crypto.saslFallback false
spark.authenticate   true
 
Turning on DEBUG logs for class ‘org.apache.spark.network.crypto’ on both Spark 
and YARN side is not giving much information either about why authentication 
fails. The driver and node manager logs have been attached to the JIRA.

  was:
Spark’s new AES based authentication mechanism does not seem to work when 
configured with external shuffle service on YARN. Here is the stack trace for 
the error we see in the driver logs:
ERROR YarnScheduler: Lost executor 40 on ip-10-167-104-125.ec2.internal: Unable 
to create executor due to Unable to register with external shuffle server due 
to: java.lang.IllegalArgumentException: Authentication failed.
at 
org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:125)
at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:157)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
org.spark_project.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
org.spark_project.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
 
Here are the settings we are configuring in ‘spark-defaults’ and ‘yarn-site’:
spark.network.crypto.enabled true
spark.network.crypto.saslFallback false
spark.authenticate   true
 
Turning on DEBUG logs for class ‘org.apache.spark.network.crypto’ on both Spark 
and YARN side is not giving much information either about why authentication 
fails. The driver and nod

[jira] [Created] (SPARK-21494) Spark 2.2.0 AES encryption not working with External shuffle

2017-07-20 Thread Udit Mehrotra (JIRA)
Udit Mehrotra created SPARK-21494:
-

 Summary: Spark 2.2.0 AES encryption not working with External 
shuffle
 Key: SPARK-21494
 URL: https://issues.apache.org/jira/browse/SPARK-21494
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Shuffle
Affects Versions: 2.2.0
 Environment: AWS EMR
Reporter: Udit Mehrotra


Spark’s new AES based authentication mechanism does not seem to work when 
configured with external shuffle service on YARN. Here is the stack trace for 
the error we see in the driver logs:
ERROR YarnScheduler: Lost executor 40 on ip-10-167-104-125.ec2.internal: Unable 
to create executor due to Unable to register with external shuffle server due 
to: java.lang.IllegalArgumentException: Authentication failed.
at 
org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:125)
at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:157)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
org.spark_project.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
org.spark_project.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
 
Here are the settings we are configuring in ‘spark-defaults’ and ‘yarn-site’:
spark.network.crypto.enabled true
spark.network.crypto.saslFallback false
spark.authenticate   true
 
Turning on DEBUG logs for class ‘org.apache.spark.network.crypto’ on both Spark 
and YARN side is not giving much information either about why authentication 
fails. The driver and node manager logs have been attached to the JIRA.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20515) Issue with reading Hive ORC tables having char/varchar columns in Spark SQL

2017-04-27 Thread Udit Mehrotra (JIRA)
Udit Mehrotra created SPARK-20515:
-

 Summary: Issue with reading Hive ORC tables having char/varchar 
columns in Spark SQL
 Key: SPARK-20515
 URL: https://issues.apache.org/jira/browse/SPARK-20515
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
 Environment: AWS EMR Cluster
Reporter: Udit Mehrotra


Reading from a Hive ORC table containing char/varchar columns fails in Spark 
SQL. This is caused by the fact that Spark SQL internally replaces the 
char/varchar columns with String data type. So, while reading from the table 
created in Hive which has varchar/char columns, it ends up using the wrong 
reader and causes a ClassCastException.
 
Here is the exception:
 
java.lang.ClassCastException: 
org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to 
org.apache.hadoop.io.Text
at 
org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
at 
org.apache.spark.sql.hive.HiveInspectors$class.unwrap(HiveInspectors.scala:324)
at 
org.apache.spark.sql.hive.HadoopTableReader$.unwrap(TableReader.scala:333)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
 
While the issue has been fixed in Spark 2.1.1 and 2.2.0 with SPARK-19459, it 
still needs to be fixed Spark 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20115) Fix DAGScheduler to recompute all the lost shuffle blocks when external shuffle service is unavailable

2017-03-27 Thread Udit Mehrotra (JIRA)
Udit Mehrotra created SPARK-20115:
-

 Summary: Fix DAGScheduler to recompute all the lost shuffle blocks 
when external shuffle service is unavailable
 Key: SPARK-20115
 URL: https://issues.apache.org/jira/browse/SPARK-20115
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core, YARN
Affects Versions: 2.1.0, 2.0.2
 Environment: Spark on Yarn with external shuffle service enabled, 
running on AWS EMR cluster.
Reporter: Udit Mehrotra


The Spark’s DAGScheduler currently does not recompute all the lost shuffle 
blocks on a host when a FetchFailed exception occurs, while fetching shuffle 
blocks from another executor with external shuffle service enabled. Instead it 
only recomputes the lost shuffle blocks computed by the executor for which the 
FetchFailed exception occurred. This works fine for Internal shuffle scenario, 
where the executors serve their own shuffle blocks and hence only the shuffle 
blocks for that executor should be considered lost. However, when External 
Shuffle Service is being used, a FetchFailed exception would mean that the 
external shuffle service running on that host has become unavailable. This in 
turn is sufficient to assume that all the shuffle blocks which were managed by 
the Shuffle service on that host are lost. Therefore, just recomputing the 
shuffle blocks associated with the particular Executor for which FetchFailed 
exception occurred is not sufficient. We need to recompute all the shuffle 
blocks, managed by that service because there could be multiple executors 
running on that host.
 
Since not all the shuffle blocks (for all the executors on the host) are 
recomputed, this causes future attempts of the reduce stage to fail as well 
because the new tasks scheduled still keep trying to reach the old location of 
the shuffle blocks (which were not recomputed) and keep throwing further 
FetchFailed exceptions. This ultimately causes the job to fail, after the 
reduce stage has been retried 4 times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18756) Memory leak in Spark streaming

2016-12-06 Thread Udit Mehrotra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated SPARK-18756:
--
Description: 
We have a Spark streaming application, that processes data from Kinesis.

In our application we are observing a memory leak at the Executors with Netty 
buffers not being released properly, when the Spark BlockManager tries to 
replicate the input blocks received from Kinesis stream. The leak occurs, when 
we set Storage Level as MEMORY_AND_DISK_2 for the Kinesis input blocks. 
However, if we change the Storage level to use MEMORY_AND_DISK, which avoids 
creating a replica, we do not observe the leak any more. We were able to detect 
the leak, and obtain the stack trace by running the executors with an 
additional JVM option: -Dio.netty.leakDetectionLevel=advanced.

Here is the stack trace of the leak:

16/12/06 22:30:12 ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not 
called before it's garbage-collected. See 
http://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 0
Created at:
io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:103)
io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:335)
io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:247)

org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:69)

org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1182)

org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:997)

org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)

org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)

org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:702)

org.apache.spark.streaming.receiver.BlockManagerBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:80)

org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158)

org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129)
org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133)

org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:282)

org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:352)

org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297)

org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269)

org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110)

We also observe a continuous increase in off heap memory usage at the 
executors. Any help would be appreciated.

> Memory leak in Spark streaming
> --
>
> Key: SPARK-18756
> URL: https://issues.apache.org/jira/browse/SPARK-18756
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Udit Mehrotra
>
> We have a Spark streaming application, that processes data from Kinesis.
> In our application we are observing a memory leak at the Executors with Netty 
> buffers not being released properly, when the Spark BlockManager tries to 
> replicate the input blocks received from Kinesis stream. The leak occurs, 
> when we set Storage Level as MEMORY_AND_DISK_2 for the Kinesis input blocks. 
> However, if we change the Storage level to use MEMORY_AND_DISK, which avoids 
> creating a replica, we do not observe the leak any more. We were able to 
> detect the leak, and obtain the stack trace by running the executors with an 
> additional JVM option: -Dio.netty.leakDetectionLevel=advanced.
> Here is the stack trace of the leak:
> 16/12/06 22:30:12 ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not 
> called before it's garbage-collected. See 
> http://netty.io/wiki/reference-counted-objects.html for more information.
> Recent access records: 0
> Created at:
>   io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:103)
>   io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:335)
>   io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:247)
>   
> org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:69)
>   
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1182)
>   
> org.apache.spark.storage.BlockManager$

[jira] [Created] (SPARK-18756) Memory leak in Spark streaming

2016-12-06 Thread Udit Mehrotra (JIRA)
Udit Mehrotra created SPARK-18756:
-

 Summary: Memory leak in Spark streaming
 Key: SPARK-18756
 URL: https://issues.apache.org/jira/browse/SPARK-18756
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, DStreams
Affects Versions: 2.0.2, 2.0.1, 2.0.0
Reporter: Udit Mehrotra






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17380) Spark streaming with a multi shard Kinesis freezes after several days (memory/resource leak?)

2016-11-19 Thread Udit Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680276#comment-15680276
 ] 

Udit Mehrotra commented on SPARK-17380:
---

The above leak was seen with Spark 2.0 running on EMR. I noticed that the code 
path which causes the leak is the Block replication code, so I switched to 
using StorageLevel.MEMORY_AND_DISK, from StorageLevel.MEMORY_AND_DISK_2 for the 
Kinesis blocks received. After switching, I do not observe the above memory 
leak in the logs, but the application still freezes after 3-3.5 days. Spark 
streaming stops processing the records, and the input queue of records received 
from Kinesis keeps growing, until the executor runs out of memory.

> Spark streaming with a multi shard Kinesis freezes after several days 
> (memory/resource leak?)
> -
>
> Key: SPARK-17380
> URL: https://issues.apache.org/jira/browse/SPARK-17380
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Xeto
> Attachments: exec_Leak_Hunter.zip, memory-after-freeze.png, memory.png
>
>
> Running Spark Streaming 2.0.0 on AWS EMR 5.0.0 consuming from Kinesis (125 
> shards).
> Used memory keeps growing all the time according to Ganglia.
> The application works properly for about 3.5 days till all free memory has 
> been used.
> Then, micro batches start queuing up but none is served.
> Spark freezes. You can see in Ganglia that some memory is being freed but it 
> doesn't help the job to recover.
> Is it a memory/resource leak?
> The job uses back pressure and Kryo.
> The code has a mapToPair(), groupByKey(),  flatMap(), 
> persist(StorageLevel.MEMORY_AND_DISK_SER_2()) and repartition(19); Then 
> storing to s3 using foreachRDD()
> Cluster size: 20 machines
> Spark cofiguration:
> spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -XX:PermSize=256M -XX:MaxPermSize=256M -XX:OnOutOfMemoryError='kill -9 %p' 
> spark.driver.extraJavaOptions -Dspark.driver.log.level=INFO 
> -XX:+UseConcMarkSweepGC -XX:PermSize=256M -XX:MaxPermSize=256M 
> -XX:OnOutOfMemoryError='kill -9 %p' 
> spark.master yarn-cluster
> spark.executor.instances 19
> spark.executor.cores 7
> spark.executor.memory 7500M
> spark.driver.memory 7500M
> spark.default.parallelism 133
> spark.yarn.executor.memoryOverhead 2950
> spark.yarn.driver.memoryOverhead 2950
> spark.eventLog.enabled false
> spark.eventLog.dir hdfs:///spark-logs/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17380) Spark streaming with a multi shard Kinesis freezes after several days (memory/resource leak?)

2016-11-19 Thread Udit Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680270#comment-15680270
 ] 

Udit Mehrotra commented on SPARK-17380:
---

We came across this Memory Leak in the executor logs, by using the JVM option 
'-Dio.netty.leakDetectionLevel=advanced', which seems like a good evidence of 
memory leak, and tells the location where the buffer is created.

16/11/09 06:03:28 ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not 
called before it's garbage-collected. See 
http://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 0
Created at:
io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:103)
io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:335)
io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:247)

org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:69)

org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1161)

org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:976)

org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)

org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)

org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:700)

org.apache.spark.streaming.receiver.BlockManagerBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:80)

org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158)

org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129)
org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133)

org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:282)

org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:352)

org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297)

org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269)

org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110)

Can we please have some action on this JIRA ?

> Spark streaming with a multi shard Kinesis freezes after several days 
> (memory/resource leak?)
> -
>
> Key: SPARK-17380
> URL: https://issues.apache.org/jira/browse/SPARK-17380
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Xeto
> Attachments: exec_Leak_Hunter.zip, memory-after-freeze.png, memory.png
>
>
> Running Spark Streaming 2.0.0 on AWS EMR 5.0.0 consuming from Kinesis (125 
> shards).
> Used memory keeps growing all the time according to Ganglia.
> The application works properly for about 3.5 days till all free memory has 
> been used.
> Then, micro batches start queuing up but none is served.
> Spark freezes. You can see in Ganglia that some memory is being freed but it 
> doesn't help the job to recover.
> Is it a memory/resource leak?
> The job uses back pressure and Kryo.
> The code has a mapToPair(), groupByKey(),  flatMap(), 
> persist(StorageLevel.MEMORY_AND_DISK_SER_2()) and repartition(19); Then 
> storing to s3 using foreachRDD()
> Cluster size: 20 machines
> Spark cofiguration:
> spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -XX:PermSize=256M -XX:MaxPermSize=256M -XX:OnOutOfMemoryError='kill -9 %p' 
> spark.driver.extraJavaOptions -Dspark.driver.log.level=INFO 
> -XX:+UseConcMarkSweepGC -XX:PermSize=256M -XX:MaxPermSize=256M 
> -XX:OnOutOfMemoryError='kill -9 %p' 
> spark.master yarn-cluster
> spark.executor.instances 19
> spark.executor.cores 7
> spark.executor.memory 7500M
> spark.driver.memory 7500M
> spark.default.parallelism 133
> spark.yarn.executor.memoryOverhead 2950
> spark.yarn.driver.memoryOverhead 2950
> spark.eventLog.enabled false
> spark.eventLog.dir hdfs:///spark-logs/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17512) Specifying remote files for Python based Spark jobs in Yarn cluster mode not working

2016-09-12 Thread Udit Mehrotra (JIRA)
Udit Mehrotra created SPARK-17512:
-

 Summary: Specifying remote files for Python based Spark jobs in 
Yarn cluster mode not working
 Key: SPARK-17512
 URL: https://issues.apache.org/jira/browse/SPARK-17512
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Submit
Affects Versions: 2.0.0
Reporter: Udit Mehrotra


When I run a python application, and specify a remote path for the extra files 
to be included in the PYTHON_PATH using the '--py-files' or 
'spark.submit.pyFiles' configuration option in YARN Cluster mode I get the 
following error:

Exception in thread "main" java.lang.IllegalArgumentException: Launching Python 
applications through spark-submit is currently only supported for local files: 
s3:///app.py
at org.apache.spark.deploy.PythonRunner$.formatPath(PythonRunner.scala:104)
at 
org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136)
at 
org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.deploy.PythonRunner$.formatPaths(PythonRunner.scala:136)
at 
org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$10.apply(SparkSubmit.scala:636)
at 
org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$10.apply(SparkSubmit.scala:634)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:634)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 

Here are sample commands which would throw this error in Spark 2.0 (sparkApp.py 
requires app.py):

spark-submit --deploy-mode cluster --py-files s3:///app.py 
s3:///sparkApp.py (works fine in 1.6)

spark-submit --deploy-mode cluster --conf spark.submit.pyFiles=s3:///app.py 
s3:///sparkApp1.py (not working in 1.6)

This would work fine if app.py is downloaded locally and specified.

This was working correctly using ‘—py-files’ option in earlier version of 
Spark, but not using the ‘spark.submit.pyFiles’ configuration option. But now, 
it does not work through either of the ways.

The following diff shows the comment which states that it should work with 
‘non-local’ paths for the YARN cluster mode, and we are specifically doing 
separate validation to fail if YARN client mode is used with remote paths:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L309

And then this code gets triggered at the end of each run, irrespective of 
whether we are using Client or Cluster mode, and internally validates that the 
paths should be non-local:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L634

This above validation was not getting triggered in earlier version of Spark 
using ‘—py-files’ option because we were not storing the arguments passed to 
‘—py-files’ in the ‘spark.submit.pyFiles’ configuration for YARN. However, the 
following code was newly added in 2.0 which now stores it and hence this 
validation gets triggered even if we specify files through ‘—py-files’ option:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L545

Also, we changed the logic in YARN client, to read values directly from 
‘spark.submit.pyFiles’ configuration instead of from ‘—py-files’ (earlier):

https://github.com/apache/spark/commit/8ba2b7f28fee39c4839e5ea125bd25f5091a3a1e#diff-b050df3f55b82065803d6e83453b9706R543

So now its broken whether we use ‘—py-files’ or ‘spark.submit.pyFiles’ as the 
validation gets triggered in both cases irrespective of whether we use Client 
or Cluster mode with YARN.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org