[ https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reynold Xin resolved SPARK-18458. --------------------------------- Resolution: Fixed Assignee: Kazuaki Ishizaki Fix Version/s: 2.1.0 > core dumped running Spark SQL on large data volume (100TB) > ---------------------------------------------------------- > > Key: SPARK-18458 > URL: https://issues.apache.org/jira/browse/SPARK-18458 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.0 > Reporter: JESSE CHEN > Assignee: Kazuaki Ishizaki > Priority: Critical > Labels: core, dump > Fix For: 2.1.0 > > > Running a query on 100TB parquet database using the Spark master dated 11/04 > dump cores on Spark executors. > The query is TPCDS query 82 (though this query is not the only one can > produce this core dump, just the easiest one to re-create the error). > Spark output that showed the exception: > {noformat} > 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: > Container marked as failed: container_e68_1478924651089_0018_01_000074 on > host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from > container-launch. > Container id: container_e68_1478924651089_0018_01_000074 > Exit code: 134 > Exception message: /bin/bash: line 1: 4031216 Aborted (core > dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java > -server -Xmx24576m > -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' > -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074 > -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 > --hostname mer05x.svl.ibm.com --cores 2 --app-id > application_1478924651089_0018 --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/__app__.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.databricks_spark-csv_2.10-1.3.0.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/org.apache.commons_commons-csv-1.1.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.univocity_univocity-parsers-1.5.1.jar > > > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stdout > 2> > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stderr > Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 > Aborted (core dumped) > /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server > -Xmx24576m > -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' > -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074 > -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 > --hostname mer05x.svl.ibm.com --cores 2 --app-id > application_1478924651089_0018 --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/__app__.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.databricks_spark-csv_2.10-1.3.0.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/org.apache.commons_commons-csv-1.1.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.univocity_univocity-parsers-1.5.1.jar > > > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stdout > 2> > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stderr > at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) > at org.apache.hadoop.util.Shell.run(Shell.java:456) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Container exited with a non-zero exit code 134 > {noformat} > According to the source code, exit code 134 is 128+6, and 6 is SIGABRT > 6 Core Abort signal from abort(3). The external signal killed > executors. > On the YARN side, the log is more clear: > {noformat} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x00007fffe29e6bac, pid=3694385, tid=140735430203136 > # > # JRE version: OpenJDK Runtime Environment (8.0_77-b03) (build 1.8.0_77-b03) > # Java VM: OpenJDK 64-Bit Server VM (25.77-b03 mixed mode linux-amd64 > compressed oops) > # Problematic frame: > # J 10342% C2 > org.apache.spark.util.collection.unsafe.sort.RadixSort.sortKeyPrefixArrayAtByte(Lorg/apache/spark/unsafe/array/LongArray;I[JIIIZZ)V > (386 bytes) @ 0x00007fffe29e6bac [0x00007fffe29e43c0+0x27ec] > # > # Core dump written. Default location: > /data2/hadoop/yarn/local/usercache/spark/appcache/application_1479156026828_0006/container_e69_1479156026828_0006_01_000825/core > or core.3694385 > # > # An error report file with more information is saved as: > # > /data2/hadoop/yarn/local/usercache/spark/appcache/application_1479156026828_0006/container_e69_1479156026828_0006_01_000825/hs_err_pid3694385.log > # > # If you would like to submit a bug report, please visit: > # http://bugreport.java.com/bugreport/crash.jsp > # > {noformat} > And the hs_err_pid3694385.log shows the stack: > {noformat} > Stack: [0x00007fff85432000,0x00007fff85533000], sp=0x00007fff85530ce0, free > space=1019k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native > code) > J 3896 C1 org.apache.spark.unsafe.Platform.putLong(Ljava/lang/Object;JJ)V (10 > bytes) @ 0x00007fffe1d3cdec [0x00007fffe1d3cde0+0xc] > j > org.apache.spark.util.collection.unsafe.sort.RadixSort.sortKeyPrefixArrayAtByte(Lorg/apache/spark/unsafe/array/LongArray;I[JIIIZZ)V+138 > j > org.apache.spark.util.collection.unsafe.sort.RadixSort.sortKeyPrefixArray(Lorg/apache/spark/unsafe/array/LongArray;IIIIZZ)I+209 > j > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator()Lorg/apache/spark/util/collection/unsafe/sort/UnsafeSorterIterator;+56 > j > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator()Lorg/apache/spark/util/collection/unsafe/sort/UnsafeSorterIterator;+62 > j > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort()Lscala/collection/Iterator;+4 > j > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext()V+24 > j org.apache.spark.sql.execution.BufferedRowIterator.hasNext()Z+11 > j > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext()Z+4 > j > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;Lscala/collection/Iterator;Lscala/collection/Iterator;)Z+147 > j > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V+552 > j > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext()V+17 > J 3849 C1 org.apache.spark.sql.execution.BufferedRowIterator.hasNext()Z (30 > bytes) @ 0x00007fffe1d5679c [0x00007fffe1d56520+0x27c] > j > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext()Z+4 > j scala.collection.Iterator$$anon$11.hasNext()Z+4 > j > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(Lscala/collection/Iterator;)V+3 > j > org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Lorg/apache/spark/scheduler/MapStatus;+222 > j > org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object;+2 > j > org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;)Ljava/lang/Object;+152 > j org.apache.spark.executor.Executor$TaskRunner.run()V+423 > j > java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95 > j java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5 > j java.lang.Thread.run()V+11 > v ~StubRoutines::call_stub > V [libjvm.so+0x63d6ba] > V [libjvm.so+0x63ab74] > V [libjvm.so+0x63b189] > V [libjvm.so+0x67e6a1] > V [libjvm.so+0x9b3f5a] > V [libjvm.so+0x869722] > C [libpthread.so.0+0x7dc5] start_thread+0xc5 > {noformat} > This is not easily reproducible on smaller data volumes, e.g., 1TB or 10TB, > but easily reproducible on 100TB+...so look into data types that may not be > big enough to handle hundreds of billion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org