[ https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
JESSE CHEN updated SPARK-18458: ------------------------------- Description: Running a query on 100TB parquet database using the Spark master dated 11/04 dump cores on Spark executors. The query is TPCDS query 82 (though this query is not the only one can produce this core dump, just the easiest one to re-create the error). Spark output that showed the exception: {noformat} 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e68_1478924651089_0018_01_000074 on host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from container-launch. Container id: container_e68_1478924651089_0018_01_000074 Exit code: 134 Exception message: /bin/bash: line 1: 4031216 Aborted (core dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server -Xmx24576m -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 --hostname mer05x.svl.ibm.com --cores 2 --app-id application_1478924651089_0018 --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/__app__.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.databricks_spark-csv_2.10-1.3.0.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/org.apache.commons_commons-csv-1.1.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.univocity_univocity-parsers-1.5.1.jar > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stdout 2> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stderr Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 Aborted (core dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server -Xmx24576m -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 --hostname mer05x.svl.ibm.com --cores 2 --app-id application_1478924651089_0018 --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/__app__.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.databricks_spark-csv_2.10-1.3.0.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/org.apache.commons_commons-csv-1.1.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.univocity_univocity-parsers-1.5.1.jar > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stdout 2> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stderr at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 134 {noformat} According to the source code, exit code 134 is 128+6, and 6 is SIGABRT 6 Core Abort signal from abort(3). The external signal killed executors. On the YARN side, the log is more clear: {noformat} # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007fffe29e6bac, pid=3694385, tid=140735430203136 # # JRE version: OpenJDK Runtime Environment (8.0_77-b03) (build 1.8.0_77-b03) # Java VM: OpenJDK 64-Bit Server VM (25.77-b03 mixed mode linux-amd64 compressed oops) # Problematic frame: # J 10342% C2 org.apache.spark.util.collection.unsafe.sort.RadixSort.sortKeyPrefixArrayAtByte(Lorg/apache/spark/unsafe/array/LongArray;I[JIIIZZ)V (386 bytes) @ 0x00007fffe29e6bac [0x00007fffe29e43c0+0x27ec] # # Core dump written. Default location: /data2/hadoop/yarn/local/usercache/spark/appcache/application_1479156026828_0006/container_e69_1479156026828_0006_01_000825/core or core.3694385 # # An error report file with more information is saved as: # /data2/hadoop/yarn/local/usercache/spark/appcache/application_1479156026828_0006/container_e69_1479156026828_0006_01_000825/hs_err_pid3694385.log # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # {noformat} And the hs_err_pid3694385.log shows the stack: {noformat} Stack: [0x00007fff85432000,0x00007fff85533000], sp=0x00007fff85530ce0, free space=1019k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) J 3896 C1 org.apache.spark.unsafe.Platform.putLong(Ljava/lang/Object;JJ)V (10 bytes) @ 0x00007fffe1d3cdec [0x00007fffe1d3cde0+0xc] j org.apache.spark.util.collection.unsafe.sort.RadixSort.sortKeyPrefixArrayAtByte(Lorg/apache/spark/unsafe/array/LongArray;I[JIIIZZ)V+138 j org.apache.spark.util.collection.unsafe.sort.RadixSort.sortKeyPrefixArray(Lorg/apache/spark/unsafe/array/LongArray;IIIIZZ)I+209 j org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator()Lorg/apache/spark/util/collection/unsafe/sort/UnsafeSorterIterator;+56 j org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator()Lorg/apache/spark/util/collection/unsafe/sort/UnsafeSorterIterator;+62 j org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort()Lscala/collection/Iterator;+4 j org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext()V+24 j org.apache.spark.sql.execution.BufferedRowIterator.hasNext()Z+11 j org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext()Z+4 j org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;Lscala/collection/Iterator;Lscala/collection/Iterator;)Z+147 j org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V+552 j org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext()V+17 J 3849 C1 org.apache.spark.sql.execution.BufferedRowIterator.hasNext()Z (30 bytes) @ 0x00007fffe1d5679c [0x00007fffe1d56520+0x27c] j org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext()Z+4 j scala.collection.Iterator$$anon$11.hasNext()Z+4 j org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(Lscala/collection/Iterator;)V+3 j org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Lorg/apache/spark/scheduler/MapStatus;+222 j org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object;+2 j org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;)Ljava/lang/Object;+152 j org.apache.spark.executor.Executor$TaskRunner.run()V+423 j java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95 j java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5 j java.lang.Thread.run()V+11 v ~StubRoutines::call_stub V [libjvm.so+0x63d6ba] V [libjvm.so+0x63ab74] V [libjvm.so+0x63b189] V [libjvm.so+0x67e6a1] V [libjvm.so+0x9b3f5a] V [libjvm.so+0x869722] C [libpthread.so.0+0x7dc5] start_thread+0xc5 {noformat} This is not easily reproducible on smaller data volumes, e.g., 1TB or 10TB, but easily reproducible on 100TB+...so look into data types that may not be big enough to handle hundreds of billion. was: Testing Spark SQL using TPC queries. Query 49 returns wrong results compared to official result set. This is at 1GB SF (validation run). SparkSQL has right answer but in wrong order (and there is an 'order by' in the query). Actual results: {noformat} store,9797,0.80000000000000000000,2,2] [store,12641,0.81609195402298850575,3,3] [store,6661,0.92207792207792207792,7,7] [store,13013,0.94202898550724637681,8,8] [store,9029,1.00000000000000000000,10,10] [web,15597,0.66197183098591549296,3,3] [store,14925,0.96470588235294117647,9,9] [store,4063,1.00000000000000000000,10,10] [catalog,8929,0.76250000000000000000,7,7] [store,11589,0.82653061224489795918,6,6] [store,1171,0.82417582417582417582,5,5] [store,9471,0.77500000000000000000,1,1] [catalog,12577,0.65591397849462365591,3,3] [web,97,0.90361445783132530120,9,8] [web,85,0.85714285714285714286,8,7] [catalog,361,0.74647887323943661972,5,5] [web,2915,0.69863013698630136986,4,4] [web,117,0.92500000000000000000,10,9] [catalog,9295,0.77894736842105263158,9,9] [web,3305,0.73750000000000000000,6,16] [catalog,16215,0.79069767441860465116,10,10] [web,7539,0.59000000000000000000,1,1] [catalog,17543,0.57142857142857142857,1,1] [catalog,3411,0.71641791044776119403,4,4] [web,11933,0.71717171717171717172,5,5] [catalog,14513,0.63541666666666666667,2,2] [store,15839,0.81632653061224489796,4,4] [web,3337,0.62650602409638554217,2,2] [web,5299,0.92708333333333333333,11,10] [catalog,8189,0.74698795180722891566,6,6] [catalog,14869,0.77173913043478260870,8,8] [web,483,0.80000000000000000000,7,6] {noformat} Expected results: {noformat} +---------+-------+--------------------+-------------+---------------+ | CHANNEL | ITEM | RETURN_RATIO | RETURN_RANK | CURRENCY_RANK | +---------+-------+--------------------+-------------+---------------+ | catalog | 17543 | .5714285714285714 | 1 | 1 | | catalog | 14513 | .6354166666666666 | 2 | 2 | | catalog | 12577 | .6559139784946236 | 3 | 3 | | catalog | 3411 | .7164179104477611 | 4 | 4 | | catalog | 361 | .7464788732394366 | 5 | 5 | | catalog | 8189 | .7469879518072289 | 6 | 6 | | catalog | 8929 | .7625000000000000 | 7 | 7 | | catalog | 14869 | .7717391304347826 | 8 | 8 | | catalog | 9295 | .7789473684210526 | 9 | 9 | | catalog | 16215 | .7906976744186046 | 10 | 10 | | store | 9471 | .7750000000000000 | 1 | 1 | | store | 9797 | .8000000000000000 | 2 | 2 | | store | 12641 | .8160919540229885 | 3 | 3 | | store | 15839 | .8163265306122448 | 4 | 4 | | store | 1171 | .8241758241758241 | 5 | 5 | | store | 11589 | .8265306122448979 | 6 | 6 | | store | 6661 | .9220779220779220 | 7 | 7 | | store | 13013 | .9420289855072463 | 8 | 8 | | store | 14925 | .9647058823529411 | 9 | 9 | | store | 4063 | 1.0000000000000000 | 10 | 10 | | store | 9029 | 1.0000000000000000 | 10 | 10 | | web | 7539 | .5900000000000000 | 1 | 1 | | web | 3337 | .6265060240963855 | 2 | 2 | | web | 15597 | .6619718309859154 | 3 | 3 | | web | 2915 | .6986301369863013 | 4 | 4 | | web | 11933 | .7171717171717171 | 5 | 5 | | web | 3305 | .7375000000000000 | 6 | 16 | | web | 483 | .8000000000000000 | 7 | 6 | | web | 85 | .8571428571428571 | 8 | 7 | | web | 97 | .9036144578313253 | 9 | 8 | | web | 117 | .9250000000000000 | 10 | 9 | | web | 5299 | .9270833333333333 | 11 | 10 | +---------+-------+--------------------+-------------+---------------+ {noformat} Query used: {noformat} -- start query 49 in stream 0 using template query49.tpl and seed QUALIFICATION select 'web' as channel ,web.item ,web.return_ratio ,web.return_rank ,web.currency_rank from ( select item ,return_ratio ,currency_ratio ,rank() over (order by return_ratio) as return_rank ,rank() over (order by currency_ratio) as currency_rank from ( select ws.ws_item_sk as item ,(cast(sum(coalesce(wr.wr_return_quantity,0)) as decimal(15,4))/ cast(sum(coalesce(ws.ws_quantity,0)) as decimal(15,4) )) as return_ratio ,(cast(sum(coalesce(wr.wr_return_amt,0)) as decimal(15,4))/ cast(sum(coalesce(ws.ws_net_paid,0)) as decimal(15,4) )) as currency_ratio from web_sales ws left outer join web_returns wr on (ws.ws_order_number = wr.wr_order_number and ws.ws_item_sk = wr.wr_item_sk) ,date_dim where wr.wr_return_amt > 10000 and ws.ws_net_profit > 1 and ws.ws_net_paid > 0 and ws.ws_quantity > 0 and ws_sold_date_sk = d_date_sk and d_year = 2001 and d_moy = 12 group by ws.ws_item_sk ) in_web ) web where ( web.return_rank <= 10 or web.currency_rank <= 10 ) union select 'catalog' as channel ,catalog.item ,catalog.return_ratio ,catalog.return_rank ,catalog.currency_rank from ( select item ,return_ratio ,currency_ratio ,rank() over (order by return_ratio) as return_rank ,rank() over (order by currency_ratio) as currency_rank from ( select cs.cs_item_sk as item ,(cast(sum(coalesce(cr.cr_return_quantity,0)) as decimal(15,4))/ cast(sum(coalesce(cs.cs_quantity,0)) as decimal(15,4) )) as return_ratio ,(cast(sum(coalesce(cr.cr_return_amount,0)) as decimal(15,4))/ cast(sum(coalesce(cs.cs_net_paid,0)) as decimal(15,4) )) as currency_ratio from catalog_sales cs left outer join catalog_returns cr on (cs.cs_order_number = cr.cr_order_number and cs.cs_item_sk = cr.cr_item_sk) ,date_dim where cr.cr_return_amount > 10000 and cs.cs_net_profit > 1 and cs.cs_net_paid > 0 and cs.cs_quantity > 0 and cs_sold_date_sk = d_date_sk and d_year = 2001 and d_moy = 12 group by cs.cs_item_sk ) in_cat ) catalog where ( catalog.return_rank <= 10 or catalog.currency_rank <=10 ) union select 'store' as channel ,store.item ,store.return_ratio ,store.return_rank ,store.currency_rank from ( select item ,return_ratio ,currency_ratio ,rank() over (order by return_ratio) as return_rank ,rank() over (order by currency_ratio) as currency_rank from ( select sts.ss_item_sk as item ,(cast(sum(coalesce(sr.sr_return_quantity,0)) as decimal(15,4))/cast(sum(coalesce(sts.ss_quantity,0)) as decimal(15,4) )) as return_ratio ,(cast(sum(coalesce(sr.sr_return_amt,0)) as decimal(15,4))/cast(sum(coalesce(sts.ss_net_paid,0)) as decimal(15,4) )) as currency_ratio from store_sales sts left outer join store_returns sr on (sts.ss_ticket_number = sr.sr_ticket_number and sts.ss_item_sk = sr.sr_item_sk) ,date_dim where sr.sr_return_amt > 10000 and sts.ss_net_profit > 1 and sts.ss_net_paid > 0 and sts.ss_quantity > 0 and ss_sold_date_sk = d_date_sk and d_year = 2001 and d_moy = 12 group by sts.ss_item_sk ) in_store ) store where ( store.return_rank <= 10 or store.currency_rank <= 10 ) order by 1,4,5 limit 100; -- end query 49 in stream 0 using template query49.tpl {noformat} > core dumped running Spark SQL on large data volume (100TB) > ---------------------------------------------------------- > > Key: SPARK-18458 > URL: https://issues.apache.org/jira/browse/SPARK-18458 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.0 > Reporter: JESSE CHEN > Labels: core, dump > Fix For: 2.0.0 > > > Running a query on 100TB parquet database using the Spark master dated 11/04 > dump cores on Spark executors. > The query is TPCDS query 82 (though this query is not the only one can > produce this core dump, just the easiest one to re-create the error). > Spark output that showed the exception: > {noformat} > 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: > Container marked as failed: container_e68_1478924651089_0018_01_000074 on > host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from > container-launch. > Container id: container_e68_1478924651089_0018_01_000074 > Exit code: 134 > Exception message: /bin/bash: line 1: 4031216 Aborted (core > dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java > -server -Xmx24576m > -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' > -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074 > -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 > --hostname mer05x.svl.ibm.com --cores 2 --app-id > application_1478924651089_0018 --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/__app__.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.databricks_spark-csv_2.10-1.3.0.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/org.apache.commons_commons-csv-1.1.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.univocity_univocity-parsers-1.5.1.jar > > > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stdout > 2> > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stderr > Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 > Aborted (core dumped) > /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server > -Xmx24576m > -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' > -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074 > -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 > --hostname mer05x.svl.ibm.com --cores 2 --app-id > application_1478924651089_0018 --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/__app__.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.databricks_spark-csv_2.10-1.3.0.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/org.apache.commons_commons-csv-1.1.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.univocity_univocity-parsers-1.5.1.jar > > > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stdout > 2> > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stderr > at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) > at org.apache.hadoop.util.Shell.run(Shell.java:456) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Container exited with a non-zero exit code 134 > {noformat} > According to the source code, exit code 134 is 128+6, and 6 is SIGABRT > 6 Core Abort signal from abort(3). The external signal killed > executors. > On the YARN side, the log is more clear: > {noformat} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x00007fffe29e6bac, pid=3694385, tid=140735430203136 > # > # JRE version: OpenJDK Runtime Environment (8.0_77-b03) (build 1.8.0_77-b03) > # Java VM: OpenJDK 64-Bit Server VM (25.77-b03 mixed mode linux-amd64 > compressed oops) > # Problematic frame: > # J 10342% C2 > org.apache.spark.util.collection.unsafe.sort.RadixSort.sortKeyPrefixArrayAtByte(Lorg/apache/spark/unsafe/array/LongArray;I[JIIIZZ)V > (386 bytes) @ 0x00007fffe29e6bac [0x00007fffe29e43c0+0x27ec] > # > # Core dump written. Default location: > /data2/hadoop/yarn/local/usercache/spark/appcache/application_1479156026828_0006/container_e69_1479156026828_0006_01_000825/core > or core.3694385 > # > # An error report file with more information is saved as: > # > /data2/hadoop/yarn/local/usercache/spark/appcache/application_1479156026828_0006/container_e69_1479156026828_0006_01_000825/hs_err_pid3694385.log > # > # If you would like to submit a bug report, please visit: > # http://bugreport.java.com/bugreport/crash.jsp > # > {noformat} > And the hs_err_pid3694385.log shows the stack: > {noformat} > Stack: [0x00007fff85432000,0x00007fff85533000], sp=0x00007fff85530ce0, free > space=1019k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native > code) > J 3896 C1 org.apache.spark.unsafe.Platform.putLong(Ljava/lang/Object;JJ)V (10 > bytes) @ 0x00007fffe1d3cdec [0x00007fffe1d3cde0+0xc] > j > org.apache.spark.util.collection.unsafe.sort.RadixSort.sortKeyPrefixArrayAtByte(Lorg/apache/spark/unsafe/array/LongArray;I[JIIIZZ)V+138 > j > org.apache.spark.util.collection.unsafe.sort.RadixSort.sortKeyPrefixArray(Lorg/apache/spark/unsafe/array/LongArray;IIIIZZ)I+209 > j > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator()Lorg/apache/spark/util/collection/unsafe/sort/UnsafeSorterIterator;+56 > j > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator()Lorg/apache/spark/util/collection/unsafe/sort/UnsafeSorterIterator;+62 > j > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort()Lscala/collection/Iterator;+4 > j > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext()V+24 > j org.apache.spark.sql.execution.BufferedRowIterator.hasNext()Z+11 > j > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext()Z+4 > j > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;Lscala/collection/Iterator;Lscala/collection/Iterator;)Z+147 > j > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V+552 > j > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext()V+17 > J 3849 C1 org.apache.spark.sql.execution.BufferedRowIterator.hasNext()Z (30 > bytes) @ 0x00007fffe1d5679c [0x00007fffe1d56520+0x27c] > j > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext()Z+4 > j scala.collection.Iterator$$anon$11.hasNext()Z+4 > j > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(Lscala/collection/Iterator;)V+3 > j > org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Lorg/apache/spark/scheduler/MapStatus;+222 > j > org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object;+2 > j > org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;)Ljava/lang/Object;+152 > j org.apache.spark.executor.Executor$TaskRunner.run()V+423 > j > java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95 > j java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5 > j java.lang.Thread.run()V+11 > v ~StubRoutines::call_stub > V [libjvm.so+0x63d6ba] > V [libjvm.so+0x63ab74] > V [libjvm.so+0x63b189] > V [libjvm.so+0x67e6a1] > V [libjvm.so+0x9b3f5a] > V [libjvm.so+0x869722] > C [libpthread.so.0+0x7dc5] start_thread+0xc5 > {noformat} > This is not easily reproducible on smaller data volumes, e.g., 1TB or 10TB, > but easily reproducible on 100TB+...so look into data types that may not be > big enough to handle hundreds of billion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org