[I] [VL] [ARM] Core Dump Issue for Benchmark with the TPCDS queries [incubator-gluten]

via GitHub Mon, 02 Jun 2025 22:28:11 -0700


rajatma1993 opened a new issue, #9845:
URL: https://github.com/apache/incubator-gluten/issues/9845


   ### Backend
   
   VL (Velox)
   
   ### Bug description
   
   Hi Team , 
   
   I am Benchmarking Gluten + Velox with the TPCDS benchmarking , I am using 
the HDFS as my storage, For the some Of the queries i am facing the Core dump 
issue like below 
   
   #
   # A fatal error has been detected by the Java Runtime Environment:
   #
   #  SIGSEGV (0xb) at pc=0x0000fffe24e11500, pid=115, tid=138
   #
   # JRE version: OpenJDK Runtime Environment (17.0.13+12) (build 
17.0.13+12-LTS)
   # Java VM: OpenJDK 64-Bit Server VM (17.0.13+12-LTS, mixed mode, sharing, 
tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
   # Problematic frame:
   # C  0x0000fffe24e11500
   #
   # Core dump will be written. Default location: Core dumps may be processed 
with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to 
/tmp/hsperfdata_spark/core.115)
   #
   # An error report file with more information is saved as:
   # /tmp/hsperfdata_spark/hs_err_pid115.log
   #
   # If you would like to submit a bug report, please visit:
   #   https://bell-sw.com/support
   #
   /opt/bitnami/spark/bin/spark-shell: line 47:   115 Aborted                 
(core dumped) "${SPARK_HOME}"/bin/spark-submit --class 
org.apache.spark.repl.Main --name "Spark shell" "$@"
   
   I am using a Scale Factor Of 1000 for this benchmarking , I am also using 
single spark node cluster for this experiment , Can u get some Help on this ??
   
   ### Gluten version
   
   Gluten-1.3
   
   ### Spark version
   
   Spark-3.5.x
   
   ### Spark configurations
   
   --deploy-mode client 
   --executor-cores 124 
   --executor-memory 112g 
   --num-executors 1 
   --driver-memory 16g 
   --conf 
spark.driver.extraClassPath=/opt/gluten/package/target/gluten-velox-bundle-spark3.5_2.12-debian_12_aarch_64-1.3.0.jar
 
   --conf 
spark.executor.extraClassPath=/opt/gluten/package/target/gluten-velox-bundle-spark3.5_2.12-debian_12_aarch_64-1.3.0.jar
 
   --conf spark.default.parallelism=200 
   --conf 
spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager 
   --conf spark.sql.adaptive.enabled=false 
   --conf spark.hadoop.input.connect.timeout=1000 
   --conf spark.hadoop.input.read.timeout=1000 
   --conf spark.hadoop.input.write.timeout=1000 
   --conf spark.memory.offHeap.enabled=true 
   --conf spark.memory.offHeap.size=784g 
   --conf spark.gluten.sql.columnar.backend.velox.orc.scan.enabled=true 
   --conf spark.plugins=org.apache.gluten.GlutenPlugin 
   --conf spark.hadoop.fs.defaultFS=hdfs://192.168.2.71:8020 
   --conf 
spark.driver.extraJavaOptions=-Dio.netty.tryReflectionSetAccessible=true 
   --conf 
spark.executor.extraJavaOptions=-Dio.netty.tryReflectionSetAccessible=true 
   --conf spark.executor.memoryOverhead=60g 
   --conf spark.sql.orc.columnarReaderBatchSize=10240 
   --conf spark.sql.broadcastTimeout=4800 
   --conf spark.driver.maxResultSize=4g 
   --conf spark.sql.shuffle.partitions=200
   
   ### System information
   
   _No response_
   
   ### Relevant logs
   
   ```bash
   NUMA_CORES: 0-123 
   start sf1000 at 2025/05/16 09:16:17
   num_executors=1
   executor_cores=124
   partitions=200
   offheap_mem: 784g
   Backend : velox
   onheap_mem: 112g
   memory_overhead: 60g
   + cat /tools/run_tpcds/tpc.scala
   + spark-shell --name 20250516_091543 --master spark://192.168.2.39:31077 
--deploy-mode client --executor-cores 124 --executor-memory 112g 
--num-executors 1 --driver-memory 16g --conf 
spark.driver.extraClassPath=/opt/gluten/package/target/gluten-velox-bundle-spark3.5_2.12-debian_12_aarch_64-1.3.0.jar
 --conf 
spark.executor.extraClassPath=/opt/gluten/package/target/gluten-velox-bundle-spark3.5_2.12-debian_12_aarch_64-1.3.0.jar
 --conf 
spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager 
--conf spark.sql.adaptive.enabled=false --conf 
spark.hadoop.input.connect.timeout=1000 --conf 
spark.hadoop.input.read.timeout=1000 --conf 
spark.hadoop.input.write.timeout=1000 --conf spark.memory.offHeap.enabled=true 
--conf spark.memory.offHeap.size=784g --conf 
spark.gluten.sql.columnar.backend.velox.orc.scan.enabled=true --conf 
spark.plugins=org.apache.gluten.GlutenPlugin --conf 
spark.hadoop.fs.defaultFS=hdfs://192.168.2.71:8020 --conf 
spark.driver.extraJavaOptions=-Dio.netty.try
 ReflectionSetAccessible=true --conf 
spark.executor.extraJavaOptions=-Dio.netty.tryReflectionSetAccessible=true 
--conf spark.executor.memoryOverhead=60g --conf 
spark.sql.orc.columnarReaderBatchSize=10240 --conf 
spark.sql.broadcastTimeout=4800 --conf spark.driver.maxResultSize=4g --conf 
spark.sql.shuffle.partitions=200
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
   25/05/16 09:16:24 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
   25/05/16 09:16:25 WARN VeloxListenerApi: Memory overhead is set to 
64424509440 which is smaller than the recommended size 252544077004. This may 
cause OOM.
   W20250516 09:16:26.767813   138 MemoryArbitrator.cpp:84] Query memory 
capacity[45.00GB] is set for NOOP arbitrator which has no capacity enforcement
   25/05/16 09:16:27 WARN SparkShimProvider: Spark runtime version 3.5.3 is not 
matched with Gluten's fully tested version 3.5.2
   Spark context Web UI available at 
http://soft-spark-velox-kusanagi-master-0.soft-spark-velox-kusanagi-headless.default.svc.cluster.local:4040
   Spark context available as 'sc' (master = spark://192.168.2.39:31077, app id 
= app-20250516091626-0000).
   Spark session available as 'spark'.
   Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /___/ .__/\_,_/_/ /_/\_\   version 3.5.3
         /_/
            
   Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 17.0.13)
   Type in expressions to have them evaluated.
   Type :help for more information.
   
   scala>      |      |      |      |      |      |      |      |      |      | 
     |      |      |      |      |      | import 
org.apache.spark.sql.execution.debug._
   
   scala> import scala.io.Source
   
   scala> import java.io.File
   
   scala> import java.util.Arrays
   
   scala> import sys.process._
   
   scala> import scala.util.{Try, Success, Failure, Random}
   
   scala> import java.time.LocalDateTime
   
   scala> 
   scala> 
   scala> 
   scala>      |      |      | tpc: String = tpcds
   
   scala> tpc: tpcds
   
   scala> 
   scala>      |      |      | file_ext: String = orc
   
   scala> file_ext: orc
   
   scala> 
   scala>      |      |      | queries_path: String = 
/tools/run_tpcds/queries.gluten/
   
   scala> queries_path: /tools/run_tpcds/queries.gluten/
   
   scala> 
   scala>      |      |      | target_query: String = q5.sql
   
   scala> target_query: q5.sql
   
   scala> 
   scala>      |      |      | random_seed: Long = 0
   
   scala> random_seed: 0
   
   scala> 
   scala> 
   scala>      |      |      | file_scheme: String = hdfs
   
   scala> file_scheme : hdfs
   
   scala> 
   scala> 
   scala> 
   scala>      |      |      |      |      |      |      |      |      |      | 
     |      | orc_file_root: String = hdfs://192.168.2.71:8020
   
   scala> orc_file_root: hdfs://192.168.2.71:8020
   
   scala> 
   scala>      |      |      |      |      |      | data_file_path: String = 
/spark-tpcds-data/sf1000
   
   scala> data_file_path: /spark-tpcds-data/sf1000
   
   scala> 
   scala>      |      |      | sync: Boolean = false
   
   scala> file_scheme : hdfs
   
   scala> 
   scala>      |      |      |      |      |      |      |      | time: 
[R](block: => R)R
   
   scala> 
   scala>      |      |      |      |      |      |      |      |      |      | 
     |      |      |      |      |      |      |      |      |      |      |    
  |      |      |      |      |      |      |      |      |      |      |      
|      |      |      |      |      |      |      |      |      |      |      |  
    |      |      |      |      |      |      |      |      |      |      |     
 |      |      |      |      |      |      |      |      |      |      |      | 
     |      |      |      |      | 
   [Stage 0:>                                                     (0 + 124) / 
2104]
   
   [Stage 0:>                                                     (0 + 130) / 
2104]
   
   [Stage 0:==>                                                  (90 + 125) / 
2104]
   
   [Stage 0:====>                                               (167 + 124) / 
2104]
   
   [Stage 0:=====>                                              (235 + 125) / 
2104]
   
   [Stage 0:========>                                           (327 + 125) / 
2104]
   
   [Stage 0:==========>                                         (435 + 125) / 
2104]
   
   [Stage 0:=============>                                      (543 + 125) / 
2104]
   
   [Stage 0:================>                                   (665 + 124) / 
2104]
   
   [Stage 0:===================>                                (789 + 124) / 
2104]
   
   [Stage 0:=======================>                            (935 + 129) / 
2104]
   
   [Stage 0:=========================>                         (1054 + 126) / 
2104]
   
   [Stage 0:============================>                      (1175 + 124) / 
2104]
   
   [Stage 0:===============================>                   (1317 + 126) / 
2104]
   
   [Stage 0:===================================>               (1471 + 125) / 
2104]
   
   [Stage 0:=======================================>           (1642 + 124) / 
2104]
   
   [Stage 0:===========================================>       (1806 + 125) / 
2104]
   
   [Stage 0:===============================================>   (1968 + 127) / 
2104]
   
                                                                                
   
   
   [Stage 1:=======>                                            (265 + 124) / 
1837]
   
   [Stage 1:============>                                       (425 + 125) / 
1837]
   
   [Stage 1:=================>                                  (619 + 124) / 
1837]
   
   [Stage 1:======================>                             (792 + 124) / 
1837]
   
   [Stage 1:===========================>                        (957 + 124) / 
1837]
   
   [Stage 1:===============================>                   (1129 + 125) / 
1837]
   
   [Stage 1:===================================>               (1296 + 127) / 
1837]
   
   [Stage 1:========================================>          (1463 + 125) / 
1837]
   
   [Stage 1:=============================================>     (1631 + 126) / 
1837]
   
                                                                                
   
   
   [Stage 3:=========>                                          (355 + 126) / 
2004]
   
   [Stage 3:=============>                                      (524 + 125) / 
2004]
   
   [Stage 3:=================>                                  (689 + 126) / 
2004]
   
   [Stage 3:======================>                             (854 + 125) / 
2004]
   
   [Stage 3:=========================>                         (1017 + 126) / 
2004]
   
   [Stage 3:==============================>                    (1180 + 126) / 
2004]
   
   [Stage 3:==================================>                (1340 + 132) / 
2004]
   
   [Stage 3:======================================>            (1521 + 125) / 
2004]
   
   [Stage 3:==========================================>        (1680 + 125) / 
2004]
   
   [Stage 3:===============================================>   (1852 + 125) / 
2004]
   
                                                                                
   
   
   [Stage 4:==========>                                         (379 + 125) / 
1824]
   
   [Stage 4:================>                                   (563 + 125) / 
1824]
   
   [Stage 4:====================>                               (728 + 124) / 
1824]
   
   [Stage 4:=========================>                          (899 + 124) / 
1824]
   
   [Stage 4:=============================>                     (1056 + 125) / 
1824]
   
   [Stage 4:=================================>                 (1214 + 125) / 
1824]
   
   [Stage 4:======================================>            (1377 + 125) / 
1824]
   
   [Stage 4:===========================================>       (1558 + 125) / 
1824]
   
   [Stage 4:=================================================>  (1732 + 92) / 
1824]
   
                                                                                
   
   
   [Stage 5:=========>                                          (390 + 126) / 
2184]
   
   [Stage 5:=============>                                      (553 + 125) / 
2184]
   
   [Stage 5:================>                                   (702 + 124) / 
2184]
   
   [Stage 5:====================>                               (858 + 124) / 
2184]
   
   [Stage 5:=======================>                           (1001 + 124) / 
2184]
   
   [Stage 5:===========================>                       (1159 + 124) / 
2184]
   
   [Stage 5:==============================>                    (1308 + 125) / 
2184]
   
   [Stage 5:=================================>                 (1447 + 124) / 
2184]
   
   [Stage 5:=====================================>             (1602 + 124) / 
2184]
   
   [Stage 5:========================================>          (1736 + 124) / 
2184]
   
   [Stage 5:============================================>      (1885 + 126) / 
2184]
   
   [Stage 5:===============================================>   (2044 + 124) / 
2184]
   
                                                                                
   
   
   [Stage 6:=========>                                          (348 + 125) / 
1824]
   
   [Stage 6:==============>                                     (492 + 124) / 
1824]
   
   [Stage 6:=================>                                  (628 + 124) / 
1824]
   
   [Stage 6:======================>                             (779 + 124) / 
1824]
   
   [Stage 6:==========================>                         (931 + 124) / 
1824]
   
   [Stage 6:==============================>                    (1077 + 124) / 
1824]
   
   [Stage 6:==================================>                (1225 + 124) / 
1824]
   
   [Stage 6:=====================================>             (1357 + 125) / 
1824]
   
   [Stage 6:==========================================>        (1518 + 124) / 
1824]
   
   [Stage 6:==============================================>    (1661 + 124) / 
1824]
   
                                                                                
   
   25/05/16 09:17:02 WARN SparkStringUtils: Truncated the string representation 
of a plan since it was too large. This behavior can be adjusted by setting 
'spark.sql.debug.maxToStringFields'.
   
   scala> 
   scala> 
   scala>      |      |      | shuffleList: [A](list: List[A], seed: 
Long)List[A]
   
   scala> 
   scala>      |      |      |      |      |      |      |      |      |      | 
     |      |      |      |      |      |      |      |      |      |      |    
  |      |      |      |      |      |      |      |      |      |      |      
|      |      |      |      |      |      |      |      |      |      |      |  
    | getListOfFiles: (dir: String)List[java.io.File]
   
   scala> fileLists: List[java.io.File] = 
List(/tools/run_tpcds/queries.gluten/q5.sql)
   
   scala>      |      |      |      |      |      |      |      |      | 
sorted: List[java.io.File] = List(/tools/run_tpcds/queries.gluten/q5.sql)
   
   scala> queries: List[java.io.File] = 
List(/tools/run_tpcds/queries.gluten/q5.sql)
   
   scala> 
   scala>      |      |      |      |      |      |      |      |      |      | 
     |      |      |      |      |      |      |      |      |      | sync 
disabled
   
   scala> 
   scala> starttime: Long = 6125152060595533
   
   scala> 
   scala> 
   scala>      |      |      |      |      |      |      |      |      |      | 
     |      |      |      |      |      | /tools/run_tpcds/queries.gluten/q5.sql
   2025-05-16T09:17:05.252946492
    with ssr as  (select s_store_id,         sum(sales_price) as sales,         
sum(profit) as profit,         sum(return_amt) as returns,         
sum(net_loss) as profit_loss  from   ( select  ss_store_sk as store_sk,         
    ss_sold_date_sk  as date_sk,             ss_ext_sales_price as sales_price, 
            ss_net_profit as profit,             cast(0 as decimal(7,2)) as 
return_amt,             cast(0 as decimal(7,2)) as net_loss     from 
store_sales     union all     select sr_store_sk as store_sk,            
sr_returned_date_sk as date_sk,            cast(0 as decimal(7,2)) as 
sales_price,            cast(0 as decimal(7,2)) as profit,            
sr_return_amt as return_amt,            sr_net_loss as net_loss     from 
store_returns    ) salesreturns,      date_dim,      store  where date_sk = 
d_date_sk        and d_date between cast('1998-08-04' as date)                  
  and (cast('1998-08-04' as date) + interval '14' day)        and store_sk = 
s_store_sk  group by s_stor
 e_id)  ,  csr as  (select cp_catalog_page_id,         sum(sales_price) as 
sales,         sum(profit) as profit,         sum(return_amt) as returns,       
  sum(net_loss) as profit_loss  from   ( select  cs_catalog_page_sk as page_sk, 
            cs_sold_date_sk  as date_sk,             cs_ext_sales_price as 
sales_price,             cs_net_profit as profit,             cast(0 as 
decimal(7,2)) as return_amt,             cast(0 as decimal(7,2)) as net_loss    
 from catalog_sales     union all     select cr_catalog_page_sk as page_sk,     
       cr_returned_date_sk as date_sk,            cast(0 as decimal(7,2)) as 
sales_price,            cast(0 as decimal(7,2)) as profit,            
cr_return_amount as return_amt,            cr_net_loss as net_loss     from 
catalog_returns    ) salesreturns,      date_dim,      catalog_page  where 
date_sk = d_date_sk        and d_date between cast('1998-08-04' as date)        
           and (cast('1998-08-04' as date) + interval '14' day)        and pag
 e_sk = cp_catalog_page_sk  group by cp_catalog_page_id)  ,  wsr as  (select 
web_site_id,         sum(sales_price) as sales,         sum(profit) as profit,  
       sum(return_amt) as returns,         sum(net_loss) as profit_loss  from   
( select  ws_web_site_sk as wsr_web_site_sk,             ws_sold_date_sk  as 
date_sk,             ws_ext_sales_price as sales_price,             
ws_net_profit as profit,             cast(0 as decimal(7,2)) as return_amt,     
        cast(0 as decimal(7,2)) as net_loss     from web_sales     union all    
 select ws_web_site_sk as wsr_web_site_sk,            wr_returned_date_sk as 
date_sk,            cast(0 as decimal(7,2)) as sales_price,            cast(0 
as decimal(7,2)) as profit,            wr_return_amt as return_amt,            
wr_net_loss as net_loss     from web_returns left outer join web_sales on       
   ( wr_item_sk = ws_item_sk            and wr_order_number = ws_order_number)  
  ) salesreturns,      date_dim,      web_site  where date_sk 
 = d_date_sk        and d_date between cast('1998-08-04' as date)               
    and (cast('1998-08-04' as date) + interval '14' day)        and 
wsr_web_site_sk = web_site_sk  group by web_site_id)   select  channel         
, id         , sum(sales) as sales         , sum(returns) as returns         , 
sum(profit) as profit  from   (select 'store channel' as channel         , 
'store' || s_store_id as id         , sales         , returns         , (profit 
- profit_loss) as profit  from   ssr  union all  select 'catalog channel' as 
channel         , 'catalog_page' || cp_catalog_page_id as id         , sales    
     , returns         , (profit - profit_loss) as profit  from  csr  union all 
 select 'web channel' as channel         , 'web_site' || web_site_id as id      
   , sales         , returns         , (profit - profit_loss) as profit  from   
wsr  ) x  group by rollup (channel, id)  order by channel          ,id   LIMIT 
100 ;  
   #
   # A fatal error has been detected by the Java Runtime Environment:
   #
   #  SIGSEGV (0xb) at pc=0x0000fffe24e11500, pid=115, tid=138
   #
   # JRE version: OpenJDK Runtime Environment (17.0.13+12) (build 
17.0.13+12-LTS)
   # Java VM: OpenJDK 64-Bit Server VM (17.0.13+12-LTS, mixed mode, sharing, 
tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
   # Problematic frame:
   # C  0x0000fffe24e11500
   #
   # Core dump will be written. Default location: Core dumps may be processed 
with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to 
/tmp/hsperfdata_spark/core.115)
   #
   # An error report file with more information is saved as:
   # /tmp/hsperfdata_spark/hs_err_pid115.log
   #
   # If you would like to submit a bug report, please visit:
   #   https://bell-sw.com/support
   #
   /opt/bitnami/spark/bin/spark-shell: line 47:   115 Aborted                 
(core dumped) "${SPARK_HOME}"/bin/spark-submit --class 
org.apache.spark.repl.Main --name "Spark shell" "$@"
   ++ date '+%Y/%m/%d %H:%M:%S'
   + echo 'end sf1000 at 2025/05/16 09:17:13'
   end sf1000 at 2025/05/16 09:17:13
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [VL] [ARM] Core Dump Issue for Benchmark with the TPCDS queries [incubator-gluten]

Reply via email to