rajatma1993 opened a new issue, #9845: URL: https://github.com/apache/incubator-gluten/issues/9845
### Backend VL (Velox) ### Bug description Hi Team , I am Benchmarking Gluten + Velox with the TPCDS benchmarking , I am using the HDFS as my storage, For the some Of the queries i am facing the Core dump issue like below # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x0000fffe24e11500, pid=115, tid=138 # # JRE version: OpenJDK Runtime Environment (17.0.13+12) (build 17.0.13+12-LTS) # Java VM: OpenJDK 64-Bit Server VM (17.0.13+12-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64) # Problematic frame: # C 0x0000fffe24e11500 # # Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /tmp/hsperfdata_spark/core.115) # # An error report file with more information is saved as: # /tmp/hsperfdata_spark/hs_err_pid115.log # # If you would like to submit a bug report, please visit: # https://bell-sw.com/support # /opt/bitnami/spark/bin/spark-shell: line 47: 115 Aborted (core dumped) "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@" I am using a Scale Factor Of 1000 for this benchmarking , I am also using single spark node cluster for this experiment , Can u get some Help on this ?? ### Gluten version Gluten-1.3 ### Spark version Spark-3.5.x ### Spark configurations --deploy-mode client --executor-cores 124 --executor-memory 112g --num-executors 1 --driver-memory 16g --conf spark.driver.extraClassPath=/opt/gluten/package/target/gluten-velox-bundle-spark3.5_2.12-debian_12_aarch_64-1.3.0.jar --conf spark.executor.extraClassPath=/opt/gluten/package/target/gluten-velox-bundle-spark3.5_2.12-debian_12_aarch_64-1.3.0.jar --conf spark.default.parallelism=200 --conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager --conf spark.sql.adaptive.enabled=false --conf spark.hadoop.input.connect.timeout=1000 --conf spark.hadoop.input.read.timeout=1000 --conf spark.hadoop.input.write.timeout=1000 --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=784g --conf spark.gluten.sql.columnar.backend.velox.orc.scan.enabled=true --conf spark.plugins=org.apache.gluten.GlutenPlugin --conf spark.hadoop.fs.defaultFS=hdfs://192.168.2.71:8020 --conf spark.driver.extraJavaOptions=-Dio.netty.tryReflectionSetAccessible=true --conf spark.executor.extraJavaOptions=-Dio.netty.tryReflectionSetAccessible=true --conf spark.executor.memoryOverhead=60g --conf spark.sql.orc.columnarReaderBatchSize=10240 --conf spark.sql.broadcastTimeout=4800 --conf spark.driver.maxResultSize=4g --conf spark.sql.shuffle.partitions=200 ### System information _No response_ ### Relevant logs ```bash NUMA_CORES: 0-123 start sf1000 at 2025/05/16 09:16:17 num_executors=1 executor_cores=124 partitions=200 offheap_mem: 784g Backend : velox onheap_mem: 112g memory_overhead: 60g + cat /tools/run_tpcds/tpc.scala + spark-shell --name 20250516_091543 --master spark://192.168.2.39:31077 --deploy-mode client --executor-cores 124 --executor-memory 112g --num-executors 1 --driver-memory 16g --conf spark.driver.extraClassPath=/opt/gluten/package/target/gluten-velox-bundle-spark3.5_2.12-debian_12_aarch_64-1.3.0.jar --conf spark.executor.extraClassPath=/opt/gluten/package/target/gluten-velox-bundle-spark3.5_2.12-debian_12_aarch_64-1.3.0.jar --conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager --conf spark.sql.adaptive.enabled=false --conf spark.hadoop.input.connect.timeout=1000 --conf spark.hadoop.input.read.timeout=1000 --conf spark.hadoop.input.write.timeout=1000 --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=784g --conf spark.gluten.sql.columnar.backend.velox.orc.scan.enabled=true --conf spark.plugins=org.apache.gluten.GlutenPlugin --conf spark.hadoop.fs.defaultFS=hdfs://192.168.2.71:8020 --conf spark.driver.extraJavaOptions=-Dio.netty.try ReflectionSetAccessible=true --conf spark.executor.extraJavaOptions=-Dio.netty.tryReflectionSetAccessible=true --conf spark.executor.memoryOverhead=60g --conf spark.sql.orc.columnarReaderBatchSize=10240 --conf spark.sql.broadcastTimeout=4800 --conf spark.driver.maxResultSize=4g --conf spark.sql.shuffle.partitions=200 Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 25/05/16 09:16:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 25/05/16 09:16:25 WARN VeloxListenerApi: Memory overhead is set to 64424509440 which is smaller than the recommended size 252544077004. This may cause OOM. W20250516 09:16:26.767813 138 MemoryArbitrator.cpp:84] Query memory capacity[45.00GB] is set for NOOP arbitrator which has no capacity enforcement 25/05/16 09:16:27 WARN SparkShimProvider: Spark runtime version 3.5.3 is not matched with Gluten's fully tested version 3.5.2 Spark context Web UI available at http://soft-spark-velox-kusanagi-master-0.soft-spark-velox-kusanagi-headless.default.svc.cluster.local:4040 Spark context available as 'sc' (master = spark://192.168.2.39:31077, app id = app-20250516091626-0000). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.5.3 /_/ Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 17.0.13) Type in expressions to have them evaluated. Type :help for more information. scala> | | | | | | | | | | | | | | | | import org.apache.spark.sql.execution.debug._ scala> import scala.io.Source scala> import java.io.File scala> import java.util.Arrays scala> import sys.process._ scala> import scala.util.{Try, Success, Failure, Random} scala> import java.time.LocalDateTime scala> scala> scala> scala> | | | tpc: String = tpcds scala> tpc: tpcds scala> scala> | | | file_ext: String = orc scala> file_ext: orc scala> scala> | | | queries_path: String = /tools/run_tpcds/queries.gluten/ scala> queries_path: /tools/run_tpcds/queries.gluten/ scala> scala> | | | target_query: String = q5.sql scala> target_query: q5.sql scala> scala> | | | random_seed: Long = 0 scala> random_seed: 0 scala> scala> scala> | | | file_scheme: String = hdfs scala> file_scheme : hdfs scala> scala> scala> scala> | | | | | | | | | | | | orc_file_root: String = hdfs://192.168.2.71:8020 scala> orc_file_root: hdfs://192.168.2.71:8020 scala> scala> | | | | | | data_file_path: String = /spark-tpcds-data/sf1000 scala> data_file_path: /spark-tpcds-data/sf1000 scala> scala> | | | sync: Boolean = false scala> file_scheme : hdfs scala> scala> | | | | | | | | time: [R](block: => R)R scala> scala> | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [Stage 0:> (0 + 124) / 2104] [Stage 0:> (0 + 130) / 2104] [Stage 0:==> (90 + 125) / 2104] [Stage 0:====> (167 + 124) / 2104] [Stage 0:=====> (235 + 125) / 2104] [Stage 0:========> (327 + 125) / 2104] [Stage 0:==========> (435 + 125) / 2104] [Stage 0:=============> (543 + 125) / 2104] [Stage 0:================> (665 + 124) / 2104] [Stage 0:===================> (789 + 124) / 2104] [Stage 0:=======================> (935 + 129) / 2104] [Stage 0:=========================> (1054 + 126) / 2104] [Stage 0:============================> (1175 + 124) / 2104] [Stage 0:===============================> (1317 + 126) / 2104] [Stage 0:===================================> (1471 + 125) / 2104] [Stage 0:=======================================> (1642 + 124) / 2104] [Stage 0:===========================================> (1806 + 125) / 2104] [Stage 0:===============================================> (1968 + 127) / 2104] [Stage 1:=======> (265 + 124) / 1837] [Stage 1:============> (425 + 125) / 1837] [Stage 1:=================> (619 + 124) / 1837] [Stage 1:======================> (792 + 124) / 1837] [Stage 1:===========================> (957 + 124) / 1837] [Stage 1:===============================> (1129 + 125) / 1837] [Stage 1:===================================> (1296 + 127) / 1837] [Stage 1:========================================> (1463 + 125) / 1837] [Stage 1:=============================================> (1631 + 126) / 1837] [Stage 3:=========> (355 + 126) / 2004] [Stage 3:=============> (524 + 125) / 2004] [Stage 3:=================> (689 + 126) / 2004] [Stage 3:======================> (854 + 125) / 2004] [Stage 3:=========================> (1017 + 126) / 2004] [Stage 3:==============================> (1180 + 126) / 2004] [Stage 3:==================================> (1340 + 132) / 2004] [Stage 3:======================================> (1521 + 125) / 2004] [Stage 3:==========================================> (1680 + 125) / 2004] [Stage 3:===============================================> (1852 + 125) / 2004] [Stage 4:==========> (379 + 125) / 1824] [Stage 4:================> (563 + 125) / 1824] [Stage 4:====================> (728 + 124) / 1824] [Stage 4:=========================> (899 + 124) / 1824] [Stage 4:=============================> (1056 + 125) / 1824] [Stage 4:=================================> (1214 + 125) / 1824] [Stage 4:======================================> (1377 + 125) / 1824] [Stage 4:===========================================> (1558 + 125) / 1824] [Stage 4:=================================================> (1732 + 92) / 1824] [Stage 5:=========> (390 + 126) / 2184] [Stage 5:=============> (553 + 125) / 2184] [Stage 5:================> (702 + 124) / 2184] [Stage 5:====================> (858 + 124) / 2184] [Stage 5:=======================> (1001 + 124) / 2184] [Stage 5:===========================> (1159 + 124) / 2184] [Stage 5:==============================> (1308 + 125) / 2184] [Stage 5:=================================> (1447 + 124) / 2184] [Stage 5:=====================================> (1602 + 124) / 2184] [Stage 5:========================================> (1736 + 124) / 2184] [Stage 5:============================================> (1885 + 126) / 2184] [Stage 5:===============================================> (2044 + 124) / 2184] [Stage 6:=========> (348 + 125) / 1824] [Stage 6:==============> (492 + 124) / 1824] [Stage 6:=================> (628 + 124) / 1824] [Stage 6:======================> (779 + 124) / 1824] [Stage 6:==========================> (931 + 124) / 1824] [Stage 6:==============================> (1077 + 124) / 1824] [Stage 6:==================================> (1225 + 124) / 1824] [Stage 6:=====================================> (1357 + 125) / 1824] [Stage 6:==========================================> (1518 + 124) / 1824] [Stage 6:==============================================> (1661 + 124) / 1824] 25/05/16 09:17:02 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. scala> scala> scala> | | | shuffleList: [A](list: List[A], seed: Long)List[A] scala> scala> | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | getListOfFiles: (dir: String)List[java.io.File] scala> fileLists: List[java.io.File] = List(/tools/run_tpcds/queries.gluten/q5.sql) scala> | | | | | | | | | sorted: List[java.io.File] = List(/tools/run_tpcds/queries.gluten/q5.sql) scala> queries: List[java.io.File] = List(/tools/run_tpcds/queries.gluten/q5.sql) scala> scala> | | | | | | | | | | | | | | | | | | | | sync disabled scala> scala> starttime: Long = 6125152060595533 scala> scala> scala> | | | | | | | | | | | | | | | | /tools/run_tpcds/queries.gluten/q5.sql 2025-05-16T09:17:05.252946492 with ssr as (select s_store_id, sum(sales_price) as sales, sum(profit) as profit, sum(return_amt) as returns, sum(net_loss) as profit_loss from ( select ss_store_sk as store_sk, ss_sold_date_sk as date_sk, ss_ext_sales_price as sales_price, ss_net_profit as profit, cast(0 as decimal(7,2)) as return_amt, cast(0 as decimal(7,2)) as net_loss from store_sales union all select sr_store_sk as store_sk, sr_returned_date_sk as date_sk, cast(0 as decimal(7,2)) as sales_price, cast(0 as decimal(7,2)) as profit, sr_return_amt as return_amt, sr_net_loss as net_loss from store_returns ) salesreturns, date_dim, store where date_sk = d_date_sk and d_date between cast('1998-08-04' as date) and (cast('1998-08-04' as date) + interval '14' day) and store_sk = s_store_sk group by s_stor e_id) , csr as (select cp_catalog_page_id, sum(sales_price) as sales, sum(profit) as profit, sum(return_amt) as returns, sum(net_loss) as profit_loss from ( select cs_catalog_page_sk as page_sk, cs_sold_date_sk as date_sk, cs_ext_sales_price as sales_price, cs_net_profit as profit, cast(0 as decimal(7,2)) as return_amt, cast(0 as decimal(7,2)) as net_loss from catalog_sales union all select cr_catalog_page_sk as page_sk, cr_returned_date_sk as date_sk, cast(0 as decimal(7,2)) as sales_price, cast(0 as decimal(7,2)) as profit, cr_return_amount as return_amt, cr_net_loss as net_loss from catalog_returns ) salesreturns, date_dim, catalog_page where date_sk = d_date_sk and d_date between cast('1998-08-04' as date) and (cast('1998-08-04' as date) + interval '14' day) and pag e_sk = cp_catalog_page_sk group by cp_catalog_page_id) , wsr as (select web_site_id, sum(sales_price) as sales, sum(profit) as profit, sum(return_amt) as returns, sum(net_loss) as profit_loss from ( select ws_web_site_sk as wsr_web_site_sk, ws_sold_date_sk as date_sk, ws_ext_sales_price as sales_price, ws_net_profit as profit, cast(0 as decimal(7,2)) as return_amt, cast(0 as decimal(7,2)) as net_loss from web_sales union all select ws_web_site_sk as wsr_web_site_sk, wr_returned_date_sk as date_sk, cast(0 as decimal(7,2)) as sales_price, cast(0 as decimal(7,2)) as profit, wr_return_amt as return_amt, wr_net_loss as net_loss from web_returns left outer join web_sales on ( wr_item_sk = ws_item_sk and wr_order_number = ws_order_number) ) salesreturns, date_dim, web_site where date_sk = d_date_sk and d_date between cast('1998-08-04' as date) and (cast('1998-08-04' as date) + interval '14' day) and wsr_web_site_sk = web_site_sk group by web_site_id) select channel , id , sum(sales) as sales , sum(returns) as returns , sum(profit) as profit from (select 'store channel' as channel , 'store' || s_store_id as id , sales , returns , (profit - profit_loss) as profit from ssr union all select 'catalog channel' as channel , 'catalog_page' || cp_catalog_page_id as id , sales , returns , (profit - profit_loss) as profit from csr union all select 'web channel' as channel , 'web_site' || web_site_id as id , sales , returns , (profit - profit_loss) as profit from wsr ) x group by rollup (channel, id) order by channel ,id LIMIT 100 ; # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x0000fffe24e11500, pid=115, tid=138 # # JRE version: OpenJDK Runtime Environment (17.0.13+12) (build 17.0.13+12-LTS) # Java VM: OpenJDK 64-Bit Server VM (17.0.13+12-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64) # Problematic frame: # C 0x0000fffe24e11500 # # Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /tmp/hsperfdata_spark/core.115) # # An error report file with more information is saved as: # /tmp/hsperfdata_spark/hs_err_pid115.log # # If you would like to submit a bug report, please visit: # https://bell-sw.com/support # /opt/bitnami/spark/bin/spark-shell: line 47: 115 Aborted (core dumped) "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@" ++ date '+%Y/%m/%d %H:%M:%S' + echo 'end sf1000 at 2025/05/16 09:17:13' end sf1000 at 2025/05/16 09:17:13 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
