Kontinuation opened a new issue, #884:
URL: https://github.com/apache/datafusion-comet/issues/884

   ### Describe the bug
   
   I've built Datafusion Comet using commit 
https://github.com/apache/datafusion-comet/commit/f7f0bb1ed68367b8d3e1c88010c1f943f480ea11
 for Spark 3.5.1. I found that the memory usage keeps increasing when 
repeatedly running the [TPC-H benchmark 
script](https://github.com/apache/datafusion-benchmarks/blob/main/runners/datafusion-comet/tpcbench.py)
 on a set of parquet files. The parquet files were generated using 
https://github.com/databricks/spark-sql-perf with scale factor = 10. The memory 
usage could be as high as 20GB. Given the spark and comet configurations I'm 
using to run the benchmarks (see **Additional context**) this seems to be 
problematic.
   
   
![image](https://github.com/user-attachments/assets/2adfb671-d674-4753-8bcc-cbd272e15da0)
   
   
   I've noticed that the native memory allocated by `Unsafe_AllocateMemory0` 
keeps increasing using `jcmd VM.native_memory detail.diff | grep Unsafe -A 2`. 
I'm not enabling offheap memory so the allocation should be initiated by the 
arrow `RootAllocator`:
   
   Initially after setting the baseline:
   
   ```
   [0x00000001099c98a8] Unsafe_AllocateMemory0(JNIEnv_*, _jobject*, long)+0xcc
   [0x000000011a0523b4]
                                (malloc=870721KB type=Other +621478KB #6842866 
+4937676)
   --
   [0x00000001099c98a8] Unsafe_AllocateMemory0(JNIEnv_*, _jobject*, long)+0xcc
   [0x0000000119017be0]
                                (malloc=8463KB type=Other -469KB #221 -3)
   ```
   
   After 10 minutes:
   
   ```
   [0x00000001099c98a8] Unsafe_AllocateMemory0(JNIEnv_*, _jobject*, long)+0xcc
   [0x000000011a0523b4]
                                (malloc=4349265KB type=Other +4100021KB 
#34671096 +32765906)
   --
   [0x00000001099c98a8] Unsafe_AllocateMemory0(JNIEnv_*, _jobject*, long)+0xcc
   [0x0000000119017be0]
                                (malloc=8449KB type=Other -483KB #217 -7)
   ```
   
   The leaked memory were allocated by the 
[`CometArrowAllocator`](https://github.com/apache/datafusion-comet/blob/33706125b8c7a7f347865c7fb38fede6aceb97e9/common/src/main/scala/org/apache/comet/package.scala#L35).
 I've verified this by attaching a debugger to the Spark process and inspected 
`CometArrowAllocator.getAllocatedMemory`:
   
   
![image](https://github.com/user-attachments/assets/3d0ddeb7-d6bd-4d97-8a57-44544fc1e19f)
   
   I've also deliberately disabled AQE coalesce partitions since I noticed this 
issue: https://github.com/apache/datafusion-comet/issues/381. Although it is 
fixed I still disabled it for being safe.  See **Additional context** section 
for more details.
   
   ### Steps to reproduce
   
   Run the [TPC-H benchmark 
script](https://github.com/apache/datafusion-benchmarks/blob/main/runners/datafusion-comet/tpcbench.py)
 with `--iterations=100` and observe the RSS of the java process of Apache 
Spark.Java
   
   ### Expected behavior
   
   Memory usage should not increase over time.
   
   ### Additional context
   
   I'm simply running it locally with `master = local[4]`. Here are my test 
environment and spark configurations:
   
   **Environment**:
   
   * Operating System: macOS 14.6.1, arch: Apple M1 Pro
   * Apache Spark: 3.5.1
   * Datafusion Comet: commit 
https://github.com/apache/datafusion-comet/commit/f7f0bb1ed68367b8d3e1c88010c1f943f480ea11
   * JVM: 17.0.10 (Eclipse Adoptium)
   
   **Spark configurations**:
   
   ```
   spark.master                     local[4]
   spark.driver.cores               4
   spark.executor.cores             4
   spark.driver.memory              4g
   spark.executor.memory            4g
   spark.comet.memory.overhead.factor 0.4
   
   spark.jars                     
/path/to/workspace/github/datafusion-comet/spark/target/comet-spark-spark3.5_2.12-0.3.0-SNAPSHOT.jar
   
   spark.driver.extraClassPath    
/path/to/workspace/github/datafusion-comet/spark/target/comet-spark-spark3.5_2.12-0.3.0-SNAPSHOT.jar
   spark.executor.extraClassPath  
/path/to/workspace/github/datafusion-comet/spark/target/comet-spark-spark3.5_2.12-0.3.0-SNAPSHOT.jar
   
   spark.serializer          org.apache.spark.serializer.KryoSerializer
   
   spark.sql.extensions         org.apache.comet.CometSparkSessionExtensions
   spark.comet.enabled          true
   spark.comet.exec.enabled     true
   spark.comet.exec.all.enabled true
   spark.comet.explainFallback.enabled false
   
   spark.comet.exec.shuffle.enabled true
   spark.comet.exec.shuffle.mode auto
   spark.shuffle.manager 
org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
   
   # Disable AQE coalesce partitions
   spark.sql.adaptive.enabled   false
   spark.sql.adaptive.coalescePartitions.enabled  false
   
   # Enable debugging and native memory tracking
   spark.driver.extraJavaOptions  
-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 
-XX:NativeMemoryTracking=detail
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to