nellaivijay opened a new issue, #55609:
URL: https://github.com/apache/spark/issues/55609
Description:
Problem Description:
When caching large DataFrames in Spark, the block fetching mechanism can
cause excessive memory pressure, leading to:
• Out-of-memory (OOM) errors in production workloads
• Performance degradation due to memory thrashing
• Unpredictable memory usage patterns during cache operations
• Unnecessary disk spilling even when data could fit in memory
Expected Behavior: Spark should monitor memory pressure during block
fetching operations and adaptively adjust fetching behavior to prevent
excessive memory pressure while maintaining
cache performance.
Actual Behavior: Spark's current block fetching mechanism loads cache blocks
without monitoring memory pressure. When caching large datasets:
• Spark attempts to fetch many blocks simultaneously
• Memory usage spikes rapidly during block fetching
• No adaptive mechanism to slow down or batch the fetching
• Results in memory pressure that can crash applications
Reproduction Steps:
// Create a large DataFrame
val data = (1 until 100000).map(i => (i, f"Data_{i}", "A" * 1000))
val df = spark.createDataFrame(data).toDF("id", "data", "description")
// Cache the DataFrame - causes memory pressure
df.cache()
df.count() // Materialize the cache
// Perform operations on cached data
df.filter($"id" > 50000).count()
Environment:
• Spark Version: 3.5.0+
• Component: BlockManager / Memory Management
• Deployment: Standalone, YARN, Kubernetes
• Impact: All environments where large DataFrames are cached
Observed Behavior:
• Memory usage spikes during cache materialization
• No adaptive mechanism to control memory pressure during block fetching
• System may run out of memory for large cached datasets
• Performance degradation due to memory pressure
Proposed Solution: Implement memory-aware progressive block fetching with:
1. Real-time memory pressure monitoring using JVM MemoryManagement APIs
2. Adaptive batch sizing based on current memory conditions
3. Progressive loading of cache blocks in manageable batches
4. Automatic cache eviction when memory surge detected
5. Configuration options for fine-grained control
Configuration Options:
spark.memory.awareBlockFetching.enabled true
spark.memory.awareBlockFetching.initialBatchSize 1000
spark.memory.awareBlockFetching.pressureThreshold 0.7
spark.memory.awareBlockFetching.autoEviction.enabled true
Benefits:
• Reduces risk of OOM errors during cache operations
• Provides predictable memory usage patterns
• Enables graceful degradation under memory pressure
• Maintains cache performance with adaptive behavior
• Fully backward compatible (opt-in feature)
Additional Context: This is particularly problematic for:
• Big data applications processing large datasets
• Machine learning workloads with cached training data
• ETL pipelines with intermediate caching
• Any Spark job that caches large DataFrames
The solution uses standard JVM MemoryManagement APIs (no external
dependencies) and follows Spark's coding standards. It provides a
production-ready implementation with comprehensive
testing and configuration options.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]