[I] Add Spark compatibility mode using datafusion-spark expressions [datafusion-ballista]

via GitHub Thu, 22 Jan 2026 07:16:19 -0800


andygrove opened a new issue, #1397:
URL: https://github.com/apache/datafusion-ballista/issues/1397


   ## Summary
   
   Add a configuration option to enable Spark-compatible expression behavior by 
registering functions from the 
[`datafusion-spark`](https://crates.io/crates/datafusion-spark) crate. This 
would help users migrating from Spark get more consistent behavior without 
requiring a full Spark Connect implementation.
   
   ## Motivation
   
   Ballista aims to be a compelling alternative to Apache Spark. While full 
Spark Connect protocol support is being addressed by other projects like 
[LakeSail Sail](https://github.com/lakehq/sail), there's a simpler improvement 
that would help Spark users: ensuring expression/function behavior matches 
Spark semantics.
   
   The `datafusion-spark` crate (version 51.0.0, maintained alongside 
DataFusion) provides:
   - Spark-compatible scalar functions
   - Spark-compatible aggregate functions  
   - Spark-compatible window functions
   - Spark-compatible table functions
   
   These functions implement Spark's specific semantics which can differ from 
DataFusion's defaults (e.g., null handling, type coercion, edge cases).
   
   ## Proposed Solution
   
   ### New Configuration Option
   
   Add a new Ballista configuration key:
   
   ```rust
   pub const BALLISTA_SPARK_COMPAT_MODE: &str = "ballista.spark_compat_mode";
   ```
   
   With the config entry:
   ```rust
   ConfigEntry::new(
       BALLISTA_SPARK_COMPAT_MODE.to_string(),
       "Enable Spark compatibility mode which registers Spark-compatible 
expressions from datafusion-spark".to_string(),
       DataType::Boolean,
       Some("false".to_string())
   )
   ```
   
   ### Implementation
   
   When `ballista.spark_compat_mode` is enabled:
   
   1. **Scheduler side**: Register datafusion-spark functions when creating the 
SessionContext
   2. **Executor side**: Ensure the same functions are available during plan 
execution
   
   ```rust
   use datafusion_spark::register_all;
   
   if config.spark_compat_mode() {
       register_all(&mut ctx)?;
   }
   ```
   
   ### Feature Flag
   
   Add an optional feature to ballista-core and ballista-scheduler:
   
   ```toml
   [features]
   spark-compat = ["datafusion-spark"]
   
   [dependencies]
   datafusion-spark = { version = "51", optional = true }
   ```
   
   This keeps the dependency optional for users who don't need Spark 
compatibility.
   
   ## Usage
   
   ### CLI
   ```bash
   ballista-scheduler --spark-compat-mode
   ballista-executor --spark-compat-mode
   ```
   
   ### Environment Variable
   ```bash
   BALLISTA_SPARK_COMPAT_MODE=true ballista-scheduler
   ```
   
   ### Programmatic
   ```rust
   let config = BallistaConfig::builder()
       .set(BALLISTA_SPARK_COMPAT_MODE, "true")
       .build()?;
   ```
   
   ## Benefits
   
   1. **Low effort, high value**: Leverages existing datafusion-spark crate
   2. **Incremental migration path**: Users can test Spark compatibility 
without full commitment
   3. **Transparent**: Clear config flag makes behavior explicit
   4. **Optional**: Feature-flagged to avoid bloating builds for users who 
don't need it
   
   ## Future Extensions
   
   This could be extended to include:
   - Spark SQL dialect parsing (when available in DataFusion)
   - Additional Spark-specific behaviors (null ordering, case sensitivity)
   - Integration with datafusion-comet-spark-expr for even more compatibility
   
   ## References
   
   - [datafusion-spark crate](https://crates.io/crates/datafusion-spark)
   - [DataFusion Spark Functions 
docs](https://datafusion.apache.org/library-user-guide/functions/spark.html)
   - 
[datafusion-comet-spark-expr](https://crates.io/crates/datafusion-comet-spark-expr)
 (alternative/complementary)
   - Ballista Spark Connect discussion: #964


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Add Spark compatibility mode using datafusion-spark expressions [datafusion-ballista]

Reply via email to