andygrove opened a new issue, #1397: URL: https://github.com/apache/datafusion-ballista/issues/1397
## Summary Add a configuration option to enable Spark-compatible expression behavior by registering functions from the [`datafusion-spark`](https://crates.io/crates/datafusion-spark) crate. This would help users migrating from Spark get more consistent behavior without requiring a full Spark Connect implementation. ## Motivation Ballista aims to be a compelling alternative to Apache Spark. While full Spark Connect protocol support is being addressed by other projects like [LakeSail Sail](https://github.com/lakehq/sail), there's a simpler improvement that would help Spark users: ensuring expression/function behavior matches Spark semantics. The `datafusion-spark` crate (version 51.0.0, maintained alongside DataFusion) provides: - Spark-compatible scalar functions - Spark-compatible aggregate functions - Spark-compatible window functions - Spark-compatible table functions These functions implement Spark's specific semantics which can differ from DataFusion's defaults (e.g., null handling, type coercion, edge cases). ## Proposed Solution ### New Configuration Option Add a new Ballista configuration key: ```rust pub const BALLISTA_SPARK_COMPAT_MODE: &str = "ballista.spark_compat_mode"; ``` With the config entry: ```rust ConfigEntry::new( BALLISTA_SPARK_COMPAT_MODE.to_string(), "Enable Spark compatibility mode which registers Spark-compatible expressions from datafusion-spark".to_string(), DataType::Boolean, Some("false".to_string()) ) ``` ### Implementation When `ballista.spark_compat_mode` is enabled: 1. **Scheduler side**: Register datafusion-spark functions when creating the SessionContext 2. **Executor side**: Ensure the same functions are available during plan execution ```rust use datafusion_spark::register_all; if config.spark_compat_mode() { register_all(&mut ctx)?; } ``` ### Feature Flag Add an optional feature to ballista-core and ballista-scheduler: ```toml [features] spark-compat = ["datafusion-spark"] [dependencies] datafusion-spark = { version = "51", optional = true } ``` This keeps the dependency optional for users who don't need Spark compatibility. ## Usage ### CLI ```bash ballista-scheduler --spark-compat-mode ballista-executor --spark-compat-mode ``` ### Environment Variable ```bash BALLISTA_SPARK_COMPAT_MODE=true ballista-scheduler ``` ### Programmatic ```rust let config = BallistaConfig::builder() .set(BALLISTA_SPARK_COMPAT_MODE, "true") .build()?; ``` ## Benefits 1. **Low effort, high value**: Leverages existing datafusion-spark crate 2. **Incremental migration path**: Users can test Spark compatibility without full commitment 3. **Transparent**: Clear config flag makes behavior explicit 4. **Optional**: Feature-flagged to avoid bloating builds for users who don't need it ## Future Extensions This could be extended to include: - Spark SQL dialect parsing (when available in DataFusion) - Additional Spark-specific behaviors (null ordering, case sensitivity) - Integration with datafusion-comet-spark-expr for even more compatibility ## References - [datafusion-spark crate](https://crates.io/crates/datafusion-spark) - [DataFusion Spark Functions docs](https://datafusion.apache.org/library-user-guide/functions/spark.html) - [datafusion-comet-spark-expr](https://crates.io/crates/datafusion-comet-spark-expr) (alternative/complementary) - Ballista Spark Connect discussion: #964 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
