mpurins-coralogix opened a new issue, #578:
URL: https://github.com/apache/arrow-ballista/issues/578

   **Describe the bug**
   When physical plan includes datafusion builtin function `starts_with` then 
it is serialized as UDF.
   
   **To Reproduce**
   
   Following test (can be added in somewhere in 
https://github.com/apache/arrow-ballista/blob/master/ballista/core/src/serde/physical_plan/mod.rs)
 should pass but it fails with `DataFusionError(Plan("There is no UDF named 
\"startswith\" in the registry"))`
   ```rust
   #[test]
       fn roundtrip_builtin_startswith_function() -> Result<()> {
           let field_a = Field::new("a", DataType::Utf8, false);
           let field_b = Field::new("b", DataType::Utf8, false);
           let schema = Arc::new(Schema::new(vec![field_a, field_b]));
   
           let input = Arc::new(EmptyExec::new(false, schema.clone()));
   
           let execution_props = ExecutionProps::new();
   
           let fun_expr = functions::create_physical_fun(
               &BuiltinScalarFunction::StartsWith,
               &execution_props,
           )?;
   
           let expr = ScalarFunctionExpr::new(
               &format!("{}", BuiltinScalarFunction::StartsWith), // Note that 
this is how ballista creates this name in 
https://github.com/apache/arrow-datafusion/blob/master/datafusion/physical-expr/src/functions.rs#L183
               fun_expr,
               vec![col("a", &schema)?, col("b", &schema)?],
               &DataType::Boolean,
           );
   
           let project =
               ProjectionExec::try_new(vec![(Arc::new(expr), "a".to_string())], 
input)?;
   
           roundtrip_test(Arc::new(project))
       }
   ```
   
   **Expected behavior**
   Above test should pass
   
   **Additional context**
   This is pretty much happening because UDFs and builtin functions are 
represented with the same `ScalarFunctionExpr` and ballista when serializing 
uses function name (which is `startswith`) and 
`BuiltinScalarFunction::from_str` (which matches only on `starts_with`) to 
differentiate between builtin and user defined functions.
   
   While simplest fix probably would be in datafusion 
`BuiltinScalarFunction::from_str` to handle `startswith` as function name as 
well I believe that it would be better to fix it in some other way where 
builtin and udf functions are thrown in the same bag. Current approach means 
that in some given project code could unexpectedly break on datafusion updates 
if datafusion introduces new builtin function and project already had user 
defined function under the same name.
   
   I wonder if it would be reasonable to separate builtin and user defined 
functions already in datafusion physical plan.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to