parthchandra opened a new pull request, #4350:
URL: https://github.com/apache/datafusion-comet/pull/4350

    Which issue does this PR close?
     
     Part of  #4150.
   
     Rationale for this change
   
     parse_url is a commonly used Spark SQL function that currently falls back 
to the JVM. Adding native support lets Comet accelerate queries that use it. 
The native implementation in DataFusion already exists (datafusion-spark 
crate); this PR wires the Spark expression through the Comet serde layer to the 
native
     UDFs.
   
     Because the native implementation has known divergences from Spark on 
certain edge cases (percent-encoding in QUERY values, FILE part for path-less 
URLs — tracked in https://github.com/apache/datafusion/issues/21943), the 
expression is marked Incompatible and gated behind
     spark.comet.expression.ParseUrl.allowIncompatible=true.
   
    Note: https://github.com/apache/datafusion-comet/pull/4152 implemented 
parse_url but was closed in favor of 
https://github.com/apache/datafusion-comet/pull/4231 which did not implement 
parse_url
   
     What changes are included in this PR?
   
     - Serde handler (url.scala): New CometParseUrl handler that maps ParseUrl 
to either parse_url or try_parse_url native UDF depending on failOnError (ANSI 
mode).
     - Spark 4.x shim (both spark-4.0 and spark-4.1 CometExprShim.scala): 
Handle Invoke(Literal(ParseUrlEvaluator), "evaluate", args) — the rewritten 
form of ParseUrl in Spark 4.x — by reconstructing a ParseUrl node and 
re-dispatching through the serde framework. Propagates EXTENSION_INFO tags so 
fallback reasons
     are reported correctly.
     - UDF registration (jni_api.rs): Register SparkParseUrl and 
SparkTryParseUrl UDFs with the DataFusion session context.
     - Expression map (QueryPlanSerde.scala): Add urlExpressions map and 
include it in the combined expression lookup.
     - Doc generation (GenerateDocs.scala): Add "url" category so compatibility 
docs are auto-generated.
     - Compatibility docs (url.md): Template for auto-generated URL expression 
compatibility table.
   
     How are these changes tested?
   
     Three new SQL test files covering different configurations:
   
     - parse_url.sql — Default config (allowIncompatible=false): verifies the 
expression falls back with the expected incompatibility reason.
     - parse_url_enabled.sql — With allowIncompatible=true: exercises all URL 
parts (HOST, PATH, QUERY, PROTOCOL, REF, AUTHORITY, USERINFO, FILE), literal 
and column-valued arguments, NULL handling, malformed URLs, column-valued part 
keys, and known edge cases (two tests marked ignore for documented divergences).
     - parse_url_ansi.sql — ANSI mode (failOnError=true): verifies the 
parse_url (non-try) native path works for valid URLs.
   
     All tests pass on both Spark 4.0 and Spark 4.1 profiles.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to