rdblue commented on code in PR #14117: URL: https://github.com/apache/iceberg/pull/14117#discussion_r2713846566
########## format/udf-spec.md: ########## @@ -0,0 +1,322 @@ +--- +title: "SQL UDF Spec" +--- +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Iceberg UDF Spec + +## Background and Motivation + +A SQL user-defined function (UDF or UDTF) is a callable routine that accepts input parameters and executes a function body. +Depending on the function type, the result can be: + +- **Scalar function (UDF)** – returns a single value, which may be a primitive type (e.g., `int`, `string`) or a non-primitive type (e.g., `struct`, `list`). +- **Table function (UDTF)** – returns a table with zero or more rows of columns with a uniform schema. + +Many compute engines (e.g., Spark, Trino) already support UDFs, but in different and incompatible ways. Without a common +standard, UDFs cannot be reliably shared across engines or reused in multi-engine environments. + +This specification introduces a standardized metadata format for UDFs in Iceberg. + +## Goals + +* Define a portable metadata format for both scalar and table SQL UDFs. The metadata is self-contained and can be moved across catalogs. +* Support function evolution through versioning and rollback. +* Provide consistent semantics for representing UDFs across engines. + +## Overview + +UDF metadata follows the same design principles as Iceberg table and view metadata: each function is represented by a +**self-contained metadata file**. Metadata captures definitions, parameters, return types, documentation, security, +properties, and engine-specific representations. + +* Any modification (new definition, updated representation, changed properties, etc.) creates a new metadata file, and atomically swaps in the new file as the current metadata. +* Each metadata file includes recent definition versions, enabling rollbacks without external state. + +## Specification + +### UDF Metadata +The UDF metadata file has the following fields: + +| Requirement | Field name | Type | Description | +|-------------|-------------------|------------------------|-----------------------------------------------------------------------| +| *required* | `function-uuid` | `string` | A UUID that identifies the function, generated once at creation. | +| *required* | `format-version` | `int` | Metadata format version (must be `1`). | +| *required* | `definitions` | `list<definition>` | List of function [definition](#definition) entities. | +| *required* | `definition-log` | `list<definition-log>` | History of [definition snapshots](#definition-log). | +| *optional* | `location` | `string` | The function's base location; used to create metadata file locations. | +| *optional* | `properties` | `map<string,string>` | A string-to-string map of properties. | +| *optional* | `secure` | `boolean` | Whether it is a secure function. Default: `false`. | +| *optional* | `doc` | `string` | Documentation string. | + +Notes: +1. Engines must prevent leakage of sensitive information when a function is marked as `secure` by setting it to `true`. +2. Entries in `properties` are treated as hints, not strict rules. + +### Definition + +Each `definition` represents one function signature (e.g., `add_one(int)` vs `add_one(float)`). + +| Requirement | Field name | Type | Description | +|-------------|----------------------|-------------------------------------------------|---------------------------------------------------------------------------------------------------------------| +| *required* | `definition-id` | `string` | An identifier derived from canonical parameter-type tuple (lowercase, no spaces; e.g., `"(int,int,string)"`). | +| *required* | `parameters` | `list<parameter>` | Ordered list of [function parameters](#parameter). Invocation order **must** match this list. | +| *required* | `return-type` | `string` | Declared return type (see [Parameter Type](#parameter-type)). | +| *optional* | `nullable-return` | `boolean` | A hint to indicate whether the return value is nullable or not. Default: `true`. | +| *required* | `versions` | `list<definition-version>` | [Versioned implementations](#definition-version) of this definition. | +| *required* | `current-version-id` | `int` | Identifier of the current version for this definition. | +| *optional* | `function-type` | `string` (`"udf"` or `"udtf"`, default `"udf"`) | If `"udtf"`, `return-type` must be an Iceberg type `struct` describing the output schema. | +| *optional* | `doc` | `string` | Documentation string. | + +### Parameter +| Requirement | Field | Type | Description | +|-------------|--------|----------|--------------------------------------------------------------| +| *required* | `type` | `string` | Parameter data type (see [Parameter Type](#parameter-type)). | +| *required* | `name` | `string` | Parameter name. | +| *optional* | `doc` | `string` | Parameter documentation. | + +Notes: +1. Function definitions are identified by the tuple of `type`s and there can be only one definition for a given tuple. + The type tuple is immutable across versions. +2. Variadic (vararg) parameters are not supported. Each definition must declare a fixed number of parameters. +3. Each parameter input MUST be assignable to its declared Iceberg type. For complex types, the value’s + structure must match (correct field names, element/key/value types, and nesting). If a parameter—or any nested Review Comment: What does it mean for a value to be assignable? Similarly, what does it mean for the field names to match? To me, these are details that are specific to the engine and this is over-reaching a bit. If you have a parameter like `point struct<x int, y int>`, then it doesn't really matter how the engine represents that data internally. Lots of engines may resolve and drop names and use ordinal positions instead because their internal representation (like Spark's) does not operate on field names. That's perfectly fine as long as when the SQL definition refers to `point.x` that the correct data field is used. I also think there could be strange issues from interpreting "assignable". Is it okay to pass a short rather than an int? It depends. It's okay for an engine to upcast values in order to call a UDF that accepts an int, but that's not to say it can pass a short -- it still has to do the cast before calling the UDF code. For instance, if we had a Python representation, the code could actually check the Python type so a type mismatch must be handled by the engine. Again, it really comes down to how the engine handles things internally and we don't want to be too prescriptive about that. What if a SQL UDF only references `point.x` and the engine optimizes the code and discards the unused `point.y` value? I think we're perfectly fine with that. My suggestion here is that we remove this part of the note. I think it is clear what the semantics are. I can't think of anything I'm worried about going wrong here. It should be clear that values passed into UDF definitions should match the declared argument types. If you want to have something here, then that's what I would say: "values passed into UDFs must match the declared argument types". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
