flyrain commented on code in PR #14117:
URL: https://github.com/apache/iceberg/pull/14117#discussion_r2771641415


##########
format/udf-spec.md:
##########
@@ -0,0 +1,352 @@
+---
+title: "SQL UDF Spec"
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Iceberg UDF Spec
+
+## Background and Motivation
+
+A SQL user-defined function (UDF or UDTF) is a callable routine that accepts 
input parameters and executes a function body.
+Depending on the function type, the result can be:
+
+- **Scalar function (UDF)** – returns a single value, which may be a primitive 
type (e.g., `int`, `string`) or a non-primitive type (e.g., `struct`, `list`).
+- **Table function (UDTF)** – returns a table with zero or more rows of 
columns with a uniform schema.
+
+This specification introduces a standardized metadata format for UDFs in 
Iceberg.
+
+## Goals
+
+* Define a portable metadata format for both scalar and table SQL UDFs. The 
metadata is self-contained and can be moved across catalogs.
+* Support function evolution through versioning and rollback.
+* Provide consistent semantics for representing UDFs across engines.
+
+## Overview
+
+UDF metadata follows the same design principles as Iceberg table and view 
metadata: each function is represented by a
+**self-contained metadata file**. Metadata captures definitions, parameters, 
return types, documentation, security,
+properties, and engine-specific representations.
+
+* UDF metadata files are immutable. Any modification (new definition, updated 
representation, changed properties, etc.)
+  creates a new metadata file. Catalogs use an atomic swap to update the 
metadata file linked to a catalog identifier.
+* Each metadata file includes recent definition versions, enabling rollbacks 
without external state.
+
+## Specification
+
+### UDF Metadata
+The UDF metadata file has the following fields:
+
+| Requirement | Field name       | Type                   | Description        
                                                       |
+|-------------|------------------|------------------------|---------------------------------------------------------------------------|
+| *required*  | `function-uuid`  | `string`               | A UUID that 
identifies this UDF, generated once at creation.              |
+| *required*  | `format-version` | `int`                  | UDF specification 
format version (must be `1`).                           |
+| *required*  | `definitions`    | `list<definition>`     | List of function 
[definition](#definition) entities.                      |
+| *required*  | `definition-log` | `list<definition-log>` | History of 
[versions within the function's definitions](#definition-log). |
+| *optional*  | `location`       | `string`               | The function's 
base location; used to create metadata file locations.     |
+| *optional*  | `properties`     | `map<string,string>`   | A string-to-string 
map of properties.                                     |
+| *optional*  | `secure`         | `boolean`              | Whether it is a 
secure function. Default: `false`.                        |
+| *optional*  | `doc`            | `string`               | Documentation 
string.                                                     |
+
+Notes:
+1. When `secure` is set to `true`, engines should prevent leakage of sensitive 
information to end users. Each engine
+   may have its own security definition and mechanisms. It is the 
administrator's responsibility to ensure that secure
+   UDFs are properly configured and protected in their environment.
+2. Entries in `properties` are treated as hints, not strict rules.
+3. UDF names are not stored in metadata. It's the catalog's responsibility to 
map UDF names to metadata file locations.
+
+### Definition
+
+Each `definition` represents one function signature (e.g., `add_one(int)` vs 
`add_one(float)`). A definition is uniquely
+identified by its signature (the ordered list of parameter types). There can 
be only one definition for a given signature.
+All versions within a definition must accept the same signature as specified 
in the definition's `parameters` field and
+must produce values of the declared `return-type`.
+
+| Requirement | Field name           | Type                           | 
Description                                                                     
                  |
+|-------------|----------------------|--------------------------------|---------------------------------------------------------------------------------------------------|
+| *required*  | `definition-id`      | `string`                       | An 
identifier derived from canonical parameter-type tuple (see [Definition 
ID](#definition-id)).  |
+| *required*  | `parameters`         | `list<parameter>`              | 
Ordered list of [function parameters](#parameter). Invocation order **must** 
match this list.     |
+| *required*  | `return-type`        | `string` or `object`           | 
Declared return type using [Types](#types).                                     
                  |
+| *optional*  | `return-nullable`    | `boolean`                      | A hint 
to indicate whether the return value is nullable or not. Default: `true`.       
           |
+| *required*  | `versions`           | `list<definition-version>`     | 
[Versioned implementations](#definition-version) of this definition.            
                  |
+| *required*  | `current-version-id` | `int`                          | 
Identifier of the current version for this definition.                          
                  |
+| *required*  | `function-type`      | `string` (`"udf"` or `"udtf"`) | If 
`"udtf"`, `return-type` must be a `struct` (see [Types](#types)) describing the 
output schema. |
+| *optional*  | `doc`                | `string`                       | 
Documentation string.                                                           
                  |
+
+### Parameter
+| Requirement | Field  | Type     | Description                                
|
+|-------------|--------|----------|--------------------------------------------|
+| *required*  | `type` | `string` | Parameter data type (see [Types](#types)). 
|
+| *required*  | `name` | `string` | Parameter name.                            
|
+| *optional*  | `doc`  | `string` | Parameter documentation.                   
|
+
+Notes:
+1. Variadic (vararg) parameters are not supported. Each definition must 
declare a fixed number of parameters.
+2. Parameterized signatures are not supported (e.g., `<E> array_agg(E)`).
+3. All types, including parameters and return types, are considered nullable.
+
+#### Types
+[iceberg-type-json]: 
https://iceberg.apache.org/spec/#appendix-c-json-serialization
+Types are based on the [Iceberg 
Type](https://iceberg.apache.org/spec/#schemas-and-data-types).
+
+Primitive and semi-structured type strings are encoded based on [Iceberg Type 
JSON Representation][iceberg-type-json]
+(e.g., `int`, `string`, `timestamp`, `decimal(9,2)`, `variant`). Type strings 
must contain no spaces or quote characters.
+
+Nested types (`struct`, `list`, `map`) use the [Iceberg Type JSON 
Representation][iceberg-type-json] with the
+following fields required. Any other fields must be ignored.
+* `list` requires `type` and `element`, e.g., `{ "type": "list", "element": 
"string" }`
+* `map` requires `type`, `key`, and `value`, e.g., `{ "type": "map", "key": 
"string", "value": "int" }`
+* `struct` requires `type` and `fields`, where each field requires `name` and 
`type`,
+  e.g., `{ "type": "struct", "fields": [ { "name": "id", "type": "int" }, { 
"name": "name", "type": "string" } ] }`
+
+#### Definition ID
+The `definition-id` is a canonical string derived from the parameter types, 
formatted as a comma-separated list with no
+spaces. Each type uses the following string representation:
+
+* Primitives and semi-structured: the type name (e.g., `int`, `variant`)
+* List: `list<element-type>` (e.g., `list<int>`)
+* Map: `map<key-type,value-type>` (e.g., `map<string,int>`)
+* Struct: `struct<name1:type1,name2:type2,...>` with field names and types 
(e.g., `struct<id:int,name:string>`)
+
+Examples of complete definition-id signatures:
+* `int` – single int parameter
+* `int,string` – two parameters: int and string
+* `int,list<int>,struct<id:int,name:string>` – three parameters: an int, a 
list and a struct
+
+### Definition Version
+
+Each definition can evolve over time by introducing new versions.
+A `definition version` represents a specific implementation of that definition 
at a given point in time.
+
+| Requirement | Field name        | Type                                       
              | Description                                                    |
+|-------------|-------------------|----------------------------------------------------------|----------------------------------------------------------------|
+| *required*  | `version-id`      | `int`                                      
              | Monotonically increasing identifier of the definition version. |
+| *required*  | `representations` | `list<representation>`                     
              | [UDF implementations](#representation).                        |
+| *optional*  | `deterministic`   | `boolean` (default `false`)                
              | Whether the function is deterministic.                         |
+| *optional*  | `on-null-input`   | `string` (`"return-null"` or `"call"`, 
default `"call"`) | Defines how the UDF behaves when any input parameter is 
NULL.  |
+| *required*  | `timestamp-ms`    | `long` (unix epoch millis)                 
              | Creation timestamp of this version.                            |
+
+
+#### Null Input Handling
+`on-null-input` provides an optimization hint for query engines:
+1. If set to `return-null`, the function always returns `NULL` if any input 
argument is `NULL`. This allows engines to
+   apply predicate pushdown or skip function evaluation for rows with `NULL` 
inputs. For a function `f(x, y) = x + y`,
+   the engine can safely rewrite `WHERE f(a,b) > 0` as `WHERE a IS NOT NULL 
AND b IS NOT NULL AND f(a,b) > 0`.
+2. If set to `call`, the function may handle `NULL`s internally (e.g., 
`COALESCE`, `NVL`, `IFNULL`), so the engine must
+   execute the function even if some inputs are `NULL`.
+
+### Representation
+Each representation is an object with at least one common field, `type`, that 
is one of the following:
+* `sql`: a SQL expression that defines the function body
+
+Representations further define metadata for each type.
+
+#### SQL Representation
+
+A definition version can have multiple SQL representations of different 
dialects, but only one SQL representation per dialect.
+The SQL representation stores the function body as a SQL expression, with 
metadata such as the SQL dialect.
+
+| Requirement | Field name | Type     | Description                            
              |
+|-------------|------------|----------|------------------------------------------------------|
+| *required*  | `type`     | `string` | Must be `"sql"`                        
              |
+| *required*  | `dialect`  | `string` | SQL dialect identifier (e.g., 
`"spark"`, `"trino"`). |
+| *required*  | `sql`      | `string` | SQL expression text.                   
              |

Review Comment:
   Discussed offline with @singhpk234. We chose to follow the view spec and 
intentionally avoid specifying this further. There is a fine balance between 
providing enough clarity in the specification and over specifying behavior that 
is better left to implementations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to