Re: [DISCUSS] SPIP: Catalog-backed code-literal functions (SQL/Python) via catalog SPI + CRUD

huaxin gao Fri, 13 Feb 2026 14:09:20 -0800

Hi Holden,

Yes, that’s exactly the motivation: replace the “global function” hacks.


For requirements, I’d avoid dynamic installs in Phase 1. The initial
contract is: the spec can declare pythonVersion / requirements /
environmentUri, and Spark validates and fails fast if the runtime isn’t
satisfied. Dynamic installs could be an opt-in follow-up.

Thanks,

Huaxin

On Fri, Feb 13, 2026 at 12:18 PM Holden Karau <[email protected]>
wrote:

> I like the idea of this a lot, I’ve seen a bunch of hacks at companies to
> make global functions within the company and this seems like a much better
> way of doing it.
>
> For the requirements option, would it make sense to try and install them
> dynamically? (Fail fast seems like the way to start though).
>
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>
>
> On Wed, Jan 28, 2026 at 1:12 PM Szehon Ho <[email protected]> wrote:
>
>> This sounds useful, especially with Iceberg proposals like versioned SQL
>> UDF's.  On the surface it sounds like we could extend DSV2 FunctionCatalog
>> (which as you point out lacks dynamic create/drop function today), but I
>> may not know some details.  Would like to hear opinion of others too who
>> have worked more on functions/UDF's.
>>
>> Thanks!
>> Szehon
>>
>> On Wed, Jan 7, 2026 at 9:32 PM huaxin gao <[email protected]> wrote:
>>
>>> Hi Wenchen,
>>>
>>> Great question. In the SPIP, the language runtime is carried in the
>>> function spec (for python / python-pandas) so catalogs can optionally
>>> declare constraints on the execution environment.
>>>
>>> Concretely, the spec can include optional fields like:
>>>
>>>    -
>>>
>>>    pythonVersion (e.g., "3.10")
>>>    -
>>>
>>>    requirements (pip-style specs)
>>>    -
>>>
>>>    environmentUri (optional pointer to a pre-built / admin-approved
>>>    environment)
>>>
>>> For the initial stage, we assume execution uses the existing PySpark
>>> worker environment (same as regular Python UDF / pandas UDF). If
>>> pythonVersion / requirements are present, Spark can validate them
>>> against the current worker env and fail fast (AnalysisException) if they’re
>>> not satisfied.
>>>
>>> environmentUri is intended as an extension point for future integration
>>> (or vendor plugins) to select a vetted environment, but we don’t assume
>>> Spark will provision environments out-of-the-box in v1.
>>>
>>> Thanks,
>>>
>>> Huaxin
>>>
>>> On Wed, Jan 7, 2026 at 6:06 PM Wenchen Fan <[email protected]> wrote:
>>>
>>>> This is a great feature! How do we define the language runtime? e.g.
>>>> the Python version and libraries. Do we assume the Python runtime is the
>>>> same as the PySpark worker?
>>>>
>>>> On Thu, Jan 8, 2026 at 3:12 AM huaxin gao <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I’d like to start a discussion on a draft SPIP
>>>>> <https://docs.google.com/document/d/186cTAZxoXp1p8vaSunIaJmVLXcPR-FxSiLiDUl8kK8A/edit?tab=t.0#heading=h.for1fb3tezo3>
>>>>> :
>>>>>
>>>>> *SPIP: Catalog-backed Code-Literal Functions (SQL and Python) with
>>>>> Catalog SPI and CRUD*
>>>>>
>>>>> *Problem:* Spark can’t load SQL/Python function bodies from external
>>>>> catalogs in a standard way today, so users rely on session registration or
>>>>> vendor extensions.
>>>>>
>>>>> *Proposal:*
>>>>>
>>>>>    -
>>>>>
>>>>>    Add CodeLiteralFunctionCatalog (Java SPI) returning
>>>>>    CodeFunctionSpec with implementations (spark-sql, python,
>>>>>    python-pandas).
>>>>>    -
>>>>>
>>>>>    Resolution:
>>>>>    -
>>>>>
>>>>>       SQL: parse + inline (deterministic ⇒ foldable).
>>>>>       -
>>>>>
>>>>>       Python/pandas: run via existing Python UDF / pandas UDF runtime
>>>>>       (opaque).
>>>>>       -
>>>>>
>>>>>       SQL TVF: parse to plan, substitute params, validate schema.
>>>>>       -
>>>>>
>>>>>    DDL: CREATE/REPLACE/DROP FUNCTION delegates to the catalog if it
>>>>>    implements the SPI; otherwise fall back.
>>>>>
>>>>> *Precedence + defaults:*
>>>>>
>>>>>    -
>>>>>
>>>>>    Unqualified: temp/session > built-in/DSv2 > code-literal (current
>>>>>    catalog). Qualified names resolve only in the named catalog.
>>>>>    -
>>>>>
>>>>>    Defaults: feature on, SQL on, Python/pandas off; optional
>>>>>    languagePreference.
>>>>>
>>>>> Feedbacks are welcomed!
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Huaxin
>>>>>
>>>>

Re: [DISCUSS] SPIP: Catalog-backed code-literal functions (SQL/Python) via catalog SPI + CRUD

Reply via email to