Re: [DISCUSS] SPIP: Improving Spark SQL UDFs with Transpilation

Jungtaek Lim Fri, 19 Dec 2025 22:35:13 -0800

Thanks for the proposal. UDF has been known to be noticeably slow,
especially for the language where we run the external process and do
intercommunication, so this is an interesting topic.

The starting question for this proposal would be the coverage. The proposal
says we create an AST and try to convert it to a catalyst plan. Since this
does not sound like we are generating Java/bytecode so I assume this only
leverages built-in operators/expressions.

That said, when we say "simple" UDF, what is exactly the scope of "simple"
here? For me, it sounds to me like if the UDF can be translated to a
catalyst plan (without UDF), the UDF has actually been something users
could have written via the DataFrame API without UDF, unless we have
non-user-facing expressions where users are needed. Same with Pandas on
Spark for covering Pandas UDF. Do we see such a case e.g. users fail to
write logic based on built-in SQL expressions while they can, and end up
with choosing UDF? I think this needs more clarification given that's
really a user facing contract and the factor of evaluating this project as
a successful one.

Once that is clarified, we may have follow-up questions/voices with the
answer, something along the line:

1. It might be the case we may just want this proposal to be direct to the
"future success", translating Python UDF to Java code (codegen) to cover
arbitrary logic (unless it's not involving python library, which we had to
find alternatives).

2. We might want to make sure this proposal is addressing major use cases
and not just niche cases. e.g. it might be the case the majority of Python
UDF usage is to pull other Python dependencies, then we lose the majority
of cases.

Hope I understand the proposal well and ask valid questions.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Sat, Dec 20, 2025 at 5:42 AM Holden Karau <[email protected]> wrote:

> Hi Folks,
>
> It's been a few years since we last looked at transpilation, and with the
> growth of Pandas on Spark I think it's time we revisit it. I've got a JIRA
> filed <https://issues.apache.org/jira/browse/SPARK-54783> some rough
> proof of concept code <https://github.com/apache/spark/pull/53547> (I
> think doing the transpilation Python side instead of Scala side makes more
> sense, but was interesting to play with), and  of course everyones favourite
> a design doc.
> <https://docs.google.com/document/d/1cHc6tiR4yO3hppTzrK1F1w9RwyEPMvaeEuL2ub2LURg/edit?usp=sharing>
>  (I
> also have a collection of YouTube streams playing with the idea
> <https://www.youtube.com/@HoldenKarau/streams> if anyone wants to follow
> along on that journey).
>
> Wishing everyone a happy holidays :)
>
> Cheers,
>
> Holden :)
>
> --
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>

Re: [DISCUSS] SPIP: Improving Spark SQL UDFs with Transpilation

Reply via email to