Hi everyone,
I’ve been looking into improving how users of our Spark platform register
and use UDFs and I’d like to discuss a few ideas for making this easier.

The motivation for this is the use case of defining a UDF from SparkSQL or
PySpark. We want to make it easy to write JVM UDFs and use them from both
SQL and Python. Python UDFs work great in most cases, but we occasionally
don’t want to pay the cost of shipping data to python and processing it
there so we want to make it easy to register UDFs that will run in the JVM.

There is already syntax to create a function from a JVM class
<https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html>
in SQL that would work, but this option requires using the Hive UDF API
instead of Spark’s simpler Scala API. It also requires argument translation
and doesn’t support codegen. Beyond the problem of the API and performance,
it is annoying to require registering every function individually with a CREATE
FUNCTION statement.

The alternative that I’d like to propose is to add a way to register a
named group of functions using the proposed catalog plugin API.

For anyone unfamiliar with the proposed catalog plugins, the basic idea is
to load and configure plugins using a simple property-based scheme. Those
plugins expose functionality through mix-in interfaces, like TableCatalog
to create/drop/load/alter tables. Another interface could be UDFCatalog
that can load UDFs.

interface UDFCatalog extends CatalogPlugin {
  UserDefinedFunction loadUDF(String name)
}

To use this, I would create a UDFCatalog class that returns my Scala
functions as UDFs. To look up functions, we would use both the catalog name
and the function name.

This would allow my users to write Scala UDF instances, package them using
a UDFCatalog class (provided by me), and easily use them in Spark with a
few configuration options to add the catalog in their environment.

This would also allow me to expose UDF libraries easily in my
configuration, like brickhouse
<https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Leveraging-Brickhouse-in-Spark2-pivot/m-p/59943>,
without users needing to ensure the Jar is loaded and register individual
functions.

Any thoughts on this high-level approach? I know that this ignores things
like creating and storing functions in a FunctionCatalog, and we’d have to
solve challenges with function naming (whether there is a db component).
Right now I’d like to think through the overall idea and not get too
focused on those details.

Thanks,

rb
-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to