Re: [DISCUSS] Function plugins

Reynold Xin Fri, 14 Dec 2018 19:09:25 -0800

I don’t think it is realistic to support codegen for UDFs. It’s hooked deep
into intervals.


On Fri, Dec 14, 2018 at 6:52 PM Matt Cheah <mch...@palantir.com> wrote:

> How would this work with:
>
>    1. Codegen – how does one generate code given a user’s UDF? Would the
>    user be able to specify the code that is generated that represents their
>    function? In practice that’s pretty hard to get right.
>    2. Row serialization and representation – Will the UDF receive
>    catalyst rows with optimized internal representations, or will Spark have
>    to convert to something more easily consumed by a UDF?
>
>
>
> Otherwise +1 for trying to get this to work without Hive. I think even
> having something without codegen and optimized row formats is worthwhile if
> only because it’s easier to use than Hive UDFs.
>
>
>
> -Matt Cheah
>
>
>
> *From: *Reynold Xin <r...@databricks.com>
> *Date: *Friday, December 14, 2018 at 1:49 PM
> *To: *"rb...@netflix.com" <rb...@netflix.com>
> *Cc: *Spark Dev List <dev@spark.apache.org>
> *Subject: *Re: [DISCUSS] Function plugins
>
>
>
> [image: Image removed by sender.]
>
> Having a way to register UDFs that are not using Hive APIs would be great!
>
>
>
>
>
>
>
> On Fri, Dec 14, 2018 at 1:30 PM, Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
> Hi everyone,
> I’ve been looking into improving how users of our Spark platform register
> and use UDFs and I’d like to discuss a few ideas for making this easier.
>
> The motivation for this is the use case of defining a UDF from SparkSQL or
> PySpark. We want to make it easy to write JVM UDFs and use them from both
> SQL and Python. Python UDFs work great in most cases, but we occasionally
> don’t want to pay the cost of shipping data to python and processing it
> there so we want to make it easy to register UDFs that will run in the JVM.
>
> There is already syntax to create a function from a JVM class
> [docs.databricks.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.databricks.com_spark_latest_spark-2Dsql_language-2Dmanual_create-2Dfunction.html&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=A89zvby1qgVX4Zrstgfnlf1mCBIZUyOhADRR-czy4Fw&s=k_fqMI22guBLW5lj5ZJ21QeKoXoa6LuPP5yA2tlj-TE&e=>
> in SQL that would work, but this option requires using the Hive UDF API
> instead of Spark’s simpler Scala API. It also requires argument translation
> and doesn’t support codegen. Beyond the problem of the API and performance,
> it is annoying to require registering every function individually with a 
> CREATE
> FUNCTION statement.
>
> The alternative that I’d like to propose is to add a way to register a
> named group of functions using the proposed catalog plugin API.
>
> For anyone unfamiliar with the proposed catalog plugins, the basic idea is
> to load and configure plugins using a simple property-based scheme. Those
> plugins expose functionality through mix-in interfaces, like TableCatalog
> to create/drop/load/alter tables. Another interface could be UDFCatalog
> that can load UDFs.
>
> interface UDFCatalog extends CatalogPlugin {
>
>   UserDefinedFunction loadUDF(String name)
>
> }
>
> To use this, I would create a UDFCatalog class that returns my Scala
> functions as UDFs. To look up functions, we would use both the catalog name
> and the function name.
>
> This would allow my users to write Scala UDF instances, package them using
> a UDFCatalog class (provided by me), and easily use them in Spark with a
> few configuration options to add the catalog in their environment.
>
> This would also allow me to expose UDF libraries easily in my
> configuration, like brickhouse [community.cloudera.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__community.cloudera.com_t5_Advanced-2DAnalytics-2DApache-2DSpark_Leveraging-2DBrickhouse-2Din-2DSpark2-2Dpivot_m-2Dp_59943&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=A89zvby1qgVX4Zrstgfnlf1mCBIZUyOhADRR-czy4Fw&s=UztUaaJiaM74bMR2lIutW8hWYRbufd2tKCVj23ReIfs&e=>,
> without users needing to ensure the Jar is loaded and register individual
> functions.
>
> Any thoughts on this high-level approach? I know that this ignores things
> like creating and storing functions in a FunctionCatalog, and we’d have
> to solve challenges with function naming (whether there is a db component).
> Right now I’d like to think through the overall idea and not get too
> focused on those details.
>
> Thanks,
>
> rb
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>

Re: [DISCUSS] Function plugins

Reply via email to