I don’t think it is realistic to support codegen for UDFs. It’s hooked deep into intervals.
On Fri, Dec 14, 2018 at 6:52 PM Matt Cheah <mch...@palantir.com> wrote: > How would this work with: > > 1. Codegen – how does one generate code given a user’s UDF? Would the > user be able to specify the code that is generated that represents their > function? In practice that’s pretty hard to get right. > 2. Row serialization and representation – Will the UDF receive > catalyst rows with optimized internal representations, or will Spark have > to convert to something more easily consumed by a UDF? > > > > Otherwise +1 for trying to get this to work without Hive. I think even > having something without codegen and optimized row formats is worthwhile if > only because it’s easier to use than Hive UDFs. > > > > -Matt Cheah > > > > *From: *Reynold Xin <r...@databricks.com> > *Date: *Friday, December 14, 2018 at 1:49 PM > *To: *"rb...@netflix.com" <rb...@netflix.com> > *Cc: *Spark Dev List <dev@spark.apache.org> > *Subject: *Re: [DISCUSS] Function plugins > > > > [image: Image removed by sender.] > > Having a way to register UDFs that are not using Hive APIs would be great! > > > > > > > > On Fri, Dec 14, 2018 at 1:30 PM, Ryan Blue <rb...@netflix.com.invalid> > wrote: > > Hi everyone, > I’ve been looking into improving how users of our Spark platform register > and use UDFs and I’d like to discuss a few ideas for making this easier. > > The motivation for this is the use case of defining a UDF from SparkSQL or > PySpark. We want to make it easy to write JVM UDFs and use them from both > SQL and Python. Python UDFs work great in most cases, but we occasionally > don’t want to pay the cost of shipping data to python and processing it > there so we want to make it easy to register UDFs that will run in the JVM. > > There is already syntax to create a function from a JVM class > [docs.databricks.com] > <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.databricks.com_spark_latest_spark-2Dsql_language-2Dmanual_create-2Dfunction.html&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=A89zvby1qgVX4Zrstgfnlf1mCBIZUyOhADRR-czy4Fw&s=k_fqMI22guBLW5lj5ZJ21QeKoXoa6LuPP5yA2tlj-TE&e=> > in SQL that would work, but this option requires using the Hive UDF API > instead of Spark’s simpler Scala API. It also requires argument translation > and doesn’t support codegen. Beyond the problem of the API and performance, > it is annoying to require registering every function individually with a > CREATE > FUNCTION statement. > > The alternative that I’d like to propose is to add a way to register a > named group of functions using the proposed catalog plugin API. > > For anyone unfamiliar with the proposed catalog plugins, the basic idea is > to load and configure plugins using a simple property-based scheme. Those > plugins expose functionality through mix-in interfaces, like TableCatalog > to create/drop/load/alter tables. Another interface could be UDFCatalog > that can load UDFs. > > interface UDFCatalog extends CatalogPlugin { > > UserDefinedFunction loadUDF(String name) > > } > > To use this, I would create a UDFCatalog class that returns my Scala > functions as UDFs. To look up functions, we would use both the catalog name > and the function name. > > This would allow my users to write Scala UDF instances, package them using > a UDFCatalog class (provided by me), and easily use them in Spark with a > few configuration options to add the catalog in their environment. > > This would also allow me to expose UDF libraries easily in my > configuration, like brickhouse [community.cloudera.com] > <https://urldefense.proofpoint.com/v2/url?u=https-3A__community.cloudera.com_t5_Advanced-2DAnalytics-2DApache-2DSpark_Leveraging-2DBrickhouse-2Din-2DSpark2-2Dpivot_m-2Dp_59943&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=A89zvby1qgVX4Zrstgfnlf1mCBIZUyOhADRR-czy4Fw&s=UztUaaJiaM74bMR2lIutW8hWYRbufd2tKCVj23ReIfs&e=>, > without users needing to ensure the Jar is loaded and register individual > functions. > > Any thoughts on this high-level approach? I know that this ignores things > like creating and storing functions in a FunctionCatalog, and we’d have > to solve challenges with function naming (whether there is a db component). > Right now I’d like to think through the overall idea and not get too > focused on those details. > > Thanks, > > rb > > -- > > Ryan Blue > > Software Engineer > > Netflix > > >