Re: [DISCUSS] Function plugins

2018-12-14 Thread Reynold Xin
I don’t think it is realistic to support codegen for UDFs. It’s hooked deep
into intervals.

On Fri, Dec 14, 2018 at 6:52 PM Matt Cheah  wrote:

> How would this work with:
>
>1. Codegen – how does one generate code given a user’s UDF? Would the
>user be able to specify the code that is generated that represents their
>function? In practice that’s pretty hard to get right.
>2. Row serialization and representation – Will the UDF receive
>catalyst rows with optimized internal representations, or will Spark have
>to convert to something more easily consumed by a UDF?
>
>
>
> Otherwise +1 for trying to get this to work without Hive. I think even
> having something without codegen and optimized row formats is worthwhile if
> only because it’s easier to use than Hive UDFs.
>
>
>
> -Matt Cheah
>
>
>
> *From: *Reynold Xin 
> *Date: *Friday, December 14, 2018 at 1:49 PM
> *To: *"rb...@netflix.com" 
> *Cc: *Spark Dev List 
> *Subject: *Re: [DISCUSS] Function plugins
>
>
>
> [image: Image removed by sender.]
>
> Having a way to register UDFs that are not using Hive APIs would be great!
>
>
>
>
>
>
>
> On Fri, Dec 14, 2018 at 1:30 PM, Ryan Blue 
> wrote:
>
> Hi everyone,
> I’ve been looking into improving how users of our Spark platform register
> and use UDFs and I’d like to discuss a few ideas for making this easier.
>
> The motivation for this is the use case of defining a UDF from SparkSQL or
> PySpark. We want to make it easy to write JVM UDFs and use them from both
> SQL and Python. Python UDFs work great in most cases, but we occasionally
> don’t want to pay the cost of shipping data to python and processing it
> there so we want to make it easy to register UDFs that will run in the JVM.
>
> There is already syntax to create a function from a JVM class
> [docs.databricks.com]
> 
> in SQL that would work, but this option requires using the Hive UDF API
> instead of Spark’s simpler Scala API. It also requires argument translation
> and doesn’t support codegen. Beyond the problem of the API and performance,
> it is annoying to require registering every function individually with a 
> CREATE
> FUNCTION statement.
>
> The alternative that I’d like to propose is to add a way to register a
> named group of functions using the proposed catalog plugin API.
>
> For anyone unfamiliar with the proposed catalog plugins, the basic idea is
> to load and configure plugins using a simple property-based scheme. Those
> plugins expose functionality through mix-in interfaces, like TableCatalog
> to create/drop/load/alter tables. Another interface could be UDFCatalog
> that can load UDFs.
>
> interface UDFCatalog extends CatalogPlugin {
>
>   UserDefinedFunction loadUDF(String name)
>
> }
>
> To use this, I would create a UDFCatalog class that returns my Scala
> functions as UDFs. To look up functions, we would use both the catalog name
> and the function name.
>
> This would allow my users to write Scala UDF instances, package them using
> a UDFCatalog class (provided by me), and easily use them in Spark with a
> few configuration options to add the catalog in their environment.
>
> This would also allow me to expose UDF libraries easily in my
> configuration, like brickhouse [community.cloudera.com]
> ,
> without users needing to ensure the Jar is loaded and register individual
> functions.
>
> Any thoughts on this high-level approach? I know that this ignores things
> like creating and storing functions in a FunctionCatalog, and we’d have
> to solve challenges with function naming (whether there is a db component).
> Right now I’d like to think through the overall idea and not get too
> focused on those details.
>
> Thanks,
>
> rb
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>


Re: [DISCUSS] Function plugins

2018-12-14 Thread Matt Cheah
How would this work with:
 Codegen – how does one generate code given a user’s UDF? Would the user be 
able to specify the code that is generated that represents their function? In 
practice that’s pretty hard to get right.
Row serialization and representation – Will the UDF receive catalyst rows with 
optimized internal representations, or will Spark have to convert to something 
more easily consumed by a UDF?
 

Otherwise +1 for trying to get this to work without Hive. I think even having 
something without codegen and optimized row formats is worthwhile if only 
because it’s easier to use than Hive UDFs.

 

-Matt Cheah

 

From: Reynold Xin 
Date: Friday, December 14, 2018 at 1:49 PM
To: "rb...@netflix.com" 
Cc: Spark Dev List 
Subject: Re: [DISCUSS] Function plugins

 

Having a way to register UDFs that are not using Hive APIs would be great!

 

 

 

On Fri, Dec 14, 2018 at 1:30 PM, Ryan Blue  wrote:

Hi everyone,
I’ve been looking into improving how users of our Spark platform register and 
use UDFs and I’d like to discuss a few ideas for making this easier.

The motivation for this is the use case of defining a UDF from SparkSQL or 
PySpark. We want to make it easy to write JVM UDFs and use them from both SQL 
and Python. Python UDFs work great in most cases, but we occasionally don’t 
want to pay the cost of shipping data to python and processing it there so we 
want to make it easy to register UDFs that will run in the JVM.

There is already syntax to create a function from a JVM class 
[docs.databricks.com] in SQL that would work, but this option requires using 
the Hive UDF API instead of Spark’s simpler Scala API. It also requires 
argument translation and doesn’t support codegen. Beyond the problem of the API 
and performance, it is annoying to require registering every function 
individually with a CREATE FUNCTION statement.

The alternative that I’d like to propose is to add a way to register a named 
group of functions using the proposed catalog plugin API.

For anyone unfamiliar with the proposed catalog plugins, the basic idea is to 
load and configure plugins using a simple property-based scheme. Those plugins 
expose functionality through mix-in interfaces, like TableCatalog to 
create/drop/load/alter tables. Another interface could be UDFCatalog that can 
load UDFs.
interface UDFCatalog extends CatalogPlugin {
  UserDefinedFunction loadUDF(String name)
}
To use this, I would create a UDFCatalog class that returns my Scala functions 
as UDFs. To look up functions, we would use both the catalog name and the 
function name.

This would allow my users to write Scala UDF instances, package them using a 
UDFCatalog class (provided by me), and easily use them in Spark with a few 
configuration options to add the catalog in their environment.

This would also allow me to expose UDF libraries easily in my configuration, 
like brickhouse [community.cloudera.com], without users needing to ensure the 
Jar is loaded and register individual functions.

Any thoughts on this high-level approach? I know that this ignores things like 
creating and storing functions in a FunctionCatalog, and we’d have to solve 
challenges with function naming (whether there is a db component). Right now 
I’d like to think through the overall idea and not get too focused on those 
details.

Thanks,

rb

-- 

Ryan Blue 

Software Engineer

Netflix

 



smime.p7s
Description: S/MIME cryptographic signature


Re: [DISCUSS] Function plugins

2018-12-14 Thread Reynold Xin
Having a way to register UDFs that are not using Hive APIs would be great!

On Fri, Dec 14, 2018 at 1:30 PM, Ryan Blue < rb...@netflix.com.invalid > wrote:

> 
> 
> 
> Hi everyone,
> I’ve been looking into improving how users of our Spark platform register
> and use UDFs and I’d like to discuss a few ideas for making this easier.
> 
> 
> 
> The motivation for this is the use case of defining a UDF from SparkSQL or
> PySpark. We want to make it easy to write JVM UDFs and use them from both
> SQL and Python. Python UDFs work great in most cases, but we occasionally
> don’t want to pay the cost of shipping data to python and processing it
> there so we want to make it easy to register UDFs that will run in the
> JVM.
> 
> 
> 
> There is already syntax to create a function from a JVM class (
> https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html
> ) in SQL that would work, but this option requires using the Hive UDF API
> instead of Spark’s simpler Scala API. It also requires argument
> translation and doesn’t support codegen. Beyond the problem of the API and
> performance, it is annoying to require registering every function
> individually with a CREATE FUNCTION statement.
> 
> 
> 
> The alternative that I’d like to propose is to add a way to register a
> named group of functions using the proposed catalog plugin API.
> 
> 
> 
> For anyone unfamiliar with the proposed catalog plugins, the basic idea is
> to load and configure plugins using a simple property-based scheme. Those
> plugins expose functionality through mix-in interfaces, like TableCatalog to
> create/drop/load/alter tables. Another interface could be UDFCatalog that
> can load UDFs.
> 
> interface UDFCatalog extends CatalogPlugin { UserDefinedFunction
> loadUDF(String name) }
> 
> To use this, I would create a UDFCatalog class that returns my Scala
> functions as UDFs. To look up functions, we would use both the catalog
> name and the function name.
> 
> 
> 
> This would allow my users to write Scala UDF instances, package them using
> a UDFCatalog class (provided by me), and easily use them in Spark with a
> few configuration options to add the catalog in their environment.
> 
> 
> 
> This would also allow me to expose UDF libraries easily in my
> configuration, like brickhouse (
> https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Leveraging-Brickhouse-in-Spark2-pivot/m-p/59943
> ) , without users needing to ensure the Jar is loaded and register
> individual functions.
> 
> 
> 
> Any thoughts on this high-level approach? I know that this ignores things
> like creating and storing functions in a FunctionCatalog , and we’d have to
> solve challenges with function naming (whether there is a db component).
> Right now I’d like to think through the overall idea and not get too
> focused on those details.
> 
> 
> 
> Thanks,
> 
> 
> 
> rb
> 
> 
> --
> Ryan Blue
> Software Engineer
> Netflix
>

[DISCUSS] Function plugins

2018-12-14 Thread Ryan Blue
Hi everyone,
I’ve been looking into improving how users of our Spark platform register
and use UDFs and I’d like to discuss a few ideas for making this easier.

The motivation for this is the use case of defining a UDF from SparkSQL or
PySpark. We want to make it easy to write JVM UDFs and use them from both
SQL and Python. Python UDFs work great in most cases, but we occasionally
don’t want to pay the cost of shipping data to python and processing it
there so we want to make it easy to register UDFs that will run in the JVM.

There is already syntax to create a function from a JVM class

in SQL that would work, but this option requires using the Hive UDF API
instead of Spark’s simpler Scala API. It also requires argument translation
and doesn’t support codegen. Beyond the problem of the API and performance,
it is annoying to require registering every function individually with a CREATE
FUNCTION statement.

The alternative that I’d like to propose is to add a way to register a
named group of functions using the proposed catalog plugin API.

For anyone unfamiliar with the proposed catalog plugins, the basic idea is
to load and configure plugins using a simple property-based scheme. Those
plugins expose functionality through mix-in interfaces, like TableCatalog
to create/drop/load/alter tables. Another interface could be UDFCatalog
that can load UDFs.

interface UDFCatalog extends CatalogPlugin {
  UserDefinedFunction loadUDF(String name)
}

To use this, I would create a UDFCatalog class that returns my Scala
functions as UDFs. To look up functions, we would use both the catalog name
and the function name.

This would allow my users to write Scala UDF instances, package them using
a UDFCatalog class (provided by me), and easily use them in Spark with a
few configuration options to add the catalog in their environment.

This would also allow me to expose UDF libraries easily in my
configuration, like brickhouse
,
without users needing to ensure the Jar is loaded and register individual
functions.

Any thoughts on this high-level approach? I know that this ignores things
like creating and storing functions in a FunctionCatalog, and we’d have to
solve challenges with function naming (whether there is a db component).
Right now I’d like to think through the overall idea and not get too
focused on those details.

Thanks,

rb
-- 
Ryan Blue
Software Engineer
Netflix


Re: removing most of the config functions in SQLConf?

2018-12-14 Thread Darcy Shen




I agree with the CatalystConf idea. On Fri, 14 Dec 2018 
18:40:26 +0800  Wenchen Fan wrote IIRC, the reason we 
did it is: `SQLConf` was in SQL core module. So we need to create methods in 
`CatalystConf`, and `SQLConf` implements `CatalystConf`.Now the problem has 
gone: we moved `SQLConf` to catalyst module. I think we can remove these 
methods.On Fri, Dec 14, 2018 at 3:45 PM Reynold Xin  
wrote:In SQLConf, for each config option, we declare them in two places:First 
in the SQLConf object, e.g.:val CSV_PARSER_COLUMN_PRUNING = 
buildConf("spark.sql.csv.parser.columnPruning.enabled")  .internal()  .doc("If 
it is set to true, column names of the requested schema are passed to CSV 
parser. " +"Other column values can be ignored during parsing even if they 
are malformed.")  .booleanConf  .createWithDefault(true)Second in SQLConf 
class:def csvColumnPruning: Boolean = 
getConf(SQLConf.CSV_PARSER_COLUMN_PRUNING)As the person that introduced both, 
I'm now thinking we should remove almost all of the latter, unless it is used 
more than 5 times. The vast majority of config options are read only in one 
place, so the functions are pretty redundant ... 








Re: removing most of the config functions in SQLConf?

2018-12-14 Thread Wenchen Fan
IIRC, the reason we did it is: `SQLConf` was in SQL core module. So we need
to create methods in `CatalystConf`, and `SQLConf` implements
`CatalystConf`.

Now the problem has gone: we moved `SQLConf` to catalyst module. I think we
can remove these methods.

On Fri, Dec 14, 2018 at 3:45 PM Reynold Xin  wrote:

> In SQLConf, for each config option, we declare them in two places:
>
> First in the SQLConf object, e.g.:
>
> *val **CSV_PARSER_COLUMN_PRUNING *= 
> *buildConf*("spark.sql.csv.parser.columnPruning.enabled")
>   .internal()
>   .doc("If it is set to true, column names of the requested schema are passed 
> to CSV parser. " +
> "Other column values can be ignored during parsing even if they are 
> malformed.")
>   .booleanConf
>   .createWithDefault(*true*)
>
>
> Second in SQLConf class:
>
> *def *csvColumnPruning: Boolean = getConf(SQLConf.*CSV_PARSER_COLUMN_PRUNING*)
>
>
>
> As the person that introduced both, I'm now thinking we should remove
> almost all of the latter, unless it is used more than 5 times. The vast
> majority of config options are read only in one place, so the functions are
> pretty redundant ...
>
>
>
>
>