Re: [Spark Sql/ UDFs] Spark and Hive UDFs parity

Yong Zhang Sun, 18 Jun 2017 06:26:35 -0700

I assume you use Scala to implement your UDFs.


In this case, Scala language itself provides some options already for you.


If you want to control more logic when UDFs init, you can define a Scala 
object, def your UDF as part of it, then the object in Scala will behavior like 
Singleton pattern for you.


So the Sacala object's constructor logic can be treated as init/configure 
contract as in Hive. They will be called once per JVM, to init your Scala 
object. That should meet your requirement.


The only trick part is the context reference for configure() method, which 
allow you to pass some configuration dynamic to your UDF for runtime. Since 
object in Scala has to fix at compile time, so you cannot pass any parameters 
to the construct of it. But there is nothing stopping you building Scala 
class/companion object to allow any parameter passed in at constructor/init 
time, which can control your UDF's behavior.


If you have a concrete example that you cannot do in Spark Scala UDF, you can 
post here.


Yong


________________________________
From: RD <rdsr...@gmail.com>
Sent: Friday, June 16, 2017 11:37 AM
To: Georg Heiler
Cc: user@spark.apache.org
Subject: Re: [Spark Sql/ UDFs] Spark and Hive UDFs parity

Thanks Georg. But I'm not sure how mapPartitions is relevant here.  Can you 
elaborate?



On Thu, Jun 15, 2017 at 4:18 AM, Georg Heiler 
<georg.kf.hei...@gmail.com<mailto:georg.kf.hei...@gmail.com>> wrote:
What about using map partitions instead?

RD <rdsr...@gmail.com<mailto:rdsr...@gmail.com>> schrieb am Do. 15. Juni 2017 
um 06:52:
Hi Spark folks,

    Is there any plan to support the richer UDF API that Hive supports for 
Spark UDFs ? Hive supports the GenericUDF API which has, among others methods 
like initialize(), configure() (called once on the cluster) etc, which a lot of 
our users use. We have now a lot of UDFs in Hive which make use of these 
methods. We plan to move to UDFs to Spark UDFs but are being limited by not 
having similar lifecycle methods.
   Are there plans to address these? Or do people usually adopt some sort of 
workaround?

   If we  directly use  the Hive UDFs  in Spark we pay a performance penalty. I 
think Spark anyways does a conversion from InternalRow to Row back to 
InternalRow for native spark udfs and for Hive it does InternalRow to Hive 
Object back to InternalRow but somehow the conversion in native udfs is more 
performant.

-Best,
R.

Re: [Spark Sql/ UDFs] Spark and Hive UDFs parity

Reply via email to