Re: [Spark Sql/ UDFs] Spark and Hive UDFs parity

Georg Heiler Fri, 16 Jun 2017 09:53:41 -0700

I assume you want to have this life cycle in oder to create big/ heavy /
complex objects only once ( per partition) map partitions should fit this
usecase pretty well.
RD <rdsr...@gmail.com> schrieb am Fr. 16. Juni 2017 um 17:37:


> Thanks Georg. But I'm not sure how mapPartitions is relevant here.  Can
> you elaborate?
>
>
>
> On Thu, Jun 15, 2017 at 4:18 AM, Georg Heiler <georg.kf.hei...@gmail.com>
> wrote:
>
>> What about using map partitions instead?
>>
>> RD <rdsr...@gmail.com> schrieb am Do. 15. Juni 2017 um 06:52:
>>
>>> Hi Spark folks,
>>>
>>>     Is there any plan to support the richer UDF API that Hive supports
>>> for Spark UDFs ? Hive supports the GenericUDF API which has, among others
>>> methods like initialize(), configure() (called once on the cluster) etc,
>>> which a lot of our users use. We have now a lot of UDFs in Hive which make
>>> use of these methods. We plan to move to UDFs to Spark UDFs but are being
>>> limited by not having similar lifecycle methods.
>>>    Are there plans to address these? Or do people usually adopt some
>>> sort of workaround?
>>>
>>>    If we  directly use  the Hive UDFs  in Spark we pay a performance
>>> penalty. I think Spark anyways does a conversion from InternalRow to Row
>>> back to InternalRow for native spark udfs and for Hive it does InternalRow
>>> to Hive Object back to InternalRow but somehow the conversion in native
>>> udfs is more performant.
>>>
>>> -Best,
>>> R.
>>>
>>
>

Re: [Spark Sql/ UDFs] Spark and Hive UDFs parity

Reply via email to