Re: Dataset and lambas

Deenar Toraskar Mon, 07 Dec 2015 12:35:33 -0800

Michael

Having VectorUnionSumUDAF implemented would be great. This is quite
generic, it does element-wise sum of arrays and maps
https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/timeseries/VectorUnionSumUDAF.java
and would be massive benefit for a lot of risk analytics.


In general most of the brickhouse UDFs are quite useful
https://github.com/klout/brickhouse. Happy to help out.

On another note what would be involved to have arrays backed by a sparse
Array (I am assuming the current implementation is dense), sort of native
support for http://spark.apache.org/docs/latest/mllib-data-types.html

Regards
Deenar



Regards
Deenar

On 7 December 2015 at 20:21, Michael Armbrust <mich...@databricks.com>
wrote:

> On Sat, Dec 5, 2015 at 3:27 PM, Deenar Toraskar <deenar.toras...@gmail.com
> > wrote:
>>
>> On a similar note, what is involved in getting native support for some
>> user defined functions, so that they are as efficient as native Spark SQL
>> expressions? I had one particular one - an arraySum (element wise sum) that
>> is heavily used in a lot of risk analytics.
>>
>
> To get the best performance you have to implement a catalyst expression
> with codegen.  This however is necessarily an internal (unstable) interface
> since we are constantly making breaking changes to improve performance.  So
> if its a common enough operation we should bake it into the engine.
>
> That said, the code generated encoders that we created for datasets should
> lower the cost of calling into external functions as we start using them in
> more and more places (i.e.
> https://issues.apache.org/jira/browse/SPARK-11593)
>

Re: Dataset and lambas

Reply via email to