Re: Dataset and lambas

Deenar Toraskar Sat, 05 Dec 2015 15:28:01 -0800

Hi Michael

On a similar note, what is involved in getting native support for some user
defined functions, so that they are as efficient as native Spark SQL
expressions? I had one particular one - an arraySum (element wise sum) that
is heavily used in a lot of risk analytics.



Deenar

On 5 December 2015 at 21:09, Michael Armbrust <mich...@databricks.com>
wrote:

> On Sat, Dec 5, 2015 at 9:42 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> hello all,
>> DataFrame internally uses a different encoding for values then what the
>> user sees. i assume the same is true for Dataset?
>>
>
> This is true.  We encode objects in the tungsten binary format using code
> generated serializers.
>
>
>> if so, does this means that a function like Dataset.map needs to convert
>> all the values twice (once to user format and then back to internal
>> format)? or is it perhaps possible to write scala functions that operate on
>> internal formats and avoid this?
>>
>
> Currently this is true, but there are plans to avoid unnecessary
> conversions (back to back maps / filters, etc) and only convert when we
> need to (shuffles, sorting, hashing, SQL operations).
>
> There are also plans to allow you to directly access some of the more
> efficient internal types by using them as fields in your classes (mutable
> UTF8 String instead of the immutable java.lang.String).
>
>

Re: Dataset and lambas

Reply via email to