Hi Michael On a similar note, what is involved in getting native support for some user defined functions, so that they are as efficient as native Spark SQL expressions? I had one particular one - an arraySum (element wise sum) that is heavily used in a lot of risk analytics.
Deenar On 5 December 2015 at 21:09, Michael Armbrust <mich...@databricks.com> wrote: > On Sat, Dec 5, 2015 at 9:42 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> hello all, >> DataFrame internally uses a different encoding for values then what the >> user sees. i assume the same is true for Dataset? >> > > This is true. We encode objects in the tungsten binary format using code > generated serializers. > > >> if so, does this means that a function like Dataset.map needs to convert >> all the values twice (once to user format and then back to internal >> format)? or is it perhaps possible to write scala functions that operate on >> internal formats and avoid this? >> > > Currently this is true, but there are plans to avoid unnecessary > conversions (back to back maps / filters, etc) and only convert when we > need to (shuffles, sorting, hashing, SQL operations). > > There are also plans to allow you to directly access some of the more > efficient internal types by using them as fields in your classes (mutable > UTF8 String instead of the immutable java.lang.String). > >