On Sat, Dec 5, 2015 at 9:42 AM, Koert Kuipers <ko...@tresata.com> wrote:

> hello all,
> DataFrame internally uses a different encoding for values then what the
> user sees. i assume the same is true for Dataset?
>

This is true.  We encode objects in the tungsten binary format using code
generated serializers.


> if so, does this means that a function like Dataset.map needs to convert
> all the values twice (once to user format and then back to internal
> format)? or is it perhaps possible to write scala functions that operate on
> internal formats and avoid this?
>

Currently this is true, but there are plans to avoid unnecessary
conversions (back to back maps / filters, etc) and only convert when we
need to (shuffles, sorting, hashing, SQL operations).

There are also plans to allow you to directly access some of the more
efficient internal types by using them as fields in your classes (mutable
UTF8 String instead of the immutable java.lang.String).

Reply via email to