On Sat, Dec 5, 2015 at 9:42 AM, Koert Kuipers <ko...@tresata.com> wrote:
> hello all, > DataFrame internally uses a different encoding for values then what the > user sees. i assume the same is true for Dataset? > This is true. We encode objects in the tungsten binary format using code generated serializers. > if so, does this means that a function like Dataset.map needs to convert > all the values twice (once to user format and then back to internal > format)? or is it perhaps possible to write scala functions that operate on > internal formats and avoid this? > Currently this is true, but there are plans to avoid unnecessary conversions (back to back maps / filters, etc) and only convert when we need to (shuffles, sorting, hashing, SQL operations). There are also plans to allow you to directly access some of the more efficient internal types by using them as fields in your classes (mutable UTF8 String instead of the immutable java.lang.String).