great thanks On Mon, Dec 7, 2015 at 3:02 PM, Michael Armbrust <mich...@databricks.com> wrote:
> These specific JIRAs don't exist yet, but watch SPARK-9999 as we'll make > sure everything shows up there. > > On Sun, Dec 6, 2015 at 10:06 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> that's good news about plans to avoid unnecessary conversions, and allow >> access to more efficient internal types. could you point me to the jiras, >> if they exist already? i just tried to find them but had little luck. >> best, koert >> >> On Sat, Dec 5, 2015 at 4:09 PM, Michael Armbrust <mich...@databricks.com> >> wrote: >> >>> On Sat, Dec 5, 2015 at 9:42 AM, Koert Kuipers <ko...@tresata.com> wrote: >>> >>>> hello all, >>>> DataFrame internally uses a different encoding for values then what the >>>> user sees. i assume the same is true for Dataset? >>>> >>> >>> This is true. We encode objects in the tungsten binary format using >>> code generated serializers. >>> >>> >>>> if so, does this means that a function like Dataset.map needs to >>>> convert all the values twice (once to user format and then back to internal >>>> format)? or is it perhaps possible to write scala functions that operate on >>>> internal formats and avoid this? >>>> >>> >>> Currently this is true, but there are plans to avoid unnecessary >>> conversions (back to back maps / filters, etc) and only convert when we >>> need to (shuffles, sorting, hashing, SQL operations). >>> >>> There are also plans to allow you to directly access some of the more >>> efficient internal types by using them as fields in your classes (mutable >>> UTF8 String instead of the immutable java.lang.String). >>> >>> >> >