Based on this discussion I'm thinking we should deprecate the two explode functions.
On Wednesday, May 25, 2016, Koert Kuipers <ko...@tresata.com> wrote: > wenchen, > that definition of explode seems identical to flatMap, so you dont need it > either? > > michael, > i didn't know about the column expression version of explode, that makes > sense. i will experiment with that instead. > > On Wed, May 25, 2016 at 3:03 PM, Wenchen Fan <wenc...@databricks.com > <javascript:_e(%7B%7D,'cvml','wenc...@databricks.com');>> wrote: > >> I think we only need this version: `def explode[B : Encoder](f: A >> => TraversableOnce[B]): Dataset[B]` >> >> For untyped one, `df.select(explode($"arrayCol").as("item"))` should be >> the best choice. >> >> On Wed, May 25, 2016 at 11:55 AM, Michael Armbrust < >> mich...@databricks.com >> <javascript:_e(%7B%7D,'cvml','mich...@databricks.com');>> wrote: >> >>> These APIs predate Datasets / encoders, so that is why they are Row >>> instead of objects. We should probably rethink that. >>> >>> Honestly, I usually end up using the column expression version of >>> explode now that it exists (i.e. explode($"arrayCol").as("Item")). It >>> would be great to understand more why you are using these instead. >>> >>> On Wed, May 25, 2016 at 8:49 AM, Koert Kuipers <ko...@tresata.com >>> <javascript:_e(%7B%7D,'cvml','ko...@tresata.com');>> wrote: >>> >>>> we currently have 2 explode definitions in Dataset: >>>> >>>> def explode[A <: Product : TypeTag](input: Column*)(f: Row => >>>> TraversableOnce[A]): DataFrame >>>> >>>> def explode[A, B : TypeTag](inputColumn: String, outputColumn: >>>> String)(f: A => TraversableOnce[B]): DataFrame >>>> >>>> 1) the separation of the functions into their own argument lists is >>>> nice, but unfortunately scala's type inference doesn't handle this well, >>>> meaning that the generic types always have to be explicitly provided. i >>>> assume this was done to allow the "input" to be a varargs in the first >>>> method, and then kept the same in the second for reasons of symmetry. >>>> >>>> 2) i am surprised the first definition returns a DataFrame. this seems >>>> to suggest DataFrame usage (so DataFrame to DataFrame), but there is no way >>>> to specify the output column names, which limits its usability for >>>> DataFrames. i frequently end up using the first definition for DataFrames >>>> anyhow because of the need to return more than 1 column (and the data has >>>> columns unknown at compile time that i need to carry along making flatMap >>>> on Dataset clumsy/unusable), but relying on the output columns being called >>>> _1 and _2 and renaming then afterwards seems like an anti-pattern. >>>> >>>> 3) using Row objects isn't very pretty. why not f: A => >>>> TraversableOnce[B] or something like that for the first definition? how >>>> about: >>>> def explode[A: TypeTag, B: TypeTag](input: Seq[Column], output: >>>> Seq[Column])(f: A => TraversableOnce[B]): DataFrame >>>> >>>> best, >>>> koert >>>> >>> >>> >> >