Re: feedback on dataset api explode

Reynold Xin Wed, 25 May 2016 13:01:21 -0700

Based on this discussion I'm thinking we should deprecate the two explode
functions.


On Wednesday, May 25, 2016, Koert Kuipers <ko...@tresata.com> wrote:

> wenchen,
> that definition of explode seems identical to flatMap, so you dont need it
> either?
>
> michael,
> i didn't know about the column expression version of explode, that makes
> sense. i will experiment with that instead.
>
> On Wed, May 25, 2016 at 3:03 PM, Wenchen Fan <wenc...@databricks.com
> <javascript:_e(%7B%7D,'cvml','wenc...@databricks.com');>> wrote:
>
>> I think we only need this version:  `def explode[B : Encoder](f: A
>> => TraversableOnce[B]): Dataset[B]`
>>
>> For untyped one, `df.select(explode($"arrayCol").as("item"))` should be
>> the best choice.
>>
>> On Wed, May 25, 2016 at 11:55 AM, Michael Armbrust <
>> mich...@databricks.com
>> <javascript:_e(%7B%7D,'cvml','mich...@databricks.com');>> wrote:
>>
>>> These APIs predate Datasets / encoders, so that is why they are Row
>>> instead of objects.  We should probably rethink that.
>>>
>>> Honestly, I usually end up using the column expression version of
>>> explode now that it exists (i.e. explode($"arrayCol").as("Item")).  It
>>> would be great to understand more why you are using these instead.
>>>
>>> On Wed, May 25, 2016 at 8:49 AM, Koert Kuipers <ko...@tresata.com
>>> <javascript:_e(%7B%7D,'cvml','ko...@tresata.com');>> wrote:
>>>
>>>> we currently have 2 explode definitions in Dataset:
>>>>
>>>>  def explode[A <: Product : TypeTag](input: Column*)(f: Row =>
>>>> TraversableOnce[A]): DataFrame
>>>>
>>>>  def explode[A, B : TypeTag](inputColumn: String, outputColumn:
>>>> String)(f: A => TraversableOnce[B]): DataFrame
>>>>
>>>> 1) the separation of the functions into their own argument lists is
>>>> nice, but unfortunately scala's type inference doesn't handle this well,
>>>> meaning that the generic types always have to be explicitly provided. i
>>>> assume this was done to allow the "input" to be a varargs in the first
>>>> method, and then kept the same in the second for reasons of symmetry.
>>>>
>>>> 2) i am surprised the first definition returns a DataFrame. this seems
>>>> to suggest DataFrame usage (so DataFrame to DataFrame), but there is no way
>>>> to specify the output column names, which limits its usability for
>>>> DataFrames. i frequently end up using the first definition for DataFrames
>>>> anyhow because of the need to return more than 1 column (and the data has
>>>> columns unknown at compile time that i need to carry along making flatMap
>>>> on Dataset clumsy/unusable), but relying on the output columns being called
>>>> _1 and _2 and renaming then afterwards seems like an anti-pattern.
>>>>
>>>> 3) using Row objects isn't very pretty. why not f: A =>
>>>> TraversableOnce[B] or something like that for the first definition? how
>>>> about:
>>>>  def explode[A: TypeTag, B: TypeTag](input: Seq[Column], output:
>>>> Seq[Column])(f: A => TraversableOnce[B]): DataFrame
>>>>
>>>> best,
>>>> koert
>>>>
>>>
>>>
>>
>

Re: feedback on dataset api explode

Reply via email to