feedback on dataset api explode

Koert Kuipers Wed, 25 May 2016 08:51:00 -0700

we currently have 2 explode definitions in Dataset:

 def explode[A <: Product : TypeTag](input: Column*)(f: Row =>
TraversableOnce[A]): DataFrame


 def explode[A, B : TypeTag](inputColumn: String, outputColumn: String)(f:
A => TraversableOnce[B]): DataFrame

1) the separation of the functions into their own argument lists is nice,
but unfortunately scala's type inference doesn't handle this well, meaning
that the generic types always have to be explicitly provided. i assume this
was done to allow the "input" to be a varargs in the first method, and then
kept the same in the second for reasons of symmetry.

2) i am surprised the first definition returns a DataFrame. this seems to
suggest DataFrame usage (so DataFrame to DataFrame), but there is no way to
specify the output column names, which limits its usability for DataFrames.
i frequently end up using the first definition for DataFrames anyhow
because of the need to return more than 1 column (and the data has columns
unknown at compile time that i need to carry along making flatMap on
Dataset clumsy/unusable), but relying on the output columns being called _1
and _2 and renaming then afterwards seems like an anti-pattern.

3) using Row objects isn't very pretty. why not f: A => TraversableOnce[B]
or something like that for the first definition? how about:
 def explode[A: TypeTag, B: TypeTag](input: Seq[Column], output:
Seq[Column])(f: A => TraversableOnce[B]): DataFrame

best,
koert

feedback on dataset api explode

Reply via email to