+1 to this request. I talked last week with a product group within IBM that
is struggling with the same issue. It's pretty common in data cleaning
applications for data in the early stages to have nested lists or sets
inconsistent or incomplete schema information.

Fred

On Tue, Sep 13, 2016 at 8:08 AM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> Hi everyone,
> I'm currently trying to create a generic transformation mecanism on a
> Dataframe to modify an arbitrary column regardless of the underlying the
> schema.
>
> It's "relatively" straightforward for complex types like struct<struct<…>>
> to apply an arbitrary UDF on the column and replace the data "inside" the
> struct, however I'm struggling to make it work for complex types containing
> arrays along the way like struct<array<struct<…>>>.
>
> Michael Armbrust seemed to allude on the mailing list/forum to a way of
> using Encoders to do that, I'd be interested in any pointers, especially
> considering that it's not possible to output any Row or
> GenericRowWithSchema from a UDF (thanks to https://github.com/apache/
> spark/blob/v2.0.0/sql/catalyst/src/main/scala/org/
> apache/spark/sql/catalyst/ScalaReflection.scala#L657 it seems).
>
> To sum up, I'd like to find a way to apply a transformation on complex
> nested datatypes (arrays and struct) on a Dataframe updating the value
> itself.
>
> Regards,
>
> *Olivier Girardot*
>

Reply via email to