Is what you are looking for a withColumn that support in place modification of nested columns? or is it some other problem?
On Wed, Sep 14, 2016 at 11:07 PM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > I tried to use the RowEncoder but got stuck along the way : > The main issue really is that even if it's possible (however tedious) to > pattern match generically Row(s) and target the nested field that you need > to modify, Rows being immutable data structure without a method like a case > class's copy or any kind of lens to create a brand new object, I ended up > stuck at the step "target and extract the field to update" without any way > to update the original Row with the new value. > > To sum up, I tried : > > - using only dataframe's API itself + my udf - which works for nested > structs as long as no arrays are along the way > - trying to create a udf the can apply on Row and pattern match > recursively the path I needed to explore/modify > - trying to create a UDT - but we seem to be stuck in a strange > middle-ground with 2.0 because some parts of the API ended up private while > some stayed public making it impossible to use it now (I'd be glad if I'm > mistaken) > > All of these failed for me and I ended up converting the rows to JSON and > update using JSONPath which is…. something I'd like to avoid 'pretty > please' [image: simple_smile] > > > > On Thu, Sep 15, 2016 5:20 AM, Michael Allman mich...@videoamp.com wrote: > >> Hi Guys, >> >> Have you tried org.apache.spark.sql.catalyst.encoders.RowEncoder? It's >> not a public API, but it is publicly accessible. I used it recently to >> correct some bad data in a few nested columns in a dataframe. It wasn't an >> easy job, but it made it possible. In my particular case I was not working >> with arrays. >> >> Olivier, I'm interested in seeing what you come up with. >> >> Thanks, >> >> Michael >> >> >> On Sep 14, 2016, at 10:44 AM, Fred Reiss <freiss....@gmail.com> wrote: >> >> +1 to this request. I talked last week with a product group within IBM >> that is struggling with the same issue. It's pretty common in data cleaning >> applications for data in the early stages to have nested lists or sets >> inconsistent or incomplete schema information. >> >> Fred >> >> On Tue, Sep 13, 2016 at 8:08 AM, Olivier Girardot < >> o.girar...@lateral-thoughts.com> wrote: >> >> Hi everyone, >> I'm currently trying to create a generic transformation mecanism on a >> Dataframe to modify an arbitrary column regardless of the underlying the >> schema. >> >> It's "relatively" straightforward for complex types like >> struct<struct<…>> to apply an arbitrary UDF on the column and replace the >> data "inside" the struct, however I'm struggling to make it work for >> complex types containing arrays along the way like struct<array<struct<…>>>. >> >> Michael Armbrust seemed to allude on the mailing list/forum to a way of >> using Encoders to do that, I'd be interested in any pointers, especially >> considering that it's not possible to output any Row or >> GenericRowWithSchema from a UDF (thanks to https://github.com/apache/spar >> k/blob/v2.0.0/sql/catalyst/src/main/scala/org/apache/ >> spark/sql/catalyst/ScalaReflection.scala#L657 it seems). >> >> To sum up, I'd like to find a way to apply a transformation on complex >> nested datatypes (arrays and struct) on a Dataframe updating the value >> itself. >> >> Regards, >> >> *Olivier Girardot* >> >> >> >> > > *Olivier Girardot* | Associé > o.girar...@lateral-thoughts.com > +33 6 24 09 17 94 >