Is what you are looking for a withColumn that support in place modification
of nested columns? or is it some other problem?

On Wed, Sep 14, 2016 at 11:07 PM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> I tried to use the RowEncoder but got stuck along the way :
> The main issue really is that even if it's possible (however tedious) to
> pattern match generically Row(s) and target the nested field that you need
> to modify, Rows being immutable data structure without a method like a case
> class's copy or any kind of lens to create a brand new object, I ended up
> stuck at the step "target and extract the field to update" without any way
> to update the original Row with the new value.
>
> To sum up, I tried :
>
>    - using only dataframe's API itself + my udf - which works for nested
>    structs as long as no arrays are along the way
>    - trying to create a udf the can apply on Row and pattern match
>    recursively the path I needed to explore/modify
>    - trying to create a UDT - but we seem to be stuck in a strange
>    middle-ground with 2.0 because some parts of the API ended up private while
>    some stayed public making it impossible to use it now (I'd be glad if I'm
>    mistaken)
>
> All of these failed for me and I ended up converting the rows to JSON and
> update using JSONPath which is…. something I'd like to avoid 'pretty
> please' [image: simple_smile]
>
>
>
> On Thu, Sep 15, 2016 5:20 AM, Michael Allman mich...@videoamp.com wrote:
>
>> Hi Guys,
>>
>> Have you tried org.apache.spark.sql.catalyst.encoders.RowEncoder? It's
>> not a public API, but it is publicly accessible. I used it recently to
>> correct some bad data in a few nested columns in a dataframe. It wasn't an
>> easy job, but it made it possible. In my particular case I was not working
>> with arrays.
>>
>> Olivier, I'm interested in seeing what you come up with.
>>
>> Thanks,
>>
>> Michael
>>
>>
>> On Sep 14, 2016, at 10:44 AM, Fred Reiss <freiss....@gmail.com> wrote:
>>
>> +1 to this request. I talked last week with a product group within IBM
>> that is struggling with the same issue. It's pretty common in data cleaning
>> applications for data in the early stages to have nested lists or sets
>> inconsistent or incomplete schema information.
>>
>> Fred
>>
>> On Tue, Sep 13, 2016 at 8:08 AM, Olivier Girardot <
>> o.girar...@lateral-thoughts.com> wrote:
>>
>> Hi everyone,
>> I'm currently trying to create a generic transformation mecanism on a
>> Dataframe to modify an arbitrary column regardless of the underlying the
>> schema.
>>
>> It's "relatively" straightforward for complex types like
>> struct<struct<…>> to apply an arbitrary UDF on the column and replace the
>> data "inside" the struct, however I'm struggling to make it work for
>> complex types containing arrays along the way like struct<array<struct<…>>>.
>>
>> Michael Armbrust seemed to allude on the mailing list/forum to a way of
>> using Encoders to do that, I'd be interested in any pointers, especially
>> considering that it's not possible to output any Row or
>> GenericRowWithSchema from a UDF (thanks to https://github.com/apache/spar
>> k/blob/v2.0.0/sql/catalyst/src/main/scala/org/apache/
>> spark/sql/catalyst/ScalaReflection.scala#L657 it seems).
>>
>> To sum up, I'd like to find a way to apply a transformation on complex
>> nested datatypes (arrays and struct) on a Dataframe updating the value
>> itself.
>>
>> Regards,
>>
>> *Olivier Girardot*
>>
>>
>>
>>
>
> *Olivier Girardot* | Associé
> o.girar...@lateral-thoughts.com
> +33 6 24 09 17 94
>

Reply via email to