Hi , I am having the same issue. Has any one found solution to this.
When i convert the nested JSON to parquet. I dont see the projection working correctly. It still reads all the nested structure columns. Parquet does support nested column projection. Does Spark 2 SQL provide the column projection for nested fields.? Does predicate push-down work for nested columns? How we can optimize things in this case. Am i using the wrong API ? Thanks in advance. On Sun, Jan 29, 2017 at 12:39 AM, Antoine HOM <antoine....@gmail.com> wrote: > Hello everybody, > > I have been trying to use complex types (stored in parquet) with spark > SQL and ended up having an issue that I can't seem to be able to solve > cleanly. > I was hoping, through this mail, to get some insights from the > community, maybe I'm just missing something obvious in the way I'm > using spark :) > > It seems that spark only push down projections for columns at the root > level of the records. > This is a big I/O issue depending on how much you use complex types, > in my samples I ended up reading 100GB of data when using only a > single field out of a struct (It should most likely have read only > 1GB). > > I already saw this PR which sounds promising: > https://github.com/apache/spark/pull/16578 > > However it seems that it won't be applicable if you have multiple > array nesting level, the main reason is that I can't seem to find how > to reference to fields deeply nested in arrays in a Column expression. > I can do everything within lambdas but I think the optimizer won't > drill into it to understand that I'm only accessing a few fields. > > If I take the following (simplified) example: > > { > trip_legs:[{ > from: "LHR", > to: "NYC", > taxes: [{ > type: "gov", > amount: 12 > currency: "USD" > }] > }] > } > > col(trip_legs.from) will return an Array of all the from fields for > each trip_leg object. > col(trip_legs.taxes.type) will throw an exception. > > So my questions are: > * Is there a way to reference to these deeply nested fields without > having to specify an array index with a Column expression? > * If not, is there an API to force the projection of a given set of > fields so that parquet only read this specific set of columns? > > In addition, regarding the handling of arrays of struct within spark sql: > * Has it been discussed to have a way to "reshape" an array of > structs without using lambdas? (Like the $map/$filter/etc.. operators > of mongodb for example) > * If not and I'm willing to dedicate time to code for these > features, does someone familiar with the code base could tell me how > disruptive this would be? And if this would be a welcome change or > not? (most likely more appropriate for the dev mailing list though) > > Regards, > Antoine > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >