Hi ,

I am having the same issue. Has any one found solution to this.

When i convert the nested JSON to parquet. I dont see the projection
working correctly.
It still reads all the nested structure columns. Parquet does support
nested column projection.

Does Spark 2 SQL provide the column projection for nested fields.? Does
predicate push-down work for nested columns?

How we can optimize things in this case. Am i using the wrong API ?

Thanks in advance.




On Sun, Jan 29, 2017 at 12:39 AM, Antoine HOM <antoine....@gmail.com> wrote:

> Hello everybody,
>
> I have been trying to use complex types (stored in parquet) with spark
> SQL and ended up having an issue that I can't seem to be able to solve
> cleanly.
> I was hoping, through this mail, to get some insights from the
> community, maybe I'm just missing something obvious in the way I'm
> using spark :)
>
> It seems that spark only push down projections for columns at the root
> level of the records.
> This is a big I/O issue depending on how much you use complex types,
> in my samples I ended up reading 100GB of data when using only a
> single field out of a struct (It should most likely have read only
> 1GB).
>
> I already saw this PR which sounds promising:
> https://github.com/apache/spark/pull/16578
>
> However it seems that it won't be applicable if you have multiple
> array nesting level, the main reason is that I can't seem to find how
> to reference to fields deeply nested in arrays in a Column expression.
> I can do everything within lambdas but I think the optimizer won't
> drill into it to understand that I'm only accessing a few fields.
>
> If I take the following (simplified) example:
>
> {
>     trip_legs:[{
>         from: "LHR",
>         to: "NYC",
>         taxes: [{
>             type: "gov",
>             amount: 12
>             currency: "USD"
>         }]
>     }]
> }
>
> col(trip_legs.from) will return an Array of all the from fields for
> each trip_leg object.
> col(trip_legs.taxes.type) will throw an exception.
>
> So my questions are:
>   * Is there a way to reference to these deeply nested fields without
> having to specify an array index with a Column expression?
>   * If not, is there an API to force the projection of a given set of
> fields so that parquet only read this specific set of columns?
>
> In addition, regarding the handling of arrays of struct within spark sql:
>   * Has it been discussed to have a way to "reshape" an array of
> structs without using lambdas? (Like the $map/$filter/etc.. operators
> of mongodb for example)
>   * If not and I'm willing to dedicate time to code for these
> features, does someone familiar with the code base could tell me how
> disruptive this would be? And if this would be a welcome change or
> not? (most likely more appropriate for the dev mailing list though)
>
> Regards,
> Antoine
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to