Hey Lin,

This is a good question. The root cause of this issue lies in the analyzer. Currently, Spark SQL can only resolve a name to a top level column. (Hive suffers the same issue.) Take the SQL query and struct you provided as an example, col_b.col_d.col_g is resolved as two nested GetStructField expression over an AttributeReference named col_b. Unfortunately, today we only pushes AttributeReferences down to underlying relations. So Spark SQL Parquet data source doesn't even know about which nested fields are referenced. Instead, all it knows about is which top level columns are referenced.

I haven't think about this in depth, but I think there can be at least two approaches:

1. Make the analyzer recognize finer grained nested fields, and make expressions like GetStructField named expressions.
2. Pass expressions like GetStructField down to underlying data sources

Cheng

On 12/31/15 10:11 AM, lin wrote:
Hi yanbo, thanks for the quick response.

Looks like we'll need to do some work-around.
But before that, we'd like to dig into some related discussions first. We've looked through the following urls, but none seems helpful.

Mailing list threads:
http://search-hadoop.com/m/q3RTtLkgZl1K4oyx/v=threaded
JIRA:
Parquet pushdown for unionAll, https://issues.apache.org/jira/browse/SPARK-3462 Spark SQL reads unneccesary nested fields from Parquet, https://issues.apache.org/jira/browse/SPARK-4502 Parquet Predicate Pushdown Does Not Work with Nested Structures, https://issues.apache.org/jira/browse/SPARK-5151

Any recommended threads, JIRA, or PR, please? Thanks. :)


On Wed, Dec 30, 2015 at 6:21 PM, Yanbo Liang <yblia...@gmail.com <mailto:yblia...@gmail.com>> wrote:

    This problem has been discussed long before, but I think there is
    no straight forward way to read only col_g.

    2015-12-30 17:48 GMT+08:00 lin <kurtt....@gmail.com
    <mailto:kurtt....@gmail.com>>:

        Hi all,

        We are trying to read from nested parquet data. SQL is "select
        col_b.col_d.col_g from some_table" and the data schema for
        some_table is:

        root
         |-- col_a: long (nullable = false)
         |-- col_b: struct (nullable = true)
         |    |-- col_c: string (nullable = true)
         |    |-- col_d: array (nullable = true)
         |    |    |-- element: struct (containsNull = true)
         |    |    |    |-- col_e: integer (nullable = true)
         |    |    |    |-- col_f: string (nullable = true)
         |    |    |    |-- col_g: long (nullable = true)
        We expect to see only col_g are read and parsed from the
        parquet files; however, we acually observed the whole col_b
        being read and parsed.
        As we dig in a little bit, seems that col_g is a
        GetArrayStructFields, col_d is a GetStructField, and only
        col_b is an AttributeReference, so
        PhysicalOperation.collectProjectsAndFilters() returns col_b
        instead of col_g as projections.
        So we wonder, is there any way to read and parse only col_g
        instead of the whole col_b? We use Spark 1.5.1 and Parquet 1.7.0.

        Thanks! :)




Reply via email to