Re: [SparkSQL][Parquet] Read from nested parquet data

Cheng Lian Thu, 31 Dec 2015 00:27:36 -0800

Hey Lin,

This is a good question. The root cause of this issue lies in theanalyzer. Currently, Spark SQL can only resolve a name to a top levelcolumn. (Hive suffers the same issue.) Take the SQL query and struct youprovided as an example, col_b.col_d.col_g is resolved as two nestedGetStructField expression over an AttributeReference named col_b.Unfortunately, today we only pushes AttributeReferences down tounderlying relations. So Spark SQL Parquet data source doesn't even knowabout which nested fields are referenced. Instead, all it knows about iswhich top level columns are referenced.

I haven't think about this in depth, but I think there can be at leasttwo approaches:

1. Make the analyzer recognize finer grained nested fields, and makeexpressions like GetStructField named expressions.

2. Pass expressions like GetStructField down to underlying data sources

Cheng

On 12/31/15 10:11 AM, lin wrote:

Hi yanbo, thanks for the quick response.

Looks like we'll need to do some work-around.

But before that, we'd like to dig into some related discussionsfirst. We've looked through the following urls, but none seems helpful.


Mailing list threads:
http://search-hadoop.com/m/q3RTtLkgZl1K4oyx/v=threaded
JIRA:

Parquet pushdown for unionAll,https://issues.apache.org/jira/browse/SPARK-3462Spark SQL reads unneccesary nested fields from Parquet,https://issues.apache.org/jira/browse/SPARK-4502Parquet Predicate Pushdown Does Not Work with Nested Structures,https://issues.apache.org/jira/browse/SPARK-5151


Any recommended threads, JIRA, or PR, please? Thanks. :)

On Wed, Dec 30, 2015 at 6:21 PM, Yanbo Liang <yblia...@gmail.com<mailto:yblia...@gmail.com>> wrote:


    This problem has been discussed long before, but I think there is
    no straight forward way to read only col_g.

    2015-12-30 17:48 GMT+08:00 lin <kurtt....@gmail.com
    <mailto:kurtt....@gmail.com>>:

        Hi all,

        We are trying to read from nested parquet data. SQL is "select
        col_b.col_d.col_g from some_table" and the data schema for
        some_table is:

        root
         |-- col_a: long (nullable = false)
         |-- col_b: struct (nullable = true)
         |    |-- col_c: string (nullable = true)
         |    |-- col_d: array (nullable = true)
         |    |    |-- element: struct (containsNull = true)
         |    |    |    |-- col_e: integer (nullable = true)
         |    |    |    |-- col_f: string (nullable = true)
         |    |    |    |-- col_g: long (nullable = true)
        We expect to see only col_g are read and parsed from the
        parquet files; however, we acually observed the whole col_b
        being read and parsed.
        As we dig in a little bit, seems that col_g is a
        GetArrayStructFields, col_d is a GetStructField, and only
        col_b is an AttributeReference, so
        PhysicalOperation.collectProjectsAndFilters() returns col_b
        instead of col_g as projections.
        So we wonder, is there any way to read and parse only col_g
        instead of the whole col_b? We use Spark 1.5.1 and Parquet 1.7.0.

        Thanks! :)

Re: [SparkSQL][Parquet] Read from nested parquet data

Reply via email to