Hey Lin,
This is a good question. The root cause of this issue lies in the
analyzer. Currently, Spark SQL can only resolve a name to a top level
column. (Hive suffers the same issue.) Take the SQL query and struct you
provided as an example, col_b.col_d.col_g is resolved as two nested
GetStructField expression over an AttributeReference named col_b.
Unfortunately, today we only pushes AttributeReferences down to
underlying relations. So Spark SQL Parquet data source doesn't even know
about which nested fields are referenced. Instead, all it knows about is
which top level columns are referenced.
I haven't think about this in depth, but I think there can be at least
two approaches:
1. Make the analyzer recognize finer grained nested fields, and make
expressions like GetStructField named expressions.
2. Pass expressions like GetStructField down to underlying data sources
Cheng
On 12/31/15 10:11 AM, lin wrote:
Hi yanbo, thanks for the quick response.
Looks like we'll need to do some work-around.
But before that, we'd like to dig into some related discussions
first. We've looked through the following urls, but none seems helpful.
Mailing list threads:
http://search-hadoop.com/m/q3RTtLkgZl1K4oyx/v=threaded
JIRA:
Parquet pushdown for unionAll,
https://issues.apache.org/jira/browse/SPARK-3462
Spark SQL reads unneccesary nested fields from Parquet,
https://issues.apache.org/jira/browse/SPARK-4502
Parquet Predicate Pushdown Does Not Work with Nested Structures,
https://issues.apache.org/jira/browse/SPARK-5151
Any recommended threads, JIRA, or PR, please? Thanks. :)
On Wed, Dec 30, 2015 at 6:21 PM, Yanbo Liang <yblia...@gmail.com
<mailto:yblia...@gmail.com>> wrote:
This problem has been discussed long before, but I think there is
no straight forward way to read only col_g.
2015-12-30 17:48 GMT+08:00 lin <kurtt....@gmail.com
<mailto:kurtt....@gmail.com>>:
Hi all,
We are trying to read from nested parquet data. SQL is "select
col_b.col_d.col_g from some_table" and the data schema for
some_table is:
root
|-- col_a: long (nullable = false)
|-- col_b: struct (nullable = true)
| |-- col_c: string (nullable = true)
| |-- col_d: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- col_e: integer (nullable = true)
| | | |-- col_f: string (nullable = true)
| | | |-- col_g: long (nullable = true)
We expect to see only col_g are read and parsed from the
parquet files; however, we acually observed the whole col_b
being read and parsed.
As we dig in a little bit, seems that col_g is a
GetArrayStructFields, col_d is a GetStructField, and only
col_b is an AttributeReference, so
PhysicalOperation.collectProjectsAndFilters() returns col_b
instead of col_g as projections.
So we wonder, is there any way to read and parse only col_g
instead of the whole col_b? We use Spark 1.5.1 and Parquet 1.7.0.
Thanks! :)