[ https://issues.apache.org/jira/browse/SPARK-24595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-24595. ---------------------------------- Resolution: Invalid Let's redirect the question to dev/user mailing list before filing here as a JIRA. > What about additional support on deeply nested column? > ------------------------------------------------------ > > Key: SPARK-24595 > URL: https://issues.apache.org/jira/browse/SPARK-24595 > Project: Spark > Issue Type: Question > Components: SQL > Affects Versions: 2.3.0 > Reporter: Zejun Li > Priority: Major > > I store some trajectories data in parquet with this schema: > {code:java} > create table traj( > id string, > points array<struct< > lat: double, > lng: double, > time: bigint, > speed: double, > ... lots attributes here > candidate_road: array<struct<linestring: string, score: double>> > >> > ){code} > It contains a lots of attribute comes from sensors. It also have a nested > array which contains information generated during map-matching algorithm. > > All of my algorithm run on this dataset is trajectory-oriented, which means > they often do iteration on points, and use a subset of point's attributes to > do some computation. With this schema I can get points of trajectory without > doing `group by` and `collect_list`. > > Because Parquet works very well on deeply nested data, so I directly store it > in parquet format with no flatten. > It works very well with Impala, because Impala has some special support on > nested data: > > {code:java} > select > id, > avg_speed > from > traj t, > (select avg(speed) avg_speed from t.points where time < '2018-06-19'){code} > As you can see, Impala treat array of structs as a nested table, and can do > some computation on array elements at pre-row level. And Impala will use > Parquet's features to prune unused attributes in point struct. > > I use Spark for some complex algorithm which cannot written in pure SQL. But > I meet some trouble with Spark DataFrame API: > Spark cannot do schema prune and filter push-down on nested column. And it > seems like there is no handy syntax to play with deeply nested data. > * `explode` not help in my scenario, because I need to preserve the > trajectory-points hierarchy. If I use `explode` here, I need do a extra > `group by` on `id`. > * Although, I can directly select `points.lat`, but it lost it structure. If > I need array of (lat, lng) pair, I need to zip two array. And it cannot work > at deeper nested level, such as select `points.candidate_road.score`. > * Maybe I can use parquet-mr package to read file as RDD, and pass read > schema directly to it. But this manner lost Hive integration and vectorized > reader in Spark. > > So, I think it is nice to have a Impala style subquery syntax on complex > data, or can we add some support to do schema projection on nested data like: > > {code:java} > select id, extract(points, lat, lng, extract(candidate_road, score)) from > traj{code} > which produce schema as: > > > {code:java} > |- id string > |- points array of struct > |- lat double > |- lng double > |- candidate_road array of struct > |- score double{code} > And user can play with points with desired schema and data prune in Parquet. > > > Or if there are some existing syntax to done my work? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org