Seems like you are also looking for transform and reduce for arrays? https://issues.apache.org/jira/browse/SPARK-23908
https://issues.apache.org/jira/browse/SPARK-23911 On Wed, Jun 20, 2018 at 10:43 AM bobotu <m...@zejun.li> wrote: > I store some trajectories data in parquet with this schema: > > create table traj ( > id string, > points array<struct< > lat: double, > lng: double, > time: bigint, > speed: double, > ... lots attributes here > candidate_roads: array<struct<linestring: string, score: > double>> > >> > ) > > It contains a lots of attribute comes from sensors. It also have a nested > array which contains information generated during map-matching algorithm. > > All of my algorithm run on this dataset is trajectory-oriented, which means > they often do iteration on points, and use a subset of point's attributes > to > do some computation. With this schema I can get points of trajectory > without > doing `group by` and `collect_list`. > > Because Parquet works very well on deeply nested data, so I directly store > it in parquet format with no flatten. > It works very well with Impala, because Impala has some special support on > nested data: > > select > id, > avg_speed > from > traj t, > (select avg(speed) avg_speed from t.points where time < '2018-06-19') > > As you can see, Impala treat array of structs as a table nested in each > row, > and can do some computation on array elements at pre-row level. And Impala > will use Parquet's features to prune unused attributes in point struct. > > I use Spark for some complex algorithm which cannot written in pure SQL. > But > I meet some trouble with Spark DataFrame API: > 1. Spark cannot do schema prune and filter push-down on nested column. > And it seems like there is no handy syntax to play with deeply nested data. > 2. `explode` not help in my scenario, because I need to preserve the > trajectory-points hierarchy. If I use `explode` here, I need do a extra > `group by` on `id`. > 3. Although, I can directly select `points.lat`, but it lost it > structure. If I need array of (lat, lng) pair, I need to zip two array. And > it cannot work at deeper nested level, such as select > `points.candidate_road.score`. > 4. Maybe I can use parquet-mr to read file as RDD, set read schema and > push-down filters. But this manner lost Hive integration and the table > abstraction. > > So, I think it is nice to have some additional supports on nested data. > Maybe an Impala style subquery syntax on complex data, or something like a > schema projection function on nested data like: > > select id, extract(points, lat, lng, extract(candidate_road, score)) from > traj > > which produce schema as: > > |- id string > |- points array of struct > |- lat double > |- lng double > |- candidate_road array of struct > |- score double > > And user can play with points with desired schema and data prune in > Parquet. > > Or if there are some existing syntax to done my work? > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >