I store some trajectories data in parquet with this schema: create table traj ( id string, points array<struct< lat: double, lng: double, time: bigint, speed: double, ... lots attributes here candidate_roads: array<struct<linestring: string, score: double>> >> )
It contains a lots of attribute comes from sensors. It also have a nested array which contains information generated during map-matching algorithm. All of my algorithm run on this dataset is trajectory-oriented, which means they often do iteration on points, and use a subset of point's attributes to do some computation. With this schema I can get points of trajectory without doing `group by` and `collect_list`. Because Parquet works very well on deeply nested data, so I directly store it in parquet format with no flatten. It works very well with Impala, because Impala has some special support on nested data: select id, avg_speed from traj t, (select avg(speed) avg_speed from t.points where time < '2018-06-19') As you can see, Impala treat array of structs as a table nested in each row, and can do some computation on array elements at pre-row level. And Impala will use Parquet's features to prune unused attributes in point struct. I use Spark for some complex algorithm which cannot written in pure SQL. But I meet some trouble with Spark DataFrame API: 1. Spark cannot do schema prune and filter push-down on nested column. And it seems like there is no handy syntax to play with deeply nested data. 2. `explode` not help in my scenario, because I need to preserve the trajectory-points hierarchy. If I use `explode` here, I need do a extra `group by` on `id`. 3. Although, I can directly select `points.lat`, but it lost it structure. If I need array of (lat, lng) pair, I need to zip two array. And it cannot work at deeper nested level, such as select `points.candidate_road.score`. 4. Maybe I can use parquet-mr to read file as RDD, set read schema and push-down filters. But this manner lost Hive integration and the table abstraction. So, I think it is nice to have some additional supports on nested data. Maybe an Impala style subquery syntax on complex data, or something like a schema projection function on nested data like: select id, extract(points, lat, lng, extract(candidate_road, score)) from traj which produce schema as: |- id string |- points array of struct |- lat double |- lng double |- candidate_road array of struct |- score double And user can play with points with desired schema and data prune in Parquet. Or if there are some existing syntax to done my work? -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org