Seems like you are also looking for transform and reduce for arrays?
https://issues.apache.org/jira/browse/SPARK-23908
https://issues.apache.org/jira/browse/SPARK-23911
On Wed, Jun 20, 2018 at 10:43 AM bobotu wrote:
> I store some trajectories data in parquet with this schema:
>
> create table traj (
> id string,
> points array lat: double,
> lng: double,
> time: bigint,
> speed: double,
> ... lots attributes here
> candidate_roads: arraystructlinestring: string, score:
> double>>
>>>
> )
>
> It contains a lots of attribute comes from sensors. It also have a nested
> array which contains information generated during map-matching algorithm.
>
> All of my algorithm run on this dataset is trajectory-oriented, which means
> they often do iteration on points, and use a subset of point's attributes
> to
> do some computation. With this schema I can get points of trajectory
> without
> doing `group by` and `collect_list`.
>
> Because Parquet works very well on deeply nested data, so I directly store
> it in parquet format with no flatten.
> It works very well with Impala, because Impala has some special support on
> nested data:
>
> select
> id,
> avg_speed
> from
> traj t,
> (select avg(speed) avg_speed from t.points where time < '2018-06-19')
>
> As you can see, Impala treat array of structs as a table nested in each
> row,
> and can do some computation on array elements at pre-row level. And Impala
> will use Parquet's features to prune unused attributes in point struct.
>
> I use Spark for some complex algorithm which cannot written in pure SQL.
> But
> I meet some trouble with Spark DataFrame API:
> 1. Spark cannot do schema prune and filter push-down on nested column.
> And it seems like there is no handy syntax to play with deeply nested data.
> 2. `explode` not help in my scenario, because I need to preserve the
> trajectory-points hierarchy. If I use `explode` here, I need do a extra
> `group by` on `id`.
> 3. Although, I can directly select `points.lat`, but it lost it
> structure. If I need array of (lat, lng) pair, I need to zip two array. And
> it cannot work at deeper nested level, such as select
> `points.candidate_road.score`.
> 4. Maybe I can use parquet-mr to read file as RDD, set read schema and
> push-down filters. But this manner lost Hive integration and the table
> abstraction.
>
> So, I think it is nice to have some additional supports on nested data.
> Maybe an Impala style subquery syntax on complex data, or something like a
> schema projection function on nested data like:
>
> select id, extract(points, lat, lng, extract(candidate_road, score)) from
> traj
>
> which produce schema as:
>
> |- id string
> |- points array of struct
> |- lat double
> |- lng double
> |- candidate_road array of struct
> |- score double
>
> And user can play with points with desired schema and data prune in
> Parquet.
>
> Or if there are some existing syntax to done my work?
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>