Re: What about additional support on deeply nested data?

Reynold Xin Wed, 20 Jun 2018 10:54:13 -0700

Seems like you are also looking for transform and reduce for arrays?

https://issues.apache.org/jira/browse/SPARK-23908


https://issues.apache.org/jira/browse/SPARK-23911

On Wed, Jun 20, 2018 at 10:43 AM bobotu <m...@zejun.li> wrote:

> I store some trajectories data in parquet with this schema:
>
> create table traj (
>   id       string,
>   points array<struct&lt;
>             lat:      double,
>             lng:     double,
>             time:   bigint,
>             speed: double,
>             ... lots attributes here
>             candidate_roads: array&lt;struct&lt;linestring: string, score:
> double>>
>            >>
> )
>
> It contains a lots of attribute comes from sensors. It also have a nested
> array which contains information generated during map-matching algorithm.
>
> All of my algorithm run on this dataset is trajectory-oriented, which means
> they often do iteration on points, and use a subset of point's attributes
> to
> do some computation. With this schema I can get points of trajectory
> without
> doing `group by` and `collect_list`.
>
> Because Parquet works very well on deeply nested data, so I directly store
> it in parquet format with no flatten.
> It works very well with Impala, because Impala has some special support on
> nested data:
>
> select
>   id,
>   avg_speed
> from
>   traj t,
>   (select avg(speed) avg_speed from t.points where time < '2018-06-19')
>
> As you can see, Impala treat array of structs as a table nested in each
> row,
> and can do some computation on array elements at pre-row level. And Impala
> will use Parquet's features to prune unused attributes in point struct.
>
> I use Spark for some complex algorithm which cannot written in pure SQL.
> But
> I meet some trouble with Spark DataFrame API:
>     1. Spark cannot do schema prune and filter push-down on nested column.
> And it seems like there is no handy syntax to play with deeply nested data.
>     2. `explode` not help in my scenario, because I need to preserve the
> trajectory-points hierarchy. If I use `explode` here, I need do a extra
> `group by` on `id`.
>     3. Although, I can directly select `points.lat`, but it lost it
> structure. If I need array of (lat, lng) pair, I need to zip two array. And
> it cannot work at deeper nested level, such as select
> `points.candidate_road.score`.
>     4. Maybe I can use parquet-mr to read file as RDD, set read schema and
> push-down filters. But this manner lost Hive integration and the table
> abstraction.
>
> So, I think it is nice to have some additional supports on nested data.
> Maybe an Impala style subquery syntax on complex data, or something like a
> schema projection function on nested data like:
>
> select id, extract(points, lat, lng, extract(candidate_road, score)) from
> traj
>
> which produce schema as:
>
> |- id string
> |- points array of struct
>     |- lat double
>     |- lng double
>     |- candidate_road array of struct
>         |- score double
>
> And user can play with points with desired schema and data prune in
> Parquet.
>
> Or if there are some existing syntax to done my work?
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: What about additional support on deeply nested data?

Reply via email to