I store some trajectories data in parquet with this schema:

create table traj (
  id       string,
  points array<struct&lt;
            lat:      double,
            lng:     double,
            time:   bigint,
            speed: double,
            ... lots attributes here
            candidate_roads: array&lt;struct&lt;linestring: string, score:
double>>
           >>
)

It contains a lots of attribute comes from sensors. It also have a nested
array which contains information generated during map-matching algorithm.
 
All of my algorithm run on this dataset is trajectory-oriented, which means
they often do iteration on points, and use a subset of point's attributes to
do some computation. With this schema I can get points of trajectory without
doing `group by` and `collect_list`.
 
Because Parquet works very well on deeply nested data, so I directly store
it in parquet format with no flatten.
It works very well with Impala, because Impala has some special support on
nested data:
 
select
  id,
  avg_speed
from 
  traj t,
  (select avg(speed) avg_speed from t.points where time < '2018-06-19')

As you can see, Impala treat array of structs as a table nested in each row,
and can do some computation on array elements at pre-row level. And Impala
will use Parquet's features to prune unused attributes in point struct.
 
I use Spark for some complex algorithm which cannot written in pure SQL. But
I meet some trouble with Spark DataFrame API:
    1. Spark cannot do schema prune and filter push-down on nested column.
And it seems like there is no handy syntax to play with deeply nested data.
    2. `explode` not help in my scenario, because I need to preserve the
trajectory-points hierarchy. If I use `explode` here, I need do a extra
`group by` on `id`.
    3. Although, I can directly select `points.lat`, but it lost it
structure. If I need array of (lat, lng) pair, I need to zip two array. And
it cannot work at deeper nested level, such as select
`points.candidate_road.score`. 
    4. Maybe I can use parquet-mr to read file as RDD, set read schema and
push-down filters. But this manner lost Hive integration and the table
abstraction.

So, I think it is nice to have some additional supports on nested data.
Maybe an Impala style subquery syntax on complex data, or something like a
schema projection function on nested data like:

select id, extract(points, lat, lng, extract(candidate_road, score)) from
traj

which produce schema as:

|- id string
|- points array of struct
    |- lat double
    |- lng double
    |- candidate_road array of struct
        |- score double

And user can play with points with desired schema and data prune in Parquet.

Or if there are some existing syntax to done my work?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to