Re: What about additional support on deeply nested data?

2018-06-20 Thread Reynold Xin
Seems like you are also looking for transform and reduce for arrays?

https://issues.apache.org/jira/browse/SPARK-23908

https://issues.apache.org/jira/browse/SPARK-23911

On Wed, Jun 20, 2018 at 10:43 AM bobotu  wrote:

> I store some trajectories data in parquet with this schema:
>
> create table traj (
>   id   string,
>   points array lat:  double,
> lng: double,
> time:   bigint,
> speed: double,
> ... lots attributes here
> candidate_roads: arraystructlinestring: string, score:
> double>>
>>>
> )
>
> It contains a lots of attribute comes from sensors. It also have a nested
> array which contains information generated during map-matching algorithm.
>
> All of my algorithm run on this dataset is trajectory-oriented, which means
> they often do iteration on points, and use a subset of point's attributes
> to
> do some computation. With this schema I can get points of trajectory
> without
> doing `group by` and `collect_list`.
>
> Because Parquet works very well on deeply nested data, so I directly store
> it in parquet format with no flatten.
> It works very well with Impala, because Impala has some special support on
> nested data:
>
> select
>   id,
>   avg_speed
> from
>   traj t,
>   (select avg(speed) avg_speed from t.points where time < '2018-06-19')
>
> As you can see, Impala treat array of structs as a table nested in each
> row,
> and can do some computation on array elements at pre-row level. And Impala
> will use Parquet's features to prune unused attributes in point struct.
>
> I use Spark for some complex algorithm which cannot written in pure SQL.
> But
> I meet some trouble with Spark DataFrame API:
> 1. Spark cannot do schema prune and filter push-down on nested column.
> And it seems like there is no handy syntax to play with deeply nested data.
> 2. `explode` not help in my scenario, because I need to preserve the
> trajectory-points hierarchy. If I use `explode` here, I need do a extra
> `group by` on `id`.
> 3. Although, I can directly select `points.lat`, but it lost it
> structure. If I need array of (lat, lng) pair, I need to zip two array. And
> it cannot work at deeper nested level, such as select
> `points.candidate_road.score`.
> 4. Maybe I can use parquet-mr to read file as RDD, set read schema and
> push-down filters. But this manner lost Hive integration and the table
> abstraction.
>
> So, I think it is nice to have some additional supports on nested data.
> Maybe an Impala style subquery syntax on complex data, or something like a
> schema projection function on nested data like:
>
> select id, extract(points, lat, lng, extract(candidate_road, score)) from
> traj
>
> which produce schema as:
>
> |- id string
> |- points array of struct
> |- lat double
> |- lng double
> |- candidate_road array of struct
> |- score double
>
> And user can play with points with desired schema and data prune in
> Parquet.
>
> Or if there are some existing syntax to done my work?
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


What about additional support on deeply nested data?

2018-06-20 Thread bobotu
I store some trajectories data in parquet with this schema:

create table traj (
  id   string,
  points array>
   >>
)

It contains a lots of attribute comes from sensors. It also have a nested
array which contains information generated during map-matching algorithm.
 
All of my algorithm run on this dataset is trajectory-oriented, which means
they often do iteration on points, and use a subset of point's attributes to
do some computation. With this schema I can get points of trajectory without
doing `group by` and `collect_list`.
 
Because Parquet works very well on deeply nested data, so I directly store
it in parquet format with no flatten.
It works very well with Impala, because Impala has some special support on
nested data:
 
select
  id,
  avg_speed
from 
  traj t,
  (select avg(speed) avg_speed from t.points where time < '2018-06-19')

As you can see, Impala treat array of structs as a table nested in each row,
and can do some computation on array elements at pre-row level. And Impala
will use Parquet's features to prune unused attributes in point struct.
 
I use Spark for some complex algorithm which cannot written in pure SQL. But
I meet some trouble with Spark DataFrame API:
1. Spark cannot do schema prune and filter push-down on nested column.
And it seems like there is no handy syntax to play with deeply nested data.
2. `explode` not help in my scenario, because I need to preserve the
trajectory-points hierarchy. If I use `explode` here, I need do a extra
`group by` on `id`.
3. Although, I can directly select `points.lat`, but it lost it
structure. If I need array of (lat, lng) pair, I need to zip two array. And
it cannot work at deeper nested level, such as select
`points.candidate_road.score`. 
4. Maybe I can use parquet-mr to read file as RDD, set read schema and
push-down filters. But this manner lost Hive integration and the table
abstraction.

So, I think it is nice to have some additional supports on nested data.
Maybe an Impala style subquery syntax on complex data, or something like a
schema projection function on nested data like:

select id, extract(points, lat, lng, extract(candidate_road, score)) from
traj

which produce schema as:

|- id string
|- points array of struct
|- lat double
|- lng double
|- candidate_road array of struct
|- score double

And user can play with points with desired schema and data prune in Parquet.

Or if there are some existing syntax to done my work?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org