[jira] [Created] (SPARK-24595) What about additional support on deeply nested column?

Zejun Li (JIRA) Tue, 19 Jun 2018 08:07:20 -0700

Zejun Li created SPARK-24595:
--------------------------------

             Summary: What about additional support on deeply nested column?
                 Key: SPARK-24595
                 URL: https://issues.apache.org/jira/browse/SPARK-24595
             Project: Spark
          Issue Type: Question
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: Zejun Li



I store some trajectories data in parquet with this schema:
{code:java}
create table traj(
  id     string,
  points array<struct<
            lat:   double,
            lng:   double,
            time:  bigint,
            speed: double,
            ... lots attributes here
            candidate_road: array<struct<linestring: string, score: double>>
           >>
){code}
It contains a lots of attribute comes from sensors. It also have a nested array 
which contains information generated during map-matching algorithm.

 

All of my algorithm run on this dataset is trajectory-oriented, which means 
they often do iteration on points, and use a subset of point's attributes to do 
some computation. With this schema I can get points of trajectory without doing 
`group by` and `collect_list`.

 

Because Parquet works very well on deeply nested data, so I directly store it 
in parquet format with no flatten.

It works very well with Impala, because Impala has some special support on 
nested data:

 
{code:java}
select
  id,
  avg_speed
from 
  traj t,
  (select avg(speed) avg_speed from t.points where time < '2018-06-19'){code}
As you can see, Impala treat array of structs as a nested table, and can do 
some computation on array elements at pre-row level. And Impala will use 
Parquet's features to prune unused attributes in point struct.

 

I use Spark for some complex algorithm which cannot written in pure SQL. But I 
meet some trouble with Spark DataFrame API:

Spark cannot do schema prune and filter push-down on nested column. And it 
seems like there is no handy syntax to play with deeply nested data.
 * `explode` not help in my scenario, because I need to preserve the 
trajectory-points hierarchy. If I use `explode` here, I need do a extra `group 
by` on `id`.
 * Although, I can directly select `points.lat`, but it lost it structure. If I 
need array of (lat, lng) pair, I need to zip two array. And it cannot work at 
deeper nested level, such as select `points.candidate_road.score`. 
 * Maybe I can use parquet-mr package to read file as RDD, and pass read schema 
directly to it. But this manner lost Hive integration and vectorized reader in 
Spark.

 

So, I think it is nice to have a Impala style subquery syntax on complex data, 
or can we add some support to do schema projection on nested data like:

 
{code:java}
select id, extract(points, lat, lng, extract(candidate_road, score)) from 
traj{code}
which produce schema as:

 

 
{code:java}
|- id string
|- points array of struct
    |- lat double
    |- lng double
    |- candidate_road array of struct
        |- score double{code}
And user can play with points with desired schema and data prune in Parquet.

 

 

Or if there are some existing syntax to done my work?

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24595) What about additional support on deeply nested column?

Reply via email to