[jira] [Comment Edited] (SPARK-11431) Allow exploding arrays of structs in DataFrames

Tycho Grouwstra (JIRA) Sat, 31 Oct 2015 05:54:49 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983980#comment-14983980
 ]


Tycho Grouwstra edited comment on SPARK-11431 at 10/31/15 12:54 PM:
--------------------------------------------------------------------

Hah, I actually missed that, so my bad, thanks!

What I'm doing now then is along the lines of the following (example contrived 
from 
[here|http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html],
 only just realized this was you):

{code}
val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, 
{"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show
{code}

I must say I've felt slightly puzzled with the convention of having to use 
`explode` as part of an embedding `select` statement though; as an unwitting 
user, I'd feel `df.explode($"col")` should do something functionally equivalent 
to the current `df.select($"*", explode($"col"))` without having to type that 
out though. In my naivety, I'd wonder 'if I wanted to also select just a subset 
of columns, I could just manually add a `select` to do so myself'.

Obviously, changing user APIs is bad, and not everyone will have identical 
expectations, but I'm just kind of curious. Was this an artifact of performance 
considerations, or a deliberate part of a larger philosophy of having the 
syntax be as explicit as possible?

Then again, aside from keeping existing column, to me the `drop` on the 
pre-`explode` column would often seem a sensible default as well, so point 
taken that expectations may differ, in which case defaulting to whatever takes 
least processing definitely seems a sane choice.


Something else I was thinking about though would be an `explodeZipped` type of 
function, to explode multiple equally-sized-array columns together, as opposed 
to chaining separate explodes to form a Cartesian. I was still sort of looking 
into that, but... at this point I'd wonder if perhaps I've overlooked existing 
functionality for that as well. :)



was (Author: tycho01):
Hah, I actually missed that, so my bad, thanks!

What I'm doing now then is along the lines of the following (example contrived 
from 
[here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html),
 only just realized this was you):

{code}
val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, 
{"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show
{code}

I must say I've felt slightly puzzled with the convention of having to use 
`explode` as part of an embedding `select` statement though; as an unwitting 
user, I'd feel `df.explode($"col")` should do something functionally equivalent 
to the current `df.select($"*", explode($"col"))` without having to type that 
out though. In my naivety, I'd wonder 'if I wanted to also select just a subset 
of columns, I could just manually add a `select` to do so myself'.

Obviously, changing user APIs is bad, and not everyone will have identical 
expectations, but I'm just kind of curious. Was this an artifact of performance 
considerations, or a deliberate part of a larger philosophy of having the 
syntax be as explicit as possible?

Then again, aside from keeping existing column, to me the `drop` on the 
pre-`explode` column would often seem a sensible default as well, so point 
taken that expectations may differ, in which case defaulting to whatever takes 
least processing definitely seems a sane choice.


Something else I was thinking about though would be an `explodeZipped` type of 
function, to explode multiple equally-sized-array columns together, as opposed 
to chaining separate explodes to form a Cartesian. I was still sort of looking 
into that, but... at this point I'd wonder if perhaps I've overlooked existing 
functionality for that as well. :)


> Allow exploding arrays of structs in DataFrames
> -----------------------------------------------
>
>                 Key: SPARK-11431
>                 URL: https://issues.apache.org/jira/browse/SPARK-11431
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Tycho Grouwstra
>              Labels: features
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am creating DataFrames from some [JSON 
> data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to 
> explode an array of structs (as are common in JSON) to their own rows so I 
> could start analyzing the data using GraphX. I believe many others might have 
> use for this as well, since most web data is in JSON format.
> This feature would build upon the existing `explode` functionality added to 
> DataFrames by [~marmbrus], which currently errors when you call it on such 
> arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor 
> function to infer column types -- this approach is insufficient in the case 
> of Rows, since their type does not contain the required info. The alternative 
> here would be to instead grab the schema info from the existing schema for 
> such cases.
> I'm trying to implement a patch that might add this functionality, so stay 
> tuned until I've figured that out. I'm new here though so I'll probably have 
> use for some feedback...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11431) Allow exploding arrays of structs in DataFrames

Reply via email to