[ https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983980#comment-14983980 ]
Tycho Grouwstra edited comment on SPARK-11431 at 10/31/15 12:54 PM: -------------------------------------------------------------------- Hah, I actually missed that, so my bad, thanks! What I'm doing now then is along the lines of the following (example contrived from [here|http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html], only just realized this was you): {code} val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}""" val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil)) import org.apache.spark.sql.functions._ peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show {code} I must say I've felt slightly puzzled with the convention of having to use `explode` as part of an embedding `select` statement though; as an unwitting user, I'd feel `df.explode($"col")` should do something functionally equivalent to the current `df.select($"*", explode($"col"))` without having to type that out though. In my naivety, I'd wonder 'if I wanted to also select just a subset of columns, I could just manually add a `select` to do so myself'. Obviously, changing user APIs is bad, and not everyone will have identical expectations, but I'm just kind of curious. Was this an artifact of performance considerations, or a deliberate part of a larger philosophy of having the syntax be as explicit as possible? Then again, aside from keeping existing column, to me the `drop` on the pre-`explode` column would often seem a sensible default as well, so point taken that expectations may differ, in which case defaulting to whatever takes least processing definitely seems a sane choice. Something else I was thinking about though would be an `explodeZipped` type of function, to explode multiple equally-sized-array columns together, as opposed to chaining separate explodes to form a Cartesian. I was still sort of looking into that, but... at this point I'd wonder if perhaps I've overlooked existing functionality for that as well. :) was (Author: tycho01): Hah, I actually missed that, so my bad, thanks! What I'm doing now then is along the lines of the following (example contrived from [here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html), only just realized this was you): {code} val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}""" val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil)) import org.apache.spark.sql.functions._ peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show {code} I must say I've felt slightly puzzled with the convention of having to use `explode` as part of an embedding `select` statement though; as an unwitting user, I'd feel `df.explode($"col")` should do something functionally equivalent to the current `df.select($"*", explode($"col"))` without having to type that out though. In my naivety, I'd wonder 'if I wanted to also select just a subset of columns, I could just manually add a `select` to do so myself'. Obviously, changing user APIs is bad, and not everyone will have identical expectations, but I'm just kind of curious. Was this an artifact of performance considerations, or a deliberate part of a larger philosophy of having the syntax be as explicit as possible? Then again, aside from keeping existing column, to me the `drop` on the pre-`explode` column would often seem a sensible default as well, so point taken that expectations may differ, in which case defaulting to whatever takes least processing definitely seems a sane choice. Something else I was thinking about though would be an `explodeZipped` type of function, to explode multiple equally-sized-array columns together, as opposed to chaining separate explodes to form a Cartesian. I was still sort of looking into that, but... at this point I'd wonder if perhaps I've overlooked existing functionality for that as well. :) > Allow exploding arrays of structs in DataFrames > ----------------------------------------------- > > Key: SPARK-11431 > URL: https://issues.apache.org/jira/browse/SPARK-11431 > Project: Spark > Issue Type: New Feature > Components: SQL > Reporter: Tycho Grouwstra > Labels: features > Original Estimate: 24h > Remaining Estimate: 24h > > I am creating DataFrames from some [JSON > data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to > explode an array of structs (as are common in JSON) to their own rows so I > could start analyzing the data using GraphX. I believe many others might have > use for this as well, since most web data is in JSON format. > This feature would build upon the existing `explode` functionality added to > DataFrames by [~marmbrus], which currently errors when you call it on such > arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor > function to infer column types -- this approach is insufficient in the case > of Rows, since their type does not contain the required info. The alternative > here would be to instead grab the schema info from the existing schema for > such cases. > I'm trying to implement a patch that might add this functionality, so stay > tuned until I've figured that out. I'm new here though so I'll probably have > use for some feedback... -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org