[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU
[ https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080669#comment-15080669 ] Tycho Grouwstra commented on SPARK-3785: >> For idea B, we need to write CUDA code >> The motivation of idea 2 is to avoid writing hardware-dependent code I thought that unlike CUDA OpenCL could be run on GPU as well as CPU? (That being said I'm under the impression CUDA is generally at the bleeding edge of innovation, so no objections here regardless.) > Support off-loading computations to a GPU > - > > Key: SPARK-3785 > URL: https://issues.apache.org/jira/browse/SPARK-3785 > Project: Spark > Issue Type: Brainstorming > Components: MLlib >Reporter: Thomas Darimont >Priority: Minor > > Are there any plans to adding support for off-loading computations to the > GPU, e.g. via an open-cl binding? > http://www.jocl.org/ > https://code.google.com/p/javacl/ > http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11907) Allowing errors as values in DataFrames (like 'Either Left/Right')
[ https://issues.apache.org/jira/browse/SPARK-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tycho Grouwstra updated SPARK-11907: Description: I like Spark, but one thing I find funny about it is that it is picky about circumstantial errors. For one, given the following: {code} import org.apache.spark.sql._ import org.apache.spark.sql.functions._ val sqlContext = new org.apache.spark.sql.SQLContext(sc) val rows = (1,"a") :: (2,"b") :: (3,"c") :: (0,"d") :: Nil val df = sqlContext.createDataFrame(sc.parallelize(rows)).toDF("num","let") val div = udf[Double, Integer](10 / _) df.withColumn("div", div(col("num"))).show() {code} ... the job fails with a `java.lang.ArithmeticException: / by zero`. The example is trivial, but my point is, if one thing goes wrong, the rest goes right, why throw out the baby with the bathwater when you could both show what went wrong as well as went right? Instead, I would propose allowing to use raised Exceptions as resulting values, not unlike how one might store 'bad' results using Either Left/Right constructions in Scala/Haskell (which I suppose would not currently work in DFs, lacking serializability), or cells containing errors in MS Excel. As a solution, I would propose a DataFrame subclass (?) using a variant of NullableColumnBuilder, e.g. ErrorableColumnBuilder (/ SafeColumnBuilder?). NullableColumnBuilder currently explains its workings as follows: {code} /** * A stackable trait used for building byte buffer for a column containing null values. Memory * layout of the final byte buffer is: * {{{ *.--- Null count N (4 bytes) *| .--- Null positions (4 x N bytes, empty if null count is zero) *| | .- Non-null elements *V V V * +---+-+-+ * | | ... | ... ... | * +---+-+-+ * }}} */ {code} This might be extended by adding a further section storing Throwables (or null) for the bad values in question (alt: store count/positions separately from null ones so null values would not need to be stored). Don't get me wrong, there is nothing with throwing exceptions (or catching them for that matter). Rather, I see a use cases for both "do it right or bust" vs. the explorative "show me what happens if I try this operation on these values" -- not unlike how languages as Ruby/Elixir might distinguish unsafe methods using a bang ('!') from their safe variants that should not throw global exceptions. I'm sort of new here but would be glad to get some opinions on this idea. was: I like Spark, but one thing I find funny about it is that it is picky about circumstantial errors. For one, given the following: [code] import org.apache.spark.sql._ import org.apache.spark.sql.functions._ val sqlContext = new org.apache.spark.sql.SQLContext(sc) val rows = (1,"a") :: (2,"b") :: (3,"c") :: (0,"d") :: Nil val df = sqlContext.createDataFrame(sc.parallelize(rows)).toDF("num","let") val div = udf[Double, Integer](10 / _) df.withColumn("div", div(col("num"))).show() [/code] ... the job fails with a `java.lang.ArithmeticException: / by zero`. The example is trivial, but my point is, if one thing goes wrong, the rest goes right, why throw out the baby with the bathwater when you could both show what went wrong as well as went right? Instead, I would propose allowing to use raised Exceptions as resulting values, not unlike how one might store 'bad' results using Either Left/Right constructions in Scala/Haskell (which I suppose would not currently work in DFs, lacking serializability), or cells containing errors in MS Excel. As a solution, I would propose a DataFrame subclass (?) using a variant of NullableColumnBuilder, e.g. ErrorableColumnBuilder (/ SafeColumnBuilder?). NullableColumnBuilder currently explains its workings as follows: [code] /** * A stackable trait used for building byte buffer for a column containing null values. Memory * layout of the final byte buffer is: * {{{ *.--- Null count N (4 bytes) *| .--- Null positions (4 x N bytes, empty if null count is zero) *| | .- Non-null elements *V V V * +---+-+-+ * | | ... | ... ... | * +---+-+-+ * }}} */ [/code] This might be extended by adding a further section storing Throwables (or null) for the bad values in question (alt: store count/positions separately from null ones so null values would not need to be stored). Don't get me wrong, there is nothing with throwing exceptions (or catching them for that matter). Rather, I see a use cases for both "do it right or bust" vs. the explorative "show me what happens if I try this operation on these values" -- not unlike how languages as Ruby/Elixir might distinguish unsafe methods using a bang ('!') from their safe variants that should not throw global
[jira] [Created] (SPARK-11907) Allowing errors as values in DataFrames (like 'Either Left/Right')
Tycho Grouwstra created SPARK-11907: --- Summary: Allowing errors as values in DataFrames (like 'Either Left/Right') Key: SPARK-11907 URL: https://issues.apache.org/jira/browse/SPARK-11907 Project: Spark Issue Type: Wish Components: SQL Reporter: Tycho Grouwstra I like Spark, but one thing I find funny about it is that it is picky about circumstantial errors. For one, given the following: ``` import org.apache.spark.sql._ import org.apache.spark.sql.functions._ val sqlContext = new org.apache.spark.sql.SQLContext(sc) val rows = (1,"a") :: (2,"b") :: (3,"c") :: (0,"d") :: Nil val df = sqlContext.createDataFrame(sc.parallelize(rows)).toDF("num","let") val div = udf[Double, Integer](10 / _) df.withColumn("div", div(col("num"))).show() ``` ... the job fails with a `java.lang.ArithmeticException: / by zero`. The example is trivial, but my point is, if one thing goes wrong, the rest goes right, why throw out the baby with the bathwater when you could both show what went wrong as well as went right? Instead, I would propose allowing to use raised Exceptions as resulting values, not unlike how one might store 'bad' results using Either Left/Right constructions in Scala/Haskell (which I suppose would not currently work in DFs, lacking serializability), or cells containing errors in MS Excel. As a solution, I would propose a DataFrame subclass (?) using a variant of NullableColumnBuilder, e.g. ErrorableColumnBuilder (/ SafeColumnBuilder?). NullableColumnBuilder currently explains its workings as follows: ``` /** * A stackable trait used for building byte buffer for a column containing null values. Memory * layout of the final byte buffer is: * {{{ *.--- Null count N (4 bytes) *| .--- Null positions (4 x N bytes, empty if null count is zero) *| | .- Non-null elements *V V V * +---+-+-+ * | | ... | ... ... | * +---+-+-+ * }}} */ ``` This might be extended by adding a further section storing Throwables (or null) for the bad values in question (alt: store count/positions separately from null ones so null values would not need to be stored). Don't get me wrong, there is nothing with throwing exceptions (or catching them for that matter). Rather, I see a use cases for both "do it right or bust" vs. the explorative "show me what happens if I try this operation on these values" -- not unlike how languages as Ruby/Elixir might distinguish unsafe methods using a bang ('!') from their safe variants that should not throw global exceptions. I'm sort of new here but would be glad to get some opinions on this idea. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11907) Allowing errors as values in DataFrames (like 'Either Left/Right')
[ https://issues.apache.org/jira/browse/SPARK-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tycho Grouwstra updated SPARK-11907: Description: I like Spark, but one thing I find funny about it is that it is picky about circumstantial errors. For one, given the following: [code] import org.apache.spark.sql._ import org.apache.spark.sql.functions._ val sqlContext = new org.apache.spark.sql.SQLContext(sc) val rows = (1,"a") :: (2,"b") :: (3,"c") :: (0,"d") :: Nil val df = sqlContext.createDataFrame(sc.parallelize(rows)).toDF("num","let") val div = udf[Double, Integer](10 / _) df.withColumn("div", div(col("num"))).show() [/code] ... the job fails with a `java.lang.ArithmeticException: / by zero`. The example is trivial, but my point is, if one thing goes wrong, the rest goes right, why throw out the baby with the bathwater when you could both show what went wrong as well as went right? Instead, I would propose allowing to use raised Exceptions as resulting values, not unlike how one might store 'bad' results using Either Left/Right constructions in Scala/Haskell (which I suppose would not currently work in DFs, lacking serializability), or cells containing errors in MS Excel. As a solution, I would propose a DataFrame subclass (?) using a variant of NullableColumnBuilder, e.g. ErrorableColumnBuilder (/ SafeColumnBuilder?). NullableColumnBuilder currently explains its workings as follows: [code] /** * A stackable trait used for building byte buffer for a column containing null values. Memory * layout of the final byte buffer is: * {{{ *.--- Null count N (4 bytes) *| .--- Null positions (4 x N bytes, empty if null count is zero) *| | .- Non-null elements *V V V * +---+-+-+ * | | ... | ... ... | * +---+-+-+ * }}} */ [/code] This might be extended by adding a further section storing Throwables (or null) for the bad values in question (alt: store count/positions separately from null ones so null values would not need to be stored). Don't get me wrong, there is nothing with throwing exceptions (or catching them for that matter). Rather, I see a use cases for both "do it right or bust" vs. the explorative "show me what happens if I try this operation on these values" -- not unlike how languages as Ruby/Elixir might distinguish unsafe methods using a bang ('!') from their safe variants that should not throw global exceptions. I'm sort of new here but would be glad to get some opinions on this idea. was: I like Spark, but one thing I find funny about it is that it is picky about circumstantial errors. For one, given the following: ``` import org.apache.spark.sql._ import org.apache.spark.sql.functions._ val sqlContext = new org.apache.spark.sql.SQLContext(sc) val rows = (1,"a") :: (2,"b") :: (3,"c") :: (0,"d") :: Nil val df = sqlContext.createDataFrame(sc.parallelize(rows)).toDF("num","let") val div = udf[Double, Integer](10 / _) df.withColumn("div", div(col("num"))).show() ``` ... the job fails with a `java.lang.ArithmeticException: / by zero`. The example is trivial, but my point is, if one thing goes wrong, the rest goes right, why throw out the baby with the bathwater when you could both show what went wrong as well as went right? Instead, I would propose allowing to use raised Exceptions as resulting values, not unlike how one might store 'bad' results using Either Left/Right constructions in Scala/Haskell (which I suppose would not currently work in DFs, lacking serializability), or cells containing errors in MS Excel. As a solution, I would propose a DataFrame subclass (?) using a variant of NullableColumnBuilder, e.g. ErrorableColumnBuilder (/ SafeColumnBuilder?). NullableColumnBuilder currently explains its workings as follows: ``` /** * A stackable trait used for building byte buffer for a column containing null values. Memory * layout of the final byte buffer is: * {{{ *.--- Null count N (4 bytes) *| .--- Null positions (4 x N bytes, empty if null count is zero) *| | .- Non-null elements *V V V * +---+-+-+ * | | ... | ... ... | * +---+-+-+ * }}} */ ``` This might be extended by adding a further section storing Throwables (or null) for the bad values in question (alt: store count/positions separately from null ones so null values would not need to be stored). Don't get me wrong, there is nothing with throwing exceptions (or catching them for that matter). Rather, I see a use cases for both "do it right or bust" vs. the explorative "show me what happens if I try this operation on these values" -- not unlike how languages as Ruby/Elixir might distinguish unsafe methods using a bang ('!') from their safe variants that should not throw global exceptions. I'm sort
[jira] [Comment Edited] (SPARK-11431) Allow exploding arrays of structs in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983980#comment-14983980 ] Tycho Grouwstra edited comment on SPARK-11431 at 10/31/15 12:51 PM: Hah, I actually missed that, so my bad, thanks! What I'm doing now then is along the lines of the following (example contrived from [here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html), only just realized this was you): ``` val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}""" val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil)) import org.apache.spark.sql.functions._ peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show ``` I must say I've felt slightly puzzled with the convention of having to use `explode` as part of an embedding `select` statement though; as an unwitting user, I'd feel `df.explode($"col")` should do something functionally equivalent to the current `df.select($"*", explode($"col"))` without having to type that out though. In my naivety, I'd wonder 'if I wanted to also select just a subset of columns, I could just manually add a `select` to do so myself'. Obviously, changing user APIs is bad, and not everyone will have identical expectations, but I'm just kind of curious. Was this an artifact of performance considerations, or a deliberate part of a larger philosophy of having the syntax be as explicit as possible? Then again, aside from keeping existing column, to me the `drop` on the pre-`explode` column would often seem a sensible default as well, so point taken that expectations may differ, in which case defaulting to whatever takes least processing definitely seems a sane choice. Something else I was thinking about though would be an `explodeZipped` type of function, to explode multiple equally-sized-array columns together, as opposed to chaining separate explodes to form a Cartesian. I was still sort of looking into that, but... at this point I'd wonder if perhaps I've overlooked existing functionality for that as well. :) was (Author: tycho01): Hah, I actually missed that, so my bad, thanks! What I'm doing now then is along the lines of the following (example contrived from [here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html), only just realized this was you): ``` val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}""" val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil)) import org.apache.spark.sql.functions._ peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show ``` I must say I've felt slightly puzzled with the convention of having to use `explode` as part of an embedding `select` statement though; as an unwitting user, I'd feel `df.explode($"col")` should do something functionally equivalent to the current `df.select($"*", explode($"col"))` without having to type that out though. In my naivety, I'd wonder 'if I wanted to also select just a subset of columns, I could just manually add a `select` to do so myself'. Obviously, changing user APIs is bad, and not everyone will have identical expectations, but I'm just kind of curious. Was this an artifact of performance considerations, or a deliberate part of a larger philosophy of having the syntax be as explicit as possible? Then again, aside from keeping existing column, to me the `drop` on the pre-`explode` column would often seem a sensible default as well, so point taken that expectations may differ, in which case defaulting to whatever takes least processing definitely seems a sane choice. Something else I was thinking about though would be an `explodeZipped` type of function, to explode multiple equally-sized-array columns together, as opposed to chaining separate explodes to form a Cartesian. I was still sort of looking into that, but... at this point I'd wonder if perhaps I've overlooked existing functionality for that as well. :) > Allow exploding arrays of structs in DataFrames > --- > > Key: SPARK-11431 > URL: https://issues.apache.org/jira/browse/SPARK-11431 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Tycho Grouwstra > Labels: features > Original Estimate: 24h > Remaining Estimate: 24h > > I am creating DataFrames from some [JSON > data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to > explode an array of structs (as are common in JSON) to their own rows so I > could start analyzing the data using GraphX. I believe many others might have > use for this as well, since most web data is in JSON format. > This feature would
[jira] [Comment Edited] (SPARK-11431) Allow exploding arrays of structs in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983980#comment-14983980 ] Tycho Grouwstra edited comment on SPARK-11431 at 10/31/15 12:51 PM: Hah, I actually missed that, so my bad, thanks! What I'm doing now then is along the lines of the following (example contrived from [here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html), only just realized this was you): val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}""" val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil)) import org.apache.spark.sql.functions._ peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show I must say I've felt slightly puzzled with the convention of having to use `explode` as part of an embedding `select` statement though; as an unwitting user, I'd feel `df.explode($"col")` should do something functionally equivalent to the current `df.select($"*", explode($"col"))` without having to type that out though. In my naivety, I'd wonder 'if I wanted to also select just a subset of columns, I could just manually add a `select` to do so myself'. Obviously, changing user APIs is bad, and not everyone will have identical expectations, but I'm just kind of curious. Was this an artifact of performance considerations, or a deliberate part of a larger philosophy of having the syntax be as explicit as possible? Then again, aside from keeping existing column, to me the `drop` on the pre-`explode` column would often seem a sensible default as well, so point taken that expectations may differ, in which case defaulting to whatever takes least processing definitely seems a sane choice. Something else I was thinking about though would be an `explodeZipped` type of function, to explode multiple equally-sized-array columns together, as opposed to chaining separate explodes to form a Cartesian. I was still sort of looking into that, but... at this point I'd wonder if perhaps I've overlooked existing functionality for that as well. :) was (Author: tycho01): Hah, I actually missed that, so my bad, thanks! What I'm doing now then is along the lines of the following (example contrived from [here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html), only just realized this was you): ``` val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}""" val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil)) import org.apache.spark.sql.functions._ peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show ``` I must say I've felt slightly puzzled with the convention of having to use `explode` as part of an embedding `select` statement though; as an unwitting user, I'd feel `df.explode($"col")` should do something functionally equivalent to the current `df.select($"*", explode($"col"))` without having to type that out though. In my naivety, I'd wonder 'if I wanted to also select just a subset of columns, I could just manually add a `select` to do so myself'. Obviously, changing user APIs is bad, and not everyone will have identical expectations, but I'm just kind of curious. Was this an artifact of performance considerations, or a deliberate part of a larger philosophy of having the syntax be as explicit as possible? Then again, aside from keeping existing column, to me the `drop` on the pre-`explode` column would often seem a sensible default as well, so point taken that expectations may differ, in which case defaulting to whatever takes least processing definitely seems a sane choice. Something else I was thinking about though would be an `explodeZipped` type of function, to explode multiple equally-sized-array columns together, as opposed to chaining separate explodes to form a Cartesian. I was still sort of looking into that, but... at this point I'd wonder if perhaps I've overlooked existing functionality for that as well. :) > Allow exploding arrays of structs in DataFrames > --- > > Key: SPARK-11431 > URL: https://issues.apache.org/jira/browse/SPARK-11431 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Tycho Grouwstra > Labels: features > Original Estimate: 24h > Remaining Estimate: 24h > > I am creating DataFrames from some [JSON > data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to > explode an array of structs (as are common in JSON) to their own rows so I > could start analyzing the data using GraphX. I believe many others might have > use for this as well, since most web data is in JSON format. > This
[jira] [Commented] (SPARK-11431) Allow exploding arrays of structs in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983980#comment-14983980 ] Tycho Grouwstra commented on SPARK-11431: - Hah, I actually missed that, so my bad, thanks! What I'm doing now then is along the lines of the following (example contrived from [here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html), only just realized this was you): ``` val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}""" val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil)) import org.apache.spark.sql.functions._ peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show ``` I must say I've felt slightly puzzled with the convention of having to use `explode` as part of an embedding `select` statement though; as an unwitting user, I'd feel `df.explode($"col")` should do something functionally equivalent to the current `df.select($"*", explode($"col"))` without having to type that out though. In my naivety, I'd wonder 'if I wanted to also select just a subset of columns, I could just manually add a `select` to do so myself'. Obviously, changing user APIs is bad, and not everyone will have identical expectations, but I'm just kind of curious. Was this an artifact of performance considerations, or a deliberate part of a larger philosophy of having the syntax be as explicit as possible? Then again, aside from keeping existing column, to me the `drop` on the pre-`explode` column would often seem a sensible default as well, so point taken that expectations may differ, in which case defaulting to whatever takes least processing definitely seems a sane choice. Something else I was thinking about though would be an `explodeZipped` type of function, to explode multiple equally-sized-array columns together, as opposed to chaining separate explodes to form a Cartesian. I was still sort of looking into that, but... at this point I'd wonder if perhaps I've overlooked existing functionality for that as well. :) > Allow exploding arrays of structs in DataFrames > --- > > Key: SPARK-11431 > URL: https://issues.apache.org/jira/browse/SPARK-11431 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Tycho Grouwstra > Labels: features > Original Estimate: 24h > Remaining Estimate: 24h > > I am creating DataFrames from some [JSON > data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to > explode an array of structs (as are common in JSON) to their own rows so I > could start analyzing the data using GraphX. I believe many others might have > use for this as well, since most web data is in JSON format. > This feature would build upon the existing `explode` functionality added to > DataFrames by [~marmbrus], which currently errors when you call it on such > arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor > function to infer column types -- this approach is insufficient in the case > of Rows, since their type does not contain the required info. The alternative > here would be to instead grab the schema info from the existing schema for > such cases. > I'm trying to implement a patch that might add this functionality, so stay > tuned until I've figured that out. I'm new here though so I'll probably have > use for some feedback... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11431) Allow exploding arrays of structs in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983980#comment-14983980 ] Tycho Grouwstra edited comment on SPARK-11431 at 10/31/15 12:53 PM: Hah, I actually missed that, so my bad, thanks! What I'm doing now then is along the lines of the following (example contrived from [here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html), only just realized this was you): {code} val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}""" val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil)) import org.apache.spark.sql.functions._ peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show {code} I must say I've felt slightly puzzled with the convention of having to use `explode` as part of an embedding `select` statement though; as an unwitting user, I'd feel `df.explode($"col")` should do something functionally equivalent to the current `df.select($"*", explode($"col"))` without having to type that out though. In my naivety, I'd wonder 'if I wanted to also select just a subset of columns, I could just manually add a `select` to do so myself'. Obviously, changing user APIs is bad, and not everyone will have identical expectations, but I'm just kind of curious. Was this an artifact of performance considerations, or a deliberate part of a larger philosophy of having the syntax be as explicit as possible? Then again, aside from keeping existing column, to me the `drop` on the pre-`explode` column would often seem a sensible default as well, so point taken that expectations may differ, in which case defaulting to whatever takes least processing definitely seems a sane choice. Something else I was thinking about though would be an `explodeZipped` type of function, to explode multiple equally-sized-array columns together, as opposed to chaining separate explodes to form a Cartesian. I was still sort of looking into that, but... at this point I'd wonder if perhaps I've overlooked existing functionality for that as well. :) was (Author: tycho01): Hah, I actually missed that, so my bad, thanks! What I'm doing now then is along the lines of the following (example contrived from [here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html), only just realized this was you): val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}""" val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil)) import org.apache.spark.sql.functions._ peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show I must say I've felt slightly puzzled with the convention of having to use `explode` as part of an embedding `select` statement though; as an unwitting user, I'd feel `df.explode($"col")` should do something functionally equivalent to the current `df.select($"*", explode($"col"))` without having to type that out though. In my naivety, I'd wonder 'if I wanted to also select just a subset of columns, I could just manually add a `select` to do so myself'. Obviously, changing user APIs is bad, and not everyone will have identical expectations, but I'm just kind of curious. Was this an artifact of performance considerations, or a deliberate part of a larger philosophy of having the syntax be as explicit as possible? Then again, aside from keeping existing column, to me the `drop` on the pre-`explode` column would often seem a sensible default as well, so point taken that expectations may differ, in which case defaulting to whatever takes least processing definitely seems a sane choice. Something else I was thinking about though would be an `explodeZipped` type of function, to explode multiple equally-sized-array columns together, as opposed to chaining separate explodes to form a Cartesian. I was still sort of looking into that, but... at this point I'd wonder if perhaps I've overlooked existing functionality for that as well. :) > Allow exploding arrays of structs in DataFrames > --- > > Key: SPARK-11431 > URL: https://issues.apache.org/jira/browse/SPARK-11431 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Tycho Grouwstra > Labels: features > Original Estimate: 24h > Remaining Estimate: 24h > > I am creating DataFrames from some [JSON > data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to > explode an array of structs (as are common in JSON) to their own rows so I > could start analyzing the data using GraphX. I believe many others might have > use for this as well, since most web data is in JSON format. > This
[jira] [Updated] (SPARK-11431) Allow exploding arrays of structs in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tycho Grouwstra updated SPARK-11431: Description: I am creating DataFrames from some [JSON data|http://www.kayak.com/h/explore/api?airport=AMS], and would like to explode an array of structs (as are common in JSON) to their own rows so I could start analyzing the data using GraphX. I believe many others might have use for this as well, since most web data is in JSON format. This feature would build upon the existing `explode` functionality added to DataFrames by [~marmbrus], which currently errors when you call it on such arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor function to infer column types -- this approach is insufficient in the case of Rows, since their type does not contain the required info. The alternative here would be to instead grab the schema info from the existing schema for such cases. I'm trying to implement a patch that might add this functionality, so stay tuned until I've figured that out. I'm new here though so I'll probably have use for some feedback... was: I am creating DataFrames from some [JSON data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to explode an array of structs (as are common in JSON) to their own rows so I could start analyzing the data using GraphX. I believe many others might have use for this as well, since most web data is in JSON format. This feature would build upon the existing `explode` functionality added to DataFrames by [~marmbrus], which currently errors when you call it on such arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor function to infer column types -- this approach is insufficient in the case of Rows, since their type does not contain the required info. The alternative here would be to instead grab the schema info from the existing schema for such cases. I'm trying to implement a patch that might add this functionality, so stay tuned until I've figured that out. I'm new here though so I'll probably have use for some feedback... > Allow exploding arrays of structs in DataFrames > --- > > Key: SPARK-11431 > URL: https://issues.apache.org/jira/browse/SPARK-11431 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Tycho Grouwstra > Labels: features > Original Estimate: 24h > Remaining Estimate: 24h > > I am creating DataFrames from some [JSON > data|http://www.kayak.com/h/explore/api?airport=AMS], and would like to > explode an array of structs (as are common in JSON) to their own rows so I > could start analyzing the data using GraphX. I believe many others might have > use for this as well, since most web data is in JSON format. > This feature would build upon the existing `explode` functionality added to > DataFrames by [~marmbrus], which currently errors when you call it on such > arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor > function to infer column types -- this approach is insufficient in the case > of Rows, since their type does not contain the required info. The alternative > here would be to instead grab the schema info from the existing schema for > such cases. > I'm trying to implement a patch that might add this functionality, so stay > tuned until I've figured that out. I'm new here though so I'll probably have > use for some feedback... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11431) Allow exploding arrays of structs in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983980#comment-14983980 ] Tycho Grouwstra edited comment on SPARK-11431 at 10/31/15 12:54 PM: Hah, I actually missed that, so my bad, thanks! What I'm doing now then is along the lines of the following (example contrived from [here|http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html], only just realized this was you): {code} val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}""" val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil)) import org.apache.spark.sql.functions._ peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show {code} I must say I've felt slightly puzzled with the convention of having to use `explode` as part of an embedding `select` statement though; as an unwitting user, I'd feel `df.explode($"col")` should do something functionally equivalent to the current `df.select($"*", explode($"col"))` without having to type that out though. In my naivety, I'd wonder 'if I wanted to also select just a subset of columns, I could just manually add a `select` to do so myself'. Obviously, changing user APIs is bad, and not everyone will have identical expectations, but I'm just kind of curious. Was this an artifact of performance considerations, or a deliberate part of a larger philosophy of having the syntax be as explicit as possible? Then again, aside from keeping existing column, to me the `drop` on the pre-`explode` column would often seem a sensible default as well, so point taken that expectations may differ, in which case defaulting to whatever takes least processing definitely seems a sane choice. Something else I was thinking about though would be an `explodeZipped` type of function, to explode multiple equally-sized-array columns together, as opposed to chaining separate explodes to form a Cartesian. I was still sort of looking into that, but... at this point I'd wonder if perhaps I've overlooked existing functionality for that as well. :) was (Author: tycho01): Hah, I actually missed that, so my bad, thanks! What I'm doing now then is along the lines of the following (example contrived from [here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html), only just realized this was you): {code} val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}""" val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil)) import org.apache.spark.sql.functions._ peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show {code} I must say I've felt slightly puzzled with the convention of having to use `explode` as part of an embedding `select` statement though; as an unwitting user, I'd feel `df.explode($"col")` should do something functionally equivalent to the current `df.select($"*", explode($"col"))` without having to type that out though. In my naivety, I'd wonder 'if I wanted to also select just a subset of columns, I could just manually add a `select` to do so myself'. Obviously, changing user APIs is bad, and not everyone will have identical expectations, but I'm just kind of curious. Was this an artifact of performance considerations, or a deliberate part of a larger philosophy of having the syntax be as explicit as possible? Then again, aside from keeping existing column, to me the `drop` on the pre-`explode` column would often seem a sensible default as well, so point taken that expectations may differ, in which case defaulting to whatever takes least processing definitely seems a sane choice. Something else I was thinking about though would be an `explodeZipped` type of function, to explode multiple equally-sized-array columns together, as opposed to chaining separate explodes to form a Cartesian. I was still sort of looking into that, but... at this point I'd wonder if perhaps I've overlooked existing functionality for that as well. :) > Allow exploding arrays of structs in DataFrames > --- > > Key: SPARK-11431 > URL: https://issues.apache.org/jira/browse/SPARK-11431 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Tycho Grouwstra > Labels: features > Original Estimate: 24h > Remaining Estimate: 24h > > I am creating DataFrames from some [JSON > data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to > explode an array of structs (as are common in JSON) to their own rows so I > could start analyzing the data using GraphX. I believe many others might have > use for this as well, since most web data is in JSON format. > This
[jira] [Resolved] (SPARK-11431) Allow exploding arrays of structs in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tycho Grouwstra resolved SPARK-11431. - Resolution: Implemented > Allow exploding arrays of structs in DataFrames > --- > > Key: SPARK-11431 > URL: https://issues.apache.org/jira/browse/SPARK-11431 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Tycho Grouwstra > Labels: features > Original Estimate: 24h > Remaining Estimate: 24h > > I am creating DataFrames from some [JSON > data|http://www.kayak.com/h/explore/api?airport=AMS], and would like to > explode an array of structs (as are common in JSON) to their own rows so I > could start analyzing the data using GraphX. I believe many others might have > use for this as well, since most web data is in JSON format. > This feature would build upon the existing `explode` functionality added to > DataFrames by [~marmbrus], which currently errors when you call it on such > arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor > function to infer column types -- this approach is insufficient in the case > of Rows, since their type does not contain the required info. The alternative > here would be to instead grab the schema info from the existing schema for > such cases. > I'm trying to implement a patch that might add this functionality, so stay > tuned until I've figured that out. I'm new here though so I'll probably have > use for some feedback... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11431) Allow exploding arrays of structs in DataFrames
Tycho Grouwstra created SPARK-11431: --- Summary: Allow exploding arrays of structs in DataFrames Key: SPARK-11431 URL: https://issues.apache.org/jira/browse/SPARK-11431 Project: Spark Issue Type: New Feature Components: SQL Reporter: Tycho Grouwstra I am creating DataFrames from some [JSON data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to explode an array of structs (as are common in JSON) to their own rows so I could start analyzing the data using GraphX. I believe many others might have use for this as well, since most web data is in JSON format. This feature would build upon the existing `explode` functionality added to DataFrames by [~marmbrus], which currently errors when you call it on such arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor function to infer column types -- this approach is insufficient in the case of Rows, since their type does not contain the required info. The alternative here would be to instead grab the schema info from the existing schema for such cases. I'm trying to implement a patch that might add this functionality, so stay tuned until I've figured that out. I'm new here though so I'll probably have use for some feedback... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11431) Allow exploding arrays of structs in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983488#comment-14983488 ] Tycho Grouwstra commented on SPARK-11431: - One might wonder if the similar-sounding issue SPARK-7734 ("make explode support struct type") is related to this. That one concerned splitting structs into multiple columns though. That's relevant here as well, but the issue here pertains to splitting arrays over rows instead (as in the existing `explode` function). > Allow exploding arrays of structs in DataFrames > --- > > Key: SPARK-11431 > URL: https://issues.apache.org/jira/browse/SPARK-11431 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Tycho Grouwstra > Labels: features > Original Estimate: 24h > Remaining Estimate: 24h > > I am creating DataFrames from some [JSON > data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to > explode an array of structs (as are common in JSON) to their own rows so I > could start analyzing the data using GraphX. I believe many others might have > use for this as well, since most web data is in JSON format. > This feature would build upon the existing `explode` functionality added to > DataFrames by [~marmbrus], which currently errors when you call it on such > arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor > function to infer column types -- this approach is insufficient in the case > of Rows, since their type does not contain the required info. The alternative > here would be to instead grab the schema info from the existing schema for > such cases. > I'm trying to implement a patch that might add this functionality, so stay > tuned until I've figured that out. I'm new here though so I'll probably have > use for some feedback... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU
[ https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341974#comment-14341974 ] Tycho Grouwstra commented on SPARK-3785: Hm, tried commenting a bit earlier but seems it failed. I was wondering, it seems [ArrayFire](http://www.arrayfire.com/docs/group__arrayfire__func.htm) already parallelized a number of mathematical/reductor functions for C(++) arrays. If Spark RDDS/DataFrames expose some array interface for columns, might it be possible to use those through JNI? Not sure there'd be tangible performance gains without using APUs, but seemed interesting to me. Support off-loading computations to a GPU - Key: SPARK-3785 URL: https://issues.apache.org/jira/browse/SPARK-3785 Project: Spark Issue Type: Brainstorming Components: MLlib Reporter: Thomas Darimont Priority: Minor Are there any plans to adding support for off-loading computations to the GPU, e.g. via an open-cl binding? http://www.jocl.org/ https://code.google.com/p/javacl/ http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3785) Support off-loading computations to a GPU
[ https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341974#comment-14341974 ] Tycho Grouwstra edited comment on SPARK-3785 at 3/1/15 6:32 AM: I was wondering, it seems [ArrayFire|http://www.arrayfire.com/docs/group__arrayfire__func.htm] already parallelized a number of mathematical/reductor functions for C(++) arrays. If Spark RDDS/DataFrames expose some array interface for columns, might it be possible to use those through JNI? Not sure there'd be tangible performance gains without using APUs, but seemed interesting to me. was (Author: tycho01): Hm, tried commenting a bit earlier but seems it failed. I was wondering, it seems [ArrayFire](http://www.arrayfire.com/docs/group__arrayfire__func.htm) already parallelized a number of mathematical/reductor functions for C(++) arrays. If Spark RDDS/DataFrames expose some array interface for columns, might it be possible to use those through JNI? Not sure there'd be tangible performance gains without using APUs, but seemed interesting to me. Support off-loading computations to a GPU - Key: SPARK-3785 URL: https://issues.apache.org/jira/browse/SPARK-3785 Project: Spark Issue Type: Brainstorming Components: MLlib Reporter: Thomas Darimont Priority: Minor Are there any plans to adding support for off-loading computations to the GPU, e.g. via an open-cl binding? http://www.jocl.org/ https://code.google.com/p/javacl/ http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org