[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU

2016-01-03 Thread Tycho Grouwstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080669#comment-15080669
 ] 

Tycho Grouwstra commented on SPARK-3785:


>> For idea B, we need to write CUDA code

>> The motivation of idea 2 is to avoid writing hardware-dependent code

I thought that unlike CUDA OpenCL could be run on GPU as well as CPU?
(That being said I'm under the impression CUDA is generally at the bleeding 
edge of innovation, so no objections here regardless.)


> Support off-loading computations to a GPU
> -
>
> Key: SPARK-3785
> URL: https://issues.apache.org/jira/browse/SPARK-3785
> Project: Spark
>  Issue Type: Brainstorming
>  Components: MLlib
>Reporter: Thomas Darimont
>Priority: Minor
>
> Are there any plans to adding support for off-loading computations to the 
> GPU, e.g. via an open-cl binding? 
> http://www.jocl.org/
> https://code.google.com/p/javacl/
> http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11907) Allowing errors as values in DataFrames (like 'Either Left/Right')

2015-11-22 Thread Tycho Grouwstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tycho Grouwstra updated SPARK-11907:

Description: 
I like Spark, but one thing I find funny about it is that it is picky about 
circumstantial errors. For one, given the following:

{code}
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rows = (1,"a") :: (2,"b") :: (3,"c") :: (0,"d") :: Nil
val df = sqlContext.createDataFrame(sc.parallelize(rows)).toDF("num","let")
val div = udf[Double, Integer](10 / _)
df.withColumn("div", div(col("num"))).show()
{code}

... the job fails with a `java.lang.ArithmeticException: / by zero`.

The example is trivial, but my point is, if one thing goes wrong, the rest goes 
right, why throw out the baby with the bathwater when you could both show what 
went wrong as well as went right?

Instead, I would propose allowing to use raised Exceptions as resulting values, 
not unlike how one might store 'bad' results using Either Left/Right 
constructions in Scala/Haskell (which I suppose would not currently work in 
DFs, lacking serializability), or cells containing errors in MS Excel.

As a solution, I would propose a DataFrame subclass (?) using a variant of 
NullableColumnBuilder, e.g. ErrorableColumnBuilder (/ SafeColumnBuilder?).
NullableColumnBuilder currently explains its workings as follows:

{code}
/**
 * A stackable trait used for building byte buffer for a column containing null 
values.  Memory
 * layout of the final byte buffer is:
 * {{{
 *.--- Null count N (4 bytes)
 *|   .--- Null positions (4 x N bytes, empty if null count is 
zero)
 *|   | .- Non-null elements
 *V   V V
 *   +---+-+-+
 *   |   | ... | ... ... |
 *   +---+-+-+
 * }}}
 */
{code}

This might be extended by adding a further section storing Throwables (or null) 
for the bad values in question (alt: store count/positions separately from null 
ones so null values would not need to be stored). 

Don't get me wrong, there is nothing with throwing exceptions (or catching them 
for that matter). Rather, I see a use cases for both "do it right or bust" vs. 
the explorative "show me what happens if I try this operation on these values" 
-- not unlike how languages as Ruby/Elixir might distinguish unsafe methods 
using a bang ('!') from their safe variants that should not throw global 
exceptions.

I'm sort of new here but would be glad to get some opinions on this idea.


  was:
I like Spark, but one thing I find funny about it is that it is picky about 
circumstantial errors. For one, given the following:

[code]
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rows = (1,"a") :: (2,"b") :: (3,"c") :: (0,"d") :: Nil
val df = sqlContext.createDataFrame(sc.parallelize(rows)).toDF("num","let")
val div = udf[Double, Integer](10 / _)
df.withColumn("div", div(col("num"))).show()
[/code]

... the job fails with a `java.lang.ArithmeticException: / by zero`.

The example is trivial, but my point is, if one thing goes wrong, the rest goes 
right, why throw out the baby with the bathwater when you could both show what 
went wrong as well as went right?

Instead, I would propose allowing to use raised Exceptions as resulting values, 
not unlike how one might store 'bad' results using Either Left/Right 
constructions in Scala/Haskell (which I suppose would not currently work in 
DFs, lacking serializability), or cells containing errors in MS Excel.

As a solution, I would propose a DataFrame subclass (?) using a variant of 
NullableColumnBuilder, e.g. ErrorableColumnBuilder (/ SafeColumnBuilder?).
NullableColumnBuilder currently explains its workings as follows:

[code]
/**
 * A stackable trait used for building byte buffer for a column containing null 
values.  Memory
 * layout of the final byte buffer is:
 * {{{
 *.--- Null count N (4 bytes)
 *|   .--- Null positions (4 x N bytes, empty if null count is 
zero)
 *|   | .- Non-null elements
 *V   V V
 *   +---+-+-+
 *   |   | ... | ... ... |
 *   +---+-+-+
 * }}}
 */
[/code]

This might be extended by adding a further section storing Throwables (or null) 
for the bad values in question (alt: store count/positions separately from null 
ones so null values would not need to be stored). 

Don't get me wrong, there is nothing with throwing exceptions (or catching them 
for that matter). Rather, I see a use cases for both "do it right or bust" vs. 
the explorative "show me what happens if I try this operation on these values" 
-- not unlike how languages as Ruby/Elixir might distinguish unsafe methods 
using a bang ('!') from their safe variants that should not throw global 

[jira] [Created] (SPARK-11907) Allowing errors as values in DataFrames (like 'Either Left/Right')

2015-11-22 Thread Tycho Grouwstra (JIRA)
Tycho Grouwstra created SPARK-11907:
---

 Summary: Allowing errors as values in DataFrames (like 'Either 
Left/Right')
 Key: SPARK-11907
 URL: https://issues.apache.org/jira/browse/SPARK-11907
 Project: Spark
  Issue Type: Wish
  Components: SQL
Reporter: Tycho Grouwstra


I like Spark, but one thing I find funny about it is that it is picky about 
circumstantial errors. For one, given the following:

```
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rows = (1,"a") :: (2,"b") :: (3,"c") :: (0,"d") :: Nil
val df = sqlContext.createDataFrame(sc.parallelize(rows)).toDF("num","let")
val div = udf[Double, Integer](10 / _)
df.withColumn("div", div(col("num"))).show()
```

... the job fails with a `java.lang.ArithmeticException: / by zero`.

The example is trivial, but my point is, if one thing goes wrong, the rest goes 
right, why throw out the baby with the bathwater when you could both show what 
went wrong as well as went right?

Instead, I would propose allowing to use raised Exceptions as resulting values, 
not unlike how one might store 'bad' results using Either Left/Right 
constructions in Scala/Haskell (which I suppose would not currently work in 
DFs, lacking serializability), or cells containing errors in MS Excel.

As a solution, I would propose a DataFrame subclass (?) using a variant of 
NullableColumnBuilder, e.g. ErrorableColumnBuilder (/ SafeColumnBuilder?).
NullableColumnBuilder currently explains its workings as follows:

```
/**
 * A stackable trait used for building byte buffer for a column containing null 
values.  Memory
 * layout of the final byte buffer is:
 * {{{
 *.--- Null count N (4 bytes)
 *|   .--- Null positions (4 x N bytes, empty if null count is 
zero)
 *|   | .- Non-null elements
 *V   V V
 *   +---+-+-+
 *   |   | ... | ... ... |
 *   +---+-+-+
 * }}}
 */
```

This might be extended by adding a further section storing Throwables (or null) 
for the bad values in question (alt: store count/positions separately from null 
ones so null values would not need to be stored). 

Don't get me wrong, there is nothing with throwing exceptions (or catching them 
for that matter). Rather, I see a use cases for both "do it right or bust" vs. 
the explorative "show me what happens if I try this operation on these values" 
-- not unlike how languages as Ruby/Elixir might distinguish unsafe methods 
using a bang ('!') from their safe variants that should not throw global 
exceptions.

I'm sort of new here but would be glad to get some opinions on this idea.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11907) Allowing errors as values in DataFrames (like 'Either Left/Right')

2015-11-22 Thread Tycho Grouwstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tycho Grouwstra updated SPARK-11907:

Description: 
I like Spark, but one thing I find funny about it is that it is picky about 
circumstantial errors. For one, given the following:

[code]
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rows = (1,"a") :: (2,"b") :: (3,"c") :: (0,"d") :: Nil
val df = sqlContext.createDataFrame(sc.parallelize(rows)).toDF("num","let")
val div = udf[Double, Integer](10 / _)
df.withColumn("div", div(col("num"))).show()
[/code]

... the job fails with a `java.lang.ArithmeticException: / by zero`.

The example is trivial, but my point is, if one thing goes wrong, the rest goes 
right, why throw out the baby with the bathwater when you could both show what 
went wrong as well as went right?

Instead, I would propose allowing to use raised Exceptions as resulting values, 
not unlike how one might store 'bad' results using Either Left/Right 
constructions in Scala/Haskell (which I suppose would not currently work in 
DFs, lacking serializability), or cells containing errors in MS Excel.

As a solution, I would propose a DataFrame subclass (?) using a variant of 
NullableColumnBuilder, e.g. ErrorableColumnBuilder (/ SafeColumnBuilder?).
NullableColumnBuilder currently explains its workings as follows:

[code]
/**
 * A stackable trait used for building byte buffer for a column containing null 
values.  Memory
 * layout of the final byte buffer is:
 * {{{
 *.--- Null count N (4 bytes)
 *|   .--- Null positions (4 x N bytes, empty if null count is 
zero)
 *|   | .- Non-null elements
 *V   V V
 *   +---+-+-+
 *   |   | ... | ... ... |
 *   +---+-+-+
 * }}}
 */
[/code]

This might be extended by adding a further section storing Throwables (or null) 
for the bad values in question (alt: store count/positions separately from null 
ones so null values would not need to be stored). 

Don't get me wrong, there is nothing with throwing exceptions (or catching them 
for that matter). Rather, I see a use cases for both "do it right or bust" vs. 
the explorative "show me what happens if I try this operation on these values" 
-- not unlike how languages as Ruby/Elixir might distinguish unsafe methods 
using a bang ('!') from their safe variants that should not throw global 
exceptions.

I'm sort of new here but would be glad to get some opinions on this idea.


  was:
I like Spark, but one thing I find funny about it is that it is picky about 
circumstantial errors. For one, given the following:

```
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rows = (1,"a") :: (2,"b") :: (3,"c") :: (0,"d") :: Nil
val df = sqlContext.createDataFrame(sc.parallelize(rows)).toDF("num","let")
val div = udf[Double, Integer](10 / _)
df.withColumn("div", div(col("num"))).show()
```

... the job fails with a `java.lang.ArithmeticException: / by zero`.

The example is trivial, but my point is, if one thing goes wrong, the rest goes 
right, why throw out the baby with the bathwater when you could both show what 
went wrong as well as went right?

Instead, I would propose allowing to use raised Exceptions as resulting values, 
not unlike how one might store 'bad' results using Either Left/Right 
constructions in Scala/Haskell (which I suppose would not currently work in 
DFs, lacking serializability), or cells containing errors in MS Excel.

As a solution, I would propose a DataFrame subclass (?) using a variant of 
NullableColumnBuilder, e.g. ErrorableColumnBuilder (/ SafeColumnBuilder?).
NullableColumnBuilder currently explains its workings as follows:

```
/**
 * A stackable trait used for building byte buffer for a column containing null 
values.  Memory
 * layout of the final byte buffer is:
 * {{{
 *.--- Null count N (4 bytes)
 *|   .--- Null positions (4 x N bytes, empty if null count is 
zero)
 *|   | .- Non-null elements
 *V   V V
 *   +---+-+-+
 *   |   | ... | ... ... |
 *   +---+-+-+
 * }}}
 */
```

This might be extended by adding a further section storing Throwables (or null) 
for the bad values in question (alt: store count/positions separately from null 
ones so null values would not need to be stored). 

Don't get me wrong, there is nothing with throwing exceptions (or catching them 
for that matter). Rather, I see a use cases for both "do it right or bust" vs. 
the explorative "show me what happens if I try this operation on these values" 
-- not unlike how languages as Ruby/Elixir might distinguish unsafe methods 
using a bang ('!') from their safe variants that should not throw global 
exceptions.

I'm sort 

[jira] [Comment Edited] (SPARK-11431) Allow exploding arrays of structs in DataFrames

2015-10-31 Thread Tycho Grouwstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983980#comment-14983980
 ] 

Tycho Grouwstra edited comment on SPARK-11431 at 10/31/15 12:51 PM:


Hah, I actually missed that, so my bad, thanks!

What I'm doing now then is along the lines of the following (example contrived 
from 
[here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html),
 only just realized this was you):

```
val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, 
{"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show
```

I must say I've felt slightly puzzled with the convention of having to use 
`explode` as part of an embedding `select` statement though; as an unwitting 
user, I'd feel `df.explode($"col")` should do something functionally equivalent 
to the current `df.select($"*", explode($"col"))` without having to type that 
out though. In my naivety, I'd wonder 'if I wanted to also select just a subset 
of columns, I could just manually add a `select` to do so myself'.

Obviously, changing user APIs is bad, and not everyone will have identical 
expectations, but I'm just kind of curious. Was this an artifact of performance 
considerations, or a deliberate part of a larger philosophy of having the 
syntax be as explicit as possible?

Then again, aside from keeping existing column, to me the `drop` on the 
pre-`explode` column would often seem a sensible default as well, so point 
taken that expectations may differ, in which case defaulting to whatever takes 
least processing definitely seems a sane choice.


Something else I was thinking about though would be an `explodeZipped` type of 
function, to explode multiple equally-sized-array columns together, as opposed 
to chaining separate explodes to form a Cartesian. I was still sort of looking 
into that, but... at this point I'd wonder if perhaps I've overlooked existing 
functionality for that as well. :)



was (Author: tycho01):
Hah, I actually missed that, so my bad, thanks!

What I'm doing now then is along the lines of the following (example contrived 
from 
[here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html),
 only just realized this was you):
```
val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, 
{"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show
```

I must say I've felt slightly puzzled with the convention of having to use 
`explode` as part of an embedding `select` statement though; as an unwitting 
user, I'd feel `df.explode($"col")` should do something functionally equivalent 
to the current `df.select($"*", explode($"col"))` without having to type that 
out though. In my naivety, I'd wonder 'if I wanted to also select just a subset 
of columns, I could just manually add a `select` to do so myself'.

Obviously, changing user APIs is bad, and not everyone will have identical 
expectations, but I'm just kind of curious. Was this an artifact of performance 
considerations, or a deliberate part of a larger philosophy of having the 
syntax be as explicit as possible?

Then again, aside from keeping existing column, to me the `drop` on the 
pre-`explode` column would often seem a sensible default as well, so point 
taken that expectations may differ, in which case defaulting to whatever takes 
least processing definitely seems a sane choice.


Something else I was thinking about though would be an `explodeZipped` type of 
function, to explode multiple equally-sized-array columns together, as opposed 
to chaining separate explodes to form a Cartesian. I was still sort of looking 
into that, but... at this point I'd wonder if perhaps I've overlooked existing 
functionality for that as well. :)


> Allow exploding arrays of structs in DataFrames
> ---
>
> Key: SPARK-11431
> URL: https://issues.apache.org/jira/browse/SPARK-11431
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tycho Grouwstra
>  Labels: features
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am creating DataFrames from some [JSON 
> data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to 
> explode an array of structs (as are common in JSON) to their own rows so I 
> could start analyzing the data using GraphX. I believe many others might have 
> use for this as well, since most web data is in JSON format.
> This feature would 

[jira] [Comment Edited] (SPARK-11431) Allow exploding arrays of structs in DataFrames

2015-10-31 Thread Tycho Grouwstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983980#comment-14983980
 ] 

Tycho Grouwstra edited comment on SPARK-11431 at 10/31/15 12:51 PM:


Hah, I actually missed that, so my bad, thanks!

What I'm doing now then is along the lines of the following (example contrived 
from 
[here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html),
 only just realized this was you):

val json = """{"name":"Michael", "schools":[{"sname":"stanford", 
"year":2010}, {"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show

I must say I've felt slightly puzzled with the convention of having to use 
`explode` as part of an embedding `select` statement though; as an unwitting 
user, I'd feel `df.explode($"col")` should do something functionally equivalent 
to the current `df.select($"*", explode($"col"))` without having to type that 
out though. In my naivety, I'd wonder 'if I wanted to also select just a subset 
of columns, I could just manually add a `select` to do so myself'.

Obviously, changing user APIs is bad, and not everyone will have identical 
expectations, but I'm just kind of curious. Was this an artifact of performance 
considerations, or a deliberate part of a larger philosophy of having the 
syntax be as explicit as possible?

Then again, aside from keeping existing column, to me the `drop` on the 
pre-`explode` column would often seem a sensible default as well, so point 
taken that expectations may differ, in which case defaulting to whatever takes 
least processing definitely seems a sane choice.


Something else I was thinking about though would be an `explodeZipped` type of 
function, to explode multiple equally-sized-array columns together, as opposed 
to chaining separate explodes to form a Cartesian. I was still sort of looking 
into that, but... at this point I'd wonder if perhaps I've overlooked existing 
functionality for that as well. :)



was (Author: tycho01):
Hah, I actually missed that, so my bad, thanks!

What I'm doing now then is along the lines of the following (example contrived 
from 
[here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html),
 only just realized this was you):

```
val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, 
{"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show
```

I must say I've felt slightly puzzled with the convention of having to use 
`explode` as part of an embedding `select` statement though; as an unwitting 
user, I'd feel `df.explode($"col")` should do something functionally equivalent 
to the current `df.select($"*", explode($"col"))` without having to type that 
out though. In my naivety, I'd wonder 'if I wanted to also select just a subset 
of columns, I could just manually add a `select` to do so myself'.

Obviously, changing user APIs is bad, and not everyone will have identical 
expectations, but I'm just kind of curious. Was this an artifact of performance 
considerations, or a deliberate part of a larger philosophy of having the 
syntax be as explicit as possible?

Then again, aside from keeping existing column, to me the `drop` on the 
pre-`explode` column would often seem a sensible default as well, so point 
taken that expectations may differ, in which case defaulting to whatever takes 
least processing definitely seems a sane choice.


Something else I was thinking about though would be an `explodeZipped` type of 
function, to explode multiple equally-sized-array columns together, as opposed 
to chaining separate explodes to form a Cartesian. I was still sort of looking 
into that, but... at this point I'd wonder if perhaps I've overlooked existing 
functionality for that as well. :)


> Allow exploding arrays of structs in DataFrames
> ---
>
> Key: SPARK-11431
> URL: https://issues.apache.org/jira/browse/SPARK-11431
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tycho Grouwstra
>  Labels: features
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am creating DataFrames from some [JSON 
> data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to 
> explode an array of structs (as are common in JSON) to their own rows so I 
> could start analyzing the data using GraphX. I believe many others might have 
> use for this as well, since most web data is in JSON format.
> This 

[jira] [Commented] (SPARK-11431) Allow exploding arrays of structs in DataFrames

2015-10-31 Thread Tycho Grouwstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983980#comment-14983980
 ] 

Tycho Grouwstra commented on SPARK-11431:
-

Hah, I actually missed that, so my bad, thanks!

What I'm doing now then is along the lines of the following (example contrived 
from 
[here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html),
 only just realized this was you):
```
val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, 
{"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show
```

I must say I've felt slightly puzzled with the convention of having to use 
`explode` as part of an embedding `select` statement though; as an unwitting 
user, I'd feel `df.explode($"col")` should do something functionally equivalent 
to the current `df.select($"*", explode($"col"))` without having to type that 
out though. In my naivety, I'd wonder 'if I wanted to also select just a subset 
of columns, I could just manually add a `select` to do so myself'.

Obviously, changing user APIs is bad, and not everyone will have identical 
expectations, but I'm just kind of curious. Was this an artifact of performance 
considerations, or a deliberate part of a larger philosophy of having the 
syntax be as explicit as possible?

Then again, aside from keeping existing column, to me the `drop` on the 
pre-`explode` column would often seem a sensible default as well, so point 
taken that expectations may differ, in which case defaulting to whatever takes 
least processing definitely seems a sane choice.


Something else I was thinking about though would be an `explodeZipped` type of 
function, to explode multiple equally-sized-array columns together, as opposed 
to chaining separate explodes to form a Cartesian. I was still sort of looking 
into that, but... at this point I'd wonder if perhaps I've overlooked existing 
functionality for that as well. :)


> Allow exploding arrays of structs in DataFrames
> ---
>
> Key: SPARK-11431
> URL: https://issues.apache.org/jira/browse/SPARK-11431
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tycho Grouwstra
>  Labels: features
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am creating DataFrames from some [JSON 
> data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to 
> explode an array of structs (as are common in JSON) to their own rows so I 
> could start analyzing the data using GraphX. I believe many others might have 
> use for this as well, since most web data is in JSON format.
> This feature would build upon the existing `explode` functionality added to 
> DataFrames by [~marmbrus], which currently errors when you call it on such 
> arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor 
> function to infer column types -- this approach is insufficient in the case 
> of Rows, since their type does not contain the required info. The alternative 
> here would be to instead grab the schema info from the existing schema for 
> such cases.
> I'm trying to implement a patch that might add this functionality, so stay 
> tuned until I've figured that out. I'm new here though so I'll probably have 
> use for some feedback...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11431) Allow exploding arrays of structs in DataFrames

2015-10-31 Thread Tycho Grouwstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983980#comment-14983980
 ] 

Tycho Grouwstra edited comment on SPARK-11431 at 10/31/15 12:53 PM:


Hah, I actually missed that, so my bad, thanks!

What I'm doing now then is along the lines of the following (example contrived 
from 
[here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html),
 only just realized this was you):

{code}
val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, 
{"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show
{code}

I must say I've felt slightly puzzled with the convention of having to use 
`explode` as part of an embedding `select` statement though; as an unwitting 
user, I'd feel `df.explode($"col")` should do something functionally equivalent 
to the current `df.select($"*", explode($"col"))` without having to type that 
out though. In my naivety, I'd wonder 'if I wanted to also select just a subset 
of columns, I could just manually add a `select` to do so myself'.

Obviously, changing user APIs is bad, and not everyone will have identical 
expectations, but I'm just kind of curious. Was this an artifact of performance 
considerations, or a deliberate part of a larger philosophy of having the 
syntax be as explicit as possible?

Then again, aside from keeping existing column, to me the `drop` on the 
pre-`explode` column would often seem a sensible default as well, so point 
taken that expectations may differ, in which case defaulting to whatever takes 
least processing definitely seems a sane choice.


Something else I was thinking about though would be an `explodeZipped` type of 
function, to explode multiple equally-sized-array columns together, as opposed 
to chaining separate explodes to form a Cartesian. I was still sort of looking 
into that, but... at this point I'd wonder if perhaps I've overlooked existing 
functionality for that as well. :)



was (Author: tycho01):
Hah, I actually missed that, so my bad, thanks!

What I'm doing now then is along the lines of the following (example contrived 
from 
[here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html),
 only just realized this was you):

val json = """{"name":"Michael", "schools":[{"sname":"stanford", 
"year":2010}, {"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show

I must say I've felt slightly puzzled with the convention of having to use 
`explode` as part of an embedding `select` statement though; as an unwitting 
user, I'd feel `df.explode($"col")` should do something functionally equivalent 
to the current `df.select($"*", explode($"col"))` without having to type that 
out though. In my naivety, I'd wonder 'if I wanted to also select just a subset 
of columns, I could just manually add a `select` to do so myself'.

Obviously, changing user APIs is bad, and not everyone will have identical 
expectations, but I'm just kind of curious. Was this an artifact of performance 
considerations, or a deliberate part of a larger philosophy of having the 
syntax be as explicit as possible?

Then again, aside from keeping existing column, to me the `drop` on the 
pre-`explode` column would often seem a sensible default as well, so point 
taken that expectations may differ, in which case defaulting to whatever takes 
least processing definitely seems a sane choice.


Something else I was thinking about though would be an `explodeZipped` type of 
function, to explode multiple equally-sized-array columns together, as opposed 
to chaining separate explodes to form a Cartesian. I was still sort of looking 
into that, but... at this point I'd wonder if perhaps I've overlooked existing 
functionality for that as well. :)


> Allow exploding arrays of structs in DataFrames
> ---
>
> Key: SPARK-11431
> URL: https://issues.apache.org/jira/browse/SPARK-11431
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tycho Grouwstra
>  Labels: features
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am creating DataFrames from some [JSON 
> data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to 
> explode an array of structs (as are common in JSON) to their own rows so I 
> could start analyzing the data using GraphX. I believe many others might have 
> use for this as well, since most web data is in JSON format.
> This 

[jira] [Updated] (SPARK-11431) Allow exploding arrays of structs in DataFrames

2015-10-31 Thread Tycho Grouwstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tycho Grouwstra updated SPARK-11431:

Description: 
I am creating DataFrames from some [JSON 
data|http://www.kayak.com/h/explore/api?airport=AMS], and would like to explode 
an array of structs (as are common in JSON) to their own rows so I could start 
analyzing the data using GraphX. I believe many others might have use for this 
as well, since most web data is in JSON format.

This feature would build upon the existing `explode` functionality added to 
DataFrames by [~marmbrus], which currently errors when you call it on such 
arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor 
function to infer column types -- this approach is insufficient in the case of 
Rows, since their type does not contain the required info. The alternative here 
would be to instead grab the schema info from the existing schema for such 
cases.

I'm trying to implement a patch that might add this functionality, so stay 
tuned until I've figured that out. I'm new here though so I'll probably have 
use for some feedback...


  was:
I am creating DataFrames from some [JSON 
data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to 
explode an array of structs (as are common in JSON) to their own rows so I 
could start analyzing the data using GraphX. I believe many others might have 
use for this as well, since most web data is in JSON format.

This feature would build upon the existing `explode` functionality added to 
DataFrames by [~marmbrus], which currently errors when you call it on such 
arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor 
function to infer column types -- this approach is insufficient in the case of 
Rows, since their type does not contain the required info. The alternative here 
would be to instead grab the schema info from the existing schema for such 
cases.

I'm trying to implement a patch that might add this functionality, so stay 
tuned until I've figured that out. I'm new here though so I'll probably have 
use for some feedback...



> Allow exploding arrays of structs in DataFrames
> ---
>
> Key: SPARK-11431
> URL: https://issues.apache.org/jira/browse/SPARK-11431
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tycho Grouwstra
>  Labels: features
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am creating DataFrames from some [JSON 
> data|http://www.kayak.com/h/explore/api?airport=AMS], and would like to 
> explode an array of structs (as are common in JSON) to their own rows so I 
> could start analyzing the data using GraphX. I believe many others might have 
> use for this as well, since most web data is in JSON format.
> This feature would build upon the existing `explode` functionality added to 
> DataFrames by [~marmbrus], which currently errors when you call it on such 
> arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor 
> function to infer column types -- this approach is insufficient in the case 
> of Rows, since their type does not contain the required info. The alternative 
> here would be to instead grab the schema info from the existing schema for 
> such cases.
> I'm trying to implement a patch that might add this functionality, so stay 
> tuned until I've figured that out. I'm new here though so I'll probably have 
> use for some feedback...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11431) Allow exploding arrays of structs in DataFrames

2015-10-31 Thread Tycho Grouwstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983980#comment-14983980
 ] 

Tycho Grouwstra edited comment on SPARK-11431 at 10/31/15 12:54 PM:


Hah, I actually missed that, so my bad, thanks!

What I'm doing now then is along the lines of the following (example contrived 
from 
[here|http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html],
 only just realized this was you):

{code}
val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, 
{"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show
{code}

I must say I've felt slightly puzzled with the convention of having to use 
`explode` as part of an embedding `select` statement though; as an unwitting 
user, I'd feel `df.explode($"col")` should do something functionally equivalent 
to the current `df.select($"*", explode($"col"))` without having to type that 
out though. In my naivety, I'd wonder 'if I wanted to also select just a subset 
of columns, I could just manually add a `select` to do so myself'.

Obviously, changing user APIs is bad, and not everyone will have identical 
expectations, but I'm just kind of curious. Was this an artifact of performance 
considerations, or a deliberate part of a larger philosophy of having the 
syntax be as explicit as possible?

Then again, aside from keeping existing column, to me the `drop` on the 
pre-`explode` column would often seem a sensible default as well, so point 
taken that expectations may differ, in which case defaulting to whatever takes 
least processing definitely seems a sane choice.


Something else I was thinking about though would be an `explodeZipped` type of 
function, to explode multiple equally-sized-array columns together, as opposed 
to chaining separate explodes to form a Cartesian. I was still sort of looking 
into that, but... at this point I'd wonder if perhaps I've overlooked existing 
functionality for that as well. :)



was (Author: tycho01):
Hah, I actually missed that, so my bad, thanks!

What I'm doing now then is along the lines of the following (example contrived 
from 
[here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html),
 only just realized this was you):

{code}
val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, 
{"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show
{code}

I must say I've felt slightly puzzled with the convention of having to use 
`explode` as part of an embedding `select` statement though; as an unwitting 
user, I'd feel `df.explode($"col")` should do something functionally equivalent 
to the current `df.select($"*", explode($"col"))` without having to type that 
out though. In my naivety, I'd wonder 'if I wanted to also select just a subset 
of columns, I could just manually add a `select` to do so myself'.

Obviously, changing user APIs is bad, and not everyone will have identical 
expectations, but I'm just kind of curious. Was this an artifact of performance 
considerations, or a deliberate part of a larger philosophy of having the 
syntax be as explicit as possible?

Then again, aside from keeping existing column, to me the `drop` on the 
pre-`explode` column would often seem a sensible default as well, so point 
taken that expectations may differ, in which case defaulting to whatever takes 
least processing definitely seems a sane choice.


Something else I was thinking about though would be an `explodeZipped` type of 
function, to explode multiple equally-sized-array columns together, as opposed 
to chaining separate explodes to form a Cartesian. I was still sort of looking 
into that, but... at this point I'd wonder if perhaps I've overlooked existing 
functionality for that as well. :)


> Allow exploding arrays of structs in DataFrames
> ---
>
> Key: SPARK-11431
> URL: https://issues.apache.org/jira/browse/SPARK-11431
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tycho Grouwstra
>  Labels: features
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am creating DataFrames from some [JSON 
> data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to 
> explode an array of structs (as are common in JSON) to their own rows so I 
> could start analyzing the data using GraphX. I believe many others might have 
> use for this as well, since most web data is in JSON format.
> This 

[jira] [Resolved] (SPARK-11431) Allow exploding arrays of structs in DataFrames

2015-10-31 Thread Tycho Grouwstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tycho Grouwstra resolved SPARK-11431.
-
Resolution: Implemented

> Allow exploding arrays of structs in DataFrames
> ---
>
> Key: SPARK-11431
> URL: https://issues.apache.org/jira/browse/SPARK-11431
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tycho Grouwstra
>  Labels: features
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am creating DataFrames from some [JSON 
> data|http://www.kayak.com/h/explore/api?airport=AMS], and would like to 
> explode an array of structs (as are common in JSON) to their own rows so I 
> could start analyzing the data using GraphX. I believe many others might have 
> use for this as well, since most web data is in JSON format.
> This feature would build upon the existing `explode` functionality added to 
> DataFrames by [~marmbrus], which currently errors when you call it on such 
> arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor 
> function to infer column types -- this approach is insufficient in the case 
> of Rows, since their type does not contain the required info. The alternative 
> here would be to instead grab the schema info from the existing schema for 
> such cases.
> I'm trying to implement a patch that might add this functionality, so stay 
> tuned until I've figured that out. I'm new here though so I'll probably have 
> use for some feedback...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11431) Allow exploding arrays of structs in DataFrames

2015-10-30 Thread Tycho Grouwstra (JIRA)
Tycho Grouwstra created SPARK-11431:
---

 Summary: Allow exploding arrays of structs in DataFrames
 Key: SPARK-11431
 URL: https://issues.apache.org/jira/browse/SPARK-11431
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Tycho Grouwstra


I am creating DataFrames from some [JSON 
data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to 
explode an array of structs (as are common in JSON) to their own rows so I 
could start analyzing the data using GraphX. I believe many others might have 
use for this as well, since most web data is in JSON format.

This feature would build upon the existing `explode` functionality added to 
DataFrames by [~marmbrus], which currently errors when you call it on such 
arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor 
function to infer column types -- this approach is insufficient in the case of 
Rows, since their type does not contain the required info. The alternative here 
would be to instead grab the schema info from the existing schema for such 
cases.

I'm trying to implement a patch that might add this functionality, so stay 
tuned until I've figured that out. I'm new here though so I'll probably have 
use for some feedback...




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11431) Allow exploding arrays of structs in DataFrames

2015-10-30 Thread Tycho Grouwstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983488#comment-14983488
 ] 

Tycho Grouwstra commented on SPARK-11431:
-

One might wonder if the similar-sounding issue SPARK-7734 ("make explode 
support struct type") is related to this. That one concerned splitting structs 
into multiple columns though. That's relevant here as well, but the issue here 
pertains to splitting arrays over rows instead (as in the existing `explode` 
function).

> Allow exploding arrays of structs in DataFrames
> ---
>
> Key: SPARK-11431
> URL: https://issues.apache.org/jira/browse/SPARK-11431
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tycho Grouwstra
>  Labels: features
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am creating DataFrames from some [JSON 
> data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to 
> explode an array of structs (as are common in JSON) to their own rows so I 
> could start analyzing the data using GraphX. I believe many others might have 
> use for this as well, since most web data is in JSON format.
> This feature would build upon the existing `explode` functionality added to 
> DataFrames by [~marmbrus], which currently errors when you call it on such 
> arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor 
> function to infer column types -- this approach is insufficient in the case 
> of Rows, since their type does not contain the required info. The alternative 
> here would be to instead grab the schema info from the existing schema for 
> such cases.
> I'm trying to implement a patch that might add this functionality, so stay 
> tuned until I've figured that out. I'm new here though so I'll probably have 
> use for some feedback...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU

2015-02-28 Thread Tycho Grouwstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341974#comment-14341974
 ] 

Tycho Grouwstra commented on SPARK-3785:


Hm, tried commenting a bit earlier but seems it failed.

I was wondering, it seems 
[ArrayFire](http://www.arrayfire.com/docs/group__arrayfire__func.htm) already 
parallelized a number of mathematical/reductor functions for C(++) arrays. If 
Spark RDDS/DataFrames expose some array interface for columns, might it be 
possible to use those through JNI? Not sure there'd be tangible performance 
gains without using APUs, but seemed interesting to me.


 Support off-loading computations to a GPU
 -

 Key: SPARK-3785
 URL: https://issues.apache.org/jira/browse/SPARK-3785
 Project: Spark
  Issue Type: Brainstorming
  Components: MLlib
Reporter: Thomas Darimont
Priority: Minor

 Are there any plans to adding support for off-loading computations to the 
 GPU, e.g. via an open-cl binding? 
 http://www.jocl.org/
 https://code.google.com/p/javacl/
 http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3785) Support off-loading computations to a GPU

2015-02-28 Thread Tycho Grouwstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341974#comment-14341974
 ] 

Tycho Grouwstra edited comment on SPARK-3785 at 3/1/15 6:32 AM:


I was wondering, it seems 
[ArrayFire|http://www.arrayfire.com/docs/group__arrayfire__func.htm] already 
parallelized a number of mathematical/reductor functions for C(++) arrays. If 
Spark RDDS/DataFrames expose some array interface for columns, might it be 
possible to use those through JNI? Not sure there'd be tangible performance 
gains without using APUs, but seemed interesting to me.


was (Author: tycho01):
Hm, tried commenting a bit earlier but seems it failed.

I was wondering, it seems 
[ArrayFire](http://www.arrayfire.com/docs/group__arrayfire__func.htm) already 
parallelized a number of mathematical/reductor functions for C(++) arrays. If 
Spark RDDS/DataFrames expose some array interface for columns, might it be 
possible to use those through JNI? Not sure there'd be tangible performance 
gains without using APUs, but seemed interesting to me.


 Support off-loading computations to a GPU
 -

 Key: SPARK-3785
 URL: https://issues.apache.org/jira/browse/SPARK-3785
 Project: Spark
  Issue Type: Brainstorming
  Components: MLlib
Reporter: Thomas Darimont
Priority: Minor

 Are there any plans to adding support for off-loading computations to the 
 GPU, e.g. via an open-cl binding? 
 http://www.jocl.org/
 https://code.google.com/p/javacl/
 http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org