[jira] [Resolved] (SPARK-11117) PhysicalRDD.outputsUnsafeRows should return true when the underlying data source produces UnsafeRows

2015-10-31 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-7.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9305
[https://github.com/apache/spark/pull/9305]

> PhysicalRDD.outputsUnsafeRows should return true when the underlying data 
> source produces UnsafeRows
> 
>
> Key: SPARK-7
> URL: https://issues.apache.org/jira/browse/SPARK-7
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 1.6.0
>
>
> {{PhysicalRDD}} doesn't override {{SparkPlan.outputsUnsafeRows}}, and thus 
> can't avoid {{ConvertToUnsafe}} when upper level operators only support 
> {{UnsafeRow}} even if the underlying data source produces {{UnsafeRow}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11345) Make HadoopFsRelation always outputs UnsafeRow

2015-10-31 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-11345.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9305
[https://github.com/apache/spark/pull/9305]

> Make HadoopFsRelation always outputs UnsafeRow
> --
>
> Key: SPARK-11345
> URL: https://issues.apache.org/jira/browse/SPARK-11345
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10158) ALS should print better errors when given Long IDs

2015-10-31 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983644#comment-14983644
 ] 

Bryan Cutler edited comment on SPARK-10158 at 10/31/15 7:05 AM:


I think the best way to handle this from the PySpark side is to add something 
like the following to {{ALS._prepare}} 
([link|https://github.com/apache/spark/blob/master/python/pyspark/mllib/recommendation.py#L215])
 which is called before training

{noformat}
MAX_ID_VALUE = ratings.ctx._gateway.jvm.Integer.MAX_VALUE
if ratings.filter(lambda x: x.user > MAX_ID_VALUE or x.product > 
MAX_ID_VALUE).count() > 0:
  raise ValueError("Rating IDs must be less than max Java int %s." % 
str(MAX_ID_VALUE))
{noformat}

But any operations on the data are probably not worth the hit for this issue

Edit: I meant the above as an alternative to checking values for 2^31 
explicitly, which could be done in the Ratings constructor but seems like too 
much of a hack to me


was (Author: bryanc):
The only way I can see handling this from the PySpark side is to add something 
like the following to {{ALS._prepare}} 
([link|https://github.com/apache/spark/blob/master/python/pyspark/mllib/recommendation.py#L215])
 which is called before training

{noformat}
MAX_ID_VALUE = ratings.ctx._gateway.jvm.Integer.MAX_VALUE
if ratings.filter(lambda x: x.user > MAX_ID_VALUE or x.product > 
MAX_ID_VALUE).count() > 0:
  raise ValueError("Rating IDs must be less than max Java int %s." % 
str(MAX_ID_VALUE))
{noformat}

But any operations on the data are probably not worth the hit for this issue

> ALS should print better errors when given Long IDs
> --
>
> Key: SPARK-10158
> URL: https://issues.apache.org/jira/browse/SPARK-10158
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> See [SPARK-10115] for the very confusing messages you get when you try to use 
> ALS with Long IDs.  We should catch and identify these errors and print 
> meaningful error messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11436) we should rebind right encoder when join 2 datasets

2015-10-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11436:


Assignee: Apache Spark

> we should rebind right encoder when join 2 datasets
> ---
>
> Key: SPARK-11436
> URL: https://issues.apache.org/jira/browse/SPARK-11436
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11436) we should rebind right encoder when join 2 datasets

2015-10-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983965#comment-14983965
 ] 

Apache Spark commented on SPARK-11436:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9391

> we should rebind right encoder when join 2 datasets
> ---
>
> Key: SPARK-11436
> URL: https://issues.apache.org/jira/browse/SPARK-11436
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11436) we should rebind right encoder when join 2 datasets

2015-10-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11436:


Assignee: (was: Apache Spark)

> we should rebind right encoder when join 2 datasets
> ---
>
> Key: SPARK-11436
> URL: https://issues.apache.org/jira/browse/SPARK-11436
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11431) Allow exploding arrays of structs in DataFrames

2015-10-31 Thread Tycho Grouwstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983980#comment-14983980
 ] 

Tycho Grouwstra edited comment on SPARK-11431 at 10/31/15 12:51 PM:


Hah, I actually missed that, so my bad, thanks!

What I'm doing now then is along the lines of the following (example contrived 
from 
[here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html),
 only just realized this was you):

```
val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, 
{"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show
```

I must say I've felt slightly puzzled with the convention of having to use 
`explode` as part of an embedding `select` statement though; as an unwitting 
user, I'd feel `df.explode($"col")` should do something functionally equivalent 
to the current `df.select($"*", explode($"col"))` without having to type that 
out though. In my naivety, I'd wonder 'if I wanted to also select just a subset 
of columns, I could just manually add a `select` to do so myself'.

Obviously, changing user APIs is bad, and not everyone will have identical 
expectations, but I'm just kind of curious. Was this an artifact of performance 
considerations, or a deliberate part of a larger philosophy of having the 
syntax be as explicit as possible?

Then again, aside from keeping existing column, to me the `drop` on the 
pre-`explode` column would often seem a sensible default as well, so point 
taken that expectations may differ, in which case defaulting to whatever takes 
least processing definitely seems a sane choice.


Something else I was thinking about though would be an `explodeZipped` type of 
function, to explode multiple equally-sized-array columns together, as opposed 
to chaining separate explodes to form a Cartesian. I was still sort of looking 
into that, but... at this point I'd wonder if perhaps I've overlooked existing 
functionality for that as well. :)



was (Author: tycho01):
Hah, I actually missed that, so my bad, thanks!

What I'm doing now then is along the lines of the following (example contrived 
from 
[here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html),
 only just realized this was you):
```
val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, 
{"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show
```

I must say I've felt slightly puzzled with the convention of having to use 
`explode` as part of an embedding `select` statement though; as an unwitting 
user, I'd feel `df.explode($"col")` should do something functionally equivalent 
to the current `df.select($"*", explode($"col"))` without having to type that 
out though. In my naivety, I'd wonder 'if I wanted to also select just a subset 
of columns, I could just manually add a `select` to do so myself'.

Obviously, changing user APIs is bad, and not everyone will have identical 
expectations, but I'm just kind of curious. Was this an artifact of performance 
considerations, or a deliberate part of a larger philosophy of having the 
syntax be as explicit as possible?

Then again, aside from keeping existing column, to me the `drop` on the 
pre-`explode` column would often seem a sensible default as well, so point 
taken that expectations may differ, in which case defaulting to whatever takes 
least processing definitely seems a sane choice.


Something else I was thinking about though would be an `explodeZipped` type of 
function, to explode multiple equally-sized-array columns together, as opposed 
to chaining separate explodes to form a Cartesian. I was still sort of looking 
into that, but... at this point I'd wonder if perhaps I've overlooked existing 
functionality for that as well. :)


> Allow exploding arrays of structs in DataFrames
> ---
>
> Key: SPARK-11431
> URL: https://issues.apache.org/jira/browse/SPARK-11431
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tycho Grouwstra
>  Labels: features
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am creating DataFrames from some [JSON 
> data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to 
> explode an array of structs (as are common in JSON) to their own rows so I 
> could start analyzing the data using GraphX. I believe many others might have 
> use for this as well, since most web data is in JSON format.
> This feature would 

[jira] [Comment Edited] (SPARK-11431) Allow exploding arrays of structs in DataFrames

2015-10-31 Thread Tycho Grouwstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983980#comment-14983980
 ] 

Tycho Grouwstra edited comment on SPARK-11431 at 10/31/15 12:51 PM:


Hah, I actually missed that, so my bad, thanks!

What I'm doing now then is along the lines of the following (example contrived 
from 
[here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html),
 only just realized this was you):

val json = """{"name":"Michael", "schools":[{"sname":"stanford", 
"year":2010}, {"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show

I must say I've felt slightly puzzled with the convention of having to use 
`explode` as part of an embedding `select` statement though; as an unwitting 
user, I'd feel `df.explode($"col")` should do something functionally equivalent 
to the current `df.select($"*", explode($"col"))` without having to type that 
out though. In my naivety, I'd wonder 'if I wanted to also select just a subset 
of columns, I could just manually add a `select` to do so myself'.

Obviously, changing user APIs is bad, and not everyone will have identical 
expectations, but I'm just kind of curious. Was this an artifact of performance 
considerations, or a deliberate part of a larger philosophy of having the 
syntax be as explicit as possible?

Then again, aside from keeping existing column, to me the `drop` on the 
pre-`explode` column would often seem a sensible default as well, so point 
taken that expectations may differ, in which case defaulting to whatever takes 
least processing definitely seems a sane choice.


Something else I was thinking about though would be an `explodeZipped` type of 
function, to explode multiple equally-sized-array columns together, as opposed 
to chaining separate explodes to form a Cartesian. I was still sort of looking 
into that, but... at this point I'd wonder if perhaps I've overlooked existing 
functionality for that as well. :)



was (Author: tycho01):
Hah, I actually missed that, so my bad, thanks!

What I'm doing now then is along the lines of the following (example contrived 
from 
[here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html),
 only just realized this was you):

```
val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, 
{"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show
```

I must say I've felt slightly puzzled with the convention of having to use 
`explode` as part of an embedding `select` statement though; as an unwitting 
user, I'd feel `df.explode($"col")` should do something functionally equivalent 
to the current `df.select($"*", explode($"col"))` without having to type that 
out though. In my naivety, I'd wonder 'if I wanted to also select just a subset 
of columns, I could just manually add a `select` to do so myself'.

Obviously, changing user APIs is bad, and not everyone will have identical 
expectations, but I'm just kind of curious. Was this an artifact of performance 
considerations, or a deliberate part of a larger philosophy of having the 
syntax be as explicit as possible?

Then again, aside from keeping existing column, to me the `drop` on the 
pre-`explode` column would often seem a sensible default as well, so point 
taken that expectations may differ, in which case defaulting to whatever takes 
least processing definitely seems a sane choice.


Something else I was thinking about though would be an `explodeZipped` type of 
function, to explode multiple equally-sized-array columns together, as opposed 
to chaining separate explodes to form a Cartesian. I was still sort of looking 
into that, but... at this point I'd wonder if perhaps I've overlooked existing 
functionality for that as well. :)


> Allow exploding arrays of structs in DataFrames
> ---
>
> Key: SPARK-11431
> URL: https://issues.apache.org/jira/browse/SPARK-11431
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tycho Grouwstra
>  Labels: features
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am creating DataFrames from some [JSON 
> data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to 
> explode an array of structs (as are common in JSON) to their own rows so I 
> could start analyzing the data using GraphX. I believe many others might have 
> use for this as well, since most web data is in JSON format.
> This 

[jira] [Assigned] (SPARK-10500) sparkr.zip cannot be created if $SPARK_HOME/R/lib is unwritable

2015-10-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10500:


Assignee: Apache Spark

> sparkr.zip cannot be created if $SPARK_HOME/R/lib is unwritable
> ---
>
> Key: SPARK-10500
> URL: https://issues.apache.org/jira/browse/SPARK-10500
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Jonathan Kelly
>Assignee: Apache Spark
>
> As of SPARK-6797, sparkr.zip is re-created each time spark-submit is run with 
> an R application, which fails if Spark has been installed into a directory to 
> which the current user doesn't have write permissions. (e.g., on EMR's 
> emr-4.0.0 release, Spark is installed at /usr/lib/spark, which is only 
> writable by root.)
> Would it be possible to skip creating sparkr.zip if it already exists? That 
> would enable sparkr.zip to be pre-created by the root user and then reused 
> each time spark-submit is run, which I believe is similar to how pyspark 
> works.
> Another option would be to make the location configurable, as it's currently 
> hardcoded to $SPARK_HOME/R/lib/sparkr.zip. Allowing it to be configured to 
> something like the user's home directory or a random path in /tmp would get 
> around the permissions issue.
> By the way, why does spark-submit even need to re-create sparkr.zip every 
> time a new R application is launched? This seems unnecessary and inefficient, 
> unless you are actively developing the SparkR libraries and expect the 
> contents of sparkr.zip to change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11436) we should rebind right encoder when join 2 datasets

2015-10-31 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-11436:
---

 Summary: we should rebind right encoder when join 2 datasets
 Key: SPARK-11436
 URL: https://issues.apache.org/jira/browse/SPARK-11436
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11226) Empty line in json file should be skipped

2015-10-31 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11226.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9211
[https://github.com/apache/spark/pull/9211]

> Empty line in json file should be skipped
> -
>
> Key: SPARK-11226
> URL: https://issues.apache.org/jira/browse/SPARK-11226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Minor
> Fix For: 1.6.0
>
>
> Currently the empty line in json file will be parsed into Row with all null 
> field values. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11226) Empty line in json file should be skipped

2015-10-31 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11226:
--
Assignee: Jeff Zhang

> Empty line in json file should be skipped
> -
>
> Key: SPARK-11226
> URL: https://issues.apache.org/jira/browse/SPARK-11226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Minor
> Fix For: 1.6.0
>
>
> Currently the empty line in json file will be parsed into Row with all null 
> field values. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11431) Allow exploding arrays of structs in DataFrames

2015-10-31 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983945#comment-14983945
 ] 

Michael Armbrust commented on SPARK-11431:
--

Have you look at the explode that works on a column?

{code}
import org.apache.spark.sql.functions._
df.select(explode($"arrayOfStructs"))
{code}

> Allow exploding arrays of structs in DataFrames
> ---
>
> Key: SPARK-11431
> URL: https://issues.apache.org/jira/browse/SPARK-11431
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tycho Grouwstra
>  Labels: features
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am creating DataFrames from some [JSON 
> data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to 
> explode an array of structs (as are common in JSON) to their own rows so I 
> could start analyzing the data using GraphX. I believe many others might have 
> use for this as well, since most web data is in JSON format.
> This feature would build upon the existing `explode` functionality added to 
> DataFrames by [~marmbrus], which currently errors when you call it on such 
> arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor 
> function to infer column types -- this approach is insufficient in the case 
> of Rows, since their type does not contain the required info. The alternative 
> here would be to instead grab the schema info from the existing schema for 
> such cases.
> I'm trying to implement a patch that might add this functionality, so stay 
> tuned until I've figured that out. I'm new here though so I'll probably have 
> use for some feedback...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10500) sparkr.zip cannot be created if $SPARK_HOME/R/lib is unwritable

2015-10-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10500:


Assignee: (was: Apache Spark)

> sparkr.zip cannot be created if $SPARK_HOME/R/lib is unwritable
> ---
>
> Key: SPARK-10500
> URL: https://issues.apache.org/jira/browse/SPARK-10500
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Jonathan Kelly
>
> As of SPARK-6797, sparkr.zip is re-created each time spark-submit is run with 
> an R application, which fails if Spark has been installed into a directory to 
> which the current user doesn't have write permissions. (e.g., on EMR's 
> emr-4.0.0 release, Spark is installed at /usr/lib/spark, which is only 
> writable by root.)
> Would it be possible to skip creating sparkr.zip if it already exists? That 
> would enable sparkr.zip to be pre-created by the root user and then reused 
> each time spark-submit is run, which I believe is similar to how pyspark 
> works.
> Another option would be to make the location configurable, as it's currently 
> hardcoded to $SPARK_HOME/R/lib/sparkr.zip. Allowing it to be configured to 
> something like the user's home directory or a random path in /tmp would get 
> around the permissions issue.
> By the way, why does spark-submit even need to re-create sparkr.zip every 
> time a new R application is launched? This seems unnecessary and inefficient, 
> unless you are actively developing the SparkR libraries and expect the 
> contents of sparkr.zip to change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10500) sparkr.zip cannot be created if $SPARK_HOME/R/lib is unwritable

2015-10-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983925#comment-14983925
 ] 

Apache Spark commented on SPARK-10500:
--

User 'sun-rui' has created a pull request for this issue:
https://github.com/apache/spark/pull/9390

> sparkr.zip cannot be created if $SPARK_HOME/R/lib is unwritable
> ---
>
> Key: SPARK-10500
> URL: https://issues.apache.org/jira/browse/SPARK-10500
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Jonathan Kelly
>
> As of SPARK-6797, sparkr.zip is re-created each time spark-submit is run with 
> an R application, which fails if Spark has been installed into a directory to 
> which the current user doesn't have write permissions. (e.g., on EMR's 
> emr-4.0.0 release, Spark is installed at /usr/lib/spark, which is only 
> writable by root.)
> Would it be possible to skip creating sparkr.zip if it already exists? That 
> would enable sparkr.zip to be pre-created by the root user and then reused 
> each time spark-submit is run, which I believe is similar to how pyspark 
> works.
> Another option would be to make the location configurable, as it's currently 
> hardcoded to $SPARK_HOME/R/lib/sparkr.zip. Allowing it to be configured to 
> something like the user's home directory or a random path in /tmp would get 
> around the permissions issue.
> By the way, why does spark-submit even need to re-create sparkr.zip every 
> time a new R application is launched? This seems unnecessary and inefficient, 
> unless you are actively developing the SparkR libraries and expect the 
> contents of sparkr.zip to change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11431) Allow exploding arrays of structs in DataFrames

2015-10-31 Thread Tycho Grouwstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983980#comment-14983980
 ] 

Tycho Grouwstra commented on SPARK-11431:
-

Hah, I actually missed that, so my bad, thanks!

What I'm doing now then is along the lines of the following (example contrived 
from 
[here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html),
 only just realized this was you):
```
val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, 
{"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show
```

I must say I've felt slightly puzzled with the convention of having to use 
`explode` as part of an embedding `select` statement though; as an unwitting 
user, I'd feel `df.explode($"col")` should do something functionally equivalent 
to the current `df.select($"*", explode($"col"))` without having to type that 
out though. In my naivety, I'd wonder 'if I wanted to also select just a subset 
of columns, I could just manually add a `select` to do so myself'.

Obviously, changing user APIs is bad, and not everyone will have identical 
expectations, but I'm just kind of curious. Was this an artifact of performance 
considerations, or a deliberate part of a larger philosophy of having the 
syntax be as explicit as possible?

Then again, aside from keeping existing column, to me the `drop` on the 
pre-`explode` column would often seem a sensible default as well, so point 
taken that expectations may differ, in which case defaulting to whatever takes 
least processing definitely seems a sane choice.


Something else I was thinking about though would be an `explodeZipped` type of 
function, to explode multiple equally-sized-array columns together, as opposed 
to chaining separate explodes to form a Cartesian. I was still sort of looking 
into that, but... at this point I'd wonder if perhaps I've overlooked existing 
functionality for that as well. :)


> Allow exploding arrays of structs in DataFrames
> ---
>
> Key: SPARK-11431
> URL: https://issues.apache.org/jira/browse/SPARK-11431
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tycho Grouwstra
>  Labels: features
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am creating DataFrames from some [JSON 
> data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to 
> explode an array of structs (as are common in JSON) to their own rows so I 
> could start analyzing the data using GraphX. I believe many others might have 
> use for this as well, since most web data is in JSON format.
> This feature would build upon the existing `explode` functionality added to 
> DataFrames by [~marmbrus], which currently errors when you call it on such 
> arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor 
> function to infer column types -- this approach is insufficient in the case 
> of Rows, since their type does not contain the required info. The alternative 
> here would be to instead grab the schema info from the existing schema for 
> such cases.
> I'm trying to implement a patch that might add this functionality, so stay 
> tuned until I've figured that out. I'm new here though so I'll probably have 
> use for some feedback...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11431) Allow exploding arrays of structs in DataFrames

2015-10-31 Thread Tycho Grouwstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983980#comment-14983980
 ] 

Tycho Grouwstra edited comment on SPARK-11431 at 10/31/15 12:53 PM:


Hah, I actually missed that, so my bad, thanks!

What I'm doing now then is along the lines of the following (example contrived 
from 
[here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html),
 only just realized this was you):

{code}
val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, 
{"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show
{code}

I must say I've felt slightly puzzled with the convention of having to use 
`explode` as part of an embedding `select` statement though; as an unwitting 
user, I'd feel `df.explode($"col")` should do something functionally equivalent 
to the current `df.select($"*", explode($"col"))` without having to type that 
out though. In my naivety, I'd wonder 'if I wanted to also select just a subset 
of columns, I could just manually add a `select` to do so myself'.

Obviously, changing user APIs is bad, and not everyone will have identical 
expectations, but I'm just kind of curious. Was this an artifact of performance 
considerations, or a deliberate part of a larger philosophy of having the 
syntax be as explicit as possible?

Then again, aside from keeping existing column, to me the `drop` on the 
pre-`explode` column would often seem a sensible default as well, so point 
taken that expectations may differ, in which case defaulting to whatever takes 
least processing definitely seems a sane choice.


Something else I was thinking about though would be an `explodeZipped` type of 
function, to explode multiple equally-sized-array columns together, as opposed 
to chaining separate explodes to form a Cartesian. I was still sort of looking 
into that, but... at this point I'd wonder if perhaps I've overlooked existing 
functionality for that as well. :)



was (Author: tycho01):
Hah, I actually missed that, so my bad, thanks!

What I'm doing now then is along the lines of the following (example contrived 
from 
[here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html),
 only just realized this was you):

val json = """{"name":"Michael", "schools":[{"sname":"stanford", 
"year":2010}, {"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show

I must say I've felt slightly puzzled with the convention of having to use 
`explode` as part of an embedding `select` statement though; as an unwitting 
user, I'd feel `df.explode($"col")` should do something functionally equivalent 
to the current `df.select($"*", explode($"col"))` without having to type that 
out though. In my naivety, I'd wonder 'if I wanted to also select just a subset 
of columns, I could just manually add a `select` to do so myself'.

Obviously, changing user APIs is bad, and not everyone will have identical 
expectations, but I'm just kind of curious. Was this an artifact of performance 
considerations, or a deliberate part of a larger philosophy of having the 
syntax be as explicit as possible?

Then again, aside from keeping existing column, to me the `drop` on the 
pre-`explode` column would often seem a sensible default as well, so point 
taken that expectations may differ, in which case defaulting to whatever takes 
least processing definitely seems a sane choice.


Something else I was thinking about though would be an `explodeZipped` type of 
function, to explode multiple equally-sized-array columns together, as opposed 
to chaining separate explodes to form a Cartesian. I was still sort of looking 
into that, but... at this point I'd wonder if perhaps I've overlooked existing 
functionality for that as well. :)


> Allow exploding arrays of structs in DataFrames
> ---
>
> Key: SPARK-11431
> URL: https://issues.apache.org/jira/browse/SPARK-11431
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tycho Grouwstra
>  Labels: features
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am creating DataFrames from some [JSON 
> data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to 
> explode an array of structs (as are common in JSON) to their own rows so I 
> could start analyzing the data using GraphX. I believe many others might have 
> use for this as well, since most web data is in JSON format.
> This 

[jira] [Updated] (SPARK-11431) Allow exploding arrays of structs in DataFrames

2015-10-31 Thread Tycho Grouwstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tycho Grouwstra updated SPARK-11431:

Description: 
I am creating DataFrames from some [JSON 
data|http://www.kayak.com/h/explore/api?airport=AMS], and would like to explode 
an array of structs (as are common in JSON) to their own rows so I could start 
analyzing the data using GraphX. I believe many others might have use for this 
as well, since most web data is in JSON format.

This feature would build upon the existing `explode` functionality added to 
DataFrames by [~marmbrus], which currently errors when you call it on such 
arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor 
function to infer column types -- this approach is insufficient in the case of 
Rows, since their type does not contain the required info. The alternative here 
would be to instead grab the schema info from the existing schema for such 
cases.

I'm trying to implement a patch that might add this functionality, so stay 
tuned until I've figured that out. I'm new here though so I'll probably have 
use for some feedback...


  was:
I am creating DataFrames from some [JSON 
data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to 
explode an array of structs (as are common in JSON) to their own rows so I 
could start analyzing the data using GraphX. I believe many others might have 
use for this as well, since most web data is in JSON format.

This feature would build upon the existing `explode` functionality added to 
DataFrames by [~marmbrus], which currently errors when you call it on such 
arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor 
function to infer column types -- this approach is insufficient in the case of 
Rows, since their type does not contain the required info. The alternative here 
would be to instead grab the schema info from the existing schema for such 
cases.

I'm trying to implement a patch that might add this functionality, so stay 
tuned until I've figured that out. I'm new here though so I'll probably have 
use for some feedback...



> Allow exploding arrays of structs in DataFrames
> ---
>
> Key: SPARK-11431
> URL: https://issues.apache.org/jira/browse/SPARK-11431
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tycho Grouwstra
>  Labels: features
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am creating DataFrames from some [JSON 
> data|http://www.kayak.com/h/explore/api?airport=AMS], and would like to 
> explode an array of structs (as are common in JSON) to their own rows so I 
> could start analyzing the data using GraphX. I believe many others might have 
> use for this as well, since most web data is in JSON format.
> This feature would build upon the existing `explode` functionality added to 
> DataFrames by [~marmbrus], which currently errors when you call it on such 
> arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor 
> function to infer column types -- this approach is insufficient in the case 
> of Rows, since their type does not contain the required info. The alternative 
> here would be to instead grab the schema info from the existing schema for 
> such cases.
> I'm trying to implement a patch that might add this functionality, so stay 
> tuned until I've figured that out. I'm new here though so I'll probably have 
> use for some feedback...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11431) Allow exploding arrays of structs in DataFrames

2015-10-31 Thread Tycho Grouwstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983980#comment-14983980
 ] 

Tycho Grouwstra edited comment on SPARK-11431 at 10/31/15 12:54 PM:


Hah, I actually missed that, so my bad, thanks!

What I'm doing now then is along the lines of the following (example contrived 
from 
[here|http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html],
 only just realized this was you):

{code}
val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, 
{"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show
{code}

I must say I've felt slightly puzzled with the convention of having to use 
`explode` as part of an embedding `select` statement though; as an unwitting 
user, I'd feel `df.explode($"col")` should do something functionally equivalent 
to the current `df.select($"*", explode($"col"))` without having to type that 
out though. In my naivety, I'd wonder 'if I wanted to also select just a subset 
of columns, I could just manually add a `select` to do so myself'.

Obviously, changing user APIs is bad, and not everyone will have identical 
expectations, but I'm just kind of curious. Was this an artifact of performance 
considerations, or a deliberate part of a larger philosophy of having the 
syntax be as explicit as possible?

Then again, aside from keeping existing column, to me the `drop` on the 
pre-`explode` column would often seem a sensible default as well, so point 
taken that expectations may differ, in which case defaulting to whatever takes 
least processing definitely seems a sane choice.


Something else I was thinking about though would be an `explodeZipped` type of 
function, to explode multiple equally-sized-array columns together, as opposed 
to chaining separate explodes to form a Cartesian. I was still sort of looking 
into that, but... at this point I'd wonder if perhaps I've overlooked existing 
functionality for that as well. :)



was (Author: tycho01):
Hah, I actually missed that, so my bad, thanks!

What I'm doing now then is along the lines of the following (example contrived 
from 
[here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html),
 only just realized this was you):

{code}
val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, 
{"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show
{code}

I must say I've felt slightly puzzled with the convention of having to use 
`explode` as part of an embedding `select` statement though; as an unwitting 
user, I'd feel `df.explode($"col")` should do something functionally equivalent 
to the current `df.select($"*", explode($"col"))` without having to type that 
out though. In my naivety, I'd wonder 'if I wanted to also select just a subset 
of columns, I could just manually add a `select` to do so myself'.

Obviously, changing user APIs is bad, and not everyone will have identical 
expectations, but I'm just kind of curious. Was this an artifact of performance 
considerations, or a deliberate part of a larger philosophy of having the 
syntax be as explicit as possible?

Then again, aside from keeping existing column, to me the `drop` on the 
pre-`explode` column would often seem a sensible default as well, so point 
taken that expectations may differ, in which case defaulting to whatever takes 
least processing definitely seems a sane choice.


Something else I was thinking about though would be an `explodeZipped` type of 
function, to explode multiple equally-sized-array columns together, as opposed 
to chaining separate explodes to form a Cartesian. I was still sort of looking 
into that, but... at this point I'd wonder if perhaps I've overlooked existing 
functionality for that as well. :)


> Allow exploding arrays of structs in DataFrames
> ---
>
> Key: SPARK-11431
> URL: https://issues.apache.org/jira/browse/SPARK-11431
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tycho Grouwstra
>  Labels: features
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am creating DataFrames from some [JSON 
> data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to 
> explode an array of structs (as are common in JSON) to their own rows so I 
> could start analyzing the data using GraphX. I believe many others might have 
> use for this as well, since most web data is in JSON format.
> This 

[jira] [Created] (SPARK-11437) createDataFrame shouldn't .take() when provided schema

2015-10-31 Thread Jason White (JIRA)
Jason White created SPARK-11437:
---

 Summary: createDataFrame shouldn't .take() when provided schema
 Key: SPARK-11437
 URL: https://issues.apache.org/jira/browse/SPARK-11437
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Jason White


When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls 
`.take(10)` to verify the first 10 rows of the RDD match the provided schema. 
Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue 
affected cases where a schema was not provided.

Verifying the first 10 rows is of limited utility and causes the DAG to be 
executed non-lazily. If necessary, I believe this verification should be done 
lazily on all rows. However, since the caller is providing a schema to follow, 
I think it's acceptable to simply fail if the schema is incorrect.

https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6373) Add SSL/TLS for the Netty based BlockTransferService

2015-10-31 Thread Jeffrey Turpin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984140#comment-14984140
 ] 

Jeffrey Turpin commented on SPARK-6373:
---

[~jlewandowski], are you willing to do a review of my changes? I can create a 
pull request if you would like?

> Add SSL/TLS for the Netty based BlockTransferService 
> -
>
> Key: SPARK-6373
> URL: https://issues.apache.org/jira/browse/SPARK-6373
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Shuffle
>Affects Versions: 1.2.1
>Reporter: Jeffrey Turpin
>
> Add the ability to allow for secure communications (SSL/TLS) for the Netty 
> based BlockTransferService and the ExternalShuffleClient. This ticket will 
> hopefully start the conversation around potential designs... Below is a 
> reference to a WIP prototype which implements this functionality 
> (prototype)... I have attempted to disrupt as little code as possible and 
> tried to follow the current code structure (for the most part) in the areas I 
> modified. I also studied how Hadoop achieves encrypted shuffle 
> (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html)
> https://github.com/turp1twin/spark/commit/024b559f27945eb63068d1badf7f82e4e7c3621c



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11437) createDataFrame shouldn't .take() when provided schema

2015-10-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11437:


Assignee: (was: Apache Spark)

> createDataFrame shouldn't .take() when provided schema
> --
>
> Key: SPARK-11437
> URL: https://issues.apache.org/jira/browse/SPARK-11437
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jason White
>
> When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls 
> `.take(10)` to verify the first 10 rows of the RDD match the provided schema. 
> Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue 
> affected cases where a schema was not provided.
> Verifying the first 10 rows is of limited utility and causes the DAG to be 
> executed non-lazily. If necessary, I believe this verification should be done 
> lazily on all rows. However, since the caller is providing a schema to 
> follow, I think it's acceptable to simply fail if the schema is incorrect.
> https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11437) createDataFrame shouldn't .take() when provided schema

2015-10-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984166#comment-14984166
 ] 

Apache Spark commented on SPARK-11437:
--

User 'JasonMWhite' has created a pull request for this issue:
https://github.com/apache/spark/pull/9392

> createDataFrame shouldn't .take() when provided schema
> --
>
> Key: SPARK-11437
> URL: https://issues.apache.org/jira/browse/SPARK-11437
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jason White
>
> When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls 
> `.take(10)` to verify the first 10 rows of the RDD match the provided schema. 
> Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue 
> affected cases where a schema was not provided.
> Verifying the first 10 rows is of limited utility and causes the DAG to be 
> executed non-lazily. If necessary, I believe this verification should be done 
> lazily on all rows. However, since the caller is providing a schema to 
> follow, I think it's acceptable to simply fail if the schema is incorrect.
> https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11437) createDataFrame shouldn't .take() when provided schema

2015-10-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11437:


Assignee: Apache Spark

> createDataFrame shouldn't .take() when provided schema
> --
>
> Key: SPARK-11437
> URL: https://issues.apache.org/jira/browse/SPARK-11437
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jason White
>Assignee: Apache Spark
>
> When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls 
> `.take(10)` to verify the first 10 rows of the RDD match the provided schema. 
> Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue 
> affected cases where a schema was not provided.
> Verifying the first 10 rows is of limited utility and causes the DAG to be 
> executed non-lazily. If necessary, I believe this verification should be done 
> lazily on all rows. However, since the caller is providing a schema to 
> follow, I think it's acceptable to simply fail if the schema is incorrect.
> https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10978) Allow PrunedFilterScan to eliminate predicates from further evaluation

2015-10-31 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10978:
-
Priority: Critical  (was: Minor)

> Allow PrunedFilterScan to eliminate predicates from further evaluation
> --
>
> Key: SPARK-10978
> URL: https://issues.apache.org/jira/browse/SPARK-10978
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: Russell Alexander Spitzer
>Priority: Critical
>
> Currently PrunedFilterScan allows implementors to push down predicates to an 
> underlying datasource. This is done solely as an optimization as the 
> predicate will be reapplied on the Spark side as well. This allows for 
> bloom-filter like operations but ends up doing a redundant scan for those 
> sources which can do accurate pushdowns.
> In addition it makes it difficult for underlying sources to accept queries 
> which reference non-existent to provide ancillary function. In our case we 
> allow a solr query to be passed in via a non-existent solr_query column. 
> Since this column is not returned when Spark does a filter on "solr_query" 
> nothing passes. 
> Suggestion on the ML from [~marmbrus] 
> {quote}
> We have to try and maintain binary compatibility here, so probably the 
> easiest thing to do here would be to add a method to the class.  Perhaps 
> something like:
> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
> By default, this could return all filters so behavior would remain the same, 
> but specific implementations could override it.  There is still a chance that 
> this would conflict with existing methods, but hopefully that would not be a 
> problem in practice.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11024) Optimize NULL in by folding it to Literal(null)

2015-10-31 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11024:
-
Assignee: Dilip Biswal

> Optimize NULL in  by folding it to Literal(null)
> 
>
> Key: SPARK-11024
> URL: https://issues.apache.org/jira/browse/SPARK-11024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Minor
> Fix For: 1.6.0
>
>
> Add a rule in optimizer to convert NULL [NOT] IN (expr1,...,expr2) to
> Literal(null). 
> This is a follow up defect to SPARK-8654 and suggested by Wenchen Fan.
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11024) Optimize NULL in by folding it to Literal(null)

2015-10-31 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-11024.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9348
[https://github.com/apache/spark/pull/9348]

> Optimize NULL in  by folding it to Literal(null)
> 
>
> Key: SPARK-11024
> URL: https://issues.apache.org/jira/browse/SPARK-11024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dilip Biswal
>Priority: Minor
> Fix For: 1.6.0
>
>
> Add a rule in optimizer to convert NULL [NOT] IN (expr1,...,expr2) to
> Literal(null). 
> This is a follow up defect to SPARK-8654 and suggested by Wenchen Fan.
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11437) createDataFrame shouldn't .take() when provided schema

2015-10-31 Thread Jason White (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984009#comment-14984009
 ] 

Jason White commented on SPARK-11437:
-

[~marmbrus] We briefly discussed this at SparkSummitEU this week.

> createDataFrame shouldn't .take() when provided schema
> --
>
> Key: SPARK-11437
> URL: https://issues.apache.org/jira/browse/SPARK-11437
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jason White
>
> When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls 
> `.take(10)` to verify the first 10 rows of the RDD match the provided schema. 
> Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue 
> affected cases where a schema was not provided.
> Verifying the first 10 rows is of limited utility and causes the DAG to be 
> executed non-lazily. If necessary, I believe this verification should be done 
> lazily on all rows. However, since the caller is providing a schema to 
> follow, I think it's acceptable to simply fail if the schema is incorrect.
> https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11431) Allow exploding arrays of structs in DataFrames

2015-10-31 Thread Tycho Grouwstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tycho Grouwstra resolved SPARK-11431.
-
Resolution: Implemented

> Allow exploding arrays of structs in DataFrames
> ---
>
> Key: SPARK-11431
> URL: https://issues.apache.org/jira/browse/SPARK-11431
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tycho Grouwstra
>  Labels: features
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am creating DataFrames from some [JSON 
> data|http://www.kayak.com/h/explore/api?airport=AMS], and would like to 
> explode an array of structs (as are common in JSON) to their own rows so I 
> could start analyzing the data using GraphX. I believe many others might have 
> use for this as well, since most web data is in JSON format.
> This feature would build upon the existing `explode` functionality added to 
> DataFrames by [~marmbrus], which currently errors when you call it on such 
> arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor 
> function to infer column types -- this approach is insufficient in the case 
> of Rows, since their type does not contain the required info. The alternative 
> here would be to instead grab the schema info from the existing schema for 
> such cases.
> I'm trying to implement a patch that might add this functionality, so stay 
> tuned until I've figured that out. I'm new here though so I'll probably have 
> use for some feedback...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11439) Optiomization of creating sparse feature without dense one

2015-10-31 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-11439:
--

 Summary: Optiomization of creating sparse feature without dense one
 Key: SPARK-11439
 URL: https://issues.apache.org/jira/browse/SPARK-11439
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Kai Sasaki
Priority: Minor


Currently, sparse feature generated in {{LinearDataGenerator}} needs to create 
dense vectors once. It is cost efficient to prevent generating sparse feature 
without generating dense vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11439) Optiomization of creating sparse feature without dense one

2015-10-31 Thread Kai Sasaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Sasaki updated SPARK-11439:
---
Description: Currently, sparse feature generated in {{LinearDataGenerator}} 
needs to create dense vectors once. It is cost efficient to prevent from 
generating dense feature when creating sparse features.  (was: Currently, 
sparse feature generated in {{LinearDataGenerator}} needs to create dense 
vectors once. It is cost efficient to prevent generating sparse feature without 
generating dense vectors.)

> Optiomization of creating sparse feature without dense one
> --
>
> Key: SPARK-11439
> URL: https://issues.apache.org/jira/browse/SPARK-11439
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>Priority: Minor
>
> Currently, sparse feature generated in {{LinearDataGenerator}} needs to 
> create dense vectors once. It is cost efficient to prevent from generating 
> dense feature when creating sparse features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11427) DataFrame's intersect method does not work, returns 1

2015-10-31 Thread Ram Kandasamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Kandasamy resolved SPARK-11427.
---
Resolution: Duplicate

> DataFrame's intersect method does not work, returns 1
> -
>
> Key: SPARK-11427
> URL: https://issues.apache.org/jira/browse/SPARK-11427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ram Kandasamy
>
> Hello,
> I was working with dataframes and I found the intersect() method seems to 
> always return '1'. The RDD's intersection() method does work properly.
> Consider this example:
> scala> val firstFile = 
> sqlContext.read.parquet("/Users/ramkandasamy/sparkData/2015-07-25/*").select("id").distinct
> firstFile: org.apache.spark.sql.DataFrame = [id: string]
> scala> firstFile.count
> res4: Long = 1072046
> scala> firstFile.intersect(firstFile).count
> res5: Long = 1
> scala> firstFile.rdd.intersection(firstFile.rdd).count
> res6: Long = 1072046
> I have tried various different cases, and for some reason, the dataframe's 
> intersect method always returns 1. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11427) DataFrame's intersect method does not work, returns 1

2015-10-31 Thread Ram Kandasamy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984195#comment-14984195
 ] 

Ram Kandasamy commented on SPARK-11427:
---

So it looks like this issue has been resolved in spark version 1.5.1, I will 
mark this as duplicate as it was fixed in 
https://issues.apache.org/jira/browse/SPARK-10539.

> DataFrame's intersect method does not work, returns 1
> -
>
> Key: SPARK-11427
> URL: https://issues.apache.org/jira/browse/SPARK-11427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ram Kandasamy
>
> Hello,
> I was working with dataframes and I found the intersect() method seems to 
> always return '1'. The RDD's intersection() method does work properly.
> Consider this example:
> scala> val firstFile = 
> sqlContext.read.parquet("/Users/ramkandasamy/sparkData/2015-07-25/*").select("id").distinct
> firstFile: org.apache.spark.sql.DataFrame = [id: string]
> scala> firstFile.count
> res4: Long = 1072046
> scala> firstFile.intersect(firstFile).count
> res5: Long = 1
> scala> firstFile.rdd.intersection(firstFile.rdd).count
> res6: Long = 1072046
> I have tried various different cases, and for some reason, the dataframe's 
> intersect method always returns 1. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6373) Add SSL/TLS for the Netty based BlockTransferService

2015-10-31 Thread Jeffrey Turpin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984219#comment-14984219
 ] 

Jeffrey Turpin commented on SPARK-6373:
---

Any comments/feedback would be appreciated... 
https://github.com/turp1twin/spark/commit/fd2980ab8cc1fc5b4626bb7a0d1e94128ca3874d


> Add SSL/TLS for the Netty based BlockTransferService 
> -
>
> Key: SPARK-6373
> URL: https://issues.apache.org/jira/browse/SPARK-6373
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Shuffle
>Affects Versions: 1.2.1
>Reporter: Jeffrey Turpin
>
> Add the ability to allow for secure communications (SSL/TLS) for the Netty 
> based BlockTransferService and the ExternalShuffleClient. This ticket will 
> hopefully start the conversation around potential designs... Below is a 
> reference to a WIP prototype which implements this functionality 
> (prototype)... I have attempted to disrupt as little code as possible and 
> tried to follow the current code structure (for the most part) in the areas I 
> modified. I also studied how Hadoop achieves encrypted shuffle 
> (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html)
> https://github.com/turp1twin/spark/commit/024b559f27945eb63068d1badf7f82e4e7c3621c



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11438) Allow users to define nondeterministic UDFs

2015-10-31 Thread Yin Huai (JIRA)
Yin Huai created SPARK-11438:


 Summary: Allow users to define nondeterministic UDFs
 Key: SPARK-11438
 URL: https://issues.apache.org/jira/browse/SPARK-11438
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yin Huai


Right now, all UDFs are deterministic. It will be great if we allow users to 
define nondeterministic UDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11438) Allow users to define nondeterministic UDFs

2015-10-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984223#comment-14984223
 ] 

Apache Spark commented on SPARK-11438:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/9393

> Allow users to define nondeterministic UDFs
> ---
>
> Key: SPARK-11438
> URL: https://issues.apache.org/jira/browse/SPARK-11438
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Right now, all UDFs are deterministic. It will be great if we allow users to 
> define nondeterministic UDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11438) Allow users to define nondeterministic UDFs

2015-10-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11438:


Assignee: Yin Huai  (was: Apache Spark)

> Allow users to define nondeterministic UDFs
> ---
>
> Key: SPARK-11438
> URL: https://issues.apache.org/jira/browse/SPARK-11438
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Right now, all UDFs are deterministic. It will be great if we allow users to 
> define nondeterministic UDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11438) Allow users to define nondeterministic UDFs

2015-10-31 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reassigned SPARK-11438:


Assignee: Yin Huai

> Allow users to define nondeterministic UDFs
> ---
>
> Key: SPARK-11438
> URL: https://issues.apache.org/jira/browse/SPARK-11438
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Right now, all UDFs are deterministic. It will be great if we allow users to 
> define nondeterministic UDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11438) Allow users to define nondeterministic UDFs

2015-10-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11438:


Assignee: Apache Spark  (was: Yin Huai)

> Allow users to define nondeterministic UDFs
> ---
>
> Key: SPARK-11438
> URL: https://issues.apache.org/jira/browse/SPARK-11438
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> Right now, all UDFs are deterministic. It will be great if we allow users to 
> define nondeterministic UDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11265) YarnClient can't get tokens to talk to Hive 1.2.1 in a secure cluster

2015-10-31 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-11265.

   Resolution: Fixed
 Assignee: Steve Loughran
Fix Version/s: 1.6.0

> YarnClient can't get tokens to talk to Hive 1.2.1 in a secure cluster
> -
>
> Key: SPARK-11265
> URL: https://issues.apache.org/jira/browse/SPARK-11265
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Kerberized Hadoop cluster
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Fix For: 1.6.0
>
>
> As reported on the dev list, trying to run a YARN client which wants to talk 
> to Hive in a Kerberized hadoop cluster fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org