date:20150916

[jira] [Updated] (SPARK-10050) Support collecting data of MapType in DataFrame

2015-09-16 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-10050:
--
Assignee: Sun Rui

> Support collecting data of MapType in DataFrame
> ---
>
> Key: SPARK-10050
> URL: https://issues.apache.org/jira/browse/SPARK-10050
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Sun Rui
>Assignee: Sun Rui
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7841) Spark build should not use lib_managed for dependencies

2015-09-16 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791063#comment-14791063
 ] 

Josh Rosen commented on SPARK-7841:
---

I agree that we can probably fix this, but note that we'll have to do something 
about how lib_managed is used in the dev/mima script.

> Spark build should not use lib_managed for dependencies
> ---
>
> Key: SPARK-7841
> URL: https://issues.apache.org/jira/browse/SPARK-7841
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.1
>Reporter: Iulian Dragos
>  Labels: easyfix, sbt
>
> - unnecessary duplication (I will have those libraries under ./m2, via maven 
> anyway)
> - every time I call make-distribution I lose lib_managed (via mvn clean 
> install) and have to wait to download again all jars next time I use sbt
> - Eclipse does not handle relative paths very well (source attachments from 
> lib_managed don’t always work)
> - it's not the default configuration. If we stray from defaults I think there 
> should be a clear advantage.
> Digging through history, the only reference to `retrieveManaged := true` I 
> found was in f686e3d, from July 2011 ("Initial work on converting build to 
> SBT 0.10.1"). My guess this is purely an accident of porting the build form 
> Sbt 0.7.x and trying to keep the old project layout.
> If there are reasons for keeping it, please comment (I didn't get any answers 
> on the [dev mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/Why-use-quot-lib-managed-quot-for-the-Sbt-build-td12361.html])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6513) Add zipWithUniqueId (and other RDD APIs) to RDDApi

2015-09-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-6513.
---
Resolution: Won't Fix

> Add zipWithUniqueId (and other RDD APIs) to RDDApi
> --
>
> Key: SPARK-6513
> URL: https://issues.apache.org/jira/browse/SPARK-6513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I 
> don't think it's relevant)
>Reporter: Eran Medan
>Priority: Minor
>
> It will be nice if we could treat a Dataframe just like an RDD (wherever it 
> makes sense) 
> *Worked in 1.2.1*
> {code}
>  val sqlContext = new HiveContext(sc)
>  import sqlContext._
>  val jsonRDD = sqlContext.jsonFile(jsonFilePath)
>  jsonRDD.registerTempTable("jsonTable")
>  val jsonResult = sql(s"select * from jsonTable")
>  val foo = jsonResult.zipWithUniqueId().map {
>case (Row(...), uniqueId) => // do something useful
>...
>  }
>  foo.registerTempTable("...")
> {code}
> *Stopped working in 1.3.0* 
> {code}   
> jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method
> {code}
> **Not working workaround:**
> although this might give me an {{RDD\[Row\]}}:
> {code}
> jsonResult.rdd.zipWithUniqueId()  
> {code}
> Now this won't work obviously since {{RDD\[Row\]}} does not have a 
> {{registerTempTable}} method of course
> {code}
>  foo.registerTempTable("...")
> {code}
> (see related SO question: 
> http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3)
> EDIT: changed from issue to enhancement request 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10647) Rename property spark.deploy.zookeeper.dir to spark.mesos.deploy.zookeeper.dir

2015-09-16 Thread Alan Braithwaite (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Braithwaite updated SPARK-10647:
-
Issue Type: Improvement  (was: New Feature)

> Rename property spark.deploy.zookeeper.dir to spark.mesos.deploy.zookeeper.dir
> --
>
> Key: SPARK-10647
> URL: https://issues.apache.org/jira/browse/SPARK-10647
> Project: Spark
>  Issue Type: Improvement
>Reporter: Alan Braithwaite
>Priority: Minor
>
> This property doesn't match up with the other properties surrounding it, 
> namely:
> spark.mesos.deploy.zookeeper.url
> and
> spark.mesos.deploy.recoveryMode
> Since it's also a property specific to mesos, it makes sense to be under that 
> hierarchy as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10647) Rename property spark.deploy.zookeeper.dir to spark.mesos.deploy.zookeeper.dir

2015-09-16 Thread Alan Braithwaite (JIRA)

Alan Braithwaite created SPARK-10647:


 Summary: Rename property spark.deploy.zookeeper.dir to 
spark.mesos.deploy.zookeeper.dir
 Key: SPARK-10647
 URL: https://issues.apache.org/jira/browse/SPARK-10647
 Project: Spark
  Issue Type: New Feature
Reporter: Alan Braithwaite
Priority: Minor


This property doesn't match up with the other properties surrounding it, namely:

spark.mesos.deploy.zookeeper.url
and
spark.mesos.deploy.recoveryMode

Since it's also a property specific to mesos, it makes sense to be under that 
hierarchy as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. categorical

2015-09-16 Thread Jihong MA (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-10646:
--
Description: Pearson's chi-squared goodness of fit test for observed 
against the expected distribution.

> Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. 
> categorical
> 
>
> Key: SPARK-10646
> URL: https://issues.apache.org/jira/browse/SPARK-10646
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jihong MA
>
> Pearson's chi-squared goodness of fit test for observed against the expected 
> distribution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4216) Eliminate duplicate Jenkins GitHub posts from AMPLab

2015-09-16 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791016#comment-14791016
 ] 

Nicholas Chammas commented on SPARK-4216:
-

Thanks Josh!

> Eliminate duplicate Jenkins GitHub posts from AMPLab
> 
>
> Key: SPARK-4216
> URL: https://issues.apache.org/jira/browse/SPARK-4216
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Reporter: Nicholas Chammas
>Priority: Minor
>
> * [Real Jenkins | 
> https://github.com/apache/spark/pull/2988#issuecomment-60873361]
> * [Imposter Jenkins | 
> https://github.com/apache/spark/pull/2988#issuecomment-60873366]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. categorical

2015-09-16 Thread Jihong MA (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-10646:
--
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-10385

> Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. 
> categorical
> 
>
> Key: SPARK-10646
> URL: https://issues.apache.org/jira/browse/SPARK-10646
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jihong MA
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. categorical

2015-09-16 Thread Jihong MA (JIRA)

Jihong MA created SPARK-10646:
-

 Summary: Bivariate Statistics: Pearson's Chi-Squared Test for 
categorical vs. categorical
 Key: SPARK-10646
 URL: https://issues.apache.org/jira/browse/SPARK-10646
 Project: Spark
  Issue Type: New Feature
Reporter: Jihong MA






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10645) Bivariate Statistics for continuous vs. continuous

2015-09-16 Thread Jihong MA (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-10645:
--
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-10385

> Bivariate Statistics for continuous vs. continuous
> --
>
> Key: SPARK-10645
> URL: https://issues.apache.org/jira/browse/SPARK-10645
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jihong MA
>
> this is an umbrella jira, which covers Bivariate Statistics for continuous 
> vs. continuous columns, including covariance, Pearson's correlation, 
> Spearman's correlation (for both continuous & categorical).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10645) Bivariate Statistics for continuous vs. continuous

2015-09-16 Thread Jihong MA (JIRA)

Jihong MA created SPARK-10645:
-

 Summary: Bivariate Statistics for continuous vs. continuous
 Key: SPARK-10645
 URL: https://issues.apache.org/jira/browse/SPARK-10645
 Project: Spark
  Issue Type: New Feature
Reporter: Jihong MA


this is an umbrella jira, which covers Bivariate Statistics for continuous vs. 
continuous columns, including covariance, Pearson's correlation, Spearman's 
correlation (for both continuous & categorical).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3497) Report serialized size of task binary

2015-09-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3497.
---
Resolution: Fixed

We now have an automatic warning-level log message for large closures.

> Report serialized size of task binary
> -
>
> Key: SPARK-3497
> URL: https://issues.apache.org/jira/browse/SPARK-3497
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Sandy Ryza
>
> This is useful for determining that task closures are larger than expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4087) Only use broadcast for large tasks

2015-09-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen closed SPARK-4087.
-
Resolution: Won't Fix

> Only use broadcast for large tasks
> --
>
> Key: SPARK-4087
> URL: https://issues.apache.org/jira/browse/SPARK-4087
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Davies Liu
>Priority: Critical
>
> After we broadcast every tasks, some regressions are introduced because of 
> broadcast is not stable enough.
> So we would like to only use broadcast for large tasks, which will keep the 
> same behaviour as 1.0 for most of the cases. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4738) Update the netty-3.x version in spark-assembly-*.jar

2015-09-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4738.
---
Resolution: Incomplete

Resolving as "Incomplete" since this an old issue and it doesn't look like 
there's any action to take here. In newer Spark versions, you should be able to 
use the various user-classpath-first configurationst o work around this.

> Update the netty-3.x version in spark-assembly-*.jar
> 
>
> Key: SPARK-4738
> URL: https://issues.apache.org/jira/browse/SPARK-4738
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.1.0
>Reporter: Tobias Pfeiffer
>Priority: Minor
>
> It seems as if the version of akka-remote (2.2.3-shaded-protobuf) that is 
> bundled in the spark-assembly-1.1.1-hadoop2.4.0.jar file pulls in an ancient 
> version of netty, namely io.netty:netty:3.6.6.Final (using the package 
> org.jboss.netty). This means that when using spark-submit, there will always 
> be this netty version on the classpath before any versions added by the user. 
> This may lead to issues with other packages that depend on newer versions and 
> may fail with java.lang.NoSuchMethodError etc.(finagle-http in my case).
> I wonder if it possible to manually include a newer netty version, like 
> netty-3.8.0.Final.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4568) Publish release candidates under $VERSION-RCX instead of $VERSION

2015-09-16 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790945#comment-14790945
 ] 

Josh Rosen edited comment on SPARK-4568 at 9/16/15 7:00 PM:


We now do this. Specifically, we publish RC's under both names.


was (Author: joshrosen):
We now do this.

> Publish release candidates under $VERSION-RCX instead of $VERSION
> -
>
> Key: SPARK-4568
> URL: https://issues.apache.org/jira/browse/SPARK-4568
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4568) Publish release candidates under $VERSION-RCX instead of $VERSION

2015-09-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4568.
---
Resolution: Fixed

We now do this.

> Publish release candidates under $VERSION-RCX instead of $VERSION
> -
>
> Key: SPARK-4568
> URL: https://issues.apache.org/jira/browse/SPARK-4568
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4442) Move common unit test utilities into their own package / module

2015-09-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4442.
---
Resolution: Won't Fix

We're using test-jar dependencies instead, so this is "Won't Fix".

> Move common unit test utilities into their own package / module
> ---
>
> Key: SPARK-4442
> URL: https://issues.apache.org/jira/browse/SPARK-4442
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Josh Rosen
>Priority: Minor
>
> We should move generally-useful unit test fixtures / utility methods to their 
> own test utilities set package / module to make them easier to find / use.
> See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for 
> one example of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4440) Enhance the job progress API to expose more information

2015-09-16 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790935#comment-14790935
 ] 

Josh Rosen commented on SPARK-4440:
---

Does anyone still want these extensions? If so, can you please come up with a 
concrete proposal for exactly what you'd like to add?

> Enhance the job progress API to expose more information
> ---
>
> Key: SPARK-4440
> URL: https://issues.apache.org/jira/browse/SPARK-4440
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Rui Li
>
> The progress API introduced in SPARK-2321 provides a new way for user to 
> monitor job progress. However the information exposed in the API is 
> relatively limited. It'll be much more useful if we can enhance the API to 
> expose more data.
> Some improvement for example may include but not limited to:
> 1. Stage submission and completion time.
> 2. Task metrics.
> The requirement is initially identified for the hive on spark 
> project(HIVE-7292), other application should benefit as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4389) Set akka.remote.netty.tcp.bind-hostname="0.0.0.0" so driver can be located behind NAT

2015-09-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4389.
---
Resolution: Won't Fix

I'm going to resolve this as "Won't Fix":

- Akka 2.4 has not shipped yet, so this feature isn't supported in any released 
version.
- Akka 2.4 requires Java 8 and Scala 2.11 or 2.12, meaning that we can't use it 
in Spark as long as we need to continue to support Java 7 and Scala 2.10. We'll 
replace Akka RPC with a custom RPC layer long before we'll be able to drop 
support for these platforms.

> Set akka.remote.netty.tcp.bind-hostname="0.0.0.0" so driver can be located 
> behind NAT
> -
>
> Key: SPARK-4389
> URL: https://issues.apache.org/jira/browse/SPARK-4389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Josh Rosen
>Priority: Minor
>
> We should set {{akka.remote.netty.tcp.bind-hostname="0.0.0.0"}} in our Akka 
> configuration so that Spark drivers can be located behind NATs / work with 
> weird DNS setups.
> This is blocked by upgrading our Akka version, since this configuration is 
> not present Akka 2.3.4.  There might be a different approach / workaround 
> that works on our current Akka version, though.
> EDIT: this is blocked by Akka 2.4, since this feature is only available in 
> the 2.4 snapshot release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4216) Eliminate duplicate Jenkins GitHub posts from AMPLab

2015-09-16 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790924#comment-14790924
 ] 

Josh Rosen commented on SPARK-4216:
---

We now have a bot which deletes the duplicated posts after they're posted. This 
de-clutters the thread for folks who read it on GitHub.

> Eliminate duplicate Jenkins GitHub posts from AMPLab
> 
>
> Key: SPARK-4216
> URL: https://issues.apache.org/jira/browse/SPARK-4216
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Reporter: Nicholas Chammas
>Priority: Minor
>
> * [Real Jenkins | 
> https://github.com/apache/spark/pull/2988#issuecomment-60873361]
> * [Imposter Jenkins | 
> https://github.com/apache/spark/pull/2988#issuecomment-60873366]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3949) Use IAMRole in lieu of static access key-id/secret

2015-09-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3949.
---
Resolution: Incomplete

As of SPARK-8576, spark-ec2 can launch instances with IAM instance roles. I'm 
going to close this issue as "Incomplete" since it's underspecified; it would 
be helpful to know more specifically where you think we need support for IAM 
roles instead of keys (i.e. while launching the cluster? for configuring access 
to S3?).

> Use IAMRole in lieu of static access key-id/secret
> --
>
> Key: SPARK-3949
> URL: https://issues.apache.org/jira/browse/SPARK-3949
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Affects Versions: 1.1.0
>Reporter: Rangarajan Sreenivasan
>
> Spark currently supports AWS resource access through user-specific 
> key-id/secret. While this works, the AWS recommended way is to use IAM Roles 
> instead of specific key-id/secrets. 
> http://docs.aws.amazon.com/IAM/latest/UserGuide/IAMBestPractices.html#use-roles-with-ec2
> http://docs.aws.amazon.com/IAM/latest/UserGuide/IAM_Introduction.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10644) Applications wait even if free executors are available

2015-09-16 Thread Balagopal Nair (JIRA)

Balagopal Nair created SPARK-10644:
--

 Summary: Applications wait even if free executors are available
 Key: SPARK-10644
 URL: https://issues.apache.org/jira/browse/SPARK-10644
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.5.0
 Environment: RHEL 6.5 64 bit
Reporter: Balagopal Nair


Number of workers: 21
Number of executors: 63

Steps to reproduce:
1. Run 4 jobs each with max cores set to 10
2. The first 3 jobs run with 10 each. (30 executors consumed so far)
3. The 4 th job waits even though there are 33 idle executors.

The reason is that a job will not get executors unless 
the total number of EXECUTORS in use < the number of WORKERS

If there are executors available, resources should be allocated to the pending 
job.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3603) InvalidClassException on a Linux VM - probably problem with serialization

2015-09-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3603.
---
Resolution: Cannot Reproduce

Resolving as "Cannot Reproduce", since this is an old issue that hasn't 
received any updates since 1.1.0. Please re-open and update if this is still a 
problem.

> InvalidClassException on a Linux VM - probably problem with serialization
> -
>
> Key: SPARK-3603
> URL: https://issues.apache.org/jira/browse/SPARK-3603
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0, 1.1.0
> Environment: Linux version 2.6.32-358.32.3.el6.x86_64 
> (mockbu...@x86-029.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red 
> Hat 4.4.7-3) (GCC) ) #1 SMP Fri Jan 17 08:42:31 EST 2014
> java version "1.7.0_25"
> OpenJDK Runtime Environment (rhel-2.3.10.4.el6_4-x86_64)
> OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)
> Spark (either 1.0.0 or 1.1.0)
>Reporter: Tomasz Dudziak
>Priority: Critical
>  Labels: scala, serialization, spark
>
> I have a Scala app connecting to a standalone Spark cluster. It works fine on 
> Windows or on a Linux VM; however, when I try to run the app and the Spark 
> cluster on another Linux VM (the same Linux kernel, Java and Spark - tested 
> for versions 1.0.0 and 1.1.0) I get the below exception. This looks kind of 
> similar to the Big-Endian (IBM Power7) Spark Serialization issue 
> (SPARK-2018), but... my system is definitely little endian and I understand 
> the big endian issue should be already fixed in Spark 1.1.0 anyway. I'd 
> appreaciate your help.
> 01:34:53.251 WARN  [Result resolver thread-0][TaskSetManager] Lost TID 2 
> (task 1.0:2)
> 01:34:53.278 WARN  [Result resolver thread-0][TaskSetManager] Loss was due to 
> java.io.InvalidClassException
> java.io.InvalidClassException: scala.reflect.ClassTag$$anon$1; local class 
> incompatible: stream classdesc serialVersionUID = -4937928798201944954, local 
> class serialVersionUID = -8102093212602380348
> at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
> at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620)
> at 
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1769)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1891)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at scala.collection.immutabl

[jira] [Updated] (SPARK-10643) Support HDFS urls in spark-submit

2015-09-16 Thread Alan Braithwaite (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Braithwaite updated SPARK-10643:
-
Description: 
When using mesos with docker and marathon, it would be nice to be able to make 
spark-submit deployable on marathon and have that download a jar from HDFS 
instead of having to package the jar with the docker.

{code}
$ docker run -it docker.example.com/spark:latest 
/usr/local/spark/bin/spark-submit  --class 
com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar 
Warning: Skip remote jar hdfs://hdfs/tmp/application.jar.
java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

Although I'm aware that we can run in cluster mode with mesos, we've already 
built some nice tools surrounding marathon for logging and monitoring.

Code in question:
https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L685-L698

  was:
When using mesos with docker and marathon, it would be nice to be able to make 
spark-submit deployable on marathon and have that download a jar from HDFS 
instead of having to package the jar with the docker.

{code}
$ docker run -it docker.example.com/spark:latest 
/usr/local/spark/bin/spark-submit  --class 
com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar 
Warning: Skip remote jar hdfs://hdfs/tmp/application.jar.
java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

Although I'm aware that we can run in cluster mode with mesos, we've already 
built some nice tools surrounding marathon for logging and monitoring.


> Support HDFS urls in spark-submit
> -
>
> Key: SPARK-10643
> URL: https://issues.apache.org/jira/browse/SPARK-10643
> Project: Spark
>  Issue Type: New Feature
>Reporter: Alan Braithwaite
>Priority: Minor
>
> When using mesos with docker and marathon, it would be nice to be able to 
> make spark-submit deployable on marathon and have that download a jar from 
> HDFS instead of having to package the jar with the docker.
> {code}
> $ docker run -it docker.example.com/spark:latest 
> /usr/local/spark/bin/spark-submit  --class 
> com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar 
> Warning: Skip remote jar hdfs://hdfs/tmp/application.jar.
> java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>

[jira] [Resolved] (SPARK-3489) support rdd.zip(rdd1, rdd2,...) with variable number of rdds as params

2015-09-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3489.
---
Resolution: Won't Fix

Resolving as "Won't Fix" per PR discussion.

> support rdd.zip(rdd1, rdd2,...) with variable number of rdds as params
> --
>
> Key: SPARK-3489
> URL: https://issues.apache.org/jira/browse/SPARK-3489
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: Mohit Jaggi
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10371) Optimize sequential projections

2015-09-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10371:
--
Description: 
In ML pipelines, each transformer/estimator appends new columns to the input 
DataFrame. For example, it might produce DataFrames like the following columns: 
a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), and d = 
udf_d(c). Some UDFs could be expensive. However, if we materialize c and d, 
udf_b, and udf_c are triggered twice, i.e., value c is not re-used.

It would be nice to detect this pattern and re-use intermediate values.

{code}
val input = sqlContext.range(10)
val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * 2)
output.explain(true)

== Parsed Logical Plan ==
'Project [*,('x * 2) AS y#254]
 Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
  LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30

== Analyzed Logical Plan ==
id: bigint, x: bigint, y: bigint
Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L]
 Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
  LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30

== Optimized Logical Plan ==
Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L]
 LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30

== Physical Plan ==
TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L]
 Scan PhysicalRDD[id#252L]

Code Generation: true
input: org.apache.spark.sql.DataFrame = [id: bigint]
output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint]
{code}

  was:
In ML pipelines, each transformer/estimator appends new columns to the input 
DataFrame. For example, it might produce DataFrames like the following columns: 
a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), and d = 
udf_d(c). Some UDFs could be expensive. However, if we materialize c and d, 
udf_b, and udf_c are triggered twice, i.e., value c is not re-used.

It would be nice to detect this pattern and re-use intermediate values.


> Optimize sequential projections
> ---
>
> Key: SPARK-10371
> URL: https://issues.apache.org/jira/browse/SPARK-10371
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>
> In ML pipelines, each transformer/estimator appends new columns to the input 
> DataFrame. For example, it might produce DataFrames like the following 
> columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), 
> and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c 
> and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used.
> It would be nice to detect this pattern and re-use intermediate values.
> {code}
> val input = sqlContext.range(10)
> val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * 
> 2)
> output.explain(true)
> == Parsed Logical Plan ==
> 'Project [*,('x * 2) AS y#254]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Analyzed Logical Plan ==
> id: bigint, x: bigint, y: bigint
> Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Optimized Logical Plan ==
> Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L]
>  LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Physical Plan ==
> TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS 
> y#254L]
>  Scan PhysicalRDD[id#252L]
> Code Generation: true
> input: org.apache.spark.sql.DataFrame = [id: bigint]
> output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10643) Support HDFS urls in spark-submit

2015-09-16 Thread Alan Braithwaite (JIRA)

Alan Braithwaite created SPARK-10643:


 Summary: Support HDFS urls in spark-submit
 Key: SPARK-10643
 URL: https://issues.apache.org/jira/browse/SPARK-10643
 Project: Spark
  Issue Type: New Feature
Reporter: Alan Braithwaite
Priority: Minor


When using mesos with docker and marathon, it would be nice to be able to make 
spark-submit deployable on marathon and have that download a jar from HDFS 
instead of having to package the jar with the docker.

{code}
$ docker run -it docker.example.com/spark:latest 
/usr/local/spark/bin/spark-submit  --class 
com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar 
Warning: Skip remote jar hdfs://hdfs/tmp/application.jar.
java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

Although I'm aware that we can run in cluster mode with mesos, we've already 
built some nice tools surrounding marathon for logging and monitoring.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2991) RDD transforms for scan and scanLeft

2015-09-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2991.
---
Resolution: Won't Fix

> RDD transforms for scan and scanLeft 
> -
>
> Key: SPARK-2991
> URL: https://issues.apache.org/jira/browse/SPARK-2991
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Minor
>  Labels: features
>
> Provide RDD transforms analogous to Scala scan(z)(f) (parallel prefix scan) 
> and scanLeft(z)(f) (sequential prefix scan)
> Discussion of a scanLeft implementation:
> http://erikerlandson.github.io/blog/2014/08/09/implementing-an-rdd-scanleft-transform-with-cascade-rdds/
> Discussion of scan:
> http://erikerlandson.github.io/blog/2014/08/12/implementing-parallel-prefix-scan-as-a-spark-rdd-transform/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2496) Compression streams should write its codec info to the stream

2015-09-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2496.
---
Resolution: Incomplete

Resolving as "Incomplete"; if we still want to do this then we should wait 
until we have a specific concrete use-case / list of things that need to be 
changed.

> Compression streams should write its codec info to the stream
> -
>
> Key: SPARK-2496
> URL: https://issues.apache.org/jira/browse/SPARK-2496
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark sometime store compressed data outside of Spark (e.g. event logs, 
> blocks in tachyon), and those data are read back directly using the codec 
> configured by the user. When the codec differs between runs, Spark wouldn't 
> be able to read the codec back. 
> I'm not sure what the best strategy here is yet. If we write the codec 
> identifier for all streams, then we will be writing a lot of identifiers for 
> shuffle blocks. One possibility is to only write it for blocks that will be 
> shared across different Spark instances (i.e. managed outside of Spark), 
> which includes tachyon blocks and event log blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2302) master should discard exceeded completedDrivers

2015-09-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2302.
---
   Resolution: Fixed
 Assignee: Lianhui Wang
Fix Version/s: 1.1.0

> master should discard exceeded completedDrivers 
> 
>
> Key: SPARK-2302
> URL: https://issues.apache.org/jira/browse/SPARK-2302
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Lianhui Wang
>Assignee: Lianhui Wang
> Fix For: 1.1.0
>
>
> When completedDrivers number exceeds the threshold, the first 
> Max(spark.deploy.retainedDrivers, 1) will be discarded.
> see PR:
> https://github.com/apache/spark/pull/1114



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10642) Crash in rdd.lookup() with "java.lang.Long cannot be cast to java.lang.Integer"

2015-09-16 Thread Thouis Jones (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790884#comment-14790884
 ] 

Thouis Jones commented on SPARK-10642:
--

Simpler cases.

Fails:
{code}
sc.parallelize([(('a', 'b'), 'c')]).groupByKey().lookup(('a', 'b'))
{code}

Works:
{code}
sc.parallelize([(('a', 'b'), 'c')]).groupByKey().map(lambda x: x).lookup(('a', 
'b'))
{code}

> Crash in rdd.lookup() with "java.lang.Long cannot be cast to 
> java.lang.Integer"
> ---
>
> Key: SPARK-10642
> URL: https://issues.apache.org/jira/browse/SPARK-10642
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
> Environment: OSX
>Reporter: Thouis Jones
>
> Running this command:
> {code}
> sc.parallelize([(('a', 'b'), 
> 'c')]).groupByKey().partitionBy(20).cache().lookup(('a', 'b'))
> {code}
> gives the following error:
> {noformat}
> 15/09/16 14:22:23 INFO SparkContext: Starting job: runJob at 
> PythonRDD.scala:361
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/rdd.py", 
> line 2199, in lookup
> return self.ctx.runJob(values, lambda x: x, [self.partitioner(key)])
>   File 
> "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/context.py", 
> line 916, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File 
> "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/sql/utils.py", 
> line 36, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : java.lang.ClassCastException: java.lang.Long cannot be cast to 
> java.lang.Integer
>   at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitJob$1.apply(DAGScheduler.scala:530)
>   at scala.collection.Iterator$class.find(Iterator.scala:780)
>   at scala.collection.AbstractIterator.find(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.find(IterableLike.scala:79)
>   at scala.collection.AbstractIterable.find(Iterable.scala:54)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:530)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:558)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1826)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1839)
>   at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:361)
>   at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
>   at sun.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1304) Job fails with spot instances (due to IllegalStateException: Shutdown in progress)

2015-09-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-1304.
---
Resolution: Won't Fix

> Job fails with spot instances (due to IllegalStateException: Shutdown in 
> progress)
> --
>
> Key: SPARK-1304
> URL: https://issues.apache.org/jira/browse/SPARK-1304
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 0.9.0
>Reporter: Alex Boisvert
>Priority: Minor
>
> We had a job running smoothly with spot instances until one of the spot 
> instances got terminated ... which led to a series of "IllegalStateException: 
> Shutdown in progress" and the job failed afterwards.
> 14/03/24 06:07:52 WARN scheduler.TaskSetManager: Loss was due to 
> java.lang.IllegalStateException
> java.lang.IllegalStateException: Shutdown in progress
>   at 
> java.lang.ApplicationShutdownHooks.add(ApplicationShutdownHooks.java:66)
>   at java.lang.Runtime.addShutdownHook(Runtime.java:211)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1441)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:256)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
>   at 
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:77)
>   at 
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:51)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:156)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at 
> org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:90)
>   at 
> org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:89)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:57)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94)
>   at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
>   at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
>   at org.apache.spark.scheduler.Task.run(Task.scala:53)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:724)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10640) Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied

2015-09-16 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-10640:
-

Assignee: Thomas Graves

> Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied
> --
>
> Key: SPARK-10640
> URL: https://issues.apache.org/jira/browse/SPARK-10640
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> I'm seeing an exception from the spark history server trying to read a 
> history file:
> scala.MatchError: TaskCommitDenied (of class java.lang.String)
> at 
> org.apache.spark.util.JsonProtocol$.taskEndReasonFromJson(JsonProtocol.scala:775)
> at 
> org.apache.spark.util.JsonProtocol$.taskEndFromJson(JsonProtocol.scala:531)
> at 
> org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:488)
> at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:457)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:292)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:289)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:289)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$1$$anon$2.run(FsHistoryProvider.scala:210)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-869) Retrofit rest of RDD api to use proper serializer type

2015-09-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-869.
--
Resolution: Done

Going to resolve this as Done; please open a new JIRA if you find specific 
examples where we're using the wrong serializer.

> Retrofit rest of RDD api to use proper serializer type
> --
>
> Key: SPARK-869
> URL: https://issues.apache.org/jira/browse/SPARK-869
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 0.8.0
>Reporter: Dmitriy Lyubimov
>
> SPARK-826 and SPARK-827 resolved proper serialization support for some RDD 
> method parameters, but  not all. 
> This issue is to address the rest of RDD api and operations. Most of the time 
> this is due to wrapping RDD parameters into a closure which can only use a 
> closure serializer to communicate to the backend.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10642) Crash in rdd.lookup() with "java.lang.Long cannot be cast to java.lang.Integer"

2015-09-16 Thread Thouis Jones (JIRA)

Thouis Jones created SPARK-10642:


 Summary: Crash in rdd.lookup() with "java.lang.Long cannot be cast 
to java.lang.Integer"
 Key: SPARK-10642
 URL: https://issues.apache.org/jira/browse/SPARK-10642
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.5.0
 Environment: OSX
Reporter: Thouis Jones


Running this command:

{code}
sc.parallelize([(('a', 'b'), 
'c')]).groupByKey().partitionBy(20).cache().lookup(('a', 'b'))
{code}

gives the following error:
{noformat}
15/09/16 14:22:23 INFO SparkContext: Starting job: runJob at PythonRDD.scala:361
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/rdd.py", 
line 2199, in lookup
return self.ctx.runJob(values, lambda x: x, [self.partitioner(key)])
  File 
"/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/context.py", line 
916, in runJob
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
partitions)
  File 
"/usr/local/Cellar/apache-spark/1.5.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 538, in __call__
  File 
"/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/sql/utils.py", 
line 36, in deco
return f(*a, **kw)
  File 
"/usr/local/Cellar/apache-spark/1.5.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
 line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
: java.lang.ClassCastException: java.lang.Long cannot be cast to 
java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$submitJob$1.apply(DAGScheduler.scala:530)
at scala.collection.Iterator$class.find(Iterator.scala:780)
at scala.collection.AbstractIterator.find(Iterator.scala:1157)
at scala.collection.IterableLike$class.find(IterableLike.scala:79)
at scala.collection.AbstractIterable.find(Iterable.scala:54)
at 
org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:530)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:558)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1826)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1839)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:361)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10504) aggregate where NULL is defined as the value expression aborts when SUM used

2015-09-16 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10504.
--
   Resolution: Fixed
 Assignee: Yin Huai
Fix Version/s: 1.5.0

> aggregate where NULL is defined as the value expression aborts when SUM used
> 
>
> Key: SPARK-10504
> URL: https://issues.apache.org/jira/browse/SPARK-10504
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1
>Reporter: N Campbell
>Assignee: Yin Huai
>Priority: Minor
> Fix For: 1.5.0
>
>
> In ISO-SQL the context would determine an implicit type for NULL or one might 
> find that a vendor requires an explicit type via CAST ( NULL as INTEGER). It 
> appears that SPARK presumes a long type i.e. select min(NULL), max(NULL) but 
> a query such the following aborts.
>  
> {{select sum ( null  )  from tversion}}
> {code}
> Operation: execute
> Errors:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 5232.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 5232.0 (TID 18531, sandbox.hortonworks.com): scala.MatchError: NullType (of 
> class org.apache.spark.sql.types.NullType$)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$cast(Cast.scala:403)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:422)
>   at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:422)
>   at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:426)
>   at 
> org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:119)
>   at 
> org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:82)
>   at 
> org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:581)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:133)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:126)
>   at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
>   at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:64)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats

2015-09-16 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10602:
--
Assignee: Seth Hendrickson

> Univariate statistics as UDAFs: single-pass continuous stats
> 
>
> Key: SPARK-10602
> URL: https://issues.apache.org/jira/browse/SPARK-10602
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>
> See parent JIRA for more details.  This subtask covers statistics for 
> continuous values requiring a single pass over the data, such as min and max.
> This JIRA is an umbrella.  For individual stats, please create and link a new 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10641) skewness and kurtosis support

2015-09-16 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10641:
--
Issue Type: New Feature  (was: Sub-task)
Parent: (was: SPARK-10384)

> skewness and kurtosis support
> -
>
> Key: SPARK-10641
> URL: https://issues.apache.org/jira/browse/SPARK-10641
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Reporter: Jihong MA
>
> Implementing skewness and kurtosis support based on following algorithm:
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats

2015-09-16 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790859#comment-14790859
 ] 

Joseph K. Bradley commented on SPARK-10602:
---

Yeah, JIRA only allows 2 levels of subtasks  (a long-time annoyance of mine!).  
I'd recommend linking here using "contains."  I'll fix the link for now.

> Univariate statistics as UDAFs: single-pass continuous stats
> 
>
> Key: SPARK-10602
> URL: https://issues.apache.org/jira/browse/SPARK-10602
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Joseph K. Bradley
>
> See parent JIRA for more details.  This subtask covers statistics for 
> continuous values requiring a single pass over the data, such as min and max.
> This JIRA is an umbrella.  For individual stats, please create and link a new 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10589) Add defense against external site framing

2015-09-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10589.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8745
[https://github.com/apache/spark/pull/8745]

> Add defense against external site framing
> -
>
> Key: SPARK-10589
> URL: https://issues.apache.org/jira/browse/SPARK-10589
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.6.0
>
>
> This came up as a minor point during a security audit using a common scanning 
> tool: It's best if Spark UIs try to actively defend against certain types of 
> frame-related vulnerabilities by setting X-Frame-Options. See 
> https://www.owasp.org/index.php/Clickjacking_Defense_Cheat_Sheet
> Easy PR coming ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10320) Kafka Support new topic subscriptions without requiring restart of the streaming context

2015-09-16 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790843#comment-14790843
 ] 

Cody Koeninger commented on SPARK-10320:


I don't think there's much benefit to multiple dstreams with the direct api, 
because it's straightforward to filter or match on the topic on a per-partition 
basis.  I'm not sure that adding entirely new dstreams after the streaming 
context has been started makes sense.

As far as defaults go... I don't see a clearly reasonable default like 
messageHandler has.  Maybe an example implementation of a function that 
maintains just a list of topic names and handles the offset lookups.

The other thing is, in order to get much use out of this, the api for 
communicating with the kafka cluster would need to be made public, and there 
had been some reluctance on that point previously.

[~tdas] Any thoughts on making the KafkaCluster api public?

> Kafka Support new topic subscriptions without requiring restart of the 
> streaming context
> 
>
> Key: SPARK-10320
> URL: https://issues.apache.org/jira/browse/SPARK-10320
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Sudarshan Kadambi
>
> Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe 
> to current ones once the streaming context has been started. Restarting the 
> streaming context increases the latency of update handling.
> Consider a streaming application subscribed to n topics. Let's say 1 of the 
> topics is no longer needed in streaming analytics and hence should be 
> dropped. We could do this by stopping the streaming context, removing that 
> topic from the topic list and restarting the streaming context. Since with 
> some DStreams such as DirectKafkaStream, the per-partition offsets are 
> maintained by Spark, we should be able to resume uninterrupted (I think?) 
> from where we left off with a minor delay. However, in instances where 
> expensive state initialization (from an external datastore) may be needed for 
> datasets published to all topics, before streaming updates can be applied to 
> it, it is more convenient to only subscribe or unsubcribe to the incremental 
> changes to the topic list. Without such a feature, updates go unprocessed for 
> longer than they need to be, thus affecting QoS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10485) IF expression is not correctly resolved when one of the options have NullType

2015-09-16 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10485:
-
Affects Version/s: 1.5.0

> IF expression is not correctly resolved when one of the options have NullType
> -
>
> Key: SPARK-10485
> URL: https://issues.apache.org/jira/browse/SPARK-10485
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Antonio Jesus Navarro
>
> If we have this query:
> {code}
> SELECT IF(column > 1, 1, NULL) FROM T1
> {code}
> On Spark 1.4.1 we have this:
> {code}
> override lazy val resolved = childrenResolved && trueValue.dataType == 
> falseValue.dataType
> {code}
> So if one of the types is NullType, the if expression is not resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats

2015-09-16 Thread Jihong MA (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790816#comment-14790816
 ] 

Jihong MA commented on SPARK-10602:
---

I go ahead/ created SPARK-10641, since this JIRA is not listed as umbrella, 
couldn't link to this JIRA directly instead linked to SPARK-10384. @Joseph, can 
you assign SPARK-10641 to Seth?  and help fix the link, Thanks! 

> Univariate statistics as UDAFs: single-pass continuous stats
> 
>
> Key: SPARK-10602
> URL: https://issues.apache.org/jira/browse/SPARK-10602
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Joseph K. Bradley
>
> See parent JIRA for more details.  This subtask covers statistics for 
> continuous values requiring a single pass over the data, such as min and max.
> This JIRA is an umbrella.  For individual stats, please create and link a new 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10320) Kafka Support new topic subscriptions without requiring restart of the streaming context

2015-09-16 Thread Sudarshan Kadambi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790804#comment-14790804
 ] 

Sudarshan Kadambi commented on SPARK-10320:
---

Sure, a function as proposed that allows for the topic, partitions and offsets 
to be specified in a fine grained manner is needed to provide the full 
flexbility we desire (starting at an arbitrary offset within each topic 
partition). If separate DStreams are desired for each topic, you intend for 
createDirectStream to be called multiple times (with a different subscription 
topic each time) both before and after the streaming context is started? 

Also, what kind of defaults did you have in mind? For e.g. I might require the 
ability to specify new topics after the streaming context is started but might 
not want the burden of being aware of the partitions within the topic or the 
offsets. I might simply want to default to either the start or the end of each 
partition that exists for that topic.

> Kafka Support new topic subscriptions without requiring restart of the 
> streaming context
> 
>
> Key: SPARK-10320
> URL: https://issues.apache.org/jira/browse/SPARK-10320
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Sudarshan Kadambi
>
> Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe 
> to current ones once the streaming context has been started. Restarting the 
> streaming context increases the latency of update handling.
> Consider a streaming application subscribed to n topics. Let's say 1 of the 
> topics is no longer needed in streaming analytics and hence should be 
> dropped. We could do this by stopping the streaming context, removing that 
> topic from the topic list and restarting the streaming context. Since with 
> some DStreams such as DirectKafkaStream, the per-partition offsets are 
> maintained by Spark, we should be able to resume uninterrupted (I think?) 
> from where we left off with a minor delay. However, in instances where 
> expensive state initialization (from an external datastore) may be needed for 
> datasets published to all topics, before streaming updates can be applied to 
> it, it is more convenient to only subscribe or unsubcribe to the incremental 
> changes to the topic list. Without such a feature, updates go unprocessed for 
> longer than they need to be, thus affecting QoS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10641) skewness and kurtosis support

2015-09-16 Thread Jihong MA (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-10641:
--
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-10384

> skewness and kurtosis support
> -
>
> Key: SPARK-10641
> URL: https://issues.apache.org/jira/browse/SPARK-10641
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>
> Implementing skewness and kurtosis support based on following algorithm:
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10636) RDD filter does not work after if..then..else RDD blocks

2015-09-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790792#comment-14790792
 ] 

Sean Owen commented on SPARK-10636:
---

It's a Scala syntax issue, as you say when you wondered if you're not using 
if..else in the way you think. Stuff happens, but if that crosses your mind, 
I'd just float a user@ question first or try a simple test in Scala to narrow 
down the behavior in question.

> RDD filter does not work after if..then..else RDD blocks
> 
>
> Key: SPARK-10636
> URL: https://issues.apache.org/jira/browse/SPARK-10636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
>
> I have an RDD declaration of the following form:
> {code}
> val myRDD = if (some condition) { tempRDD1.some operations } else { 
> tempRDD2.some operations}.filter(a => a._2._5 <= 50)
> {code}
> When I output the contents of myRDD, I found entries that clearly had a._2._5 
> > 50... the filter didn't work!
> If I move the filter inside of the if..then blocks, it suddenly does work:
> {code}
> val myRDD = if (some condition) { tempRDD1.some operations.filter(a => 
> a._2._5 <= 50) } else { tempRDD2.some operations.filter(a => a._2._5 <= 50) }
> {code}
> I ran toDebugString after both of these code examples, and "filter" does 
> appear in the DAG for the second example, but DOES NOT appear in the first 
> DAG.  Why not?
> Am I misusing the if..then..else syntax for Spark/Scala?
> Here is my actual code... ignore the crazy naming conventions and what it's 
> doing...
> {code}
> // this does NOT work
> val myRDD = if (tempRDD2.count() > 0) {
>tempRDD1.
>  map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))).
>  leftOuterJoin(tempRDD2).
>  map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, 
> a._2._2.getOrElse(1L.
>  leftOuterJoin(tempRDD2).
>  map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, 
> a._2._2.getOrElse(1L.
>  map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), 
> (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if 
> (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4)))
>} else {
>  tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L)))
>}.
>filter(a => a._2._5 <= 50).
>partitionBy(partitioner).
>setName("myRDD").
>persist(StorageLevel.MEMORY_AND_DISK_SER)
> myRDD.checkpoint()
> println(myRDD.toDebugString)
> // (4) MapPartitionsRDD[58] at map at myProgram.scala:2120 []
> //  |  MapPartitionsRDD[57] at map at myProgram.scala:2119 []
> //  |  MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 []
> //  |  MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 []
> //  |  CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 []
> //  +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 []
> //  |  |  MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 []
> //  |  |  MapPartitionsRDD[51] at leftOuterJoin at myProgram.scala:2116 []
> //  |  |  CoGroupedRDD[50] at leftOuterJoin at myProgram.scala:2116 []
> //  |  +-(4) MapPartitionsRDD[49] at map at myProgram.scala:2115 []
> //  |  |  |  clusterGraphWithComponentsRDD MapPartitionsRDD[28] at 
> reduceByKey at myProgram.scala:1689 []
> //  |  |  |  CachedPartitions: 4; MemorySize: 1176.0 B; TachyonSize: 0.0 
> B; DiskSize: 0.0 B
> //  |  |  |  CheckpointRDD[29] at count at myProgram.scala:1701 []
> //  |  +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at 
> myProgram.scala:383 []
> //  | |  CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 
> B; DiskSize: 0.0 B
> //  | |  CheckpointRDD[17] at count at myProgram.scala:394 []
> //  +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at 
> myProgram.scala:383 []
> // |  CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; 
> DiskSize: 0.0 B
> // |  CheckpointRDD[17] at count at myProgram.scala:394 []
> // this DOES work!
> val myRDD = if (tempRDD2.count() > 0) {
>tempRDD1.
>  map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))).
>  leftOuterJoin(tempRDD2).
>  map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, 
> a._2._2.getOrElse(1L.
>  leftOuterJoin(tempRDD2).
>  map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, 
> a._2._2.getOrElse(1L.
>  map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), 
> (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if 
> (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))).
>  filter(a => a._2._5 <= 50)
>} else {
>  tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))).
>  filter(a => a._2._5 <= 50)
>}.
>partitionBy(partitioner).
>setName("myRDD").
>pers

[jira] [Created] (SPARK-10641) skewness and kurtosis support

2015-09-16 Thread Jihong MA (JIRA)

Jihong MA created SPARK-10641:
-

 Summary: skewness and kurtosis support
 Key: SPARK-10641
 URL: https://issues.apache.org/jira/browse/SPARK-10641
 Project: Spark
  Issue Type: New Feature
  Components: ML, SQL
Reporter: Jihong MA


Implementing skewness and kurtosis support based on following algorithm:
https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10640) Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied

2015-09-16 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790777#comment-14790777
 ] 

Thomas Graves commented on SPARK-10640:
---

looks like the jsonProtocol.taskEndReasonFromJson is just missing the 
TaskCommitDenied TaskEndReaons.

Seems like we should have a better way of handling this in the future too.  
Right now it doesn't load the history file at all which is pretty annoying.  
Perhaps have a catch all that skips it or prints unknown or something so it 
atleast loads. 

> Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied
> --
>
> Key: SPARK-10640
> URL: https://issues.apache.org/jira/browse/SPARK-10640
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>
> I'm seeing an exception from the spark history server trying to read a 
> history file:
> scala.MatchError: TaskCommitDenied (of class java.lang.String)
> at 
> org.apache.spark.util.JsonProtocol$.taskEndReasonFromJson(JsonProtocol.scala:775)
> at 
> org.apache.spark.util.JsonProtocol$.taskEndFromJson(JsonProtocol.scala:531)
> at 
> org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:488)
> at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:457)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:292)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:289)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:289)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$1$$anon$2.run(FsHistoryProvider.scala:210)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10640) Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied

2015-09-16 Thread Thomas Graves (JIRA)

Thomas Graves created SPARK-10640:
-

 Summary: Spark history server fails to parse taskEndReasonFromJson 
TaskCommitDenied
 Key: SPARK-10640
 URL: https://issues.apache.org/jira/browse/SPARK-10640
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.0
Reporter: Thomas Graves


I'm seeing an exception from the spark history server trying to read a history 
file:

scala.MatchError: TaskCommitDenied (of class java.lang.String)
at 
org.apache.spark.util.JsonProtocol$.taskEndReasonFromJson(JsonProtocol.scala:775)
at 
org.apache.spark.util.JsonProtocol$.taskEndFromJson(JsonProtocol.scala:531)
at 
org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:488)
at 
org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:457)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:292)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:289)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:289)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$1$$anon$2.run(FsHistoryProvider.scala:210)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10638) spark streaming stop gracefully keeps the spark context

2015-09-16 Thread Mamdouh Alramadan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mamdouh Alramadan updated SPARK-10638:
--
Description: 
With spark 1.4 on Mesos cluster, I am trying to stop the context with graceful 
shutdown, I have seen this mailing list that [~tdas] addressed

http://mail-archives.apache.org/mod_mbox/incubator-spark-commits/201505.mbox/%3c176cb228a2704ab996839fb97fa90...@git.apache.org%3E

which introduces a new config that was not documented, however, even with 
including it, the streaming job still stops correctly but the process doesn't 
die after all e.g. the Spark Context still running. My Mesos UI still sees the 
framework which is still allocating all the cores needed

the code used for the shutdown hook is:

{code:title=Start.scala|borderStyle=solid}
sys.ShutdownHookThread {
logInfo("Received SIGTERM, calling streaming stop")
streamingContext.stop(stopSparkContext = true, stopGracefully = true)
logInfo("Application Stopped")
  }

{code}

The logs are for this process are:
{code:title=SparkLogs|borderStyle=solid}
```
5/09/16 16:37:51 INFO Start: Received SIGTERM, calling streaming stop
15/09/16 16:37:51 INFO JobGenerator: Stopping JobGenerator gracefully
15/09/16 16:37:51 INFO JobGenerator: Waiting for all received blocks to be 
consumed for job generation
15/09/16 16:37:51 INFO JobGenerator: Waited for all received blocks to be 
consumed for job generation
15/09/16 16:37:51 INFO StreamingContext: Invoking stop(stopGracefully=true) 
from shutdown hook
15/09/16 16:38:00 INFO RecurringTimer: Stopped timer for JobGenerator after 
time 144242148
15/09/16 16:38:00 INFO JobScheduler: Starting job streaming job 144242148 
ms.0 from job set of time 144242148 ms
15/09/16 16:38:00 INFO JobGenerator: Stopped generation timer
15/09/16 16:38:00 INFO JobGenerator: Waiting for jobs to be processed and 
checkpoints to be written
15/09/16 16:38:00 INFO JobScheduler: Added jobs for time 144242148 ms
15/09/16 16:38:00 INFO JobGenerator: Checkpointing graph for time 144242148 
ms
15/09/16 16:38:00 INFO DStreamGraph: Updating checkpoint data for time 
144242148 ms
15/09/16 16:38:00 INFO DStreamGraph: Updated checkpoint data for time 
144242148 ms
15/09/16 16:38:00 INFO SparkContext: Starting job: foreachRDD at 
StreamDigest.scala:21
15/09/16 16:38:00 INFO DAGScheduler: Got job 12 (foreachRDD at 
StreamDigest.scala:21) with 1 output partitions (allowLocal=true)
15/09/16 16:38:00 INFO DAGScheduler: Final stage: ResultStage 12(foreachRDD at 
StreamDigest.scala:21)
15/09/16 16:38:00 INFO DAGScheduler: Parents of final stage: List()
15/09/16 16:38:00 INFO CheckpointWriter: Saving checkpoint for time 
144242148 ms to file 
'hdfs://EMRURL/sparkStreaming/checkpoint/checkpoint-144242148'
15/09/16 16:38:00 INFO DAGScheduler: Missing parents: List()
.
.
.
.
15/09/16 16:38:00 INFO JobGenerator: Waited for jobs to be processed and 
checkpoints to be written
15/09/16 16:38:00 INFO CheckpointWriter: CheckpointWriter executor terminated ? 
true, waited for 1 ms.
15/09/16 16:38:00 INFO JobGenerator: Stopped JobGenerator
15/09/16 16:38:00 INFO JobScheduler: Stopped JobScheduler

```
{code}
And in my spark-defaults.conf I included
{code:title=spark-defaults.conf|borderStyle=solid}
`spark.streaming.stopGracefullyOnShutdowntrue`
{code}

  was:
With spark 1.4 on Mesos cluster, I am trying to stop the context with graceful 
shutdown, I have seen this mailing list that [~tdas] addressed

http://mail-archives.apache.org/mod_mbox/incubator-spark-commits/201505.mbox/%3c176cb228a2704ab996839fb97fa90...@git.apache.org%3E

which introduces a new config that was not documented, however, even with 
including it, the streaming job still stops correctly but the process doesn't 
die after all e.g. the Spark Context still running. My Mesos UI still sees the 
framework which is still allocating all the cores needed

the code used for the shutdown hook is:

`sys.ShutdownHookThread {
logInfo("Received SIGTERM, calling streaming stop")
streamingContext.stop(stopSparkContext = true, stopGracefully = true)
logInfo("Application Stopped")
  }
`

The logs are for this process are:
```
5/09/16 16:37:51 INFO Start: Received SIGTERM, calling streaming stop
15/09/16 16:37:51 INFO JobGenerator: Stopping JobGenerator gracefully
15/09/16 16:37:51 INFO JobGenerator: Waiting for all received blocks to be 
consumed for job generation
15/09/16 16:37:51 INFO JobGenerator: Waited for all received blocks to be 
consumed for job generation
15/09/16 16:37:51 INFO StreamingContext: Invoking stop(stopGracefully=true) 
from shutdown hook
15/09/16 16:38:00 INFO RecurringTimer: Stopped timer for JobGenerator after 
time 144242148
15/09/16 16:38:00 INFO JobScheduler: Starting job streaming job 144242148 
ms.0 from job set of time 144242148 ms
15/09/16 16:38:00 INFO JobG

[jira] [Created] (SPARK-10639) Need to convert UDAF's result from scala to sql type

2015-09-16 Thread Yin Huai (JIRA)

Yin Huai created SPARK-10639:


 Summary: Need to convert UDAF's result from scala to sql type
 Key: SPARK-10639
 URL: https://issues.apache.org/jira/browse/SPARK-10639
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yin Huai
Priority: Blocker


We are missing a conversion at 
https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala#L427.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10638) spark streaming stop gracefully keeps the spark context

2015-09-16 Thread Mamdouh Alramadan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mamdouh Alramadan updated SPARK-10638:
--
Description: 
With spark 1.4 on Mesos cluster, I am trying to stop the context with graceful 
shutdown, I have seen this mailing list that [~tdas] addressed

http://mail-archives.apache.org/mod_mbox/incubator-spark-commits/201505.mbox/%3c176cb228a2704ab996839fb97fa90...@git.apache.org%3E

which introduces a new config that was not documented, however, even with 
including it, the streaming job still stops correctly but the process doesn't 
die after all e.g. the Spark Context still running. My Mesos UI still sees the 
framework which is still allocating all the cores needed

the code used for the shutdown hook is:

{code:title=Start.scala|borderStyle=solid}
sys.ShutdownHookThread {
logInfo("Received SIGTERM, calling streaming stop")
streamingContext.stop(stopSparkContext = true, stopGracefully = true)
logInfo("Application Stopped")
  }

{code}

The logs are for this process are:
{code:title=SparkLogs|borderStyle=solid}
```
5/09/16 16:37:51 INFO Start: Received SIGTERM, calling streaming stop
15/09/16 16:37:51 INFO JobGenerator: Stopping JobGenerator gracefully
15/09/16 16:37:51 INFO JobGenerator: Waiting for all received blocks to be 
consumed for job generation
15/09/16 16:37:51 INFO JobGenerator: Waited for all received blocks to be 
consumed for job generation
15/09/16 16:37:51 INFO StreamingContext: Invoking stop(stopGracefully=true) 
from shutdown hook
15/09/16 16:38:00 INFO RecurringTimer: Stopped timer for JobGenerator after 
time 144242148
15/09/16 16:38:00 INFO JobScheduler: Starting job streaming job 144242148 
ms.0 from job set of time 144242148 ms
15/09/16 16:38:00 INFO JobGenerator: Stopped generation timer
15/09/16 16:38:00 INFO JobGenerator: Waiting for jobs to be processed and 
checkpoints to be written
15/09/16 16:38:00 INFO JobScheduler: Added jobs for time 144242148 ms
15/09/16 16:38:00 INFO JobGenerator: Checkpointing graph for time 144242148 
ms
15/09/16 16:38:00 INFO DStreamGraph: Updating checkpoint data for time 
144242148 ms
15/09/16 16:38:00 INFO DStreamGraph: Updated checkpoint data for time 
144242148 ms
15/09/16 16:38:00 INFO SparkContext: Starting job: foreachRDD at 
StreamDigest.scala:21
15/09/16 16:38:00 INFO DAGScheduler: Got job 12 (foreachRDD at 
StreamDigest.scala:21) with 1 output partitions (allowLocal=true)
15/09/16 16:38:00 INFO DAGScheduler: Final stage: ResultStage 12(foreachRDD at 
StreamDigest.scala:21)
15/09/16 16:38:00 INFO DAGScheduler: Parents of final stage: List()
15/09/16 16:38:00 INFO CheckpointWriter: Saving checkpoint for time 
144242148 ms to file 
'hdfs://EMRURL/sparkStreaming/checkpoint/checkpoint-144242148'
15/09/16 16:38:00 INFO DAGScheduler: Missing parents: List()
.
.
.
.
15/09/16 16:38:00 INFO JobGenerator: Waited for jobs to be processed and 
checkpoints to be written
15/09/16 16:38:00 INFO CheckpointWriter: CheckpointWriter executor terminated ? 
true, waited for 1 ms.
15/09/16 16:38:00 INFO JobGenerator: Stopped JobGenerator
15/09/16 16:38:00 INFO JobScheduler: Stopped JobScheduler

```
{code}
And in my spark-defaults.conf I included
{code:title=spark-defaults.conf|borderStyle=solid}
spark.streaming.stopGracefullyOnShutdowntrue
{code}

  was:
With spark 1.4 on Mesos cluster, I am trying to stop the context with graceful 
shutdown, I have seen this mailing list that [~tdas] addressed

http://mail-archives.apache.org/mod_mbox/incubator-spark-commits/201505.mbox/%3c176cb228a2704ab996839fb97fa90...@git.apache.org%3E

which introduces a new config that was not documented, however, even with 
including it, the streaming job still stops correctly but the process doesn't 
die after all e.g. the Spark Context still running. My Mesos UI still sees the 
framework which is still allocating all the cores needed

the code used for the shutdown hook is:

{code:title=Start.scala|borderStyle=solid}
sys.ShutdownHookThread {
logInfo("Received SIGTERM, calling streaming stop")
streamingContext.stop(stopSparkContext = true, stopGracefully = true)
logInfo("Application Stopped")
  }

{code}

The logs are for this process are:
{code:title=SparkLogs|borderStyle=solid}
```
5/09/16 16:37:51 INFO Start: Received SIGTERM, calling streaming stop
15/09/16 16:37:51 INFO JobGenerator: Stopping JobGenerator gracefully
15/09/16 16:37:51 INFO JobGenerator: Waiting for all received blocks to be 
consumed for job generation
15/09/16 16:37:51 INFO JobGenerator: Waited for all received blocks to be 
consumed for job generation
15/09/16 16:37:51 INFO StreamingContext: Invoking stop(stopGracefully=true) 
from shutdown hook
15/09/16 16:38:00 INFO RecurringTimer: Stopped timer for JobGenerator after 
time 144242148
15/09/16 16:38:00 INFO JobScheduler: Starting job streaming jo

[jira] [Created] (SPARK-10638) spark streaming stop gracefully keeps the spark context

2015-09-16 Thread Mamdouh Alramadan (JIRA)

Mamdouh Alramadan created SPARK-10638:
-

 Summary: spark streaming stop gracefully keeps the spark context
 Key: SPARK-10638
 URL: https://issues.apache.org/jira/browse/SPARK-10638
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Mamdouh Alramadan


With spark 1.4 on Mesos cluster, I am trying to stop the context with graceful 
shutdown, I have seen this mailing list that [~tdas] addressed

http://mail-archives.apache.org/mod_mbox/incubator-spark-commits/201505.mbox/%3c176cb228a2704ab996839fb97fa90...@git.apache.org%3E

which introduces a new config that was not documented, however, even with 
including it, the streaming job still stops correctly but the process doesn't 
die after all e.g. the Spark Context still running. My Mesos UI still sees the 
framework which is still allocating all the cores needed

the code used for the shutdown hook is:

`sys.ShutdownHookThread {
logInfo("Received SIGTERM, calling streaming stop")
streamingContext.stop(stopSparkContext = true, stopGracefully = true)
logInfo("Application Stopped")
  }
`

The logs are for this process are:
```
5/09/16 16:37:51 INFO Start: Received SIGTERM, calling streaming stop
15/09/16 16:37:51 INFO JobGenerator: Stopping JobGenerator gracefully
15/09/16 16:37:51 INFO JobGenerator: Waiting for all received blocks to be 
consumed for job generation
15/09/16 16:37:51 INFO JobGenerator: Waited for all received blocks to be 
consumed for job generation
15/09/16 16:37:51 INFO StreamingContext: Invoking stop(stopGracefully=true) 
from shutdown hook
15/09/16 16:38:00 INFO RecurringTimer: Stopped timer for JobGenerator after 
time 144242148
15/09/16 16:38:00 INFO JobScheduler: Starting job streaming job 144242148 
ms.0 from job set of time 144242148 ms
15/09/16 16:38:00 INFO JobGenerator: Stopped generation timer
15/09/16 16:38:00 INFO JobGenerator: Waiting for jobs to be processed and 
checkpoints to be written
15/09/16 16:38:00 INFO JobScheduler: Added jobs for time 144242148 ms
15/09/16 16:38:00 INFO JobGenerator: Checkpointing graph for time 144242148 
ms
15/09/16 16:38:00 INFO DStreamGraph: Updating checkpoint data for time 
144242148 ms
15/09/16 16:38:00 INFO DStreamGraph: Updated checkpoint data for time 
144242148 ms
15/09/16 16:38:00 INFO SparkContext: Starting job: foreachRDD at 
StreamDigest.scala:21
15/09/16 16:38:00 INFO DAGScheduler: Got job 12 (foreachRDD at 
StreamDigest.scala:21) with 1 output partitions (allowLocal=true)
15/09/16 16:38:00 INFO DAGScheduler: Final stage: ResultStage 12(foreachRDD at 
StreamDigest.scala:21)
15/09/16 16:38:00 INFO DAGScheduler: Parents of final stage: List()
15/09/16 16:38:00 INFO CheckpointWriter: Saving checkpoint for time 
144242148 ms to file 
'hdfs://EMRURL/sparkStreaming/checkpoint/checkpoint-144242148'
15/09/16 16:38:00 INFO DAGScheduler: Missing parents: List()
.
.
.
.
15/09/16 16:38:00 INFO JobGenerator: Waited for jobs to be processed and 
checkpoints to be written
15/09/16 16:38:00 INFO CheckpointWriter: CheckpointWriter executor terminated ? 
true, waited for 1 ms.
15/09/16 16:38:00 INFO JobGenerator: Stopped JobGenerator
15/09/16 16:38:00 INFO JobScheduler: Stopped JobScheduler

```

And in my spark-defaults.conf I included

`spark.streaming.stopGracefullyOnShutdowntrue`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9715) Store numFeatures in all ML PredictionModel types

2015-09-16 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9715:
-
Shepherd: Joseph K. Bradley
Assignee: Seth Hendrickson

> Store numFeatures in all ML PredictionModel types
> -
>
> Key: SPARK-9715
> URL: https://issues.apache.org/jira/browse/SPARK-9715
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>Priority: Minor
>
> The PredictionModel abstraction should store numFeatures.  Currently, only 
> RandomForest* types do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats

2015-09-16 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790713#comment-14790713
 ] 

Seth Hendrickson commented on SPARK-10602:
--

Right now I have working versions of single pass algos for skewness and 
kurtosis. I'm still writing tests and need to work on performance profiling, 
but I just wanted to make sure we don't duplicate any work. I'll post back with 
more details soon.

> Univariate statistics as UDAFs: single-pass continuous stats
> 
>
> Key: SPARK-10602
> URL: https://issues.apache.org/jira/browse/SPARK-10602
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Joseph K. Bradley
>
> See parent JIRA for more details.  This subtask covers statistics for 
> continuous values requiring a single pass over the data, such as min and max.
> This JIRA is an umbrella.  For individual stats, please create and link a new 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6810) Performance benchmarks for SparkR

2015-09-16 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790700#comment-14790700
 ] 

Xiangrui Meng commented on SPARK-6810:
--

the ml part (glm) is about the same. all computation is on the Scala side. i 
would wait until 1.6 to benchmark GLM because we are going to implement the 
same algorithm as R in 1.6.

> Performance benchmarks for SparkR
> -
>
> Key: SPARK-6810
> URL: https://issues.apache.org/jira/browse/SPARK-6810
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Priority: Critical
>
> We should port some performance benchmarks from spark-perf to SparkR for 
> tracking performance regressions / improvements.
> https://github.com/databricks/spark-perf/tree/master/pyspark-tests has a list 
> of PySpark performance benchmarks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-10636) RDD filter does not work after if..then..else RDD blocks

2015-09-16 Thread Glenn Strycker (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glenn Strycker closed SPARK-10636.
--

> RDD filter does not work after if..then..else RDD blocks
> 
>
> Key: SPARK-10636
> URL: https://issues.apache.org/jira/browse/SPARK-10636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
>
> I have an RDD declaration of the following form:
> {code}
> val myRDD = if (some condition) { tempRDD1.some operations } else { 
> tempRDD2.some operations}.filter(a => a._2._5 <= 50)
> {code}
> When I output the contents of myRDD, I found entries that clearly had a._2._5 
> > 50... the filter didn't work!
> If I move the filter inside of the if..then blocks, it suddenly does work:
> {code}
> val myRDD = if (some condition) { tempRDD1.some operations.filter(a => 
> a._2._5 <= 50) } else { tempRDD2.some operations.filter(a => a._2._5 <= 50) }
> {code}
> I ran toDebugString after both of these code examples, and "filter" does 
> appear in the DAG for the second example, but DOES NOT appear in the first 
> DAG.  Why not?
> Am I misusing the if..then..else syntax for Spark/Scala?
> Here is my actual code... ignore the crazy naming conventions and what it's 
> doing...
> {code}
> // this does NOT work
> val myRDD = if (tempRDD2.count() > 0) {
>tempRDD1.
>  map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))).
>  leftOuterJoin(tempRDD2).
>  map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, 
> a._2._2.getOrElse(1L.
>  leftOuterJoin(tempRDD2).
>  map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, 
> a._2._2.getOrElse(1L.
>  map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), 
> (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if 
> (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4)))
>} else {
>  tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L)))
>}.
>filter(a => a._2._5 <= 50).
>partitionBy(partitioner).
>setName("myRDD").
>persist(StorageLevel.MEMORY_AND_DISK_SER)
> myRDD.checkpoint()
> println(myRDD.toDebugString)
> // (4) MapPartitionsRDD[58] at map at myProgram.scala:2120 []
> //  |  MapPartitionsRDD[57] at map at myProgram.scala:2119 []
> //  |  MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 []
> //  |  MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 []
> //  |  CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 []
> //  +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 []
> //  |  |  MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 []
> //  |  |  MapPartitionsRDD[51] at leftOuterJoin at myProgram.scala:2116 []
> //  |  |  CoGroupedRDD[50] at leftOuterJoin at myProgram.scala:2116 []
> //  |  +-(4) MapPartitionsRDD[49] at map at myProgram.scala:2115 []
> //  |  |  |  clusterGraphWithComponentsRDD MapPartitionsRDD[28] at 
> reduceByKey at myProgram.scala:1689 []
> //  |  |  |  CachedPartitions: 4; MemorySize: 1176.0 B; TachyonSize: 0.0 
> B; DiskSize: 0.0 B
> //  |  |  |  CheckpointRDD[29] at count at myProgram.scala:1701 []
> //  |  +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at 
> myProgram.scala:383 []
> //  | |  CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 
> B; DiskSize: 0.0 B
> //  | |  CheckpointRDD[17] at count at myProgram.scala:394 []
> //  +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at 
> myProgram.scala:383 []
> // |  CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; 
> DiskSize: 0.0 B
> // |  CheckpointRDD[17] at count at myProgram.scala:394 []
> // this DOES work!
> val myRDD = if (tempRDD2.count() > 0) {
>tempRDD1.
>  map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))).
>  leftOuterJoin(tempRDD2).
>  map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, 
> a._2._2.getOrElse(1L.
>  leftOuterJoin(tempRDD2).
>  map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, 
> a._2._2.getOrElse(1L.
>  map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), 
> (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if 
> (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))).
>  filter(a => a._2._5 <= 50)
>} else {
>  tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))).
>  filter(a => a._2._5 <= 50)
>}.
>partitionBy(partitioner).
>setName("myRDD").
>persist(StorageLevel.MEMORY_AND_DISK_SER)
> myRDD.checkpoint()
> println(myRDD.toDebugString)
> // (4) MapPartitionsRDD[59] at filter at myProgram.scala:2121 []
> //  |  MapPartitionsRDD[58] at map at myProgram.scala:2120 []
> //  |  MapPartitionsRDD[57] at map at myProgram.scala:2119 []
> //  |  MapPartitionsRDD[56]

[jira] [Commented] (SPARK-10636) RDD filter does not work after if..then..else RDD blocks

2015-09-16 Thread Glenn Strycker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790681#comment-14790681
 ] 

Glenn Strycker commented on SPARK-10636:


I didn't "forget", I believed that "RDD = if {} else {} . something" would 
automatically take care of the associative property, and that anything after 
the final else {} would apply to both blocks.  I didn't realize that braces 
behave similarly to parentheses and that I needed extras -- makes sense.  I 
have now added these to my code.

This wasn't a question for "user@ first", since I really did believe there was 
a bug.  Jira is the place for submitting bug reports, even when the resolution 
is user error.

> RDD filter does not work after if..then..else RDD blocks
> 
>
> Key: SPARK-10636
> URL: https://issues.apache.org/jira/browse/SPARK-10636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
>
> I have an RDD declaration of the following form:
> {code}
> val myRDD = if (some condition) { tempRDD1.some operations } else { 
> tempRDD2.some operations}.filter(a => a._2._5 <= 50)
> {code}
> When I output the contents of myRDD, I found entries that clearly had a._2._5 
> > 50... the filter didn't work!
> If I move the filter inside of the if..then blocks, it suddenly does work:
> {code}
> val myRDD = if (some condition) { tempRDD1.some operations.filter(a => 
> a._2._5 <= 50) } else { tempRDD2.some operations.filter(a => a._2._5 <= 50) }
> {code}
> I ran toDebugString after both of these code examples, and "filter" does 
> appear in the DAG for the second example, but DOES NOT appear in the first 
> DAG.  Why not?
> Am I misusing the if..then..else syntax for Spark/Scala?
> Here is my actual code... ignore the crazy naming conventions and what it's 
> doing...
> {code}
> // this does NOT work
> val myRDD = if (tempRDD2.count() > 0) {
>tempRDD1.
>  map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))).
>  leftOuterJoin(tempRDD2).
>  map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, 
> a._2._2.getOrElse(1L.
>  leftOuterJoin(tempRDD2).
>  map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, 
> a._2._2.getOrElse(1L.
>  map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), 
> (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if 
> (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4)))
>} else {
>  tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L)))
>}.
>filter(a => a._2._5 <= 50).
>partitionBy(partitioner).
>setName("myRDD").
>persist(StorageLevel.MEMORY_AND_DISK_SER)
> myRDD.checkpoint()
> println(myRDD.toDebugString)
> // (4) MapPartitionsRDD[58] at map at myProgram.scala:2120 []
> //  |  MapPartitionsRDD[57] at map at myProgram.scala:2119 []
> //  |  MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 []
> //  |  MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 []
> //  |  CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 []
> //  +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 []
> //  |  |  MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 []
> //  |  |  MapPartitionsRDD[51] at leftOuterJoin at myProgram.scala:2116 []
> //  |  |  CoGroupedRDD[50] at leftOuterJoin at myProgram.scala:2116 []
> //  |  +-(4) MapPartitionsRDD[49] at map at myProgram.scala:2115 []
> //  |  |  |  clusterGraphWithComponentsRDD MapPartitionsRDD[28] at 
> reduceByKey at myProgram.scala:1689 []
> //  |  |  |  CachedPartitions: 4; MemorySize: 1176.0 B; TachyonSize: 0.0 
> B; DiskSize: 0.0 B
> //  |  |  |  CheckpointRDD[29] at count at myProgram.scala:1701 []
> //  |  +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at 
> myProgram.scala:383 []
> //  | |  CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 
> B; DiskSize: 0.0 B
> //  | |  CheckpointRDD[17] at count at myProgram.scala:394 []
> //  +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at 
> myProgram.scala:383 []
> // |  CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; 
> DiskSize: 0.0 B
> // |  CheckpointRDD[17] at count at myProgram.scala:394 []
> // this DOES work!
> val myRDD = if (tempRDD2.count() > 0) {
>tempRDD1.
>  map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))).
>  leftOuterJoin(tempRDD2).
>  map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, 
> a._2._2.getOrElse(1L.
>  leftOuterJoin(tempRDD2).
>  map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, 
> a._2._2.getOrElse(1L.
>  map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), 
> (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if 
> (a._1._1 <= a._1._2) { a._2._4

[jira] [Commented] (SPARK-6810) Performance benchmarks for SparkR

2015-09-16 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790654#comment-14790654
 ] 

Shivaram Venkataraman commented on SPARK-6810:
--

So the DataFrame API doesn't need much of performance benchmarks as we mostly 
wrap all our calls to Java / Scala - However we are adding new ML API 
components and [~mengxr] will be able to provide more guidance for this.

cc [~sunrui] 

> Performance benchmarks for SparkR
> -
>
> Key: SPARK-6810
> URL: https://issues.apache.org/jira/browse/SPARK-6810
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Priority: Critical
>
> We should port some performance benchmarks from spark-perf to SparkR for 
> tracking performance regressions / improvements.
> https://github.com/databricks/spark-perf/tree/master/pyspark-tests has a list 
> of PySpark performance benchmarks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10632) Cannot save DataFrame with User Defined Types

2015-09-16 Thread Joao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joao updated SPARK-10632:
-
Description: 
Cannot save DataFrames that contain user-defined types.
I tried to save a dataframe with instances of the Vector class from mlib and 
got the error.

The code below should reproduce the error.
{noformat}
val df = sc.parallelize(Seq((1,Vectors.dense(1,1,1)), 
(2,Vectors.dense(2,2,2.toDF()
df.write.format("json").mode(SaveMode.Overwrite).save(path)
{noformat}

The error log is below

{noformat}
15/09/16 09:58:27 ERROR DefaultWriterContainer: Aborting task.
scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133)
at 
org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:185)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:243)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 
'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6'
 closed. Now beginning upload
15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 
'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6'
 upload complete
15/09/16 09:58:28 ERROR DefaultWriterContainer: Task attempt 
attempt_201509160958__m_00_0 aborted.
15/09/16 09:58:28 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Task failed while writing rows.
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:251)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfu

[jira] [Updated] (SPARK-10637) DataFrames: saving with nested User Data Types

2015-09-16 Thread Joao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joao updated SPARK-10637:
-
Description: 
Cannot save data frames using nested UserDefinedType
I wrote a simple example to show the error.

It causes the following error java.lang.IllegalArgumentException: Nested type 
should be repeated: required group array {
  required int32 num;
}

{code:java}
import org.apache.spark.sql.SaveMode
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
import org.apache.spark.sql.types._

@SQLUserDefinedType(udt = classOf[AUDT])
case class A(list:Seq[B])

class AUDT extends UserDefinedType[A] {
  override def sqlType: DataType = StructType(Seq(StructField("list", 
ArrayType(BUDT, containsNull = false), nullable = true)))
  override def userClass: Class[A] = classOf[A]
  override def serialize(obj: Any): Any = obj match {
case A(list) =>
  val row = new GenericMutableRow(1)
  row.update(0, new GenericArrayData(list.map(_.asInstanceOf[Any]).toArray))
  row
  }

  override def deserialize(datum: Any): A = {
datum match {
  case row: InternalRow => new A(row.getArray(0).toArray(BUDT).toSeq)
}
  }
}

object AUDT extends AUDT

@SQLUserDefinedType(udt = classOf[BUDT])
case class B(num:Int)

class BUDT extends UserDefinedType[B] {
  override def sqlType: DataType = StructType(Seq(StructField("num", 
IntegerType, nullable = false)))
  override def userClass: Class[B] = classOf[B]
  override def serialize(obj: Any): Any = obj match {
case B(num) =>
  val row = new GenericMutableRow(1)
  row.setInt(0, num)
  row
  }

  override def deserialize(datum: Any): B = {
datum match {
  case row: InternalRow => new B(row.getInt(0))
}
  }
}

object BUDT extends BUDT

object TestNested {
  def main(args:Array[String]) = {
val col = Seq(new A(Seq(new B(1), new B(2))),
  new A(Seq(new B(3), new B(4

val sc = new SparkContext(new 
SparkConf().setMaster("local[1]").setAppName("TestSpark"))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

val df = sc.parallelize(1 to 2 zip col).toDF()

df.show()

df.write.mode(SaveMode.Overwrite).save(...)


  }
}
{code}

The error log is shown below:

{noformat}
15/09/16 16:44:36 WARN : Your hostname, X resolves to a loopback/non-reachable 
address: fe80:0:0:0:c4c7:8c4b:4a24:f8a1%14, but we couldn't find any external 
IP address!
15/09/16 16:44:38 INFO deprecation: mapred.job.id is deprecated. Instead, use 
mapreduce.job.id
15/09/16 16:44:38 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
mapreduce.task.id
15/09/16 16:44:38 INFO deprecation: mapred.task.id is deprecated. Instead, use 
mapreduce.task.attempt.id
15/09/16 16:44:38 INFO deprecation: mapred.task.is.map is deprecated. Instead, 
use mapreduce.task.ismap
15/09/16 16:44:38 INFO deprecation: mapred.task.partition is deprecated. 
Instead, use mapreduce.task.partition
15/09/16 16:44:38 INFO ParquetRelation: Using default output committer for 
Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
15/09/16 16:44:38 INFO DefaultWriterContainer: Using user defined output 
committer class org.apache.parquet.hadoop.ParquetOutputCommitter
15/09/16 16:44:38 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 
localhost:50986 in memory (size: 1402.0 B, free: 973.6 MB)
15/09/16 16:44:38 INFO ContextCleaner: Cleaned accumulator 1
15/09/16 16:44:39 INFO SparkContext: Starting job: save at TestNested.scala:73
15/09/16 16:44:39 INFO DAGScheduler: Got job 1 (save at TestNested.scala:73) 
with 1 output partitions
15/09/16 16:44:39 INFO DAGScheduler: Final stage: ResultStage 1(save at 
TestNested.scala:73)
15/09/16 16:44:39 INFO DAGScheduler: Parents of final stage: List()
15/09/16 16:44:39 INFO DAGScheduler: Missing parents: List()
15/09/16 16:44:39 INFO DAGScheduler: Submitting ResultStage 1 
(MapPartitionsRDD[1] at rddToDataFrameHolder at TestNested.scala:69), which has 
no missing parents
15/09/16 16:44:39 INFO MemoryStore: ensureFreeSpace(59832) called with 
curMem=0, maxMem=1020914565
15/09/16 16:44:39 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 58.4 KB, free 973.6 MB)
15/09/16 16:44:39 INFO MemoryStore: ensureFreeSpace(20794) called with 
curMem=59832, maxMem=1020914565
15/09/16 16:44:39 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in 
memory (estimated size 20.3 KB, free 973.5 MB)
15/09/16 16:44:39 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 
localhost:50986 (size: 20.3 KB, free: 973.6 MB)
15/09/16 16:44:39 INFO SparkContext: Created broadcast 1 from broadcast at 
DAGScheduler.scala:861
15/09/16 16:44:39 INFO DAGScheduler: Submitting 1 missing tasks from 
ResultStage 1 (MapPartitionsRDD[1] at rddToDataFrameHolder at 
TestNested.sc

[jira] [Assigned] (SPARK-9296) variance, var_pop, and var_samp aggregate functions

2015-09-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9296:
---

Assignee: Apache Spark

> variance, var_pop, and var_samp aggregate functions
> ---
>
> Key: SPARK-9296
> URL: https://issues.apache.org/jira/browse/SPARK-9296
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> A short introduction on how to build aggregate functions based on our new 
> interface can be found at 
> https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9296) variance, var_pop, and var_samp aggregate functions

2015-09-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790663#comment-14790663
 ] 

Apache Spark commented on SPARK-9296:
-

User 'JihongMA' has created a pull request for this issue:
https://github.com/apache/spark/pull/8778

> variance, var_pop, and var_samp aggregate functions
> ---
>
> Key: SPARK-9296
> URL: https://issues.apache.org/jira/browse/SPARK-9296
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> A short introduction on how to build aggregate functions based on our new 
> interface can be found at 
> https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9296) variance, var_pop, and var_samp aggregate functions

2015-09-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9296:
---

Assignee: (was: Apache Spark)

> variance, var_pop, and var_samp aggregate functions
> ---
>
> Key: SPARK-9296
> URL: https://issues.apache.org/jira/browse/SPARK-9296
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> A short introduction on how to build aggregate functions based on our new 
> interface can be found at 
> https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.

2015-09-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-10634.
-

> The spark sql fails if the where clause contains a string with " in it.
> ---
>
> Key: SPARK-10634
> URL: https://issues.apache.org/jira/browse/SPARK-10634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Prachi Burathoki
>
> When running a sql query in which the where clause contains a string with " 
> in it, the sql parser throws an error.
> Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but 
> identifier test found
> SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER 
> FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test""))
>   
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
>   at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933)
>   at 
> com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106)
>   at 
> com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93)
>   at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153)
>   at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752)
>   at 
> com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011)
>   ... 31 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.

2015-09-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10634.
---
Resolution: Not A Problem

> The spark sql fails if the where clause contains a string with " in it.
> ---
>
> Key: SPARK-10634
> URL: https://issues.apache.org/jira/browse/SPARK-10634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Prachi Burathoki
>
> When running a sql query in which the where clause contains a string with " 
> in it, the sql parser throws an error.
> Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but 
> identifier test found
> SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER 
> FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test""))
>   
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
>   at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933)
>   at 
> com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106)
>   at 
> com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93)
>   at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153)
>   at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752)
>   at 
> com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011)
>   ... 31 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.

2015-09-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-10634:
---

> The spark sql fails if the where clause contains a string with " in it.
> ---
>
> Key: SPARK-10634
> URL: https://issues.apache.org/jira/browse/SPARK-10634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Prachi Burathoki
>
> When running a sql query in which the where clause contains a string with " 
> in it, the sql parser throws an error.
> Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but 
> identifier test found
> SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER 
> FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test""))
>   
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
>   at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933)
>   at 
> com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106)
>   at 
> com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93)
>   at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153)
>   at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752)
>   at 
> com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011)
>   ... 31 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10636) RDD filter does not work after if..then..else RDD blocks

2015-09-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10636.
---
Resolution: Not A Problem

In the first case, your {{.filter}} statement plainly applies only to the else 
block. That's the behavior you see. Did you forget to wrap the if-else in 
parens?

I'd ask this as a question on user@ first, since I think you are posing this as 
a question. JIRA is not the right place to start with those.

> RDD filter does not work after if..then..else RDD blocks
> 
>
> Key: SPARK-10636
> URL: https://issues.apache.org/jira/browse/SPARK-10636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
>
> I have an RDD declaration of the following form:
> {code}
> val myRDD = if (some condition) { tempRDD1.some operations } else { 
> tempRDD2.some operations}.filter(a => a._2._5 <= 50)
> {code}
> When I output the contents of myRDD, I found entries that clearly had a._2._5 
> > 50... the filter didn't work!
> If I move the filter inside of the if..then blocks, it suddenly does work:
> {code}
> val myRDD = if (some condition) { tempRDD1.some operations.filter(a => 
> a._2._5 <= 50) } else { tempRDD2.some operations.filter(a => a._2._5 <= 50) }
> {code}
> I ran toDebugString after both of these code examples, and "filter" does 
> appear in the DAG for the second example, but DOES NOT appear in the first 
> DAG.  Why not?
> Am I misusing the if..then..else syntax for Spark/Scala?
> Here is my actual code... ignore the crazy naming conventions and what it's 
> doing...
> {code}
> // this does NOT work
> val myRDD = if (tempRDD2.count() > 0) {
>tempRDD1.
>  map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))).
>  leftOuterJoin(tempRDD2).
>  map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, 
> a._2._2.getOrElse(1L.
>  leftOuterJoin(tempRDD2).
>  map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, 
> a._2._2.getOrElse(1L.
>  map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), 
> (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if 
> (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4)))
>} else {
>  tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L)))
>}.
>filter(a => a._2._5 <= 50).
>partitionBy(partitioner).
>setName("myRDD").
>persist(StorageLevel.MEMORY_AND_DISK_SER)
> myRDD.checkpoint()
> println(myRDD.toDebugString)
> // (4) MapPartitionsRDD[58] at map at myProgram.scala:2120 []
> //  |  MapPartitionsRDD[57] at map at myProgram.scala:2119 []
> //  |  MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 []
> //  |  MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 []
> //  |  CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 []
> //  +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 []
> //  |  |  MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 []
> //  |  |  MapPartitionsRDD[51] at leftOuterJoin at myProgram.scala:2116 []
> //  |  |  CoGroupedRDD[50] at leftOuterJoin at myProgram.scala:2116 []
> //  |  +-(4) MapPartitionsRDD[49] at map at myProgram.scala:2115 []
> //  |  |  |  clusterGraphWithComponentsRDD MapPartitionsRDD[28] at 
> reduceByKey at myProgram.scala:1689 []
> //  |  |  |  CachedPartitions: 4; MemorySize: 1176.0 B; TachyonSize: 0.0 
> B; DiskSize: 0.0 B
> //  |  |  |  CheckpointRDD[29] at count at myProgram.scala:1701 []
> //  |  +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at 
> myProgram.scala:383 []
> //  | |  CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 
> B; DiskSize: 0.0 B
> //  | |  CheckpointRDD[17] at count at myProgram.scala:394 []
> //  +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at 
> myProgram.scala:383 []
> // |  CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; 
> DiskSize: 0.0 B
> // |  CheckpointRDD[17] at count at myProgram.scala:394 []
> // this DOES work!
> val myRDD = if (tempRDD2.count() > 0) {
>tempRDD1.
>  map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))).
>  leftOuterJoin(tempRDD2).
>  map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, 
> a._2._2.getOrElse(1L.
>  leftOuterJoin(tempRDD2).
>  map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, 
> a._2._2.getOrElse(1L.
>  map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), 
> (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if 
> (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))).
>  filter(a => a._2._5 <= 50)
>} else {
>  tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))).
>  filter(a => a._2._5 <= 50)
>}.
>partitionBy(partitioner).
>setName("my

[jira] [Closed] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.

2015-09-16 Thread Prachi Burathoki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prachi Burathoki closed SPARK-10634.

Resolution: Done

> The spark sql fails if the where clause contains a string with " in it.
> ---
>
> Key: SPARK-10634
> URL: https://issues.apache.org/jira/browse/SPARK-10634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Prachi Burathoki
>
> When running a sql query in which the where clause contains a string with " 
> in it, the sql parser throws an error.
> Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but 
> identifier test found
> SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER 
> FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test""))
>   
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
>   at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933)
>   at 
> com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106)
>   at 
> com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93)
>   at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153)
>   at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752)
>   at 
> com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011)
>   ... 31 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10636) RDD filter does not work after if..then..else RDD blocks

2015-09-16 Thread Glenn Strycker (JIRA)

Glenn Strycker created SPARK-10636:
--

 Summary: RDD filter does not work after if..then..else RDD blocks
 Key: SPARK-10636
 URL: https://issues.apache.org/jira/browse/SPARK-10636
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Glenn Strycker


I have an RDD declaration of the following form:

{code}
val myRDD = if (some condition) { tempRDD1.some operations } else { 
tempRDD2.some operations}.filter(a => a._2._5 <= 50)
{code}

When I output the contents of myRDD, I found entries that clearly had a._2._5 > 
50... the filter didn't work!

If I move the filter inside of the if..then blocks, it suddenly does work:

{code}
val myRDD = if (some condition) { tempRDD1.some operations.filter(a => a._2._5 
<= 50) } else { tempRDD2.some operations.filter(a => a._2._5 <= 50) }
{code}

I ran toDebugString after both of these code examples, and "filter" does appear 
in the DAG for the second example, but DOES NOT appear in the first DAG.  Why 
not?

Am I misusing the if..then..else syntax for Spark/Scala?


Here is my actual code... ignore the crazy naming conventions and what it's 
doing...

{code}
// this does NOT work

val myRDD = if (tempRDD2.count() > 0) {
   tempRDD1.
 map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))).
 leftOuterJoin(tempRDD2).
 map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, 
a._2._2.getOrElse(1L.
 leftOuterJoin(tempRDD2).
 map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, 
a._2._2.getOrElse(1L.
 map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), 
(a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if 
(a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4)))
   } else {
 tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L)))
   }.
   filter(a => a._2._5 <= 50).
   partitionBy(partitioner).
   setName("myRDD").
   persist(StorageLevel.MEMORY_AND_DISK_SER)

myRDD.checkpoint()

println(myRDD.toDebugString)

// (4) MapPartitionsRDD[58] at map at myProgram.scala:2120 []
//  |  MapPartitionsRDD[57] at map at myProgram.scala:2119 []
//  |  MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 []
//  |  MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 []
//  |  CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 []
//  +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 []
//  |  |  MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 []
//  |  |  MapPartitionsRDD[51] at leftOuterJoin at myProgram.scala:2116 []
//  |  |  CoGroupedRDD[50] at leftOuterJoin at myProgram.scala:2116 []
//  |  +-(4) MapPartitionsRDD[49] at map at myProgram.scala:2115 []
//  |  |  |  clusterGraphWithComponentsRDD MapPartitionsRDD[28] at reduceByKey 
at myProgram.scala:1689 []
//  |  |  |  CachedPartitions: 4; MemorySize: 1176.0 B; TachyonSize: 0.0 B; 
DiskSize: 0.0 B
//  |  |  |  CheckpointRDD[29] at count at myProgram.scala:1701 []
//  |  +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at 
myProgram.scala:383 []
//  | |  CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; 
DiskSize: 0.0 B
//  | |  CheckpointRDD[17] at count at myProgram.scala:394 []
//  +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at 
myProgram.scala:383 []
// |  CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; 
DiskSize: 0.0 B
// |  CheckpointRDD[17] at count at myProgram.scala:394 []




// this DOES work!

val myRDD = if (tempRDD2.count() > 0) {
   tempRDD1.
 map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))).
 leftOuterJoin(tempRDD2).
 map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, 
a._2._2.getOrElse(1L.
 leftOuterJoin(tempRDD2).
 map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, 
a._2._2.getOrElse(1L.
 map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), 
(a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if 
(a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))).
 filter(a => a._2._5 <= 50)
   } else {
 tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))).
 filter(a => a._2._5 <= 50)
   }.
   partitionBy(partitioner).
   setName("myRDD").
   persist(StorageLevel.MEMORY_AND_DISK_SER)

myRDD.checkpoint()

println(myRDD.toDebugString)

// (4) MapPartitionsRDD[59] at filter at myProgram.scala:2121 []
//  |  MapPartitionsRDD[58] at map at myProgram.scala:2120 []
//  |  MapPartitionsRDD[57] at map at myProgram.scala:2119 []
//  |  MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 []
//  |  MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 []
//  |  CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 []
//  +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 []
//  |  |  MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 []
//  |  |  Ma

[jira] [Commented] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.

2015-09-16 Thread Prachi Burathoki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790634#comment-14790634
 ] 

Prachi Burathoki commented on SPARK-10634:
--

I tried escaping" with both \ and ".But got the same error.But I've found a 
workaround, in the where clause i changed it to listc = 'this is a"test"'  and 
that works.
Thanks

> The spark sql fails if the where clause contains a string with " in it.
> ---
>
> Key: SPARK-10634
> URL: https://issues.apache.org/jira/browse/SPARK-10634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Prachi Burathoki
>
> When running a sql query in which the where clause contains a string with " 
> in it, the sql parser throws an error.
> Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but 
> identifier test found
> SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER 
> FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test""))
>   
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
>   at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933)
>   at 
> com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106)
>   at 
> com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93)
>   at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153)
>   at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752)
>   at 
> com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011)
>   ... 31 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10637) DataFrames: saving with nested User Data Types

2015-09-16 Thread Joao (JIRA)

Joao created SPARK-10637:


 Summary: DataFrames: saving with nested User Data Types
 Key: SPARK-10637
 URL: https://issues.apache.org/jira/browse/SPARK-10637
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Joao


Cannot save data frames using nested UserDefinedType
I wrote a simple example to show the error.

It causes the following error java.lang.IllegalArgumentException: Nested type 
should be repeated: required group array {
  required int32 num;
}

{code:java}
import org.apache.spark.sql.SaveMode
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
import org.apache.spark.sql.types._

@SQLUserDefinedType(udt = classOf[AUDT])
case class A(list:Seq[B])

class AUDT extends UserDefinedType[A] {
  override def sqlType: DataType = StructType(Seq(StructField("list", 
ArrayType(BUDT, containsNull = false), nullable = true)))
  override def userClass: Class[A] = classOf[A]
  override def serialize(obj: Any): Any = obj match {
case A(list) =>
  val row = new GenericMutableRow(1)
  row.update(0, new GenericArrayData(list.map(_.asInstanceOf[Any]).toArray))
  row
  }

  override def deserialize(datum: Any): A = {
datum match {
  case row: InternalRow => new A(row.getArray(0).toArray(BUDT).toSeq)
}
  }
}

object AUDT extends AUDT

@SQLUserDefinedType(udt = classOf[BUDT])
case class B(num:Int)

class BUDT extends UserDefinedType[B] {
  override def sqlType: DataType = StructType(Seq(StructField("num", 
IntegerType, nullable = false)))
  override def userClass: Class[B] = classOf[B]
  override def serialize(obj: Any): Any = obj match {
case B(num) =>
  val row = new GenericMutableRow(1)
  row.setInt(0, num)
  row
  }

  override def deserialize(datum: Any): B = {
datum match {
  case row: InternalRow => new B(row.getInt(0))
}
  }
}

object BUDT extends BUDT

object TestNested {
  def main(args:Array[String]) = {
val col = Seq(new A(Seq(new B(1), new B(2))),
  new A(Seq(new B(3), new B(4

val sc = new SparkContext(new 
SparkConf().setMaster("local[1]").setAppName("TestSpark"))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

val df = sc.parallelize(1 to 2 zip col).toDF()

df.show()

df.write.mode(SaveMode.Overwrite).save(...)


  }
}
{code}

The error log is shown below:

{noformat}
15/09/16 16:44:36 WARN : Your hostname, X resolves to a loopback/non-reachable 
address: fe80:0:0:0:c4c7:8c4b:4a24:f8a1%14, but we couldn't find any external 
IP address!
15/09/16 16:44:38 INFO deprecation: mapred.job.id is deprecated. Instead, use 
mapreduce.job.id
15/09/16 16:44:38 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
mapreduce.task.id
15/09/16 16:44:38 INFO deprecation: mapred.task.id is deprecated. Instead, use 
mapreduce.task.attempt.id
15/09/16 16:44:38 INFO deprecation: mapred.task.is.map is deprecated. Instead, 
use mapreduce.task.ismap
15/09/16 16:44:38 INFO deprecation: mapred.task.partition is deprecated. 
Instead, use mapreduce.task.partition
15/09/16 16:44:38 INFO ParquetRelation: Using default output committer for 
Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
15/09/16 16:44:38 INFO DefaultWriterContainer: Using user defined output 
committer class org.apache.parquet.hadoop.ParquetOutputCommitter
15/09/16 16:44:38 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 
localhost:50986 in memory (size: 1402.0 B, free: 973.6 MB)
15/09/16 16:44:38 INFO ContextCleaner: Cleaned accumulator 1
15/09/16 16:44:39 INFO SparkContext: Starting job: save at TestNested.scala:73
15/09/16 16:44:39 INFO DAGScheduler: Got job 1 (save at TestNested.scala:73) 
with 1 output partitions
15/09/16 16:44:39 INFO DAGScheduler: Final stage: ResultStage 1(save at 
TestNested.scala:73)
15/09/16 16:44:39 INFO DAGScheduler: Parents of final stage: List()
15/09/16 16:44:39 INFO DAGScheduler: Missing parents: List()
15/09/16 16:44:39 INFO DAGScheduler: Submitting ResultStage 1 
(MapPartitionsRDD[1] at rddToDataFrameHolder at TestNested.scala:69), which has 
no missing parents
15/09/16 16:44:39 INFO MemoryStore: ensureFreeSpace(59832) called with 
curMem=0, maxMem=1020914565
15/09/16 16:44:39 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 58.4 KB, free 973.6 MB)
15/09/16 16:44:39 INFO MemoryStore: ensureFreeSpace(20794) called with 
curMem=59832, maxMem=1020914565
15/09/16 16:44:39 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in 
memory (estimated size 20.3 KB, free 973.5 MB)
15/09/16 16:44:39 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 
localhost:50986 (size: 20.3 KB, free: 973.6 MB)
15/09/16 16:44:39 INFO SparkContext: Created broadcast 1 from broadcast at

[jira] [Commented] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.

2015-09-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14769057#comment-14769057
 ] 

Sean Owen commented on SPARK-10634:
---

Don't you need to use "" to quote double quotes? I actually am not sure what 
Spark supports but that's not uncommon in SQL. I don't think you're escaping 
the quotes here.

> The spark sql fails if the where clause contains a string with " in it.
> ---
>
> Key: SPARK-10634
> URL: https://issues.apache.org/jira/browse/SPARK-10634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Prachi Burathoki
>
> When running a sql query in which the where clause contains a string with " 
> in it, the sql parser throws an error.
> Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but 
> identifier test found
> SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER 
> FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test""))
>   
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
>   at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933)
>   at 
> com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106)
>   at 
> com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93)
>   at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153)
>   at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752)
>   at 
> com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011)
>   ... 31 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10635) pyspark - running on a different host

2015-09-16 Thread Ben Duffield (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Duffield updated SPARK-10635:
-
Description: 
At various points we assume we only ever talk to a driver on the same host.
e.g. 
https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615

We use pyspark to connect to an existing driver (i.e. do not let pyspark launch 
the driver itself, but instead construct the SparkContext with the gateway and 
jsc arguments.

There are a few reasons for this, but essentially it's to allow more 
flexibility when running in AWS.

Before 1.3.1 we were able to monkeypatch around this:  
{code}
def _load_from_socket(port, serializer):
sock = socket.socket()
sock.settimeout(3)
try:
sock.connect((host, port))
rf = sock.makefile("rb", 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
pyspark.rdd._load_from_socket = _load_from_socket
{code}

  was:
At various points we assume we only ever talk to a driver on the same host.
e.g. 
https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615

We use pyspark to connect to an existing driver (i.e. do not let pyspark launch 
the driver itself, but instead construct the SparkContext with the gateway and 
jsc arguments.

There are a few reasons for this, but essentially it's to allow more 
flexibility when running in AWS.

Before 1.3.1 we were able to monkeypatch around this:  
{code}
def _load_from_socket(port, serializer):
sock = socket.socket()
sock.settimeout(3)
try:
sock.connect((host, port))
rf = sock.makefile("rb", 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
pyspark.rdd._load_from_socket = _load_from_socket
{/code}


> pyspark - running on a different host
> -
>
> Key: SPARK-10635
> URL: https://issues.apache.org/jira/browse/SPARK-10635
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ben Duffield
>
> At various points we assume we only ever talk to a driver on the same host.
> e.g. 
> https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615
> We use pyspark to connect to an existing driver (i.e. do not let pyspark 
> launch the driver itself, but instead construct the SparkContext with the 
> gateway and jsc arguments.
> There are a few reasons for this, but essentially it's to allow more 
> flexibility when running in AWS.
> Before 1.3.1 we were able to monkeypatch around this:  
> {code}
> def _load_from_socket(port, serializer):
> sock = socket.socket()
> sock.settimeout(3)
> try:
> sock.connect((host, port))
> rf = sock.makefile("rb", 65536)
> for item in serializer.load_stream(rf):
> yield item
> finally:
> sock.close()
> pyspark.rdd._load_from_socket = _load_from_socket
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10635) pyspark - running on a different host

2015-09-16 Thread Ben Duffield (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Duffield updated SPARK-10635:
-
Description: 
At various points we assume we only ever talk to a driver on the same host.
e.g. 
https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615

We use pyspark to connect to an existing driver (i.e. do not let pyspark launch 
the driver itself, but instead construct the SparkContext with the gateway and 
jsc arguments.

There are a few reasons for this, but essentially it's to allow more 
flexibility when running in AWS.

Before 1.3.1 we were able to monkeypatch around this:  
{code}
def _load_from_socket(port, serializer):
sock = socket.socket()
sock.settimeout(3)
try:
sock.connect((host, port))
rf = sock.makefile("rb", 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
pyspark.rdd._load_from_socket = _load_from_socket
{code}

  was:
At various points we assume we only ever talk to a driver on the same host.
e.g. 
https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615

We use pyspark to connect to an existing driver (i.e. do not let pyspark launch 
the driver itself, but instead construct the SparkContext with the gateway and 
jsc arguments.

There are a few reasons for this, but essentially it's to allow more 
flexibility when running in AWS.

Before 1.3.1 we were able to monkeypatch around this:  
{code}
def _load_from_socket(port, serializer):
sock = socket.socket()
sock.settimeout(3)
try:
sock.connect((host, port))
rf = sock.makefile("rb", 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
pyspark.rdd._load_from_socket = _load_from_socket
{code}


> pyspark - running on a different host
> -
>
> Key: SPARK-10635
> URL: https://issues.apache.org/jira/browse/SPARK-10635
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ben Duffield
>
> At various points we assume we only ever talk to a driver on the same host.
> e.g. 
> https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615
> We use pyspark to connect to an existing driver (i.e. do not let pyspark 
> launch the driver itself, but instead construct the SparkContext with the 
> gateway and jsc arguments.
> There are a few reasons for this, but essentially it's to allow more 
> flexibility when running in AWS.
> Before 1.3.1 we were able to monkeypatch around this:  
> {code}
> def _load_from_socket(port, serializer):
> sock = socket.socket()
> sock.settimeout(3)
> try:
> sock.connect((host, port))
> rf = sock.makefile("rb", 65536)
> for item in serializer.load_stream(rf):
> yield item
> finally:
> sock.close()
> pyspark.rdd._load_from_socket = _load_from_socket
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10635) pyspark - running on a different host

2015-09-16 Thread Ben Duffield (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Duffield updated SPARK-10635:
-
Description: 
At various points we assume we only ever talk to a driver on the same host.
e.g. 
https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615

We use pyspark to connect to an existing driver (i.e. do not let pyspark launch 
the driver itself, but instead construct the SparkContext with the gateway and 
jsc arguments.

There are a few reasons for this, but essentially it's to allow more 
flexibility when running in AWS.

Before 1.3.1 we were able to monkeypatch around this:  
{code}
def _load_from_socket(port, serializer):
sock = socket.socket()
sock.settimeout(3)
try:
sock.connect((host, port))
rf = sock.makefile("rb", 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
pyspark.rdd._load_from_socket = _load_from_socket
{/code}

  was:
At various points we assume we only ever talk to a driver on the same host.
e.g. 
https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615

We use pyspark to connect to an existing driver (i.e. do not let pyspark launch 
the driver itself, but instead construct the SparkContext with the gateway and 
jsc arguments.

There are a few reasons for this, but essentially it's to allow more 
flexibility when running in AWS.

Before 1.3.1 we were able to monkeypatch around this:  

def _load_from_socket(port, serializer):
sock = socket.socket()
sock.settimeout(3)
try:
sock.connect((host, port))
rf = sock.makefile("rb", 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
pyspark.rdd._load_from_socket = _load_from_socket



> pyspark - running on a different host
> -
>
> Key: SPARK-10635
> URL: https://issues.apache.org/jira/browse/SPARK-10635
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ben Duffield
>
> At various points we assume we only ever talk to a driver on the same host.
> e.g. 
> https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615
> We use pyspark to connect to an existing driver (i.e. do not let pyspark 
> launch the driver itself, but instead construct the SparkContext with the 
> gateway and jsc arguments.
> There are a few reasons for this, but essentially it's to allow more 
> flexibility when running in AWS.
> Before 1.3.1 we were able to monkeypatch around this:  
> {code}
> def _load_from_socket(port, serializer):
> sock = socket.socket()
> sock.settimeout(3)
> try:
> sock.connect((host, port))
> rf = sock.makefile("rb", 65536)
> for item in serializer.load_stream(rf):
> yield item
> finally:
> sock.close()
> pyspark.rdd._load_from_socket = _load_from_socket
> {/code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10635) pyspark - running on a different host

2015-09-16 Thread Ben Duffield (JIRA)

Ben Duffield created SPARK-10635:


 Summary: pyspark - running on a different host
 Key: SPARK-10635
 URL: https://issues.apache.org/jira/browse/SPARK-10635
 Project: Spark
  Issue Type: Improvement
Reporter: Ben Duffield


At various points we assume we only ever talk to a driver on the same host.
e.g. 
https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615

We use pyspark to connect to an existing driver (i.e. do not let pyspark launch 
the driver itself, but instead construct the SparkContext with the gateway and 
jsc arguments.

There are a few reasons for this, but essentially it's to allow more 
flexibility when running in AWS.

Before 1.3.1 we were able to monkeypatch around this:  

def _load_from_socket(port, serializer):
sock = socket.socket()
sock.settimeout(3)
try:
sock.connect((host, port))
rf = sock.makefile("rb", 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
pyspark.rdd._load_from_socket = _load_from_socket




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.

2015-09-16 Thread Prachi Burathoki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14769011#comment-14769011
 ] 

Prachi Burathoki commented on SPARK-10634:
--

I tried by escaping with \, but still same error

Caused by: java.lang.RuntimeException: [1.130] failure: ``)'' expected but 
identifier test found

SELECT clistc1426336010, corc2125646118, candc2031403851, SYSIBM_ROW_NUMBER 
FROM TABLE_1 WHERE ((clistc1426336010 = "this is a \"test\""))

 ^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933)
at 
com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106)
at 
com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93)
at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153)
at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291)
at 
com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840)
at 
com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752)
at 
com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRef

> The spark sql fails if the where clause contains a string with " in it.
> ---
>
> Key: SPARK-10634
> URL: https://issues.apache.org/jira/browse/SPARK-10634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Prachi Burathoki
>
> When running a sql query in which the where clause contains a string with " 
> in it, the sql parser throws an error.
> Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but 
> identifier test found
> SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER 
> FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test""))
>   
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
>   at 
> org.apache.spark.sql

[jira] [Commented] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.

2015-09-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14768938#comment-14768938
 ] 

Sean Owen commented on SPARK-10634:
---

Shouldn't the " be escaped in some way? or else I'm not sure how the parser 
would know the end of the literal from a quote inside the literal.

> The spark sql fails if the where clause contains a string with " in it.
> ---
>
> Key: SPARK-10634
> URL: https://issues.apache.org/jira/browse/SPARK-10634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Prachi Burathoki
>
> When running a sql query in which the where clause contains a string with " 
> in it, the sql parser throws an error.
> Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but 
> identifier test found
> SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER 
> FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test""))
>   
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
>   at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933)
>   at 
> com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106)
>   at 
> com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93)
>   at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153)
>   at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752)
>   at 
> com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011)
>   ... 31 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.

2015-09-16 Thread Prachi Burathoki (JIRA)

Prachi Burathoki created SPARK-10634:


 Summary: The spark sql fails if the where clause contains a string 
with " in it.
 Key: SPARK-10634
 URL: https://issues.apache.org/jira/browse/SPARK-10634
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Prachi Burathoki


When running a sql query in which the where clause contains a string with " in 
it, the sql parser throws an error.

Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but 
identifier test found

SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER FROM 
TABLE_1 WHERE ((clistc215647292 = "this is a "test""))

  ^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933)
at 
com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106)
at 
com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93)
at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153)
at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291)
at 
com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840)
at 
com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752)
at 
com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011)
... 31 more




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10614) SystemClock uses non-monotonic time in its wait logic

2015-09-16 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14768900#comment-14768900
 ] 

Steve Loughran commented on SPARK-10614:


Having done a little more detailed research on the current state of this clock, 
I'm now having doubts about this.

On x86, its generally assumed that the {{System.nanoTime()}} uses the {{TSC}} 
counter to get the timestamp —which is fast and only goes forwards (albeit at a 
rate which depends on CPU power states). But it turns out that on manycore 
CPUs, because that could lead to different answers on different cores, the OS 
may use alternative mechanisms to return a counter: which may be neither 
monotonic nor fast.

# [Inside the Hotspot VM: Clocks, Timers and Scheduling Events - Part I - 
Windows|https://blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks]
# [JDK-6440250 : On Windows System.nanoTime() may be 25x slower than 
System.currentTimeMillis()|http://bugs.java.com/view_bug.do?bug_id=6440250]
# [JDK-6458294 : nanoTime affected by system clock change on Linux (RH9) or in 
general lacks monotonicity|JDK-6458294 : nanoTime affected by system clock 
change on Linux (RH9) or in general lacks monotonicity]
# [Redhat on timestamps in 
Linux|https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Timestamping.html]
# [Timekeeping in VMware Virtual 
Machineshttp://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf]

These docs imply that nanotime may be fast-but-unreliable-on multiple socket 
systems (the latest many core parts share one counter) —and may downgrade to 
something slower than calls to getTimeMillis()., or even something that isn't 
guaranteed to be monotonic. 

It's not clear that on deployments of physical many-core systems moving to 
nanotime actually offers much. I don't know about EC2 or other cloud 
infrastructures though.

maybe its just best to WONTFIX this as it won't raise unrealistic expectations 
about nanoTime working

> SystemClock uses non-monotonic time in its wait logic
> -
>
> Key: SPARK-10614
> URL: https://issues.apache.org/jira/browse/SPARK-10614
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Steve Loughran
>Priority: Minor
>
> The consolidated (SPARK-4682) clock uses {{System.currentTimeMillis()}} for 
> measuring time, which means its {{waitTillTime()}} routine is brittle against 
> systems (VMs in particular) whose time can go backwards as well as forward.
> For the {{ExecutorAllocationManager}} this appears to be a regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6810) Performance benchmarks for SparkR

2015-09-16 Thread Yashwanth Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14768853#comment-14768853
 ] 

Yashwanth Kumar commented on SPARK-6810:


Hi Shivaram Venkataraman, I would like to try this.

> Performance benchmarks for SparkR
> -
>
> Key: SPARK-6810
> URL: https://issues.apache.org/jira/browse/SPARK-6810
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Priority: Critical
>
> We should port some performance benchmarks from spark-perf to SparkR for 
> tracking performance regressions / improvements.
> https://github.com/databricks/spark-perf/tree/master/pyspark-tests has a list 
> of PySpark performance benchmarks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9492) LogisticRegression in R should provide model statistics

2015-09-16 Thread Yashwanth Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14768850#comment-14768850
 ] 

Yashwanth Kumar commented on SPARK-9492:


Can i Have this task?

> LogisticRegression in R should provide model statistics
> ---
>
> Key: SPARK-9492
> URL: https://issues.apache.org/jira/browse/SPARK-9492
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, R
>Reporter: Eric Liang
>
> Like ml LinearRegression, LogisticRegression should provide a training 
> summary including feature names and their coefficients.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10267) Add @Since annotation to ml.util

2015-09-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10267.
---
Resolution: Not A Problem
  Assignee: Ehsan Mohyedin Kermani

> Add @Since annotation to ml.util
> 
>
> Key: SPARK-10267
> URL: https://issues.apache.org/jira/browse/SPARK-10267
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Assignee: Ehsan Mohyedin Kermani
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10262) Add @Since annotation to ml.attribute

2015-09-16 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14768840#comment-14768840
 ] 

Xiangrui Meng commented on SPARK-10262:
---

[~tijoparacka] Are you still working on this?

> Add @Since annotation to ml.attribute
> -
>
> Key: SPARK-10262
> URL: https://issues.apache.org/jira/browse/SPARK-10262
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10631) Add missing API doc in pyspark.mllib.linalg.Vector

2015-09-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10631:
--
Assignee: Vinod KC

> Add missing API doc in pyspark.mllib.linalg.Vector
> --
>
> Key: SPARK-10631
> URL: https://issues.apache.org/jira/browse/SPARK-10631
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Vinod KC
>Priority: Minor
>
> There are some missing API docs in pyspark.mllib.linalg.Vector (including 
> DenseVector and SparseVector). We should add them based on their Scala 
> counterparts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10276) Add @since annotation to pyspark.mllib.recommendation

2015-09-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10276.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8677
[https://github.com/apache/spark/pull/8677]

> Add @since annotation to pyspark.mllib.recommendation
> -
>
> Key: SPARK-10276
> URL: https://issues.apache.org/jira/browse/SPARK-10276
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: starter
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10633) Persisting Spark stream to MySQL - Spark tries to create the table for every stream even if it exist already.

2015-09-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10633:
--
Priority: Major  (was: Blocker)

[~lunendl] Please have a look at 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark -- for 
example you should not set Blocker, as this can't really be considered one at 
this point.

> Persisting Spark stream to MySQL - Spark tries to create the table for every 
> stream even if it exist already.
> -
>
> Key: SPARK-10633
> URL: https://issues.apache.org/jira/browse/SPARK-10633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Affects Versions: 1.4.0, 1.5.0
> Environment: Ubuntu 14.04
> IntelliJ IDEA 14.1.4
> sbt
> mysql-connector-java 5.1.35 (Tested and working with Spark 1.3.1)
>Reporter: Lunen
>
> Persisting Spark Kafka stream to MySQL 
> Spark 1.4 + tries to create a table automatically every time the stream gets 
> sent to a specified table.
> Please note, Spark 1.3.1 works.
> Code sample:
> val url = "jdbc:mysql://host:port/db?user=user&password=password
> val crp = RowSetProvider.newFactory()
> val crsSql: CachedRowSet = crp.createCachedRowSet()
> val crsTrg: CachedRowSet = crp.createCachedRowSet()
> crsSql.beforeFirst()
> crsTrg.beforeFirst()
> //Read Stream from Kafka
> //Produce SQL INSERT STRING
> 
> streamT.foreachRDD { rdd =>
>   if (rdd.toLocalIterator.nonEmpty) {
> sqlContext.read.json(rdd).registerTempTable(serverEvents + "_events")
> while (crsSql.next) {
>   sqlContext.sql("SQL INSERT STRING").write.jdbc(url, "SCHEMA_NAME", 
> new Properties)
>   println("Persisted Data: " + 'SQL INSERT STRING')
> }
> crsSql.beforeFirst()
>   }
>   stmt.close()
>   conn.close()
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10633) Persisting Spark stream to MySQL - Spark tries to create the table for every stream even if it exist already.

2015-09-16 Thread Lunen (JIRA)

Lunen created SPARK-10633:
-

 Summary: Persisting Spark stream to MySQL - Spark tries to create 
the table for every stream even if it exist already.
 Key: SPARK-10633
 URL: https://issues.apache.org/jira/browse/SPARK-10633
 Project: Spark
  Issue Type: Bug
  Components: SQL, Streaming
Affects Versions: 1.5.0, 1.4.0
 Environment: Ubuntu 14.04
IntelliJ IDEA 14.1.4
sbt
mysql-connector-java 5.1.35 (Tested and working with Spark 1.3.1)
Reporter: Lunen
Priority: Blocker


Persisting Spark Kafka stream to MySQL 
Spark 1.4 + tries to create a table automatically every time the stream gets 
sent to a specified table.
Please note, Spark 1.3.1 works.
Code sample:
val url = "jdbc:mysql://host:port/db?user=user&password=password
val crp = RowSetProvider.newFactory()
val crsSql: CachedRowSet = crp.createCachedRowSet()
val crsTrg: CachedRowSet = crp.createCachedRowSet()
crsSql.beforeFirst()
crsTrg.beforeFirst()

//Read Stream from Kafka
//Produce SQL INSERT STRING

streamT.foreachRDD { rdd =>
  if (rdd.toLocalIterator.nonEmpty) {
sqlContext.read.json(rdd).registerTempTable(serverEvents + "_events")
while (crsSql.next) {
  sqlContext.sql("SQL INSERT STRING").write.jdbc(url, "SCHEMA_NAME", 
new Properties)
  println("Persisted Data: " + 'SQL INSERT STRING')
}
crsSql.beforeFirst()
  }
  stmt.close()
  conn.close()
}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10632) Cannot save DataFrame with User Defined Types

2015-09-16 Thread Joao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joao updated SPARK-10632:
-
Description: 
Cannot save DataFrames that contain user-defined types.
I tried to save a dataframe with instances of the Vector class from mlib and 
got the error.

The code below should reproduce the error.
{noformat}
val df = sc.parallelize(Seq((1,Vectors.dense(1,1,1)), 
(2,Vectors.dense(2,2,2.toDF()
df.write.format("json").mode(SaveMode.Overwrite).save(path)
{noformat}

The error log is below

{noformat}
15/09/16 09:58:27 ERROR DefaultWriterContainer: Aborting task.
scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133)
at 
org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:185)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:243)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 
'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6'
 closed. Now beginning upload
15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 
'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6'
 upload complete
15/09/16 09:58:28 ERROR DefaultWriterContainer: Task attempt 
attempt_201509160958__m_00_0 aborted.
15/09/16 09:58:28 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Task failed while writing rows.
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:251)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfu

[jira] [Resolved] (SPARK-10511) Source releases should not include maven jars

2015-09-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10511.
---
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Resolved by https://github.com/apache/spark/pull/8774

> Source releases should not include maven jars
> -
>
> Key: SPARK-10511
> URL: https://issues.apache.org/jira/browse/SPARK-10511
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Patrick Wendell
>Assignee: Luciano Resende
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> I noticed our source jars seemed really big for 1.5.0. At least one 
> contributing factor is that, likely due to some change in the release script, 
> the maven jars are being bundled in with the source code in our build 
> directory. This runs afoul of the ASF policy on binaries in source releases - 
> we should fix it in 1.5.1.
> The issue (I think) is that we might invoke maven to compute the version 
> between when we checkout Spark from github and when we package the source 
> file. I think it could be fixed by simply clearing out the build/ directory 
> after that statement runs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10632) Cannot save DataFrame with User Defined Types

2015-09-16 Thread Joao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joao updated SPARK-10632:
-
Description: 
Cannot save DataFrames that contain user-defined types.
At first I thought it was a problem with my udt class, then tried the Vector 
class from mlib and the error was the same.


The code below should reproduce the error.
{noformat}
val df = sc.parallelize(Seq((1,Vectors.dense(1,1,1)), 
(2,Vectors.dense(2,2,2.toDF()
df.write.format("json").mode(SaveMode.Overwrite).save(path)
{noformat}

The error log is below

{noformat}
15/09/16 09:58:27 ERROR DefaultWriterContainer: Aborting task.
scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133)
at 
org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:185)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:243)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 
'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6'
 closed. Now beginning upload
15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 
'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6'
 upload complete
15/09/16 09:58:28 ERROR DefaultWriterContainer: Task attempt 
attempt_201509160958__m_00_0 aborted.
15/09/16 09:58:28 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Task failed while writing rows.
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:251)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103)
at 
org.apache.spark.sql.execution.datasources

[jira] [Created] (SPARK-10632) Cannot save DataFrame with User Defined Types

2015-09-16 Thread Joao (JIRA)

Joao created SPARK-10632:


 Summary: Cannot save DataFrame with User Defined Types
 Key: SPARK-10632
 URL: https://issues.apache.org/jira/browse/SPARK-10632
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Joao


Cannot save DataFrames that contain user-defined types.
At first I thought it was a problem with my udt class, then tried the Vector 
class from mlib and the error was the same.


Te code below should reproduce the error.
{noformat}
val df = sc.parallelize(Seq((1,Vectors.dense(1,1,1)), 
(2,Vectors.dense(2,2,2.toDF()
df.write.format("json").mode(SaveMode.Overwrite).save(path)
{noformat}

The error log is below

{noformat}
15/09/16 09:58:27 ERROR DefaultWriterContainer: Aborting task.
scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133)
at 
org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:185)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:243)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 
'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6'
 closed. Now beginning upload
15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 
'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6'
 upload complete
15/09/16 09:58:28 ERROR DefaultWriterContainer: Task attempt 
attempt_201509160958__m_00_0 aborted.
15/09/16 09:58:28 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Task failed while writing rows.
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:251)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179)
at 
org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$o

[jira] [Updated] (SPARK-10515) When killing executor, the pending replacement executors will be lost

2015-09-16 Thread KaiXinXIaoLei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-10515:
--
Summary: When killing executor, the pending replacement executors will be 
lost  (was: When kill executor, there is no need to seed RequestExecutors to AM)

> When killing executor, the pending replacement executors will be lost
> -
>
> Key: SPARK-10515
> URL: https://issues.apache.org/jira/browse/SPARK-10515
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: KaiXinXIaoLei
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10508) incorrect evaluation of searched case expression

2015-09-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10508:
--
Assignee: Josh Rosen

> incorrect evaluation of searched case expression
> 
>
> Key: SPARK-10508
> URL: https://issues.apache.org/jira/browse/SPARK-10508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: N Campbell
>Assignee: Josh Rosen
> Fix For: 1.4.1, 1.5.0
>
>
> The following case expression never evaluates to 'test1' when cdec is -1 or 
> 10 as it will for Hive 0.13. Instead is returns 'other' for all rows.
> {code}
> select rnum, cdec, case when cdec in ( -1,10,0.1 )  then 'test1' else 'other' 
> end from tdec 
> create table  if not exists TDEC ( RNUM int , CDEC decimal(7, 2 ))
> TERMINATED BY '\n' 
>  STORED AS orc  ;
> 0|\N
> 1|-1.00
> 2|0.00
> 3|1.00
> 4|0.10
> 5|10.00
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10475) improve column prunning for Project on Sort

2015-09-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10475:
--
Assignee: Wenchen Fan

> improve column prunning for Project on Sort
> ---
>
> Key: SPARK-10475
> URL: https://issues.apache.org/jira/browse/SPARK-10475
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9032) scala.MatchError in DataFrameReader.json(String path)

2015-09-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9032:
-
Assignee: Josh Rosen

> scala.MatchError in DataFrameReader.json(String path)
> -
>
> Key: SPARK-9032
> URL: https://issues.apache.org/jira/browse/SPARK-9032
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 1.4.0
> Environment: Ubuntu 15.04
>Reporter: Philipp Poetter
>Assignee: Josh Rosen
> Fix For: 1.4.1
>
>
> Executing read().json() of SQLContext e.g. DataFrameReader raises a 
> MatchError with a stacktrace as follows while trying to read JSON data:
> {code}
> 15/07/14 11:25:26 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
> have all completed, from pool 
> 15/07/14 11:25:26 INFO DAGScheduler: Job 0 finished: json at Example.java:23, 
> took 6.981330 s
> Exception in thread "main" scala.MatchError: StringType (of class 
> org.apache.spark.sql.types.StringType$)
>   at org.apache.spark.sql.json.InferSchema$.apply(InferSchema.scala:58)
>   at 
> org.apache.spark.sql.json.JSONRelation$$anonfun$schema$1.apply(JSONRelation.scala:139)
>   at 
> org.apache.spark.sql.json.JSONRelation$$anonfun$schema$1.apply(JSONRelation.scala:138)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.json.JSONRelation.schema$lzycompute(JSONRelation.scala:137)
>   at org.apache.spark.sql.json.JSONRelation.schema(JSONRelation.scala:137)
>   at 
> org.apache.spark.sql.sources.LogicalRelation.(LogicalRelation.scala:30)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:120)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:213)
>   at com.hp.sparkdemo.Example.main(Example.java:23)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 15/07/14 11:25:26 INFO SparkContext: Invoking stop() from shutdown hook
> 15/07/14 11:25:26 INFO SparkUI: Stopped Spark web UI at http://10.0.2.15:4040
> 15/07/14 11:25:26 INFO DAGScheduler: Stopping DAGScheduler
> 15/07/14 11:25:26 INFO SparkDeploySchedulerBackend: Shutting down all 
> executors
> 15/07/14 11:25:26 INFO SparkDeploySchedulerBackend: Asking each executor to 
> shut down
> 15/07/14 11:25:26 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> {code}
> Offending code snippet (around line 23):
> {code}
>JavaSparkContext sctx = new JavaSparkContext(sparkConf);
> SQLContext ctx = new SQLContext(sctx);
> DataFrame frame = ctx.read().json(facebookJSON);
> frame.printSchema();
> {code}
> The exception is reproducable using the following JSON:
> {code}
> {
>"data": [
>   {
>  "id": "X999_Y999",
>  "from": {
> "name": "Tom Brady", "id": "X12"
>  },
>  "message": "Looking forward to 2010!",
>  "actions": [
> {
>"name": "Comment",
>"link": "http://www.facebook.com/X999/posts/Y999";
> },
> {
>"name": "Like",
>"link": "http://www.facebook.com/X999/posts/Y999";
> }
>  ],
>  "type": "status",
>  "created_time": "2010-08-02T21:27:44+",
>  "updated_time": "2010-08-02T21:27:44+"
>   },
>   {
>  "id": "X998_Y998",
>  "from": {
> "name": "Peyton Manning", "id": "X18"
>  },
>  "message": "Where's my contract?",
>  "actions": [
> {
>"name": "Comment",
>"link": "http://www.facebook.com/X998/posts/Y998";
> },
> {
>"name": "Like",
>"link": "http://www.facebook.com/X998/posts/Y998";
> }
>  ],
>  "type": "status",
>  "created_time": "2010-08-02T21:27:44+",
>  "updated_time": "2010-08-02T21:27:44+"
>   }
>]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3

[jira] [Updated] (SPARK-9343) DROP TABLE ignores IF EXISTS clause

2015-09-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9343:
-
Assignee: Michael Armbrust

> DROP TABLE ignores IF EXISTS clause
> ---
>
> Key: SPARK-9343
> URL: https://issues.apache.org/jira/browse/SPARK-9343
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>Assignee: Michael Armbrust
>  Labels: sql
> Fix For: 1.5.0
>
>
> If a table is missing, {{DROP TABLE IF EXISTS _tableName_}} generates an 
> exception:
> {code}
> 15/07/25 15:17:32 ERROR Hive: NoSuchObjectException(message:default.test 
> table not found)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1560)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
>   at com.sun.proxy.$Proxy27.get_table(Unknown Source)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:997)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
>   at com.sun.proxy.$Proxy28.getTable(Unknown Source)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:976)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:918)
>   at 
> org.apache.hadoop.hive.ql.exec.DDLTask.dropTableOrPartitions(DDLTask.java:3846)
>  
> {code}
> I standalone test with full spark-shell output is available at: 
> [https://gist.github.com/ssimeonov/eeb388d13f802689d772]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9033) scala.MatchError: interface java.util.Map (of class java.lang.Class) with Spark SQL

2015-09-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9033:
-
Assignee: Josh Rosen

> scala.MatchError: interface java.util.Map (of class java.lang.Class) with 
> Spark SQL
> ---
>
> Key: SPARK-9033
> URL: https://issues.apache.org/jira/browse/SPARK-9033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.2, 1.3.1
>Reporter: Pavel
>Assignee: Josh Rosen
> Fix For: 1.4.0
>
>
> I've a java.util.Map field in a POJO class and I'm trying to 
> use it to createDataFrame (1.3.1) / applySchema(1.2.2) with the SQLContext 
> and getting following error in both 1.2.2 & 1.3.1 versions of the Spark SQL:
> *sample code:
> {code}
> SQLContext sqlCtx = new SQLContext(sc.sc());
> JavaRDD rdd = sc.textFile("/path").map(line-> Event.fromString(line)); 
> //text line is splitted and assigned to respective field of the event class 
> here
> DataFrame schemaRDD  = sqlCtx.createDataFrame(rdd, Event.class); <-- error 
> thrown here
> schemaRDD.registerTempTable("events");
> {code}
> Event class is a Serializable containing a field of type  
> java.util.Map. This issue occurs also with Spark streaming 
> when used with SQL.
> {code}
> JavaDStream receiverStream = jssc.receiverStream(new 
> StreamingReceiver());
> JavaDStream windowDStream = receiverStream.window(WINDOW_LENGTH, 
> SLIDE_INTERVAL);
> jssc.checkpoint("event-streaming");
> windowDStream.foreachRDD(evRDD -> {
>if(evRDD.count() == 0) return null;
> DataFrame schemaRDD = sqlCtx.createDataFrame(evRDD, Event.class);
> schemaRDD.registerTempTable("events");
>   ...
> }
> {code}
> *error:
> {code}
> scala.MatchError: interface java.util.Map (of class java.lang.Class)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1193)
>  ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1192)
>  ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  ~[scala-library-2.10.5.jar:na]
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  ~[scala-library-2.10.5.jar:na]
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  ~[scala-library-2.10.5.jar:na]
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) 
> ~[scala-library-2.10.5.jar:na]
>   at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244) 
> ~[scala-library-2.10.5.jar:na]
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) 
> ~[scala-library-2.10.5.jar:na]
>   at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1192) 
> ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:437) 
> ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:465) 
> ~[spark-sql_2.10-1.3.1.jar:1.3.1]
> {code}
> **also this occurs for fields of custom POJO classes:
> {code}
> scala.MatchError: class com.test.MyClass (of class java.lang.Class)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1193)
>  ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1192)
>  ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  ~[scala-library-2.10.5.jar:na]
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  ~[scala-library-2.10.5.jar:na]
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  ~[scala-library-2.10.5.jar:na]
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) 
> ~[scala-library-2.10.5.jar:na]
>   at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244) 
> ~[scala-library-2.10.5.jar:na]
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) 
> ~[scala-library-2.10.5.jar:na]
>   at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1192) 
> ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:437) 
> ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:465) 
> ~[spark-sql_2.10-1.3.1.jar:1.3.1]
> {code}
> **also occurs for Calendar  type:
> {code}
> scala.MatchError: class java.util.Calendar (of class java.lang.Class)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$getSc

[jira] [Updated] (SPARK-10437) Support aggregation expressions in Order By

2015-09-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10437:
--
Assignee: Liang-Chi Hsieh

> Support aggregation expressions in Order By
> ---
>
> Key: SPARK-10437
> URL: https://issues.apache.org/jira/browse/SPARK-10437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Harish Butani
>Assignee: Liang-Chi Hsieh
> Fix For: 1.6.0
>
>
> Followup on SPARK-6583
> The following still fails. 
> {code}
> val df = sqlContext.read.json("examples/src/main/resources/people.json")
> df.registerTempTable("t")
> val df2 = sqlContext.sql("select age, count(*) from t group by age order by 
> count(*)")
> df2.show()
> {code}
> {code:title=StackTrace}
> Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No 
> function to evaluate expression. type: Count, tree: COUNT(1)
>   at 
> org.apache.spark.sql.catalyst.expressions.AggregateExpression.eval(aggregates.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.RowOrdering.compare(rows.scala:219)
> {code}
> In 1.4 the issue seemed to be BindReferences.bindReference didn't handle this 
> case.
> Haven't looked at 1.5 code, but don't see a change to bindReference in this 
> patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 1 2 3 >

101 - 200 of 223 matches

Mail list logo