[jira] [Updated] (SPARK-10050) Support collecting data of MapType in DataFrame
[ https://issues.apache.org/jira/browse/SPARK-10050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-10050: -- Assignee: Sun Rui > Support collecting data of MapType in DataFrame > --- > > Key: SPARK-10050 > URL: https://issues.apache.org/jira/browse/SPARK-10050 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Sun Rui >Assignee: Sun Rui > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7841) Spark build should not use lib_managed for dependencies
[ https://issues.apache.org/jira/browse/SPARK-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791063#comment-14791063 ] Josh Rosen commented on SPARK-7841: --- I agree that we can probably fix this, but note that we'll have to do something about how lib_managed is used in the dev/mima script. > Spark build should not use lib_managed for dependencies > --- > > Key: SPARK-7841 > URL: https://issues.apache.org/jira/browse/SPARK-7841 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.3.1 >Reporter: Iulian Dragos > Labels: easyfix, sbt > > - unnecessary duplication (I will have those libraries under ./m2, via maven > anyway) > - every time I call make-distribution I lose lib_managed (via mvn clean > install) and have to wait to download again all jars next time I use sbt > - Eclipse does not handle relative paths very well (source attachments from > lib_managed don’t always work) > - it's not the default configuration. If we stray from defaults I think there > should be a clear advantage. > Digging through history, the only reference to `retrieveManaged := true` I > found was in f686e3d, from July 2011 ("Initial work on converting build to > SBT 0.10.1"). My guess this is purely an accident of porting the build form > Sbt 0.7.x and trying to keep the old project layout. > If there are reasons for keeping it, please comment (I didn't get any answers > on the [dev mailing > list|http://apache-spark-developers-list.1001551.n3.nabble.com/Why-use-quot-lib-managed-quot-for-the-Sbt-build-td12361.html]) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6513) Add zipWithUniqueId (and other RDD APIs) to RDDApi
[ https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-6513. --- Resolution: Won't Fix > Add zipWithUniqueId (and other RDD APIs) to RDDApi > -- > > Key: SPARK-6513 > URL: https://issues.apache.org/jira/browse/SPARK-6513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 > Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I > don't think it's relevant) >Reporter: Eran Medan >Priority: Minor > > It will be nice if we could treat a Dataframe just like an RDD (wherever it > makes sense) > *Worked in 1.2.1* > {code} > val sqlContext = new HiveContext(sc) > import sqlContext._ > val jsonRDD = sqlContext.jsonFile(jsonFilePath) > jsonRDD.registerTempTable("jsonTable") > val jsonResult = sql(s"select * from jsonTable") > val foo = jsonResult.zipWithUniqueId().map { >case (Row(...), uniqueId) => // do something useful >... > } > foo.registerTempTable("...") > {code} > *Stopped working in 1.3.0* > {code} > jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method > {code} > **Not working workaround:** > although this might give me an {{RDD\[Row\]}}: > {code} > jsonResult.rdd.zipWithUniqueId() > {code} > Now this won't work obviously since {{RDD\[Row\]}} does not have a > {{registerTempTable}} method of course > {code} > foo.registerTempTable("...") > {code} > (see related SO question: > http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3) > EDIT: changed from issue to enhancement request -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10647) Rename property spark.deploy.zookeeper.dir to spark.mesos.deploy.zookeeper.dir
[ https://issues.apache.org/jira/browse/SPARK-10647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Braithwaite updated SPARK-10647: - Issue Type: Improvement (was: New Feature) > Rename property spark.deploy.zookeeper.dir to spark.mesos.deploy.zookeeper.dir > -- > > Key: SPARK-10647 > URL: https://issues.apache.org/jira/browse/SPARK-10647 > Project: Spark > Issue Type: Improvement >Reporter: Alan Braithwaite >Priority: Minor > > This property doesn't match up with the other properties surrounding it, > namely: > spark.mesos.deploy.zookeeper.url > and > spark.mesos.deploy.recoveryMode > Since it's also a property specific to mesos, it makes sense to be under that > hierarchy as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10647) Rename property spark.deploy.zookeeper.dir to spark.mesos.deploy.zookeeper.dir
Alan Braithwaite created SPARK-10647: Summary: Rename property spark.deploy.zookeeper.dir to spark.mesos.deploy.zookeeper.dir Key: SPARK-10647 URL: https://issues.apache.org/jira/browse/SPARK-10647 Project: Spark Issue Type: New Feature Reporter: Alan Braithwaite Priority: Minor This property doesn't match up with the other properties surrounding it, namely: spark.mesos.deploy.zookeeper.url and spark.mesos.deploy.recoveryMode Since it's also a property specific to mesos, it makes sense to be under that hierarchy as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. categorical
[ https://issues.apache.org/jira/browse/SPARK-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-10646: -- Description: Pearson's chi-squared goodness of fit test for observed against the expected distribution. > Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. > categorical > > > Key: SPARK-10646 > URL: https://issues.apache.org/jira/browse/SPARK-10646 > Project: Spark > Issue Type: Sub-task >Reporter: Jihong MA > > Pearson's chi-squared goodness of fit test for observed against the expected > distribution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4216) Eliminate duplicate Jenkins GitHub posts from AMPLab
[ https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791016#comment-14791016 ] Nicholas Chammas commented on SPARK-4216: - Thanks Josh! > Eliminate duplicate Jenkins GitHub posts from AMPLab > > > Key: SPARK-4216 > URL: https://issues.apache.org/jira/browse/SPARK-4216 > Project: Spark > Issue Type: Bug > Components: Build, Project Infra >Reporter: Nicholas Chammas >Priority: Minor > > * [Real Jenkins | > https://github.com/apache/spark/pull/2988#issuecomment-60873361] > * [Imposter Jenkins | > https://github.com/apache/spark/pull/2988#issuecomment-60873366] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. categorical
[ https://issues.apache.org/jira/browse/SPARK-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-10646: -- Issue Type: Sub-task (was: New Feature) Parent: SPARK-10385 > Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. > categorical > > > Key: SPARK-10646 > URL: https://issues.apache.org/jira/browse/SPARK-10646 > Project: Spark > Issue Type: Sub-task >Reporter: Jihong MA > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. categorical
Jihong MA created SPARK-10646: - Summary: Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. categorical Key: SPARK-10646 URL: https://issues.apache.org/jira/browse/SPARK-10646 Project: Spark Issue Type: New Feature Reporter: Jihong MA -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10645) Bivariate Statistics for continuous vs. continuous
[ https://issues.apache.org/jira/browse/SPARK-10645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-10645: -- Issue Type: Sub-task (was: New Feature) Parent: SPARK-10385 > Bivariate Statistics for continuous vs. continuous > -- > > Key: SPARK-10645 > URL: https://issues.apache.org/jira/browse/SPARK-10645 > Project: Spark > Issue Type: Sub-task >Reporter: Jihong MA > > this is an umbrella jira, which covers Bivariate Statistics for continuous > vs. continuous columns, including covariance, Pearson's correlation, > Spearman's correlation (for both continuous & categorical). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10645) Bivariate Statistics for continuous vs. continuous
Jihong MA created SPARK-10645: - Summary: Bivariate Statistics for continuous vs. continuous Key: SPARK-10645 URL: https://issues.apache.org/jira/browse/SPARK-10645 Project: Spark Issue Type: New Feature Reporter: Jihong MA this is an umbrella jira, which covers Bivariate Statistics for continuous vs. continuous columns, including covariance, Pearson's correlation, Spearman's correlation (for both continuous & categorical). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3497) Report serialized size of task binary
[ https://issues.apache.org/jira/browse/SPARK-3497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3497. --- Resolution: Fixed We now have an automatic warning-level log message for large closures. > Report serialized size of task binary > - > > Key: SPARK-3497 > URL: https://issues.apache.org/jira/browse/SPARK-3497 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Sandy Ryza > > This is useful for determining that task closures are larger than expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4087) Only use broadcast for large tasks
[ https://issues.apache.org/jira/browse/SPARK-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen closed SPARK-4087. - Resolution: Won't Fix > Only use broadcast for large tasks > -- > > Key: SPARK-4087 > URL: https://issues.apache.org/jira/browse/SPARK-4087 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Davies Liu >Priority: Critical > > After we broadcast every tasks, some regressions are introduced because of > broadcast is not stable enough. > So we would like to only use broadcast for large tasks, which will keep the > same behaviour as 1.0 for most of the cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4738) Update the netty-3.x version in spark-assembly-*.jar
[ https://issues.apache.org/jira/browse/SPARK-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4738. --- Resolution: Incomplete Resolving as "Incomplete" since this an old issue and it doesn't look like there's any action to take here. In newer Spark versions, you should be able to use the various user-classpath-first configurationst o work around this. > Update the netty-3.x version in spark-assembly-*.jar > > > Key: SPARK-4738 > URL: https://issues.apache.org/jira/browse/SPARK-4738 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.1.0 >Reporter: Tobias Pfeiffer >Priority: Minor > > It seems as if the version of akka-remote (2.2.3-shaded-protobuf) that is > bundled in the spark-assembly-1.1.1-hadoop2.4.0.jar file pulls in an ancient > version of netty, namely io.netty:netty:3.6.6.Final (using the package > org.jboss.netty). This means that when using spark-submit, there will always > be this netty version on the classpath before any versions added by the user. > This may lead to issues with other packages that depend on newer versions and > may fail with java.lang.NoSuchMethodError etc.(finagle-http in my case). > I wonder if it possible to manually include a newer netty version, like > netty-3.8.0.Final. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4568) Publish release candidates under $VERSION-RCX instead of $VERSION
[ https://issues.apache.org/jira/browse/SPARK-4568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790945#comment-14790945 ] Josh Rosen edited comment on SPARK-4568 at 9/16/15 7:00 PM: We now do this. Specifically, we publish RC's under both names. was (Author: joshrosen): We now do this. > Publish release candidates under $VERSION-RCX instead of $VERSION > - > > Key: SPARK-4568 > URL: https://issues.apache.org/jira/browse/SPARK-4568 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4568) Publish release candidates under $VERSION-RCX instead of $VERSION
[ https://issues.apache.org/jira/browse/SPARK-4568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4568. --- Resolution: Fixed We now do this. > Publish release candidates under $VERSION-RCX instead of $VERSION > - > > Key: SPARK-4568 > URL: https://issues.apache.org/jira/browse/SPARK-4568 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4442) Move common unit test utilities into their own package / module
[ https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4442. --- Resolution: Won't Fix We're using test-jar dependencies instead, so this is "Won't Fix". > Move common unit test utilities into their own package / module > --- > > Key: SPARK-4442 > URL: https://issues.apache.org/jira/browse/SPARK-4442 > Project: Spark > Issue Type: Improvement > Components: Tests >Reporter: Josh Rosen >Priority: Minor > > We should move generally-useful unit test fixtures / utility methods to their > own test utilities set package / module to make them easier to find / use. > See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for > one example of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4440) Enhance the job progress API to expose more information
[ https://issues.apache.org/jira/browse/SPARK-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790935#comment-14790935 ] Josh Rosen commented on SPARK-4440: --- Does anyone still want these extensions? If so, can you please come up with a concrete proposal for exactly what you'd like to add? > Enhance the job progress API to expose more information > --- > > Key: SPARK-4440 > URL: https://issues.apache.org/jira/browse/SPARK-4440 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Rui Li > > The progress API introduced in SPARK-2321 provides a new way for user to > monitor job progress. However the information exposed in the API is > relatively limited. It'll be much more useful if we can enhance the API to > expose more data. > Some improvement for example may include but not limited to: > 1. Stage submission and completion time. > 2. Task metrics. > The requirement is initially identified for the hive on spark > project(HIVE-7292), other application should benefit as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4389) Set akka.remote.netty.tcp.bind-hostname="0.0.0.0" so driver can be located behind NAT
[ https://issues.apache.org/jira/browse/SPARK-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4389. --- Resolution: Won't Fix I'm going to resolve this as "Won't Fix": - Akka 2.4 has not shipped yet, so this feature isn't supported in any released version. - Akka 2.4 requires Java 8 and Scala 2.11 or 2.12, meaning that we can't use it in Spark as long as we need to continue to support Java 7 and Scala 2.10. We'll replace Akka RPC with a custom RPC layer long before we'll be able to drop support for these platforms. > Set akka.remote.netty.tcp.bind-hostname="0.0.0.0" so driver can be located > behind NAT > - > > Key: SPARK-4389 > URL: https://issues.apache.org/jira/browse/SPARK-4389 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Josh Rosen >Priority: Minor > > We should set {{akka.remote.netty.tcp.bind-hostname="0.0.0.0"}} in our Akka > configuration so that Spark drivers can be located behind NATs / work with > weird DNS setups. > This is blocked by upgrading our Akka version, since this configuration is > not present Akka 2.3.4. There might be a different approach / workaround > that works on our current Akka version, though. > EDIT: this is blocked by Akka 2.4, since this feature is only available in > the 2.4 snapshot release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4216) Eliminate duplicate Jenkins GitHub posts from AMPLab
[ https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790924#comment-14790924 ] Josh Rosen commented on SPARK-4216: --- We now have a bot which deletes the duplicated posts after they're posted. This de-clutters the thread for folks who read it on GitHub. > Eliminate duplicate Jenkins GitHub posts from AMPLab > > > Key: SPARK-4216 > URL: https://issues.apache.org/jira/browse/SPARK-4216 > Project: Spark > Issue Type: Bug > Components: Build, Project Infra >Reporter: Nicholas Chammas >Priority: Minor > > * [Real Jenkins | > https://github.com/apache/spark/pull/2988#issuecomment-60873361] > * [Imposter Jenkins | > https://github.com/apache/spark/pull/2988#issuecomment-60873366] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3949) Use IAMRole in lieu of static access key-id/secret
[ https://issues.apache.org/jira/browse/SPARK-3949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3949. --- Resolution: Incomplete As of SPARK-8576, spark-ec2 can launch instances with IAM instance roles. I'm going to close this issue as "Incomplete" since it's underspecified; it would be helpful to know more specifically where you think we need support for IAM roles instead of keys (i.e. while launching the cluster? for configuring access to S3?). > Use IAMRole in lieu of static access key-id/secret > -- > > Key: SPARK-3949 > URL: https://issues.apache.org/jira/browse/SPARK-3949 > Project: Spark > Issue Type: Improvement > Components: EC2 >Affects Versions: 1.1.0 >Reporter: Rangarajan Sreenivasan > > Spark currently supports AWS resource access through user-specific > key-id/secret. While this works, the AWS recommended way is to use IAM Roles > instead of specific key-id/secrets. > http://docs.aws.amazon.com/IAM/latest/UserGuide/IAMBestPractices.html#use-roles-with-ec2 > http://docs.aws.amazon.com/IAM/latest/UserGuide/IAM_Introduction.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10644) Applications wait even if free executors are available
Balagopal Nair created SPARK-10644: -- Summary: Applications wait even if free executors are available Key: SPARK-10644 URL: https://issues.apache.org/jira/browse/SPARK-10644 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.5.0 Environment: RHEL 6.5 64 bit Reporter: Balagopal Nair Number of workers: 21 Number of executors: 63 Steps to reproduce: 1. Run 4 jobs each with max cores set to 10 2. The first 3 jobs run with 10 each. (30 executors consumed so far) 3. The 4 th job waits even though there are 33 idle executors. The reason is that a job will not get executors unless the total number of EXECUTORS in use < the number of WORKERS If there are executors available, resources should be allocated to the pending job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3603) InvalidClassException on a Linux VM - probably problem with serialization
[ https://issues.apache.org/jira/browse/SPARK-3603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3603. --- Resolution: Cannot Reproduce Resolving as "Cannot Reproduce", since this is an old issue that hasn't received any updates since 1.1.0. Please re-open and update if this is still a problem. > InvalidClassException on a Linux VM - probably problem with serialization > - > > Key: SPARK-3603 > URL: https://issues.apache.org/jira/browse/SPARK-3603 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0, 1.1.0 > Environment: Linux version 2.6.32-358.32.3.el6.x86_64 > (mockbu...@x86-029.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red > Hat 4.4.7-3) (GCC) ) #1 SMP Fri Jan 17 08:42:31 EST 2014 > java version "1.7.0_25" > OpenJDK Runtime Environment (rhel-2.3.10.4.el6_4-x86_64) > OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode) > Spark (either 1.0.0 or 1.1.0) >Reporter: Tomasz Dudziak >Priority: Critical > Labels: scala, serialization, spark > > I have a Scala app connecting to a standalone Spark cluster. It works fine on > Windows or on a Linux VM; however, when I try to run the app and the Spark > cluster on another Linux VM (the same Linux kernel, Java and Spark - tested > for versions 1.0.0 and 1.1.0) I get the below exception. This looks kind of > similar to the Big-Endian (IBM Power7) Spark Serialization issue > (SPARK-2018), but... my system is definitely little endian and I understand > the big endian issue should be already fixed in Spark 1.1.0 anyway. I'd > appreaciate your help. > 01:34:53.251 WARN [Result resolver thread-0][TaskSetManager] Lost TID 2 > (task 1.0:2) > 01:34:53.278 WARN [Result resolver thread-0][TaskSetManager] Loss was due to > java.io.InvalidClassException > java.io.InvalidClassException: scala.reflect.ClassTag$$anon$1; local class > incompatible: stream classdesc serialVersionUID = -4937928798201944954, local > class serialVersionUID = -8102093212602380348 > at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617) > at > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620) > at > java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1769) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at scala.collection.immutable.$colon$colon.readObject(List.scala:362) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1891) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at scala.collection.immutabl
[jira] [Updated] (SPARK-10643) Support HDFS urls in spark-submit
[ https://issues.apache.org/jira/browse/SPARK-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Braithwaite updated SPARK-10643: - Description: When using mesos with docker and marathon, it would be nice to be able to make spark-submit deployable on marathon and have that download a jar from HDFS instead of having to package the jar with the docker. {code} $ docker run -it docker.example.com/spark:latest /usr/local/spark/bin/spark-submit --class com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar Warning: Skip remote jar hdfs://hdfs/tmp/application.jar. java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:173) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} Although I'm aware that we can run in cluster mode with mesos, we've already built some nice tools surrounding marathon for logging and monitoring. Code in question: https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L685-L698 was: When using mesos with docker and marathon, it would be nice to be able to make spark-submit deployable on marathon and have that download a jar from HDFS instead of having to package the jar with the docker. {code} $ docker run -it docker.example.com/spark:latest /usr/local/spark/bin/spark-submit --class com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar Warning: Skip remote jar hdfs://hdfs/tmp/application.jar. java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:173) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} Although I'm aware that we can run in cluster mode with mesos, we've already built some nice tools surrounding marathon for logging and monitoring. > Support HDFS urls in spark-submit > - > > Key: SPARK-10643 > URL: https://issues.apache.org/jira/browse/SPARK-10643 > Project: Spark > Issue Type: New Feature >Reporter: Alan Braithwaite >Priority: Minor > > When using mesos with docker and marathon, it would be nice to be able to > make spark-submit deployable on marathon and have that download a jar from > HDFS instead of having to package the jar with the docker. > {code} > $ docker run -it docker.example.com/spark:latest > /usr/local/spark/bin/spark-submit --class > com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar > Warning: Skip remote jar hdfs://hdfs/tmp/application.jar. > java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at org.apache.spark.util.Utils$.classForName(Utils.scala:173) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) >
[jira] [Resolved] (SPARK-3489) support rdd.zip(rdd1, rdd2,...) with variable number of rdds as params
[ https://issues.apache.org/jira/browse/SPARK-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3489. --- Resolution: Won't Fix Resolving as "Won't Fix" per PR discussion. > support rdd.zip(rdd1, rdd2,...) with variable number of rdds as params > -- > > Key: SPARK-3489 > URL: https://issues.apache.org/jira/browse/SPARK-3489 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.2 >Reporter: Mohit Jaggi >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10371) Optimize sequential projections
[ https://issues.apache.org/jira/browse/SPARK-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10371: -- Description: In ML pipelines, each transformer/estimator appends new columns to the input DataFrame. For example, it might produce DataFrames like the following columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used. It would be nice to detect this pattern and re-use intermediate values. {code} val input = sqlContext.range(10) val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * 2) output.explain(true) == Parsed Logical Plan == 'Project [*,('x * 2) AS y#254] Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L] LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 == Analyzed Logical Plan == id: bigint, x: bigint, y: bigint Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L] Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L] LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 == Optimized Logical Plan == Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L] LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 == Physical Plan == TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L] Scan PhysicalRDD[id#252L] Code Generation: true input: org.apache.spark.sql.DataFrame = [id: bigint] output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint] {code} was: In ML pipelines, each transformer/estimator appends new columns to the input DataFrame. For example, it might produce DataFrames like the following columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used. It would be nice to detect this pattern and re-use intermediate values. > Optimize sequential projections > --- > > Key: SPARK-10371 > URL: https://issues.apache.org/jira/browse/SPARK-10371 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > > In ML pipelines, each transformer/estimator appends new columns to the input > DataFrame. For example, it might produce DataFrames like the following > columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), > and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c > and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used. > It would be nice to detect this pattern and re-use intermediate values. > {code} > val input = sqlContext.range(10) > val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * > 2) > output.explain(true) > == Parsed Logical Plan == > 'Project [*,('x * 2) AS y#254] > Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Analyzed Logical Plan == > id: bigint, x: bigint, y: bigint > Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L] > Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Optimized Logical Plan == > Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Physical Plan == > TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS > y#254L] > Scan PhysicalRDD[id#252L] > Code Generation: true > input: org.apache.spark.sql.DataFrame = [id: bigint] > output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10643) Support HDFS urls in spark-submit
Alan Braithwaite created SPARK-10643: Summary: Support HDFS urls in spark-submit Key: SPARK-10643 URL: https://issues.apache.org/jira/browse/SPARK-10643 Project: Spark Issue Type: New Feature Reporter: Alan Braithwaite Priority: Minor When using mesos with docker and marathon, it would be nice to be able to make spark-submit deployable on marathon and have that download a jar from HDFS instead of having to package the jar with the docker. {code} $ docker run -it docker.example.com/spark:latest /usr/local/spark/bin/spark-submit --class com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar Warning: Skip remote jar hdfs://hdfs/tmp/application.jar. java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:173) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} Although I'm aware that we can run in cluster mode with mesos, we've already built some nice tools surrounding marathon for logging and monitoring. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2991) RDD transforms for scan and scanLeft
[ https://issues.apache.org/jira/browse/SPARK-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2991. --- Resolution: Won't Fix > RDD transforms for scan and scanLeft > - > > Key: SPARK-2991 > URL: https://issues.apache.org/jira/browse/SPARK-2991 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Erik Erlandson >Assignee: Erik Erlandson >Priority: Minor > Labels: features > > Provide RDD transforms analogous to Scala scan(z)(f) (parallel prefix scan) > and scanLeft(z)(f) (sequential prefix scan) > Discussion of a scanLeft implementation: > http://erikerlandson.github.io/blog/2014/08/09/implementing-an-rdd-scanleft-transform-with-cascade-rdds/ > Discussion of scan: > http://erikerlandson.github.io/blog/2014/08/12/implementing-parallel-prefix-scan-as-a-spark-rdd-transform/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2496) Compression streams should write its codec info to the stream
[ https://issues.apache.org/jira/browse/SPARK-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2496. --- Resolution: Incomplete Resolving as "Incomplete"; if we still want to do this then we should wait until we have a specific concrete use-case / list of things that need to be changed. > Compression streams should write its codec info to the stream > - > > Key: SPARK-2496 > URL: https://issues.apache.org/jira/browse/SPARK-2496 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Priority: Critical > > Spark sometime store compressed data outside of Spark (e.g. event logs, > blocks in tachyon), and those data are read back directly using the codec > configured by the user. When the codec differs between runs, Spark wouldn't > be able to read the codec back. > I'm not sure what the best strategy here is yet. If we write the codec > identifier for all streams, then we will be writing a lot of identifiers for > shuffle blocks. One possibility is to only write it for blocks that will be > shared across different Spark instances (i.e. managed outside of Spark), > which includes tachyon blocks and event log blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2302) master should discard exceeded completedDrivers
[ https://issues.apache.org/jira/browse/SPARK-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2302. --- Resolution: Fixed Assignee: Lianhui Wang Fix Version/s: 1.1.0 > master should discard exceeded completedDrivers > > > Key: SPARK-2302 > URL: https://issues.apache.org/jira/browse/SPARK-2302 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: Lianhui Wang >Assignee: Lianhui Wang > Fix For: 1.1.0 > > > When completedDrivers number exceeds the threshold, the first > Max(spark.deploy.retainedDrivers, 1) will be discarded. > see PR: > https://github.com/apache/spark/pull/1114 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10642) Crash in rdd.lookup() with "java.lang.Long cannot be cast to java.lang.Integer"
[ https://issues.apache.org/jira/browse/SPARK-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790884#comment-14790884 ] Thouis Jones commented on SPARK-10642: -- Simpler cases. Fails: {code} sc.parallelize([(('a', 'b'), 'c')]).groupByKey().lookup(('a', 'b')) {code} Works: {code} sc.parallelize([(('a', 'b'), 'c')]).groupByKey().map(lambda x: x).lookup(('a', 'b')) {code} > Crash in rdd.lookup() with "java.lang.Long cannot be cast to > java.lang.Integer" > --- > > Key: SPARK-10642 > URL: https://issues.apache.org/jira/browse/SPARK-10642 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.0 > Environment: OSX >Reporter: Thouis Jones > > Running this command: > {code} > sc.parallelize([(('a', 'b'), > 'c')]).groupByKey().partitionBy(20).cache().lookup(('a', 'b')) > {code} > gives the following error: > {noformat} > 15/09/16 14:22:23 INFO SparkContext: Starting job: runJob at > PythonRDD.scala:361 > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/rdd.py", > line 2199, in lookup > return self.ctx.runJob(values, lambda x: x, [self.partitioner(key)]) > File > "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/context.py", > line 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File > "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File > "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/sql/utils.py", > line 36, in deco > return f(*a, **kw) > File > "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : java.lang.ClassCastException: java.lang.Long cannot be cast to > java.lang.Integer > at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$submitJob$1.apply(DAGScheduler.scala:530) > at scala.collection.Iterator$class.find(Iterator.scala:780) > at scala.collection.AbstractIterator.find(Iterator.scala:1157) > at scala.collection.IterableLike$class.find(IterableLike.scala:79) > at scala.collection.AbstractIterable.find(Iterable.scala:54) > at > org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:530) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:558) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1826) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1839) > at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:361) > at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala) > at sun.reflect.GeneratedMethodAccessor49.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1304) Job fails with spot instances (due to IllegalStateException: Shutdown in progress)
[ https://issues.apache.org/jira/browse/SPARK-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-1304. --- Resolution: Won't Fix > Job fails with spot instances (due to IllegalStateException: Shutdown in > progress) > -- > > Key: SPARK-1304 > URL: https://issues.apache.org/jira/browse/SPARK-1304 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 0.9.0 >Reporter: Alex Boisvert >Priority: Minor > > We had a job running smoothly with spot instances until one of the spot > instances got terminated ... which led to a series of "IllegalStateException: > Shutdown in progress" and the job failed afterwards. > 14/03/24 06:07:52 WARN scheduler.TaskSetManager: Loss was due to > java.lang.IllegalStateException > java.lang.IllegalStateException: Shutdown in progress > at > java.lang.ApplicationShutdownHooks.add(ApplicationShutdownHooks.java:66) > at java.lang.Runtime.addShutdownHook(Runtime.java:211) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1441) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:256) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) > at > org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:77) > at > org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:51) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:156) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) > at > org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:90) > at > org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:89) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:57) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94) > at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) > at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) > at org.apache.spark.scheduler.Task.run(Task.scala:53) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10640) Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied
[ https://issues.apache.org/jira/browse/SPARK-10640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-10640: - Assignee: Thomas Graves > Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied > -- > > Key: SPARK-10640 > URL: https://issues.apache.org/jira/browse/SPARK-10640 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > > I'm seeing an exception from the spark history server trying to read a > history file: > scala.MatchError: TaskCommitDenied (of class java.lang.String) > at > org.apache.spark.util.JsonProtocol$.taskEndReasonFromJson(JsonProtocol.scala:775) > at > org.apache.spark.util.JsonProtocol$.taskEndFromJson(JsonProtocol.scala:531) > at > org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:488) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:457) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:292) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:289) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:289) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$1$$anon$2.run(FsHistoryProvider.scala:210) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-869) Retrofit rest of RDD api to use proper serializer type
[ https://issues.apache.org/jira/browse/SPARK-869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-869. -- Resolution: Done Going to resolve this as Done; please open a new JIRA if you find specific examples where we're using the wrong serializer. > Retrofit rest of RDD api to use proper serializer type > -- > > Key: SPARK-869 > URL: https://issues.apache.org/jira/browse/SPARK-869 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 0.8.0 >Reporter: Dmitriy Lyubimov > > SPARK-826 and SPARK-827 resolved proper serialization support for some RDD > method parameters, but not all. > This issue is to address the rest of RDD api and operations. Most of the time > this is due to wrapping RDD parameters into a closure which can only use a > closure serializer to communicate to the backend. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10642) Crash in rdd.lookup() with "java.lang.Long cannot be cast to java.lang.Integer"
Thouis Jones created SPARK-10642: Summary: Crash in rdd.lookup() with "java.lang.Long cannot be cast to java.lang.Integer" Key: SPARK-10642 URL: https://issues.apache.org/jira/browse/SPARK-10642 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.5.0 Environment: OSX Reporter: Thouis Jones Running this command: {code} sc.parallelize([(('a', 'b'), 'c')]).groupByKey().partitionBy(20).cache().lookup(('a', 'b')) {code} gives the following error: {noformat} 15/09/16 14:22:23 INFO SparkContext: Starting job: runJob at PythonRDD.scala:361 Traceback (most recent call last): File "", line 1, in File "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/rdd.py", line 2199, in lookup return self.ctx.runJob(values, lambda x: x, [self.partitioner(key)]) File "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/context.py", line 916, in runJob port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) File "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/sql/utils.py", line 36, in deco return f(*a, **kw) File "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106) at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitJob$1.apply(DAGScheduler.scala:530) at scala.collection.Iterator$class.find(Iterator.scala:780) at scala.collection.AbstractIterator.find(Iterator.scala:1157) at scala.collection.IterableLike$class.find(IterableLike.scala:79) at scala.collection.AbstractIterable.find(Iterable.scala:54) at org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:530) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:558) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1826) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1839) at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:361) at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala) at sun.reflect.GeneratedMethodAccessor49.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10504) aggregate where NULL is defined as the value expression aborts when SUM used
[ https://issues.apache.org/jira/browse/SPARK-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-10504. -- Resolution: Fixed Assignee: Yin Huai Fix Version/s: 1.5.0 > aggregate where NULL is defined as the value expression aborts when SUM used > > > Key: SPARK-10504 > URL: https://issues.apache.org/jira/browse/SPARK-10504 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1, 1.4.1 >Reporter: N Campbell >Assignee: Yin Huai >Priority: Minor > Fix For: 1.5.0 > > > In ISO-SQL the context would determine an implicit type for NULL or one might > find that a vendor requires an explicit type via CAST ( NULL as INTEGER). It > appears that SPARK presumes a long type i.e. select min(NULL), max(NULL) but > a query such the following aborts. > > {{select sum ( null ) from tversion}} > {code} > Operation: execute > Errors: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 5232.0 failed 4 times, most recent failure: Lost task 0.3 in stage > 5232.0 (TID 18531, sandbox.hortonworks.com): scala.MatchError: NullType (of > class org.apache.spark.sql.types.NullType$) > at > org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$cast(Cast.scala:403) > at > org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:422) > at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:422) > at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:426) > at > org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:51) > at > org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:119) > at > org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:51) > at > org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:82) > at > org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:581) > at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:133) > at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:126) > at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634) > at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats
[ https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-10602: -- Assignee: Seth Hendrickson > Univariate statistics as UDAFs: single-pass continuous stats > > > Key: SPARK-10602 > URL: https://issues.apache.org/jira/browse/SPARK-10602 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Joseph K. Bradley >Assignee: Seth Hendrickson > > See parent JIRA for more details. This subtask covers statistics for > continuous values requiring a single pass over the data, such as min and max. > This JIRA is an umbrella. For individual stats, please create and link a new > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10641) skewness and kurtosis support
[ https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-10641: -- Issue Type: New Feature (was: Sub-task) Parent: (was: SPARK-10384) > skewness and kurtosis support > - > > Key: SPARK-10641 > URL: https://issues.apache.org/jira/browse/SPARK-10641 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Reporter: Jihong MA > > Implementing skewness and kurtosis support based on following algorithm: > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats
[ https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790859#comment-14790859 ] Joseph K. Bradley commented on SPARK-10602: --- Yeah, JIRA only allows 2 levels of subtasks (a long-time annoyance of mine!). I'd recommend linking here using "contains." I'll fix the link for now. > Univariate statistics as UDAFs: single-pass continuous stats > > > Key: SPARK-10602 > URL: https://issues.apache.org/jira/browse/SPARK-10602 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Joseph K. Bradley > > See parent JIRA for more details. This subtask covers statistics for > continuous values requiring a single pass over the data, such as min and max. > This JIRA is an umbrella. For individual stats, please create and link a new > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10589) Add defense against external site framing
[ https://issues.apache.org/jira/browse/SPARK-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10589. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8745 [https://github.com/apache/spark/pull/8745] > Add defense against external site framing > - > > Key: SPARK-10589 > URL: https://issues.apache.org/jira/browse/SPARK-10589 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Fix For: 1.6.0 > > > This came up as a minor point during a security audit using a common scanning > tool: It's best if Spark UIs try to actively defend against certain types of > frame-related vulnerabilities by setting X-Frame-Options. See > https://www.owasp.org/index.php/Clickjacking_Defense_Cheat_Sheet > Easy PR coming ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10320) Kafka Support new topic subscriptions without requiring restart of the streaming context
[ https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790843#comment-14790843 ] Cody Koeninger commented on SPARK-10320: I don't think there's much benefit to multiple dstreams with the direct api, because it's straightforward to filter or match on the topic on a per-partition basis. I'm not sure that adding entirely new dstreams after the streaming context has been started makes sense. As far as defaults go... I don't see a clearly reasonable default like messageHandler has. Maybe an example implementation of a function that maintains just a list of topic names and handles the offset lookups. The other thing is, in order to get much use out of this, the api for communicating with the kafka cluster would need to be made public, and there had been some reluctance on that point previously. [~tdas] Any thoughts on making the KafkaCluster api public? > Kafka Support new topic subscriptions without requiring restart of the > streaming context > > > Key: SPARK-10320 > URL: https://issues.apache.org/jira/browse/SPARK-10320 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Sudarshan Kadambi > > Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe > to current ones once the streaming context has been started. Restarting the > streaming context increases the latency of update handling. > Consider a streaming application subscribed to n topics. Let's say 1 of the > topics is no longer needed in streaming analytics and hence should be > dropped. We could do this by stopping the streaming context, removing that > topic from the topic list and restarting the streaming context. Since with > some DStreams such as DirectKafkaStream, the per-partition offsets are > maintained by Spark, we should be able to resume uninterrupted (I think?) > from where we left off with a minor delay. However, in instances where > expensive state initialization (from an external datastore) may be needed for > datasets published to all topics, before streaming updates can be applied to > it, it is more convenient to only subscribe or unsubcribe to the incremental > changes to the topic list. Without such a feature, updates go unprocessed for > longer than they need to be, thus affecting QoS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10485) IF expression is not correctly resolved when one of the options have NullType
[ https://issues.apache.org/jira/browse/SPARK-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10485: - Affects Version/s: 1.5.0 > IF expression is not correctly resolved when one of the options have NullType > - > > Key: SPARK-10485 > URL: https://issues.apache.org/jira/browse/SPARK-10485 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.0 >Reporter: Antonio Jesus Navarro > > If we have this query: > {code} > SELECT IF(column > 1, 1, NULL) FROM T1 > {code} > On Spark 1.4.1 we have this: > {code} > override lazy val resolved = childrenResolved && trueValue.dataType == > falseValue.dataType > {code} > So if one of the types is NullType, the if expression is not resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats
[ https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790816#comment-14790816 ] Jihong MA commented on SPARK-10602: --- I go ahead/ created SPARK-10641, since this JIRA is not listed as umbrella, couldn't link to this JIRA directly instead linked to SPARK-10384. @Joseph, can you assign SPARK-10641 to Seth? and help fix the link, Thanks! > Univariate statistics as UDAFs: single-pass continuous stats > > > Key: SPARK-10602 > URL: https://issues.apache.org/jira/browse/SPARK-10602 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Joseph K. Bradley > > See parent JIRA for more details. This subtask covers statistics for > continuous values requiring a single pass over the data, such as min and max. > This JIRA is an umbrella. For individual stats, please create and link a new > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10320) Kafka Support new topic subscriptions without requiring restart of the streaming context
[ https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790804#comment-14790804 ] Sudarshan Kadambi commented on SPARK-10320: --- Sure, a function as proposed that allows for the topic, partitions and offsets to be specified in a fine grained manner is needed to provide the full flexbility we desire (starting at an arbitrary offset within each topic partition). If separate DStreams are desired for each topic, you intend for createDirectStream to be called multiple times (with a different subscription topic each time) both before and after the streaming context is started? Also, what kind of defaults did you have in mind? For e.g. I might require the ability to specify new topics after the streaming context is started but might not want the burden of being aware of the partitions within the topic or the offsets. I might simply want to default to either the start or the end of each partition that exists for that topic. > Kafka Support new topic subscriptions without requiring restart of the > streaming context > > > Key: SPARK-10320 > URL: https://issues.apache.org/jira/browse/SPARK-10320 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Sudarshan Kadambi > > Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe > to current ones once the streaming context has been started. Restarting the > streaming context increases the latency of update handling. > Consider a streaming application subscribed to n topics. Let's say 1 of the > topics is no longer needed in streaming analytics and hence should be > dropped. We could do this by stopping the streaming context, removing that > topic from the topic list and restarting the streaming context. Since with > some DStreams such as DirectKafkaStream, the per-partition offsets are > maintained by Spark, we should be able to resume uninterrupted (I think?) > from where we left off with a minor delay. However, in instances where > expensive state initialization (from an external datastore) may be needed for > datasets published to all topics, before streaming updates can be applied to > it, it is more convenient to only subscribe or unsubcribe to the incremental > changes to the topic list. Without such a feature, updates go unprocessed for > longer than they need to be, thus affecting QoS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10641) skewness and kurtosis support
[ https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-10641: -- Issue Type: Sub-task (was: New Feature) Parent: SPARK-10384 > skewness and kurtosis support > - > > Key: SPARK-10641 > URL: https://issues.apache.org/jira/browse/SPARK-10641 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA > > Implementing skewness and kurtosis support based on following algorithm: > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10636) RDD filter does not work after if..then..else RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790792#comment-14790792 ] Sean Owen commented on SPARK-10636: --- It's a Scala syntax issue, as you say when you wondered if you're not using if..else in the way you think. Stuff happens, but if that crosses your mind, I'd just float a user@ question first or try a simple test in Scala to narrow down the behavior in question. > RDD filter does not work after if..then..else RDD blocks > > > Key: SPARK-10636 > URL: https://issues.apache.org/jira/browse/SPARK-10636 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Glenn Strycker > > I have an RDD declaration of the following form: > {code} > val myRDD = if (some condition) { tempRDD1.some operations } else { > tempRDD2.some operations}.filter(a => a._2._5 <= 50) > {code} > When I output the contents of myRDD, I found entries that clearly had a._2._5 > > 50... the filter didn't work! > If I move the filter inside of the if..then blocks, it suddenly does work: > {code} > val myRDD = if (some condition) { tempRDD1.some operations.filter(a => > a._2._5 <= 50) } else { tempRDD2.some operations.filter(a => a._2._5 <= 50) } > {code} > I ran toDebugString after both of these code examples, and "filter" does > appear in the DAG for the second example, but DOES NOT appear in the first > DAG. Why not? > Am I misusing the if..then..else syntax for Spark/Scala? > Here is my actual code... ignore the crazy naming conventions and what it's > doing... > {code} > // this does NOT work > val myRDD = if (tempRDD2.count() > 0) { >tempRDD1. > map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))). > leftOuterJoin(tempRDD2). > map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, > a._2._2.getOrElse(1L. > leftOuterJoin(tempRDD2). > map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, > a._2._2.getOrElse(1L. > map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), > (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if > (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))) >} else { > tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))) >}. >filter(a => a._2._5 <= 50). >partitionBy(partitioner). >setName("myRDD"). >persist(StorageLevel.MEMORY_AND_DISK_SER) > myRDD.checkpoint() > println(myRDD.toDebugString) > // (4) MapPartitionsRDD[58] at map at myProgram.scala:2120 [] > // | MapPartitionsRDD[57] at map at myProgram.scala:2119 [] > // | MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 [] > // | MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 [] > // | CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 [] > // +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 [] > // | | MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 [] > // | | MapPartitionsRDD[51] at leftOuterJoin at myProgram.scala:2116 [] > // | | CoGroupedRDD[50] at leftOuterJoin at myProgram.scala:2116 [] > // | +-(4) MapPartitionsRDD[49] at map at myProgram.scala:2115 [] > // | | | clusterGraphWithComponentsRDD MapPartitionsRDD[28] at > reduceByKey at myProgram.scala:1689 [] > // | | | CachedPartitions: 4; MemorySize: 1176.0 B; TachyonSize: 0.0 > B; DiskSize: 0.0 B > // | | | CheckpointRDD[29] at count at myProgram.scala:1701 [] > // | +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at > myProgram.scala:383 [] > // | | CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 > B; DiskSize: 0.0 B > // | | CheckpointRDD[17] at count at myProgram.scala:394 [] > // +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at > myProgram.scala:383 [] > // | CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; > DiskSize: 0.0 B > // | CheckpointRDD[17] at count at myProgram.scala:394 [] > // this DOES work! > val myRDD = if (tempRDD2.count() > 0) { >tempRDD1. > map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))). > leftOuterJoin(tempRDD2). > map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, > a._2._2.getOrElse(1L. > leftOuterJoin(tempRDD2). > map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, > a._2._2.getOrElse(1L. > map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), > (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if > (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))). > filter(a => a._2._5 <= 50) >} else { > tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))). > filter(a => a._2._5 <= 50) >}. >partitionBy(partitioner). >setName("myRDD"). >pers
[jira] [Created] (SPARK-10641) skewness and kurtosis support
Jihong MA created SPARK-10641: - Summary: skewness and kurtosis support Key: SPARK-10641 URL: https://issues.apache.org/jira/browse/SPARK-10641 Project: Spark Issue Type: New Feature Components: ML, SQL Reporter: Jihong MA Implementing skewness and kurtosis support based on following algorithm: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10640) Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied
[ https://issues.apache.org/jira/browse/SPARK-10640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790777#comment-14790777 ] Thomas Graves commented on SPARK-10640: --- looks like the jsonProtocol.taskEndReasonFromJson is just missing the TaskCommitDenied TaskEndReaons. Seems like we should have a better way of handling this in the future too. Right now it doesn't load the history file at all which is pretty annoying. Perhaps have a catch all that skips it or prints unknown or something so it atleast loads. > Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied > -- > > Key: SPARK-10640 > URL: https://issues.apache.org/jira/browse/SPARK-10640 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Thomas Graves > > I'm seeing an exception from the spark history server trying to read a > history file: > scala.MatchError: TaskCommitDenied (of class java.lang.String) > at > org.apache.spark.util.JsonProtocol$.taskEndReasonFromJson(JsonProtocol.scala:775) > at > org.apache.spark.util.JsonProtocol$.taskEndFromJson(JsonProtocol.scala:531) > at > org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:488) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:457) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:292) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:289) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:289) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$1$$anon$2.run(FsHistoryProvider.scala:210) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10640) Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied
Thomas Graves created SPARK-10640: - Summary: Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied Key: SPARK-10640 URL: https://issues.apache.org/jira/browse/SPARK-10640 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.0 Reporter: Thomas Graves I'm seeing an exception from the spark history server trying to read a history file: scala.MatchError: TaskCommitDenied (of class java.lang.String) at org.apache.spark.util.JsonProtocol$.taskEndReasonFromJson(JsonProtocol.scala:775) at org.apache.spark.util.JsonProtocol$.taskEndFromJson(JsonProtocol.scala:531) at org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:488) at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58) at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:457) at org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:292) at org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:289) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:289) at org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$1$$anon$2.run(FsHistoryProvider.scala:210) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10638) spark streaming stop gracefully keeps the spark context
[ https://issues.apache.org/jira/browse/SPARK-10638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mamdouh Alramadan updated SPARK-10638: -- Description: With spark 1.4 on Mesos cluster, I am trying to stop the context with graceful shutdown, I have seen this mailing list that [~tdas] addressed http://mail-archives.apache.org/mod_mbox/incubator-spark-commits/201505.mbox/%3c176cb228a2704ab996839fb97fa90...@git.apache.org%3E which introduces a new config that was not documented, however, even with including it, the streaming job still stops correctly but the process doesn't die after all e.g. the Spark Context still running. My Mesos UI still sees the framework which is still allocating all the cores needed the code used for the shutdown hook is: {code:title=Start.scala|borderStyle=solid} sys.ShutdownHookThread { logInfo("Received SIGTERM, calling streaming stop") streamingContext.stop(stopSparkContext = true, stopGracefully = true) logInfo("Application Stopped") } {code} The logs are for this process are: {code:title=SparkLogs|borderStyle=solid} ``` 5/09/16 16:37:51 INFO Start: Received SIGTERM, calling streaming stop 15/09/16 16:37:51 INFO JobGenerator: Stopping JobGenerator gracefully 15/09/16 16:37:51 INFO JobGenerator: Waiting for all received blocks to be consumed for job generation 15/09/16 16:37:51 INFO JobGenerator: Waited for all received blocks to be consumed for job generation 15/09/16 16:37:51 INFO StreamingContext: Invoking stop(stopGracefully=true) from shutdown hook 15/09/16 16:38:00 INFO RecurringTimer: Stopped timer for JobGenerator after time 144242148 15/09/16 16:38:00 INFO JobScheduler: Starting job streaming job 144242148 ms.0 from job set of time 144242148 ms 15/09/16 16:38:00 INFO JobGenerator: Stopped generation timer 15/09/16 16:38:00 INFO JobGenerator: Waiting for jobs to be processed and checkpoints to be written 15/09/16 16:38:00 INFO JobScheduler: Added jobs for time 144242148 ms 15/09/16 16:38:00 INFO JobGenerator: Checkpointing graph for time 144242148 ms 15/09/16 16:38:00 INFO DStreamGraph: Updating checkpoint data for time 144242148 ms 15/09/16 16:38:00 INFO DStreamGraph: Updated checkpoint data for time 144242148 ms 15/09/16 16:38:00 INFO SparkContext: Starting job: foreachRDD at StreamDigest.scala:21 15/09/16 16:38:00 INFO DAGScheduler: Got job 12 (foreachRDD at StreamDigest.scala:21) with 1 output partitions (allowLocal=true) 15/09/16 16:38:00 INFO DAGScheduler: Final stage: ResultStage 12(foreachRDD at StreamDigest.scala:21) 15/09/16 16:38:00 INFO DAGScheduler: Parents of final stage: List() 15/09/16 16:38:00 INFO CheckpointWriter: Saving checkpoint for time 144242148 ms to file 'hdfs://EMRURL/sparkStreaming/checkpoint/checkpoint-144242148' 15/09/16 16:38:00 INFO DAGScheduler: Missing parents: List() . . . . 15/09/16 16:38:00 INFO JobGenerator: Waited for jobs to be processed and checkpoints to be written 15/09/16 16:38:00 INFO CheckpointWriter: CheckpointWriter executor terminated ? true, waited for 1 ms. 15/09/16 16:38:00 INFO JobGenerator: Stopped JobGenerator 15/09/16 16:38:00 INFO JobScheduler: Stopped JobScheduler ``` {code} And in my spark-defaults.conf I included {code:title=spark-defaults.conf|borderStyle=solid} `spark.streaming.stopGracefullyOnShutdowntrue` {code} was: With spark 1.4 on Mesos cluster, I am trying to stop the context with graceful shutdown, I have seen this mailing list that [~tdas] addressed http://mail-archives.apache.org/mod_mbox/incubator-spark-commits/201505.mbox/%3c176cb228a2704ab996839fb97fa90...@git.apache.org%3E which introduces a new config that was not documented, however, even with including it, the streaming job still stops correctly but the process doesn't die after all e.g. the Spark Context still running. My Mesos UI still sees the framework which is still allocating all the cores needed the code used for the shutdown hook is: `sys.ShutdownHookThread { logInfo("Received SIGTERM, calling streaming stop") streamingContext.stop(stopSparkContext = true, stopGracefully = true) logInfo("Application Stopped") } ` The logs are for this process are: ``` 5/09/16 16:37:51 INFO Start: Received SIGTERM, calling streaming stop 15/09/16 16:37:51 INFO JobGenerator: Stopping JobGenerator gracefully 15/09/16 16:37:51 INFO JobGenerator: Waiting for all received blocks to be consumed for job generation 15/09/16 16:37:51 INFO JobGenerator: Waited for all received blocks to be consumed for job generation 15/09/16 16:37:51 INFO StreamingContext: Invoking stop(stopGracefully=true) from shutdown hook 15/09/16 16:38:00 INFO RecurringTimer: Stopped timer for JobGenerator after time 144242148 15/09/16 16:38:00 INFO JobScheduler: Starting job streaming job 144242148 ms.0 from job set of time 144242148 ms 15/09/16 16:38:00 INFO JobG
[jira] [Created] (SPARK-10639) Need to convert UDAF's result from scala to sql type
Yin Huai created SPARK-10639: Summary: Need to convert UDAF's result from scala to sql type Key: SPARK-10639 URL: https://issues.apache.org/jira/browse/SPARK-10639 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Yin Huai Priority: Blocker We are missing a conversion at https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala#L427. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10638) spark streaming stop gracefully keeps the spark context
[ https://issues.apache.org/jira/browse/SPARK-10638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mamdouh Alramadan updated SPARK-10638: -- Description: With spark 1.4 on Mesos cluster, I am trying to stop the context with graceful shutdown, I have seen this mailing list that [~tdas] addressed http://mail-archives.apache.org/mod_mbox/incubator-spark-commits/201505.mbox/%3c176cb228a2704ab996839fb97fa90...@git.apache.org%3E which introduces a new config that was not documented, however, even with including it, the streaming job still stops correctly but the process doesn't die after all e.g. the Spark Context still running. My Mesos UI still sees the framework which is still allocating all the cores needed the code used for the shutdown hook is: {code:title=Start.scala|borderStyle=solid} sys.ShutdownHookThread { logInfo("Received SIGTERM, calling streaming stop") streamingContext.stop(stopSparkContext = true, stopGracefully = true) logInfo("Application Stopped") } {code} The logs are for this process are: {code:title=SparkLogs|borderStyle=solid} ``` 5/09/16 16:37:51 INFO Start: Received SIGTERM, calling streaming stop 15/09/16 16:37:51 INFO JobGenerator: Stopping JobGenerator gracefully 15/09/16 16:37:51 INFO JobGenerator: Waiting for all received blocks to be consumed for job generation 15/09/16 16:37:51 INFO JobGenerator: Waited for all received blocks to be consumed for job generation 15/09/16 16:37:51 INFO StreamingContext: Invoking stop(stopGracefully=true) from shutdown hook 15/09/16 16:38:00 INFO RecurringTimer: Stopped timer for JobGenerator after time 144242148 15/09/16 16:38:00 INFO JobScheduler: Starting job streaming job 144242148 ms.0 from job set of time 144242148 ms 15/09/16 16:38:00 INFO JobGenerator: Stopped generation timer 15/09/16 16:38:00 INFO JobGenerator: Waiting for jobs to be processed and checkpoints to be written 15/09/16 16:38:00 INFO JobScheduler: Added jobs for time 144242148 ms 15/09/16 16:38:00 INFO JobGenerator: Checkpointing graph for time 144242148 ms 15/09/16 16:38:00 INFO DStreamGraph: Updating checkpoint data for time 144242148 ms 15/09/16 16:38:00 INFO DStreamGraph: Updated checkpoint data for time 144242148 ms 15/09/16 16:38:00 INFO SparkContext: Starting job: foreachRDD at StreamDigest.scala:21 15/09/16 16:38:00 INFO DAGScheduler: Got job 12 (foreachRDD at StreamDigest.scala:21) with 1 output partitions (allowLocal=true) 15/09/16 16:38:00 INFO DAGScheduler: Final stage: ResultStage 12(foreachRDD at StreamDigest.scala:21) 15/09/16 16:38:00 INFO DAGScheduler: Parents of final stage: List() 15/09/16 16:38:00 INFO CheckpointWriter: Saving checkpoint for time 144242148 ms to file 'hdfs://EMRURL/sparkStreaming/checkpoint/checkpoint-144242148' 15/09/16 16:38:00 INFO DAGScheduler: Missing parents: List() . . . . 15/09/16 16:38:00 INFO JobGenerator: Waited for jobs to be processed and checkpoints to be written 15/09/16 16:38:00 INFO CheckpointWriter: CheckpointWriter executor terminated ? true, waited for 1 ms. 15/09/16 16:38:00 INFO JobGenerator: Stopped JobGenerator 15/09/16 16:38:00 INFO JobScheduler: Stopped JobScheduler ``` {code} And in my spark-defaults.conf I included {code:title=spark-defaults.conf|borderStyle=solid} spark.streaming.stopGracefullyOnShutdowntrue {code} was: With spark 1.4 on Mesos cluster, I am trying to stop the context with graceful shutdown, I have seen this mailing list that [~tdas] addressed http://mail-archives.apache.org/mod_mbox/incubator-spark-commits/201505.mbox/%3c176cb228a2704ab996839fb97fa90...@git.apache.org%3E which introduces a new config that was not documented, however, even with including it, the streaming job still stops correctly but the process doesn't die after all e.g. the Spark Context still running. My Mesos UI still sees the framework which is still allocating all the cores needed the code used for the shutdown hook is: {code:title=Start.scala|borderStyle=solid} sys.ShutdownHookThread { logInfo("Received SIGTERM, calling streaming stop") streamingContext.stop(stopSparkContext = true, stopGracefully = true) logInfo("Application Stopped") } {code} The logs are for this process are: {code:title=SparkLogs|borderStyle=solid} ``` 5/09/16 16:37:51 INFO Start: Received SIGTERM, calling streaming stop 15/09/16 16:37:51 INFO JobGenerator: Stopping JobGenerator gracefully 15/09/16 16:37:51 INFO JobGenerator: Waiting for all received blocks to be consumed for job generation 15/09/16 16:37:51 INFO JobGenerator: Waited for all received blocks to be consumed for job generation 15/09/16 16:37:51 INFO StreamingContext: Invoking stop(stopGracefully=true) from shutdown hook 15/09/16 16:38:00 INFO RecurringTimer: Stopped timer for JobGenerator after time 144242148 15/09/16 16:38:00 INFO JobScheduler: Starting job streaming jo
[jira] [Created] (SPARK-10638) spark streaming stop gracefully keeps the spark context
Mamdouh Alramadan created SPARK-10638: - Summary: spark streaming stop gracefully keeps the spark context Key: SPARK-10638 URL: https://issues.apache.org/jira/browse/SPARK-10638 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Mamdouh Alramadan With spark 1.4 on Mesos cluster, I am trying to stop the context with graceful shutdown, I have seen this mailing list that [~tdas] addressed http://mail-archives.apache.org/mod_mbox/incubator-spark-commits/201505.mbox/%3c176cb228a2704ab996839fb97fa90...@git.apache.org%3E which introduces a new config that was not documented, however, even with including it, the streaming job still stops correctly but the process doesn't die after all e.g. the Spark Context still running. My Mesos UI still sees the framework which is still allocating all the cores needed the code used for the shutdown hook is: `sys.ShutdownHookThread { logInfo("Received SIGTERM, calling streaming stop") streamingContext.stop(stopSparkContext = true, stopGracefully = true) logInfo("Application Stopped") } ` The logs are for this process are: ``` 5/09/16 16:37:51 INFO Start: Received SIGTERM, calling streaming stop 15/09/16 16:37:51 INFO JobGenerator: Stopping JobGenerator gracefully 15/09/16 16:37:51 INFO JobGenerator: Waiting for all received blocks to be consumed for job generation 15/09/16 16:37:51 INFO JobGenerator: Waited for all received blocks to be consumed for job generation 15/09/16 16:37:51 INFO StreamingContext: Invoking stop(stopGracefully=true) from shutdown hook 15/09/16 16:38:00 INFO RecurringTimer: Stopped timer for JobGenerator after time 144242148 15/09/16 16:38:00 INFO JobScheduler: Starting job streaming job 144242148 ms.0 from job set of time 144242148 ms 15/09/16 16:38:00 INFO JobGenerator: Stopped generation timer 15/09/16 16:38:00 INFO JobGenerator: Waiting for jobs to be processed and checkpoints to be written 15/09/16 16:38:00 INFO JobScheduler: Added jobs for time 144242148 ms 15/09/16 16:38:00 INFO JobGenerator: Checkpointing graph for time 144242148 ms 15/09/16 16:38:00 INFO DStreamGraph: Updating checkpoint data for time 144242148 ms 15/09/16 16:38:00 INFO DStreamGraph: Updated checkpoint data for time 144242148 ms 15/09/16 16:38:00 INFO SparkContext: Starting job: foreachRDD at StreamDigest.scala:21 15/09/16 16:38:00 INFO DAGScheduler: Got job 12 (foreachRDD at StreamDigest.scala:21) with 1 output partitions (allowLocal=true) 15/09/16 16:38:00 INFO DAGScheduler: Final stage: ResultStage 12(foreachRDD at StreamDigest.scala:21) 15/09/16 16:38:00 INFO DAGScheduler: Parents of final stage: List() 15/09/16 16:38:00 INFO CheckpointWriter: Saving checkpoint for time 144242148 ms to file 'hdfs://EMRURL/sparkStreaming/checkpoint/checkpoint-144242148' 15/09/16 16:38:00 INFO DAGScheduler: Missing parents: List() . . . . 15/09/16 16:38:00 INFO JobGenerator: Waited for jobs to be processed and checkpoints to be written 15/09/16 16:38:00 INFO CheckpointWriter: CheckpointWriter executor terminated ? true, waited for 1 ms. 15/09/16 16:38:00 INFO JobGenerator: Stopped JobGenerator 15/09/16 16:38:00 INFO JobScheduler: Stopped JobScheduler ``` And in my spark-defaults.conf I included `spark.streaming.stopGracefullyOnShutdowntrue` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9715) Store numFeatures in all ML PredictionModel types
[ https://issues.apache.org/jira/browse/SPARK-9715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9715: - Shepherd: Joseph K. Bradley Assignee: Seth Hendrickson > Store numFeatures in all ML PredictionModel types > - > > Key: SPARK-9715 > URL: https://issues.apache.org/jira/browse/SPARK-9715 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Seth Hendrickson >Priority: Minor > > The PredictionModel abstraction should store numFeatures. Currently, only > RandomForest* types do this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats
[ https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790713#comment-14790713 ] Seth Hendrickson commented on SPARK-10602: -- Right now I have working versions of single pass algos for skewness and kurtosis. I'm still writing tests and need to work on performance profiling, but I just wanted to make sure we don't duplicate any work. I'll post back with more details soon. > Univariate statistics as UDAFs: single-pass continuous stats > > > Key: SPARK-10602 > URL: https://issues.apache.org/jira/browse/SPARK-10602 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Joseph K. Bradley > > See parent JIRA for more details. This subtask covers statistics for > continuous values requiring a single pass over the data, such as min and max. > This JIRA is an umbrella. For individual stats, please create and link a new > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6810) Performance benchmarks for SparkR
[ https://issues.apache.org/jira/browse/SPARK-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790700#comment-14790700 ] Xiangrui Meng commented on SPARK-6810: -- the ml part (glm) is about the same. all computation is on the Scala side. i would wait until 1.6 to benchmark GLM because we are going to implement the same algorithm as R in 1.6. > Performance benchmarks for SparkR > - > > Key: SPARK-6810 > URL: https://issues.apache.org/jira/browse/SPARK-6810 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Shivaram Venkataraman >Priority: Critical > > We should port some performance benchmarks from spark-perf to SparkR for > tracking performance regressions / improvements. > https://github.com/databricks/spark-perf/tree/master/pyspark-tests has a list > of PySpark performance benchmarks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10636) RDD filter does not work after if..then..else RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glenn Strycker closed SPARK-10636. -- > RDD filter does not work after if..then..else RDD blocks > > > Key: SPARK-10636 > URL: https://issues.apache.org/jira/browse/SPARK-10636 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Glenn Strycker > > I have an RDD declaration of the following form: > {code} > val myRDD = if (some condition) { tempRDD1.some operations } else { > tempRDD2.some operations}.filter(a => a._2._5 <= 50) > {code} > When I output the contents of myRDD, I found entries that clearly had a._2._5 > > 50... the filter didn't work! > If I move the filter inside of the if..then blocks, it suddenly does work: > {code} > val myRDD = if (some condition) { tempRDD1.some operations.filter(a => > a._2._5 <= 50) } else { tempRDD2.some operations.filter(a => a._2._5 <= 50) } > {code} > I ran toDebugString after both of these code examples, and "filter" does > appear in the DAG for the second example, but DOES NOT appear in the first > DAG. Why not? > Am I misusing the if..then..else syntax for Spark/Scala? > Here is my actual code... ignore the crazy naming conventions and what it's > doing... > {code} > // this does NOT work > val myRDD = if (tempRDD2.count() > 0) { >tempRDD1. > map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))). > leftOuterJoin(tempRDD2). > map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, > a._2._2.getOrElse(1L. > leftOuterJoin(tempRDD2). > map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, > a._2._2.getOrElse(1L. > map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), > (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if > (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))) >} else { > tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))) >}. >filter(a => a._2._5 <= 50). >partitionBy(partitioner). >setName("myRDD"). >persist(StorageLevel.MEMORY_AND_DISK_SER) > myRDD.checkpoint() > println(myRDD.toDebugString) > // (4) MapPartitionsRDD[58] at map at myProgram.scala:2120 [] > // | MapPartitionsRDD[57] at map at myProgram.scala:2119 [] > // | MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 [] > // | MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 [] > // | CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 [] > // +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 [] > // | | MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 [] > // | | MapPartitionsRDD[51] at leftOuterJoin at myProgram.scala:2116 [] > // | | CoGroupedRDD[50] at leftOuterJoin at myProgram.scala:2116 [] > // | +-(4) MapPartitionsRDD[49] at map at myProgram.scala:2115 [] > // | | | clusterGraphWithComponentsRDD MapPartitionsRDD[28] at > reduceByKey at myProgram.scala:1689 [] > // | | | CachedPartitions: 4; MemorySize: 1176.0 B; TachyonSize: 0.0 > B; DiskSize: 0.0 B > // | | | CheckpointRDD[29] at count at myProgram.scala:1701 [] > // | +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at > myProgram.scala:383 [] > // | | CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 > B; DiskSize: 0.0 B > // | | CheckpointRDD[17] at count at myProgram.scala:394 [] > // +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at > myProgram.scala:383 [] > // | CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; > DiskSize: 0.0 B > // | CheckpointRDD[17] at count at myProgram.scala:394 [] > // this DOES work! > val myRDD = if (tempRDD2.count() > 0) { >tempRDD1. > map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))). > leftOuterJoin(tempRDD2). > map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, > a._2._2.getOrElse(1L. > leftOuterJoin(tempRDD2). > map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, > a._2._2.getOrElse(1L. > map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), > (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if > (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))). > filter(a => a._2._5 <= 50) >} else { > tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))). > filter(a => a._2._5 <= 50) >}. >partitionBy(partitioner). >setName("myRDD"). >persist(StorageLevel.MEMORY_AND_DISK_SER) > myRDD.checkpoint() > println(myRDD.toDebugString) > // (4) MapPartitionsRDD[59] at filter at myProgram.scala:2121 [] > // | MapPartitionsRDD[58] at map at myProgram.scala:2120 [] > // | MapPartitionsRDD[57] at map at myProgram.scala:2119 [] > // | MapPartitionsRDD[56]
[jira] [Commented] (SPARK-10636) RDD filter does not work after if..then..else RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790681#comment-14790681 ] Glenn Strycker commented on SPARK-10636: I didn't "forget", I believed that "RDD = if {} else {} . something" would automatically take care of the associative property, and that anything after the final else {} would apply to both blocks. I didn't realize that braces behave similarly to parentheses and that I needed extras -- makes sense. I have now added these to my code. This wasn't a question for "user@ first", since I really did believe there was a bug. Jira is the place for submitting bug reports, even when the resolution is user error. > RDD filter does not work after if..then..else RDD blocks > > > Key: SPARK-10636 > URL: https://issues.apache.org/jira/browse/SPARK-10636 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Glenn Strycker > > I have an RDD declaration of the following form: > {code} > val myRDD = if (some condition) { tempRDD1.some operations } else { > tempRDD2.some operations}.filter(a => a._2._5 <= 50) > {code} > When I output the contents of myRDD, I found entries that clearly had a._2._5 > > 50... the filter didn't work! > If I move the filter inside of the if..then blocks, it suddenly does work: > {code} > val myRDD = if (some condition) { tempRDD1.some operations.filter(a => > a._2._5 <= 50) } else { tempRDD2.some operations.filter(a => a._2._5 <= 50) } > {code} > I ran toDebugString after both of these code examples, and "filter" does > appear in the DAG for the second example, but DOES NOT appear in the first > DAG. Why not? > Am I misusing the if..then..else syntax for Spark/Scala? > Here is my actual code... ignore the crazy naming conventions and what it's > doing... > {code} > // this does NOT work > val myRDD = if (tempRDD2.count() > 0) { >tempRDD1. > map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))). > leftOuterJoin(tempRDD2). > map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, > a._2._2.getOrElse(1L. > leftOuterJoin(tempRDD2). > map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, > a._2._2.getOrElse(1L. > map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), > (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if > (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))) >} else { > tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))) >}. >filter(a => a._2._5 <= 50). >partitionBy(partitioner). >setName("myRDD"). >persist(StorageLevel.MEMORY_AND_DISK_SER) > myRDD.checkpoint() > println(myRDD.toDebugString) > // (4) MapPartitionsRDD[58] at map at myProgram.scala:2120 [] > // | MapPartitionsRDD[57] at map at myProgram.scala:2119 [] > // | MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 [] > // | MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 [] > // | CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 [] > // +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 [] > // | | MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 [] > // | | MapPartitionsRDD[51] at leftOuterJoin at myProgram.scala:2116 [] > // | | CoGroupedRDD[50] at leftOuterJoin at myProgram.scala:2116 [] > // | +-(4) MapPartitionsRDD[49] at map at myProgram.scala:2115 [] > // | | | clusterGraphWithComponentsRDD MapPartitionsRDD[28] at > reduceByKey at myProgram.scala:1689 [] > // | | | CachedPartitions: 4; MemorySize: 1176.0 B; TachyonSize: 0.0 > B; DiskSize: 0.0 B > // | | | CheckpointRDD[29] at count at myProgram.scala:1701 [] > // | +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at > myProgram.scala:383 [] > // | | CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 > B; DiskSize: 0.0 B > // | | CheckpointRDD[17] at count at myProgram.scala:394 [] > // +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at > myProgram.scala:383 [] > // | CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; > DiskSize: 0.0 B > // | CheckpointRDD[17] at count at myProgram.scala:394 [] > // this DOES work! > val myRDD = if (tempRDD2.count() > 0) { >tempRDD1. > map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))). > leftOuterJoin(tempRDD2). > map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, > a._2._2.getOrElse(1L. > leftOuterJoin(tempRDD2). > map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, > a._2._2.getOrElse(1L. > map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), > (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if > (a._1._1 <= a._1._2) { a._2._4
[jira] [Commented] (SPARK-6810) Performance benchmarks for SparkR
[ https://issues.apache.org/jira/browse/SPARK-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790654#comment-14790654 ] Shivaram Venkataraman commented on SPARK-6810: -- So the DataFrame API doesn't need much of performance benchmarks as we mostly wrap all our calls to Java / Scala - However we are adding new ML API components and [~mengxr] will be able to provide more guidance for this. cc [~sunrui] > Performance benchmarks for SparkR > - > > Key: SPARK-6810 > URL: https://issues.apache.org/jira/browse/SPARK-6810 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Shivaram Venkataraman >Priority: Critical > > We should port some performance benchmarks from spark-perf to SparkR for > tracking performance regressions / improvements. > https://github.com/databricks/spark-perf/tree/master/pyspark-tests has a list > of PySpark performance benchmarks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10632) Cannot save DataFrame with User Defined Types
[ https://issues.apache.org/jira/browse/SPARK-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joao updated SPARK-10632: - Description: Cannot save DataFrames that contain user-defined types. I tried to save a dataframe with instances of the Vector class from mlib and got the error. The code below should reproduce the error. {noformat} val df = sc.parallelize(Seq((1,Vectors.dense(1,1,1)), (2,Vectors.dense(2,2,2.toDF() df.write.format("json").mode(SaveMode.Overwrite).save(path) {noformat} The error log is below {noformat} 15/09/16 09:58:27 ERROR DefaultWriterContainer: Aborting task. scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericMutableRow) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133) at org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:185) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:243) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6' closed. Now beginning upload 15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6' upload complete 15/09/16 09:58:28 ERROR DefaultWriterContainer: Task attempt attempt_201509160958__m_00_0 aborted. 15/09/16 09:58:28 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:251) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericMutableRow) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfu
[jira] [Updated] (SPARK-10637) DataFrames: saving with nested User Data Types
[ https://issues.apache.org/jira/browse/SPARK-10637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joao updated SPARK-10637: - Description: Cannot save data frames using nested UserDefinedType I wrote a simple example to show the error. It causes the following error java.lang.IllegalArgumentException: Nested type should be repeated: required group array { required int32 num; } {code:java} import org.apache.spark.sql.SaveMode import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.expressions.GenericMutableRow import org.apache.spark.sql.types._ @SQLUserDefinedType(udt = classOf[AUDT]) case class A(list:Seq[B]) class AUDT extends UserDefinedType[A] { override def sqlType: DataType = StructType(Seq(StructField("list", ArrayType(BUDT, containsNull = false), nullable = true))) override def userClass: Class[A] = classOf[A] override def serialize(obj: Any): Any = obj match { case A(list) => val row = new GenericMutableRow(1) row.update(0, new GenericArrayData(list.map(_.asInstanceOf[Any]).toArray)) row } override def deserialize(datum: Any): A = { datum match { case row: InternalRow => new A(row.getArray(0).toArray(BUDT).toSeq) } } } object AUDT extends AUDT @SQLUserDefinedType(udt = classOf[BUDT]) case class B(num:Int) class BUDT extends UserDefinedType[B] { override def sqlType: DataType = StructType(Seq(StructField("num", IntegerType, nullable = false))) override def userClass: Class[B] = classOf[B] override def serialize(obj: Any): Any = obj match { case B(num) => val row = new GenericMutableRow(1) row.setInt(0, num) row } override def deserialize(datum: Any): B = { datum match { case row: InternalRow => new B(row.getInt(0)) } } } object BUDT extends BUDT object TestNested { def main(args:Array[String]) = { val col = Seq(new A(Seq(new B(1), new B(2))), new A(Seq(new B(3), new B(4 val sc = new SparkContext(new SparkConf().setMaster("local[1]").setAppName("TestSpark")) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ val df = sc.parallelize(1 to 2 zip col).toDF() df.show() df.write.mode(SaveMode.Overwrite).save(...) } } {code} The error log is shown below: {noformat} 15/09/16 16:44:36 WARN : Your hostname, X resolves to a loopback/non-reachable address: fe80:0:0:0:c4c7:8c4b:4a24:f8a1%14, but we couldn't find any external IP address! 15/09/16 16:44:38 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 15/09/16 16:44:38 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/09/16 16:44:38 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 15/09/16 16:44:38 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 15/09/16 16:44:38 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 15/09/16 16:44:38 INFO ParquetRelation: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter 15/09/16 16:44:38 INFO DefaultWriterContainer: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter 15/09/16 16:44:38 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:50986 in memory (size: 1402.0 B, free: 973.6 MB) 15/09/16 16:44:38 INFO ContextCleaner: Cleaned accumulator 1 15/09/16 16:44:39 INFO SparkContext: Starting job: save at TestNested.scala:73 15/09/16 16:44:39 INFO DAGScheduler: Got job 1 (save at TestNested.scala:73) with 1 output partitions 15/09/16 16:44:39 INFO DAGScheduler: Final stage: ResultStage 1(save at TestNested.scala:73) 15/09/16 16:44:39 INFO DAGScheduler: Parents of final stage: List() 15/09/16 16:44:39 INFO DAGScheduler: Missing parents: List() 15/09/16 16:44:39 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[1] at rddToDataFrameHolder at TestNested.scala:69), which has no missing parents 15/09/16 16:44:39 INFO MemoryStore: ensureFreeSpace(59832) called with curMem=0, maxMem=1020914565 15/09/16 16:44:39 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 58.4 KB, free 973.6 MB) 15/09/16 16:44:39 INFO MemoryStore: ensureFreeSpace(20794) called with curMem=59832, maxMem=1020914565 15/09/16 16:44:39 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 20.3 KB, free 973.5 MB) 15/09/16 16:44:39 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:50986 (size: 20.3 KB, free: 973.6 MB) 15/09/16 16:44:39 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:861 15/09/16 16:44:39 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[1] at rddToDataFrameHolder at TestNested.sc
[jira] [Assigned] (SPARK-9296) variance, var_pop, and var_samp aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-9296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9296: --- Assignee: Apache Spark > variance, var_pop, and var_samp aggregate functions > --- > > Key: SPARK-9296 > URL: https://issues.apache.org/jira/browse/SPARK-9296 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > > A short introduction on how to build aggregate functions based on our new > interface can be found at > https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9296) variance, var_pop, and var_samp aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-9296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790663#comment-14790663 ] Apache Spark commented on SPARK-9296: - User 'JihongMA' has created a pull request for this issue: https://github.com/apache/spark/pull/8778 > variance, var_pop, and var_samp aggregate functions > --- > > Key: SPARK-9296 > URL: https://issues.apache.org/jira/browse/SPARK-9296 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > A short introduction on how to build aggregate functions based on our new > interface can be found at > https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9296) variance, var_pop, and var_samp aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-9296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9296: --- Assignee: (was: Apache Spark) > variance, var_pop, and var_samp aggregate functions > --- > > Key: SPARK-9296 > URL: https://issues.apache.org/jira/browse/SPARK-9296 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > A short introduction on how to build aggregate functions based on our new > interface can be found at > https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.
[ https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen closed SPARK-10634. - > The spark sql fails if the where clause contains a string with " in it. > --- > > Key: SPARK-10634 > URL: https://issues.apache.org/jira/browse/SPARK-10634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: Prachi Burathoki > > When running a sql query in which the where clause contains a string with " > in it, the sql parser throws an error. > Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but > identifier test found > SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER > FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test"")) > > ^ > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) > at > scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933) > at > com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106) > at > com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93) > at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153) > at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752) > at > com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011) > ... 31 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.
[ https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10634. --- Resolution: Not A Problem > The spark sql fails if the where clause contains a string with " in it. > --- > > Key: SPARK-10634 > URL: https://issues.apache.org/jira/browse/SPARK-10634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: Prachi Burathoki > > When running a sql query in which the where clause contains a string with " > in it, the sql parser throws an error. > Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but > identifier test found > SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER > FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test"")) > > ^ > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) > at > scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933) > at > com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106) > at > com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93) > at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153) > at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752) > at > com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011) > ... 31 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.
[ https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-10634: --- > The spark sql fails if the where clause contains a string with " in it. > --- > > Key: SPARK-10634 > URL: https://issues.apache.org/jira/browse/SPARK-10634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: Prachi Burathoki > > When running a sql query in which the where clause contains a string with " > in it, the sql parser throws an error. > Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but > identifier test found > SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER > FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test"")) > > ^ > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) > at > scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933) > at > com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106) > at > com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93) > at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153) > at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752) > at > com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011) > ... 31 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10636) RDD filter does not work after if..then..else RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10636. --- Resolution: Not A Problem In the first case, your {{.filter}} statement plainly applies only to the else block. That's the behavior you see. Did you forget to wrap the if-else in parens? I'd ask this as a question on user@ first, since I think you are posing this as a question. JIRA is not the right place to start with those. > RDD filter does not work after if..then..else RDD blocks > > > Key: SPARK-10636 > URL: https://issues.apache.org/jira/browse/SPARK-10636 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Glenn Strycker > > I have an RDD declaration of the following form: > {code} > val myRDD = if (some condition) { tempRDD1.some operations } else { > tempRDD2.some operations}.filter(a => a._2._5 <= 50) > {code} > When I output the contents of myRDD, I found entries that clearly had a._2._5 > > 50... the filter didn't work! > If I move the filter inside of the if..then blocks, it suddenly does work: > {code} > val myRDD = if (some condition) { tempRDD1.some operations.filter(a => > a._2._5 <= 50) } else { tempRDD2.some operations.filter(a => a._2._5 <= 50) } > {code} > I ran toDebugString after both of these code examples, and "filter" does > appear in the DAG for the second example, but DOES NOT appear in the first > DAG. Why not? > Am I misusing the if..then..else syntax for Spark/Scala? > Here is my actual code... ignore the crazy naming conventions and what it's > doing... > {code} > // this does NOT work > val myRDD = if (tempRDD2.count() > 0) { >tempRDD1. > map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))). > leftOuterJoin(tempRDD2). > map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, > a._2._2.getOrElse(1L. > leftOuterJoin(tempRDD2). > map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, > a._2._2.getOrElse(1L. > map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), > (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if > (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))) >} else { > tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))) >}. >filter(a => a._2._5 <= 50). >partitionBy(partitioner). >setName("myRDD"). >persist(StorageLevel.MEMORY_AND_DISK_SER) > myRDD.checkpoint() > println(myRDD.toDebugString) > // (4) MapPartitionsRDD[58] at map at myProgram.scala:2120 [] > // | MapPartitionsRDD[57] at map at myProgram.scala:2119 [] > // | MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 [] > // | MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 [] > // | CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 [] > // +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 [] > // | | MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 [] > // | | MapPartitionsRDD[51] at leftOuterJoin at myProgram.scala:2116 [] > // | | CoGroupedRDD[50] at leftOuterJoin at myProgram.scala:2116 [] > // | +-(4) MapPartitionsRDD[49] at map at myProgram.scala:2115 [] > // | | | clusterGraphWithComponentsRDD MapPartitionsRDD[28] at > reduceByKey at myProgram.scala:1689 [] > // | | | CachedPartitions: 4; MemorySize: 1176.0 B; TachyonSize: 0.0 > B; DiskSize: 0.0 B > // | | | CheckpointRDD[29] at count at myProgram.scala:1701 [] > // | +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at > myProgram.scala:383 [] > // | | CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 > B; DiskSize: 0.0 B > // | | CheckpointRDD[17] at count at myProgram.scala:394 [] > // +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at > myProgram.scala:383 [] > // | CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; > DiskSize: 0.0 B > // | CheckpointRDD[17] at count at myProgram.scala:394 [] > // this DOES work! > val myRDD = if (tempRDD2.count() > 0) { >tempRDD1. > map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))). > leftOuterJoin(tempRDD2). > map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, > a._2._2.getOrElse(1L. > leftOuterJoin(tempRDD2). > map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, > a._2._2.getOrElse(1L. > map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), > (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if > (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))). > filter(a => a._2._5 <= 50) >} else { > tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))). > filter(a => a._2._5 <= 50) >}. >partitionBy(partitioner). >setName("my
[jira] [Closed] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.
[ https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prachi Burathoki closed SPARK-10634. Resolution: Done > The spark sql fails if the where clause contains a string with " in it. > --- > > Key: SPARK-10634 > URL: https://issues.apache.org/jira/browse/SPARK-10634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: Prachi Burathoki > > When running a sql query in which the where clause contains a string with " > in it, the sql parser throws an error. > Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but > identifier test found > SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER > FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test"")) > > ^ > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) > at > scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933) > at > com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106) > at > com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93) > at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153) > at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752) > at > com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011) > ... 31 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10636) RDD filter does not work after if..then..else RDD blocks
Glenn Strycker created SPARK-10636: -- Summary: RDD filter does not work after if..then..else RDD blocks Key: SPARK-10636 URL: https://issues.apache.org/jira/browse/SPARK-10636 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Glenn Strycker I have an RDD declaration of the following form: {code} val myRDD = if (some condition) { tempRDD1.some operations } else { tempRDD2.some operations}.filter(a => a._2._5 <= 50) {code} When I output the contents of myRDD, I found entries that clearly had a._2._5 > 50... the filter didn't work! If I move the filter inside of the if..then blocks, it suddenly does work: {code} val myRDD = if (some condition) { tempRDD1.some operations.filter(a => a._2._5 <= 50) } else { tempRDD2.some operations.filter(a => a._2._5 <= 50) } {code} I ran toDebugString after both of these code examples, and "filter" does appear in the DAG for the second example, but DOES NOT appear in the first DAG. Why not? Am I misusing the if..then..else syntax for Spark/Scala? Here is my actual code... ignore the crazy naming conventions and what it's doing... {code} // this does NOT work val myRDD = if (tempRDD2.count() > 0) { tempRDD1. map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))). leftOuterJoin(tempRDD2). map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, a._2._2.getOrElse(1L. leftOuterJoin(tempRDD2). map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, a._2._2.getOrElse(1L. map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))) } else { tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))) }. filter(a => a._2._5 <= 50). partitionBy(partitioner). setName("myRDD"). persist(StorageLevel.MEMORY_AND_DISK_SER) myRDD.checkpoint() println(myRDD.toDebugString) // (4) MapPartitionsRDD[58] at map at myProgram.scala:2120 [] // | MapPartitionsRDD[57] at map at myProgram.scala:2119 [] // | MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 [] // | MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 [] // | CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 [] // +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 [] // | | MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 [] // | | MapPartitionsRDD[51] at leftOuterJoin at myProgram.scala:2116 [] // | | CoGroupedRDD[50] at leftOuterJoin at myProgram.scala:2116 [] // | +-(4) MapPartitionsRDD[49] at map at myProgram.scala:2115 [] // | | | clusterGraphWithComponentsRDD MapPartitionsRDD[28] at reduceByKey at myProgram.scala:1689 [] // | | | CachedPartitions: 4; MemorySize: 1176.0 B; TachyonSize: 0.0 B; DiskSize: 0.0 B // | | | CheckpointRDD[29] at count at myProgram.scala:1701 [] // | +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at myProgram.scala:383 [] // | | CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; DiskSize: 0.0 B // | | CheckpointRDD[17] at count at myProgram.scala:394 [] // +-(4) clusterStatsRDD MapPartitionsRDD[16] at distinct at myProgram.scala:383 [] // | CachedPartitions: 4; MemorySize: 824.0 B; TachyonSize: 0.0 B; DiskSize: 0.0 B // | CheckpointRDD[17] at count at myProgram.scala:394 [] // this DOES work! val myRDD = if (tempRDD2.count() > 0) { tempRDD1. map(a => (a._1._1, (a._1._2, a._2._1, a._2._2))). leftOuterJoin(tempRDD2). map(a => (a._2._1._1, (a._1, a._2._1._2, a._2._1._3, a._2._2.getOrElse(1L. leftOuterJoin(tempRDD2). map(a => ((a._2._1._1, a._1), (a._2._1._2, a._2._1._3, a._2._1._4, a._2._2.getOrElse(1L. map(a =>((List(a._1._1, a._1._2).min, List(a._1._1, a._1._2).max), (a._2._1, a._2._2, if (a._1._1 <= a._1._2) { a._2._3 } else { a._2._4 }, if (a._1._1 <= a._1._2) { a._2._4 } else { a._2._3 }, a._2._3 + a._2._4))). filter(a => a._2._5 <= 50) } else { tempRDD1.map(a => (a._1, (a._2._1, a._2._2, 1L, 1L, 2L))). filter(a => a._2._5 <= 50) }. partitionBy(partitioner). setName("myRDD"). persist(StorageLevel.MEMORY_AND_DISK_SER) myRDD.checkpoint() println(myRDD.toDebugString) // (4) MapPartitionsRDD[59] at filter at myProgram.scala:2121 [] // | MapPartitionsRDD[58] at map at myProgram.scala:2120 [] // | MapPartitionsRDD[57] at map at myProgram.scala:2119 [] // | MapPartitionsRDD[56] at leftOuterJoin at myProgram.scala:2118 [] // | MapPartitionsRDD[55] at leftOuterJoin at myProgram.scala:2118 [] // | CoGroupedRDD[54] at leftOuterJoin at myProgram.scala:2118 [] // +-(4) MapPartitionsRDD[53] at map at myProgram.scala:2117 [] // | | MapPartitionsRDD[52] at leftOuterJoin at myProgram.scala:2116 [] // | | Ma
[jira] [Commented] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.
[ https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790634#comment-14790634 ] Prachi Burathoki commented on SPARK-10634: -- I tried escaping" with both \ and ".But got the same error.But I've found a workaround, in the where clause i changed it to listc = 'this is a"test"' and that works. Thanks > The spark sql fails if the where clause contains a string with " in it. > --- > > Key: SPARK-10634 > URL: https://issues.apache.org/jira/browse/SPARK-10634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: Prachi Burathoki > > When running a sql query in which the where clause contains a string with " > in it, the sql parser throws an error. > Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but > identifier test found > SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER > FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test"")) > > ^ > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) > at > scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933) > at > com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106) > at > com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93) > at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153) > at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752) > at > com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011) > ... 31 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10637) DataFrames: saving with nested User Data Types
Joao created SPARK-10637: Summary: DataFrames: saving with nested User Data Types Key: SPARK-10637 URL: https://issues.apache.org/jira/browse/SPARK-10637 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Joao Cannot save data frames using nested UserDefinedType I wrote a simple example to show the error. It causes the following error java.lang.IllegalArgumentException: Nested type should be repeated: required group array { required int32 num; } {code:java} import org.apache.spark.sql.SaveMode import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.expressions.GenericMutableRow import org.apache.spark.sql.types._ @SQLUserDefinedType(udt = classOf[AUDT]) case class A(list:Seq[B]) class AUDT extends UserDefinedType[A] { override def sqlType: DataType = StructType(Seq(StructField("list", ArrayType(BUDT, containsNull = false), nullable = true))) override def userClass: Class[A] = classOf[A] override def serialize(obj: Any): Any = obj match { case A(list) => val row = new GenericMutableRow(1) row.update(0, new GenericArrayData(list.map(_.asInstanceOf[Any]).toArray)) row } override def deserialize(datum: Any): A = { datum match { case row: InternalRow => new A(row.getArray(0).toArray(BUDT).toSeq) } } } object AUDT extends AUDT @SQLUserDefinedType(udt = classOf[BUDT]) case class B(num:Int) class BUDT extends UserDefinedType[B] { override def sqlType: DataType = StructType(Seq(StructField("num", IntegerType, nullable = false))) override def userClass: Class[B] = classOf[B] override def serialize(obj: Any): Any = obj match { case B(num) => val row = new GenericMutableRow(1) row.setInt(0, num) row } override def deserialize(datum: Any): B = { datum match { case row: InternalRow => new B(row.getInt(0)) } } } object BUDT extends BUDT object TestNested { def main(args:Array[String]) = { val col = Seq(new A(Seq(new B(1), new B(2))), new A(Seq(new B(3), new B(4 val sc = new SparkContext(new SparkConf().setMaster("local[1]").setAppName("TestSpark")) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ val df = sc.parallelize(1 to 2 zip col).toDF() df.show() df.write.mode(SaveMode.Overwrite).save(...) } } {code} The error log is shown below: {noformat} 15/09/16 16:44:36 WARN : Your hostname, X resolves to a loopback/non-reachable address: fe80:0:0:0:c4c7:8c4b:4a24:f8a1%14, but we couldn't find any external IP address! 15/09/16 16:44:38 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 15/09/16 16:44:38 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/09/16 16:44:38 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 15/09/16 16:44:38 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 15/09/16 16:44:38 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 15/09/16 16:44:38 INFO ParquetRelation: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter 15/09/16 16:44:38 INFO DefaultWriterContainer: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter 15/09/16 16:44:38 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:50986 in memory (size: 1402.0 B, free: 973.6 MB) 15/09/16 16:44:38 INFO ContextCleaner: Cleaned accumulator 1 15/09/16 16:44:39 INFO SparkContext: Starting job: save at TestNested.scala:73 15/09/16 16:44:39 INFO DAGScheduler: Got job 1 (save at TestNested.scala:73) with 1 output partitions 15/09/16 16:44:39 INFO DAGScheduler: Final stage: ResultStage 1(save at TestNested.scala:73) 15/09/16 16:44:39 INFO DAGScheduler: Parents of final stage: List() 15/09/16 16:44:39 INFO DAGScheduler: Missing parents: List() 15/09/16 16:44:39 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[1] at rddToDataFrameHolder at TestNested.scala:69), which has no missing parents 15/09/16 16:44:39 INFO MemoryStore: ensureFreeSpace(59832) called with curMem=0, maxMem=1020914565 15/09/16 16:44:39 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 58.4 KB, free 973.6 MB) 15/09/16 16:44:39 INFO MemoryStore: ensureFreeSpace(20794) called with curMem=59832, maxMem=1020914565 15/09/16 16:44:39 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 20.3 KB, free 973.5 MB) 15/09/16 16:44:39 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:50986 (size: 20.3 KB, free: 973.6 MB) 15/09/16 16:44:39 INFO SparkContext: Created broadcast 1 from broadcast at
[jira] [Commented] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.
[ https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14769057#comment-14769057 ] Sean Owen commented on SPARK-10634: --- Don't you need to use "" to quote double quotes? I actually am not sure what Spark supports but that's not uncommon in SQL. I don't think you're escaping the quotes here. > The spark sql fails if the where clause contains a string with " in it. > --- > > Key: SPARK-10634 > URL: https://issues.apache.org/jira/browse/SPARK-10634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: Prachi Burathoki > > When running a sql query in which the where clause contains a string with " > in it, the sql parser throws an error. > Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but > identifier test found > SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER > FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test"")) > > ^ > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) > at > scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933) > at > com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106) > at > com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93) > at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153) > at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752) > at > com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011) > ... 31 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10635) pyspark - running on a different host
[ https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Duffield updated SPARK-10635: - Description: At various points we assume we only ever talk to a driver on the same host. e.g. https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 We use pyspark to connect to an existing driver (i.e. do not let pyspark launch the driver itself, but instead construct the SparkContext with the gateway and jsc arguments. There are a few reasons for this, but essentially it's to allow more flexibility when running in AWS. Before 1.3.1 we were able to monkeypatch around this: {code} def _load_from_socket(port, serializer): sock = socket.socket() sock.settimeout(3) try: sock.connect((host, port)) rf = sock.makefile("rb", 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() pyspark.rdd._load_from_socket = _load_from_socket {code} was: At various points we assume we only ever talk to a driver on the same host. e.g. https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 We use pyspark to connect to an existing driver (i.e. do not let pyspark launch the driver itself, but instead construct the SparkContext with the gateway and jsc arguments. There are a few reasons for this, but essentially it's to allow more flexibility when running in AWS. Before 1.3.1 we were able to monkeypatch around this: {code} def _load_from_socket(port, serializer): sock = socket.socket() sock.settimeout(3) try: sock.connect((host, port)) rf = sock.makefile("rb", 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() pyspark.rdd._load_from_socket = _load_from_socket {/code} > pyspark - running on a different host > - > > Key: SPARK-10635 > URL: https://issues.apache.org/jira/browse/SPARK-10635 > Project: Spark > Issue Type: Improvement >Reporter: Ben Duffield > > At various points we assume we only ever talk to a driver on the same host. > e.g. > https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 > We use pyspark to connect to an existing driver (i.e. do not let pyspark > launch the driver itself, but instead construct the SparkContext with the > gateway and jsc arguments. > There are a few reasons for this, but essentially it's to allow more > flexibility when running in AWS. > Before 1.3.1 we were able to monkeypatch around this: > {code} > def _load_from_socket(port, serializer): > sock = socket.socket() > sock.settimeout(3) > try: > sock.connect((host, port)) > rf = sock.makefile("rb", 65536) > for item in serializer.load_stream(rf): > yield item > finally: > sock.close() > pyspark.rdd._load_from_socket = _load_from_socket > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10635) pyspark - running on a different host
[ https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Duffield updated SPARK-10635: - Description: At various points we assume we only ever talk to a driver on the same host. e.g. https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 We use pyspark to connect to an existing driver (i.e. do not let pyspark launch the driver itself, but instead construct the SparkContext with the gateway and jsc arguments. There are a few reasons for this, but essentially it's to allow more flexibility when running in AWS. Before 1.3.1 we were able to monkeypatch around this: {code} def _load_from_socket(port, serializer): sock = socket.socket() sock.settimeout(3) try: sock.connect((host, port)) rf = sock.makefile("rb", 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() pyspark.rdd._load_from_socket = _load_from_socket {code} was: At various points we assume we only ever talk to a driver on the same host. e.g. https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 We use pyspark to connect to an existing driver (i.e. do not let pyspark launch the driver itself, but instead construct the SparkContext with the gateway and jsc arguments. There are a few reasons for this, but essentially it's to allow more flexibility when running in AWS. Before 1.3.1 we were able to monkeypatch around this: {code} def _load_from_socket(port, serializer): sock = socket.socket() sock.settimeout(3) try: sock.connect((host, port)) rf = sock.makefile("rb", 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() pyspark.rdd._load_from_socket = _load_from_socket {code} > pyspark - running on a different host > - > > Key: SPARK-10635 > URL: https://issues.apache.org/jira/browse/SPARK-10635 > Project: Spark > Issue Type: Improvement >Reporter: Ben Duffield > > At various points we assume we only ever talk to a driver on the same host. > e.g. > https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 > We use pyspark to connect to an existing driver (i.e. do not let pyspark > launch the driver itself, but instead construct the SparkContext with the > gateway and jsc arguments. > There are a few reasons for this, but essentially it's to allow more > flexibility when running in AWS. > Before 1.3.1 we were able to monkeypatch around this: > {code} > def _load_from_socket(port, serializer): > sock = socket.socket() > sock.settimeout(3) > try: > sock.connect((host, port)) > rf = sock.makefile("rb", 65536) > for item in serializer.load_stream(rf): > yield item > finally: > sock.close() > pyspark.rdd._load_from_socket = _load_from_socket > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10635) pyspark - running on a different host
[ https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Duffield updated SPARK-10635: - Description: At various points we assume we only ever talk to a driver on the same host. e.g. https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 We use pyspark to connect to an existing driver (i.e. do not let pyspark launch the driver itself, but instead construct the SparkContext with the gateway and jsc arguments. There are a few reasons for this, but essentially it's to allow more flexibility when running in AWS. Before 1.3.1 we were able to monkeypatch around this: {code} def _load_from_socket(port, serializer): sock = socket.socket() sock.settimeout(3) try: sock.connect((host, port)) rf = sock.makefile("rb", 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() pyspark.rdd._load_from_socket = _load_from_socket {/code} was: At various points we assume we only ever talk to a driver on the same host. e.g. https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 We use pyspark to connect to an existing driver (i.e. do not let pyspark launch the driver itself, but instead construct the SparkContext with the gateway and jsc arguments. There are a few reasons for this, but essentially it's to allow more flexibility when running in AWS. Before 1.3.1 we were able to monkeypatch around this: def _load_from_socket(port, serializer): sock = socket.socket() sock.settimeout(3) try: sock.connect((host, port)) rf = sock.makefile("rb", 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() pyspark.rdd._load_from_socket = _load_from_socket > pyspark - running on a different host > - > > Key: SPARK-10635 > URL: https://issues.apache.org/jira/browse/SPARK-10635 > Project: Spark > Issue Type: Improvement >Reporter: Ben Duffield > > At various points we assume we only ever talk to a driver on the same host. > e.g. > https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 > We use pyspark to connect to an existing driver (i.e. do not let pyspark > launch the driver itself, but instead construct the SparkContext with the > gateway and jsc arguments. > There are a few reasons for this, but essentially it's to allow more > flexibility when running in AWS. > Before 1.3.1 we were able to monkeypatch around this: > {code} > def _load_from_socket(port, serializer): > sock = socket.socket() > sock.settimeout(3) > try: > sock.connect((host, port)) > rf = sock.makefile("rb", 65536) > for item in serializer.load_stream(rf): > yield item > finally: > sock.close() > pyspark.rdd._load_from_socket = _load_from_socket > {/code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10635) pyspark - running on a different host
Ben Duffield created SPARK-10635: Summary: pyspark - running on a different host Key: SPARK-10635 URL: https://issues.apache.org/jira/browse/SPARK-10635 Project: Spark Issue Type: Improvement Reporter: Ben Duffield At various points we assume we only ever talk to a driver on the same host. e.g. https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 We use pyspark to connect to an existing driver (i.e. do not let pyspark launch the driver itself, but instead construct the SparkContext with the gateway and jsc arguments. There are a few reasons for this, but essentially it's to allow more flexibility when running in AWS. Before 1.3.1 we were able to monkeypatch around this: def _load_from_socket(port, serializer): sock = socket.socket() sock.settimeout(3) try: sock.connect((host, port)) rf = sock.makefile("rb", 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() pyspark.rdd._load_from_socket = _load_from_socket -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.
[ https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14769011#comment-14769011 ] Prachi Burathoki commented on SPARK-10634: -- I tried by escaping with \, but still same error Caused by: java.lang.RuntimeException: [1.130] failure: ``)'' expected but identifier test found SELECT clistc1426336010, corc2125646118, candc2031403851, SYSIBM_ROW_NUMBER FROM TABLE_1 WHERE ((clistc1426336010 = "this is a \"test\"")) ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40) at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933) at com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106) at com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93) at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153) at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291) at com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840) at com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752) at com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRef > The spark sql fails if the where clause contains a string with " in it. > --- > > Key: SPARK-10634 > URL: https://issues.apache.org/jira/browse/SPARK-10634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: Prachi Burathoki > > When running a sql query in which the where clause contains a string with " > in it, the sql parser throws an error. > Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but > identifier test found > SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER > FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test"")) > > ^ > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40) > at > org.apache.spark.sql
[jira] [Commented] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.
[ https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14768938#comment-14768938 ] Sean Owen commented on SPARK-10634: --- Shouldn't the " be escaped in some way? or else I'm not sure how the parser would know the end of the literal from a quote inside the literal. > The spark sql fails if the where clause contains a string with " in it. > --- > > Key: SPARK-10634 > URL: https://issues.apache.org/jira/browse/SPARK-10634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: Prachi Burathoki > > When running a sql query in which the where clause contains a string with " > in it, the sql parser throws an error. > Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but > identifier test found > SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER > FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test"")) > > ^ > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) > at > scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at > org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933) > at > com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106) > at > com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93) > at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153) > at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840) > at > com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752) > at > com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011) > ... 31 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.
Prachi Burathoki created SPARK-10634: Summary: The spark sql fails if the where clause contains a string with " in it. Key: SPARK-10634 URL: https://issues.apache.org/jira/browse/SPARK-10634 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Prachi Burathoki When running a sql query in which the where clause contains a string with " in it, the sql parser throws an error. Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but identifier test found SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test"")) ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40) at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933) at com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106) at com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93) at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153) at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291) at com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840) at com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752) at com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011) ... 31 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10614) SystemClock uses non-monotonic time in its wait logic
[ https://issues.apache.org/jira/browse/SPARK-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14768900#comment-14768900 ] Steve Loughran commented on SPARK-10614: Having done a little more detailed research on the current state of this clock, I'm now having doubts about this. On x86, its generally assumed that the {{System.nanoTime()}} uses the {{TSC}} counter to get the timestamp —which is fast and only goes forwards (albeit at a rate which depends on CPU power states). But it turns out that on manycore CPUs, because that could lead to different answers on different cores, the OS may use alternative mechanisms to return a counter: which may be neither monotonic nor fast. # [Inside the Hotspot VM: Clocks, Timers and Scheduling Events - Part I - Windows|https://blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks] # [JDK-6440250 : On Windows System.nanoTime() may be 25x slower than System.currentTimeMillis()|http://bugs.java.com/view_bug.do?bug_id=6440250] # [JDK-6458294 : nanoTime affected by system clock change on Linux (RH9) or in general lacks monotonicity|JDK-6458294 : nanoTime affected by system clock change on Linux (RH9) or in general lacks monotonicity] # [Redhat on timestamps in Linux|https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Timestamping.html] # [Timekeeping in VMware Virtual Machineshttp://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf] These docs imply that nanotime may be fast-but-unreliable-on multiple socket systems (the latest many core parts share one counter) —and may downgrade to something slower than calls to getTimeMillis()., or even something that isn't guaranteed to be monotonic. It's not clear that on deployments of physical many-core systems moving to nanotime actually offers much. I don't know about EC2 or other cloud infrastructures though. maybe its just best to WONTFIX this as it won't raise unrealistic expectations about nanoTime working > SystemClock uses non-monotonic time in its wait logic > - > > Key: SPARK-10614 > URL: https://issues.apache.org/jira/browse/SPARK-10614 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Steve Loughran >Priority: Minor > > The consolidated (SPARK-4682) clock uses {{System.currentTimeMillis()}} for > measuring time, which means its {{waitTillTime()}} routine is brittle against > systems (VMs in particular) whose time can go backwards as well as forward. > For the {{ExecutorAllocationManager}} this appears to be a regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6810) Performance benchmarks for SparkR
[ https://issues.apache.org/jira/browse/SPARK-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14768853#comment-14768853 ] Yashwanth Kumar commented on SPARK-6810: Hi Shivaram Venkataraman, I would like to try this. > Performance benchmarks for SparkR > - > > Key: SPARK-6810 > URL: https://issues.apache.org/jira/browse/SPARK-6810 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Shivaram Venkataraman >Priority: Critical > > We should port some performance benchmarks from spark-perf to SparkR for > tracking performance regressions / improvements. > https://github.com/databricks/spark-perf/tree/master/pyspark-tests has a list > of PySpark performance benchmarks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9492) LogisticRegression in R should provide model statistics
[ https://issues.apache.org/jira/browse/SPARK-9492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14768850#comment-14768850 ] Yashwanth Kumar commented on SPARK-9492: Can i Have this task? > LogisticRegression in R should provide model statistics > --- > > Key: SPARK-9492 > URL: https://issues.apache.org/jira/browse/SPARK-9492 > Project: Spark > Issue Type: Sub-task > Components: ML, R >Reporter: Eric Liang > > Like ml LinearRegression, LogisticRegression should provide a training > summary including feature names and their coefficients. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10267) Add @Since annotation to ml.util
[ https://issues.apache.org/jira/browse/SPARK-10267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-10267. --- Resolution: Not A Problem Assignee: Ehsan Mohyedin Kermani > Add @Since annotation to ml.util > > > Key: SPARK-10267 > URL: https://issues.apache.org/jira/browse/SPARK-10267 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Xiangrui Meng >Assignee: Ehsan Mohyedin Kermani >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10262) Add @Since annotation to ml.attribute
[ https://issues.apache.org/jira/browse/SPARK-10262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14768840#comment-14768840 ] Xiangrui Meng commented on SPARK-10262: --- [~tijoparacka] Are you still working on this? > Add @Since annotation to ml.attribute > - > > Key: SPARK-10262 > URL: https://issues.apache.org/jira/browse/SPARK-10262 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Xiangrui Meng >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10631) Add missing API doc in pyspark.mllib.linalg.Vector
[ https://issues.apache.org/jira/browse/SPARK-10631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10631: -- Assignee: Vinod KC > Add missing API doc in pyspark.mllib.linalg.Vector > -- > > Key: SPARK-10631 > URL: https://issues.apache.org/jira/browse/SPARK-10631 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Vinod KC >Priority: Minor > > There are some missing API docs in pyspark.mllib.linalg.Vector (including > DenseVector and SparseVector). We should add them based on their Scala > counterparts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10276) Add @since annotation to pyspark.mllib.recommendation
[ https://issues.apache.org/jira/browse/SPARK-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-10276. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8677 [https://github.com/apache/spark/pull/8677] > Add @since annotation to pyspark.mllib.recommendation > - > > Key: SPARK-10276 > URL: https://issues.apache.org/jira/browse/SPARK-10276 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Yu Ishikawa >Priority: Minor > Labels: starter > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10633) Persisting Spark stream to MySQL - Spark tries to create the table for every stream even if it exist already.
[ https://issues.apache.org/jira/browse/SPARK-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10633: -- Priority: Major (was: Blocker) [~lunendl] Please have a look at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark -- for example you should not set Blocker, as this can't really be considered one at this point. > Persisting Spark stream to MySQL - Spark tries to create the table for every > stream even if it exist already. > - > > Key: SPARK-10633 > URL: https://issues.apache.org/jira/browse/SPARK-10633 > Project: Spark > Issue Type: Bug > Components: SQL, Streaming >Affects Versions: 1.4.0, 1.5.0 > Environment: Ubuntu 14.04 > IntelliJ IDEA 14.1.4 > sbt > mysql-connector-java 5.1.35 (Tested and working with Spark 1.3.1) >Reporter: Lunen > > Persisting Spark Kafka stream to MySQL > Spark 1.4 + tries to create a table automatically every time the stream gets > sent to a specified table. > Please note, Spark 1.3.1 works. > Code sample: > val url = "jdbc:mysql://host:port/db?user=user&password=password > val crp = RowSetProvider.newFactory() > val crsSql: CachedRowSet = crp.createCachedRowSet() > val crsTrg: CachedRowSet = crp.createCachedRowSet() > crsSql.beforeFirst() > crsTrg.beforeFirst() > //Read Stream from Kafka > //Produce SQL INSERT STRING > > streamT.foreachRDD { rdd => > if (rdd.toLocalIterator.nonEmpty) { > sqlContext.read.json(rdd).registerTempTable(serverEvents + "_events") > while (crsSql.next) { > sqlContext.sql("SQL INSERT STRING").write.jdbc(url, "SCHEMA_NAME", > new Properties) > println("Persisted Data: " + 'SQL INSERT STRING') > } > crsSql.beforeFirst() > } > stmt.close() > conn.close() > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10633) Persisting Spark stream to MySQL - Spark tries to create the table for every stream even if it exist already.
Lunen created SPARK-10633: - Summary: Persisting Spark stream to MySQL - Spark tries to create the table for every stream even if it exist already. Key: SPARK-10633 URL: https://issues.apache.org/jira/browse/SPARK-10633 Project: Spark Issue Type: Bug Components: SQL, Streaming Affects Versions: 1.5.0, 1.4.0 Environment: Ubuntu 14.04 IntelliJ IDEA 14.1.4 sbt mysql-connector-java 5.1.35 (Tested and working with Spark 1.3.1) Reporter: Lunen Priority: Blocker Persisting Spark Kafka stream to MySQL Spark 1.4 + tries to create a table automatically every time the stream gets sent to a specified table. Please note, Spark 1.3.1 works. Code sample: val url = "jdbc:mysql://host:port/db?user=user&password=password val crp = RowSetProvider.newFactory() val crsSql: CachedRowSet = crp.createCachedRowSet() val crsTrg: CachedRowSet = crp.createCachedRowSet() crsSql.beforeFirst() crsTrg.beforeFirst() //Read Stream from Kafka //Produce SQL INSERT STRING streamT.foreachRDD { rdd => if (rdd.toLocalIterator.nonEmpty) { sqlContext.read.json(rdd).registerTempTable(serverEvents + "_events") while (crsSql.next) { sqlContext.sql("SQL INSERT STRING").write.jdbc(url, "SCHEMA_NAME", new Properties) println("Persisted Data: " + 'SQL INSERT STRING') } crsSql.beforeFirst() } stmt.close() conn.close() } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10632) Cannot save DataFrame with User Defined Types
[ https://issues.apache.org/jira/browse/SPARK-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joao updated SPARK-10632: - Description: Cannot save DataFrames that contain user-defined types. I tried to save a dataframe with instances of the Vector class from mlib and got the error. The code below should reproduce the error. {noformat} val df = sc.parallelize(Seq((1,Vectors.dense(1,1,1)), (2,Vectors.dense(2,2,2.toDF() df.write.format("json").mode(SaveMode.Overwrite).save(path) {noformat} The error log is below {noformat} 15/09/16 09:58:27 ERROR DefaultWriterContainer: Aborting task. scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericMutableRow) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133) at org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:185) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:243) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6' closed. Now beginning upload 15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6' upload complete 15/09/16 09:58:28 ERROR DefaultWriterContainer: Task attempt attempt_201509160958__m_00_0 aborted. 15/09/16 09:58:28 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:251) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericMutableRow) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfu
[jira] [Resolved] (SPARK-10511) Source releases should not include maven jars
[ https://issues.apache.org/jira/browse/SPARK-10511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10511. --- Resolution: Fixed Fix Version/s: 1.5.1 1.6.0 Resolved by https://github.com/apache/spark/pull/8774 > Source releases should not include maven jars > - > > Key: SPARK-10511 > URL: https://issues.apache.org/jira/browse/SPARK-10511 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.5.0 >Reporter: Patrick Wendell >Assignee: Luciano Resende >Priority: Blocker > Fix For: 1.6.0, 1.5.1 > > > I noticed our source jars seemed really big for 1.5.0. At least one > contributing factor is that, likely due to some change in the release script, > the maven jars are being bundled in with the source code in our build > directory. This runs afoul of the ASF policy on binaries in source releases - > we should fix it in 1.5.1. > The issue (I think) is that we might invoke maven to compute the version > between when we checkout Spark from github and when we package the source > file. I think it could be fixed by simply clearing out the build/ directory > after that statement runs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10632) Cannot save DataFrame with User Defined Types
[ https://issues.apache.org/jira/browse/SPARK-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joao updated SPARK-10632: - Description: Cannot save DataFrames that contain user-defined types. At first I thought it was a problem with my udt class, then tried the Vector class from mlib and the error was the same. The code below should reproduce the error. {noformat} val df = sc.parallelize(Seq((1,Vectors.dense(1,1,1)), (2,Vectors.dense(2,2,2.toDF() df.write.format("json").mode(SaveMode.Overwrite).save(path) {noformat} The error log is below {noformat} 15/09/16 09:58:27 ERROR DefaultWriterContainer: Aborting task. scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericMutableRow) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133) at org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:185) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:243) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6' closed. Now beginning upload 15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6' upload complete 15/09/16 09:58:28 ERROR DefaultWriterContainer: Task attempt attempt_201509160958__m_00_0 aborted. 15/09/16 09:58:28 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:251) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericMutableRow) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103) at org.apache.spark.sql.execution.datasources
[jira] [Created] (SPARK-10632) Cannot save DataFrame with User Defined Types
Joao created SPARK-10632: Summary: Cannot save DataFrame with User Defined Types Key: SPARK-10632 URL: https://issues.apache.org/jira/browse/SPARK-10632 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Joao Cannot save DataFrames that contain user-defined types. At first I thought it was a problem with my udt class, then tried the Vector class from mlib and the error was the same. Te code below should reproduce the error. {noformat} val df = sc.parallelize(Seq((1,Vectors.dense(1,1,1)), (2,Vectors.dense(2,2,2.toDF() df.write.format("json").mode(SaveMode.Overwrite).save(path) {noformat} The error log is below {noformat} 15/09/16 09:58:27 ERROR DefaultWriterContainer: Aborting task. scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericMutableRow) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133) at org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:185) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:243) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6' closed. Now beginning upload 15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6' upload complete 15/09/16 09:58:28 ERROR DefaultWriterContainer: Task attempt attempt_201509160958__m_00_0 aborted. 15/09/16 09:58:28 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:251) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericMutableRow) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179) at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$o
[jira] [Updated] (SPARK-10515) When killing executor, the pending replacement executors will be lost
[ https://issues.apache.org/jira/browse/SPARK-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KaiXinXIaoLei updated SPARK-10515: -- Summary: When killing executor, the pending replacement executors will be lost (was: When kill executor, there is no need to seed RequestExecutors to AM) > When killing executor, the pending replacement executors will be lost > - > > Key: SPARK-10515 > URL: https://issues.apache.org/jira/browse/SPARK-10515 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: KaiXinXIaoLei > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10508) incorrect evaluation of searched case expression
[ https://issues.apache.org/jira/browse/SPARK-10508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10508: -- Assignee: Josh Rosen > incorrect evaluation of searched case expression > > > Key: SPARK-10508 > URL: https://issues.apache.org/jira/browse/SPARK-10508 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: N Campbell >Assignee: Josh Rosen > Fix For: 1.4.1, 1.5.0 > > > The following case expression never evaluates to 'test1' when cdec is -1 or > 10 as it will for Hive 0.13. Instead is returns 'other' for all rows. > {code} > select rnum, cdec, case when cdec in ( -1,10,0.1 ) then 'test1' else 'other' > end from tdec > create table if not exists TDEC ( RNUM int , CDEC decimal(7, 2 )) > TERMINATED BY '\n' > STORED AS orc ; > 0|\N > 1|-1.00 > 2|0.00 > 3|1.00 > 4|0.10 > 5|10.00 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10475) improve column prunning for Project on Sort
[ https://issues.apache.org/jira/browse/SPARK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10475: -- Assignee: Wenchen Fan > improve column prunning for Project on Sort > --- > > Key: SPARK-10475 > URL: https://issues.apache.org/jira/browse/SPARK-10475 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9032) scala.MatchError in DataFrameReader.json(String path)
[ https://issues.apache.org/jira/browse/SPARK-9032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9032: - Assignee: Josh Rosen > scala.MatchError in DataFrameReader.json(String path) > - > > Key: SPARK-9032 > URL: https://issues.apache.org/jira/browse/SPARK-9032 > Project: Spark > Issue Type: Bug > Components: Java API, SQL >Affects Versions: 1.4.0 > Environment: Ubuntu 15.04 >Reporter: Philipp Poetter >Assignee: Josh Rosen > Fix For: 1.4.1 > > > Executing read().json() of SQLContext e.g. DataFrameReader raises a > MatchError with a stacktrace as follows while trying to read JSON data: > {code} > 15/07/14 11:25:26 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks > have all completed, from pool > 15/07/14 11:25:26 INFO DAGScheduler: Job 0 finished: json at Example.java:23, > took 6.981330 s > Exception in thread "main" scala.MatchError: StringType (of class > org.apache.spark.sql.types.StringType$) > at org.apache.spark.sql.json.InferSchema$.apply(InferSchema.scala:58) > at > org.apache.spark.sql.json.JSONRelation$$anonfun$schema$1.apply(JSONRelation.scala:139) > at > org.apache.spark.sql.json.JSONRelation$$anonfun$schema$1.apply(JSONRelation.scala:138) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.json.JSONRelation.schema$lzycompute(JSONRelation.scala:137) > at org.apache.spark.sql.json.JSONRelation.schema(JSONRelation.scala:137) > at > org.apache.spark.sql.sources.LogicalRelation.(LogicalRelation.scala:30) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:120) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:213) > at com.hp.sparkdemo.Example.main(Example.java:23) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > 15/07/14 11:25:26 INFO SparkContext: Invoking stop() from shutdown hook > 15/07/14 11:25:26 INFO SparkUI: Stopped Spark web UI at http://10.0.2.15:4040 > 15/07/14 11:25:26 INFO DAGScheduler: Stopping DAGScheduler > 15/07/14 11:25:26 INFO SparkDeploySchedulerBackend: Shutting down all > executors > 15/07/14 11:25:26 INFO SparkDeploySchedulerBackend: Asking each executor to > shut down > 15/07/14 11:25:26 INFO MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! > {code} > Offending code snippet (around line 23): > {code} >JavaSparkContext sctx = new JavaSparkContext(sparkConf); > SQLContext ctx = new SQLContext(sctx); > DataFrame frame = ctx.read().json(facebookJSON); > frame.printSchema(); > {code} > The exception is reproducable using the following JSON: > {code} > { >"data": [ > { > "id": "X999_Y999", > "from": { > "name": "Tom Brady", "id": "X12" > }, > "message": "Looking forward to 2010!", > "actions": [ > { >"name": "Comment", >"link": "http://www.facebook.com/X999/posts/Y999"; > }, > { >"name": "Like", >"link": "http://www.facebook.com/X999/posts/Y999"; > } > ], > "type": "status", > "created_time": "2010-08-02T21:27:44+", > "updated_time": "2010-08-02T21:27:44+" > }, > { > "id": "X998_Y998", > "from": { > "name": "Peyton Manning", "id": "X18" > }, > "message": "Where's my contract?", > "actions": [ > { >"name": "Comment", >"link": "http://www.facebook.com/X998/posts/Y998"; > }, > { >"name": "Like", >"link": "http://www.facebook.com/X998/posts/Y998"; > } > ], > "type": "status", > "created_time": "2010-08-02T21:27:44+", > "updated_time": "2010-08-02T21:27:44+" > } >] > } > {code} -- This message was sent by Atlassian JIRA (v6.3
[jira] [Updated] (SPARK-9343) DROP TABLE ignores IF EXISTS clause
[ https://issues.apache.org/jira/browse/SPARK-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9343: - Assignee: Michael Armbrust > DROP TABLE ignores IF EXISTS clause > --- > > Key: SPARK-9343 > URL: https://issues.apache.org/jira/browse/SPARK-9343 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 > Environment: Ubuntu on AWS >Reporter: Simeon Simeonov >Assignee: Michael Armbrust > Labels: sql > Fix For: 1.5.0 > > > If a table is missing, {{DROP TABLE IF EXISTS _tableName_}} generates an > exception: > {code} > 15/07/25 15:17:32 ERROR Hive: NoSuchObjectException(message:default.test > table not found) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1560) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105) > at com.sun.proxy.$Proxy27.get_table(Unknown Source) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:997) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) > at com.sun.proxy.$Proxy28.getTable(Unknown Source) > at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:976) > at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:918) > at > org.apache.hadoop.hive.ql.exec.DDLTask.dropTableOrPartitions(DDLTask.java:3846) > > {code} > I standalone test with full spark-shell output is available at: > [https://gist.github.com/ssimeonov/eeb388d13f802689d772] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9033) scala.MatchError: interface java.util.Map (of class java.lang.Class) with Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9033: - Assignee: Josh Rosen > scala.MatchError: interface java.util.Map (of class java.lang.Class) with > Spark SQL > --- > > Key: SPARK-9033 > URL: https://issues.apache.org/jira/browse/SPARK-9033 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.2, 1.3.1 >Reporter: Pavel >Assignee: Josh Rosen > Fix For: 1.4.0 > > > I've a java.util.Map field in a POJO class and I'm trying to > use it to createDataFrame (1.3.1) / applySchema(1.2.2) with the SQLContext > and getting following error in both 1.2.2 & 1.3.1 versions of the Spark SQL: > *sample code: > {code} > SQLContext sqlCtx = new SQLContext(sc.sc()); > JavaRDD rdd = sc.textFile("/path").map(line-> Event.fromString(line)); > //text line is splitted and assigned to respective field of the event class > here > DataFrame schemaRDD = sqlCtx.createDataFrame(rdd, Event.class); <-- error > thrown here > schemaRDD.registerTempTable("events"); > {code} > Event class is a Serializable containing a field of type > java.util.Map. This issue occurs also with Spark streaming > when used with SQL. > {code} > JavaDStream receiverStream = jssc.receiverStream(new > StreamingReceiver()); > JavaDStream windowDStream = receiverStream.window(WINDOW_LENGTH, > SLIDE_INTERVAL); > jssc.checkpoint("event-streaming"); > windowDStream.foreachRDD(evRDD -> { >if(evRDD.count() == 0) return null; > DataFrame schemaRDD = sqlCtx.createDataFrame(evRDD, Event.class); > schemaRDD.registerTempTable("events"); > ... > } > {code} > *error: > {code} > scala.MatchError: interface java.util.Map (of class java.lang.Class) > at > org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1193) > ~[spark-sql_2.10-1.3.1.jar:1.3.1] > at > org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1192) > ~[spark-sql_2.10-1.3.1.jar:1.3.1] > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > ~[scala-library-2.10.5.jar:na] > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > ~[scala-library-2.10.5.jar:na] > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > ~[scala-library-2.10.5.jar:na] > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > ~[scala-library-2.10.5.jar:na] > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > ~[scala-library-2.10.5.jar:na] > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > ~[scala-library-2.10.5.jar:na] > at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1192) > ~[spark-sql_2.10-1.3.1.jar:1.3.1] > at > org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:437) > ~[spark-sql_2.10-1.3.1.jar:1.3.1] > at > org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:465) > ~[spark-sql_2.10-1.3.1.jar:1.3.1] > {code} > **also this occurs for fields of custom POJO classes: > {code} > scala.MatchError: class com.test.MyClass (of class java.lang.Class) > at > org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1193) > ~[spark-sql_2.10-1.3.1.jar:1.3.1] > at > org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1192) > ~[spark-sql_2.10-1.3.1.jar:1.3.1] > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > ~[scala-library-2.10.5.jar:na] > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > ~[scala-library-2.10.5.jar:na] > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > ~[scala-library-2.10.5.jar:na] > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > ~[scala-library-2.10.5.jar:na] > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > ~[scala-library-2.10.5.jar:na] > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > ~[scala-library-2.10.5.jar:na] > at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1192) > ~[spark-sql_2.10-1.3.1.jar:1.3.1] > at > org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:437) > ~[spark-sql_2.10-1.3.1.jar:1.3.1] > at > org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:465) > ~[spark-sql_2.10-1.3.1.jar:1.3.1] > {code} > **also occurs for Calendar type: > {code} > scala.MatchError: class java.util.Calendar (of class java.lang.Class) > at > org.apache.spark.sql.SQLContext$$anonfun$getSc
[jira] [Updated] (SPARK-10437) Support aggregation expressions in Order By
[ https://issues.apache.org/jira/browse/SPARK-10437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10437: -- Assignee: Liang-Chi Hsieh > Support aggregation expressions in Order By > --- > > Key: SPARK-10437 > URL: https://issues.apache.org/jira/browse/SPARK-10437 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Harish Butani >Assignee: Liang-Chi Hsieh > Fix For: 1.6.0 > > > Followup on SPARK-6583 > The following still fails. > {code} > val df = sqlContext.read.json("examples/src/main/resources/people.json") > df.registerTempTable("t") > val df2 = sqlContext.sql("select age, count(*) from t group by age order by > count(*)") > df2.show() > {code} > {code:title=StackTrace} > Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No > function to evaluate expression. type: Count, tree: COUNT(1) > at > org.apache.spark.sql.catalyst.expressions.AggregateExpression.eval(aggregates.scala:41) > at > org.apache.spark.sql.catalyst.expressions.RowOrdering.compare(rows.scala:219) > {code} > In 1.4 the issue seemed to be BindReferences.bindReference didn't handle this > case. > Haven't looked at 1.5 code, but don't see a change to bindReference in this > patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org