[jira] [Commented] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560180#comment-14560180 ] Konstantin Shaposhnikov commented on SPARK-7042: It looks like akka-zeromq_2.11 is only available for versions 2.3.7+, though the rest of the akka libraries are available for 2.3.4. I wonder if akka version can just be updated to the latest 2.3.11? Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x -- Key: SPARK-7042 URL: https://issues.apache.org/jira/browse/SPARK-7042 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Konstantin Shaposhnikov Assignee: Konstantin Shaposhnikov Priority: Minor Fix For: 1.5.0 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built with Scala 2.11) from an application that uses akka 2.3.9 I get the following error: {noformat} 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] [sparkDriver-akka.actor.default-dispatcher-5] - Association with remote system [akka.tcp://sparkExecutor@server:59007] has failed, address is now gated for [5000] ms. Reason is: [akka.actor.Identify; local class incompatible: stream classdesc serialVersionUID = -213377755528332889, local class serialVersionUID = 1]. {noformat} It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations (see https://issues.scala-lang.org/browse/SI-8549). The following steps can resolve the issue: - re-build the custom akka library that is used by Spark with the more recent version of Scala compiler (e.g. 2.11.6) - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo - update version of akka used by spark (master and 1.3 branch) I would also suggest to upgrade to the latest version of akka 2.3.9 (or 2.3.10 that should be released soon). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7104) Support model save/load in Python's Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7104: - Target Version/s: 1.5.0 (was: 1.4.0) Support model save/load in Python's Word2Vec Key: SPARK-7104 URL: https://issues.apache.org/jira/browse/SPARK-7104 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Joseph K. Bradley Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7605) Python API for ElementwiseProduct
[ https://issues.apache.org/jira/browse/SPARK-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7605: - Target Version/s: 1.5.0 Assignee: Manoj Kumar Python API for ElementwiseProduct - Key: SPARK-7605 URL: https://issues.apache.org/jira/browse/SPARK-7605 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Assignee: Manoj Kumar Python API for org.apache.spark.mllib.feature.ElementwiseProduct -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6263) Python MLlib API missing items: Utils
[ https://issues.apache.org/jira/browse/SPARK-6263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6263: - Target Version/s: 1.5.0 (was: 1.4.0) Python MLlib API missing items: Utils - Key: SPARK-6263 URL: https://issues.apache.org/jira/browse/SPARK-6263 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Kai Sasaki This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. MLUtils * appendBias * kFold * loadVectors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7605) Python API for ElementwiseProduct
[ https://issues.apache.org/jira/browse/SPARK-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560212#comment-14560212 ] Joseph K. Bradley commented on SPARK-7605: -- Updated for 1.5. I don't think we'll be able to merge more features into 1.4. But we can see about merging this soon after 1.4 QA is done. Python API for ElementwiseProduct - Key: SPARK-7605 URL: https://issues.apache.org/jira/browse/SPARK-7605 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Assignee: Manoj Kumar Python API for org.apache.spark.mllib.feature.ElementwiseProduct -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6263) Python MLlib API missing items: Utils
[ https://issues.apache.org/jira/browse/SPARK-6263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560214#comment-14560214 ] Joseph K. Bradley commented on SPARK-6263: -- Updated for 1.5. I don't think we'll be able to merge more features into 1.4. But we can see about merging this soon after 1.4 QA is done. Python MLlib API missing items: Utils - Key: SPARK-6263 URL: https://issues.apache.org/jira/browse/SPARK-6263 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Kai Sasaki This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. MLUtils * appendBias * kFold * loadVectors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6099) Stabilize mllib ClassificationModel, RegressionModel APIs
[ https://issues.apache.org/jira/browse/SPARK-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6099: - Target Version/s: 1.5.0 (was: 1.4.0) Stabilize mllib ClassificationModel, RegressionModel APIs - Key: SPARK-6099 URL: https://issues.apache.org/jira/browse/SPARK-6099 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley The abstractions spark.mllib.classification.ClassificationModel and spark.mllib.regression.RegressionModel have been Experimental for a while. This is a problem since some of the implementing classes are not Experimental (e.g., LogisticRegressionModel). We should finalize the API and make it non-Experimental ASAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6099) Stabilize mllib ClassificationModel, RegressionModel APIs
[ https://issues.apache.org/jira/browse/SPARK-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6099: - Target Version/s: 1.4.0 (was: 1.5.0) Stabilize mllib ClassificationModel, RegressionModel APIs - Key: SPARK-6099 URL: https://issues.apache.org/jira/browse/SPARK-6099 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley The abstractions spark.mllib.classification.ClassificationModel and spark.mllib.regression.RegressionModel have been Experimental for a while. This is a problem since some of the implementing classes are not Experimental (e.g., LogisticRegressionModel). We should finalize the API and make it non-Experimental ASAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7883) Fixing broken trainImplicit example in MLlib Collaborative Filtering documentation.
[ https://issues.apache.org/jira/browse/SPARK-7883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7883. -- Resolution: Fixed Fix Version/s: 1.4.0 1.0.3 1.6.0 1.2.3 1.1.2 1.3.2 Issue resolved by pull request 6422 [https://github.com/apache/spark/pull/6422] Fixing broken trainImplicit example in MLlib Collaborative Filtering documentation. --- Key: SPARK-7883 URL: https://issues.apache.org/jira/browse/SPARK-7883 Project: Spark Issue Type: Bug Components: Documentation, MLlib Affects Versions: 1.3.1, 1.4.0 Reporter: Mike Dusenberry Priority: Trivial Fix For: 1.3.2, 1.1.2, 1.2.3, 1.6.0, 1.0.3, 1.4.0 The trainImplicit Scala example near the end of the MLlib Collaborative Filtering documentation refers to an ALS.trainImplicit function signature that does not exist. Rather than add an extra function, let's just fix the example. Currently, the example refers to a function that would have the following signature: def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, alpha: Double) : MatrixFactorizationModel Instead, let's change the example to refer to this function, which does exist (notice the addition of the lambda parameter): def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: Double, alpha: Double) : MatrixFactorizationModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7883) Fixing broken trainImplicit example in MLlib Collaborative Filtering documentation.
[ https://issues.apache.org/jira/browse/SPARK-7883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7883: - Fix Version/s: (was: 1.6.0) Fixing broken trainImplicit example in MLlib Collaborative Filtering documentation. --- Key: SPARK-7883 URL: https://issues.apache.org/jira/browse/SPARK-7883 Project: Spark Issue Type: Bug Components: Documentation, MLlib Affects Versions: 1.3.1, 1.4.0 Reporter: Mike Dusenberry Priority: Trivial Fix For: 1.0.3, 1.1.2, 1.2.3, 1.3.2, 1.4.0 The trainImplicit Scala example near the end of the MLlib Collaborative Filtering documentation refers to an ALS.trainImplicit function signature that does not exist. Rather than add an extra function, let's just fix the example. Currently, the example refers to a function that would have the following signature: def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, alpha: Double) : MatrixFactorizationModel Instead, let's change the example to refer to this function, which does exist (notice the addition of the lambda parameter): def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: Double, alpha: Double) : MatrixFactorizationModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7865) Hadoop Filesystem for eventlog closed before sparkContext stopped
[ https://issues.apache.org/jira/browse/SPARK-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560266#comment-14560266 ] Zhang, Liye commented on SPARK-7865: Thank [~vanzin]'s reply, I'm just thinking about why Hadoop filesystem closed before Spark JVM stoping. I think the reason can be found in description of PR [#5560|https://github.com/apache/spark/pull/5560]. [~srowen], sorry for opening one more duplicated JIRA, I'll take care of it next time. Hadoop Filesystem for eventlog closed before sparkContext stopped - Key: SPARK-7865 URL: https://issues.apache.org/jira/browse/SPARK-7865 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Zhang, Liye After [SPARK-3090|https://issues.apache.org/jira/browse/SPARK-3090] (patch [#5969|https://github.com/apache/spark/pull/5696]), SparkContext will be automatically stop if user forget to. While when shutdownhook is called, Eventlog will give out following exception for flushing content: {noformat} 15/05/26 17:40:38 INFO spark.SparkContext: Invoking stop() from shutdown hook 15/05/26 17:40:38 ERROR scheduler.LiveListenerBus: Listener EventLoggingListener threw an exception java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144) at org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:188) at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:54) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:56) at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180) at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63) Caused by: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:323) at org.apache.hadoop.hdfs.DFSClient.access$1200(DFSClient.java:78) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3877) at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97) ... 16 more {noformat} And exception for stopping: {noformat} 15/05/26 17:40:39 INFO cluster.SparkDeploySchedulerBackend: Asking each executor to shut down 15/05/26 17:40:39 ERROR util.Utils: Uncaught exception in thread Spark Shutdown Hook java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:323) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1057) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:554) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:788) at org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:209) at org.apache.spark.SparkContext$$anonfun$stop$5.apply(SparkContext.scala:1515) at org.apache.spark.SparkContext$$anonfun$stop$5.apply(SparkContext.scala:1515) at scala.Option.foreach(Option.scala:236) at org.apache.spark.SparkContext.stop(SparkContext.scala:1515) at org.apache.spark.SparkContext$$anonfun$3.apply$mcV$sp(SparkContext.scala:527) at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2211) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Utils.scala:2181) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2181) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2181) at
[jira] [Commented] (SPARK-7555) User guide update for ElasticNet
[ https://issues.apache.org/jira/browse/SPARK-7555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560171#comment-14560171 ] Joseph K. Bradley commented on SPARK-7555: -- Thanks! User guide update for ElasticNet Key: SPARK-7555 URL: https://issues.apache.org/jira/browse/SPARK-7555 Project: Spark Issue Type: Documentation Components: ML Reporter: Joseph K. Bradley Assignee: DB Tsai Copied from [SPARK-7443]: {quote} Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. * The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. * We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7576) User guide update for spark.ml ElementwiseProduct
[ https://issues.apache.org/jira/browse/SPARK-7576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560170#comment-14560170 ] Joseph K. Bradley commented on SPARK-7576: -- Thanks! User guide update for spark.ml ElementwiseProduct - Key: SPARK-7576 URL: https://issues.apache.org/jira/browse/SPARK-7576 Project: Spark Issue Type: Documentation Components: Documentation, ML Reporter: Joseph K. Bradley Assignee: Octavian Geagla Copied from [SPARK-7443]: {quote} Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. * The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. * We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. {quote} Note: I created a new subsection for links to spark.ml-specific guides in this JIRA's PR: [SPARK-7557]. This transformer can go within the new subsection. I'll try to get that PR merged ASAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7856) Scalable PCA implementation for tall and fat matrices
[ https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7856: - Issue Type: Improvement (was: Bug) Scalable PCA implementation for tall and fat matrices - Key: SPARK-7856 URL: https://issues.apache.org/jira/browse/SPARK-7856 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Tarek Elgamal Currently the PCA implementation has a limitation of fitting d^2 covariance/grammian matrix entries in memory (d is the number of columns/dimensions of the matrix). We often need only the largest k principal components. To make pca really scalable, I suggest an implementation where the memory usage is proportional to the principal components k rather than the full dimensionality d. I suggest adopting the solution described in this paper that is published in SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). The paper offers an implementation for Probabilistic PCA (PPCA) which has less memory and time complexity and could potentially scale to tall and fat matrices rather than tall and skinny matrices that is supported by the current PCA impelmentation. Probablistic PCA could be potentially added to the set of algorithms supported by MLlib and it does not necessarily replace the old PCA implementation. PPCA implementation is adopted in Matlab's Statistics and Machine Learning Toolbox (http://www.mathworks.com/help/stats/ppca.html) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560181#comment-14560181 ] Konstantin Shaposhnikov commented on SPARK-7042: It looks like akka-zeromq_2.11 is only available for versions 2.3.7+, though the rest of the akka libraries are available for 2.3.4. I wonder if akka version can just be updated to the latest 2.3.11? Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x -- Key: SPARK-7042 URL: https://issues.apache.org/jira/browse/SPARK-7042 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Konstantin Shaposhnikov Assignee: Konstantin Shaposhnikov Priority: Minor Fix For: 1.5.0 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built with Scala 2.11) from an application that uses akka 2.3.9 I get the following error: {noformat} 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] [sparkDriver-akka.actor.default-dispatcher-5] - Association with remote system [akka.tcp://sparkExecutor@server:59007] has failed, address is now gated for [5000] ms. Reason is: [akka.actor.Identify; local class incompatible: stream classdesc serialVersionUID = -213377755528332889, local class serialVersionUID = 1]. {noformat} It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations (see https://issues.scala-lang.org/browse/SI-8549). The following steps can resolve the issue: - re-build the custom akka library that is used by Spark with the more recent version of Scala compiler (e.g. 2.11.6) - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo - update version of akka used by spark (master and 1.3 branch) I would also suggest to upgrade to the latest version of akka 2.3.9 (or 2.3.10 that should be released soon). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shaposhnikov updated SPARK-7042: --- Comment: was deleted (was: It looks like akka-zeromq_2.11 is only available for versions 2.3.7+, though the rest of the akka libraries are available for 2.3.4. I wonder if akka version can just be updated to the latest 2.3.11?) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x -- Key: SPARK-7042 URL: https://issues.apache.org/jira/browse/SPARK-7042 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Konstantin Shaposhnikov Assignee: Konstantin Shaposhnikov Priority: Minor Fix For: 1.5.0 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built with Scala 2.11) from an application that uses akka 2.3.9 I get the following error: {noformat} 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] [sparkDriver-akka.actor.default-dispatcher-5] - Association with remote system [akka.tcp://sparkExecutor@server:59007] has failed, address is now gated for [5000] ms. Reason is: [akka.actor.Identify; local class incompatible: stream classdesc serialVersionUID = -213377755528332889, local class serialVersionUID = 1]. {noformat} It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations (see https://issues.scala-lang.org/browse/SI-8549). The following steps can resolve the issue: - re-build the custom akka library that is used by Spark with the more recent version of Scala compiler (e.g. 2.11.6) - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo - update version of akka used by spark (master and 1.3 branch) I would also suggest to upgrade to the latest version of akka 2.3.9 (or 2.3.10 that should be released soon). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7604) Python API for PCA and PCAModel
[ https://issues.apache.org/jira/browse/SPARK-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560209#comment-14560209 ] Joseph K. Bradley commented on SPARK-7604: -- Updated for 1.5. I don't think we'll be able to merge more features into 1.4. But we can see about merging this soon after 1.4 QA is done. Python API for PCA and PCAModel --- Key: SPARK-7604 URL: https://issues.apache.org/jira/browse/SPARK-7604 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Assignee: Yanbo Liang Python API for org.apache.spark.mllib.feature.PCA and org.apache.spark.mllib.feature.PCAModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7884) Allow Spark shuffle APIs to be more customizable
[ https://issues.apache.org/jira/browse/SPARK-7884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7884: --- Assignee: Apache Spark Allow Spark shuffle APIs to be more customizable Key: SPARK-7884 URL: https://issues.apache.org/jira/browse/SPARK-7884 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Matt Massie Assignee: Apache Spark The current Spark shuffle has some hard-coded assumptions about how shuffle managers will read and write data. The FileShuffleBlockResolver.forMapTask method creates disk writers by calling BlockManager.getDiskWriter. This forces all shuffle managers to store data using the DiskBlockObjectWriter which read/write data as record-oriented (preventing column-orient record writing). The BlockStoreShuffleFetcher.fetch method relies on the ShuffleBlockFetcherIterator that assumes shuffle data is written using the BlockManager.getDiskWriter method and doesn't allow for customization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7884) Allow Spark shuffle APIs to be more customizable
[ https://issues.apache.org/jira/browse/SPARK-7884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7884: --- Assignee: (was: Apache Spark) Allow Spark shuffle APIs to be more customizable Key: SPARK-7884 URL: https://issues.apache.org/jira/browse/SPARK-7884 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Matt Massie The current Spark shuffle has some hard-coded assumptions about how shuffle managers will read and write data. The FileShuffleBlockResolver.forMapTask method creates disk writers by calling BlockManager.getDiskWriter. This forces all shuffle managers to store data using the DiskBlockObjectWriter which read/write data as record-oriented (preventing column-orient record writing). The BlockStoreShuffleFetcher.fetch method relies on the ShuffleBlockFetcherIterator that assumes shuffle data is written using the BlockManager.getDiskWriter method and doesn't allow for customization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7884) Allow Spark shuffle APIs to be more customizable
Matt Massie created SPARK-7884: -- Summary: Allow Spark shuffle APIs to be more customizable Key: SPARK-7884 URL: https://issues.apache.org/jira/browse/SPARK-7884 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Matt Massie The current Spark shuffle has some hard-coded assumptions about how shuffle managers will read and write data. The FileShuffleBlockResolver.forMapTask method creates disk writers by calling BlockManager.getDiskWriter. This forces all shuffle managers to store data using the DiskBlockObjectWriter which read/write data as record-oriented (preventing column-orient record writing). The BlockStoreShuffleFetcher.fetch method relies on the ShuffleBlockFetcherIterator that assumes shuffle data is written using the BlockManager.getDiskWriter method and doesn't allow for customization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7884) Allow Spark shuffle APIs to be more customizable
[ https://issues.apache.org/jira/browse/SPARK-7884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560241#comment-14560241 ] Apache Spark commented on SPARK-7884: - User 'massie' has created a pull request for this issue: https://github.com/apache/spark/pull/6423 Allow Spark shuffle APIs to be more customizable Key: SPARK-7884 URL: https://issues.apache.org/jira/browse/SPARK-7884 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Matt Massie The current Spark shuffle has some hard-coded assumptions about how shuffle managers will read and write data. The FileShuffleBlockResolver.forMapTask method creates disk writers by calling BlockManager.getDiskWriter. This forces all shuffle managers to store data using the DiskBlockObjectWriter which read/write data as record-oriented (preventing column-orient record writing). The BlockStoreShuffleFetcher.fetch method relies on the ShuffleBlockFetcherIterator that assumes shuffle data is written using the BlockManager.getDiskWriter method and doesn't allow for customization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3674) Add support for launching YARN clusters in spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560282#comment-14560282 ] Apache Spark commented on SPARK-3674: - User 'shivaram' has created a pull request for this issue: https://github.com/apache/spark/pull/6424 Add support for launching YARN clusters in spark-ec2 Key: SPARK-3674 URL: https://issues.apache.org/jira/browse/SPARK-3674 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Fix For: 1.4.0 Right now spark-ec2 only supports launching Spark Standalone clusters. While this is sufficient for basic usage it is hard to test features or do performance benchmarking on YARN. It will be good to add support for installing, configuring a Apache YARN cluster at a fixed version -- say the latest stable version 2.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7824) Collapsing operator reordering and constant folding into a single batch to push down the single side.
[ https://issues.apache.org/jira/browse/SPARK-7824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongshuai Pei updated SPARK-7824: -- Summary: Collapsing operator reordering and constant folding into a single batch to push down the single side. (was: Extracting and/or condition optimizer from BooleanSimplification optimizer and put it before PushPredicateThroughJoin optimizer to push down the single side.) Collapsing operator reordering and constant folding into a single batch to push down the single side. - Key: SPARK-7824 URL: https://issues.apache.org/jira/browse/SPARK-7824 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Zhongshuai Pei SQL: {noformat} select * from tableA join tableB on (a 3 and b = d) or (a 3 and b = e) {noformat} Plan before modify {noformat} == Optimized Logical Plan == Project [a#293,b#294,c#295,d#296,e#297] Join Inner, Some(((a#293 3) ((b#294 = d#296) || (b#294 = e#297 MetastoreRelation default, tablea, None MetastoreRelation default, tableb, None {noformat} Plan after modify {noformat} == Optimized Logical Plan == Project [a#293,b#294,c#295,d#296,e#297] Join Inner, Some(((b#294 = d#296) || (b#294 = e#297))) Filter (a#293 3) MetastoreRelation default, tablea, None MetastoreRelation default, tableb, None {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7852) Use continuation when GLMs are run with multiple regParams
[ https://issues.apache.org/jira/browse/SPARK-7852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7852: - Description: Per the discussion from https://github.com/apache/spark/pull/6386#discussion_r30964263 once we have support for specifying the initial weights we can use this to create speed up our training. keywords: continuation, warm start, homotopy (related) was:Per the discussion from https://github.com/apache/spark/pull/6386#discussion_r30964263 once we have support for specifying the initial weights we can use this to create speed up our training. Use continuation when GLMs are run with multiple regParams -- Key: SPARK-7852 URL: https://issues.apache.org/jira/browse/SPARK-7852 Project: Spark Issue Type: Bug Components: ML Reporter: holdenk Priority: Minor Per the discussion from https://github.com/apache/spark/pull/6386#discussion_r30964263 once we have support for specifying the initial weights we can use this to create speed up our training. keywords: continuation, warm start, homotopy (related) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7880) Silent failure if assembly jar is corrupted
[ https://issues.apache.org/jira/browse/SPARK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560167#comment-14560167 ] Andrew Or commented on SPARK-7880: -- I was testing out RC2 for 1.4 and somehow ended up with a corrupted one. I thought it was a java 6 java 7 incompatibility issue but turns out something's just wrong with the way I downloaded it (?). Either way we should not hide the error message. Silent failure if assembly jar is corrupted --- Key: SPARK-7880 URL: https://issues.apache.org/jira/browse/SPARK-7880 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 1.3.0 Reporter: Andrew Or If you try to run `bin/spark-submit` with a corrupted jar, you get no output and your application does not run. We should have an informative message that indicates the failure to open the jar instead of silently swallowing it. This is caused by this line: https://github.com/apache/spark/blob/61664732b25b35f94be35a42cde651cbfd0e02b7/bin/spark-class#L75 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7852) Use continuation when GLMs are run with multiple regParams
[ https://issues.apache.org/jira/browse/SPARK-7852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7852: - Summary: Use continuation when GLMs are run with multiple regParams (was: Add support for re-using weights when training with multiple lambdas) Use continuation when GLMs are run with multiple regParams -- Key: SPARK-7852 URL: https://issues.apache.org/jira/browse/SPARK-7852 Project: Spark Issue Type: Bug Components: ML Reporter: holdenk Priority: Minor Per the discussion from https://github.com/apache/spark/pull/6386#discussion_r30964263 once we have support for specifying the initial weights we can use this to create speed up our training. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6915) VectorIndexer improvements
[ https://issues.apache.org/jira/browse/SPARK-6915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6915: - Target Version/s: (was: 1.4.0) VectorIndexer improvements -- Key: SPARK-6915 URL: https://issues.apache.org/jira/browse/SPARK-6915 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Priority: Minor This covers several improvements to VectorIndexer. They could be handled separately or in 1 PR. *Preserving metadata* Currently, it preserves non-ML metadata. This is different from StringIndexer. We should change it so it does not maintain non-ML metadata. Currently, it does not preserve ML-specific input metadata in the output column. If a feature is already marked as categorical or continuous, we should preserve that metadata (rather than recomputing it). We should also check that the input data is valid for that metadata. *Allow unknown categories* Add option for allowing unknown categories, probably via a parameter like allowUnknownCategories. If true, then handle unknown categories during transform by assigning them to an extra category index. *Index particular features* Add option for limiting indexing to particular features. This could be specified by an option, or we could handle it via the Preserve metadata task above, where users would denote features as continuous in order to have VectorIndexer ignore them. *Performance optimizations* See the TODO items within VectorIndexer.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6634) Allow replacing columns in Transformers
[ https://issues.apache.org/jira/browse/SPARK-6634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6634: - Target Version/s: 1.5.0 (was: 1.4.0) Allow replacing columns in Transformers --- Key: SPARK-6634 URL: https://issues.apache.org/jira/browse/SPARK-6634 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor Currently, Transformers do not allow input and output columns to share the same name. (In fact, this is not allowed but also not even checked.) Short-term proposal: Disallow input and output columns with the same name, and add a check in transformSchema. Long-term proposal: Allow input output columns with the same name, and where the behavior is that the output columns replace input columns with the same name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6295) spark.ml.Evaluator should have evaluate method not taking ParamMap
[ https://issues.apache.org/jira/browse/SPARK-6295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-6295. -- Resolution: Fixed Fix Version/s: 1.4.0 Assignee: Xiangrui Meng (was: Joseph K. Bradley) I'm pretty sure [~mengxr] fixed this in some PR, but I'm having trouble finding which one. But it's fixed for 1.4.0. spark.ml.Evaluator should have evaluate method not taking ParamMap -- Key: SPARK-6295 URL: https://issues.apache.org/jira/browse/SPARK-6295 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Xiangrui Meng Priority: Minor Fix For: 1.4.0 spark.ml.Evaluator requires that the user pass a ParamMap, but it is not always necessary. It should have a default implementation with no ParamMap (similar to fit() and transform() in Estimator and Transformer). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7604) Python API for PCA and PCAModel
[ https://issues.apache.org/jira/browse/SPARK-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7604: - Target Version/s: 1.5.0 Assignee: Yanbo Liang Python API for PCA and PCAModel --- Key: SPARK-7604 URL: https://issues.apache.org/jira/browse/SPARK-7604 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Assignee: Yanbo Liang Python API for org.apache.spark.mllib.feature.PCA and org.apache.spark.mllib.feature.PCAModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7461) Remove spark.ml Model, and have all Transformers have parent
[ https://issues.apache.org/jira/browse/SPARK-7461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7461: - Target Version/s: 1.5.0 (was: 1.4.0) Remove spark.ml Model, and have all Transformers have parent Key: SPARK-7461 URL: https://issues.apache.org/jira/browse/SPARK-7461 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley A recent PR [https://github.com/apache/spark/pull/5980] brought up an issue with the Model abstraction: There are transformers which could be Transformers (created by a user) or Models (created by an Estimator). This is the first instance, but there will be more such transformers in the future. Some possible fixes are: * Create 2 separate classes, 1 extending Transformer and 1 extending Model. These would be essentially the same, and they could share code (or have 1 wrap the other). This would bloat the API. * Just use Model, with a possibly null parent class. There is precedence (meta-algorithms like RandomForest producing weak hypothesis Models with no parent). * Change Transformer to have a parent which may be null. ** *-- Unless there is strong disagreement, I think we should go with this last option.* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6261) Python MLlib API missing items: Feature
[ https://issues.apache.org/jira/browse/SPARK-6261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6261: - Target Version/s: (was: 1.4.0) Python MLlib API missing items: Feature --- Key: SPARK-6261 URL: https://issues.apache.org/jira/browse/SPARK-6261 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. StandardScalerModel * All functionality except predict() is missing. IDFModel * idf Word2Vec * setMinCount Word2VecModel * getVectors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-7577) User guide update for Bucketizer
[ https://issues.apache.org/jira/browse/SPARK-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-7577: - Comment: was deleted (was: Oh, yes. Sorry, I forget it. Thanks for the reminder.) User guide update for Bucketizer Key: SPARK-7577 URL: https://issues.apache.org/jira/browse/SPARK-7577 Project: Spark Issue Type: Documentation Components: Documentation, ML Reporter: Joseph K. Bradley Assignee: Xusen Yin Copied from [SPARK-7443]: {quote} Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. * The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. * We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. {quote} Note: I created a new subsection for links to spark.ml-specific guides in this JIRA's PR: [SPARK-7557]. This transformer can go within the new subsection. I'll try to get that PR merged ASAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560166#comment-14560166 ] Konstantin Shaposhnikov commented on SPARK-7042: That is not true: http://search.maven.org/#browse%7C-1552622333 (http://search.maven.org/#artifactdetails%7Ccom.typesafe.akka%7Cakka-actor_2.11%7C2.3.4%7Cjar) What exactly was broken in scala 2.11 build? Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x -- Key: SPARK-7042 URL: https://issues.apache.org/jira/browse/SPARK-7042 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Konstantin Shaposhnikov Assignee: Konstantin Shaposhnikov Priority: Minor Fix For: 1.5.0 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built with Scala 2.11) from an application that uses akka 2.3.9 I get the following error: {noformat} 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] [sparkDriver-akka.actor.default-dispatcher-5] - Association with remote system [akka.tcp://sparkExecutor@server:59007] has failed, address is now gated for [5000] ms. Reason is: [akka.actor.Identify; local class incompatible: stream classdesc serialVersionUID = -213377755528332889, local class serialVersionUID = 1]. {noformat} It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations (see https://issues.scala-lang.org/browse/SI-8549). The following steps can resolve the issue: - re-build the custom akka library that is used by Spark with the more recent version of Scala compiler (e.g. 2.11.6) - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo - update version of akka used by spark (master and 1.3 branch) I would also suggest to upgrade to the latest version of akka 2.3.9 (or 2.3.10 that should be released soon). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7461) Remove spark.ml Model, and have all Transformers have parent
[ https://issues.apache.org/jira/browse/SPARK-7461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560205#comment-14560205 ] Joseph K. Bradley commented on SPARK-7461: -- Speaking with [~mengxr], we're going to delay this decision. It may not longer be a good idea since there is discussion of ML models including more model-specific functionality, such as transient references to the training data and results [SPARK-7674] Remove spark.ml Model, and have all Transformers have parent Key: SPARK-7461 URL: https://issues.apache.org/jira/browse/SPARK-7461 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley A recent PR [https://github.com/apache/spark/pull/5980] brought up an issue with the Model abstraction: There are transformers which could be Transformers (created by a user) or Models (created by an Estimator). This is the first instance, but there will be more such transformers in the future. Some possible fixes are: * Create 2 separate classes, 1 extending Transformer and 1 extending Model. These would be essentially the same, and they could share code (or have 1 wrap the other). This would bloat the API. * Just use Model, with a possibly null parent class. There is precedence (meta-algorithms like RandomForest producing weak hypothesis Models with no parent). * Change Transformer to have a parent which may be null. ** *-- Unless there is strong disagreement, I think we should go with this last option.* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7699) Number of executors can be reduced from initial before work is scheduled
[ https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560227#comment-14560227 ] Sandy Ryza commented on SPARK-7699: --- I think tying this the AM-RM heartbeat would just make things more confusing, especially now that the heartbeat interval is variable. Whether we've had a fair chance to allocate resources also depends on internal YARN configurations, like the NM-RM heartbeat interval or whether continuous scheduling is enabled. I don't think there's any easy notion of fair chance that doesn't rely on a timeout. Another option would be to avoid adjusting targetNumExecutors down before the first job is submitted. Number of executors can be reduced from initial before work is scheduled Key: SPARK-7699 URL: https://issues.apache.org/jira/browse/SPARK-7699 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: meiyoula Priority: Minor spark.dynamicAllocation.minExecutors 2 spark.dynamicAllocation.initialExecutors 3 spark.dynamicAllocation.maxExecutors 4 Just run the spark-shell with above configurations, the initial executor number is 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7867) Support revoke role ...
[ https://issues.apache.org/jira/browse/SPARK-7867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7867: --- Assignee: (was: Apache Spark) Support revoke role ... - Key: SPARK-7867 URL: https://issues.apache.org/jira/browse/SPARK-7867 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Zhongshuai Pei Priority: Minor sql like {noformat} revoke role role_a from user user1; {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7867) Support revoke role ...
[ https://issues.apache.org/jira/browse/SPARK-7867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7867: --- Assignee: Apache Spark Support revoke role ... - Key: SPARK-7867 URL: https://issues.apache.org/jira/browse/SPARK-7867 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Zhongshuai Pei Assignee: Apache Spark Priority: Minor sql like {noformat} revoke role role_a from user user1; {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7883) Fixing broken trainImplicit example in MLlib Collaborative Filtering documentation.
[ https://issues.apache.org/jira/browse/SPARK-7883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7883: - Assignee: Mike Dusenberry Fixing broken trainImplicit example in MLlib Collaborative Filtering documentation. --- Key: SPARK-7883 URL: https://issues.apache.org/jira/browse/SPARK-7883 Project: Spark Issue Type: Bug Components: Documentation, MLlib Affects Versions: 1.0.2, 1.1.1, 1.2.2, 1.3.1, 1.4.0 Reporter: Mike Dusenberry Assignee: Mike Dusenberry Priority: Trivial Fix For: 1.0.3, 1.1.2, 1.2.3, 1.3.2, 1.4.0 The trainImplicit Scala example near the end of the MLlib Collaborative Filtering documentation refers to an ALS.trainImplicit function signature that does not exist. Rather than add an extra function, let's just fix the example. Currently, the example refers to a function that would have the following signature: def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, alpha: Double) : MatrixFactorizationModel Instead, let's change the example to refer to this function, which does exist (notice the addition of the lambda parameter): def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: Double, alpha: Double) : MatrixFactorizationModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7883) Fixing broken trainImplicit example in MLlib Collaborative Filtering documentation.
[ https://issues.apache.org/jira/browse/SPARK-7883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7883: - Target Version/s: 1.0.3, 1.1.2, 1.2.3, 1.3.2, 1.4.0 Affects Version/s: 1.0.2 1.1.1 1.2.2 Fixing broken trainImplicit example in MLlib Collaborative Filtering documentation. --- Key: SPARK-7883 URL: https://issues.apache.org/jira/browse/SPARK-7883 Project: Spark Issue Type: Bug Components: Documentation, MLlib Affects Versions: 1.0.2, 1.1.1, 1.2.2, 1.3.1, 1.4.0 Reporter: Mike Dusenberry Assignee: Mike Dusenberry Priority: Trivial Fix For: 1.0.3, 1.1.2, 1.2.3, 1.3.2, 1.4.0 The trainImplicit Scala example near the end of the MLlib Collaborative Filtering documentation refers to an ALS.trainImplicit function signature that does not exist. Rather than add an extra function, let's just fix the example. Currently, the example refers to a function that would have the following signature: def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, alpha: Double) : MatrixFactorizationModel Instead, let's change the example to refer to this function, which does exist (notice the addition of the lambda parameter): def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: Double, alpha: Double) : MatrixFactorizationModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7637) StructType.merge slow with large nenormalised tables O(N2)
[ https://issues.apache.org/jira/browse/SPARK-7637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-7637. - Resolution: Fixed Fix Version/s: 1.5.0 StructType.merge slow with large nenormalised tables O(N2) -- Key: SPARK-7637 URL: https://issues.apache.org/jira/browse/SPARK-7637 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Rowan Chattaway Priority: Minor Fix For: 1.5.0 Original Estimate: 24h Remaining Estimate: 24h StructType.merge does a linear scan through the left schema and for each element scans the right schema. This results in a O(N2) algorithm. I have found this to be very slow when dealing with large denormalised parquet files. I would like to make a small change to this function to map the fields of both the left and right schemas resulting in O(N). This has a sizable increase in performance for large denormalised schemas. 1x1 column merge 2891ms Original 32ms with mapped field approach. This merge can be called many times depending upon the number of files that you need to merge the schemas for, compounding the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7577) User guide update for Bucketizer
[ https://issues.apache.org/jira/browse/SPARK-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560278#comment-14560278 ] Xusen Yin commented on SPARK-7577: -- Oh, yes. Sorry, I forget it. Thanks for the reminder. User guide update for Bucketizer Key: SPARK-7577 URL: https://issues.apache.org/jira/browse/SPARK-7577 Project: Spark Issue Type: Documentation Components: Documentation, ML Reporter: Joseph K. Bradley Assignee: Xusen Yin Copied from [SPARK-7443]: {quote} Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. * The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. * We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. {quote} Note: I created a new subsection for links to spark.ml-specific guides in this JIRA's PR: [SPARK-7557]. This transformer can go within the new subsection. I'll try to get that PR merged ASAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7577) User guide update for Bucketizer
[ https://issues.apache.org/jira/browse/SPARK-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560279#comment-14560279 ] Xusen Yin commented on SPARK-7577: -- Oh, yes. Sorry, I forget it. Thanks for the reminder. User guide update for Bucketizer Key: SPARK-7577 URL: https://issues.apache.org/jira/browse/SPARK-7577 Project: Spark Issue Type: Documentation Components: Documentation, ML Reporter: Joseph K. Bradley Assignee: Xusen Yin Copied from [SPARK-7443]: {quote} Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. * The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. * We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. {quote} Note: I created a new subsection for links to spark.ml-specific guides in this JIRA's PR: [SPARK-7557]. This transformer can go within the new subsection. I'll try to get that PR merged ASAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560283#comment-14560283 ] Konstantin Shaposhnikov commented on SPARK-7042: It looks like the Spark specific akka-zeromq version (2.3.4-spark) has been modified to work with Scala 2.11 In fact the standard build of akka-zeromq_2.11 (that is available for versions 2.3.7+) depends on zeromq scala bindings created by Spark project (org.spark-project.zeromq:zeromq-scala-binding_2.11:0.0.7-spark). Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x -- Key: SPARK-7042 URL: https://issues.apache.org/jira/browse/SPARK-7042 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Konstantin Shaposhnikov Assignee: Konstantin Shaposhnikov Priority: Minor Fix For: 1.5.0 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built with Scala 2.11) from an application that uses akka 2.3.9 I get the following error: {noformat} 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] [sparkDriver-akka.actor.default-dispatcher-5] - Association with remote system [akka.tcp://sparkExecutor@server:59007] has failed, address is now gated for [5000] ms. Reason is: [akka.actor.Identify; local class incompatible: stream classdesc serialVersionUID = -213377755528332889, local class serialVersionUID = 1]. {noformat} It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations (see https://issues.scala-lang.org/browse/SI-8549). The following steps can resolve the issue: - re-build the custom akka library that is used by Spark with the more recent version of Scala compiler (e.g. 2.11.6) - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo - update version of akka used by spark (master and 1.3 branch) I would also suggest to upgrade to the latest version of akka 2.3.9 (or 2.3.10 that should be released soon). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7885) add config to control map aggregation in spark sql
jeanlyn created SPARK-7885: -- Summary: add config to control map aggregation in spark sql Key: SPARK-7885 URL: https://issues.apache.org/jira/browse/SPARK-7885 Project: Spark Issue Type: Improvement Affects Versions: 1.3.1, 1.2.2, 1.2.0 Reporter: jeanlyn For now, *execution.HashAggregation* add the map aggregation in oder to decrease the shuffle data.However,we found gc problem when we use this optimization and finally the executor crash.For example, {noformat} select sale_ord_id as order_id, coalesce(sum(sku_offer_amount),0.0) as sku_offer_amount, coalesce(sum(suit_offer_amount),0.0) as suit_offer_amount, coalesce(sum(flash_gp_offer_amount),0.0) + coalesce(sum(gp_offer_amount),0.0) as gp_offer_amount, coalesce(sum(flash_gp_offer_amount),0.0) as flash_gp_offer_amount, coalesce(sum(full_minus_offer_amount),0.0) as full_rebate_offer_amount, 0.0 as telecom_point_offer_amount, coalesce(sum(coupon_pay_amount),0.0) as dq_and_jq_pay_amount, coalesce(sum(jq_pay_amount),0.0) + coalesce(sum(pop_shop_jq_pay_amount),0.0) + coalesce(sum(lim_cate_jq_pay_amount),0.0) as jq_pay_amount, coalesce(sum(dq_pay_amount),0.0) + coalesce(sum(pop_shop_dq_pay_amount),0.0) + coalesce(sum(lim_cate_dq_pay_amount),0.0) as dq_pay_amount, coalesce(sum(gift_cps_pay_amount),0.0) as gift_cps_pay_amount , coalesce(sum(mobile_red_packet_pay_amount),0.0) as mobile_red_packet_pay_amount, coalesce(sum(acct_bal_pay_amount),0.0) as acct_bal_pay_amount, coalesce(sum(jbean_pay_amount),0.0) as jbean_pay_amount, coalesce(sum(sku_rebate_amount),0.0) as sku_rebate_amount, coalesce(sum(yixun_point_pay_amount),0.0) as yixun_point_pay_amount, coalesce(sum(sku_freight_coupon_amount),0.0) as freight_coupon_amount fromord_at_det_di where ds = '2015-05-20' group by sale_ord_id {noformat} the sql scan two text files and each file is 360MB,we use 6 executor, each executor has 8GB memory and 2 cpu. We can add a config control map aggregation to avoid it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7866) print the format string in dataframe explain
[ https://issues.apache.org/jira/browse/SPARK-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Wang closed SPARK-7866. --- Resolution: Won't Fix Wrong print in intelli idea, this is really ok, not a problem. print the format string in dataframe explain Key: SPARK-7866 URL: https://issues.apache.org/jira/browse/SPARK-7866 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Fei Wang Priority: Trivial QueryExecution.toString give a format and clear string, so we print it in DataFrame.explain method -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7885) add config to control map aggregation in spark sql
[ https://issues.apache.org/jira/browse/SPARK-7885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7885: --- Assignee: (was: Apache Spark) add config to control map aggregation in spark sql -- Key: SPARK-7885 URL: https://issues.apache.org/jira/browse/SPARK-7885 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0, 1.2.2, 1.3.1 Reporter: jeanlyn For now, *execution.HashAggregation* add the map aggregation in oder to decrease the shuffle data.However,we found gc problem when we use this optimization and finally the executor crash.For example, {noformat} select sale_ord_id as order_id, coalesce(sum(sku_offer_amount),0.0) as sku_offer_amount, coalesce(sum(suit_offer_amount),0.0) as suit_offer_amount, coalesce(sum(flash_gp_offer_amount),0.0) + coalesce(sum(gp_offer_amount),0.0) as gp_offer_amount, coalesce(sum(flash_gp_offer_amount),0.0) as flash_gp_offer_amount, coalesce(sum(full_minus_offer_amount),0.0) as full_rebate_offer_amount, 0.0 as telecom_point_offer_amount, coalesce(sum(coupon_pay_amount),0.0) as dq_and_jq_pay_amount, coalesce(sum(jq_pay_amount),0.0) + coalesce(sum(pop_shop_jq_pay_amount),0.0) + coalesce(sum(lim_cate_jq_pay_amount),0.0) as jq_pay_amount, coalesce(sum(dq_pay_amount),0.0) + coalesce(sum(pop_shop_dq_pay_amount),0.0) + coalesce(sum(lim_cate_dq_pay_amount),0.0) as dq_pay_amount, coalesce(sum(gift_cps_pay_amount),0.0) as gift_cps_pay_amount , coalesce(sum(mobile_red_packet_pay_amount),0.0) as mobile_red_packet_pay_amount, coalesce(sum(acct_bal_pay_amount),0.0) as acct_bal_pay_amount, coalesce(sum(jbean_pay_amount),0.0) as jbean_pay_amount, coalesce(sum(sku_rebate_amount),0.0) as sku_rebate_amount, coalesce(sum(yixun_point_pay_amount),0.0) as yixun_point_pay_amount, coalesce(sum(sku_freight_coupon_amount),0.0) as freight_coupon_amount fromord_at_det_di where ds = '2015-05-20' group by sale_ord_id {noformat} the sql scan two text files and each file is 360MB,we use 6 executor, each executor has 8GB memory and 2 cpu. We can add a config control map aggregation to avoid it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7885) add config to control map aggregation in spark sql
[ https://issues.apache.org/jira/browse/SPARK-7885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7885: --- Assignee: Apache Spark add config to control map aggregation in spark sql -- Key: SPARK-7885 URL: https://issues.apache.org/jira/browse/SPARK-7885 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0, 1.2.2, 1.3.1 Reporter: jeanlyn Assignee: Apache Spark For now, *execution.HashAggregation* add the map aggregation in oder to decrease the shuffle data.However,we found gc problem when we use this optimization and finally the executor crash.For example, {noformat} select sale_ord_id as order_id, coalesce(sum(sku_offer_amount),0.0) as sku_offer_amount, coalesce(sum(suit_offer_amount),0.0) as suit_offer_amount, coalesce(sum(flash_gp_offer_amount),0.0) + coalesce(sum(gp_offer_amount),0.0) as gp_offer_amount, coalesce(sum(flash_gp_offer_amount),0.0) as flash_gp_offer_amount, coalesce(sum(full_minus_offer_amount),0.0) as full_rebate_offer_amount, 0.0 as telecom_point_offer_amount, coalesce(sum(coupon_pay_amount),0.0) as dq_and_jq_pay_amount, coalesce(sum(jq_pay_amount),0.0) + coalesce(sum(pop_shop_jq_pay_amount),0.0) + coalesce(sum(lim_cate_jq_pay_amount),0.0) as jq_pay_amount, coalesce(sum(dq_pay_amount),0.0) + coalesce(sum(pop_shop_dq_pay_amount),0.0) + coalesce(sum(lim_cate_dq_pay_amount),0.0) as dq_pay_amount, coalesce(sum(gift_cps_pay_amount),0.0) as gift_cps_pay_amount , coalesce(sum(mobile_red_packet_pay_amount),0.0) as mobile_red_packet_pay_amount, coalesce(sum(acct_bal_pay_amount),0.0) as acct_bal_pay_amount, coalesce(sum(jbean_pay_amount),0.0) as jbean_pay_amount, coalesce(sum(sku_rebate_amount),0.0) as sku_rebate_amount, coalesce(sum(yixun_point_pay_amount),0.0) as yixun_point_pay_amount, coalesce(sum(sku_freight_coupon_amount),0.0) as freight_coupon_amount fromord_at_det_di where ds = '2015-05-20' group by sale_ord_id {noformat} the sql scan two text files and each file is 360MB,we use 6 executor, each executor has 8GB memory and 2 cpu. We can add a config control map aggregation to avoid it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7885) add config to control map aggregation in spark sql
[ https://issues.apache.org/jira/browse/SPARK-7885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560344#comment-14560344 ] Apache Spark commented on SPARK-7885: - User 'jeanlyn' has created a pull request for this issue: https://github.com/apache/spark/pull/6426 add config to control map aggregation in spark sql -- Key: SPARK-7885 URL: https://issues.apache.org/jira/browse/SPARK-7885 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0, 1.2.2, 1.3.1 Reporter: jeanlyn For now, *execution.HashAggregation* add the map aggregation in oder to decrease the shuffle data.However,we found gc problem when we use this optimization and finally the executor crash.For example, {noformat} select sale_ord_id as order_id, coalesce(sum(sku_offer_amount),0.0) as sku_offer_amount, coalesce(sum(suit_offer_amount),0.0) as suit_offer_amount, coalesce(sum(flash_gp_offer_amount),0.0) + coalesce(sum(gp_offer_amount),0.0) as gp_offer_amount, coalesce(sum(flash_gp_offer_amount),0.0) as flash_gp_offer_amount, coalesce(sum(full_minus_offer_amount),0.0) as full_rebate_offer_amount, 0.0 as telecom_point_offer_amount, coalesce(sum(coupon_pay_amount),0.0) as dq_and_jq_pay_amount, coalesce(sum(jq_pay_amount),0.0) + coalesce(sum(pop_shop_jq_pay_amount),0.0) + coalesce(sum(lim_cate_jq_pay_amount),0.0) as jq_pay_amount, coalesce(sum(dq_pay_amount),0.0) + coalesce(sum(pop_shop_dq_pay_amount),0.0) + coalesce(sum(lim_cate_dq_pay_amount),0.0) as dq_pay_amount, coalesce(sum(gift_cps_pay_amount),0.0) as gift_cps_pay_amount , coalesce(sum(mobile_red_packet_pay_amount),0.0) as mobile_red_packet_pay_amount, coalesce(sum(acct_bal_pay_amount),0.0) as acct_bal_pay_amount, coalesce(sum(jbean_pay_amount),0.0) as jbean_pay_amount, coalesce(sum(sku_rebate_amount),0.0) as sku_rebate_amount, coalesce(sum(yixun_point_pay_amount),0.0) as yixun_point_pay_amount, coalesce(sum(sku_freight_coupon_amount),0.0) as freight_coupon_amount fromord_at_det_di where ds = '2015-05-20' group by sale_ord_id {noformat} the sql scan two text files and each file is 360MB,we use 6 executor, each executor has 8GB memory and 2 cpu. We can add a config control map aggregation to avoid it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7734) make explode support struct type
[ https://issues.apache.org/jira/browse/SPARK-7734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-7734. -- Resolution: Not A Problem make explode support struct type Key: SPARK-7734 URL: https://issues.apache.org/jira/browse/SPARK-7734 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7742) Figure out what to do with insertInto w.r.t. DataFrameWriter API
[ https://issues.apache.org/jira/browse/SPARK-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-7742. -- Resolution: Fixed Fix Version/s: 1.4.0 Assignee: Yin Huai We decided to add insertInto into write. Figure out what to do with insertInto w.r.t. DataFrameWriter API Key: SPARK-7742 URL: https://issues.apache.org/jira/browse/SPARK-7742 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Yin Huai Priority: Critical Fix For: 1.4.0 See https://github.com/apache/spark/pull/6216 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7853) ClassNotFoundException for SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-7853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-7853: -- Description: Reproduce steps: {code} bin/spark-sql --jars ./sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar CREATE TABLE t1(a string, b string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'; {code} Throws Exception like: {noformat} 15/05/26 00:16:33 ERROR SparkSQLDriver: Failed in [CREATE TABLE t1(a string, b string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'] org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.apache.hive.hcatalog.data.JsonSerDe at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:333) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:310) at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:139) at org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:310) at org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:300) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:457) at org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:33) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:922) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:922) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:147) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:131) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:727) at org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:283) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:218) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {noformat} was: Reproduce steps: {code} bin/spark-sql --jars ./sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar CREATE TABLE t1(a string, b string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'; {code} Throws Exception like: {panel} 15/05/26 00:16:33 ERROR SparkSQLDriver: Failed in [CREATE TABLE t1(a string, b string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'] org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.apache.hive.hcatalog.data.JsonSerDe at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:333) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:310) at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:139) at org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:310) at org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:300) at
[jira] [Commented] (SPARK-4867) UDF clean up
[ https://issues.apache.org/jira/browse/SPARK-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560434#comment-14560434 ] Reynold Xin commented on SPARK-4867: That's a good idea. I created SPARK-4867 for that. Can you submit a pull request for it? It'd also be good to look into what expressions cannot be constructed this way. Ideally all functions should just go through function registry without being hardcoded into the parsers. UDF clean up Key: SPARK-4867 URL: https://issues.apache.org/jira/browse/SPARK-4867 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Priority: Blocker Right now our support and internal implementation of many functions has a few issues. Specifically: - UDFS don't know their input types and thus don't do type coercion. - We hard code a bunch of built in functions into the parser. This is bad because in SQL it creates new reserved words for things that aren't actually keywords. Also it means that for each function we need to add support to both SQLContext and HiveContext separately. For this JIRA I propose we do the following: - Change the interfaces for registerFunction and ScalaUdf to include types for the input arguments as well as the output type. - Add a rule to analysis that does type coercion for UDFs. - Add a parse rule for functions to SQLParser. - Rewrite all the UDFs that are currently hacked into the various parsers using this new functionality. Depending on how big this refactoring becomes we could split parts 12 from part 3 above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7886) Add built-in expressions to FunctionRegistry
Reynold Xin created SPARK-7886: -- Summary: Add built-in expressions to FunctionRegistry Key: SPARK-7886 URL: https://issues.apache.org/jira/browse/SPARK-7886 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Priority: Blocker Once we do this, we no longer needs to hardcode expressions into the parser (both for internal SQL and Hive QL). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore
[ https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560444#comment-14560444 ] Reynold Xin commented on SPARK-7550: [~chenghao] can you work on this one? Support setting the right schema serde when writing to Hive metastore --- Key: SPARK-7550 URL: https://issues.apache.org/jira/browse/SPARK-7550 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Reynold Xin As of 1.4, Spark SQL does not properly set the table schema and serde when writing a table to Hive's metastore. Would be great to do that properly so users can use non-Spark SQL systems to read those tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7858) DataSourceStrategy.createPhysicalRDD should use output schema when performing row conversions, not relation schema
[ https://issues.apache.org/jira/browse/SPARK-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-7858. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6400 [https://github.com/apache/spark/pull/6400] DataSourceStrategy.createPhysicalRDD should use output schema when performing row conversions, not relation schema -- Key: SPARK-7858 URL: https://issues.apache.org/jira/browse/SPARK-7858 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Critical Fix For: 1.4.0 In {{DataSourceStrategy.createPhysicalRDD}}, we use the relation schema as the target schema for converting incoming rows into Catalyst rows. However, we should be using the output schema instead, since our scan might return a subset of the relation's columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7868) Ignores _temporary directories while listing files in HadoopFsRelation
[ https://issues.apache.org/jira/browse/SPARK-7868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-7868. - Issue resolved by pull request 6411 [https://github.com/apache/spark/pull/6411] Ignores _temporary directories while listing files in HadoopFsRelation Key: SPARK-7868 URL: https://issues.apache.org/jira/browse/SPARK-7868 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Fix For: 1.4.0 In some cases, failed tasks/jobs may leave uncommitted partial/corrupted data in {{_temporary}} directory. These files should be counted as input files of a {{HadoopFsRelation}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6012) Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered operator
[ https://issues.apache.org/jira/browse/SPARK-6012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-6012. - Resolution: Not A Problem I think we do not have this issue after 1.3. I am going to resolve it as Not A Problem. Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered operator -- Key: SPARK-6012 URL: https://issues.apache.org/jira/browse/SPARK-6012 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Max Seiden Priority: Critical h3. Summary I've found that a deadlock occurs when asking for the partitions from a SchemaRDD that has a TakeOrdered as its terminal operator. The problem occurs when a child RDDs asks the DAGScheduler for preferred partition locations (which locks the scheduler) and eventually hits the #execute() of the TakeOrdered operator, which submits tasks but is blocked when it also tries to get preferred locations (in a separate thread). It seems like the TakeOrdered op's #execute() method should not actually submit a task (it is calling #executeCollect() and creating a new RDD) and should instead stay more true to the comment a logically apply a Limit on top of a Sort. In my particular case, I am forcing a repartition of a SchemaRDD with a terminal Limit(..., Sort(...)), which is where the CoalescedRDD comes into play. h3. Stack Traces h4. Task Submission {noformat} main prio=5 tid=0x7f8e7280 nid=0x1303 in Object.wait() [0x00010ed5e000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x0007c4c239b8 (a org.apache.spark.scheduler.JobWaiter) at java.lang.Object.wait(Object.java:503) at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) - locked 0x0007c4c239b8 (a org.apache.spark.scheduler.JobWaiter) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1390) at org.apache.spark.rdd.RDD.reduce(RDD.scala:884) at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1161) at org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:183) at org.apache.spark.sql.execution.TakeOrdered.execute(basicOperators.scala:188) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) - locked 0x0007c36ce038 (a org.apache.spark.sql.hive.HiveContext$$anon$7) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDD.getDependencies(SchemaRDD.scala:127) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207) at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1278) at org.apache.spark.sql.SchemaRDD.getPartitions(SchemaRDD.scala:122) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1333) at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1304) - locked 0x0007f55c2238 (a org.apache.spark.scheduler.DAGScheduler) at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:1148) at
[jira] [Updated] (SPARK-6012) Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered operator
[ https://issues.apache.org/jira/browse/SPARK-6012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6012: Target Version/s: (was: 1.4.0) Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered operator -- Key: SPARK-6012 URL: https://issues.apache.org/jira/browse/SPARK-6012 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Max Seiden Priority: Critical h3. Summary I've found that a deadlock occurs when asking for the partitions from a SchemaRDD that has a TakeOrdered as its terminal operator. The problem occurs when a child RDDs asks the DAGScheduler for preferred partition locations (which locks the scheduler) and eventually hits the #execute() of the TakeOrdered operator, which submits tasks but is blocked when it also tries to get preferred locations (in a separate thread). It seems like the TakeOrdered op's #execute() method should not actually submit a task (it is calling #executeCollect() and creating a new RDD) and should instead stay more true to the comment a logically apply a Limit on top of a Sort. In my particular case, I am forcing a repartition of a SchemaRDD with a terminal Limit(..., Sort(...)), which is where the CoalescedRDD comes into play. h3. Stack Traces h4. Task Submission {noformat} main prio=5 tid=0x7f8e7280 nid=0x1303 in Object.wait() [0x00010ed5e000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x0007c4c239b8 (a org.apache.spark.scheduler.JobWaiter) at java.lang.Object.wait(Object.java:503) at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) - locked 0x0007c4c239b8 (a org.apache.spark.scheduler.JobWaiter) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1390) at org.apache.spark.rdd.RDD.reduce(RDD.scala:884) at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1161) at org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:183) at org.apache.spark.sql.execution.TakeOrdered.execute(basicOperators.scala:188) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) - locked 0x0007c36ce038 (a org.apache.spark.sql.hive.HiveContext$$anon$7) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDD.getDependencies(SchemaRDD.scala:127) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207) at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1278) at org.apache.spark.sql.SchemaRDD.getPartitions(SchemaRDD.scala:122) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1333) at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1304) - locked 0x0007f55c2238 (a org.apache.spark.scheduler.DAGScheduler) at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:1148) at org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:175)
[jira] [Commented] (SPARK-7853) ClassNotFoundException for SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-7853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560429#comment-14560429 ] Cheng Lian commented on SPARK-7853: --- OT: [~chenghao] Just edited the JIRA description. When pasting exception stack trace {{noformat}} can be more preferable than {{panel}} since it uses monospace font :) ClassNotFoundException for SparkSQL --- Key: SPARK-7853 URL: https://issues.apache.org/jira/browse/SPARK-7853 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Hao Priority: Blocker Reproduce steps: {code} bin/spark-sql --jars ./sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar CREATE TABLE t1(a string, b string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'; {code} Throws Exception like: {noformat} 15/05/26 00:16:33 ERROR SparkSQLDriver: Failed in [CREATE TABLE t1(a string, b string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'] org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.apache.hive.hcatalog.data.JsonSerDe at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:333) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:310) at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:139) at org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:310) at org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:300) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:457) at org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:33) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:922) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:922) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:147) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:131) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:727) at org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:283) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:218) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7886) Add built-in expressions to FunctionRegistry
[ https://issues.apache.org/jira/browse/SPARK-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7886: --- Target Version/s: 1.5.0 Add built-in expressions to FunctionRegistry Key: SPARK-7886 URL: https://issues.apache.org/jira/browse/SPARK-7886 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Priority: Blocker Once we do this, we no longer needs to hardcode expressions into the parser (both for internal SQL and Hive QL). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore
[ https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560455#comment-14560455 ] Reynold Xin commented on SPARK-7550: cc [~yhuai]. I think the proposed design is to write the schema out according to Hive's format, for data types that Hive supports. For UDTs, just write them out like what we do right now. Support setting the right schema serde when writing to Hive metastore --- Key: SPARK-7550 URL: https://issues.apache.org/jira/browse/SPARK-7550 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Reynold Xin As of 1.4, Spark SQL does not properly set the table schema and serde when writing a table to Hive's metastore. Would be great to do that properly so users can use non-Spark SQL systems to read those tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6923) Spark SQL CLI does not read Data Source schema correctly
[ https://issues.apache.org/jira/browse/SPARK-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6923: --- Target Version/s: 1.5.0 (was: 1.4.0) Spark SQL CLI does not read Data Source schema correctly Key: SPARK-6923 URL: https://issues.apache.org/jira/browse/SPARK-6923 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: pin_zhang Priority: Critical {code:java} HiveContext hctx = new HiveContext(sc); ListString sample = new ArrayListString(); sample.add( {\id\: \id_1\, \age\:1} ); RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd(); DataFrame df = hctx.jsonRDD(sampleRDD); String table=test; df.saveAsTable(table, json,SaveMode.Overwrite); Table t = hctx.catalog().client().getTable(table); System.out.println( t.getCols()); {code} -- With the code above to save DataFrame to hive table, Get table cols returns one column named 'col' [FieldSchema(name:col, type:arraystring, comment:from deserializer)] Expected return fields schema id, age. This results in the jdbc API cannot retrieves the table columns via ResultSet DatabaseMetaData.getColumns(String catalog, String schemaPattern,String tableNamePattern, String columnNamePattern) But resultset metadata for query select * from test contains fields id, age. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7887) Remove EvaluatedType from SQL Expression
[ https://issues.apache.org/jira/browse/SPARK-7887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7887: --- Assignee: Apache Spark (was: Reynold Xin) Remove EvaluatedType from SQL Expression Key: SPARK-7887 URL: https://issues.apache.org/jira/browse/SPARK-7887 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Apache Spark It's not a very useful type to use. We can just remove it to simplify expressions slightly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7887) Remove EvaluatedType from SQL Expression
[ https://issues.apache.org/jira/browse/SPARK-7887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560462#comment-14560462 ] Apache Spark commented on SPARK-7887: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/6427 Remove EvaluatedType from SQL Expression Key: SPARK-7887 URL: https://issues.apache.org/jira/browse/SPARK-7887 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin It's not a very useful type to use. We can just remove it to simplify expressions slightly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7887) Remove EvaluatedType from SQL Expression
[ https://issues.apache.org/jira/browse/SPARK-7887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7887: --- Assignee: Reynold Xin (was: Apache Spark) Remove EvaluatedType from SQL Expression Key: SPARK-7887 URL: https://issues.apache.org/jira/browse/SPARK-7887 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin It's not a very useful type to use. We can just remove it to simplify expressions slightly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore
[ https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560450#comment-14560450 ] Cheng Hao commented on SPARK-7550: -- Similar issue with SPARK-6923 Support setting the right schema serde when writing to Hive metastore --- Key: SPARK-7550 URL: https://issues.apache.org/jira/browse/SPARK-7550 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Reynold Xin As of 1.4, Spark SQL does not properly set the table schema and serde when writing a table to Hive's metastore. Would be great to do that properly so users can use non-Spark SQL systems to read those tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore
[ https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560450#comment-14560450 ] Cheng Hao edited comment on SPARK-7550 at 5/27/15 5:48 AM: --- Similar issue with SPARK-6923 ? was (Author: chenghao): Similar issue with SPARK-6923 Support setting the right schema serde when writing to Hive metastore --- Key: SPARK-7550 URL: https://issues.apache.org/jira/browse/SPARK-7550 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Reynold Xin As of 1.4, Spark SQL does not properly set the table schema and serde when writing a table to Hive's metastore. Would be great to do that properly so users can use non-Spark SQL systems to read those tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7887) Remove EvaluatedType from SQL Expression
Reynold Xin created SPARK-7887: -- Summary: Remove EvaluatedType from SQL Expression Key: SPARK-7887 URL: https://issues.apache.org/jira/browse/SPARK-7887 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin It's not a very useful type to use. We can just remove it to simplify expressions slightly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7699) Number of executors can be reduced from initial before work is scheduled
[ https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558748#comment-14558748 ] Sandy Ryza commented on SPARK-7699: --- We can't wait only on the initial allocation being made because YARN might not be able to fully satisfy it in any finite amount of time. Number of executors can be reduced from initial before work is scheduled Key: SPARK-7699 URL: https://issues.apache.org/jira/browse/SPARK-7699 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: meiyoula Priority: Minor spark.dynamicAllocation.minExecutors 2 spark.dynamicAllocation.initialExecutors 3 spark.dynamicAllocation.maxExecutors 4 Just run the spark-shell with above configurations, the initial executor number is 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7042. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6341 [https://github.com/apache/spark/pull/6341] Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x -- Key: SPARK-7042 URL: https://issues.apache.org/jira/browse/SPARK-7042 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Konstantin Shaposhnikov Priority: Minor Fix For: 1.5.0 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built with Scala 2.11) from an application that uses akka 2.3.9 I get the following error: {noformat} 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] [sparkDriver-akka.actor.default-dispatcher-5] - Association with remote system [akka.tcp://sparkExecutor@server:59007] has failed, address is now gated for [5000] ms. Reason is: [akka.actor.Identify; local class incompatible: stream classdesc serialVersionUID = -213377755528332889, local class serialVersionUID = 1]. {noformat} It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations (see https://issues.scala-lang.org/browse/SI-8549). The following steps can resolve the issue: - re-build the custom akka library that is used by Spark with the more recent version of Scala compiler (e.g. 2.11.6) - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo - update version of akka used by spark (master and 1.3 branch) I would also suggest to upgrade to the latest version of akka 2.3.9 (or 2.3.10 that should be released soon). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7042: - Assignee: Konstantin Shaposhnikov Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x -- Key: SPARK-7042 URL: https://issues.apache.org/jira/browse/SPARK-7042 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Konstantin Shaposhnikov Assignee: Konstantin Shaposhnikov Priority: Minor Fix For: 1.5.0 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built with Scala 2.11) from an application that uses akka 2.3.9 I get the following error: {noformat} 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] [sparkDriver-akka.actor.default-dispatcher-5] - Association with remote system [akka.tcp://sparkExecutor@server:59007] has failed, address is now gated for [5000] ms. Reason is: [akka.actor.Identify; local class incompatible: stream classdesc serialVersionUID = -213377755528332889, local class serialVersionUID = 1]. {noformat} It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations (see https://issues.scala-lang.org/browse/SI-8549). The following steps can resolve the issue: - re-build the custom akka library that is used by Spark with the more recent version of Scala compiler (e.g. 2.11.6) - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo - update version of akka used by spark (master and 1.3 branch) I would also suggest to upgrade to the latest version of akka 2.3.9 (or 2.3.10 that should be released soon). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7699) Number of executors can be reduced from initial before work is scheduled
[ https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558752#comment-14558752 ] Sean Owen commented on SPARK-7699: -- Hm, waiting for x seconds also seems suboptimal since, in the case where the executors aren't needed, you're just delaying releasing them. Does it make sense to wait 1 heartbeat? x heartbeats? to start changing the allocation? Number of executors can be reduced from initial before work is scheduled Key: SPARK-7699 URL: https://issues.apache.org/jira/browse/SPARK-7699 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: meiyoula Priority: Minor spark.dynamicAllocation.minExecutors 2 spark.dynamicAllocation.initialExecutors 3 spark.dynamicAllocation.maxExecutors 4 Just run the spark-shell with above configurations, the initial executor number is 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7110) when use saveAsNewAPIHadoopFile, sometimes it throws Delegation Token can be issued only with kerberos or web authentication
[ https://issues.apache.org/jira/browse/SPARK-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7110. -- Resolution: Duplicate when use saveAsNewAPIHadoopFile, sometimes it throws Delegation Token can be issued only with kerberos or web authentication -- Key: SPARK-7110 URL: https://issues.apache.org/jira/browse/SPARK-7110 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: gu-chi Assignee: Sean Owen Under yarn-client mode, this issue random occurs. Authentication method is set to kerberos, and use saveAsNewAPIHadoopFile in PairRDDFunctions to save data to HDFS, then exception comes as: org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token can be issued only with kerberos or web authentication -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7727) Avoid inner classes in RuleExecutor
[ https://issues.apache.org/jira/browse/SPARK-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558758#comment-14558758 ] Santiago M. Mola commented on SPARK-7727: - [~chenghao] I think that is a good idea. Analyzer could be converted into a trait, moving current Analyzer to DefaultAnalyzer. It is probably a good idea to use a separate JIRA and pull request for that though. Avoid inner classes in RuleExecutor --- Key: SPARK-7727 URL: https://issues.apache.org/jira/browse/SPARK-7727 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Santiago M. Mola Labels: easyfix, starter In RuleExecutor, the following classes and objects are defined as inner classes or objects: Strategy, Once, FixedPoint, Batch. This does not seem to accomplish anything in this case, but makes extensibility harder. For example, if I want to define a new Optimizer that uses all batches from the DefaultOptimizer plus some more, I would do something like: {code} new Optimizer { override protected val batches: Seq[Batch] = DefaultOptimizer.batches ++ myBatches } {code} But this will give a typing error because batches in DefaultOptimizer are of type DefaultOptimizer#Batch while myBatches are this#Batch. Workarounds include either copying the list of batches from DefaultOptimizer or using a method like this: {code} private def transformBatchType(b: DefaultOptimizer.Batch): Batch = { val strategy = b.strategy.maxIterations match { case 1 = Once case n = FixedPoint(n) } Batch(b.name, strategy, b.rules) } {code} However, making these classes outer would solve the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7699) Number of executors can be reduced from initial before work is scheduled
[ https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558757#comment-14558757 ] Sandy Ryza commented on SPARK-7699: --- I think delaying releasing them is exactly the point of the property. If we don't want to do that, what's it there for? Number of executors can be reduced from initial before work is scheduled Key: SPARK-7699 URL: https://issues.apache.org/jira/browse/SPARK-7699 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: meiyoula Priority: Minor spark.dynamicAllocation.minExecutors 2 spark.dynamicAllocation.initialExecutors 3 spark.dynamicAllocation.maxExecutors 4 Just run the spark-shell with above configurations, the initial executor number is 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7699) Number of executors can be reduced from initial before work is scheduled
[ https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558764#comment-14558764 ] Sean Owen commented on SPARK-7699: -- initialExecutors? I think it's there for the ramp-up case really. If load will start soon, but your minimum is 1 because load is variable, it's best to not have to ramp up through 1, 2, 4, 8 executors if you need 100. The problem is evaluating load before any load has had a chance to schedule. Ramping down at all is bad if, actually, load is applied right away. I'd rather not add another lever here, but is it principled to wait for some multiple of the RM heartbeat here? so that the allocation isn't changed until the RM has had a fair chance to allocate resources? Sure, bets are off if there is a delay in scheduling but what can you do? nothing breaks here it's just suboptimal then. Number of executors can be reduced from initial before work is scheduled Key: SPARK-7699 URL: https://issues.apache.org/jira/browse/SPARK-7699 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: meiyoula Priority: Minor spark.dynamicAllocation.minExecutors 2 spark.dynamicAllocation.initialExecutors 3 spark.dynamicAllocation.maxExecutors 4 Just run the spark-shell with above configurations, the initial executor number is 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-7727) Avoid inner classes in RuleExecutor
[ https://issues.apache.org/jira/browse/SPARK-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santiago M. Mola updated SPARK-7727: Comment: was deleted (was: [~evacchi] I'm sorry I opened this duplicate for: https://issues.apache.org/jira/browse/SPARK-7823 Not sure which one to mark as duplicate since both have pull requests.) Avoid inner classes in RuleExecutor --- Key: SPARK-7727 URL: https://issues.apache.org/jira/browse/SPARK-7727 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Santiago M. Mola Labels: easyfix, starter In RuleExecutor, the following classes and objects are defined as inner classes or objects: Strategy, Once, FixedPoint, Batch. This does not seem to accomplish anything in this case, but makes extensibility harder. For example, if I want to define a new Optimizer that uses all batches from the DefaultOptimizer plus some more, I would do something like: {code} new Optimizer { override protected val batches: Seq[Batch] = DefaultOptimizer.batches ++ myBatches } {code} But this will give a typing error because batches in DefaultOptimizer are of type DefaultOptimizer#Batch while myBatches are this#Batch. Workarounds include either copying the list of batches from DefaultOptimizer or using a method like this: {code} private def transformBatchType(b: DefaultOptimizer.Batch): Batch = { val strategy = b.strategy.maxIterations match { case 1 = Once case n = FixedPoint(n) } Batch(b.name, strategy, b.rules) } {code} However, making these classes outer would solve the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7823) [SQL] Batch, FixedPoint, Strategy should not be inner classes of class RuleExecutor
[ https://issues.apache.org/jira/browse/SPARK-7823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santiago M. Mola resolved SPARK-7823. - Resolution: Duplicate This is a duplicate of https://issues.apache.org/jira/browse/SPARK-7727 [SQL] Batch, FixedPoint, Strategy should not be inner classes of class RuleExecutor --- Key: SPARK-7823 URL: https://issues.apache.org/jira/browse/SPARK-7823 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1, 1.4.0 Reporter: Edoardo Vacchi Priority: Minor Batch, FixedPoint, Strategy, Once, are defined within the class RuleExecutor[TreeType]. This makes unnecessarily complicated to reuse batches of rules within custom optimizers. E.g: {code:java} object DefaultOptimizer extends Optimizer { override val batches = /* batches defined here */ } object MyCustomOptimizer extends Optimizer { override val batches = Batch(my custom batch ...) :: DefaultOptimizer.batches } {code} MyCustomOptimizer won't compile, because DefaultOptimizer.batches has type Seq[DefaultOptimizer.this.Batch]. Solution: Batch, FixedPoint, etc. should be moved *outside* the RuleExecutor[T] class body, either in a companion object or right in the `rules` package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7727) Avoid inner classes in RuleExecutor
[ https://issues.apache.org/jira/browse/SPARK-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558796#comment-14558796 ] Edoardo Vacchi edited comment on SPARK-7727 at 5/26/15 7:44 AM: [~smolav] (about the duplicate) that's fine, since I opened my PR later (I didn't see the other). My PR wraps the case classes in a companion object, though. Don't know which solution would be best about trait v. object. Object is currently fine, since batches can be reused through `val batches = DefaultOptimizer.batches`. If we go with traits, though (which I am in favor of) I would turn into traits also SparkPlanner, for symmetry was (Author: evacchi): [~smolav] (about the duplicate) that's fine, since I opened my PR later (I didn't see the other). My PR wraps the case classes in a companion object, though. Don't know which solution would be best Avoid inner classes in RuleExecutor --- Key: SPARK-7727 URL: https://issues.apache.org/jira/browse/SPARK-7727 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Santiago M. Mola Labels: easyfix, starter In RuleExecutor, the following classes and objects are defined as inner classes or objects: Strategy, Once, FixedPoint, Batch. This does not seem to accomplish anything in this case, but makes extensibility harder. For example, if I want to define a new Optimizer that uses all batches from the DefaultOptimizer plus some more, I would do something like: {code} new Optimizer { override protected val batches: Seq[Batch] = DefaultOptimizer.batches ++ myBatches } {code} But this will give a typing error because batches in DefaultOptimizer are of type DefaultOptimizer#Batch while myBatches are this#Batch. Workarounds include either copying the list of batches from DefaultOptimizer or using a method like this: {code} private def transformBatchType(b: DefaultOptimizer.Batch): Batch = { val strategy = b.strategy.maxIterations match { case 1 = Once case n = FixedPoint(n) } Batch(b.name, strategy, b.rules) } {code} However, making these classes outer would solve the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3846) KryoException when doing joins in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558808#comment-14558808 ] Santiago M. Mola commented on SPARK-3846: - [~huangjs] Would you mind adding a test case here (an example of data and exact code used to produce the exception)? KryoException when doing joins in SparkSQL --- Key: SPARK-3846 URL: https://issues.apache.org/jira/browse/SPARK-3846 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.2.0 Reporter: Jianshi Huang The error is reproducible when I join two tables manually. The error message is like follows. org.apache.spark.SparkException: Job aborted due to stage failure: Task 645 in stage 3.0 failed 4 times, most recent failure: Lost task 645.3 in stage 3.0 (TID 3802, ...): com.esotericsoftware.kryo.KryoException: Unable to find class: __wrapper$1$18e31777385a452ba0bc030e899bf5d1.__wrapper$1$18e31777385a452ba0bc030e899bf5d1$SpecificRow$1 com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138) com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115) com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.sql.execution.HashJoin$$anon$1.hasNext(joins.scala:101) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:198) org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:165) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:724) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7727) Avoid inner classes in RuleExecutor
[ https://issues.apache.org/jira/browse/SPARK-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558796#comment-14558796 ] Edoardo Vacchi edited comment on SPARK-7727 at 5/26/15 7:41 AM: [~smolav] (about the duplicate) that's fine, since I opened my PR later (I didn't see the other). My PR wraps the case classes in a companion object, though. Don't know which solution would be best was (Author: evacchi): [~smolav] that's fine, since I opened my PR later (I didn't see the other). My PR wraps the case classes in a companion object, though. Don't know which solution would be best Avoid inner classes in RuleExecutor --- Key: SPARK-7727 URL: https://issues.apache.org/jira/browse/SPARK-7727 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Santiago M. Mola Labels: easyfix, starter In RuleExecutor, the following classes and objects are defined as inner classes or objects: Strategy, Once, FixedPoint, Batch. This does not seem to accomplish anything in this case, but makes extensibility harder. For example, if I want to define a new Optimizer that uses all batches from the DefaultOptimizer plus some more, I would do something like: {code} new Optimizer { override protected val batches: Seq[Batch] = DefaultOptimizer.batches ++ myBatches } {code} But this will give a typing error because batches in DefaultOptimizer are of type DefaultOptimizer#Batch while myBatches are this#Batch. Workarounds include either copying the list of batches from DefaultOptimizer or using a method like this: {code} private def transformBatchType(b: DefaultOptimizer.Batch): Batch = { val strategy = b.strategy.maxIterations match { case 1 = Once case n = FixedPoint(n) } Batch(b.name, strategy, b.rules) } {code} However, making these classes outer would solve the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7862) Query would hang when the using script has error output in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-7862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhichao-li updated SPARK-7862: -- Description: Steps to reproduce: val data = (1 to 10).map { i = (i, i, i) } data.toDF(d1, d2, d3).registerTempTable(script_trans) sql(SELECT TRANSFORM (d1, d2, d3) USING 'cat 12' AS (a,b,c) FROM script_trans) Query would hang when the using script has error output in SparkSQL --- Key: SPARK-7862 URL: https://issues.apache.org/jira/browse/SPARK-7862 Project: Spark Issue Type: Bug Components: SQL Reporter: zhichao-li Steps to reproduce: val data = (1 to 10).map { i = (i, i, i) } data.toDF(d1, d2, d3).registerTempTable(script_trans) sql(SELECT TRANSFORM (d1, d2, d3) USING 'cat 12' AS (a,b,c) FROM script_trans) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7727) Avoid inner classes in RuleExecutor
[ https://issues.apache.org/jira/browse/SPARK-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558796#comment-14558796 ] Edoardo Vacchi edited comment on SPARK-7727 at 5/26/15 7:44 AM: [~smolav] (about the duplicate) that's fine, since I opened my PR later (I didn't see the other). My PR wraps the case classes in a companion object, though. Don't know which solution would be best about trait v. object. Object is currently fine, since batches can be reused through `val batches = DefaultOptimizer.batches`. If we go with traits, though (which I am in favor of) I would turn into traits also SparkPlanner, for symmetry (see also SPARK-6981) was (Author: evacchi): [~smolav] (about the duplicate) that's fine, since I opened my PR later (I didn't see the other). My PR wraps the case classes in a companion object, though. Don't know which solution would be best about trait v. object. Object is currently fine, since batches can be reused through `val batches = DefaultOptimizer.batches`. If we go with traits, though (which I am in favor of) I would turn into traits also SparkPlanner, for symmetry Avoid inner classes in RuleExecutor --- Key: SPARK-7727 URL: https://issues.apache.org/jira/browse/SPARK-7727 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Santiago M. Mola Labels: easyfix, starter In RuleExecutor, the following classes and objects are defined as inner classes or objects: Strategy, Once, FixedPoint, Batch. This does not seem to accomplish anything in this case, but makes extensibility harder. For example, if I want to define a new Optimizer that uses all batches from the DefaultOptimizer plus some more, I would do something like: {code} new Optimizer { override protected val batches: Seq[Batch] = DefaultOptimizer.batches ++ myBatches } {code} But this will give a typing error because batches in DefaultOptimizer are of type DefaultOptimizer#Batch while myBatches are this#Batch. Workarounds include either copying the list of batches from DefaultOptimizer or using a method like this: {code} private def transformBatchType(b: DefaultOptimizer.Batch): Batch = { val strategy = b.strategy.maxIterations match { case 1 = Once case n = FixedPoint(n) } Batch(b.name, strategy, b.rules) } {code} However, making these classes outer would solve the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3846) KryoException when doing joins in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santiago M. Mola updated SPARK-3846: Priority: Blocker (was: Major) KryoException when doing joins in SparkSQL --- Key: SPARK-3846 URL: https://issues.apache.org/jira/browse/SPARK-3846 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.2.0 Reporter: Jianshi Huang Priority: Blocker The error is reproducible when I join two tables manually. The error message is like follows. org.apache.spark.SparkException: Job aborted due to stage failure: Task 645 in stage 3.0 failed 4 times, most recent failure: Lost task 645.3 in stage 3.0 (TID 3802, ...): com.esotericsoftware.kryo.KryoException: Unable to find class: __wrapper$1$18e31777385a452ba0bc030e899bf5d1.__wrapper$1$18e31777385a452ba0bc030e899bf5d1$SpecificRow$1 com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138) com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115) com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.sql.execution.HashJoin$$anon$1.hasNext(joins.scala:101) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:198) org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:165) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:724) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3846) [SQL] Serialization exception (Kryo) on joins when enabling codegen
[ https://issues.apache.org/jira/browse/SPARK-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santiago M. Mola updated SPARK-3846: Summary: [SQL] Serialization exception (Kryo) on joins when enabling codegen (was: [SQL] Serialization exception (Kryo and Java) on joins when enabling codegen ) [SQL] Serialization exception (Kryo) on joins when enabling codegen Key: SPARK-3846 URL: https://issues.apache.org/jira/browse/SPARK-3846 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.2.0 Reporter: Jianshi Huang Priority: Blocker The error is reproducible when I join two tables manually. The error message is like follows. org.apache.spark.SparkException: Job aborted due to stage failure: Task 645 in stage 3.0 failed 4 times, most recent failure: Lost task 645.3 in stage 3.0 (TID 3802, ...): com.esotericsoftware.kryo.KryoException: Unable to find class: __wrapper$1$18e31777385a452ba0bc030e899bf5d1.__wrapper$1$18e31777385a452ba0bc030e899bf5d1$SpecificRow$1 com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138) com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115) com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.sql.execution.HashJoin$$anon$1.hasNext(joins.scala:101) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:198) org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:165) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:724) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3846) [SQL] Serialization exception (Kryo) on joins when enabling codegen
[ https://issues.apache.org/jira/browse/SPARK-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-3846. -- Resolution: Duplicate [SQL] Serialization exception (Kryo) on joins when enabling codegen Key: SPARK-3846 URL: https://issues.apache.org/jira/browse/SPARK-3846 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.2.0 Reporter: Jianshi Huang Priority: Blocker The error is reproducible when I join two tables manually. The error message is like follows. org.apache.spark.SparkException: Job aborted due to stage failure: Task 645 in stage 3.0 failed 4 times, most recent failure: Lost task 645.3 in stage 3.0 (TID 3802, ...): com.esotericsoftware.kryo.KryoException: Unable to find class: __wrapper$1$18e31777385a452ba0bc030e899bf5d1.__wrapper$1$18e31777385a452ba0bc030e899bf5d1$SpecificRow$1 com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138) com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115) com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.sql.execution.HashJoin$$anon$1.hasNext(joins.scala:101) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:198) org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:165) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:724) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7864) Clicking a job's DAG graph on Web UI kills the job as the link is broken
Carson Wang created SPARK-7864: -- Summary: Clicking a job's DAG graph on Web UI kills the job as the link is broken Key: SPARK-7864 URL: https://issues.apache.org/jira/browse/SPARK-7864 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.4.0 Reporter: Carson Wang When clicking a job's DAG graph on Web UI, the user is expected to be redirected to the corresponding stage page. The link is got from the stage table by selecting the first link. But there are two links in each row, the first one is the killLink, the second is the nameLink. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7727) Avoid inner classes in RuleExecutor
[ https://issues.apache.org/jira/browse/SPARK-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558777#comment-14558777 ] Santiago M. Mola commented on SPARK-7727: - [~evacchi] I'm sorry I opened this duplicate for: https://issues.apache.org/jira/browse/SPARK-7823 Not sure which one to mark as duplicate since both have pull requests. Avoid inner classes in RuleExecutor --- Key: SPARK-7727 URL: https://issues.apache.org/jira/browse/SPARK-7727 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Santiago M. Mola Labels: easyfix, starter In RuleExecutor, the following classes and objects are defined as inner classes or objects: Strategy, Once, FixedPoint, Batch. This does not seem to accomplish anything in this case, but makes extensibility harder. For example, if I want to define a new Optimizer that uses all batches from the DefaultOptimizer plus some more, I would do something like: {code} new Optimizer { override protected val batches: Seq[Batch] = DefaultOptimizer.batches ++ myBatches } {code} But this will give a typing error because batches in DefaultOptimizer are of type DefaultOptimizer#Batch while myBatches are this#Batch. Workarounds include either copying the list of batches from DefaultOptimizer or using a method like this: {code} private def transformBatchType(b: DefaultOptimizer.Batch): Batch = { val strategy = b.strategy.maxIterations match { case 1 = Once case n = FixedPoint(n) } Batch(b.name, strategy, b.rules) } {code} However, making these classes outer would solve the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7862) Query would hang when the using script has error output in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-7862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7862: --- Assignee: Apache Spark Query would hang when the using script has error output in SparkSQL --- Key: SPARK-7862 URL: https://issues.apache.org/jira/browse/SPARK-7862 Project: Spark Issue Type: Bug Components: SQL Reporter: zhichao-li Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7562) Improve error reporting for expression data type mismatch
[ https://issues.apache.org/jira/browse/SPARK-7562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558836#comment-14558836 ] Wenchen Fan commented on SPARK-7562: After thinking about it, I think we can't just use the ExpectsInputTypes interface. There are some cases that we don't know the accurate required input types, like Add, we only need the left and right expressions have same data type which is numeric. I have sent a PR to add a `TypeConstraint` interface, which defines when an Expression has correct data types and what error message should be generated if type mismatch. Improve error reporting for expression data type mismatch - Key: SPARK-7562 URL: https://issues.apache.org/jira/browse/SPARK-7562 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin There is currently no error reporting for expression data types in analysis (we rely on resolved for that, which doesn't provide great error messages for types). It would be great to have that in checkAnalysis. Ideally, it should be the responsibility of each Expression itself to specify the types it requires, and report errors that way. We would need to define a simple interface for that so each Expression can implement. The default implementation can just use the information provided by ExpectsInputTypes.expectedChildTypes. cc [~marmbrus] what we discussed offline today. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7562) Improve error reporting for expression data type mismatch
[ https://issues.apache.org/jira/browse/SPARK-7562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7562: --- Assignee: (was: Apache Spark) Improve error reporting for expression data type mismatch - Key: SPARK-7562 URL: https://issues.apache.org/jira/browse/SPARK-7562 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin There is currently no error reporting for expression data types in analysis (we rely on resolved for that, which doesn't provide great error messages for types). It would be great to have that in checkAnalysis. Ideally, it should be the responsibility of each Expression itself to specify the types it requires, and report errors that way. We would need to define a simple interface for that so each Expression can implement. The default implementation can just use the information provided by ExpectsInputTypes.expectedChildTypes. cc [~marmbrus] what we discussed offline today. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7863) SimpleDateParam should not use SimpleDateFormat in multiple threads because SimpleDateFormat is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7863: --- Assignee: (was: Apache Spark) SimpleDateParam should not use SimpleDateFormat in multiple threads because SimpleDateFormat is not thread-safe --- Key: SPARK-7863 URL: https://issues.apache.org/jira/browse/SPARK-7863 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7863) SimpleDateParam should not use SimpleDateFormat in multiple threads because SimpleDateFormat is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558841#comment-14558841 ] Apache Spark commented on SPARK-7863: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/6406 SimpleDateParam should not use SimpleDateFormat in multiple threads because SimpleDateFormat is not thread-safe --- Key: SPARK-7863 URL: https://issues.apache.org/jira/browse/SPARK-7863 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7862) Query would hang when the using script has error output in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-7862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7862: --- Assignee: (was: Apache Spark) Query would hang when the using script has error output in SparkSQL --- Key: SPARK-7862 URL: https://issues.apache.org/jira/browse/SPARK-7862 Project: Spark Issue Type: Bug Components: SQL Reporter: zhichao-li -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7862) Query would hang when the using script has error output in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-7862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558797#comment-14558797 ] Apache Spark commented on SPARK-7862: - User 'zhichao-li' has created a pull request for this issue: https://github.com/apache/spark/pull/6404 Query would hang when the using script has error output in SparkSQL --- Key: SPARK-7862 URL: https://issues.apache.org/jira/browse/SPARK-7862 Project: Spark Issue Type: Bug Components: SQL Reporter: zhichao-li -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7727) Avoid inner classes in RuleExecutor
[ https://issues.apache.org/jira/browse/SPARK-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558796#comment-14558796 ] Edoardo Vacchi commented on SPARK-7727: --- [~smolav] that's fine, since I opened my PR later (I didn't see the other). My PR wraps the case classes in a companion object, though. Don't know which solution would be best Avoid inner classes in RuleExecutor --- Key: SPARK-7727 URL: https://issues.apache.org/jira/browse/SPARK-7727 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Santiago M. Mola Labels: easyfix, starter In RuleExecutor, the following classes and objects are defined as inner classes or objects: Strategy, Once, FixedPoint, Batch. This does not seem to accomplish anything in this case, but makes extensibility harder. For example, if I want to define a new Optimizer that uses all batches from the DefaultOptimizer plus some more, I would do something like: {code} new Optimizer { override protected val batches: Seq[Batch] = DefaultOptimizer.batches ++ myBatches } {code} But this will give a typing error because batches in DefaultOptimizer are of type DefaultOptimizer#Batch while myBatches are this#Batch. Workarounds include either copying the list of batches from DefaultOptimizer or using a method like this: {code} private def transformBatchType(b: DefaultOptimizer.Batch): Batch = { val strategy = b.strategy.maxIterations match { case 1 = Once case n = FixedPoint(n) } Batch(b.name, strategy, b.rules) } {code} However, making these classes outer would solve the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5707) Enabling spark.sql.codegen throws ClassNotFound exception
[ https://issues.apache.org/jira/browse/SPARK-5707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558823#comment-14558823 ] Santiago M. Mola commented on SPARK-5707: - This is probably a duplicate of SPARK-3846. Enabling spark.sql.codegen throws ClassNotFound exception - Key: SPARK-5707 URL: https://issues.apache.org/jira/browse/SPARK-5707 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.1 Environment: yarn-client mode, spark.sql.codegen=true Reporter: Yi Yao Assignee: Ram Sriharsha Priority: Blocker Exception thrown: {noformat} org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in stage 133.0 failed 4 times, most recent failure: Lost task 13.3 in stage 133.0 (TID 3066, cdh52-node2): java.io.IOException: com.esotericsoftware.kryo.KryoException: Unable to find class: __wrapper$1$81257352e1c844aebf09cb84fe9e7459.__wrapper$1$81257352e1c844aebf09cb84fe9e7459$SpecificRow$1 Serialization trace: hashTable (org.apache.spark.sql.execution.joins.UniqueKeyHashedRelation) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:62) at org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:61) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at
[jira] [Created] (SPARK-7863) SimpleDateParam should not use SimpleDateFormat in multiple threads because SimpleDateFormat is not thread-safe
Shixiong Zhu created SPARK-7863: --- Summary: SimpleDateParam should not use SimpleDateFormat in multiple threads because SimpleDateFormat is not thread-safe Key: SPARK-7863 URL: https://issues.apache.org/jira/browse/SPARK-7863 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7864) Clicking a job's DAG graph on Web UI kills the job as the link is broken
[ https://issues.apache.org/jira/browse/SPARK-7864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7864: --- Assignee: Apache Spark Clicking a job's DAG graph on Web UI kills the job as the link is broken Key: SPARK-7864 URL: https://issues.apache.org/jira/browse/SPARK-7864 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.4.0 Reporter: Carson Wang Assignee: Apache Spark When clicking a job's DAG graph on Web UI, the user is expected to be redirected to the corresponding stage page. The link is got from the stage table by selecting the first link. But there are two links in each row, the first one is the killLink, the second is the nameLink. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org