[jira] [Commented] (SPARK-3950) Completed time is blank for some successful tasks
[ https://issues.apache.org/jira/browse/SPARK-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960220#comment-14960220 ] Jean-Baptiste Onofré commented on SPARK-3950: - I don't reproduce the issue. On 1.6.0-SNAPSHOT, thanks to getFormattedTimeQuantiles(), the tasks duration are expressed as ms when required, like GC time. I think this issue can be closed. > Completed time is blank for some successful tasks > - > > Key: SPARK-3950 > URL: https://issues.apache.org/jira/browse/SPARK-3950 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.1.1 >Reporter: Aaron Davidson > > In the Spark web UI, some tasks appear to have a blank Duration column. It's > possible that these ran for <.5 seconds, but if so, we should use > milliseconds like we do for GC time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10754) table and column name are case sensitive when json Dataframe was registered as tempTable using JavaSparkContext.
[ https://issues.apache.org/jira/browse/SPARK-10754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960205#comment-14960205 ] Babulal commented on SPARK-10754: - Thank you Huaxin Gao for reply i checked with "spark.sql.caseSensitive=false " option it is working fine. Can we either make it default to false or document it (which you suggested ). i guess it is referred from SQLConf.scala val DIALECT = "spark.sql.dialect" val CASE_SENSITIVE = "spark.sql.caseSensitive" /** * caseSensitive analysis true by default */ def caseSensitiveAnalysis: Boolean = getConf(SQLConf.CASE_SENSITIVE, "true").toBoolean > table and column name are case sensitive when json Dataframe was registered > as tempTable using JavaSparkContext. > - > > Key: SPARK-10754 > URL: https://issues.apache.org/jira/browse/SPARK-10754 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0, 1.3.1, 1.4.1 > Environment: Linux ,Hadoop Version 1.3 >Reporter: Babulal > > Create a dataframe using json data source > SparkConf conf=new > SparkConf().setMaster("spark://xyz:7077")).setAppName("Spark Tabble"); > JavaSparkContext javacontext=new JavaSparkContext(conf); > SQLContext sqlContext=new SQLContext(javacontext); > > DataFrame df = > sqlContext.jsonFile("/user/root/examples/src/main/resources/people.json"); > > df.registerTempTable("sparktable"); > > Run the Query > > sqlContext.sql("select * from sparktable").show()// this will PASs > > > sqlContext.sql("select * from sparkTable").show()/// This will FAIL > > java.lang.RuntimeException: Table Not Found: sparkTable > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:115) > at > org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:115) > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) > at scala.collection.AbstractMap.getOrElse(Map.scala:58) > at > org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:115) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:233) > > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11120) maxNumExecutorFailures defaults to 3 under dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-11120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11120: Assignee: (was: Apache Spark) > maxNumExecutorFailures defaults to 3 under dynamic allocation > - > > Key: SPARK-11120 > URL: https://issues.apache.org/jira/browse/SPARK-11120 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Ryan Williams >Priority: Minor > > With dynamic allocation, the {{spark.executor.instances}} config is 0, > meaning [this > line|https://github.com/apache/spark/blob/4ace4f8a9c91beb21a0077e12b75637a4560a542/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L66-L68] > ends up with {{maxNumExecutorFailures}} equal to {{3}}, which for me has > resulted in large dynamicAllocation jobs with hundreds of executors dying due > to one bad node serially failing executors that are allocated on it. > I think that using {{spark.dynamicAllocation.maxExecutors}} would make most > sense in this case; I frequently run shells that vary between 1 and 1000 > executors, so using {{s.dA.minExecutors}} or {{s.dA.initialExecutors}} would > still leave me with a value that is lower than makes sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11120) maxNumExecutorFailures defaults to 3 under dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-11120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11120: Assignee: Apache Spark > maxNumExecutorFailures defaults to 3 under dynamic allocation > - > > Key: SPARK-11120 > URL: https://issues.apache.org/jira/browse/SPARK-11120 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Ryan Williams >Assignee: Apache Spark >Priority: Minor > > With dynamic allocation, the {{spark.executor.instances}} config is 0, > meaning [this > line|https://github.com/apache/spark/blob/4ace4f8a9c91beb21a0077e12b75637a4560a542/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L66-L68] > ends up with {{maxNumExecutorFailures}} equal to {{3}}, which for me has > resulted in large dynamicAllocation jobs with hundreds of executors dying due > to one bad node serially failing executors that are allocated on it. > I think that using {{spark.dynamicAllocation.maxExecutors}} would make most > sense in this case; I frequently run shells that vary between 1 and 1000 > executors, so using {{s.dA.minExecutors}} or {{s.dA.initialExecutors}} would > still leave me with a value that is lower than makes sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11120) maxNumExecutorFailures defaults to 3 under dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-11120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960174#comment-14960174 ] Apache Spark commented on SPARK-11120: -- User 'ryan-williams' has created a pull request for this issue: https://github.com/apache/spark/pull/9147 > maxNumExecutorFailures defaults to 3 under dynamic allocation > - > > Key: SPARK-11120 > URL: https://issues.apache.org/jira/browse/SPARK-11120 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Ryan Williams >Priority: Minor > > With dynamic allocation, the {{spark.executor.instances}} config is 0, > meaning [this > line|https://github.com/apache/spark/blob/4ace4f8a9c91beb21a0077e12b75637a4560a542/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L66-L68] > ends up with {{maxNumExecutorFailures}} equal to {{3}}, which for me has > resulted in large dynamicAllocation jobs with hundreds of executors dying due > to one bad node serially failing executors that are allocated on it. > I think that using {{spark.dynamicAllocation.maxExecutors}} would make most > sense in this case; I frequently run shells that vary between 1 and 1000 > executors, so using {{s.dA.minExecutors}} or {{s.dA.initialExecutors}} would > still leave me with a value that is lower than makes sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11120) maxNumExecutorFailures defaults to 3 under dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-11120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960169#comment-14960169 ] Ryan Williams commented on SPARK-11120: --- Without dynamic allocation, you are allowed [twice the number of executors] failures, which seems reasonable. With dynamic allocation, {{spark.executor.instances}} doesn't get set, and so you are allowed {{math.max(0 * 2, 3)}} failures, no matter how many executors your job has as its min, initial, and max settings. > maxNumExecutorFailures defaults to 3 under dynamic allocation > - > > Key: SPARK-11120 > URL: https://issues.apache.org/jira/browse/SPARK-11120 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Ryan Williams >Priority: Minor > > With dynamic allocation, the {{spark.executor.instances}} config is 0, > meaning [this > line|https://github.com/apache/spark/blob/4ace4f8a9c91beb21a0077e12b75637a4560a542/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L66-L68] > ends up with {{maxNumExecutorFailures}} equal to {{3}}, which for me has > resulted in large dynamicAllocation jobs with hundreds of executors dying due > to one bad node serially failing executors that are allocated on it. > I think that using {{spark.dynamicAllocation.maxExecutors}} would make most > sense in this case; I frequently run shells that vary between 1 and 1000 > executors, so using {{s.dA.minExecutors}} or {{s.dA.initialExecutors}} would > still leave me with a value that is lower than makes sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9963) ML RandomForest cleanup: Move predictNodeIndex to LearningNode
[ https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960164#comment-14960164 ] Luvsandondov Lkhamsuren commented on SPARK-9963: Please let me know if it needs an additional fix. Thanks > ML RandomForest cleanup: Move predictNodeIndex to LearningNode > -- > > Key: SPARK-9963 > URL: https://issues.apache.org/jira/browse/SPARK-9963 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Trivial > Labels: starter > > (updated form original description) > Move ml.tree.impl.RandomForest.predictNodeIndex to LearningNode. > We need to keep it as a separate method from Node.predictImpl because (a) it > needs to operate on binned features and (b) it needs to return the node ID, > not the node (because it can return the ID for nodes which do not yet exist). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11136) Warm-start support for ML estimator
[ https://issues.apache.org/jira/browse/SPARK-11136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-11136: -- Description: The current implementation of Estimator does not support warm-start fitting, i.e. estimator.fit(data, params, partialModel). But first we need to add warm-start for all ML estimators. This is an umbrella JIRA to add support for the warm-start estimator. Treat model as a special parameter, passing it through ParamMap. e.g. val partialModel: Param[Option[M]] = new Param(...). In the case of model existing, we use it to warm-start, else we start the training process from the beginning. was: The current implementation of Estimator does not support warm-start fitting, i.e. estimator.fit(data, params, partialModel). But first we need to add warm-start for all ML estimators. This is an umbrella JIRA to add support for the warm-start estimator. Possible solutions: 1. Add warm-start fitting interface like def fit(dataset: DataFrame, initModel: M, paramMap: ParamMap): M 2. Treat model as a special parameter, passing it through ParamMap. e.g. val partialModel: Param[Option[M]] = new Param(...). In the case of model existing, we use it to warm-start, else we start the training process from the beginning. > Warm-start support for ML estimator > --- > > Key: SPARK-11136 > URL: https://issues.apache.org/jira/browse/SPARK-11136 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xusen Yin >Priority: Minor > > The current implementation of Estimator does not support warm-start fitting, > i.e. estimator.fit(data, params, partialModel). But first we need to add > warm-start for all ML estimators. This is an umbrella JIRA to add support for > the warm-start estimator. > Treat model as a special parameter, passing it through ParamMap. e.g. val > partialModel: Param[Option[M]] = new Param(...). In the case of model > existing, we use it to warm-start, else we start the training process from > the beginning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11143) SparkMesosDispatcher can not launch driver in docker
[ https://issues.apache.org/jira/browse/SPARK-11143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960119#comment-14960119 ] Klaus Ma commented on SPARK-11143: -- I also got a feedback from StackOverflow at http://stackoverflow.com/questions/33160859/how-to-enable-spark-mesos-docker-executor; but I think it's also worth to enhance Spark to use ubuntu image only, because the solution of special image is only set work directory :(. > SparkMesosDispatcher can not launch driver in docker > > > Key: SPARK-11143 > URL: https://issues.apache.org/jira/browse/SPARK-11143 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.5.1 > Environment: Ubuntu 14.04 >Reporter: Klaus Ma > > I'm working on integration between Mesos & Spark. For now, I can start > SlaveMesosDispatcher in a docker; and I like to also run Spark executor in > Mesos docker. I do the following configuration for it, but I got an error; > any suggestion? > Configuration: > Spark: conf/spark-defaults.conf > {code} > spark.mesos.executor.docker.imageubuntu > spark.mesos.executor.docker.volumes > /usr/bin:/usr/bin,/usr/local/lib:/usr/local/lib,/usr/lib:/usr/lib,/lib:/lib,/home/test/workshop/spark:/root/spark > spark.mesos.executor.home/root/spark > #spark.executorEnv.SPARK_HOME /root/spark > spark.executorEnv.MESOS_NATIVE_LIBRARY /usr/local/lib > {code} > NOTE: The spark are installed in /home/test/workshop/spark, and all > dependencies are installed. > After submit SparkPi to the dispatcher, the driver job is started but failed. > The error messes is: > {code} > I1015 11:10:29.488456 18697 exec.cpp:134] Version: 0.26.0 > I1015 11:10:29.506619 18699 exec.cpp:208] Executor registered on slave > b7e24114-7585-40bc-879b-6a1188cb65b6-S1 > WARNING: Your kernel does not support swap limit capabilities, memory limited > without swap. > /bin/sh: 1: ./bin/spark-submit: not found > {code} > Does any know how to map/set spark home in docker for this case? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11136) Warm-start support for ML estimator
[ https://issues.apache.org/jira/browse/SPARK-11136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960116#comment-14960116 ] Xusen Yin commented on SPARK-11136: --- Sure. And I will add more subtasks on this JIRA to indicate other possible warm-start estimators. > Warm-start support for ML estimator > --- > > Key: SPARK-11136 > URL: https://issues.apache.org/jira/browse/SPARK-11136 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xusen Yin >Priority: Minor > > The current implementation of Estimator does not support warm-start fitting, > i.e. estimator.fit(data, params, partialModel). But first we need to add > warm-start for all ML estimators. This is an umbrella JIRA to add support for > the warm-start estimator. > Possible solutions: > 1. Add warm-start fitting interface like def fit(dataset: DataFrame, > initModel: M, paramMap: ParamMap): M > 2. Treat model as a special parameter, passing it through ParamMap. e.g. val > partialModel: Param[Option[M]] = new Param(...). In the case of model > existing, we use it to warm-start, else we start the training process from > the beginning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11143) SparkMesosDispatcher can not launch driver in docker
[ https://issues.apache.org/jira/browse/SPARK-11143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Klaus Ma updated SPARK-11143: - Description: I'm working on integration between Mesos & Spark. For now, I can start SlaveMesosDispatcher in a docker; and I like to also run Spark executor in Mesos docker. I do the following configuration for it, but I got an error; any suggestion? Configuration: Spark: conf/spark-defaults.conf {code} spark.mesos.executor.docker.imageubuntu spark.mesos.executor.docker.volumes /usr/bin:/usr/bin,/usr/local/lib:/usr/local/lib,/usr/lib:/usr/lib,/lib:/lib,/home/test/workshop/spark:/root/spark spark.mesos.executor.home/root/spark #spark.executorEnv.SPARK_HOME /root/spark spark.executorEnv.MESOS_NATIVE_LIBRARY /usr/local/lib {code} NOTE: The spark are installed in /home/test/workshop/spark, and all dependencies are installed. After submit SparkPi to the dispatcher, the driver job is started but failed. The error messes is: {code} I1015 11:10:29.488456 18697 exec.cpp:134] Version: 0.26.0 I1015 11:10:29.506619 18699 exec.cpp:208] Executor registered on slave b7e24114-7585-40bc-879b-6a1188cb65b6-S1 WARNING: Your kernel does not support swap limit capabilities, memory limited without swap. /bin/sh: 1: ./bin/spark-submit: not found {code} Does any know how to map/set spark home in docker for this case? was: I'm working on integration between Mesos & Spark. For now, I can start SlaveMesosDispatcher in a docker; and I like to also run Spark executor in Mesos docker. I do the following configuration for it, but I got an error; any suggestion? Configuration: Spark: conf/spark-defaults.conf spark.mesos.executor.docker.imageubuntu spark.mesos.executor.docker.volumes /usr/bin:/usr/bin,/usr/local/lib:/usr/local/lib,/usr/lib:/usr/lib,/lib:/lib,/home/test/workshop/spark:/root/spark spark.mesos.executor.home/root/spark #spark.executorEnv.SPARK_HOME /root/spark spark.executorEnv.MESOS_NATIVE_LIBRARY /usr/local/lib NOTE: The spark are installed in /home/test/workshop/spark, and all dependencies are installed. After submit SparkPi to the dispatcher, the driver job is started but failed. The error messes is: I1015 11:10:29.488456 18697 exec.cpp:134] Version: 0.26.0 I1015 11:10:29.506619 18699 exec.cpp:208] Executor registered on slave b7e24114-7585-40bc-879b-6a1188cb65b6-S1 WARNING: Your kernel does not support swap limit capabilities, memory limited without swap. /bin/sh: 1: ./bin/spark-submit: not found Does any know how to map/set spark home in docker for this case? > SparkMesosDispatcher can not launch driver in docker > > > Key: SPARK-11143 > URL: https://issues.apache.org/jira/browse/SPARK-11143 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.5.1 > Environment: Ubuntu 14.04 >Reporter: Klaus Ma > > I'm working on integration between Mesos & Spark. For now, I can start > SlaveMesosDispatcher in a docker; and I like to also run Spark executor in > Mesos docker. I do the following configuration for it, but I got an error; > any suggestion? > Configuration: > Spark: conf/spark-defaults.conf > {code} > spark.mesos.executor.docker.imageubuntu > spark.mesos.executor.docker.volumes > /usr/bin:/usr/bin,/usr/local/lib:/usr/local/lib,/usr/lib:/usr/lib,/lib:/lib,/home/test/workshop/spark:/root/spark > spark.mesos.executor.home/root/spark > #spark.executorEnv.SPARK_HOME /root/spark > spark.executorEnv.MESOS_NATIVE_LIBRARY /usr/local/lib > {code} > NOTE: The spark are installed in /home/test/workshop/spark, and all > dependencies are installed. > After submit SparkPi to the dispatcher, the driver job is started but failed. > The error messes is: > {code} > I1015 11:10:29.488456 18697 exec.cpp:134] Version: 0.26.0 > I1015 11:10:29.506619 18699 exec.cpp:208] Executor registered on slave > b7e24114-7585-40bc-879b-6a1188cb65b6-S1 > WARNING: Your kernel does not support swap limit capabilities, memory limited > without swap. > /bin/sh: 1: ./bin/spark-submit: not found > {code} > Does any know how to map/set spark home in docker for this case? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11143) SparkMesosDispatcher can not launch driver in docker
Klaus Ma created SPARK-11143: Summary: SparkMesosDispatcher can not launch driver in docker Key: SPARK-11143 URL: https://issues.apache.org/jira/browse/SPARK-11143 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.5.1 Environment: Ubuntu 14.04 Reporter: Klaus Ma I'm working on integration between Mesos & Spark. For now, I can start SlaveMesosDispatcher in a docker; and I like to also run Spark executor in Mesos docker. I do the following configuration for it, but I got an error; any suggestion? Configuration: Spark: conf/spark-defaults.conf spark.mesos.executor.docker.imageubuntu spark.mesos.executor.docker.volumes /usr/bin:/usr/bin,/usr/local/lib:/usr/local/lib,/usr/lib:/usr/lib,/lib:/lib,/home/test/workshop/spark:/root/spark spark.mesos.executor.home/root/spark #spark.executorEnv.SPARK_HOME /root/spark spark.executorEnv.MESOS_NATIVE_LIBRARY /usr/local/lib NOTE: The spark are installed in /home/test/workshop/spark, and all dependencies are installed. After submit SparkPi to the dispatcher, the driver job is started but failed. The error messes is: I1015 11:10:29.488456 18697 exec.cpp:134] Version: 0.26.0 I1015 11:10:29.506619 18699 exec.cpp:208] Executor registered on slave b7e24114-7585-40bc-879b-6a1188cb65b6-S1 WARNING: Your kernel does not support swap limit capabilities, memory limited without swap. /bin/sh: 1: ./bin/spark-submit: not found Does any know how to map/set spark home in docker for this case? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11127) Upgrade Kinesis Client Library to the latest stable version
[ https://issues.apache.org/jira/browse/SPARK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11127: -- Assignee: Tathagata Das > Upgrade Kinesis Client Library to the latest stable version > --- > > Key: SPARK-11127 > URL: https://issues.apache.org/jira/browse/SPARK-11127 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Xiangrui Meng >Assignee: Tathagata Das > > We use KCL 1.3.0 in the current master. KCL 1.4.0 added integration with > Kinesis Producer Library (KPL) and support auto de-aggregation. It would be > great to upgrade KCL to the latest stable version. > Note that the latest version is 1.6.1 and 1.6.0 restored compatibility with > dynamodb-streams-kinesis-adapter, which was broken in 1.4.0. See > https://github.com/awslabs/amazon-kinesis-client#release-notes. > [~tdas] [~brkyvz] Please recommend a version for upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9695) Add random seed Param to ML Pipeline
[ https://issues.apache.org/jira/browse/SPARK-9695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960080#comment-14960080 ] Joseph K. Bradley commented on SPARK-9695: -- {quote}I think we should store the whole pipeline and each stage's seed to reproduce the same results.{quote} --> This will be possible for PipelineModel (with a fixed set of stages), but can we do it for Pipeline (with a mutable set of stages)? We might have to have a weaker set of guarantees for Pipeline than PipelineModel. That'd be great if you can send a patch---thanks! > Add random seed Param to ML Pipeline > > > Key: SPARK-9695 > URL: https://issues.apache.org/jira/browse/SPARK-9695 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > Note this will require some discussion about whether to make HasSeed the main > API for whether an algorithm takes a seed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11136) Warm-start support for ML estimator
[ https://issues.apache.org/jira/browse/SPARK-11136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960073#comment-14960073 ] Joseph K. Bradley commented on SPARK-11136: --- We should definitely have it be a Param. I just comment on the KMeans JIRA about that. Thanks for pointing out that issue. Would you mind updating this JIRA's description to specify that as the chosen option? > Warm-start support for ML estimator > --- > > Key: SPARK-11136 > URL: https://issues.apache.org/jira/browse/SPARK-11136 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xusen Yin >Priority: Minor > > The current implementation of Estimator does not support warm-start fitting, > i.e. estimator.fit(data, params, partialModel). But first we need to add > warm-start for all ML estimators. This is an umbrella JIRA to add support for > the warm-start estimator. > Possible solutions: > 1. Add warm-start fitting interface like def fit(dataset: DataFrame, > initModel: M, paramMap: ParamMap): M > 2. Treat model as a special parameter, passing it through ParamMap. e.g. val > partialModel: Param[Option[M]] = new Param(...). In the case of model > existing, we use it to warm-start, else we start the training process from > the beginning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10780) Set initialModel in KMeans in Pipelines API
[ https://issues.apache.org/jira/browse/SPARK-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960072#comment-14960072 ] Joseph K. Bradley commented on SPARK-10780: --- [~jayants] I agree with [~yinxusen]: The initialModel should be a Param and follow the example of other Params. Could you please update your PR accordingly? Thanks! > Set initialModel in KMeans in Pipelines API > --- > > Key: SPARK-10780 > URL: https://issues.apache.org/jira/browse/SPARK-10780 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley > > This is for the Scala version. After this is merged, create a JIRA for > Python version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11136) Warm-start support for ML estimator
[ https://issues.apache.org/jira/browse/SPARK-11136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960063#comment-14960063 ] Xusen Yin commented on SPARK-11136: --- I have already linked all related issues. [~josephkb] Which kind of methods of supporting warm-start do you prefer? Or other feasible suggestions? In [~jayants]'s code of KMeans warm-start we can see the 3rd implementation. > Warm-start support for ML estimator > --- > > Key: SPARK-11136 > URL: https://issues.apache.org/jira/browse/SPARK-11136 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xusen Yin >Priority: Minor > > The current implementation of Estimator does not support warm-start fitting, > i.e. estimator.fit(data, params, partialModel). But first we need to add > warm-start for all ML estimators. This is an umbrella JIRA to add support for > the warm-start estimator. > Possible solutions: > 1. Add warm-start fitting interface like def fit(dataset: DataFrame, > initModel: M, paramMap: ParamMap): M > 2. Treat model as a special parameter, passing it through ParamMap. e.g. val > partialModel: Param[Option[M]] = new Param(...). In the case of model > existing, we use it to warm-start, else we start the training process from > the beginning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5472) Add support for reading from and writing to a JDBC database
[ https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960052#comment-14960052 ] Jia Li commented on SPARK-5472: --- [~tmyklebu] Does your PR handle BINARY type? Thanks, > Add support for reading from and writing to a JDBC database > --- > > Key: SPARK-5472 > URL: https://issues.apache.org/jira/browse/SPARK-5472 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Tor Myklebust >Assignee: Tor Myklebust >Priority: Blocker > Fix For: 1.3.0 > > > It would be nice to be able to make a table in a JDBC database appear as a > table in Spark SQL. This would let users, for instance, perform a JOIN > between a DataFrame in Spark SQL with a table in a Postgres database. > It might also be nice to be able to go the other direction -- save a > DataFrame to a database -- for instance in an ETL job. > Edited to clarify: Both of these tasks are certainly possible to accomplish > at the moment with a little bit of ad-hoc glue code. However, there is no > fundamental reason why the user should need to supply the table schema and > some code for pulling data out of a ResultSet row into a Catalyst Row > structure when this information can be derived from the schema of the > database table itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11142) org.datanucleus is already registered
[ https://issues.apache.org/jira/browse/SPARK-11142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] raicho updated SPARK-11142: --- Priority: Minor (was: Major) > org.datanucleus is already registered > - > > Key: SPARK-11142 > URL: https://issues.apache.org/jira/browse/SPARK-11142 > Project: Spark > Issue Type: Question > Components: Spark Shell >Affects Versions: 1.5.1 > Environment: Windows7 Home Basic >Reporter: raicho >Priority: Minor > Fix For: 1.5.1 > > > I first setup Spark this Wednesday on my computer. When I executed > spark-shell.cmd, warns shows on the screen like "org.datanucleus is already > registered. Ensure you don't have multiple JAR versions of the same plugin in > the classpath. The URL "file:/c:/spark/lib/datanucleus-core-3.2.10.jar" is > already registered and you are trying to register an identical plugin located > at URL "file:/c:/spark/bin/../lib/datanucleus-core-3.2.10.jar" " and > "org.datanucleus.api.jdo is already registered. Ensure you don't have > multiple JAR versions of the same plugin in the classpath. The URL > "file:/c:/spark/lib/datanucleus-core-3.2.6.jar" is already registered and you > are trying to register an identical plugin located at URL > "file:/c:/spark/bin/../lib/datanucleus-core-3.2.6.jar" " > The two URLs shown in fact mean the same path. I tried to find the classpath > in the configuration files but failed. No other codes outside has been > executed on spark yet. > What happened and how to deal with the warn? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10780) Set initialModel in KMeans in Pipelines API
[ https://issues.apache.org/jira/browse/SPARK-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960041#comment-14960041 ] Xusen Yin commented on SPARK-10780: --- This belongs to the SPARK-11136. But we need to pay more attention on unified implementations, since other estimators will add warm-start. > Set initialModel in KMeans in Pipelines API > --- > > Key: SPARK-10780 > URL: https://issues.apache.org/jira/browse/SPARK-10780 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley > > This is for the Scala version. After this is merged, create a JIRA for > Python version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9695) Add random seed Param to ML Pipeline
[ https://issues.apache.org/jira/browse/SPARK-9695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960034#comment-14960034 ] Yanbo Liang commented on SPARK-9695: I agree if users set pipeline stage's seed, it has higher priority than the pipeline's seed. To the pipeline storage and load, I think we should store the whole pipeline and each stage's seed to reproduce the same results. This issue should considered at the pipeline and stage's storage and load related tasks. I think the assumption of random number generator should not change behavior across Spark versions is reasonable. I will try to submit an initial patch for this issue and looking forward your comments. > Add random seed Param to ML Pipeline > > > Key: SPARK-9695 > URL: https://issues.apache.org/jira/browse/SPARK-9695 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > Note this will require some discussion about whether to make HasSeed the main > API for whether an algorithm takes a seed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11142) org.datanucleus is already registered
raicho created SPARK-11142: -- Summary: org.datanucleus is already registered Key: SPARK-11142 URL: https://issues.apache.org/jira/browse/SPARK-11142 Project: Spark Issue Type: Question Components: Spark Shell Affects Versions: 1.5.1 Environment: Windows7 Home Basic Reporter: raicho Fix For: 1.5.1 I first setup Spark this Wednesday on my computer. When I executed spark-shell.cmd, warns shows on the screen like "org.datanucleus is already registered. Ensure you don't have multiple JAR versions of the same plugin in the classpath. The URL "file:/c:/spark/lib/datanucleus-core-3.2.10.jar" is already registered and you are trying to register an identical plugin located at URL "file:/c:/spark/bin/../lib/datanucleus-core-3.2.10.jar" " and "org.datanucleus.api.jdo is already registered. Ensure you don't have multiple JAR versions of the same plugin in the classpath. The URL "file:/c:/spark/lib/datanucleus-core-3.2.6.jar" is already registered and you are trying to register an identical plugin located at URL "file:/c:/spark/bin/../lib/datanucleus-core-3.2.6.jar" " The two URLs shown in fact mean the same path. I tried to find the classpath in the configuration files but failed. No other codes outside has been executed on spark yet. What happened and how to deal with the warn? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8418) Add single- and multi-value support to ML Transformers
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959978#comment-14959978 ] Yanbo Liang edited comment on SPARK-8418 at 10/16/15 1:43 AM: -- [~josephkb] I don't think RFormula is the best way to resolve this issue because it still use the pipeline chained transformers one by one to encode multiple columns which is low performance. I vote for strategy 2 of [~nburoojy] proposed. But I think we don't need to reimplement all transformers to support a multi-value implementation because of some feature transformers not needed. Brief design doc: * How input and output columns will be specified /** @group setParam */ def setInputCols(value: Array[String]): this.type = set(inputCols, value) /** @group setParam */ def setOutputCols(value: Array[String]): this.type = set(outputCols, value) * Schema validation Make transformSchema adaptive to multiple input and output columns. * Code sharing to reduce duplication For backwards compatibility, we must not modify current Params, we add a new one for multiple inputs (and check for conflicting settings when running). Reimplement transformers to support multi-value implementation and make the single-value interface a trivial invocation of the multi-value code. I think we should maximum reuse the transform function of a single-value to implement the multi-value one, but it can not completely shared code depends on different transformers. So can I firstly try to start sub-tasks with StringIndexer and OneHotEncoder which is mostly common used? was (Author: yanboliang): [~josephkb] I don't think RFormula is the best way to resolve this issue because it still use the pipeline chained transformers one by one to encode multiple columns which is low performance. I vote for strategy 2 of [~nburoojy] proposed. But I think we don't need to reimplement all transformers to support a multi-value implementation because of some feature transformers not needed. Brief design doc: * How input and output columns will be specified /** @group setParam */ def setInputCols(value: Array[String]): this.type = set(inputCols, value) /** @group setParam */ def setOutputCols(value: Array[String]): this.type = set(outputCols, value) * Schema validation Make transformSchema adaptive to multiple input and output columns. * Code sharing to reduce duplication For backwards compatibility, we must not modify current Params, we add a new one for multiple inputs (and check for conflicting settings when running). Reimplement transformers to support multi-value implementation and make the single-value interface a trivial invocation of the multi-value code. I think we should maximum reuse the transform function of a single-value to implement the multi-value one, but it can not completely shared code depends on different transformers. I will firstly try to start sub-tasks with StringIndexer and OneHotEncoder which is mostly common used. > Add single- and multi-value support to ML Transformers > -- > > Key: SPARK-8418 > URL: https://issues.apache.org/jira/browse/SPARK-8418 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > It would be convenient if all feature transformers supported transforming > columns of single values and multiple values, specifically: > * one column with one value (e.g., type {{Double}}) > * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) > We could go as far as supporting multiple columns, but that may not be > necessary since VectorAssembler could be used to handle that. > Estimators under {{ml.feature}} should also support this. > This will likely require a short design doc to describe: > * how input and output columns will be specified > * schema validation > * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8418) Add single- and multi-value support to ML Transformers
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959978#comment-14959978 ] Yanbo Liang edited comment on SPARK-8418 at 10/16/15 1:26 AM: -- [~josephkb] I don't think RFormula is the best way to resolve this issue because it still use the pipeline chained transformers one by one to encode multiple columns which is low performance. I vote for strategy 2 of [~nburoojy] proposed. But I think we don't need to reimplement all transformers to support a multi-value implementation because of some feature transformers not needed. Brief design doc: * How input and output columns will be specified /** @group setParam */ def setInputCols(value: Array[String]): this.type = set(inputCols, value) /** @group setParam */ def setOutputCols(value: Array[String]): this.type = set(outputCols, value) * Schema validation Make transformSchema adaptive to multiple input and output columns. * Code sharing to reduce duplication For backwards compatibility, we must not modify current Params, we add a new one for multiple inputs (and check for conflicting settings when running). Reimplement transformers to support multi-value implementation and make the single-value interface a trivial invocation of the multi-value code. I think we should maximum reuse the transform function of a single-value to implement the multi-value one, but it can not completely shared code depends on different transformers. I will firstly try to start sub-tasks with StringIndexer and OneHotEncoder which is mostly common used. was (Author: yanboliang): [~josephkb] I don't think RFormula is the best way to resolve this issue because it still use the pipeline chained transformers one by one to encode multiple columns which is low performance. I vote for strategy 2 of [~nburoojy] proposed. But I think we don't need to reimplement all transformers to support a multi-value implementation because of some feature transformers not needed. Brief design doc: * How input and output columns will be specified /** @group setParam */ def setInputCols(value: Array[String]): this.type = set(inputCols, value) /** @group setParam */ def setOutputCols(value: Array[String]): this.type = set(outputCols, value) * Schema validation Make transformSchema adaptive to multiple input and output columns. * Code sharing to reduce duplication For backwards compatibility, we must not modify current Params, we add a new one for multiple inputs (and check for conflicting settings when running). Reimplement transformers to support multi-value implementation and make the single-value interface a trivial invocation of the multi-value code. I think we should maximum reuse the transform function of a single-value to implement the multi-value one. I will firstly try to start sub-tasks with StringIndexer and OneHotEncoder which is mostly common used. > Add single- and multi-value support to ML Transformers > -- > > Key: SPARK-8418 > URL: https://issues.apache.org/jira/browse/SPARK-8418 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > It would be convenient if all feature transformers supported transforming > columns of single values and multiple values, specifically: > * one column with one value (e.g., type {{Double}}) > * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) > We could go as far as supporting multiple columns, but that may not be > necessary since VectorAssembler could be used to handle that. > Estimators under {{ml.feature}} should also support this. > This will likely require a short design doc to describe: > * how input and output columns will be specified > * schema validation > * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8418) Add single- and multi-value support to ML Transformers
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959978#comment-14959978 ] Yanbo Liang edited comment on SPARK-8418 at 10/16/15 1:23 AM: -- [~josephkb] I don't think RFormula is the best way to resolve this issue because it still use the pipeline chained transformers one by one to encode multiple columns which is low performance. I vote for strategy 2 of [~nburoojy] proposed. But I think we don't need to reimplement all transformers to support a multi-value implementation because of some feature transformers not needed. Brief design doc: * How input and output columns will be specified /** @group setParam */ def setInputCols(value: Array[String]): this.type = set(inputCols, value) /** @group setParam */ def setOutputCols(value: Array[String]): this.type = set(outputCols, value) * Schema validation Make transformSchema adaptive to multiple input and output columns. * Code sharing to reduce duplication For backwards compatibility, we must not modify current Params, we add a new one for multiple inputs (and check for conflicting settings when running). Reimplement transformers to support multi-value implementation and make the single-value interface a trivial invocation of the multi-value code. I think we should maximum reuse the transform function of a single-value to implement the multi-value one. I will firstly try to start sub-tasks with StringIndexer and OneHotEncoder which is mostly common used. was (Author: yanboliang): [~josephkb] I don't think RFormula is the best way to resolve this issue because it still use the pipeline chained transformers one by one to encode multiple columns which is low performance. I vote for strategy 2 of [~nburoojy] proposed. But I think we don't need to reimplement all transformers to support a multi-value implementation because of some feature transformers not needed. I will firstly try to start with OneHotEncoder which is mostly common used. > Add single- and multi-value support to ML Transformers > -- > > Key: SPARK-8418 > URL: https://issues.apache.org/jira/browse/SPARK-8418 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > It would be convenient if all feature transformers supported transforming > columns of single values and multiple values, specifically: > * one column with one value (e.g., type {{Double}}) > * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) > We could go as far as supporting multiple columns, but that may not be > necessary since VectorAssembler could be used to handle that. > Estimators under {{ml.feature}} should also support this. > This will likely require a short design doc to describe: > * how input and output columns will be specified > * schema validation > * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8418) Add single- and multi-value support to ML Transformers
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959978#comment-14959978 ] Yanbo Liang edited comment on SPARK-8418 at 10/16/15 1:23 AM: -- [~josephkb] I don't think RFormula is the best way to resolve this issue because it still use the pipeline chained transformers one by one to encode multiple columns which is low performance. I vote for strategy 2 of [~nburoojy] proposed. But I think we don't need to reimplement all transformers to support a multi-value implementation because of some feature transformers not needed. Brief design doc: * How input and output columns will be specified /** @group setParam */ def setInputCols(value: Array[String]): this.type = set(inputCols, value) /** @group setParam */ def setOutputCols(value: Array[String]): this.type = set(outputCols, value) * Schema validation Make transformSchema adaptive to multiple input and output columns. * Code sharing to reduce duplication For backwards compatibility, we must not modify current Params, we add a new one for multiple inputs (and check for conflicting settings when running). Reimplement transformers to support multi-value implementation and make the single-value interface a trivial invocation of the multi-value code. I think we should maximum reuse the transform function of a single-value to implement the multi-value one. I will firstly try to start sub-tasks with StringIndexer and OneHotEncoder which is mostly common used. was (Author: yanboliang): [~josephkb] I don't think RFormula is the best way to resolve this issue because it still use the pipeline chained transformers one by one to encode multiple columns which is low performance. I vote for strategy 2 of [~nburoojy] proposed. But I think we don't need to reimplement all transformers to support a multi-value implementation because of some feature transformers not needed. Brief design doc: * How input and output columns will be specified /** @group setParam */ def setInputCols(value: Array[String]): this.type = set(inputCols, value) /** @group setParam */ def setOutputCols(value: Array[String]): this.type = set(outputCols, value) * Schema validation Make transformSchema adaptive to multiple input and output columns. * Code sharing to reduce duplication For backwards compatibility, we must not modify current Params, we add a new one for multiple inputs (and check for conflicting settings when running). Reimplement transformers to support multi-value implementation and make the single-value interface a trivial invocation of the multi-value code. I think we should maximum reuse the transform function of a single-value to implement the multi-value one. I will firstly try to start sub-tasks with StringIndexer and OneHotEncoder which is mostly common used. > Add single- and multi-value support to ML Transformers > -- > > Key: SPARK-8418 > URL: https://issues.apache.org/jira/browse/SPARK-8418 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > It would be convenient if all feature transformers supported transforming > columns of single values and multiple values, specifically: > * one column with one value (e.g., type {{Double}}) > * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) > We could go as far as supporting multiple columns, but that may not be > necessary since VectorAssembler could be used to handle that. > Estimators under {{ml.feature}} should also support this. > This will likely require a short design doc to describe: > * how input and output columns will be specified > * schema validation > * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959978#comment-14959978 ] Yanbo Liang commented on SPARK-8418: [~josephkb] I don't think RFormula is the best way to resolve this issue because it still use the pipeline chained transformers one by one to encode multiple columns which is low performance. I vote for strategy 2 of [~nburoojy] proposed. But I think we don't need to reimplement all transformers to support a multi-value implementation because of some feature transformers not needed. I will firstly try to start with OneHotEncoder which is mostly common used. > Add single- and multi-value support to ML Transformers > -- > > Key: SPARK-8418 > URL: https://issues.apache.org/jira/browse/SPARK-8418 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > It would be convenient if all feature transformers supported transforming > columns of single values and multiple values, specifically: > * one column with one value (e.g., type {{Double}}) > * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) > We could go as far as supporting multiple columns, but that may not be > necessary since VectorAssembler could be used to handle that. > Estimators under {{ml.feature}} should also support this. > This will likely require a short design doc to describe: > * how input and output columns will be specified > * schema validation > * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10066) Can't create HiveContext with spark-shell or spark-sql on snapshot
[ https://issues.apache.org/jira/browse/SPARK-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959960#comment-14959960 ] Donam Kim commented on SPARK-10066: --- I have same problem with Spark 1.5.1 on HDP 2.3.1 > Can't create HiveContext with spark-shell or spark-sql on snapshot > -- > > Key: SPARK-10066 > URL: https://issues.apache.org/jira/browse/SPARK-10066 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.5.0 > Environment: Centos 6.6 >Reporter: Robert Beauchemin >Priority: Minor > > Built the 1.5.0-preview-20150812 with the following: > ./make-distribution.sh -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive > -Phive-thriftserver -Psparkr -DskipTests > Starting spark-shell or spark-sql returns the following error: > java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. Current permissions are: rwx-- > at > org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:612) > [elided] > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508) > > It's trying to create a new HiveContext. Running pySpark or sparkR works and > creates a HiveContext successfully. SqlContext can be created successfully > with any shell. > I've tried changing permissions on that HDFS directory (even as far as making > it world-writable) without success. Tried changing SPARK_USER and also > running spark-shell as different users without success. > This works on same machine on 1.4.1 and on earlier pre-release versions of > Spark 1.5.0 (same make-distribution parms) sucessfully. Just trying the > snapshot... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11109) move FsHistoryProvider off import org.apache.hadoop.fs.permission.AccessControlException
[ https://issues.apache.org/jira/browse/SPARK-11109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11109: Assignee: (was: Apache Spark) > move FsHistoryProvider off import > org.apache.hadoop.fs.permission.AccessControlException > > > Key: SPARK-11109 > URL: https://issues.apache.org/jira/browse/SPARK-11109 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Steve Loughran >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > {{FsHistoryProvider}} imports and uses > {{org.apache.hadoop.fs.permission.AccessControlException}}; this has been > superceded by its subclass > {{org.apache.hadoop.security.AccessControlException}} since ~2011. Moving to > that subclass would remove a deprecation warning and ensure that were the > Hadoop team to remove that old method (as HADOOP-11356 has currently done to > trunk), everything still compiles and links -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11109) move FsHistoryProvider off import org.apache.hadoop.fs.permission.AccessControlException
[ https://issues.apache.org/jira/browse/SPARK-11109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959958#comment-14959958 ] Apache Spark commented on SPARK-11109: -- User 'gweidner' has created a pull request for this issue: https://github.com/apache/spark/pull/9144 > move FsHistoryProvider off import > org.apache.hadoop.fs.permission.AccessControlException > > > Key: SPARK-11109 > URL: https://issues.apache.org/jira/browse/SPARK-11109 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Steve Loughran >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > {{FsHistoryProvider}} imports and uses > {{org.apache.hadoop.fs.permission.AccessControlException}}; this has been > superceded by its subclass > {{org.apache.hadoop.security.AccessControlException}} since ~2011. Moving to > that subclass would remove a deprecation warning and ensure that were the > Hadoop team to remove that old method (as HADOOP-11356 has currently done to > trunk), everything still compiles and links -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11109) move FsHistoryProvider off import org.apache.hadoop.fs.permission.AccessControlException
[ https://issues.apache.org/jira/browse/SPARK-11109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11109: Assignee: Apache Spark > move FsHistoryProvider off import > org.apache.hadoop.fs.permission.AccessControlException > > > Key: SPARK-11109 > URL: https://issues.apache.org/jira/browse/SPARK-11109 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Steve Loughran >Assignee: Apache Spark >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > {{FsHistoryProvider}} imports and uses > {{org.apache.hadoop.fs.permission.AccessControlException}}; this has been > superceded by its subclass > {{org.apache.hadoop.security.AccessControlException}} since ~2011. Moving to > that subclass would remove a deprecation warning and ensure that were the > Hadoop team to remove that old method (as HADOOP-11356 has currently done to > trunk), everything still compiles and links -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10513) Springleaf Marketing Response
[ https://issues.apache.org/jira/browse/SPARK-10513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959949#comment-14959949 ] Yanbo Liang edited comment on SPARK-10513 at 10/16/15 12:45 AM: [~josephkb] For 4: If a column of StringType contains "" value (not null), StringIndexer will transform it right, but OneHotEncoder will throw exception caused by "" can not be assigned as a feature name. I think we should discuss whether it is legal that one category feature contains "" value, otherwise we should filter these kinds of values or replaced "" with other user specified values? was (Author: yanboliang): [~josephkb] For 4: If a column of StringType has "" value (not null), StringIndexer will transform it right, but OneHotEncoder will throw exception caused of "" can not as a feature name. I think we should discuss that whether it is legal that one category feature contains "" value, otherwise we should filter these kinds of values or replaced "" with other user specified values? > Springleaf Marketing Response > - > > Key: SPARK-10513 > URL: https://issues.apache.org/jira/browse/SPARK-10513 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > Apply ML pipeline API to Springleaf Marketing Response > (https://www.kaggle.com/c/springleaf-marketing-response) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10513) Springleaf Marketing Response
[ https://issues.apache.org/jira/browse/SPARK-10513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959949#comment-14959949 ] Yanbo Liang commented on SPARK-10513: - [~josephkb] For 4: If a column of StringType has "" value (not null), StringIndexer will transform it right, but OneHotEncoder will throw exception caused of "" can not as a feature name. I think we should discuss that whether it is legal that one category feature contains "" value, otherwise we should filter these kinds of values or replaced "" with other user specified values? > Springleaf Marketing Response > - > > Key: SPARK-10513 > URL: https://issues.apache.org/jira/browse/SPARK-10513 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > Apply ML pipeline API to Springleaf Marketing Response > (https://www.kaggle.com/c/springleaf-marketing-response) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11135) Exchange sort-planning logic incorrectly avoid sorts when existing ordering is non-empty subset of required ordering
[ https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11135. -- Resolution: Fixed Fix Version/s: 1.5.2 1.6.0 Issue resolved by pull request 9140 [https://github.com/apache/spark/pull/9140] > Exchange sort-planning logic incorrectly avoid sorts when existing ordering > is non-empty subset of required ordering > > > Key: SPARK-11135 > URL: https://issues.apache.org/jira/browse/SPARK-11135 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > Fix For: 1.6.0, 1.5.2 > > > In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases > where the data has already been sorted by a superset of the requested sorting > columns. For instance, let's say that a query calls for an operator's input > to be sorted by `a.asc` and the input happens to already be sorted by > `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The > converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then > `a.asc` alone will not satisfy the ordering requirements, requiring an > additional sort to be planned by Exchange. > However, the current Exchange code gets this wrong and incorrectly skips > sorting when the existing output ordering is a subset of the required > ordering. This is simple to fix, however. > This bug was introduced in https://github.com/apache/spark/pull/7458, so it > affects 1.5.0+. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11141) Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes
[ https://issues.apache.org/jira/browse/SPARK-11141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11141: Assignee: (was: Apache Spark) > Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes > -- > > Key: SPARK-11141 > URL: https://issues.apache.org/jira/browse/SPARK-11141 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Burak Yavuz > > When using S3 as a directory for WALs, the writes take too long. The driver > gets very easily bottlenecked when multiple receivers send AddBlock events to > the ReceiverTracker. This PR adds batching of events in the > ReceivedBlockTracker so that receivers don't get blocked by the driver for > too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11141) Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes
[ https://issues.apache.org/jira/browse/SPARK-11141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959939#comment-14959939 ] Apache Spark commented on SPARK-11141: -- User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/9143 > Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes > -- > > Key: SPARK-11141 > URL: https://issues.apache.org/jira/browse/SPARK-11141 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Burak Yavuz > > When using S3 as a directory for WALs, the writes take too long. The driver > gets very easily bottlenecked when multiple receivers send AddBlock events to > the ReceiverTracker. This PR adds batching of events in the > ReceivedBlockTracker so that receivers don't get blocked by the driver for > too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11141) Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes
[ https://issues.apache.org/jira/browse/SPARK-11141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11141: Assignee: Apache Spark > Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes > -- > > Key: SPARK-11141 > URL: https://issues.apache.org/jira/browse/SPARK-11141 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Burak Yavuz >Assignee: Apache Spark > > When using S3 as a directory for WALs, the writes take too long. The driver > gets very easily bottlenecked when multiple receivers send AddBlock events to > the ReceiverTracker. This PR adds batching of events in the > ReceivedBlockTracker so that receivers don't get blocked by the driver for > too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11141) Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes
Burak Yavuz created SPARK-11141: --- Summary: Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes Key: SPARK-11141 URL: https://issues.apache.org/jira/browse/SPARK-11141 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Burak Yavuz When using S3 as a directory for WALs, the writes take too long. The driver gets very easily bottlenecked when multiple receivers send AddBlock events to the ReceiverTracker. This PR adds batching of events in the ReceivedBlockTracker so that receivers don't get blocked by the driver for too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11102) Uninformative exception when specifing non-exist input for JSON data source
[ https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11102: Assignee: (was: Apache Spark) > Uninformative exception when specifing non-exist input for JSON data source > --- > > Key: SPARK-11102 > URL: https://issues.apache.org/jira/browse/SPARK-11102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Jeff Zhang >Priority: Minor > > If I specify a non-exist input path for json data source, the following > exception will be thrown, it is not readable. > {code} > 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 19.9 KB, free 251.4 KB) > 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB) > 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at > :19 > java.io.IOException: No input paths specified in job > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085) > at > org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:28) > at $iwC$$iwC$$iwC$$iwC.(:30) > at $iwC$$iwC$$iwC.(:32) > at $iwC$$iwC.(:34) > at $iwC.(:36) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11102) Uninformative exception when specifing non-exist input for JSON data source
[ https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11102: Assignee: Apache Spark > Uninformative exception when specifing non-exist input for JSON data source > --- > > Key: SPARK-11102 > URL: https://issues.apache.org/jira/browse/SPARK-11102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Jeff Zhang >Assignee: Apache Spark >Priority: Minor > > If I specify a non-exist input path for json data source, the following > exception will be thrown, it is not readable. > {code} > 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 19.9 KB, free 251.4 KB) > 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB) > 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at > :19 > java.io.IOException: No input paths specified in job > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085) > at > org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:28) > at $iwC$$iwC$$iwC$$iwC.(:30) > at $iwC$$iwC$$iwC.(:32) > at $iwC$$iwC.(:34) > at $iwC.(:36) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11102) Uninformative exception when specifing non-exist input for JSON data source
[ https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959897#comment-14959897 ] Apache Spark commented on SPARK-11102: -- User 'zjffdu' has created a pull request for this issue: https://github.com/apache/spark/pull/9142 > Uninformative exception when specifing non-exist input for JSON data source > --- > > Key: SPARK-11102 > URL: https://issues.apache.org/jira/browse/SPARK-11102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Jeff Zhang >Priority: Minor > > If I specify a non-exist input path for json data source, the following > exception will be thrown, it is not readable. > {code} > 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 19.9 KB, free 251.4 KB) > 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB) > 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at > :19 > java.io.IOException: No input paths specified in job > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085) > at > org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:28) > at $iwC$$iwC$$iwC$$iwC.(:30) > at $iwC$$iwC$$iwC.(:32) > at $iwC$$iwC.(:34) > at $iwC.(:36) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10560) Make StreamingLogisticRegressionWithSGD Python API equals with Scala one
[ https://issues.apache.org/jira/browse/SPARK-10560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10560: Assignee: Apache Spark > Make StreamingLogisticRegressionWithSGD Python API equals with Scala one > > > Key: SPARK-10560 > URL: https://issues.apache.org/jira/browse/SPARK-10560 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Minor > > StreamingLogisticRegressionWithSGD Python API lacks of some parameters > compared with Scala one, here we make them equality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10560) Make StreamingLogisticRegressionWithSGD Python API equals with Scala one
[ https://issues.apache.org/jira/browse/SPARK-10560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10560: Assignee: (was: Apache Spark) > Make StreamingLogisticRegressionWithSGD Python API equals with Scala one > > > Key: SPARK-10560 > URL: https://issues.apache.org/jira/browse/SPARK-10560 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Yanbo Liang >Priority: Minor > > StreamingLogisticRegressionWithSGD Python API lacks of some parameters > compared with Scala one, here we make them equality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10560) Make StreamingLogisticRegressionWithSGD Python API equals with Scala one
[ https://issues.apache.org/jira/browse/SPARK-10560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959892#comment-14959892 ] Apache Spark commented on SPARK-10560: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/9141 > Make StreamingLogisticRegressionWithSGD Python API equals with Scala one > > > Key: SPARK-10560 > URL: https://issues.apache.org/jira/browse/SPARK-10560 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Yanbo Liang >Priority: Minor > > StreamingLogisticRegressionWithSGD Python API lacks of some parameters > compared with Scala one, here we make them equality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11126) A memory leak in SQLListener._stageIdToStageMetrics
[ https://issues.apache.org/jira/browse/SPARK-11126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959876#comment-14959876 ] Nick Pritchard commented on SPARK-11126: Is there any workaround to avoid this memory leak? > A memory leak in SQLListener._stageIdToStageMetrics > --- > > Key: SPARK-11126 > URL: https://issues.apache.org/jira/browse/SPARK-11126 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Shixiong Zhu > > SQLListener adds all stage infos to _stageIdToStageMetrics, but only removes > stage infos belonging to SQL executions. > Reported by Terry Hoo in > https://www.mail-archive.com/user@spark.apache.org/msg38810.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2629) Improve performance of DStream.updateStateByKey
[ https://issues.apache.org/jira/browse/SPARK-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2629: - Target Version/s: 1.6.0 > Improve performance of DStream.updateStateByKey > --- > > Key: SPARK-2629 > URL: https://issues.apache.org/jira/browse/SPARK-2629 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 0.9.2, 1.0.2, 1.2.2, 1.3.1, 1.4.1, 1.5.1 >Reporter: Tathagata Das >Assignee: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2629) Improve performance of DStream.updateStateByKey
[ https://issues.apache.org/jira/browse/SPARK-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2629: - Affects Version/s: 0.9.2 1.0.2 1.2.2 1.3.1 1.4.1 1.5.1 > Improve performance of DStream.updateStateByKey > --- > > Key: SPARK-2629 > URL: https://issues.apache.org/jira/browse/SPARK-2629 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 0.9.2, 1.0.2, 1.2.2, 1.3.1, 1.4.1, 1.5.1 >Reporter: Tathagata Das >Assignee: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11140) Replace file server in driver with RPC-based alternative
Marcelo Vanzin created SPARK-11140: -- Summary: Replace file server in driver with RPC-based alternative Key: SPARK-11140 URL: https://issues.apache.org/jira/browse/SPARK-11140 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Marcelo Vanzin As part of making configuring encryption easy in Spark, it would be better to use the existing RPC channel between driver and executors to transfer files and jars added to the application. This would remove the need to start the HTTP server currently used for that purpose, which needs to be configured to use SSL if encryption is wanted. SSL is kinda hard to configure correctly in a multi-user, distributed environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11139) Make SparkContext.stop() exception-safe
Felix Cheung created SPARK-11139: Summary: Make SparkContext.stop() exception-safe Key: SPARK-11139 URL: https://issues.apache.org/jira/browse/SPARK-11139 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.1 Reporter: Felix Cheung Priority: Minor In SparkContext.stop(), when an exception is thrown the rest of the stop/cleanup action is aborted. Work has been done in SPARK-4194 to allow for cleanup to partial initialization. Similarly issue in StreamingContext SPARK-11137 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11138) Flaky pyspark test: test_add_py_file
Marcelo Vanzin created SPARK-11138: -- Summary: Flaky pyspark test: test_add_py_file Key: SPARK-11138 URL: https://issues.apache.org/jira/browse/SPARK-11138 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.6.0 Reporter: Marcelo Vanzin This test fails pretty often when running PR tests. For example: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43800/console {noformat} == ERROR: test_add_py_file (__main__.AddFileTests) -- Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", line 396, in test_add_py_file res = self.sc.parallelize(range(2)).map(func).first() File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", line 1315, in first rs = self.take(1) File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", line 1297, in take res = self.context.runJob(self, takeUpToNumLeft, p) File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/context.py", line 923, in runJob port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ self.target_id, self.name) File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value format(target_id, '.', name), value) Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 3.0 failed 1 times, most recent failure: Lost task 2.0 in stage 3.0 (TID 7, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main process() File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream vs = list(itertools.islice(iterator, batch)) File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", line 1293, in takeUpToNumLeft yield next(iterator) File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", line 388, in func from userlibrary import UserClass ImportError: cannot import name UserClass at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1427) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1415) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1414) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1414) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:793) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:793) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:793) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1639) at org.
[jira] [Created] (SPARK-11137) Make StreamingContext.stop() exception-safe
Felix Cheung created SPARK-11137: Summary: Make StreamingContext.stop() exception-safe Key: SPARK-11137 URL: https://issues.apache.org/jira/browse/SPARK-11137 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.5.1 Reporter: Felix Cheung Priority: Minor In StreamingContext.stop(), when an exception is thrown the rest of the stop/cleanup action is aborted. Discussed in https://github.com/apache/spark/pull/9116, srowen commented Hm, this is getting unwieldy. There are several nested try blocks here. The same argument goes for many of these methods -- if one fails should they not continue trying? A more tidy solution would be to execute a series of () -> Unit code blocks that perform some cleanup and make sure that they each fire in succession, regardless of the others. The final one to remove the shutdown hook could occur outside synchronization. I realize we're expanding the scope of the change here, but is it maybe worthwhile to go all the way here? Really, something similar could be done for SparkContext and there's an existing JIRA for it somewhere. At least, I'd prefer to either narrowly fix the deadlock here, or fix all of the finally-related issue separately and all at once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11102) Uninformative exception when specifing non-exist input for JSON data source
[ https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-11102: --- Summary: Uninformative exception when specifing non-exist input for JSON data source (was: Unreadable exception when specifing non-exist input for JSON data source) > Uninformative exception when specifing non-exist input for JSON data source > --- > > Key: SPARK-11102 > URL: https://issues.apache.org/jira/browse/SPARK-11102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Jeff Zhang >Priority: Minor > > If I specify a non-exist input path for json data source, the following > exception will be thrown, it is not readable. > {code} > 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 19.9 KB, free 251.4 KB) > 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB) > 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at > :19 > java.io.IOException: No input paths specified in job > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085) > at > org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:28) > at $iwC$$iwC$$iwC$$iwC.(:30) > at $iwC$$iwC$$iwC.(:32) > at $iwC$$iwC.(:34) > at $iwC.(:36) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11128) strange NPE when writing in non-existing S3 bucket
[ https://issues.apache.org/jira/browse/SPARK-11128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11128: -- Component/s: Input/Output > strange NPE when writing in non-existing S3 bucket > -- > > Key: SPARK-11128 > URL: https://issues.apache.org/jira/browse/SPARK-11128 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.5.1 >Reporter: mathieu despriee >Priority: Minor > > For the record, as it's relatively minor, and related to s3n (not tested with > s3a). > By mistake, we tried writing a parquet dataframe to a non-existing s3 bucket, > with a simple df.write.parquet(s3path). > We got a NPE (see stack trace below), which is very misleading. > java.lang.NullPointerException > at > org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:73) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197) > at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146) > at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11124) JsonParser/Generator should be closed for resource recycle
[ https://issues.apache.org/jira/browse/SPARK-11124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11124: -- Component/s: Spark Core > JsonParser/Generator should be closed for resource recycle > -- > > Key: SPARK-11124 > URL: https://issues.apache.org/jira/browse/SPARK-11124 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Navis >Priority: Trivial > > Some json parsers are not closed. parser in JacksonParser#parseJson, for > example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11123) Inprove HistoryServer with multithread to relay logs
[ https://issues.apache.org/jira/browse/SPARK-11123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-11123. --- Resolution: Duplicate [~xietingwen] please search JIRAs before opening a new one. > Inprove HistoryServer with multithread to relay logs > > > Key: SPARK-11123 > URL: https://issues.apache.org/jira/browse/SPARK-11123 > Project: Spark > Issue Type: Improvement >Reporter: Xie Tingwen > > Now,with Spark 1.4,when I restart HistoryServer,it took over 30 hours to > replay over 40 000 log file. What's more,when I have started it,it may take > half an hour to relay it and block other logs to be replayed.How about > rewrite it with multithread to accelerate replay log. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11068) Add callback to query execution
[ https://issues.apache.org/jira/browse/SPARK-11068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11068: -- Assignee: Wenchen Fan > Add callback to query execution > --- > > Key: SPARK-11068 > URL: https://issues.apache.org/jira/browse/SPARK-11068 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11076) Decimal Support for Ceil/Floor
[ https://issues.apache.org/jira/browse/SPARK-11076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11076: -- Assignee: Cheng Hao > Decimal Support for Ceil/Floor > -- > > Key: SPARK-11076 > URL: https://issues.apache.org/jira/browse/SPARK-11076 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Cheng Hao >Assignee: Cheng Hao > Fix For: 1.6.0 > > > Currently, Ceil & Floor doesn't support decimal, but Hive does. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11032) Failure to resolve having correctly
[ https://issues.apache.org/jira/browse/SPARK-11032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11032: -- Assignee: Wenchen Fan > Failure to resolve having correctly > --- > > Key: SPARK-11032 > URL: https://issues.apache.org/jira/browse/SPARK-11032 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1, 1.6.0 >Reporter: Michael Armbrust >Assignee: Wenchen Fan >Priority: Blocker > Fix For: 1.6.0 > > > This is a regression from Spark 1.4 > {code} > Seq(("michael", 30)).toDF("name", "age").registerTempTable("people") > sql("SELECT MIN(t0.age) FROM (SELECT * FROM PEOPLE WHERE age > 0) t0 > HAVING(COUNT(1) > 0)").explain(true) > == Parsed Logical Plan == > 'Filter cast(('COUNT(1) > 0) as boolean) > 'Project [unresolvedalias('MIN('t0.age))] > 'Subquery t0 >'Project [unresolvedalias(*)] > 'Filter ('age > 0) > 'UnresolvedRelation [PEOPLE], None > == Analyzed Logical Plan == > _c0: int > Filter cast((count(1) > cast(0 as bigint)) as boolean) > Aggregate [min(age#6) AS _c0#9] > Subquery t0 >Project [name#5,age#6] > Filter (age#6 > 0) > Subquery people > Project [_1#3 AS name#5,_2#4 AS age#6] >LocalRelation [_1#3,_2#4], [[michael,30]] > == Optimized Logical Plan == > Filter (count(1) > 0) > Aggregate [min(age#6) AS _c0#9] > Project [_2#4 AS age#6] >Filter (_2#4 > 0) > LocalRelation [_1#3,_2#4], [[michael,30]] > == Physical Plan == > Filter (count(1) > 0) > TungstenAggregate(key=[], > functions=[(min(age#6),mode=Final,isDistinct=false)], output=[_c0#9]) > TungstenExchange SinglePartition >TungstenAggregate(key=[], > functions=[(min(age#6),mode=Partial,isDistinct=false)], output=[min#12]) > TungstenProject [_2#4 AS age#6] > Filter (_2#4 > 0) > LocalTableScan [_1#3,_2#4], [[michael,30]] > Code Generation: true > {code} > {code} > Caused by: java.lang.UnsupportedOperationException: Cannot evaluate > expression: count(1) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:188) > at > org.apache.spark.sql.catalyst.expressions.Count.eval(aggregates.scala:156) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:327) > at > org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38) > at > org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38) > at > org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117) > at > org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115) > at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5391) SparkSQL fails to create tables with custom JSON SerDe
[ https://issues.apache.org/jira/browse/SPARK-5391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5391: - Assignee: Davies Liu > SparkSQL fails to create tables with custom JSON SerDe > -- > > Key: SPARK-5391 > URL: https://issues.apache.org/jira/browse/SPARK-5391 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: David Ross >Assignee: Davies Liu > Fix For: 1.6.0 > > > - Using Spark built from trunk on this commit: > https://github.com/apache/spark/commit/bc20a52b34e826895d0dcc1d783c021ebd456ebd > - Build for Hive13 > - Using this JSON serde: https://github.com/rcongiu/Hive-JSON-Serde > First download jar locally: > {code} > $ curl > http://www.congiu.net/hive-json-serde/1.3/cdh5/json-serde-1.3-jar-with-dependencies.jar > > /tmp/json-serde-1.3-jar-with-dependencies.jar > {code} > Then add it in SparkSQL session: > {code} > add jar /tmp/json-serde-1.3-jar-with-dependencies.jar > {code} > Finally create table: > {code} > create table test_json (c1 boolean) ROW FORMAT SERDE > 'org.openx.data.jsonserde.JsonSerDe'; > {code} > Logs for add jar: > {code} > 15/01/23 23:48:33 INFO thriftserver.SparkExecuteStatementOperation: Running > query 'add jar /tmp/json-serde-1.3-jar-with-dependencies.jar' > 15/01/23 23:48:34 INFO session.SessionState: No Tez session required at this > point. hive.execution.engine=mr. > 15/01/23 23:48:34 INFO SessionState: Added > /tmp/json-serde-1.3-jar-with-dependencies.jar to class path > 15/01/23 23:48:34 INFO SessionState: Added resource: > /tmp/json-serde-1.3-jar-with-dependencies.jar > 15/01/23 23:48:34 INFO spark.SparkContext: Added JAR > /tmp/json-serde-1.3-jar-with-dependencies.jar at > http://192.168.99.9:51312/jars/json-serde-1.3-jar-with-dependencies.jar with > timestamp 1422056914776 > 15/01/23 23:48:34 INFO thriftserver.SparkExecuteStatementOperation: Result > Schema: List() > 15/01/23 23:48:34 INFO thriftserver.SparkExecuteStatementOperation: Result > Schema: List() > {code} > Logs (with error) for create table: > {code} > 15/01/23 23:49:00 INFO thriftserver.SparkExecuteStatementOperation: Running > query 'create table test_json (c1 boolean) ROW FORMAT SERDE > 'org.openx.data.jsonserde.JsonSerDe'' > 15/01/23 23:49:00 INFO parse.ParseDriver: Parsing command: create table > test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' > 15/01/23 23:49:01 INFO parse.ParseDriver: Parse Completed > 15/01/23 23:49:01 INFO session.SessionState: No Tez session required at this > point. hive.execution.engine=mr. > 15/01/23 23:49:01 INFO log.PerfLogger: from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 INFO log.PerfLogger: from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 INFO ql.Driver: Concurrency mode is disabled, not creating > a lock manager > 15/01/23 23:49:01 INFO log.PerfLogger: from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 INFO log.PerfLogger: from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 INFO parse.ParseDriver: Parsing command: create table > test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' > 15/01/23 23:49:01 INFO parse.ParseDriver: Parse Completed > 15/01/23 23:49:01 INFO log.PerfLogger: start=1422056941103 end=1422056941104 duration=1 > from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 INFO log.PerfLogger: from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 INFO parse.SemanticAnalyzer: Starting Semantic Analysis > 15/01/23 23:49:01 INFO parse.SemanticAnalyzer: Creating table test_json > position=13 > 15/01/23 23:49:01 INFO ql.Driver: Semantic Analysis Completed > 15/01/23 23:49:01 INFO log.PerfLogger: start=1422056941104 end=1422056941240 duration=136 > from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 INFO ql.Driver: Returning Hive schema: > Schema(fieldSchemas:null, properties:null) > 15/01/23 23:49:01 INFO log.PerfLogger: start=1422056941071 end=1422056941252 duration=181 > from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 INFO log.PerfLogger: from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 INFO ql.Driver: Starting command: create table test_json > (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' > 15/01/23 23:49:01 INFO log.PerfLogger: start=1422056941067 end=1422056941258 duration=191 > from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 INFO log.PerfLogger: from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 INFO log.PerfLogger: from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 WARN security.ShellBasedUnixGroupsMapping: got exception > trying to get groups for user anonymous > org.apache.hadoop.util.Shell$ExitCodeException: id: anonymous: No such user > at org.apache.hadoop.util.Shell.runCommand(Shell.java:505) >
[jira] [Updated] (SPARK-10829) Scan DataSource with predicate expression combine partition key and attributes doesn't work
[ https://issues.apache.org/jira/browse/SPARK-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10829: -- Assignee: Cheng Hao > Scan DataSource with predicate expression combine partition key and > attributes doesn't work > --- > > Key: SPARK-10829 > URL: https://issues.apache.org/jira/browse/SPARK-10829 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao >Assignee: Cheng Hao >Priority: Critical > Fix For: 1.6.0 > > > To reproduce that with the code: > {code} > withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") { > withTempPath { dir => > val path = s"${dir.getCanonicalPath}/part=1" > (1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path) > // If the "part = 1" filter gets pushed down, this query will throw > an exception since > // "part" is not a valid column in the actual Parquet file > checkAnswer( > sqlContext.read.parquet(path).filter("a > 0 and (part = 0 or a > > 1)"), > (2 to 3).map(i => Row(i, i.toString, 1))) > } > } > {code} > We expect the result as: > {code} > 2, 1 > 3, 1 > {code} > But we got: > {code} > 1, 1 > 2, 1 > 3, 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11135) Exchange sort-planning logic incorrectly avoid sorts when existing ordering is non-empty subset of required ordering
[ https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11135: Assignee: Josh Rosen (was: Apache Spark) > Exchange sort-planning logic incorrectly avoid sorts when existing ordering > is non-empty subset of required ordering > > > Key: SPARK-11135 > URL: https://issues.apache.org/jira/browse/SPARK-11135 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases > where the data has already been sorted by a superset of the requested sorting > columns. For instance, let's say that a query calls for an operator's input > to be sorted by `a.asc` and the input happens to already be sorted by > `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The > converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then > `a.asc` alone will not satisfy the ordering requirements, requiring an > additional sort to be planned by Exchange. > However, the current Exchange code gets this wrong and incorrectly skips > sorting when the existing output ordering is a subset of the required > ordering. This is simple to fix, however. > This bug was introduced in https://github.com/apache/spark/pull/7458, so it > affects 1.5.0+. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11135) Exchange sort-planning logic incorrectly avoid sorts when existing ordering is non-empty subset of required ordering
[ https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959735#comment-14959735 ] Apache Spark commented on SPARK-11135: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/9140 > Exchange sort-planning logic incorrectly avoid sorts when existing ordering > is non-empty subset of required ordering > > > Key: SPARK-11135 > URL: https://issues.apache.org/jira/browse/SPARK-11135 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases > where the data has already been sorted by a superset of the requested sorting > columns. For instance, let's say that a query calls for an operator's input > to be sorted by `a.asc` and the input happens to already be sorted by > `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The > converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then > `a.asc` alone will not satisfy the ordering requirements, requiring an > additional sort to be planned by Exchange. > However, the current Exchange code gets this wrong and incorrectly skips > sorting when the existing output ordering is a subset of the required > ordering. This is simple to fix, however. > This bug was introduced in https://github.com/apache/spark/pull/7458, so it > affects 1.5.0+. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11135) Exchange sort-planning logic incorrectly avoid sorts when existing ordering is non-empty subset of required ordering
[ https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11135: Assignee: Apache Spark (was: Josh Rosen) > Exchange sort-planning logic incorrectly avoid sorts when existing ordering > is non-empty subset of required ordering > > > Key: SPARK-11135 > URL: https://issues.apache.org/jira/browse/SPARK-11135 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Blocker > > In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases > where the data has already been sorted by a superset of the requested sorting > columns. For instance, let's say that a query calls for an operator's input > to be sorted by `a.asc` and the input happens to already be sorted by > `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The > converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then > `a.asc` alone will not satisfy the ordering requirements, requiring an > additional sort to be planned by Exchange. > However, the current Exchange code gets this wrong and incorrectly skips > sorting when the existing output ordering is a subset of the required > ordering. This is simple to fix, however. > This bug was introduced in https://github.com/apache/spark/pull/7458, so it > affects 1.5.0+. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11136) Warm-start support for ML estimator
Xusen Yin created SPARK-11136: - Summary: Warm-start support for ML estimator Key: SPARK-11136 URL: https://issues.apache.org/jira/browse/SPARK-11136 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xusen Yin Priority: Minor The current implementation of Estimator does not support warm-start fitting, i.e. estimator.fit(data, params, partialModel). But first we need to add warm-start for all ML estimators. This is an umbrella JIRA to add support for the warm-start estimator. Possible solutions: 1. Add warm-start fitting interface like def fit(dataset: DataFrame, initModel: M, paramMap: ParamMap): M 2. Treat model as a special parameter, passing it through ParamMap. e.g. val partialModel: Param[Option[M]] = new Param(...). In the case of model existing, we use it to warm-start, else we start the training process from the beginning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10412) In SQL tab, show execution memory per physical operator
[ https://issues.apache.org/jira/browse/SPARK-10412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-10412. --- Resolution: Fixed Assignee: Wenchen Fan Fix Version/s: 1.6.0 > In SQL tab, show execution memory per physical operator > --- > > Key: SPARK-10412 > URL: https://issues.apache.org/jira/browse/SPARK-10412 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 1.5.0 >Reporter: Andrew Or >Assignee: Wenchen Fan > Fix For: 1.6.0 > > > We already display it per task / stage. It's really useful to also display it > per operator so the user can know which one caused all the memory to be > allocated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10515) When killing executor, the pending replacement executors will be lost
[ https://issues.apache.org/jira/browse/SPARK-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-10515. --- Resolution: Fixed Assignee: KaiXinXIaoLei Fix Version/s: 1.6.0 1.5.2 Target Version/s: 1.5.2, 1.6.0 > When killing executor, the pending replacement executors will be lost > - > > Key: SPARK-10515 > URL: https://issues.apache.org/jira/browse/SPARK-10515 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: KaiXinXIaoLei >Assignee: KaiXinXIaoLei > Fix For: 1.5.2, 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11071) Flaky test: o.a.s.launcher.LauncherServerSuite
[ https://issues.apache.org/jira/browse/SPARK-11071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-11071. --- Resolution: Fixed Fix Version/s: 1.6.0 Target Version/s: 1.6.0 > Flaky test: o.a.s.launcher.LauncherServerSuite > -- > > Key: SPARK-11071 > URL: https://issues.apache.org/jira/browse/SPARK-11071 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Labels: flaky-test > Fix For: 1.6.0 > > > This test has failed a few times on jenkins, e.g.: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/lastCompletedBuild/HADOOP_PROFILE=hadoop-2.4,label=spark-test/testReport/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11135) Exchange sort-planning logic incorrectly avoid sorts when existing ordering is subset of required ordering
[ https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-11135: --- Summary: Exchange sort-planning logic incorrectly avoid sorts when existing ordering is subset of required ordering (was: Exchange sort-planning logic may incorrect avoid sorts) > Exchange sort-planning logic incorrectly avoid sorts when existing ordering > is subset of required ordering > -- > > Key: SPARK-11135 > URL: https://issues.apache.org/jira/browse/SPARK-11135 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases > where the data has already been sorted by a superset of the requested sorting > columns. For instance, let's say that a query calls for an operator's input > to be sorted by `a.asc` and the input happens to already be sorted by > `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The > converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then > `a.asc` alone will not satisfy the ordering requirements, requiring an > additional sort to be planned by Exchange. > However, the current Exchange code gets this wrong and incorrectly skips > sorting when the existing output ordering is a subset of the required > ordering. This is simple to fix, however. > This bug was introduced in https://github.com/apache/spark/pull/7458, so it > affects 1.5.0+. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11135) Exchange sort-planning logic incorrectly avoid sorts when existing ordering is non-empty subset of required ordering
[ https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-11135: --- Summary: Exchange sort-planning logic incorrectly avoid sorts when existing ordering is non-empty subset of required ordering (was: Exchange sort-planning logic incorrectly avoid sorts when existing ordering is subset of required ordering) > Exchange sort-planning logic incorrectly avoid sorts when existing ordering > is non-empty subset of required ordering > > > Key: SPARK-11135 > URL: https://issues.apache.org/jira/browse/SPARK-11135 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases > where the data has already been sorted by a superset of the requested sorting > columns. For instance, let's say that a query calls for an operator's input > to be sorted by `a.asc` and the input happens to already be sorted by > `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The > converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then > `a.asc` alone will not satisfy the ordering requirements, requiring an > additional sort to be planned by Exchange. > However, the current Exchange code gets this wrong and incorrectly skips > sorting when the existing output ordering is a subset of the required > ordering. This is simple to fix, however. > This bug was introduced in https://github.com/apache/spark/pull/7458, so it > affects 1.5.0+. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11135) Exchange sort-planning logic may incorrect avoid sorts
Josh Rosen created SPARK-11135: -- Summary: Exchange sort-planning logic may incorrect avoid sorts Key: SPARK-11135 URL: https://issues.apache.org/jira/browse/SPARK-11135 Project: Spark Issue Type: Bug Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases where the data has already been sorted by a superset of the requested sorting columns. For instance, let's say that a query calls for an operator's input to be sorted by `a.asc` and the input happens to already be sorted by `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then `a.asc` alone will not satisfy the ordering requirements, requiring an additional sort to be planned by Exchange. However, the current Exchange code gets this wrong and incorrectly skips sorting when the existing output ordering is a subset of the required ordering. This is simple to fix, however. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11135) Exchange sort-planning logic may incorrect avoid sorts
[ https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-11135: --- Description: In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases where the data has already been sorted by a superset of the requested sorting columns. For instance, let's say that a query calls for an operator's input to be sorted by `a.asc` and the input happens to already be sorted by `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then `a.asc` alone will not satisfy the ordering requirements, requiring an additional sort to be planned by Exchange. However, the current Exchange code gets this wrong and incorrectly skips sorting when the existing output ordering is a subset of the required ordering. This is simple to fix, however. This bug was introduced in https://github.com/apache/spark/pull/7458, so it affects 1.5.0+. was: In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases where the data has already been sorted by a superset of the requested sorting columns. For instance, let's say that a query calls for an operator's input to be sorted by `a.asc` and the input happens to already be sorted by `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then `a.asc` alone will not satisfy the ordering requirements, requiring an additional sort to be planned by Exchange. However, the current Exchange code gets this wrong and incorrectly skips sorting when the existing output ordering is a subset of the required ordering. This is simple to fix, however. > Exchange sort-planning logic may incorrect avoid sorts > -- > > Key: SPARK-11135 > URL: https://issues.apache.org/jira/browse/SPARK-11135 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases > where the data has already been sorted by a superset of the requested sorting > columns. For instance, let's say that a query calls for an operator's input > to be sorted by `a.asc` and the input happens to already be sorted by > `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The > converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then > `a.asc` alone will not satisfy the ordering requirements, requiring an > additional sort to be planned by Exchange. > However, the current Exchange code gets this wrong and incorrectly skips > sorting when the existing output ordering is a subset of the required > ordering. This is simple to fix, however. > This bug was introduced in https://github.com/apache/spark/pull/7458, so it > affects 1.5.0+. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11071) Flaky test: o.a.s.launcher.LauncherServerSuite
[ https://issues.apache.org/jira/browse/SPARK-11071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11071: -- Summary: Flaky test: o.a.s.launcher.LauncherServerSuite (was: LauncherServerSuite::testTimeout is flaky) > Flaky test: o.a.s.launcher.LauncherServerSuite > -- > > Key: SPARK-11071 > URL: https://issues.apache.org/jira/browse/SPARK-11071 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Labels: flaky-test > > This test has failed a few times on jenkins, e.g.: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/lastCompletedBuild/HADOOP_PROFILE=hadoop-2.4,label=spark-test/testReport/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11071) Flaky test: o.a.s.launcher.LauncherServerSuite
[ https://issues.apache.org/jira/browse/SPARK-11071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11071: -- Labels: flaky-test (was: ) > Flaky test: o.a.s.launcher.LauncherServerSuite > -- > > Key: SPARK-11071 > URL: https://issues.apache.org/jira/browse/SPARK-11071 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Labels: flaky-test > > This test has failed a few times on jenkins, e.g.: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/lastCompletedBuild/HADOOP_PROFILE=hadoop-2.4,label=spark-test/testReport/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11071) Flaky test: o.a.s.launcher.LauncherServerSuite
[ https://issues.apache.org/jira/browse/SPARK-11071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11071: -- Component/s: (was: Spark Core) Tests > Flaky test: o.a.s.launcher.LauncherServerSuite > -- > > Key: SPARK-11071 > URL: https://issues.apache.org/jira/browse/SPARK-11071 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Labels: flaky-test > > This test has failed a few times on jenkins, e.g.: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/lastCompletedBuild/HADOOP_PROFILE=hadoop-2.4,label=spark-test/testReport/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11134) Flaky test: o.a.s.launcher.LauncherBackendSuite
[ https://issues.apache.org/jira/browse/SPARK-11134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11134: -- Labels: flaky-test (was: ) > Flaky test: o.a.s.launcher.LauncherBackendSuite > --- > > Key: SPARK-11134 > URL: https://issues.apache.org/jira/browse/SPARK-11134 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Andrew Or >Priority: Critical > Labels: flaky-test > > {code} > sbt.ForkMain$ForkError: The code passed to eventually never returned > normally. Attempted 110 times over 10.042591494 seconds. Last failure > message: The reference was null. > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) > at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) > at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) > at > org.apache.spark.launcher.LauncherBackendSuite.org$apache$spark$launcher$LauncherBackendSuite$$testWithMaster(LauncherBackendSuite.scala:57) > at > org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply$mcV$sp(LauncherBackendSuite.scala:39) > at > org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply(LauncherBackendSuite.scala:39) > at > org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply(LauncherBackendSuite.scala:39) > {code} > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3768/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=spark-test/testReport/junit/org.apache.spark.launcher/LauncherBackendSuite/local__launcher_handle/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11134) Flaky test: o.a.s.launcher.LauncherBackendSuite
Andrew Or created SPARK-11134: - Summary: Flaky test: o.a.s.launcher.LauncherBackendSuite Key: SPARK-11134 URL: https://issues.apache.org/jira/browse/SPARK-11134 Project: Spark Issue Type: Bug Components: Tests Reporter: Andrew Or Priority: Critical {code} sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 110 times over 10.042591494 seconds. Last failure message: The reference was null. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.apache.spark.launcher.LauncherBackendSuite.org$apache$spark$launcher$LauncherBackendSuite$$testWithMaster(LauncherBackendSuite.scala:57) at org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply$mcV$sp(LauncherBackendSuite.scala:39) at org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply(LauncherBackendSuite.scala:39) at org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply(LauncherBackendSuite.scala:39) {code} https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3768/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=spark-test/testReport/junit/org.apache.spark.launcher/LauncherBackendSuite/local__launcher_handle/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11133) Flaky test: o.a.s.launcher.LauncherServerSuite
[ https://issues.apache.org/jira/browse/SPARK-11133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-11133. Resolution: Duplicate > Flaky test: o.a.s.launcher.LauncherServerSuite > -- > > Key: SPARK-11133 > URL: https://issues.apache.org/jira/browse/SPARK-11133 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Andrew Or >Priority: Critical > Labels: flaky-test > > {code} > sbt.ForkMain$ForkError: Expected exception caused by connection timeout. > at org.junit.Assert.fail(Assert.java:88) > at > org.apache.spark.launcher.LauncherServerSuite.testTimeout(LauncherServerSuite.java:140) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > {code} > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3769/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/junit/org.apache.spark.launcher/LauncherServerSuite/testTimeout/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11133) Flaky test: o.a.s.launcher.LauncherServerSuite
Andrew Or created SPARK-11133: - Summary: Flaky test: o.a.s.launcher.LauncherServerSuite Key: SPARK-11133 URL: https://issues.apache.org/jira/browse/SPARK-11133 Project: Spark Issue Type: Bug Components: Tests Reporter: Andrew Or Priority: Critical {code} sbt.ForkMain$ForkError: Expected exception caused by connection timeout. at org.junit.Assert.fail(Assert.java:88) at org.apache.spark.launcher.LauncherServerSuite.testTimeout(LauncherServerSuite.java:140) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) {code} https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3769/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/junit/org.apache.spark.launcher/LauncherServerSuite/testTimeout/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11133) Flaky test: o.a.s.launcher.LauncherServerSuite
[ https://issues.apache.org/jira/browse/SPARK-11133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11133: -- Labels: flaky-test (was: ) > Flaky test: o.a.s.launcher.LauncherServerSuite > -- > > Key: SPARK-11133 > URL: https://issues.apache.org/jira/browse/SPARK-11133 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Andrew Or >Priority: Critical > Labels: flaky-test > > {code} > sbt.ForkMain$ForkError: Expected exception caused by connection timeout. > at org.junit.Assert.fail(Assert.java:88) > at > org.apache.spark.launcher.LauncherServerSuite.testTimeout(LauncherServerSuite.java:140) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > {code} > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3769/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/junit/org.apache.spark.launcher/LauncherServerSuite/testTimeout/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9241) Supporting multiple DISTINCT columns
[ https://issues.apache.org/jira/browse/SPARK-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959604#comment-14959604 ] Herman van Hovell commented on SPARK-9241: -- It should grow linear (or am I missing something). For example if we have 3 grouping sets (like in the example), we would duplicate and project the data 3x times. It is still bad, but similar to the approach in [~yhuai]'s example (saving a join). We could have a problem with the {{GROUPING__ID}} bitmask field, only 32/64 fields can be in a grouping set. > Supporting multiple DISTINCT columns > > > Key: SPARK-9241 > URL: https://issues.apache.org/jira/browse/SPARK-9241 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Priority: Critical > > Right now the new aggregation code path only support a single distinct column > (you can use it in multiple aggregate functions in the query). We need to > support multiple distinct columns by generating a different plan for handling > multiple distinct columns (without change aggregate functions). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
[ https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959599#comment-14959599 ] Zhan Zhang edited comment on SPARK-11087 at 10/15/15 8:58 PM: -- [~patcharee] I try to duplicate your table as much as possible, but still didn't hit the problem. Note that the query has to include some valid record in the partition. Otherwise, the partition pruning will trim all predicate before hitting the orc scan. Please refer to the below for the details. case class record(date: Int, hh: Int, x: Int, y: Int, height: Float, u: Float, w: Float, ph: Float, phb: Float, t: Float, p: Float, pb: Float, tke_pbl: Float, el_pbl: Float, qcloud: Float, zone: Int, z: Int, year: Int, month: Int) val records = (1 to 100).map { i => record(i.toInt, i.toInt, i.toInt, i.toInt, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toInt, i.toInt, i.toInt, i.toInt) } sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("5D") sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").partitionBy("zone","z","year","month").save("4D") val test = sqlContext.read.format("orc").load("4D") test.registerTempTable("4D") sqlContext.setConf("spark.sql.orc.filterPushdown", "true") sqlContext.setConf("spark.sql.orc.filterPushdown", "true") sqlContext.sql("select date, month, year, hh, u*0.9122461, u*-0.40964267, z from 4D where x = and y = 117 and zone == 2 and year=2 and z >= 2 and z <= 8").show 2015-10-15 13:37:45 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = (EQUALS x 320) leaf-1 = (EQUALS y 117) expr = (and leaf-0 leaf-1) 2507 sqlContext.sql("select date, month, year, hh, u*0.9122461, u*-0.40964267, z from 5D where x = 321 and y = 118 and zone == 2 and year=2 and z >= 2 and z <= 8").show 2015-10-15 13:40:06 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = (EQUALS x 321) leaf-1 = (EQUALS y 118) expr = (and leaf-0 leaf-1) was (Author: zzhan): [~patcharee] I try to duplicate your table as much as possible, but still didn't hit the problem. Please refer to the below for the details. case class record(date: Int, hh: Int, x: Int, y: Int, height: Float, u: Float, w: Float, ph: Float, phb: Float, t: Float, p: Float, pb: Float, tke_pbl: Float, el_pbl: Float, qcloud: Float, zone: Int, z: Int, year: Int, month: Int) val records = (1 to 100).map { i => record(i.toInt, i.toInt, i.toInt, i.toInt, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toInt, i.toInt, i.toInt, i.toInt) } sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("5D") sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").partitionBy("zone","z","year","month").save("4D") val test = sqlContext.read.format("orc").load("4D") 2503 test.registerTempTable("4D") 2504 sqlContext.setConf("spark.sql.orc.filterPushdown", "true") 2505 sqlContext.sql("select date, month, year, hh, u*0.9122461, u*-0.40964267, z from 4D where x = 320 and y = 117 and zone == 2 and year=2 and z >= 2 and z <= 8").show 2015-10-15 13:37:45 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = (EQUALS x 320) leaf-1 = (EQUALS y 117) expr = (and leaf-0 leaf-1) 2507 sqlContext.sql("select date, month, year, hh, u*0.9122461, u*-0.40964267, z from 5D where x = 321 and y = 118 and zone == 2 and year=2 and z >= 2 and z <= 8").show 2015-10-15 13:40:06 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = (EQUALS x 321) leaf-1 = (EQUALS y 118) expr = (and leaf-0 leaf-1) > spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate > - > > Key: SPARK-11087 > URL: https://issues.apache.org/jira/browse/SPARK-11087 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: orc file version 0.12 with HIVE_8732 > hive version 1.2.1.2.3.0.0-2557 >Reporter: patcharee >Priority: Minor > > I have an external hive table stored as partitioned orc file (see the table > schema below). I tried to query from the table with where clause> > hiveContext.setConf("spark.sql.orc.filterPushdown", "true") > hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = > 117")). > But from the log file with debug logging level on, the ORC pushdown predicate > was not generated. > Unfortunately my table was not sorted when I inserted the data, but I > expected
[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
[ https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959599#comment-14959599 ] Zhan Zhang commented on SPARK-11087: [~patcharee] I try to duplicate your table as much as possible, but still didn't hit the problem. Please refer to the below for the details. case class record(date: Int, hh: Int, x: Int, y: Int, height: Float, u: Float, w: Float, ph: Float, phb: Float, t: Float, p: Float, pb: Float, tke_pbl: Float, el_pbl: Float, qcloud: Float, zone: Int, z: Int, year: Int, month: Int) val records = (1 to 100).map { i => record(i.toInt, i.toInt, i.toInt, i.toInt, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toInt, i.toInt, i.toInt, i.toInt) } sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("5D") sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").partitionBy("zone","z","year","month").save("4D") val test = sqlContext.read.format("orc").load("4D") 2503 test.registerTempTable("4D") 2504 sqlContext.setConf("spark.sql.orc.filterPushdown", "true") 2505 sqlContext.sql("select date, month, year, hh, u*0.9122461, u*-0.40964267, z from 4D where x = 320 and y = 117 and zone == 2 and year=2 and z >= 2 and z <= 8").show 2015-10-15 13:37:45 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = (EQUALS x 320) leaf-1 = (EQUALS y 117) expr = (and leaf-0 leaf-1) 2507 sqlContext.sql("select date, month, year, hh, u*0.9122461, u*-0.40964267, z from 5D where x = 321 and y = 118 and zone == 2 and year=2 and z >= 2 and z <= 8").show 2015-10-15 13:40:06 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = (EQUALS x 321) leaf-1 = (EQUALS y 118) expr = (and leaf-0 leaf-1) > spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate > - > > Key: SPARK-11087 > URL: https://issues.apache.org/jira/browse/SPARK-11087 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: orc file version 0.12 with HIVE_8732 > hive version 1.2.1.2.3.0.0-2557 >Reporter: patcharee >Priority: Minor > > I have an external hive table stored as partitioned orc file (see the table > schema below). I tried to query from the table with where clause> > hiveContext.setConf("spark.sql.orc.filterPushdown", "true") > hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = > 117")). > But from the log file with debug logging level on, the ORC pushdown predicate > was not generated. > Unfortunately my table was not sorted when I inserted the data, but I > expected the ORC pushdown predicate should be generated (because of the where > clause) though > Table schema > > hive> describe formatted 4D; > OK > # col_namedata_type comment > > date int > hhint > x int > y int > heightfloat > u float > v float > w float > phfloat > phb float > t float > p float > pbfloat > qvaporfloat > qgraupfloat > qnice float > qnrainfloat > tke_pbl float > el_pblfloat > qcloudfloat > > # Partition Information > # col_namedata_type comment > > zone int > z int > year int
[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?
[ https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959592#comment-14959592 ] Xusen Yin commented on SPARK-5874: -- Sure I'll do it. > How to improve the current ML pipeline API? > --- > > Key: SPARK-5874 > URL: https://issues.apache.org/jira/browse/SPARK-5874 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > I created this JIRA to collect feedbacks about the ML pipeline API we > introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 > with confidence, which requires valuable input from the community. I'll > create sub-tasks for each major issue. > Design doc (WIP): > https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit# -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?
[ https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959589#comment-14959589 ] Joseph K. Bradley commented on SPARK-5874: -- Sure, that sounds good. Can you also please search for existing tickets and link them to the umbrella? > How to improve the current ML pipeline API? > --- > > Key: SPARK-5874 > URL: https://issues.apache.org/jira/browse/SPARK-5874 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > I created this JIRA to collect feedbacks about the ML pipeline API we > introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 > with confidence, which requires valuable input from the community. I'll > create sub-tasks for each major issue. > Design doc (WIP): > https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit# -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2984) FileNotFoundException on _temporary directory
[ https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959571#comment-14959571 ] Pratik Khadloya edited comment on SPARK-2984 at 10/15/15 8:39 PM: -- Am seeing the same issue on Spark 1.4.1. I am trying to save a dataframe as table ( saveAsTable ) using SaveMode.Overwrite. {code} 15/10/15 16:19:57 INFO hadoop.ColumnChunkPageWriteStore: written 1,508B for [flight_id] INT64: 1,142 values, 1,441B raw, 1,464B comp, 1 pages, encodings: [BIT_PACKED, RLE, PLAIN_DICTIONARY], dic { 563 entries, 4,504B raw, 563B comp} 15/10/15 16:19:57 WARN hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /warehouse/hive-user-tables/agg_imps_pratik/_temporary/0/_temporary/attempt_201510151447_0002_m_23_0/part-r-00023-6778754e-ac4d-44ef-8ee8-fc87e89639bc.gz.parquet (inode 2376521862): File does not exist. Holder DFSClient_attempt_201510151447_0002_m_23_0_529613711_162 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2885) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2767) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:606) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:455) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) {code} Also, i am not running in speculative mode. .set("spark.speculation", "false") was (Author: tispratik): Am seeing the same issue on Spark 1.4.1. I am trying to save a dataframe as table ( saveAsTable ) using SaveMode.Overwrite. {code} 15/10/15 16:19:57 INFO hadoop.ColumnChunkPageWriteStore: written 1,508B for [flight_id] INT64: 1,142 values, 1,441B raw, 1,464B comp, 1 pages, encodings: [BIT_PACKED, RLE, PLAIN_DICTIONARY], dic { 563 entries, 4,504B raw, 563B comp} 15/10/15 16:19:57 WARN hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /warehouse/hive-user-tables/agg_imps_pratik/_temporary/0/_temporary/attempt_201510151447_0002_m_23_0/part-r-00023-6778754e-ac4d-44ef-8ee8-fc87e89639bc.gz.parquet (inode 2376521862): File does not exist. Holder DFSClient_attempt_201510151447_0002_m_23_0_529613711_162 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2885) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2767) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:606) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:455) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) {code} > FileNotFoundException on _temporary directory > - > > Key: SPARK-2984 > URL: https://issue
[jira] [Comment Edited] (SPARK-2984) FileNotFoundException on _temporary directory
[ https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959571#comment-14959571 ] Pratik Khadloya edited comment on SPARK-2984 at 10/15/15 8:40 PM: -- Am seeing the same issue on Spark 1.4.1. I am trying to save a dataframe as table ( saveAsTable ) using SaveMode.Overwrite. {code} 15/10/15 16:19:57 INFO hadoop.ColumnChunkPageWriteStore: written 1,508B for [flight_id] INT64: 1,142 values, 1,441B raw, 1,464B comp, 1 pages, encodings: [BIT_PACKED, RLE, PLAIN_DICTIONARY], dic { 563 entries, 4,504B raw, 563B comp} 15/10/15 16:19:57 WARN hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /warehouse/hive-user-tables/agg_imps_pratik/_temporary/0/_temporary/attempt_201510151447_0002_m_23_0/part-r-00023-6778754e-ac4d-44ef-8ee8-fc87e89639bc.gz.parquet (inode 2376521862): File does not exist. Holder DFSClient_attempt_201510151447_0002_m_23_0_529613711_162 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2885) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2767) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:606) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:455) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) {code} Also, i am not running in speculative mode. {code} .set("spark.speculation", "false") {code} was (Author: tispratik): Am seeing the same issue on Spark 1.4.1. I am trying to save a dataframe as table ( saveAsTable ) using SaveMode.Overwrite. {code} 15/10/15 16:19:57 INFO hadoop.ColumnChunkPageWriteStore: written 1,508B for [flight_id] INT64: 1,142 values, 1,441B raw, 1,464B comp, 1 pages, encodings: [BIT_PACKED, RLE, PLAIN_DICTIONARY], dic { 563 entries, 4,504B raw, 563B comp} 15/10/15 16:19:57 WARN hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /warehouse/hive-user-tables/agg_imps_pratik/_temporary/0/_temporary/attempt_201510151447_0002_m_23_0/part-r-00023-6778754e-ac4d-44ef-8ee8-fc87e89639bc.gz.parquet (inode 2376521862): File does not exist. Holder DFSClient_attempt_201510151447_0002_m_23_0_529613711_162 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2885) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2767) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:606) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:455) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) {code} Also, i am not running in speculative mode. .set("spark.speculation", "false") > FileNotFoundException on _temporary directory > ---
[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory
[ https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959571#comment-14959571 ] Pratik Khadloya commented on SPARK-2984: Am seeing the same issue on Spark 1.4.1. I am trying to save a dataframe as table ( saveAsTable ) using SaveMode.Overwrite. {code} 15/10/15 16:19:57 INFO hadoop.ColumnChunkPageWriteStore: written 1,508B for [flight_id] INT64: 1,142 values, 1,441B raw, 1,464B comp, 1 pages, encodings: [BIT_PACKED, RLE, PLAIN_DICTIONARY], dic { 563 entries, 4,504B raw, 563B comp} 15/10/15 16:19:57 WARN hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /warehouse/hive-user-tables/agg_imps_pratik/_temporary/0/_temporary/attempt_201510151447_0002_m_23_0/part-r-00023-6778754e-ac4d-44ef-8ee8-fc87e89639bc.gz.parquet (inode 2376521862): File does not exist. Holder DFSClient_attempt_201510151447_0002_m_23_0_529613711_162 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2885) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2767) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:606) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:455) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) {code} > FileNotFoundException on _temporary directory > - > > Key: SPARK-2984 > URL: https://issues.apache.org/jira/browse/SPARK-2984 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.3.0 > > > We've seen several stacktraces and threads on the user mailing list where > people are having issues with a {{FileNotFoundException}} stemming from an > HDFS path containing {{_temporary}}. > I ([~aash]) think this may be related to {{spark.speculation}}. I think the > error condition might manifest in this circumstance: > 1) task T starts on a executor E1 > 2) it takes a long time, so task T' is started on another executor E2 > 3) T finishes in E1 so moves its data from {{_temporary}} to the final > destination and deletes the {{_temporary}} directory during cleanup > 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but > those files no longer exist! exception > Some samples: > {noformat} > 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job > 140774430 ms.0 > java.io.FileNotFoundException: File > hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07 > does not exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) > at > org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136) > at > org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126) > at > org.apache.spark.rdd.
[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?
[ https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959564#comment-14959564 ] Xusen Yin commented on SPARK-5874: -- I'd love to add supports to individual models first. But since there are many estimators in ML package now, I think we'd better add an umbrella JIRA to control the process. Can I create new JIRA subtask under this JIRA? > How to improve the current ML pipeline API? > --- > > Key: SPARK-5874 > URL: https://issues.apache.org/jira/browse/SPARK-5874 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > I created this JIRA to collect feedbacks about the ML pipeline API we > introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 > with confidence, which requires valuable input from the community. I'll > create sub-tasks for each major issue. > Design doc (WIP): > https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit# -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6488) Support addition/multiplication in PySpark's BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959534#comment-14959534 ] Apache Spark commented on SPARK-6488: - User 'dusenberrymw' has created a pull request for this issue: https://github.com/apache/spark/pull/9139 > Support addition/multiplication in PySpark's BlockMatrix > > > Key: SPARK-6488 > URL: https://issues.apache.org/jira/browse/SPARK-6488 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Mike Dusenberry > > This JIRA is to add addition/multiplication to BlockMatrix in PySpark. We > should reuse the Scala implementation instead of having a separate > implementation in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6488) Support addition/multiplication in PySpark's BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6488: --- Assignee: Mike Dusenberry (was: Apache Spark) > Support addition/multiplication in PySpark's BlockMatrix > > > Key: SPARK-6488 > URL: https://issues.apache.org/jira/browse/SPARK-6488 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Mike Dusenberry > > This JIRA is to add addition/multiplication to BlockMatrix in PySpark. We > should reuse the Scala implementation instead of having a separate > implementation in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6488) Support addition/multiplication in PySpark's BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6488: --- Assignee: Apache Spark (was: Mike Dusenberry) > Support addition/multiplication in PySpark's BlockMatrix > > > Key: SPARK-6488 > URL: https://issues.apache.org/jira/browse/SPARK-6488 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Apache Spark > > This JIRA is to add addition/multiplication to BlockMatrix in PySpark. We > should reuse the Scala implementation instead of having a separate > implementation in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5657) Add PySpark Avro Output Format example
[ https://issues.apache.org/jira/browse/SPARK-5657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-5657. --- Resolution: Won't Fix > Add PySpark Avro Output Format example > -- > > Key: SPARK-5657 > URL: https://issues.apache.org/jira/browse/SPARK-5657 > Project: Spark > Issue Type: Improvement > Components: Examples, PySpark >Affects Versions: 1.2.0 >Reporter: Stanislav Los > > There is an Avro Input Format example that shows how to read Avro data in > PySpark, but nothing shows how to write from PySpark to Avro. The main > challenge, a Converter needs an Avro schema to build a record, but current > Spark API doesn't provide a way to supply extra parameters to custom > converters. Provided workaround is possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11039) Document all UI "retained*" configurations
[ https://issues.apache.org/jira/browse/SPARK-11039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-11039: --- Assignee: Nick Pritchard > Document all UI "retained*" configurations > -- > > Key: SPARK-11039 > URL: https://issues.apache.org/jira/browse/SPARK-11039 > Project: Spark > Issue Type: Documentation > Components: Documentation, Web UI >Affects Versions: 1.5.1 >Reporter: Nick Pritchard >Assignee: Nick Pritchard >Priority: Trivial > Fix For: 1.5.2, 1.6.0 > > > Most are documented except these: > - spark.sql.ui.retainedExecutions > - spark.streaming.ui.retainedBatches > They are really helpful for managing the memory usage of the driver > application. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11039) Document all UI "retained*" configurations
[ https://issues.apache.org/jira/browse/SPARK-11039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-11039. Resolution: Fixed Fix Version/s: 1.5.2 1.6.0 Issue resolved by pull request 9052 [https://github.com/apache/spark/pull/9052] > Document all UI "retained*" configurations > -- > > Key: SPARK-11039 > URL: https://issues.apache.org/jira/browse/SPARK-11039 > Project: Spark > Issue Type: Documentation > Components: Documentation, Web UI >Affects Versions: 1.5.1 >Reporter: Nick Pritchard >Priority: Trivial > Fix For: 1.6.0, 1.5.2 > > > Most are documented except these: > - spark.sql.ui.retainedExecutions > - spark.streaming.ui.retainedBatches > They are really helpful for managing the memory usage of the driver > application. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8658) AttributeReference equals method only compare name, exprId and dataType
[ https://issues.apache.org/jira/browse/SPARK-8658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959451#comment-14959451 ] Michael Armbrust commented on SPARK-8658: - There is no query that exposes the problem as its an internal quirk. The {{equals}} method should check all of the specified fields for equality. Today it is missing some. > AttributeReference equals method only compare name, exprId and dataType > --- > > Key: SPARK-8658 > URL: https://issues.apache.org/jira/browse/SPARK-8658 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0, 1.3.1, 1.4.0 >Reporter: Antonio Jesus Navarro > > The AttributeReference "equals" method only accept as different objects with > different name, expression id or dataType. With this behavior when I tried to > do a "transformExpressionsDown" and try to transform qualifiers inside > "AttributeReferences", these objects are not replaced, because the > transformer considers them equal. > I propose to add to the "equals" method this variables: > name, dataType, nullable, metadata, epxrId, qualifiers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9241) Supporting multiple DISTINCT columns
[ https://issues.apache.org/jira/browse/SPARK-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959440#comment-14959440 ] Reynold Xin commented on SPARK-9241: Do we have any idea on performance characteristics of this rewrite? IIUC, grouping set's complexity grows exponentially with the number of items in the set? > Supporting multiple DISTINCT columns > > > Key: SPARK-9241 > URL: https://issues.apache.org/jira/browse/SPARK-9241 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Priority: Critical > > Right now the new aggregation code path only support a single distinct column > (you can use it in multiple aggregate functions in the query). We need to > support multiple distinct columns by generating a different plan for handling > multiple distinct columns (without change aggregate functions). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5739) Size exceeds Integer.MAX_VALUE in File Map
[ https://issues.apache.org/jira/browse/SPARK-5739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959298#comment-14959298 ] Karl D. Gierach edited comment on SPARK-5739 at 10/15/15 7:06 PM: -- Is there anyway to increase this block limit? I'm hitting the same issue during a UnionRDD operation. Also, above this issue's state is "resolved" but I'm not sure what the resolution is? Maybe a state of "closed" with a reference to the duplicate ticket would make it more clear. was (Author: kgierach): Is there anyway to increase this block limit? I'm hitting the same issue during a UnionRDD operation. Also, above this issue's state is "resolved" but I'm not sure what the resolution is? > Size exceeds Integer.MAX_VALUE in File Map > -- > > Key: SPARK-5739 > URL: https://issues.apache.org/jira/browse/SPARK-5739 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.1.1 > Environment: Spark1.1.1 on a cluster with 12 node. Every node with > 128GB RAM, 24 Core. the data is just 40GB, and there is 48 parallel task on a > node. >Reporter: DjvuLee >Priority: Minor > > I just run the kmeans algorithm using a random generate data,but occurred > this problem after some iteration. I try several time, and this problem is > reproduced. > Because the data is random generate, so I guess is there a bug ? Or if random > data can lead to such a scenario that the size is bigger than > Integer.MAX_VALUE, can we check the size before using the file map? > 015-02-11 00:39:36,057 [sparkDriver-akka.actor.default-dispatcher-15] WARN > org.apache.spark.util.SizeEstimator - Failed to check whether > UseCompressedOops is set; assuming yes > [error] (run-main-0) java.lang.IllegalArgumentException: Size exceeds > Integer.MAX_VALUE > java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:850) > at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105) > at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:86) > at > org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:140) > at > org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:105) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:747) > at > org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:598) > at > org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:869) > at > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:79) > at > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:68) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) > at > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) > at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809) > at > org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:270) > at org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143) > at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126) > at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:338) > at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:348) > at KMeansDataGenerator$.main(kmeans.scala:105) > at KMeansDataGenerator.main(kmeans.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:94) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55) > at java.lang.reflect.Method.invoke(Method.java:619) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11131) Worker registration protocol is racy
[ https://issues.apache.org/jira/browse/SPARK-11131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11131: Assignee: (was: Apache Spark) > Worker registration protocol is racy > > > Key: SPARK-11131 > URL: https://issues.apache.org/jira/browse/SPARK-11131 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Priority: Minor > > I ran into this while making changes to the new RPC framework. Because the > Worker registration protocol is based on sending unrelated messages between > Master and Worker, it's possible that another message (e.g. caused by an a > app trying to allocate workers) to arrive at the Worker before it knows the > Master has registered it. This triggers the following code: > {code} > case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) => > if (masterUrl != activeMasterUrl) { > logWarning("Invalid Master (" + masterUrl + ") attempted to launch > executor.") > {code} > This may or may not be made worse by SPARK-11098. > A simple workaround is to use an {{ask}} instead of a {{send}} for these > messages. That should at least narrow the race. > Note this is more of a problem in {{local-cluster}} mode, used a lot by unit > tests, where Master and Worker instances are coming up as part of the app > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11131) Worker registration protocol is racy
[ https://issues.apache.org/jira/browse/SPARK-11131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11131: Assignee: Apache Spark > Worker registration protocol is racy > > > Key: SPARK-11131 > URL: https://issues.apache.org/jira/browse/SPARK-11131 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > I ran into this while making changes to the new RPC framework. Because the > Worker registration protocol is based on sending unrelated messages between > Master and Worker, it's possible that another message (e.g. caused by an a > app trying to allocate workers) to arrive at the Worker before it knows the > Master has registered it. This triggers the following code: > {code} > case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) => > if (masterUrl != activeMasterUrl) { > logWarning("Invalid Master (" + masterUrl + ") attempted to launch > executor.") > {code} > This may or may not be made worse by SPARK-11098. > A simple workaround is to use an {{ask}} instead of a {{send}} for these > messages. That should at least narrow the race. > Note this is more of a problem in {{local-cluster}} mode, used a lot by unit > tests, where Master and Worker instances are coming up as part of the app > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11131) Worker registration protocol is racy
[ https://issues.apache.org/jira/browse/SPARK-11131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959378#comment-14959378 ] Apache Spark commented on SPARK-11131: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/9138 > Worker registration protocol is racy > > > Key: SPARK-11131 > URL: https://issues.apache.org/jira/browse/SPARK-11131 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Priority: Minor > > I ran into this while making changes to the new RPC framework. Because the > Worker registration protocol is based on sending unrelated messages between > Master and Worker, it's possible that another message (e.g. caused by an a > app trying to allocate workers) to arrive at the Worker before it knows the > Master has registered it. This triggers the following code: > {code} > case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) => > if (masterUrl != activeMasterUrl) { > logWarning("Invalid Master (" + masterUrl + ") attempted to launch > executor.") > {code} > This may or may not be made worse by SPARK-11098. > A simple workaround is to use an {{ask}} instead of a {{send}} for these > messages. That should at least narrow the race. > Note this is more of a problem in {{local-cluster}} mode, used a lot by unit > tests, where Master and Worker instances are coming up as part of the app > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org