[jira] [Comment Edited] (SPARK-13320) Confusing error message for Dataset API when using sum("*")
[ https://issues.apache.org/jira/browse/SPARK-13320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147032#comment-15147032 ] Xiao Li edited comment on SPARK-13320 at 2/15/16 7:57 AM: -- {code} checkAnswer(sql( """ | SELECT min(struct(record.*)) FROM | (select a as a, struct(a,b) as record from testData2) tmp | GROUP BY a """.stripMargin), Row(Row(1, 1)) :: Row(Row(2, 1)) :: Row(Row(3, 1)) :: Nil) {code} Above is a query I found in the {{SQLQuerySuite}}. Before submitting a PR, I am wondering if the following query is valid? {code} structDf.groupBy($"a").agg(min(struct($"record.*"))) {code} So far, it does not work. It outputs an error message: {{cannot resolve 'a' given input columns: [a, b];}} was (Author: smilegator): {code} checkAnswer(sql( """ | SELECT min(struct(record.*)) FROM | (select a as a, struct(a,b) as record from testData2) tmp | GROUP BY a """.stripMargin), Row(Row(1, 1)) :: Row(Row(2, 1)) :: Row(Row(3, 1)) :: Nil) {code} Above is a query I found in test case. Before submitting a PR, I am wondering if the following query is valid? {code} structDf.groupBy($"a").agg(min(struct($"record.*"))) {code} So far, it does not work. It outputs an error message: {{cannot resolve 'a' given input columns: [a, b];}} > Confusing error message for Dataset API when using sum("*") > --- > > Key: SPARK-13320 > URL: https://issues.apache.org/jira/browse/SPARK-13320 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin > > {code} > pagecounts4PartitionsDS > .map(line => (line._1, line._3)) > .toDF() > .groupBy($"_1") > .agg(sum("*") as "sumOccurances") > {code} > {code} > org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input > columns _1, _2; > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) >
[jira] [Commented] (SPARK-13320) Confusing error message for Dataset API when using sum("*")
[ https://issues.apache.org/jira/browse/SPARK-13320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147032#comment-15147032 ] Xiao Li commented on SPARK-13320: - {code} checkAnswer(sql( """ | SELECT min(struct(record.*)) FROM | (select a as a, struct(a,b) as record from testData2) tmp | GROUP BY a """.stripMargin), Row(Row(1, 1)) :: Row(Row(2, 1)) :: Row(Row(3, 1)) :: Nil) {code} Above is a query I found in test case. Before submitting a PR, I am wondering if the following query is valid? {code} structDf.groupBy($"a").agg(min(struct($"record.*"))) {code} So far, it does not work. It outputs an error message: {{cannot resolve 'a' given input columns: [a, b];}} > Confusing error message for Dataset API when using sum("*") > --- > > Key: SPARK-13320 > URL: https://issues.apache.org/jira/browse/SPARK-13320 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin > > {code} > pagecounts4PartitionsDS > .map(line => (line._1, line._3)) > .toDF() > .groupBy($"_1") > .agg(sum("*") as "sumOccurances") > {code} > {code} > org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input > columns _1, _2; > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) >
[jira] [Commented] (SPARK-13220) Deprecate "yarn-client" and "yarn-cluster"
[ https://issues.apache.org/jira/browse/SPARK-13220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147022#comment-15147022 ] Saisai Shao commented on SPARK-13220: - [~andrewor14] mind me taking a crack on this? > Deprecate "yarn-client" and "yarn-cluster" > -- > > Key: SPARK-13220 > URL: https://issues.apache.org/jira/browse/SPARK-13220 > Project: Spark > Issue Type: Sub-task > Components: YARN >Reporter: Andrew Or > > We currently allow `\-\-master yarn-client`. Instead, the user should do > `\-\-master yarn \-\-deploy-mode client` to be more explicit. This is more > consistent with other cluster managers and obviates the need to do special > parsing of the master string. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13313) Strongly connected components doesn't find all strongly connected components
[ https://issues.apache.org/jira/browse/SPARK-13313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147007#comment-15147007 ] Petar Zecevic commented on SPARK-13313: --- No, I don't think it's got anything to do with that. That largest SCC's vertices are not connected in any way and they shouldn't be in the same group. > Strongly connected components doesn't find all strongly connected components > > > Key: SPARK-13313 > URL: https://issues.apache.org/jira/browse/SPARK-13313 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.6.0 >Reporter: Petar Zecevic > > Strongly connected components algorithm doesn't find all strongly connected > components. I was using Wikispeedia dataset > (http://snap.stanford.edu/data/wikispeedia.html) and the algorithm found 519 > SCCs and one of them had 4051 vertices, which in reality don't have any edges > between them. > I think the problem could be on line 89 of StronglyConnectedComponents.scala > file where EdgeDirection.In should be changed to EdgeDirection.Out. I believe > the second Pregel call should use Out edge direction, the same as the first > call because the direction is reversed in the provided sendMsg function > (message is sent to source vertex and not destination vertex). > If that is changed (line 89), the algorithm starts finding much more SCCs, > but eventually stack overflow exception occurs. I believe graph objects that > are changed through iterations should not be cached, but checkpointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11334) numRunningTasks can't be less than 0, or it will affect executor allocation
[ https://issues.apache.org/jira/browse/SPARK-11334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146983#comment-15146983 ] Apache Spark commented on SPARK-11334: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/11205 > numRunningTasks can't be less than 0, or it will affect executor allocation > --- > > Key: SPARK-11334 > URL: https://issues.apache.org/jira/browse/SPARK-11334 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: meiyoula >Assignee: meiyoula > > With *Dynamic Allocation* function, a task failed over *maxFailure* time, all > the dependent jobs, stages, tasks will be killed or aborted. In this process, > *SparkListenerTaskEnd* event will be behind in *SparkListenerStageCompleted* > and *SparkListenerJobEnd*. Like the Event Log below: > {code} > {"Event":"SparkListenerStageCompleted","Stage Info":{"Stage ID":20,"Stage > Attempt ID":0,"Stage Name":"run at AccessController.java:-2","Number of > Tasks":200} > {"Event":"SparkListenerJobEnd","Job ID":9,"Completion Time":1444914699829} > {"Event":"SparkListenerTaskEnd","Stage ID":20,"Stage Attempt ID":0,"Task > Type":"ResultTask","Task End Reason":{"Reason":"TaskKilled"},"Task > Info":{"Task ID":1955,"Index":88,"Attempt":2,"Launch > Time":1444914699763,"Executor > ID":"5","Host":"linux-223","Locality":"PROCESS_LOCAL","Speculative":false,"Getting > Result Time":0,"Finish Time":1444914699864,"Failed":true,"Accumulables":[]}} > {code} > Because that, the *numRunningTasks* in *ExecutorAllocationManager* class will > be less than 0, and it will affect executor allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13321) Support nested UNION in parser
[ https://issues.apache.org/jira/browse/SPARK-13321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13321: Assignee: (was: Apache Spark) > Support nested UNION in parser > -- > > Key: SPARK-13321 > URL: https://issues.apache.org/jira/browse/SPARK-13321 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > > The following SQL can not be parsed with current parser: > {code} > SELECT `u_1`.`id` FROM (((SELECT `t0`.`id` FROM `default`.`t0`) UNION ALL > (SELECT `t0`.`id` FROM `default`.`t0`)) UNION ALL (SELECT `t0`.`id` FROM > `default`.`t0`)) AS u_1 > {code} > We should fix it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13321) Support nested UNION in parser
[ https://issues.apache.org/jira/browse/SPARK-13321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13321: Assignee: Apache Spark > Support nested UNION in parser > -- > > Key: SPARK-13321 > URL: https://issues.apache.org/jira/browse/SPARK-13321 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > The following SQL can not be parsed with current parser: > {code} > SELECT `u_1`.`id` FROM (((SELECT `t0`.`id` FROM `default`.`t0`) UNION ALL > (SELECT `t0`.`id` FROM `default`.`t0`)) UNION ALL (SELECT `t0`.`id` FROM > `default`.`t0`)) AS u_1 > {code} > We should fix it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13321) Support nested UNION in parser
[ https://issues.apache.org/jira/browse/SPARK-13321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146946#comment-15146946 ] Apache Spark commented on SPARK-13321: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/11204 > Support nested UNION in parser > -- > > Key: SPARK-13321 > URL: https://issues.apache.org/jira/browse/SPARK-13321 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > > The following SQL can not be parsed with current parser: > {code} > SELECT `u_1`.`id` FROM (((SELECT `t0`.`id` FROM `default`.`t0`) UNION ALL > (SELECT `t0`.`id` FROM `default`.`t0`)) UNION ALL (SELECT `t0`.`id` FROM > `default`.`t0`)) AS u_1 > {code} > We should fix it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13321) Support nested UNION in parser
Liang-Chi Hsieh created SPARK-13321: --- Summary: Support nested UNION in parser Key: SPARK-13321 URL: https://issues.apache.org/jira/browse/SPARK-13321 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh The following SQL can not be parsed with current parser: {code} SELECT `u_1`.`id` FROM (((SELECT `t0`.`id` FROM `default`.`t0`) UNION ALL (SELECT `t0`.`id` FROM `default`.`t0`)) UNION ALL (SELECT `t0`.`id` FROM `default`.`t0`)) AS u_1 {code} We should fix it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13320) Confusing error message for Dataset API when using sum("*")
[ https://issues.apache.org/jira/browse/SPARK-13320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146838#comment-15146838 ] Xiao Li commented on SPARK-13320: - Sure, will do it. Thanks! > Confusing error message for Dataset API when using sum("*") > --- > > Key: SPARK-13320 > URL: https://issues.apache.org/jira/browse/SPARK-13320 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin > > {code} > pagecounts4PartitionsDS > .map(line => (line._1, line._3)) > .toDF() > .groupBy($"_1") > .agg(sum("*") as "sumOccurances") > {code} > {code} > org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input > columns _1, _2; > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:133) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at org.apache.spark.sql.GroupedData.toDF(GroupedData.scala:57) > at org.apache.spark.sql.GroupedData.agg(GroupedData.scala:213) > {code} > The error is with sum("*"), not the resolution of group by "_1". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (SPARK-12503) Pushdown a Limit on top of a Union
[ https://issues.apache.org/jira/browse/SPARK-12503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-12503. - Resolution: Fixed Fix Version/s: 2.0.0 > Pushdown a Limit on top of a Union > -- > > Key: SPARK-12503 > URL: https://issues.apache.org/jira/browse/SPARK-12503 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 1.5.0, 1.6.0 >Reporter: Xiao Li >Assignee: Josh Rosen > Fix For: 2.0.0 > > > "Rule that applies to a Limit on top of a Union. The original Limit won't go > away after applying this rule, but additional Limit nodes will be created on > top of each child of Union, so that these children produce less rows and > Limit can be further optimized for children Relations." > -- from https://issues.apache.org/jira/browse/CALCITE-832 > Also, the same topic in Hive: https://issues.apache.org/jira/browse/HIVE-11775 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11102) Uninformative exception when specifing non-exist input for JSON data source
[ https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146824#comment-15146824 ] Jeff Zhang commented on SPARK-11102: [~sowen] Which ticket has resolved this issue ? SPARK-10709 didn't resolve it I think. > Uninformative exception when specifing non-exist input for JSON data source > --- > > Key: SPARK-11102 > URL: https://issues.apache.org/jira/browse/SPARK-11102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Jeff Zhang >Priority: Minor > > If I specify a non-exist input path for json data source, the following > exception will be thrown, it is not readable. > {code} > 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 19.9 KB, free 251.4 KB) > 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB) > 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at > :19 > java.io.IOException: No input paths specified in job > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085) > at > org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:28) > at $iwC$$iwC$$iwC$$iwC.(:30) > at $iwC$$iwC$$iwC.(:32) > at $iwC$$iwC.(:34) > at $iwC.(:36) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-12503) Pushdown a Limit on top of a Union
[ https://issues.apache.org/jira/browse/SPARK-12503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reopened SPARK-12503: - Assignee: Josh Rosen > Pushdown a Limit on top of a Union > -- > > Key: SPARK-12503 > URL: https://issues.apache.org/jira/browse/SPARK-12503 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 1.5.0, 1.6.0 >Reporter: Xiao Li >Assignee: Josh Rosen > Fix For: 2.0.0 > > > "Rule that applies to a Limit on top of a Union. The original Limit won't go > away after applying this rule, but additional Limit nodes will be created on > top of each child of Union, so that these children produce less rows and > Limit can be further optimized for children Relations." > -- from https://issues.apache.org/jira/browse/CALCITE-832 > Also, the same topic in Hive: https://issues.apache.org/jira/browse/HIVE-11775 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13314) Malformed WholeStageCodegen tree string
[ https://issues.apache.org/jira/browse/SPARK-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-13314. Resolution: Not A Problem I didn't realize that the extra vertical lines are used to indicate scopes of codegen'd stages. > Malformed WholeStageCodegen tree string > --- > > Key: SPARK-13314 > URL: https://issues.apache.org/jira/browse/SPARK-13314 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > > {{WholeStageCodegen}} overrides {{generateTreeString}} to show the inner plan > tree, but the output can be malformed when the plan contains binary operators: > {code} > val a = sqlContext range 5 > val b = sqlContext range 2 > a select ('id as 'a) unionAll (b select ('id as 'a)) explain true > {code} > {noformat} > ... > == Physical Plan == > Union > :- WholeStageCodegen > : : +- Project [id#3L AS a#6L] > : : +- Range 0, 1, 8, 5, [id#3L] > +- WholeStageCodegen >: +- Project [id#4L AS a#7L] >: +- Range 0, 1, 8, 2, [id#4L] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12720) SQL generation support for cube, rollup, and grouping set
[ https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-12720: --- Assignee: Xiao Li > SQL generation support for cube, rollup, and grouping set > - > > Key: SPARK-12720 > URL: https://issues.apache.org/jira/browse/SPARK-12720 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Xiao Li > > {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. > Please refer to SPARK-11012 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13320) Confusing error message for Dataset API when using sum("*")
Reynold Xin created SPARK-13320: --- Summary: Confusing error message for Dataset API when using sum("*") Key: SPARK-13320 URL: https://issues.apache.org/jira/browse/SPARK-13320 Project: Spark Issue Type: Bug Components: SQL Reporter: Reynold Xin {code} pagecounts4PartitionsDS .map(line => (line._1, line._3)) .toDF() .groupBy($"_1") .agg(sum("*") as "sumOccurances") {code} {code} org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input columns _1, _2; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) at org.apache.spark.sql.DataFrame.(DataFrame.scala:133) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) at org.apache.spark.sql.GroupedData.toDF(GroupedData.scala:57) at org.apache.spark.sql.GroupedData.agg(GroupedData.scala:213) {code} The error is with sum("*"), not the resolution of group by "_1". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13320) Confusing error message for Dataset API when using sum("*")
[ https://issues.apache.org/jira/browse/SPARK-13320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146815#comment-15146815 ] Reynold Xin commented on SPARK-13320: - cc [~smilegator] not sure if you have time. If you do, mind looking into this? cc [~marmbrus] and [~cloud_fan] > Confusing error message for Dataset API when using sum("*") > --- > > Key: SPARK-13320 > URL: https://issues.apache.org/jira/browse/SPARK-13320 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin > > {code} > pagecounts4PartitionsDS > .map(line => (line._1, line._3)) > .toDF() > .groupBy($"_1") > .agg(sum("*") as "sumOccurances") > {code} > {code} > org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input > columns _1, _2; > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:133) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at org.apache.spark.sql.GroupedData.toDF(GroupedData.scala:57) > at org.apache.spark.sql.GroupedData.agg(GroupedData.scala:213) > {code} > The error is with sum("*"), not the resolution of group by
[jira] [Commented] (SPARK-12846) Follow up SPARK-12707, Update documentation and other related code
[ https://issues.apache.org/jira/browse/SPARK-12846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146814#comment-15146814 ] Jeff Zhang commented on SPARK-12846: Add the context mail thread in description, [~felixcheung] will work on it > Follow up SPARK-12707, Update documentation and other related code > -- > > Key: SPARK-12846 > URL: https://issues.apache.org/jira/browse/SPARK-12846 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Jeff Zhang > > Add the background context mail therad > http://apache-spark-developers-list.1001551.n3.nabble.com/Are-we-running-SparkR-tests-in-Jenkins-td16034.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12846) Follow up SPARK-12707, Update documentation and other related code
[ https://issues.apache.org/jira/browse/SPARK-12846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-12846: --- Description: Add the background context mail thread http://apache-spark-developers-list.1001551.n3.nabble.com/Are-we-running-SparkR-tests-in-Jenkins-td16034.html (was: Add the background context mail therad http://apache-spark-developers-list.1001551.n3.nabble.com/Are-we-running-SparkR-tests-in-Jenkins-td16034.html) > Follow up SPARK-12707, Update documentation and other related code > -- > > Key: SPARK-12846 > URL: https://issues.apache.org/jira/browse/SPARK-12846 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Jeff Zhang > > Add the background context mail thread > http://apache-spark-developers-list.1001551.n3.nabble.com/Are-we-running-SparkR-tests-in-Jenkins-td16034.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12846) Follow up SPARK-12707, Update documentation and other related code
[ https://issues.apache.org/jira/browse/SPARK-12846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-12846: --- Description: Add the background context mail therad http://apache-spark-developers-list.1001551.n3.nabble.com/Are-we-running-SparkR-tests-in-Jenkins-td16034.html > Follow up SPARK-12707, Update documentation and other related code > -- > > Key: SPARK-12846 > URL: https://issues.apache.org/jira/browse/SPARK-12846 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Jeff Zhang > > Add the background context mail therad > http://apache-spark-developers-list.1001551.n3.nabble.com/Are-we-running-SparkR-tests-in-Jenkins-td16034.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13318) Model export/import for spark.ml: ElementwiseProduct
[ https://issues.apache.org/jira/browse/SPARK-13318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146800#comment-15146800 ] Apache Spark commented on SPARK-13318: -- User 'yinxusen' has created a pull request for this issue: https://github.com/apache/spark/pull/11203 > Model export/import for spark.ml: ElementwiseProduct > > > Key: SPARK-13318 > URL: https://issues.apache.org/jira/browse/SPARK-13318 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xusen Yin >Priority: Minor > > Add save/load to ElementwiseProduct -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13036) PySpark ml.feature support export/import
[ https://issues.apache.org/jira/browse/SPARK-13036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13036: Assignee: Apache Spark > PySpark ml.feature support export/import > > > Key: SPARK-13036 > URL: https://issues.apache.org/jira/browse/SPARK-13036 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Minor > > Add export/import for all estimators and transformers(which have Scala > implementation) under pyspark/ml/feature.py. Please refer the implementation > at SPARK-13032. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13318) Model export/import for spark.ml: ElementwiseProduct
[ https://issues.apache.org/jira/browse/SPARK-13318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13318: Assignee: (was: Apache Spark) > Model export/import for spark.ml: ElementwiseProduct > > > Key: SPARK-13318 > URL: https://issues.apache.org/jira/browse/SPARK-13318 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xusen Yin >Priority: Minor > > Add save/load to ElementwiseProduct -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13319) Pyspark VectorSlicer should have setDefault
[ https://issues.apache.org/jira/browse/SPARK-13319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13319: Assignee: Apache Spark > Pyspark VectorSlicer should have setDefault > --- > > Key: SPARK-13319 > URL: https://issues.apache.org/jira/browse/SPARK-13319 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Xusen Yin >Assignee: Apache Spark >Priority: Minor > > Pyspark VectorSlicer should have setDefault, otherwise it will cause error > when calling getNames or getIndices. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13319) Pyspark VectorSlicer should have setDefault
[ https://issues.apache.org/jira/browse/SPARK-13319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146801#comment-15146801 ] Apache Spark commented on SPARK-13319: -- User 'yinxusen' has created a pull request for this issue: https://github.com/apache/spark/pull/11203 > Pyspark VectorSlicer should have setDefault > --- > > Key: SPARK-13319 > URL: https://issues.apache.org/jira/browse/SPARK-13319 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Xusen Yin >Priority: Minor > > Pyspark VectorSlicer should have setDefault, otherwise it will cause error > when calling getNames or getIndices. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13319) Pyspark VectorSlicer should have setDefault
[ https://issues.apache.org/jira/browse/SPARK-13319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13319: Assignee: (was: Apache Spark) > Pyspark VectorSlicer should have setDefault > --- > > Key: SPARK-13319 > URL: https://issues.apache.org/jira/browse/SPARK-13319 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Xusen Yin >Priority: Minor > > Pyspark VectorSlicer should have setDefault, otherwise it will cause error > when calling getNames or getIndices. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13036) PySpark ml.feature support export/import
[ https://issues.apache.org/jira/browse/SPARK-13036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146799#comment-15146799 ] Apache Spark commented on SPARK-13036: -- User 'yinxusen' has created a pull request for this issue: https://github.com/apache/spark/pull/11203 > PySpark ml.feature support export/import > > > Key: SPARK-13036 > URL: https://issues.apache.org/jira/browse/SPARK-13036 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > > Add export/import for all estimators and transformers(which have Scala > implementation) under pyspark/ml/feature.py. Please refer the implementation > at SPARK-13032. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13318) Model export/import for spark.ml: ElementwiseProduct
[ https://issues.apache.org/jira/browse/SPARK-13318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13318: Assignee: Apache Spark > Model export/import for spark.ml: ElementwiseProduct > > > Key: SPARK-13318 > URL: https://issues.apache.org/jira/browse/SPARK-13318 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xusen Yin >Assignee: Apache Spark >Priority: Minor > > Add save/load to ElementwiseProduct -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13036) PySpark ml.feature support export/import
[ https://issues.apache.org/jira/browse/SPARK-13036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13036: Assignee: (was: Apache Spark) > PySpark ml.feature support export/import > > > Key: SPARK-13036 > URL: https://issues.apache.org/jira/browse/SPARK-13036 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > > Add export/import for all estimators and transformers(which have Scala > implementation) under pyspark/ml/feature.py. Please refer the implementation > at SPARK-13032. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13319) Pyspark VectorSlicer should have setDefault
Xusen Yin created SPARK-13319: - Summary: Pyspark VectorSlicer should have setDefault Key: SPARK-13319 URL: https://issues.apache.org/jira/browse/SPARK-13319 Project: Spark Issue Type: Bug Components: PySpark Reporter: Xusen Yin Priority: Minor Pyspark VectorSlicer should have setDefault, otherwise it will cause error when calling getNames or getIndices. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13318) Model export/import for spark.ml: ElementwiseProduct
Xusen Yin created SPARK-13318: - Summary: Model export/import for spark.ml: ElementwiseProduct Key: SPARK-13318 URL: https://issues.apache.org/jira/browse/SPARK-13318 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xusen Yin Priority: Minor Add save/load to ElementwiseProduct -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13185) Improve the performance of DateTimeUtils.StringToDate by reusing Calendar objects
[ https://issues.apache.org/jira/browse/SPARK-13185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13185. - Resolution: Fixed Assignee: Carson Wang Fix Version/s: 2.0.0 > Improve the performance of DateTimeUtils.StringToDate by reusing Calendar > objects > - > > Key: SPARK-13185 > URL: https://issues.apache.org/jira/browse/SPARK-13185 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Carson Wang >Assignee: Carson Wang >Priority: Minor > Fix For: 2.0.0 > > > It is expensive to create java Calendar objects in each method of > DateTimeUtils. We can reuse the objects to improve the performance. In one of > my Sql queries which calls StringToDate many times, the duration of the stage > improved from 1.6 minutes to 1.2 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13307) TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1
[ https://issues.apache.org/jira/browse/SPARK-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146765#comment-15146765 ] Xiao Li edited comment on SPARK-13307 at 2/14/16 10:49 PM: --- In the following PR: https://github.com/apache/spark/pull/9645, shuffle hash join was removed from Spark SQL. Try to see if broadcast join works in this test case. You also can use BroadcastHint to force the broadcast join. Let me CC [~rxin] [~yhuai] [~marmbrus] was (Author: smilegator): In the following PR: https://github.com/apache/spark/pull/9645, shuffle hash join is removed from Spark SQL. Try to see if broadcast join works in this test case. You also can use hint to force the broadcast join. Let me CC [~rxin] [~yhuai] [~marmbrus] > TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1 > - > > Key: SPARK-13307 > URL: https://issues.apache.org/jira/browse/SPARK-13307 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > > Majority of the TPCDS queries ran faster in 1.6.0 than in 1.4.1, average > about 9% faster. There are a few degraded, and one that is definitely not > within error margin is query 66. > Query 66 in 1.4.1: 699 seconds > Query 66 in 1.6.0: 918 seconds > 30% worse. > Collected the physical plans from both versions - drastic difference maybe > partially from using Tungsten in 1.6, but anything else at play here? > Please see plans here: > https://ibm.box.com/spark-sql-q66-debug-160plan > https://ibm.box.com/spark-sql-q66-debug-141plan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13307) TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1
[ https://issues.apache.org/jira/browse/SPARK-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146765#comment-15146765 ] Xiao Li commented on SPARK-13307: - In the following PR: https://github.com/apache/spark/pull/9645, shuffle hash join is removed from Spark SQL. Try to see if broadcast join works in this test case. You also can use hint to force the broadcast join. Let me CC [~rxin] [~yhuai] [~marmbrus] > TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1 > - > > Key: SPARK-13307 > URL: https://issues.apache.org/jira/browse/SPARK-13307 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > > Majority of the TPCDS queries ran faster in 1.6.0 than in 1.4.1, average > about 9% faster. There are a few degraded, and one that is definitely not > within error margin is query 66. > Query 66 in 1.4.1: 699 seconds > Query 66 in 1.6.0: 918 seconds > 30% worse. > Collected the physical plans from both versions - drastic difference maybe > partially from using Tungsten in 1.6, but anything else at play here? > Please see plans here: > https://ibm.box.com/spark-sql-q66-debug-160plan > https://ibm.box.com/spark-sql-q66-debug-141plan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13307) TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1
[ https://issues.apache.org/jira/browse/SPARK-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146755#comment-15146755 ] Xiao Li commented on SPARK-13307: - Please tune “spark.sql.autoBroadcastJoinThreshold” to enable the broadcast Join. Thanks! > TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1 > - > > Key: SPARK-13307 > URL: https://issues.apache.org/jira/browse/SPARK-13307 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > > Majority of the TPCDS queries ran faster in 1.6.0 than in 1.4.1, average > about 9% faster. There are a few degraded, and one that is definitely not > within error margin is query 66. > Query 66 in 1.4.1: 699 seconds > Query 66 in 1.6.0: 918 seconds > 30% worse. > Collected the physical plans from both versions - drastic difference maybe > partially from using Tungsten in 1.6, but anything else at play here? > Please see plans here: > https://ibm.box.com/spark-sql-q66-debug-160plan > https://ibm.box.com/spark-sql-q66-debug-141plan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13307) TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1
[ https://issues.apache.org/jira/browse/SPARK-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146751#comment-15146751 ] Xiao Li commented on SPARK-13307: - 1.6.1 is using SortMergeJoin, but 1.4.1 is using ShuffleHashJoin. I believe this is the major cause of the performance difference. > TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1 > - > > Key: SPARK-13307 > URL: https://issues.apache.org/jira/browse/SPARK-13307 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > > Majority of the TPCDS queries ran faster in 1.6.0 than in 1.4.1, average > about 9% faster. There are a few degraded, and one that is definitely not > within error margin is query 66. > Query 66 in 1.4.1: 699 seconds > Query 66 in 1.6.0: 918 seconds > 30% worse. > Collected the physical plans from both versions - drastic difference maybe > partially from using Tungsten in 1.6, but anything else at play here? > Please see plans here: > https://ibm.box.com/spark-sql-q66-debug-160plan > https://ibm.box.com/spark-sql-q66-debug-141plan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13313) Strongly connected components doesn't find all strongly connected components
[ https://issues.apache.org/jira/browse/SPARK-13313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146746#comment-15146746 ] Sean Owen commented on SPARK-13313: --- Dumb question, but is this the difference between directed and undirected graphs? like, GraphX is reading this as directed edges only? > Strongly connected components doesn't find all strongly connected components > > > Key: SPARK-13313 > URL: https://issues.apache.org/jira/browse/SPARK-13313 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.6.0 >Reporter: Petar Zecevic > > Strongly connected components algorithm doesn't find all strongly connected > components. I was using Wikispeedia dataset > (http://snap.stanford.edu/data/wikispeedia.html) and the algorithm found 519 > SCCs and one of them had 4051 vertices, which in reality don't have any edges > between them. > I think the problem could be on line 89 of StronglyConnectedComponents.scala > file where EdgeDirection.In should be changed to EdgeDirection.Out. I believe > the second Pregel call should use Out edge direction, the same as the first > call because the direction is reversed in the provided sendMsg function > (message is sent to source vertex and not destination vertex). > If that is changed (line 89), the algorithm starts finding much more SCCs, > but eventually stack overflow exception occurs. I believe graph objects that > are changed through iterations should not be cached, but checkpointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13307) TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1
[ https://issues.apache.org/jira/browse/SPARK-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146740#comment-15146740 ] JESSE CHEN commented on SPARK-13307: Uploaded newly collected plans (logical, analyzed, optimized and physical). Links are the same: https://ibm.box.com/spark-sql-q66-debug-160plan https://ibm.box.com/spark-sql-q66-debug-141plan Please let me know any additional info you need to collect. Thanks. > TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1 > - > > Key: SPARK-13307 > URL: https://issues.apache.org/jira/browse/SPARK-13307 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > > Majority of the TPCDS queries ran faster in 1.6.0 than in 1.4.1, average > about 9% faster. There are a few degraded, and one that is definitely not > within error margin is query 66. > Query 66 in 1.4.1: 699 seconds > Query 66 in 1.6.0: 918 seconds > 30% worse. > Collected the physical plans from both versions - drastic difference maybe > partially from using Tungsten in 1.6, but anything else at play here? > Please see plans here: > https://ibm.box.com/spark-sql-q66-debug-160plan > https://ibm.box.com/spark-sql-q66-debug-141plan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13313) Strongly connected components doesn't find all strongly connected components
[ https://issues.apache.org/jira/browse/SPARK-13313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146731#comment-15146731 ] Petar Zecevic commented on SPARK-13313: --- Yes, you need articles.tsv and links.tsv from this archive: http://snap.stanford.edu/data/wikispeedia/wikispeedia_paths-and-graph.tar.gz Then parse the data, assign IDs to article names and create the graph: val articles = sc.textFile("articles.tsv", 6).filter(line => line.trim() != "" && !line.startsWith("#")).zipWithIndex().cache() val links = sc.textFile("links.tsv", 6).filter(line => line.trim() != "" && !line.startsWith("#")) val linkIndexes = links.map(x => { val spl = x.split("\t"); (spl(0), spl(1)) }).join(articles).map(x => x._2).join(articles).map(x => x._2) val wikigraph = Graph.fromEdgeTuples(linkIndexes, 0) Then get strongly connected components: val wikiSCC = wikigraph.stronglyConnectedComponents(100) wikiSCC graph contains 519 SCCs, but there should be much more. The largest SCC in wikiSCC has 4051 vertices and that's obviously wrong. The change in line 89, which I mentioned, seems to solve this problem, but then other issues arise (stack overflow etc) and I don't have time to investigate further. I hope someone will look into this. > Strongly connected components doesn't find all strongly connected components > > > Key: SPARK-13313 > URL: https://issues.apache.org/jira/browse/SPARK-13313 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.6.0 >Reporter: Petar Zecevic > > Strongly connected components algorithm doesn't find all strongly connected > components. I was using Wikispeedia dataset > (http://snap.stanford.edu/data/wikispeedia.html) and the algorithm found 519 > SCCs and one of them had 4051 vertices, which in reality don't have any edges > between them. > I think the problem could be on line 89 of StronglyConnectedComponents.scala > file where EdgeDirection.In should be changed to EdgeDirection.Out. I believe > the second Pregel call should use Out edge direction, the same as the first > call because the direction is reversed in the provided sendMsg function > (message is sent to source vertex and not destination vertex). > If that is changed (line 89), the algorithm starts finding much more SCCs, > but eventually stack overflow exception occurs. I believe graph objects that > are changed through iterations should not be cached, but checkpointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12969) Exception while casting a spark supported date formatted "string" to "date" data type.
[ https://issues.apache.org/jira/browse/SPARK-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146728#comment-15146728 ] Ankit Jindal commented on SPARK-12969: -- Hi, I have tried your code with Java 1.8.0_66 and spark 1.6 in local mode and it is working as expected. Can you provide the command you are using to run this. Regards, Ankit > Exception while casting a spark supported date formatted "string" to "date" > data type. > --- > > Key: SPARK-12969 > URL: https://issues.apache.org/jira/browse/SPARK-12969 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.6.0 > Environment: Spark Java >Reporter: Jais Sebastian > > Getting exception while converting a string column( column is having spark > supported date format -MM-dd ) to date data type. Below is the code > snippet > List jsonData = Arrays.asList( > "{\"d\":\"2015-02-01\",\"n\":1}"); > JavaRDD dataRDD = > this.getSparkContext().parallelize(jsonData); > DataFrame data = this.getSqlContext().read().json(dataRDD); > DataFrame newData = data.select(data.col("d").cast("date")); > newData.show(); > Above code will give the error > failed to compile: org.codehaus.commons.compiler.CompileException: File > generated.java, Line 95, Column 28: Expression "scala.Option < Long > > longOpt16" is not an lvalue > This happens only if we execute the program in client mode , it works if we > execute through spark submit. Here is the sample project : > https://github.com/uhonnavarkar/spark_test -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13294) Don't build assembly in dev/run-tests
[ https://issues.apache.org/jira/browse/SPARK-13294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-13294: -- Assignee: Josh Rosen > Don't build assembly in dev/run-tests > - > > Key: SPARK-13294 > URL: https://issues.apache.org/jira/browse/SPARK-13294 > Project: Spark > Issue Type: Sub-task > Components: Build >Reporter: Josh Rosen >Assignee: Josh Rosen > > As of SPARK-9284 we should no longer need to build the full Spark assembly > JAR in order to run tests. Therefore, we should remove the assembly step from > {{dev/run-tests}} in order to reduce build + test time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13294) Don't build assembly in dev/run-tests
[ https://issues.apache.org/jira/browse/SPARK-13294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-13294: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-11157 > Don't build assembly in dev/run-tests > - > > Key: SPARK-13294 > URL: https://issues.apache.org/jira/browse/SPARK-13294 > Project: Spark > Issue Type: Sub-task > Components: Build >Reporter: Josh Rosen > > As of SPARK-9284 we should no longer need to build the full Spark assembly > JAR in order to run tests. Therefore, we should remove the assembly step from > {{dev/run-tests}} in order to reduce build + test time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12772) Better error message for syntax error in the SQL parser
[ https://issues.apache.org/jira/browse/SPARK-12772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146710#comment-15146710 ] Herman van Hovell commented on SPARK-12772: --- I'll have a look. > Better error message for syntax error in the SQL parser > --- > > Key: SPARK-12772 > URL: https://issues.apache.org/jira/browse/SPARK-12772 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Reynold Xin > > {code} > scala> sql("select case if(true, 'one', 'two')").explain(true) > org.apache.spark.sql.AnalysisException: org.antlr.runtime.EarlyExitException > line 1:34 required (...)+ loop did not match anything at input '' in > case expression > ; line 1 pos 34 > at > org.apache.spark.sql.catalyst.parser.ParseErrorReporter.throwError(ParseDriver.scala:140) > at > org.apache.spark.sql.catalyst.parser.ParseErrorReporter.throwError(ParseDriver.scala:129) > at > org.apache.spark.sql.catalyst.parser.ParseDriver$.parse(ParseDriver.scala:77) > at > org.apache.spark.sql.catalyst.CatalystQl.createPlan(CatalystQl.scala:53) > at > org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41) > at > org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40) > {code} > Is there a way to say something better other than "required (...)+ loop did > not match anything at input"? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146704#comment-15146704 ] Christopher Bourez commented on SPARK-13317: Because installing the notebooks Zeppelin or IScala on the cluster does not make a lot of sense. > SPARK_LOCAL_IP does not bind on Slaves > -- > > Key: SPARK-13317 > URL: https://issues.apache.org/jira/browse/SPARK-13317 > Project: Spark > Issue Type: Bug > Environment: Linux EC2, different VPC >Reporter: Christopher Bourez > > SPARK_LOCAL_IP does not bind to the provided IP on slaves. > When launching a job or a spark-shell from a second network, the returned IP > for the slave is still the first IP of the slave. > So the job fails with the message : > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources > It is not a question of resources but the driver which cannot connect to the > slave given the wrong IP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10759) Missing Python code example in ML Programming guide
[ https://issues.apache.org/jira/browse/SPARK-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146698#comment-15146698 ] Apache Spark commented on SPARK-10759: -- User 'JeremyNixon' has created a pull request for this issue: https://github.com/apache/spark/pull/11202 > Missing Python code example in ML Programming guide > --- > > Key: SPARK-10759 > URL: https://issues.apache.org/jira/browse/SPARK-10759 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.5.0 >Reporter: Raela Wang >Assignee: Lauren Moos >Priority: Minor > Labels: starter > > http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation > http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-train-validation-split -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10759) Missing Python code example in ML Programming guide
[ https://issues.apache.org/jira/browse/SPARK-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10759: Assignee: Apache Spark (was: Lauren Moos) > Missing Python code example in ML Programming guide > --- > > Key: SPARK-10759 > URL: https://issues.apache.org/jira/browse/SPARK-10759 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.5.0 >Reporter: Raela Wang >Assignee: Apache Spark >Priority: Minor > Labels: starter > > http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation > http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-train-validation-split -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146693#comment-15146693 ] DOAN DuyHai commented on SPARK-13317: - To complement this JIRA, I would say that the issue is: *how to configure Spark to use public IP address for slaves on machine with multiple network interfaces* ? > SPARK_LOCAL_IP does not bind on Slaves > -- > > Key: SPARK-13317 > URL: https://issues.apache.org/jira/browse/SPARK-13317 > Project: Spark > Issue Type: Bug > Environment: Linux EC2, different VPC >Reporter: Christopher Bourez > > SPARK_LOCAL_IP does not bind to the provided IP on slaves. > When launching a job or a spark-shell from a second network, the returned IP > for the slave is still the first IP of the slave. > So the job fails with the message : > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources > It is not a question of resources but the driver which cannot connect to the > slave given the wrong IP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689 ] Christopher Bourez edited comment on SPARK-13317 at 2/14/16 7:02 PM: - I launch a cluster {code} ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster {code} which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to {code} spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 {code} I see in the logs : {code} 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 64058) 16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager {code} which are private IP that my macbook cannot access and when launching a job, an error follow : {code} 16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources {code} I tried to connect to the slaves, to set SPARK_LOCAL_IP in the slaves' spark-env.sh, stop and restart all slaves from the master, spark master still returns the private IP of the slaves when I execute a job in client mode (spark-shell or zeppelin on my macbook). I think we should be able to work from different networks. Only UI interfaces seem to be bound to the correct IP. was (Author: christopher5106): I launch a cluster {code} ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster {code} which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to {code} spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 {code} I see in the logs : {code} 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/
[jira] [Comment Edited] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689 ] Christopher Bourez edited comment on SPARK-13317 at 2/14/16 7:01 PM: - I launch a cluster {code} ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster {code} which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to {code} spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 {code} I see in the logs : {code} 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 64058) 16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager {code} which are private IP that my macbook cannot access and when launching a job, an error follow : {code} 16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources {code} I tried to connect to the slaves, to set SPARK_LOCAL_IP in the slaves' spark-env.sh, stop and restart all slaves from the master, spark master still returns the private IP of the slaves. was (Author: christopher5106): I launch a cluster {code} ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster {code} which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to {code} spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 {code} I see in the logs : {code} 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerM
[jira] [Comment Edited] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689 ] Christopher Bourez edited comment on SPARK-13317 at 2/14/16 7:00 PM: - I launch a cluster {code} ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster {code} which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to {code} spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 {code} I see in the logs : {code} 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 64058) 16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager {code} which are private IP that my macbook cannot access and when launching a job, an error follow : {code} 16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources {code} I tried to connect to the slave, to set SPARK_LOCAL_IP in the slave's spark-env.sh, stop and restart all slaves from the master, spark master still returns the private IP. was (Author: christopher5106): I launch a cluster {code} ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster {code} which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to {code} spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 {code} I see in the logs : {code} 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint:
[jira] [Comment Edited] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689 ] Christopher Bourez edited comment on SPARK-13317 at 2/14/16 6:59 PM: - I launch a cluster ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 I see in the logs : {code} 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 64058) 16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager {code} which are private IP that my macbook cannot access and when launching a job, an error follow : {code} 16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources {code} I tryied to connect to the slave, to set SPARK_LOCAL_IP in the slave's spark-env.sh, stop and restart all slaves from the master, spark master still returns the private IP. Thanks, was (Author: christopher5106): I launch a cluster ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 I see in the logs : {code} 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:64058 w
[jira] [Comment Edited] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689 ] Christopher Bourez edited comment on SPARK-13317 at 2/14/16 6:59 PM: - I launch a cluster {code} ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster {code} which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to {code} spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 {code} I see in the logs : {code} 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 64058) 16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager {code} which are private IP that my macbook cannot access and when launching a job, an error follow : {code} 16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources {code} I tryied to connect to the slave, to set SPARK_LOCAL_IP in the slave's spark-env.sh, stop and restart all slaves from the master, spark master still returns the private IP. Thanks, was (Author: christopher5106): I launch a cluster ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 I see in the logs : {code} 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block m
[jira] [Comment Edited] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689 ] Christopher Bourez edited comment on SPARK-13317 at 2/14/16 6:59 PM: - I launch a cluster ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 I see in the logs : {code} 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 64058) 16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager {code} which are private IP that my macbook cannot access and when launching a job, an error follow : 16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources I tryied to connect to the slave, to set SPARK_LOCAL_IP in the slave's spark-env.sh, stop and restart all slaves from the master, spark master still returns the private IP. Thanks, was (Author: christopher5106): I launch a cluster ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 I see in the logs : ` 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:64058 with 511.5 MB RAM, B
[jira] [Comment Edited] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689 ] Christopher Bourez edited comment on SPARK-13317 at 2/14/16 6:58 PM: - I launch a cluster ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 I see in the logs : ` 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 64058) 16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager ` which are private IP that my macbook cannot access and when launching a job, an error follow : 16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources I tryied to connect to the slave, to set SPARK_LOCAL_IP in the slave's spark-env.sh, stop and restart all slaves from the master, spark master still returns the private IP. Thanks, was (Author: christopher5106): I launch a cluster ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 I see in the logs : ``` 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:64058 with 511.5 MB RAM, BlockMana
[jira] [Commented] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689 ] Christopher Bourez commented on SPARK-13317: I launch a cluster ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 I see in the logs : ``` 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 64058) 16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager ``` which are private IP that my macbook cannot access and when launching a job, an error follow : 16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources I tryied to connect to the slave, to set SPARK_LOCAL_IP in the slave's spark-env.sh, stop and restart all slaves from the master, spark master still returns the private IP. Thanks, > SPARK_LOCAL_IP does not bind on Slaves > -- > > Key: SPARK-13317 > URL: https://issues.apache.org/jira/browse/SPARK-13317 > Project: Spark > Issue Type: Bug > Environment: Linux EC2, different VPC >Reporter: Christopher Bourez > > SPARK_LOCAL_IP does not bind to the provided IP on slaves. > When launching a job or a spark-shell from a second network, the returned IP > for the slave is still the first IP of the slave. > So the job fails with the message : > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources > It is not a question of resources but the driver which cannot connect to the > slave given the wrong IP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146646#comment-15146646 ] Sean Owen commented on SPARK-13317: --- Can you clarify -- are you setting SPARK_LOCAL_IP correctly on each machine? I'm not clear what is set where and what is used where. > SPARK_LOCAL_IP does not bind on Slaves > -- > > Key: SPARK-13317 > URL: https://issues.apache.org/jira/browse/SPARK-13317 > Project: Spark > Issue Type: Bug > Environment: Linux EC2, different VPC >Reporter: Christopher Bourez > > SPARK_LOCAL_IP does not bind to the provided IP on slaves. > When launching a job or a spark-shell from a second network, the returned IP > for the slave is still the first IP of the slave. > So the job fails with the message : > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources > It is not a question of resources but the driver which cannot connect to the > slave given the wrong IP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
Christopher Bourez created SPARK-13317: -- Summary: SPARK_LOCAL_IP does not bind on Slaves Key: SPARK-13317 URL: https://issues.apache.org/jira/browse/SPARK-13317 Project: Spark Issue Type: Bug Environment: Linux EC2, different VPC Reporter: Christopher Bourez SPARK_LOCAL_IP does not bind to the provided IP on slaves. When launching a job or a spark-shell from a second network, the returned IP for the slave is still the first IP of the slave. So the job fails with the message : Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources It is not a question of resources but the driver which cannot connect to the slave given the wrong IP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13307) TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1
[ https://issues.apache.org/jira/browse/SPARK-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146140#comment-15146140 ] Xiao Li edited comment on SPARK-13307 at 2/14/16 4:13 PM: -- Could you provide logical plans, as suggested above? The attached only contains the physical plans. Thanks! was (Author: smilegator): Could you provided logical plans, as suggested above? The attached only contains the physical plans. Thanks! > TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1 > - > > Key: SPARK-13307 > URL: https://issues.apache.org/jira/browse/SPARK-13307 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > > Majority of the TPCDS queries ran faster in 1.6.0 than in 1.4.1, average > about 9% faster. There are a few degraded, and one that is definitely not > within error margin is query 66. > Query 66 in 1.4.1: 699 seconds > Query 66 in 1.6.0: 918 seconds > 30% worse. > Collected the physical plans from both versions - drastic difference maybe > partially from using Tungsten in 1.6, but anything else at play here? > Please see plans here: > https://ibm.box.com/spark-sql-q66-debug-160plan > https://ibm.box.com/spark-sql-q66-debug-141plan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13316) "SparkException: DStream has not been initialized" when restoring StreamingContext from checkpoint and the dstream is created afterwards
[ https://issues.apache.org/jira/browse/SPARK-13316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13316: -- Affects Version/s: (was: 2.0.0) Priority: Minor (was: Major) OK to updates docs and/or make a better error message if you can. > "SparkException: DStream has not been initialized" when restoring > StreamingContext from checkpoint and the dstream is created afterwards > > > Key: SPARK-13316 > URL: https://issues.apache.org/jira/browse/SPARK-13316 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Jacek Laskowski >Priority: Minor > > I faced the issue today but [it was already reported on > SO|http://stackoverflow.com/q/35090180/1305344] a couple of days ago and the > reason is that a dstream is registered after a StreamingContext has been > recreated from checkpoint. > It _appears_ that...no dstreams must be registered after a StreamingContext > has been recreated from checkpoint. It is *not* obvious at first. > The code: > {code} > def createStreamingContext(): StreamingContext = { > val ssc = new StreamingContext(sparkConf, Duration(1000)) > ssc.checkpoint(checkpointDir) > ssc > } > val ssc = StreamingContext.getOrCreate(checkpointDir), createStreamingContext) > val socketStream = ssc.socketTextStream(...) > socketStream.checkpoint(Seconds(1)) > socketStream.foreachRDD(...) > {code} > It should be described in docs at the very least and/or checked in the code > when the streaming computation starts. > The exception is as follows: > {code} > org.apache.spark.SparkException: > org.apache.spark.streaming.dstream.ConstantInputDStream@724797ab has not been > initialized > at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:311) > at > org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:89) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:332) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:332) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:329) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:233) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:228) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:228) > at > org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:97) > at > org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:83) > at > org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply$mcV$sp(StreamingContext.scala:589) > at > org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:585) > at > org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:585) > at ... run in separate thread using org.apache.spark.util.ThreadUtils ... () > at > org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:585) > at > org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:579) > ... 43 elided > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13316) "SparkException: DStream has not been initialized" when restoring StreamingContext from checkpoint and the dstream is created afterwards
Jacek Laskowski created SPARK-13316: --- Summary: "SparkException: DStream has not been initialized" when restoring StreamingContext from checkpoint and the dstream is created afterwards Key: SPARK-13316 URL: https://issues.apache.org/jira/browse/SPARK-13316 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 2.0.0 Reporter: Jacek Laskowski I faced the issue today but [it was already reported on SO|http://stackoverflow.com/q/35090180/1305344] a couple of days ago and the reason is that a dstream is registered after a StreamingContext has been recreated from checkpoint. It _appears_ that...no dstreams must be registered after a StreamingContext has been recreated from checkpoint. It is *not* obvious at first. The code: {code} def createStreamingContext(): StreamingContext = { val ssc = new StreamingContext(sparkConf, Duration(1000)) ssc.checkpoint(checkpointDir) ssc } val ssc = StreamingContext.getOrCreate(checkpointDir), createStreamingContext) val socketStream = ssc.socketTextStream(...) socketStream.checkpoint(Seconds(1)) socketStream.foreachRDD(...) {code} It should be described in docs at the very least and/or checked in the code when the streaming computation starts. The exception is as follows: {code} org.apache.spark.SparkException: org.apache.spark.streaming.dstream.ConstantInputDStream@724797ab has not been initialized at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:311) at org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:89) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:332) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:332) at scala.Option.orElse(Option.scala:289) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:329) at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:233) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:228) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:228) at org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:97) at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:83) at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply$mcV$sp(StreamingContext.scala:589) at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:585) at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:585) at ... run in separate thread using org.apache.spark.util.ThreadUtils ... () at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:585) at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:579) ... 43 elided {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13309) Incorrect type inference for CSV data.
[ https://issues.apache.org/jira/browse/SPARK-13309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13309: -- Target Version/s: (was: 1.6.0) Priority: Minor (was: Major) Fix Version/s: (was: 1.6.0) > Incorrect type inference for CSV data. > -- > > Key: SPARK-13309 > URL: https://issues.apache.org/jira/browse/SPARK-13309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Rahul Tanwani >Priority: Minor > > Type inference for CSV data does not work as expected when the data is > sparse. > For instance: Consider the following datasets and the inferred schema: > {code} > A,B,C,D > 1,,, > ,1,, > ,,1, > ,,,1 > {code} > {code} > root > |-- A: integer (nullable = true) > |-- B: integer (nullable = true) > |-- C: string (nullable = true) > |-- D: string (nullable = true) > {code} > Here all the fields should have been inferred as Integer types, but clearly > the inferred schema is different. > Another dataset: > {code} > A,B,C,D > 1,,1, > {code} > and the inferred schema: > {code} > root > |-- A: string (nullable = true) > |-- B: string (nullable = true) > |-- C: string (nullable = true) > |-- D: string (nullable = true) > {code} > Here, fields A & C should be inferred as Integer types. > Same issue has been discussed on spark-csv package. Please take a look at > https://github.com/databricks/spark-csv/issues/216 for reference. > The issue was fixed with > https://github.com/databricks/spark-csv/commit/8704b26030da88ac6e18b955a81d5c22ca3b480d. > I will try to submit PR with the patch soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13314) Malformed WholeStageCodegen tree string
[ https://issues.apache.org/jira/browse/SPARK-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13314: -- Component/s: SQL > Malformed WholeStageCodegen tree string > --- > > Key: SPARK-13314 > URL: https://issues.apache.org/jira/browse/SPARK-13314 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > > {{WholeStageCodegen}} overrides {{generateTreeString}} to show the inner plan > tree, but the output can be malformed when the plan contains binary operators: > {code} > val a = sqlContext range 5 > val b = sqlContext range 2 > a select ('id as 'a) unionAll (b select ('id as 'a)) explain true > {code} > {noformat} > ... > == Physical Plan == > Union > :- WholeStageCodegen > : : +- Project [id#3L AS a#6L] > : : +- Range 0, 1, 8, 5, [id#3L] > +- WholeStageCodegen >: +- Project [id#4L AS a#7L] >: +- Range 0, 1, 8, 2, [id#4L] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12869) Optimize conversion from BlockMatrix to IndexedRowMatrix
[ https://issues.apache.org/jira/browse/SPARK-12869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12869: -- Flags: (was: Patch) Target Version/s: (was: 1.6.1) Priority: Minor (was: Major) Fix Version/s: (was: 1.6.1) [~Fokko] don't set fix/target version > Optimize conversion from BlockMatrix to IndexedRowMatrix > > > Key: SPARK-12869 > URL: https://issues.apache.org/jira/browse/SPARK-12869 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.0 >Reporter: Fokko Driesprong >Priority: Minor > Original Estimate: 48h > Remaining Estimate: 48h > > In the current implementation of the BlockMatrix, the conversion to the > IndexedRowMatrix is done by converting it to a CoordinateMatrix first. This > is somewhat ok when the matrix is very sparse, but for dense matrices this is > very inefficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12869) Optimize conversion from BlockMatrix to IndexedRowMatrix
[ https://issues.apache.org/jira/browse/SPARK-12869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong updated SPARK-12869: - Flags: Patch Affects Version/s: 1.6.0 Target Version/s: 1.6.1 Fix Version/s: 1.6.1 > Optimize conversion from BlockMatrix to IndexedRowMatrix > > > Key: SPARK-12869 > URL: https://issues.apache.org/jira/browse/SPARK-12869 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.0 >Reporter: Fokko Driesprong > Fix For: 1.6.1 > > Original Estimate: 48h > Remaining Estimate: 48h > > In the current implementation of the BlockMatrix, the conversion to the > IndexedRowMatrix is done by converting it to a CoordinateMatrix first. This > is somewhat ok when the matrix is very sparse, but for dense matrices this is > very inefficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-13315) multiple columns filtering
[ https://issues.apache.org/jira/browse/SPARK-13315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Vatani closed SPARK-13315. -- I found the solution NewDf=Df.filter((Df.Col1==A) | (Df.Col2==B)) > multiple columns filtering > -- > > Key: SPARK-13315 > URL: https://issues.apache.org/jira/browse/SPARK-13315 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Hossein Vatani >Priority: Minor > > Hi > i tried to filter tow col like below: > NewDf=Df.filter(Df.Col1==A | Df.Col2==B) > but i got below > Py4JError: An error occurred while calling o230.or. Trace: > py4j.Py4JException: Method or([class java.lang.String]) does not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344) > at py4j.Gateway.invoke(Gateway.java:252) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:209) > at java.lang.Thread.run(Thread.java:745) > as I found, there is any capability to filter(conditions) and only > filter(condition) available. > P.S. OS:CentOS7, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13315) multiple columns filtering
[ https://issues.apache.org/jira/browse/SPARK-13315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13315. --- Resolution: Invalid Fix Version/s: (was: 1.6.0) > multiple columns filtering > -- > > Key: SPARK-13315 > URL: https://issues.apache.org/jira/browse/SPARK-13315 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Hossein Vatani >Priority: Minor > > Hi > i tried to filter tow col like below: > NewDf=Df.filter(Df.Col1==A | Df.Col2==B) > but i got below > Py4JError: An error occurred while calling o230.or. Trace: > py4j.Py4JException: Method or([class java.lang.String]) does not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344) > at py4j.Gateway.invoke(Gateway.java:252) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:209) > at java.lang.Thread.run(Thread.java:745) > as I found, there is any capability to filter(conditions) and only > filter(condition) available. > P.S. OS:CentOS7, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12363) PowerIterationClustering test case failed if we deprecated KMeans.setRuns
[ https://issues.apache.org/jira/browse/SPARK-12363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12363: -- Labels: backport-needed (was: ) Is it realistic to expect another 1.3 or 1.4 release? I am not even sure 1.5.3 will be formally released > PowerIterationClustering test case failed if we deprecated KMeans.setRuns > - > > Key: SPARK-12363 > URL: https://issues.apache.org/jira/browse/SPARK-12363 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.0, 2.0.0 >Reporter: Yanbo Liang >Assignee: Liang-Chi Hsieh >Priority: Minor > Labels: backport-needed > Fix For: 1.5.3, 1.6.1, 2.0.0 > > > We plan to deprecated `runs` of KMeans, PowerIterationClustering will > leverage KMeans to train model. > I removed `setRuns` used in PowerIterationClustering, but one of the test > cases failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13315) multiple columns filtering
Hossein Vatani created SPARK-13315: -- Summary: multiple columns filtering Key: SPARK-13315 URL: https://issues.apache.org/jira/browse/SPARK-13315 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.6.0 Reporter: Hossein Vatani Priority: Minor Fix For: 1.6.0 Hi i tried to filter tow col like below: NewDf=Df.filter(Df.Col1==A | Df.Col2==B) but i got below Py4JError: An error occurred while calling o230.or. Trace: py4j.Py4JException: Method or([class java.lang.String]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344) at py4j.Gateway.invoke(Gateway.java:252) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745) as I found, there is any capability to filter(conditions) and only filter(condition) available. P.S. OS:CentOS7, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13313) Strongly connected components doesn't find all strongly connected components
[ https://issues.apache.org/jira/browse/SPARK-13313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146534#comment-15146534 ] Sean Owen commented on SPARK-13313: --- Can you be more specific? like specific examples from the data and a pull request? > Strongly connected components doesn't find all strongly connected components > > > Key: SPARK-13313 > URL: https://issues.apache.org/jira/browse/SPARK-13313 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.6.0 >Reporter: Petar Zecevic > > Strongly connected components algorithm doesn't find all strongly connected > components. I was using Wikispeedia dataset > (http://snap.stanford.edu/data/wikispeedia.html) and the algorithm found 519 > SCCs and one of them had 4051 vertices, which in reality don't have any edges > between them. > I think the problem could be on line 89 of StronglyConnectedComponents.scala > file where EdgeDirection.In should be changed to EdgeDirection.Out. I believe > the second Pregel call should use Out edge direction, the same as the first > call because the direction is reversed in the provided sendMsg function > (message is sent to source vertex and not destination vertex). > If that is changed (line 89), the algorithm starts finding much more SCCs, > but eventually stack overflow exception occurs. I believe graph objects that > are changed through iterations should not be cached, but checkpointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13314) Malformed WholeStageCodegen tree string
[ https://issues.apache.org/jira/browse/SPARK-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146527#comment-15146527 ] Apache Spark commented on SPARK-13314: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/11200 > Malformed WholeStageCodegen tree string > --- > > Key: SPARK-13314 > URL: https://issues.apache.org/jira/browse/SPARK-13314 > Project: Spark > Issue Type: Bug >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > > {{WholeStageCodegen}} overrides {{generateTreeString}} to show the inner plan > tree, but the output can be malformed when the plan contains binary operators: > {code} > val a = sqlContext range 5 > val b = sqlContext range 2 > a select ('id as 'a) unionAll (b select ('id as 'a)) explain true > {code} > {noformat} > ... > == Physical Plan == > Union > :- WholeStageCodegen > : : +- Project [id#3L AS a#6L] > : : +- Range 0, 1, 8, 5, [id#3L] > +- WholeStageCodegen >: +- Project [id#4L AS a#7L] >: +- Range 0, 1, 8, 2, [id#4L] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13314) Malformed WholeStageCodegen tree string
[ https://issues.apache.org/jira/browse/SPARK-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian reassigned SPARK-13314: -- Assignee: Cheng Lian > Malformed WholeStageCodegen tree string > --- > > Key: SPARK-13314 > URL: https://issues.apache.org/jira/browse/SPARK-13314 > Project: Spark > Issue Type: Bug >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > > {{WholeStageCodegen}} overrides {{generateTreeString}} to show the inner plan > tree, but the output can be malformed when the plan contains binary operators: > {code} > val a = sqlContext range 5 > val b = sqlContext range 2 > a select ('id as 'a) unionAll (b select ('id as 'a)) explain true > {code} > {noformat} > ... > == Physical Plan == > Union > :- WholeStageCodegen > : : +- Project [id#3L AS a#6L] > : : +- Range 0, 1, 8, 5, [id#3L] > +- WholeStageCodegen >: +- Project [id#4L AS a#7L] >: +- Range 0, 1, 8, 2, [id#4L] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13314) Malformed WholeStageCodegen tree string
[ https://issues.apache.org/jira/browse/SPARK-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13314: Assignee: Apache Spark (was: Cheng Lian) > Malformed WholeStageCodegen tree string > --- > > Key: SPARK-13314 > URL: https://issues.apache.org/jira/browse/SPARK-13314 > Project: Spark > Issue Type: Bug >Reporter: Cheng Lian >Assignee: Apache Spark >Priority: Minor > > {{WholeStageCodegen}} overrides {{generateTreeString}} to show the inner plan > tree, but the output can be malformed when the plan contains binary operators: > {code} > val a = sqlContext range 5 > val b = sqlContext range 2 > a select ('id as 'a) unionAll (b select ('id as 'a)) explain true > {code} > {noformat} > ... > == Physical Plan == > Union > :- WholeStageCodegen > : : +- Project [id#3L AS a#6L] > : : +- Range 0, 1, 8, 5, [id#3L] > +- WholeStageCodegen >: +- Project [id#4L AS a#7L] >: +- Range 0, 1, 8, 2, [id#4L] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13314) Malformed WholeStageCodegen tree string
[ https://issues.apache.org/jira/browse/SPARK-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13314: Assignee: Cheng Lian (was: Apache Spark) > Malformed WholeStageCodegen tree string > --- > > Key: SPARK-13314 > URL: https://issues.apache.org/jira/browse/SPARK-13314 > Project: Spark > Issue Type: Bug >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > > {{WholeStageCodegen}} overrides {{generateTreeString}} to show the inner plan > tree, but the output can be malformed when the plan contains binary operators: > {code} > val a = sqlContext range 5 > val b = sqlContext range 2 > a select ('id as 'a) unionAll (b select ('id as 'a)) explain true > {code} > {noformat} > ... > == Physical Plan == > Union > :- WholeStageCodegen > : : +- Project [id#3L AS a#6L] > : : +- Range 0, 1, 8, 5, [id#3L] > +- WholeStageCodegen >: +- Project [id#4L AS a#7L] >: +- Range 0, 1, 8, 2, [id#4L] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13314) Malformed WholeStageCodegen tree string
Cheng Lian created SPARK-13314: -- Summary: Malformed WholeStageCodegen tree string Key: SPARK-13314 URL: https://issues.apache.org/jira/browse/SPARK-13314 Project: Spark Issue Type: Bug Reporter: Cheng Lian Priority: Minor {{WholeStageCodegen}} overrides {{generateTreeString}} to show the inner plan tree, but the output can be malformed when the plan contains binary operators: {code} val a = sqlContext range 5 val b = sqlContext range 2 a select ('id as 'a) unionAll (b select ('id as 'a)) explain true {code} {noformat} ... == Physical Plan == Union :- WholeStageCodegen : : +- Project [id#3L AS a#6L] : : +- Range 0, 1, 8, 5, [id#3L] +- WholeStageCodegen : +- Project [id#4L AS a#7L] : +- Range 0, 1, 8, 2, [id#4L] {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13118) Support for classes defined in package objects
[ https://issues.apache.org/jira/browse/SPARK-13118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13118: -- Target Version/s: 2.0.0 (was: 1.6.1, 2.0.0) > Support for classes defined in package objects > -- > > Key: SPARK-13118 > URL: https://issues.apache.org/jira/browse/SPARK-13118 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Armbrust > > When you define a class inside of a package object, the name ends up being > something like {{org.mycompany.project.package$MyClass}}. However, when > reflect on this we try and load {{org.mycompany.project.MyClass}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13128) API for building arrays / lists encoders
[ https://issues.apache.org/jira/browse/SPARK-13128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13128: -- Target Version/s: 2.0.0 (was: 1.6.1, 2.0.0) > API for building arrays / lists encoders > > > Key: SPARK-13128 > URL: https://issues.apache.org/jira/browse/SPARK-13128 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust > > Example usage: > {code} > Encoder.array(Encoder.INT) > Encoder.list(Encoder.INT) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12609) Make R to JVM timeout configurable
[ https://issues.apache.org/jira/browse/SPARK-12609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12609: -- Target Version/s: (was: 1.6.1, 2.0.0) > Make R to JVM timeout configurable > --- > > Key: SPARK-12609 > URL: https://issues.apache.org/jira/browse/SPARK-12609 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Shivaram Venkataraman > > The timeout from R to the JVM is hardcoded at 6000 seconds in > https://github.com/apache/spark/blob/6c5bbd628aaedb6efb44c15f816fea8fb600decc/R/pkg/R/client.R#L22 > This results in Spark jobs that take more than 100 minutes to always fail. We > should make this timeout configurable through SparkConf. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13062) Overwriting same file with new schema destroys original file.
[ https://issues.apache.org/jira/browse/SPARK-13062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13062. --- Resolution: Won't Fix ... though if someone has a reliable way to fail fast in most or all possible cases of this form, that would be a way forward > Overwriting same file with new schema destroys original file. > - > > Key: SPARK-13062 > URL: https://issues.apache.org/jira/browse/SPARK-13062 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Vincent Warmerdam > > I am using Hadoop with Spark 1.5.2. Using pyspark, let's create two > dataframes. > {code} > ddf1 = sqlCtx.createDataFrame(pd.DataFrame({'time':[1,2,3], > 'thing':['a','b','b']})) > ddf2 = sqlCtx.createDataFrame(pd.DataFrame({'time':[4,5,6,7], > 'thing':['a','b','a','b'], > 'name':['pi', 'ca', 'chu', '!']})) > ddf1.printSchema() > ddf2.printSchema() > ddf1.write.parquet('/tmp/ddf1', mode = 'overwrite') > ddf2.write.parquet('/tmp/ddf2', mode = 'overwrite') > sqlCtx.read.load('/tmp/ddf1', schema=ddf2.schema).show() > sqlCtx.read.load('/tmp/ddf2', schema=ddf1.schema).show() > {code} > Spark does a nice thing here, you can use different schemas consistently. > {code} > root > |-- thing: string (nullable = true) > |-- time: long (nullable = true) > root > |-- name: string (nullable = true) > |-- thing: string (nullable = true) > |-- time: long (nullable = true) > ++-++ > |name|thing|time| > ++-++ > |null|a| 1| > |null|b| 3| > |null|b| 2| > ++-++ > +-++ > |thing|time| > +-++ > |b| 7| > |b| 5| > |a| 4| > |a| 6| > +-++ > {code} > But here comes something naughty. Imagine that I want to update `ddf1` with > the new schema and save this on the HDFS filesystem. > I'll first write it to a new filename. > {code} > sqlCtx.read.load('/tmp/ddf1', schema=ddf1.schema)\ > .write.parquet('/tmp/ddf1_again', mode = 'overwrite') > {code} > Nothing seems to go wrong. > {code} > > sqlCtx.read.load('/tmp/ddf1_again', schema=ddf2.schema).show() > ++-++ > |name|thing|time| > ++-++ > |null|a| 1| > |null|b| 2| > |null|b| 3| > ++-++ > {code} > But what happens when I rewrite the file with a new schema. Note that the > main difference is that I am attempting to rewrite the file. I am now using > the same file name, not a different one. > {code} > sqlCtx.read.load('/tmp/ddf1_again', schema=ddf2.schema)\ > .write.parquet('/tmp/ddf1_again', mode = 'overwrite') > {code} > I get this big error. > {code} > Py4JJavaError: An error occurred while calling o97.parquet. > : org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun
[jira] [Updated] (SPARK-13278) Launcher fails to start with JDK 9 EA
[ https://issues.apache.org/jira/browse/SPARK-13278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13278: -- Assignee: Claes Redestad > Launcher fails to start with JDK 9 EA > - > > Key: SPARK-13278 > URL: https://issues.apache.org/jira/browse/SPARK-13278 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Claes Redestad >Assignee: Claes Redestad >Priority: Minor > Fix For: 2.0.0 > > > CommandBuilderUtils.addPermGenSizeOpt need to handle the JDK 9 version string > format, which can look like the expected 9, but also like 9-ea and 9+100 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13278) Launcher fails to start with JDK 9 EA
[ https://issues.apache.org/jira/browse/SPARK-13278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13278: -- Priority: Minor (was: Major) > Launcher fails to start with JDK 9 EA > - > > Key: SPARK-13278 > URL: https://issues.apache.org/jira/browse/SPARK-13278 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Claes Redestad >Priority: Minor > Fix For: 2.0.0 > > > CommandBuilderUtils.addPermGenSizeOpt need to handle the JDK 9 version string > format, which can look like the expected 9, but also like 9-ea and 9+100 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13278) Launcher fails to start with JDK 9 EA
[ https://issues.apache.org/jira/browse/SPARK-13278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13278. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11160 [https://github.com/apache/spark/pull/11160] > Launcher fails to start with JDK 9 EA > - > > Key: SPARK-13278 > URL: https://issues.apache.org/jira/browse/SPARK-13278 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Claes Redestad > Fix For: 2.0.0 > > > CommandBuilderUtils.addPermGenSizeOpt need to handle the JDK 9 version string > format, which can look like the expected 9, but also like 9-ea and 9+100 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13300) Spark examples page gives errors : Liquid error: pygments
[ https://issues.apache.org/jira/browse/SPARK-13300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13300. --- Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 11180 [https://github.com/apache/spark/pull/11180] > Spark examples page gives errors : Liquid error: pygments > -- > > Key: SPARK-13300 > URL: https://issues.apache.org/jira/browse/SPARK-13300 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.6.0 >Reporter: stefan >Assignee: Sean Owen >Priority: Minor > Fix For: 2.0.0, 1.6.1 > > > On ubuntu 15.10 updated, firefox renders this page: > http://spark.apache.org/examples.html > with this error: > Liquid error: pygments > Under every tab (Python, Scala, Java) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4039) KMeans support sparse cluster centers
[ https://issues.apache.org/jira/browse/SPARK-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146465#comment-15146465 ] yuhao yang commented on SPARK-4039: --- https://github.com/hhbyyh/spark/blob/kmeansSparse/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala I got an implementation there that supports sparse k-means centers. The calculation pattern can be switched via an extra parameter and users can choose which pattern to use. As expected, it can save a lot of memory according to the average sparsity of the cluster centers, but will consume much more time also. For feature dimension of 10M and nonzero rate 1e-6, it can reduce memory consumption by 40 times yet used 700% time. Welcome to use if you really need to support large dimension k-means. > KMeans support sparse cluster centers > - > > Key: SPARK-4039 > URL: https://issues.apache.org/jira/browse/SPARK-4039 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Antoine Amend > Labels: clustering > > When the number of features is not known, it might be quite helpful to create > sparse vectors using HashingTF.transform. KMeans transforms centers vectors > to dense vectors > (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L307), > therefore leading to OutOfMemory (even with small k). > Any way to keep vectors sparse ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12861) Changes to support KMeans with large feature space
[ https://issues.apache.org/jira/browse/SPARK-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146464#comment-15146464 ] yuhao yang edited comment on SPARK-12861 at 2/14/16 9:42 AM: - https://github.com/hhbyyh/spark/blob/kmeansSparse/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala I got an implementation there that supports sparse k-means centers. The calculation pattern can be switched via an extra parameter and users can choose which pattern to use. As expected, it can save a lot of memory according to the average sparsity of the cluster centers, but will consume much more time also. For feature dimension of 10M and nonzero rate 1e-6, it can reduce memory consumption by 40 times yet used 700% time. Welcome to use if you really need to support large dimension k-means. was (Author: yuhaoyan): https://github.com/hhbyyh/spark/blob/kmeansSparse/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala I got an implementation there that supports sparse k-means centers. The calculation pattern can be switched via an extra parameter and users can choose which pattern to use. As expected, it can save a lot of memory according to the average sparsity of the cluster centers, but will consume much more time also. For feature dimension of 10M and nonzero rate is 1e-6, it can reduce memory consumption by 40 times yet used 700% time. Welcome to use if you really need to support large dimension k-means. > Changes to support KMeans with large feature space > -- > > Key: SPARK-12861 > URL: https://issues.apache.org/jira/browse/SPARK-12861 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.6.0 >Reporter: Roy Levin > Labels: patch > > The problem: > - > In Spark's KMeans code the center vectors are always represented as dense > vectors. As a result, when each such center has a large domain space the > algorithm quickly runs out of memory. In my example I have a feature space of > around 5 and k ~= 500. This sums up to around 200MB RAM for the center > vectors alone while in fact the center vectors are very sparse and require a > lot less RAM. > Since I am running on a system with relatively low resources I keep > getting OutOfMemory errors. In my setting it is OK to trade off runtime for > using less RAM. This is what I set out to do in my solution while allowing > users the flexibility to choose. > One solution could be to reduce the dimensions of the feature space but > this is not always the best approach. For example, when the object space is > comprised of users and the feature space of items. In such an example we may > want to run kmeans over a feature space which is a function of how many times > user i clicked item j. If we reduce the dimensions of the items we will not > be able to map the centers vectors back to the items. Moreover in a streaming > context detecting the changes WRT previous runs gets more difficult. > My solution: > > Allow the kmeans algorithm to accept a VectorFactory which decides when > vectors used inside the algorithm should be sparse and when they should be > dense. For backward compatibility the default behavior is to always make them > dense (like the situation is now). But now potentially the user can provide a > SmartVectorFactory (or some proprietary VectorFactory) which can decide to > make vectors sparse. > For this I made the following changes: > (1) Added a method called reassign to SparseVectors allowing to change > the indices and values > (2) Allow axpy to accept SparseVectors > (3) create a trait called VectorFactory and two implementations for it > that are used within KMeans code > To get the above described solution do the following: > git clone https://github.com/levin-royl/spark.git -b > SupportLargeFeatureDomains > Note > -- > There are some similar issues opened in JIRA in the past, e.g.: > https://issues.apache.org/jira/browse/SPARK-4039 > https://issues.apache.org/jira/browse/SPARK-1212 > https://github.com/mesos/spark/pull/736 > But the difference is that in the problem I describe reducing the dimensions > of the problem (i.e., the feature space) to allow using dense vectors is not > suitable. Also, the solution I implemented supports this while allowing full > flexibility to the user --- i.e., using the default dense vector > implementation or selecting an alternative (only when the default it is not > desired). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For addit
[jira] [Commented] (SPARK-12861) Changes to support KMeans with large feature space
[ https://issues.apache.org/jira/browse/SPARK-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146464#comment-15146464 ] yuhao yang commented on SPARK-12861: https://github.com/hhbyyh/spark/blob/kmeansSparse/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala I got an implementation there that supports sparse k-means centers. The calculation pattern can be switched via an extra parameter and users can choose which pattern to use. As expected, it can save a lot of memory according to the average sparsity of the cluster centers, but will consume much more time also. For feature dimension of 10M and nonzero rate is 1e-6, it can reduce memory consumption by 40 times yet used 700% time. Welcome to use if you really need to support large dimension k-means. > Changes to support KMeans with large feature space > -- > > Key: SPARK-12861 > URL: https://issues.apache.org/jira/browse/SPARK-12861 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.6.0 >Reporter: Roy Levin > Labels: patch > > The problem: > - > In Spark's KMeans code the center vectors are always represented as dense > vectors. As a result, when each such center has a large domain space the > algorithm quickly runs out of memory. In my example I have a feature space of > around 5 and k ~= 500. This sums up to around 200MB RAM for the center > vectors alone while in fact the center vectors are very sparse and require a > lot less RAM. > Since I am running on a system with relatively low resources I keep > getting OutOfMemory errors. In my setting it is OK to trade off runtime for > using less RAM. This is what I set out to do in my solution while allowing > users the flexibility to choose. > One solution could be to reduce the dimensions of the feature space but > this is not always the best approach. For example, when the object space is > comprised of users and the feature space of items. In such an example we may > want to run kmeans over a feature space which is a function of how many times > user i clicked item j. If we reduce the dimensions of the items we will not > be able to map the centers vectors back to the items. Moreover in a streaming > context detecting the changes WRT previous runs gets more difficult. > My solution: > > Allow the kmeans algorithm to accept a VectorFactory which decides when > vectors used inside the algorithm should be sparse and when they should be > dense. For backward compatibility the default behavior is to always make them > dense (like the situation is now). But now potentially the user can provide a > SmartVectorFactory (or some proprietary VectorFactory) which can decide to > make vectors sparse. > For this I made the following changes: > (1) Added a method called reassign to SparseVectors allowing to change > the indices and values > (2) Allow axpy to accept SparseVectors > (3) create a trait called VectorFactory and two implementations for it > that are used within KMeans code > To get the above described solution do the following: > git clone https://github.com/levin-royl/spark.git -b > SupportLargeFeatureDomains > Note > -- > There are some similar issues opened in JIRA in the past, e.g.: > https://issues.apache.org/jira/browse/SPARK-4039 > https://issues.apache.org/jira/browse/SPARK-1212 > https://github.com/mesos/spark/pull/736 > But the difference is that in the problem I describe reducing the dimensions > of the problem (i.e., the feature space) to allow using dense vectors is not > suitable. Also, the solution I implemented supports this while allowing full > flexibility to the user --- i.e., using the default dense vector > implementation or selecting an alternative (only when the default it is not > desired). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13313) Strongly connected components doesn't find all strongly connected components
Petar Zecevic created SPARK-13313: - Summary: Strongly connected components doesn't find all strongly connected components Key: SPARK-13313 URL: https://issues.apache.org/jira/browse/SPARK-13313 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.6.0 Reporter: Petar Zecevic Strongly connected components algorithm doesn't find all strongly connected components. I was using Wikispeedia dataset (http://snap.stanford.edu/data/wikispeedia.html) and the algorithm found 519 SCCs and one of them had 4051 vertices, which in reality don't have any edges between them. I think the problem could be on line 89 of StronglyConnectedComponents.scala file where EdgeDirection.In should be changed to EdgeDirection.Out. I believe the second Pregel call should use Out edge direction, the same as the first call because the direction is reversed in the provided sendMsg function (message is sent to source vertex and not destination vertex). If that is changed (line 89), the algorithm starts finding much more SCCs, but eventually stack overflow exception occurs. I believe graph objects that are changed through iterations should not be cached, but checkpointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13309) Incorrect type inference for CSV data.
[ https://issues.apache.org/jira/browse/SPARK-13309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-13309: Description: Type inference for CSV data does not work as expected when the data is sparse. For instance: Consider the following datasets and the inferred schema: {code} A,B,C,D 1,,, ,1,, ,,1, ,,,1 {code} {code} root |-- A: integer (nullable = true) |-- B: integer (nullable = true) |-- C: string (nullable = true) |-- D: string (nullable = true) {code} Here all the fields should have been inferred as Integer types, but clearly the inferred schema is different. Another dataset: {code} A,B,C,D 1,,1, {code} and the inferred schema: {code} root |-- A: string (nullable = true) |-- B: string (nullable = true) |-- C: string (nullable = true) |-- D: string (nullable = true) {code} Here, fields A & C should be inferred as Integer types. Same issue has been discussed on spark-csv package. Please take a look at https://github.com/databricks/spark-csv/issues/216 for reference. The issue was fixed with https://github.com/databricks/spark-csv/commit/8704b26030da88ac6e18b955a81d5c22ca3b480d. I will try to submit PR with the patch soon. was: Type inference for CSV data does not work as expected when the data is sparse. For instance: Consider the following datasets and the inferred schema: A,B,C,D 1,,, ,1,, ,,1, ,,,1 root |-- A: integer (nullable = true) |-- B: integer (nullable = true) |-- C: string (nullable = true) |-- D: string (nullable = true) Here all the fields should have been inferred as Integer types, but clearly the inferred schema is different. Another dataset: A,B,C,D 1,,1, and the inferred schema: root |-- A: string (nullable = true) |-- B: string (nullable = true) |-- C: string (nullable = true) |-- D: string (nullable = true) Here, fields A & C should be inferred as Integer types. Same issue has been discussed on spark-csv package. Please take a look at https://github.com/databricks/spark-csv/issues/216 for reference. The issue was fixed with https://github.com/databricks/spark-csv/commit/8704b26030da88ac6e18b955a81d5c22ca3b480d. I will try to submit PR with the patch soon. > Incorrect type inference for CSV data. > -- > > Key: SPARK-13309 > URL: https://issues.apache.org/jira/browse/SPARK-13309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Rahul Tanwani > Fix For: 1.6.0 > > > Type inference for CSV data does not work as expected when the data is > sparse. > For instance: Consider the following datasets and the inferred schema: > {code} > A,B,C,D > 1,,, > ,1,, > ,,1, > ,,,1 > {code} > {code} > root > |-- A: integer (nullable = true) > |-- B: integer (nullable = true) > |-- C: string (nullable = true) > |-- D: string (nullable = true) > {code} > Here all the fields should have been inferred as Integer types, but clearly > the inferred schema is different. > Another dataset: > {code} > A,B,C,D > 1,,1, > {code} > and the inferred schema: > {code} > root > |-- A: string (nullable = true) > |-- B: string (nullable = true) > |-- C: string (nullable = true) > |-- D: string (nullable = true) > {code} > Here, fields A & C should be inferred as Integer types. > Same issue has been discussed on spark-csv package. Please take a look at > https://github.com/databricks/spark-csv/issues/216 for reference. > The issue was fixed with > https://github.com/databricks/spark-csv/commit/8704b26030da88ac6e18b955a81d5c22ca3b480d. > I will try to submit PR with the patch soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org