date:20160214

[jira] [Comment Edited] (SPARK-13320) Confusing error message for Dataset API when using sum("*")

2016-02-14 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147032#comment-15147032
 ] 

Xiao Li edited comment on SPARK-13320 at 2/15/16 7:57 AM:
--

{code}
checkAnswer(sql(
  """
| SELECT min(struct(record.*)) FROM
|   (select a as a, struct(a,b) as record from testData2) tmp
| GROUP BY a
  """.stripMargin),
  Row(Row(1, 1)) :: Row(Row(2, 1)) :: Row(Row(3, 1)) :: Nil)
{code}
Above is a query I found in the {{SQLQuerySuite}}. 

Before submitting a PR, I am wondering if the following query is valid? 
{code}
structDf.groupBy($"a").agg(min(struct($"record.*")))
{code}

So far, it does not work. It outputs an error message: {{cannot resolve 'a' 
given input columns: [a, b];}}


was (Author: smilegator):
{code}
checkAnswer(sql(
  """
| SELECT min(struct(record.*)) FROM
|   (select a as a, struct(a,b) as record from testData2) tmp
| GROUP BY a
  """.stripMargin),
  Row(Row(1, 1)) :: Row(Row(2, 1)) :: Row(Row(3, 1)) :: Nil)
{code}
Above is a query I found in test case. 

Before submitting a PR, I am wondering if the following query is valid? 
{code}
structDf.groupBy($"a").agg(min(struct($"record.*")))
{code}

So far, it does not work. It outputs an error message: {{cannot resolve 'a' 
given input columns: [a, b];}}

> Confusing error message for Dataset API when using sum("*")
> ---
>
> Key: SPARK-13320
> URL: https://issues.apache.org/jira/browse/SPARK-13320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> {code}
> pagecounts4PartitionsDS
>   .map(line => (line._1, line._3))
>   .toDF()
>   .groupBy($"_1")
>   .agg(sum("*") as "sumOccurances")
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input 
> columns _1, _2;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>

[jira] [Commented] (SPARK-13320) Confusing error message for Dataset API when using sum("*")

2016-02-14 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147032#comment-15147032
 ] 

Xiao Li commented on SPARK-13320:
-

{code}
checkAnswer(sql(
  """
| SELECT min(struct(record.*)) FROM
|   (select a as a, struct(a,b) as record from testData2) tmp
| GROUP BY a
  """.stripMargin),
  Row(Row(1, 1)) :: Row(Row(2, 1)) :: Row(Row(3, 1)) :: Nil)
{code}
Above is a query I found in test case. 

Before submitting a PR, I am wondering if the following query is valid? 
{code}
structDf.groupBy($"a").agg(min(struct($"record.*")))
{code}

So far, it does not work. It outputs an error message: {{cannot resolve 'a' 
given input columns: [a, b];}}

> Confusing error message for Dataset API when using sum("*")
> ---
>
> Key: SPARK-13320
> URL: https://issues.apache.org/jira/browse/SPARK-13320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> {code}
> pagecounts4PartitionsDS
>   .map(line => (line._1, line._3))
>   .toDF()
>   .groupBy($"_1")
>   .agg(sum("*") as "sumOccurances")
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input 
> columns _1, _2;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>

[jira] [Commented] (SPARK-13220) Deprecate "yarn-client" and "yarn-cluster"

2016-02-14 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147022#comment-15147022
 ] 

Saisai Shao commented on SPARK-13220:
-

[~andrewor14] mind me taking a crack on this?

> Deprecate "yarn-client" and "yarn-cluster"
> --
>
> Key: SPARK-13220
> URL: https://issues.apache.org/jira/browse/SPARK-13220
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Reporter: Andrew Or
>
> We currently allow `\-\-master yarn-client`. Instead, the user should do 
> `\-\-master yarn \-\-deploy-mode client` to be more explicit. This is more 
> consistent with other cluster managers and obviates the need to do special 
> parsing of the master string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13313) Strongly connected components doesn't find all strongly connected components

2016-02-14 Thread Petar Zecevic (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147007#comment-15147007
 ] 

Petar Zecevic commented on SPARK-13313:
---

No, I don't think it's got anything to do with that. That largest SCC's 
vertices are not connected in any way and they shouldn't be in the same group.

> Strongly connected components doesn't find all strongly connected components
> 
>
> Key: SPARK-13313
> URL: https://issues.apache.org/jira/browse/SPARK-13313
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.0
>Reporter: Petar Zecevic
>
> Strongly connected components algorithm doesn't find all strongly connected 
> components. I was using Wikispeedia dataset 
> (http://snap.stanford.edu/data/wikispeedia.html) and the algorithm found 519 
> SCCs and one of them had 4051 vertices, which in reality don't have any edges 
> between them. 
> I think the problem could be on line 89 of StronglyConnectedComponents.scala 
> file where EdgeDirection.In should be changed to EdgeDirection.Out. I believe 
> the second Pregel call should use Out edge direction, the same as the first 
> call because the direction is reversed in the provided sendMsg function 
> (message is sent to source vertex and not destination vertex).
> If that is changed (line 89), the algorithm starts finding much more SCCs, 
> but eventually stack overflow exception occurs. I believe graph objects that 
> are changed through iterations should not be cached, but checkpointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11334) numRunningTasks can't be less than 0, or it will affect executor allocation

2016-02-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146983#comment-15146983
 ] 

Apache Spark commented on SPARK-11334:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/11205

> numRunningTasks can't be less than 0, or it will affect executor allocation
> ---
>
> Key: SPARK-11334
> URL: https://issues.apache.org/jira/browse/SPARK-11334
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: meiyoula
>Assignee: meiyoula
>
> With *Dynamic Allocation* function, a task failed over *maxFailure* time, all 
> the dependent jobs, stages, tasks will be killed or aborted. In this process, 
> *SparkListenerTaskEnd* event will be behind in *SparkListenerStageCompleted* 
> and *SparkListenerJobEnd*. Like the Event Log below:
> {code}
> {"Event":"SparkListenerStageCompleted","Stage Info":{"Stage ID":20,"Stage 
> Attempt ID":0,"Stage Name":"run at AccessController.java:-2","Number of 
> Tasks":200}
> {"Event":"SparkListenerJobEnd","Job ID":9,"Completion Time":1444914699829}
> {"Event":"SparkListenerTaskEnd","Stage ID":20,"Stage Attempt ID":0,"Task 
> Type":"ResultTask","Task End Reason":{"Reason":"TaskKilled"},"Task 
> Info":{"Task ID":1955,"Index":88,"Attempt":2,"Launch 
> Time":1444914699763,"Executor 
> ID":"5","Host":"linux-223","Locality":"PROCESS_LOCAL","Speculative":false,"Getting
>  Result Time":0,"Finish Time":1444914699864,"Failed":true,"Accumulables":[]}}
> {code}
> Because that, the *numRunningTasks* in *ExecutorAllocationManager* class will 
> be less than 0, and it will affect executor allocation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13321) Support nested UNION in parser

2016-02-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13321:


Assignee: (was: Apache Spark)

> Support nested UNION in parser
> --
>
> Key: SPARK-13321
> URL: https://issues.apache.org/jira/browse/SPARK-13321
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> The following SQL can not be parsed with current parser:
> {code}
> SELECT  `u_1`.`id` FROM (((SELECT  `t0`.`id` FROM `default`.`t0`) UNION ALL 
> (SELECT  `t0`.`id` FROM `default`.`t0`)) UNION ALL (SELECT  `t0`.`id` FROM 
> `default`.`t0`)) AS u_1
> {code}
> We should fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13321) Support nested UNION in parser

2016-02-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13321:


Assignee: Apache Spark

> Support nested UNION in parser
> --
>
> Key: SPARK-13321
> URL: https://issues.apache.org/jira/browse/SPARK-13321
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> The following SQL can not be parsed with current parser:
> {code}
> SELECT  `u_1`.`id` FROM (((SELECT  `t0`.`id` FROM `default`.`t0`) UNION ALL 
> (SELECT  `t0`.`id` FROM `default`.`t0`)) UNION ALL (SELECT  `t0`.`id` FROM 
> `default`.`t0`)) AS u_1
> {code}
> We should fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13321) Support nested UNION in parser

2016-02-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146946#comment-15146946
 ] 

Apache Spark commented on SPARK-13321:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/11204

> Support nested UNION in parser
> --
>
> Key: SPARK-13321
> URL: https://issues.apache.org/jira/browse/SPARK-13321
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> The following SQL can not be parsed with current parser:
> {code}
> SELECT  `u_1`.`id` FROM (((SELECT  `t0`.`id` FROM `default`.`t0`) UNION ALL 
> (SELECT  `t0`.`id` FROM `default`.`t0`)) UNION ALL (SELECT  `t0`.`id` FROM 
> `default`.`t0`)) AS u_1
> {code}
> We should fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13321) Support nested UNION in parser

2016-02-14 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-13321:
---

 Summary: Support nested UNION in parser
 Key: SPARK-13321
 URL: https://issues.apache.org/jira/browse/SPARK-13321
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh


The following SQL can not be parsed with current parser:

{code}
SELECT  `u_1`.`id` FROM (((SELECT  `t0`.`id` FROM `default`.`t0`) UNION ALL 
(SELECT  `t0`.`id` FROM `default`.`t0`)) UNION ALL (SELECT  `t0`.`id` FROM 
`default`.`t0`)) AS u_1
{code}

We should fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13320) Confusing error message for Dataset API when using sum("*")

2016-02-14 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146838#comment-15146838
 ] 

Xiao Li commented on SPARK-13320:
-

Sure, will do it. Thanks!

> Confusing error message for Dataset API when using sum("*")
> ---
>
> Key: SPARK-13320
> URL: https://issues.apache.org/jira/browse/SPARK-13320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> {code}
> pagecounts4PartitionsDS
>   .map(line => (line._1, line._3))
>   .toDF()
>   .groupBy($"_1")
>   .agg(sum("*") as "sumOccurances")
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input 
> columns _1, _2;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.GroupedData.toDF(GroupedData.scala:57)
>   at org.apache.spark.sql.GroupedData.agg(GroupedData.scala:213)
> {code}
> The error is with sum("*"), not the resolution of group by "_1".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (SPARK-12503) Pushdown a Limit on top of a Union

2016-02-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12503.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Pushdown a Limit on top of a Union
> --
>
> Key: SPARK-12503
> URL: https://issues.apache.org/jira/browse/SPARK-12503
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Xiao Li
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> "Rule that applies to a Limit on top of a Union. The original Limit won't go 
> away after applying this rule, but additional Limit nodes will be created on 
> top of each child of Union, so that these children produce less rows and 
> Limit can be further optimized for children Relations."
> -- from https://issues.apache.org/jira/browse/CALCITE-832
> Also, the same topic in Hive: https://issues.apache.org/jira/browse/HIVE-11775



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11102) Uninformative exception when specifing non-exist input for JSON data source

2016-02-14 Thread Jeff Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146824#comment-15146824
 ] 

Jeff Zhang commented on SPARK-11102:


[~sowen] Which ticket has resolved this issue ? SPARK-10709  didn't resolve it 
I think. 

> Uninformative exception when specifing non-exist input for JSON data source
> ---
>
> Key: SPARK-11102
> URL: https://issues.apache.org/jira/browse/SPARK-11102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Minor
>
> If I specify a non-exist input path for json data source, the following 
> exception will be thrown, it is not readable. 
> {code}
> 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 19.9 KB, free 251.4 KB)
> 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB)
> 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at 
> :19
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:28)
>   at $iwC$$iwC$$iwC$$iwC.(:30)
>   at $iwC$$iwC$$iwC.(:32)
>   at $iwC$$iwC.(:34)
>   at $iwC.(:36)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-12503) Pushdown a Limit on top of a Union

2016-02-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reopened SPARK-12503:
-
  Assignee: Josh Rosen

> Pushdown a Limit on top of a Union
> --
>
> Key: SPARK-12503
> URL: https://issues.apache.org/jira/browse/SPARK-12503
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Xiao Li
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> "Rule that applies to a Limit on top of a Union. The original Limit won't go 
> away after applying this rule, but additional Limit nodes will be created on 
> top of each child of Union, so that these children produce less rows and 
> Limit can be further optimized for children Relations."
> -- from https://issues.apache.org/jira/browse/CALCITE-832
> Also, the same topic in Hive: https://issues.apache.org/jira/browse/HIVE-11775



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13314) Malformed WholeStageCodegen tree string

2016-02-14 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-13314.

Resolution: Not A Problem

I didn't realize that the extra vertical lines are used to indicate scopes of 
codegen'd stages.

> Malformed WholeStageCodegen tree string
> ---
>
> Key: SPARK-13314
> URL: https://issues.apache.org/jira/browse/SPARK-13314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> {{WholeStageCodegen}} overrides {{generateTreeString}} to show the inner plan 
> tree, but the output can be malformed when the plan contains binary operators:
> {code}
> val a = sqlContext range 5
> val b = sqlContext range 2
> a select ('id as 'a) unionAll (b select ('id as 'a)) explain true
> {code}
> {noformat}
> ...
> == Physical Plan ==
> Union
> :- WholeStageCodegen
> :  :  +- Project [id#3L AS a#6L]
> :  : +- Range 0, 1, 8, 5, [id#3L]
> +- WholeStageCodegen
>:  +- Project [id#4L AS a#7L]
>: +- Range 0, 1, 8, 2, [id#4L]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12720) SQL generation support for cube, rollup, and grouping set

2016-02-14 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12720:
---
Assignee: Xiao Li

> SQL generation support for cube, rollup, and grouping set
> -
>
> Key: SPARK-12720
> URL: https://issues.apache.org/jira/browse/SPARK-12720
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13320) Confusing error message for Dataset API when using sum("*")

2016-02-14 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-13320:
---

 Summary: Confusing error message for Dataset API when using 
sum("*")
 Key: SPARK-13320
 URL: https://issues.apache.org/jira/browse/SPARK-13320
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin


{code}
pagecounts4PartitionsDS
  .map(line => (line._1, line._3))
  .toDF()
  .groupBy($"_1")
  .agg(sum("*") as "sumOccurances")
{code}

{code}
org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input columns 
_1, _2;
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.GroupedData.toDF(GroupedData.scala:57)
at org.apache.spark.sql.GroupedData.agg(GroupedData.scala:213)
{code}

The error is with sum("*"), not the resolution of group by "_1".




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13320) Confusing error message for Dataset API when using sum("*")

2016-02-14 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146815#comment-15146815
 ] 

Reynold Xin commented on SPARK-13320:
-

cc [~smilegator] not sure if you have time. If you do, mind looking into this?

cc [~marmbrus] and [~cloud_fan]


> Confusing error message for Dataset API when using sum("*")
> ---
>
> Key: SPARK-13320
> URL: https://issues.apache.org/jira/browse/SPARK-13320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> {code}
> pagecounts4PartitionsDS
>   .map(line => (line._1, line._3))
>   .toDF()
>   .groupBy($"_1")
>   .agg(sum("*") as "sumOccurances")
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input 
> columns _1, _2;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.GroupedData.toDF(GroupedData.scala:57)
>   at org.apache.spark.sql.GroupedData.agg(GroupedData.scala:213)
> {code}
> The error is with sum("*"), not the resolution of group by

[jira] [Commented] (SPARK-12846) Follow up SPARK-12707, Update documentation and other related code

2016-02-14 Thread Jeff Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146814#comment-15146814
 ] 

Jeff Zhang commented on SPARK-12846:


Add the context mail thread in description, [~felixcheung] will work on it

> Follow up SPARK-12707, Update documentation and other related code
> --
>
> Key: SPARK-12846
> URL: https://issues.apache.org/jira/browse/SPARK-12846
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Jeff Zhang
>
> Add the background context mail therad 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Are-we-running-SparkR-tests-in-Jenkins-td16034.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12846) Follow up SPARK-12707, Update documentation and other related code

2016-02-14 Thread Jeff Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-12846:
---
Description: Add the background context mail thread 
http://apache-spark-developers-list.1001551.n3.nabble.com/Are-we-running-SparkR-tests-in-Jenkins-td16034.html
  (was: Add the background context mail therad 
http://apache-spark-developers-list.1001551.n3.nabble.com/Are-we-running-SparkR-tests-in-Jenkins-td16034.html)

> Follow up SPARK-12707, Update documentation and other related code
> --
>
> Key: SPARK-12846
> URL: https://issues.apache.org/jira/browse/SPARK-12846
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Jeff Zhang
>
> Add the background context mail thread 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Are-we-running-SparkR-tests-in-Jenkins-td16034.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12846) Follow up SPARK-12707, Update documentation and other related code

2016-02-14 Thread Jeff Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-12846:
---
Description: Add the background context mail therad 
http://apache-spark-developers-list.1001551.n3.nabble.com/Are-we-running-SparkR-tests-in-Jenkins-td16034.html

> Follow up SPARK-12707, Update documentation and other related code
> --
>
> Key: SPARK-12846
> URL: https://issues.apache.org/jira/browse/SPARK-12846
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Jeff Zhang
>
> Add the background context mail therad 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Are-we-running-SparkR-tests-in-Jenkins-td16034.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13318) Model export/import for spark.ml: ElementwiseProduct

2016-02-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146800#comment-15146800
 ] 

Apache Spark commented on SPARK-13318:
--

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11203

> Model export/import for spark.ml: ElementwiseProduct
> 
>
> Key: SPARK-13318
> URL: https://issues.apache.org/jira/browse/SPARK-13318
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xusen Yin
>Priority: Minor
>
> Add save/load to ElementwiseProduct



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13036) PySpark ml.feature support export/import

2016-02-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13036:


Assignee: Apache Spark

> PySpark ml.feature support export/import
> 
>
> Key: SPARK-13036
> URL: https://issues.apache.org/jira/browse/SPARK-13036
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> Add export/import for all estimators and transformers(which have Scala 
> implementation) under pyspark/ml/feature.py. Please refer the implementation 
> at SPARK-13032.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13318) Model export/import for spark.ml: ElementwiseProduct

2016-02-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13318:


Assignee: (was: Apache Spark)

> Model export/import for spark.ml: ElementwiseProduct
> 
>
> Key: SPARK-13318
> URL: https://issues.apache.org/jira/browse/SPARK-13318
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xusen Yin
>Priority: Minor
>
> Add save/load to ElementwiseProduct



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13319) Pyspark VectorSlicer should have setDefault

2016-02-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13319:


Assignee: Apache Spark

> Pyspark VectorSlicer should have setDefault
> ---
>
> Key: SPARK-13319
> URL: https://issues.apache.org/jira/browse/SPARK-13319
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Xusen Yin
>Assignee: Apache Spark
>Priority: Minor
>
> Pyspark VectorSlicer should have setDefault, otherwise it will cause error 
> when calling getNames or getIndices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13319) Pyspark VectorSlicer should have setDefault

2016-02-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146801#comment-15146801
 ] 

Apache Spark commented on SPARK-13319:
--

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11203

> Pyspark VectorSlicer should have setDefault
> ---
>
> Key: SPARK-13319
> URL: https://issues.apache.org/jira/browse/SPARK-13319
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Xusen Yin
>Priority: Minor
>
> Pyspark VectorSlicer should have setDefault, otherwise it will cause error 
> when calling getNames or getIndices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13319) Pyspark VectorSlicer should have setDefault

2016-02-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13319:


Assignee: (was: Apache Spark)

> Pyspark VectorSlicer should have setDefault
> ---
>
> Key: SPARK-13319
> URL: https://issues.apache.org/jira/browse/SPARK-13319
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Xusen Yin
>Priority: Minor
>
> Pyspark VectorSlicer should have setDefault, otherwise it will cause error 
> when calling getNames or getIndices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13036) PySpark ml.feature support export/import

2016-02-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146799#comment-15146799
 ] 

Apache Spark commented on SPARK-13036:
--

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11203

> PySpark ml.feature support export/import
> 
>
> Key: SPARK-13036
> URL: https://issues.apache.org/jira/browse/SPARK-13036
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> Add export/import for all estimators and transformers(which have Scala 
> implementation) under pyspark/ml/feature.py. Please refer the implementation 
> at SPARK-13032.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13318) Model export/import for spark.ml: ElementwiseProduct

2016-02-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13318:


Assignee: Apache Spark

> Model export/import for spark.ml: ElementwiseProduct
> 
>
> Key: SPARK-13318
> URL: https://issues.apache.org/jira/browse/SPARK-13318
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xusen Yin
>Assignee: Apache Spark
>Priority: Minor
>
> Add save/load to ElementwiseProduct



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13036) PySpark ml.feature support export/import

2016-02-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13036:


Assignee: (was: Apache Spark)

> PySpark ml.feature support export/import
> 
>
> Key: SPARK-13036
> URL: https://issues.apache.org/jira/browse/SPARK-13036
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> Add export/import for all estimators and transformers(which have Scala 
> implementation) under pyspark/ml/feature.py. Please refer the implementation 
> at SPARK-13032.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13319) Pyspark VectorSlicer should have setDefault

2016-02-14 Thread Xusen Yin (JIRA)

Xusen Yin created SPARK-13319:
-

 Summary: Pyspark VectorSlicer should have setDefault
 Key: SPARK-13319
 URL: https://issues.apache.org/jira/browse/SPARK-13319
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Xusen Yin
Priority: Minor


Pyspark VectorSlicer should have setDefault, otherwise it will cause error when 
calling getNames or getIndices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13318) Model export/import for spark.ml: ElementwiseProduct

2016-02-14 Thread Xusen Yin (JIRA)

Xusen Yin created SPARK-13318:
-

 Summary: Model export/import for spark.ml: ElementwiseProduct
 Key: SPARK-13318
 URL: https://issues.apache.org/jira/browse/SPARK-13318
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xusen Yin
Priority: Minor


Add save/load to ElementwiseProduct



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13185) Improve the performance of DateTimeUtils.StringToDate by reusing Calendar objects

2016-02-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13185.
-
   Resolution: Fixed
 Assignee: Carson Wang
Fix Version/s: 2.0.0

> Improve the performance of DateTimeUtils.StringToDate by reusing Calendar 
> objects
> -
>
> Key: SPARK-13185
> URL: https://issues.apache.org/jira/browse/SPARK-13185
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Carson Wang
>Assignee: Carson Wang
>Priority: Minor
> Fix For: 2.0.0
>
>
> It is expensive to create java Calendar objects in each method of 
> DateTimeUtils. We can reuse the objects to improve the performance. In one of 
> my Sql queries which calls StringToDate many times, the duration of the stage 
> improved from 1.6 minutes to 1.2 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13307) TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1

2016-02-14 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146765#comment-15146765
 ] 

Xiao Li edited comment on SPARK-13307 at 2/14/16 10:49 PM:
---

In the following PR: https://github.com/apache/spark/pull/9645, shuffle hash 
join was removed from Spark SQL. Try to see if broadcast join works in this 
test case. You also can use BroadcastHint to force the broadcast join. 

Let me CC [~rxin] [~yhuai] [~marmbrus]


was (Author: smilegator):
In the following PR: https://github.com/apache/spark/pull/9645, shuffle hash 
join is removed from Spark SQL. Try to see if broadcast join works in this test 
case. You also can use hint to force the broadcast join. 

Let me CC [~rxin] [~yhuai] [~marmbrus]

> TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1
> -
>
> Key: SPARK-13307
> URL: https://issues.apache.org/jira/browse/SPARK-13307
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>
> Majority of the TPCDS queries ran faster in 1.6.0 than in 1.4.1, average 
> about 9% faster. There are a few degraded, and one that is definitely not 
> within error margin is query 66.
> Query 66 in 1.4.1: 699 seconds
> Query 66 in 1.6.0: 918 seconds
> 30% worse.
> Collected the physical plans from both versions - drastic difference maybe 
> partially from using Tungsten in 1.6, but anything else at play here?
> Please see plans here:
> https://ibm.box.com/spark-sql-q66-debug-160plan
> https://ibm.box.com/spark-sql-q66-debug-141plan



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13307) TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1

2016-02-14 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146765#comment-15146765
 ] 

Xiao Li commented on SPARK-13307:
-

In the following PR: https://github.com/apache/spark/pull/9645, shuffle hash 
join is removed from Spark SQL. Try to see if broadcast join works in this test 
case. You also can use hint to force the broadcast join. 

Let me CC [~rxin] [~yhuai] [~marmbrus]

> TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1
> -
>
> Key: SPARK-13307
> URL: https://issues.apache.org/jira/browse/SPARK-13307
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>
> Majority of the TPCDS queries ran faster in 1.6.0 than in 1.4.1, average 
> about 9% faster. There are a few degraded, and one that is definitely not 
> within error margin is query 66.
> Query 66 in 1.4.1: 699 seconds
> Query 66 in 1.6.0: 918 seconds
> 30% worse.
> Collected the physical plans from both versions - drastic difference maybe 
> partially from using Tungsten in 1.6, but anything else at play here?
> Please see plans here:
> https://ibm.box.com/spark-sql-q66-debug-160plan
> https://ibm.box.com/spark-sql-q66-debug-141plan



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13307) TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1

2016-02-14 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146755#comment-15146755
 ] 

Xiao Li commented on SPARK-13307:
-

Please tune “spark.sql.autoBroadcastJoinThreshold” to enable the broadcast 
Join. Thanks!

> TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1
> -
>
> Key: SPARK-13307
> URL: https://issues.apache.org/jira/browse/SPARK-13307
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>
> Majority of the TPCDS queries ran faster in 1.6.0 than in 1.4.1, average 
> about 9% faster. There are a few degraded, and one that is definitely not 
> within error margin is query 66.
> Query 66 in 1.4.1: 699 seconds
> Query 66 in 1.6.0: 918 seconds
> 30% worse.
> Collected the physical plans from both versions - drastic difference maybe 
> partially from using Tungsten in 1.6, but anything else at play here?
> Please see plans here:
> https://ibm.box.com/spark-sql-q66-debug-160plan
> https://ibm.box.com/spark-sql-q66-debug-141plan



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13307) TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1

2016-02-14 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146751#comment-15146751
 ] 

Xiao Li commented on SPARK-13307:
-

1.6.1 is using SortMergeJoin, but 1.4.1 is using ShuffleHashJoin. I believe 
this is the major cause of the performance difference. 

> TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1
> -
>
> Key: SPARK-13307
> URL: https://issues.apache.org/jira/browse/SPARK-13307
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>
> Majority of the TPCDS queries ran faster in 1.6.0 than in 1.4.1, average 
> about 9% faster. There are a few degraded, and one that is definitely not 
> within error margin is query 66.
> Query 66 in 1.4.1: 699 seconds
> Query 66 in 1.6.0: 918 seconds
> 30% worse.
> Collected the physical plans from both versions - drastic difference maybe 
> partially from using Tungsten in 1.6, but anything else at play here?
> Please see plans here:
> https://ibm.box.com/spark-sql-q66-debug-160plan
> https://ibm.box.com/spark-sql-q66-debug-141plan



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13313) Strongly connected components doesn't find all strongly connected components

2016-02-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146746#comment-15146746
 ] 

Sean Owen commented on SPARK-13313:
---

Dumb question, but is this the difference between directed and undirected 
graphs? like, GraphX is reading this as directed edges only?

> Strongly connected components doesn't find all strongly connected components
> 
>
> Key: SPARK-13313
> URL: https://issues.apache.org/jira/browse/SPARK-13313
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.0
>Reporter: Petar Zecevic
>
> Strongly connected components algorithm doesn't find all strongly connected 
> components. I was using Wikispeedia dataset 
> (http://snap.stanford.edu/data/wikispeedia.html) and the algorithm found 519 
> SCCs and one of them had 4051 vertices, which in reality don't have any edges 
> between them. 
> I think the problem could be on line 89 of StronglyConnectedComponents.scala 
> file where EdgeDirection.In should be changed to EdgeDirection.Out. I believe 
> the second Pregel call should use Out edge direction, the same as the first 
> call because the direction is reversed in the provided sendMsg function 
> (message is sent to source vertex and not destination vertex).
> If that is changed (line 89), the algorithm starts finding much more SCCs, 
> but eventually stack overflow exception occurs. I believe graph objects that 
> are changed through iterations should not be cached, but checkpointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13307) TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1

2016-02-14 Thread JESSE CHEN (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146740#comment-15146740
 ] 

JESSE CHEN commented on SPARK-13307:


Uploaded newly collected plans (logical, analyzed, optimized and physical). 

Links are the same:

https://ibm.box.com/spark-sql-q66-debug-160plan
https://ibm.box.com/spark-sql-q66-debug-141plan

Please let me know any additional info you need to collect.
Thanks.


> TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1
> -
>
> Key: SPARK-13307
> URL: https://issues.apache.org/jira/browse/SPARK-13307
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>
> Majority of the TPCDS queries ran faster in 1.6.0 than in 1.4.1, average 
> about 9% faster. There are a few degraded, and one that is definitely not 
> within error margin is query 66.
> Query 66 in 1.4.1: 699 seconds
> Query 66 in 1.6.0: 918 seconds
> 30% worse.
> Collected the physical plans from both versions - drastic difference maybe 
> partially from using Tungsten in 1.6, but anything else at play here?
> Please see plans here:
> https://ibm.box.com/spark-sql-q66-debug-160plan
> https://ibm.box.com/spark-sql-q66-debug-141plan



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13313) Strongly connected components doesn't find all strongly connected components

2016-02-14 Thread Petar Zecevic (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146731#comment-15146731
 ] 

Petar Zecevic commented on SPARK-13313:
---

Yes, you need articles.tsv and links.tsv from this archive: 
http://snap.stanford.edu/data/wikispeedia/wikispeedia_paths-and-graph.tar.gz

Then parse the data, assign IDs to article names and create the graph:
val articles = sc.textFile("articles.tsv", 6).filter(line => line.trim() != "" 
&& !line.startsWith("#")).zipWithIndex().cache()
val links = sc.textFile("links.tsv", 6).filter(line => line.trim() != "" && 
!line.startsWith("#"))
val linkIndexes = links.map(x => { val spl = x.split("\t"); (spl(0), spl(1)) 
}).join(articles).map(x => x._2).join(articles).map(x => x._2)
val wikigraph = Graph.fromEdgeTuples(linkIndexes, 0)

Then get strongly connected components:
val wikiSCC = wikigraph.stronglyConnectedComponents(100)

wikiSCC graph contains 519 SCCs, but there should be much more. The largest SCC 
in wikiSCC has 4051 vertices and that's obviously wrong.

The change in line 89, which I mentioned, seems to solve this problem, but then 
other issues arise (stack overflow etc) and I don't have time to investigate 
further. I hope someone will look into this.



> Strongly connected components doesn't find all strongly connected components
> 
>
> Key: SPARK-13313
> URL: https://issues.apache.org/jira/browse/SPARK-13313
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.0
>Reporter: Petar Zecevic
>
> Strongly connected components algorithm doesn't find all strongly connected 
> components. I was using Wikispeedia dataset 
> (http://snap.stanford.edu/data/wikispeedia.html) and the algorithm found 519 
> SCCs and one of them had 4051 vertices, which in reality don't have any edges 
> between them. 
> I think the problem could be on line 89 of StronglyConnectedComponents.scala 
> file where EdgeDirection.In should be changed to EdgeDirection.Out. I believe 
> the second Pregel call should use Out edge direction, the same as the first 
> call because the direction is reversed in the provided sendMsg function 
> (message is sent to source vertex and not destination vertex).
> If that is changed (line 89), the algorithm starts finding much more SCCs, 
> but eventually stack overflow exception occurs. I believe graph objects that 
> are changed through iterations should not be cached, but checkpointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12969) Exception while casting a spark supported date formatted "string" to "date" data type.

2016-02-14 Thread Ankit Jindal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146728#comment-15146728
 ] 

Ankit Jindal commented on SPARK-12969:
--

Hi,
I have tried your code with Java 1.8.0_66 and spark 1.6 in local mode and it is 
working as expected.

Can you provide the command you are using to run this.

Regards,
Ankit

> Exception while  casting a spark supported date formatted "string" to "date" 
> data type.
> ---
>
> Key: SPARK-12969
> URL: https://issues.apache.org/jira/browse/SPARK-12969
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.6.0
> Environment: Spark Java 
>Reporter: Jais Sebastian
>
> Getting exception while  converting a string column( column is having spark 
> supported date format -MM-dd ) to date data type. Below is the code 
> snippet 
> List jsonData = Arrays.asList( 
> "{\"d\":\"2015-02-01\",\"n\":1}");
> JavaRDD dataRDD = 
> this.getSparkContext().parallelize(jsonData);
> DataFrame data = this.getSqlContext().read().json(dataRDD);
> DataFrame newData = data.select(data.col("d").cast("date"));
> newData.show();
> Above code will give the error
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> generated.java, Line 95, Column 28: Expression "scala.Option < Long > 
> longOpt16" is not an lvalue
> This happens only if we execute the program in client mode , it works if we 
> execute through spark submit. Here is the sample project : 
> https://github.com/uhonnavarkar/spark_test



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13294) Don't build assembly in dev/run-tests

2016-02-14 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-13294:
--

Assignee: Josh Rosen

> Don't build assembly in dev/run-tests
> -
>
> Key: SPARK-13294
> URL: https://issues.apache.org/jira/browse/SPARK-13294
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> As of SPARK-9284 we should no longer need to build the full Spark assembly 
> JAR in order to run tests. Therefore, we should remove the assembly step from 
> {{dev/run-tests}} in order to reduce build + test time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13294) Don't build assembly in dev/run-tests

2016-02-14 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-13294:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-11157

> Don't build assembly in dev/run-tests
> -
>
> Key: SPARK-13294
> URL: https://issues.apache.org/jira/browse/SPARK-13294
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Josh Rosen
>
> As of SPARK-9284 we should no longer need to build the full Spark assembly 
> JAR in order to run tests. Therefore, we should remove the assembly step from 
> {{dev/run-tests}} in order to reduce build + test time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12772) Better error message for syntax error in the SQL parser

2016-02-14 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146710#comment-15146710
 ] 

Herman van Hovell commented on SPARK-12772:
---

I'll have a look.

> Better error message for syntax error in the SQL parser
> ---
>
> Key: SPARK-12772
> URL: https://issues.apache.org/jira/browse/SPARK-12772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Reynold Xin
>
> {code}
> scala> sql("select case if(true, 'one', 'two')").explain(true)
> org.apache.spark.sql.AnalysisException: org.antlr.runtime.EarlyExitException
> line 1:34 required (...)+ loop did not match anything at input '' in 
> case expression
> ; line 1 pos 34
>   at 
> org.apache.spark.sql.catalyst.parser.ParseErrorReporter.throwError(ParseDriver.scala:140)
>   at 
> org.apache.spark.sql.catalyst.parser.ParseErrorReporter.throwError(ParseDriver.scala:129)
>   at 
> org.apache.spark.sql.catalyst.parser.ParseDriver$.parse(ParseDriver.scala:77)
>   at 
> org.apache.spark.sql.catalyst.CatalystQl.createPlan(CatalystQl.scala:53)
>   at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
>   at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
> {code}
> Is there a way to say something better other than "required (...)+ loop did 
> not match anything at input"?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves

2016-02-14 Thread Christopher Bourez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146704#comment-15146704
 ] 

Christopher Bourez commented on SPARK-13317:


Because installing the notebooks Zeppelin or IScala on the cluster does not 
make a lot of sense.

> SPARK_LOCAL_IP does not bind on Slaves
> --
>
> Key: SPARK-13317
> URL: https://issues.apache.org/jira/browse/SPARK-13317
> Project: Spark
>  Issue Type: Bug
> Environment: Linux EC2, different VPC 
>Reporter: Christopher Bourez
>
> SPARK_LOCAL_IP does not bind to the provided IP on slaves.
> When launching a job or a spark-shell from a second network, the returned IP 
> for the slave is still the first IP of the slave. 
> So the job fails with the message : 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> It is not a question of resources but the driver which cannot connect to the 
> slave given the wrong IP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10759) Missing Python code example in ML Programming guide

2016-02-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146698#comment-15146698
 ] 

Apache Spark commented on SPARK-10759:
--

User 'JeremyNixon' has created a pull request for this issue:
https://github.com/apache/spark/pull/11202

> Missing Python code example in ML Programming guide
> ---
>
> Key: SPARK-10759
> URL: https://issues.apache.org/jira/browse/SPARK-10759
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Raela Wang
>Assignee: Lauren Moos
>Priority: Minor
>  Labels: starter
>
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-train-validation-split



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10759) Missing Python code example in ML Programming guide

2016-02-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10759:


Assignee: Apache Spark  (was: Lauren Moos)

> Missing Python code example in ML Programming guide
> ---
>
> Key: SPARK-10759
> URL: https://issues.apache.org/jira/browse/SPARK-10759
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Raela Wang
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-train-validation-split



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves

2016-02-14 Thread DOAN DuyHai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146693#comment-15146693
 ] 

DOAN DuyHai commented on SPARK-13317:
-

To complement this JIRA, I would say that the issue is:

*how to configure Spark to use public IP address for slaves on machine with 
multiple network interfaces* ?

> SPARK_LOCAL_IP does not bind on Slaves
> --
>
> Key: SPARK-13317
> URL: https://issues.apache.org/jira/browse/SPARK-13317
> Project: Spark
>  Issue Type: Bug
> Environment: Linux EC2, different VPC 
>Reporter: Christopher Bourez
>
> SPARK_LOCAL_IP does not bind to the provided IP on slaves.
> When launching a job or a spark-shell from a second network, the returned IP 
> for the slave is still the first IP of the slave. 
> So the job fails with the message : 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> It is not a question of resources but the driver which cannot connect to the 
> slave given the wrong IP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves

2016-02-14 Thread Christopher Bourez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689
 ] 

Christopher Bourez edited comment on SPARK-13317 at 2/14/16 7:02 PM:
-

I launch a cluster 
{code}
 ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 
--copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 
launch spark-cluster
{code}
which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com
and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc
If I launch a job in client mode from another network, for example in a 
Zeppelin notebook on my macbook, which configuration is equivalent to 
{code}
spark-shell 
--master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077
{code}
I see in the logs : 

{code}
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 
(172.31.4.179:34425) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 
(172.31.4.176:47657) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 
(172.31.4.177:41379) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 
(172.31.4.178:34353) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 
192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 
64058)
16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager
{code}

which are private IP that my macbook cannot access and when launching a job, an 
error follow : 
{code}
16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
{code}
I tried to connect to the slaves, to set SPARK_LOCAL_IP in the slaves' 
spark-env.sh, stop and restart all slaves from the master, spark master still 
returns the private IP of the slaves when I execute a job in client mode 
(spark-shell or zeppelin on my macbook).
I think we should be able to work from different networks. Only UI interfaces 
seem to be bound to the correct IP.


was (Author: christopher5106):
I launch a cluster 
{code}
 ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 
--copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 
launch spark-cluster
{code}
which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com
and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc
If I launch a job in client mode from another network, for example in a 
Zeppelin notebook on my macbook, which configuration is equivalent to 
{code}
spark-shell 
--master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077
{code}
I see in the logs : 

{code}
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 
(172.31.4.179:34425) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 
(172.31.4.176:47657) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 
(172.31.4.177:41379) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 
(172.31.4.178:34353) with 2 cores
16/

[jira] [Comment Edited] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves

2016-02-14 Thread Christopher Bourez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689
 ] 

Christopher Bourez edited comment on SPARK-13317 at 2/14/16 7:01 PM:
-

I launch a cluster 
{code}
 ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 
--copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 
launch spark-cluster
{code}
which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com
and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc
If I launch a job in client mode from another network, for example in a 
Zeppelin notebook on my macbook, which configuration is equivalent to 
{code}
spark-shell 
--master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077
{code}
I see in the logs : 

{code}
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 
(172.31.4.179:34425) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 
(172.31.4.176:47657) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 
(172.31.4.177:41379) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 
(172.31.4.178:34353) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 
192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 
64058)
16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager
{code}

which are private IP that my macbook cannot access and when launching a job, an 
error follow : 
{code}
16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
{code}
I tried to connect to the slaves, to set SPARK_LOCAL_IP in the slaves' 
spark-env.sh, stop and restart all slaves from the master, spark master still 
returns the private IP of the slaves.


was (Author: christopher5106):
I launch a cluster 
{code}
 ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 
--copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 
launch spark-cluster
{code}
which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com
and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc
If I launch a job in client mode from another network, for example in a 
Zeppelin notebook on my macbook, which configuration is equivalent to 
{code}
spark-shell 
--master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077
{code}
I see in the logs : 

{code}
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 
(172.31.4.179:34425) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 
(172.31.4.176:47657) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 
(172.31.4.177:41379) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 
(172.31.4.178:34353) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO BlockManagerM

[jira] [Comment Edited] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves

2016-02-14 Thread Christopher Bourez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689
 ] 

Christopher Bourez edited comment on SPARK-13317 at 2/14/16 7:00 PM:
-

I launch a cluster 
{code}
 ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 
--copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 
launch spark-cluster
{code}
which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com
and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc
If I launch a job in client mode from another network, for example in a 
Zeppelin notebook on my macbook, which configuration is equivalent to 
{code}
spark-shell 
--master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077
{code}
I see in the logs : 

{code}
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 
(172.31.4.179:34425) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 
(172.31.4.176:47657) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 
(172.31.4.177:41379) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 
(172.31.4.178:34353) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 
192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 
64058)
16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager
{code}

which are private IP that my macbook cannot access and when launching a job, an 
error follow : 
{code}
16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
{code}
I tried to connect to the slave, to set SPARK_LOCAL_IP in the slave's 
spark-env.sh, stop and restart all slaves from the master, spark master still 
returns the private IP.


was (Author: christopher5106):
I launch a cluster 
{code}
 ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 
--copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 
launch spark-cluster
{code}
which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com
and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc
If I launch a job in client mode from another network, for example in a 
Zeppelin notebook on my macbook, which configuration is equivalent to 
{code}
spark-shell 
--master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077
{code}
I see in the logs : 

{code}
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 
(172.31.4.179:34425) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 
(172.31.4.176:47657) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 
(172.31.4.177:41379) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 
(172.31.4.178:34353) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO BlockManagerMasterEndpoint:

[jira] [Comment Edited] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves

2016-02-14 Thread Christopher Bourez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689
 ] 

Christopher Bourez edited comment on SPARK-13317 at 2/14/16 6:59 PM:
-

I launch a cluster 
 ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 
--copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 
launch spark-cluster
which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com
and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc
If I launch a job in client mode from another network, for example in a 
Zeppelin notebook on my macbook, which configuration is equivalent to 
spark-shell 
--master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077
I see in the logs : 

{code}
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 
(172.31.4.179:34425) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 
(172.31.4.176:47657) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 
(172.31.4.177:41379) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 
(172.31.4.178:34353) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 
192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 
64058)
16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager
{code}

which are private IP that my macbook cannot access and when launching a job, an 
error follow : 
{code}
16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
{code}
I tryied to connect to the slave, to set SPARK_LOCAL_IP in the slave's 
spark-env.sh, stop and restart all slaves from the master, spark master still 
returns the private IP.
Thanks,


was (Author: christopher5106):
I launch a cluster 
 ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 
--copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 
launch spark-cluster
which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com
and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc
If I launch a job in client mode from another network, for example in a 
Zeppelin notebook on my macbook, which configuration is equivalent to 
spark-shell 
--master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077
I see in the logs : 

{code}
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 
(172.31.4.179:34425) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 
(172.31.4.176:47657) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 
(172.31.4.177:41379) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 
(172.31.4.178:34353) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 
192.168.1.11:64058 w

[jira] [Comment Edited] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves

2016-02-14 Thread Christopher Bourez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689
 ] 

Christopher Bourez edited comment on SPARK-13317 at 2/14/16 6:59 PM:
-

I launch a cluster 
{code}
 ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 
--copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 
launch spark-cluster
{code}
which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com
and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc
If I launch a job in client mode from another network, for example in a 
Zeppelin notebook on my macbook, which configuration is equivalent to 
{code}
spark-shell 
--master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077
{code}
I see in the logs : 

{code}
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 
(172.31.4.179:34425) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 
(172.31.4.176:47657) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 
(172.31.4.177:41379) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 
(172.31.4.178:34353) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 
192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 
64058)
16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager
{code}

which are private IP that my macbook cannot access and when launching a job, an 
error follow : 
{code}
16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
{code}
I tryied to connect to the slave, to set SPARK_LOCAL_IP in the slave's 
spark-env.sh, stop and restart all slaves from the master, spark master still 
returns the private IP.
Thanks,


was (Author: christopher5106):
I launch a cluster 
 ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 
--copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 
launch spark-cluster
which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com
and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc
If I launch a job in client mode from another network, for example in a 
Zeppelin notebook on my macbook, which configuration is equivalent to 
spark-shell 
--master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077
I see in the logs : 

{code}
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 
(172.31.4.179:34425) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 
(172.31.4.176:47657) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 
(172.31.4.177:41379) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 
(172.31.4.178:34353) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block m

[jira] [Comment Edited] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves

2016-02-14 Thread Christopher Bourez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689
 ] 

Christopher Bourez edited comment on SPARK-13317 at 2/14/16 6:59 PM:
-

I launch a cluster 
 ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 
--copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 
launch spark-cluster
which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com
and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc
If I launch a job in client mode from another network, for example in a 
Zeppelin notebook on my macbook, which configuration is equivalent to 
spark-shell 
--master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077
I see in the logs : 

{code}
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 
(172.31.4.179:34425) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 
(172.31.4.176:47657) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 
(172.31.4.177:41379) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 
(172.31.4.178:34353) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 
192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 
64058)
16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager
{code}

which are private IP that my macbook cannot access and when launching a job, an 
error follow : 
16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
I tryied to connect to the slave, to set SPARK_LOCAL_IP in the slave's 
spark-env.sh, stop and restart all slaves from the master, spark master still 
returns the private IP.
Thanks,


was (Author: christopher5106):
I launch a cluster 
 ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 
--copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 
launch spark-cluster
which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com
and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc
If I launch a job in client mode from another network, for example in a 
Zeppelin notebook on my macbook, which configuration is equivalent to 
spark-shell 
--master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077
I see in the logs : 

`
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 
(172.31.4.179:34425) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 
(172.31.4.176:47657) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 
(172.31.4.177:41379) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 
(172.31.4.178:34353) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 
192.168.1.11:64058 with 511.5 MB RAM, B

[jira] [Comment Edited] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves

2016-02-14 Thread Christopher Bourez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689
 ] 

Christopher Bourez edited comment on SPARK-13317 at 2/14/16 6:58 PM:
-

I launch a cluster 
 ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 
--copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 
launch spark-cluster
which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com
and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc
If I launch a job in client mode from another network, for example in a 
Zeppelin notebook on my macbook, which configuration is equivalent to 
spark-shell 
--master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077
I see in the logs : 

`
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 
(172.31.4.179:34425) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 
(172.31.4.176:47657) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 
(172.31.4.177:41379) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 
(172.31.4.178:34353) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 
192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 
64058)
16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager
`

which are private IP that my macbook cannot access and when launching a job, an 
error follow : 
16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
I tryied to connect to the slave, to set SPARK_LOCAL_IP in the slave's 
spark-env.sh, stop and restart all slaves from the master, spark master still 
returns the private IP.
Thanks,


was (Author: christopher5106):
I launch a cluster 
 ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 
--copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 
launch spark-cluster
which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com
and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc
If I launch a job in client mode from another network, for example in a 
Zeppelin notebook on my macbook, which configuration is equivalent to 
spark-shell 
--master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077
I see in the logs : 

```
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 
(172.31.4.179:34425) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 
(172.31.4.176:47657) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 
(172.31.4.177:41379) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 
(172.31.4.178:34353) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 
192.168.1.11:64058 with 511.5 MB RAM, BlockMana

[jira] [Commented] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves

2016-02-14 Thread Christopher Bourez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689
 ] 

Christopher Bourez commented on SPARK-13317:


I launch a cluster 
 ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 
--copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 
launch spark-cluster
which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com
and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc
If I launch a job in client mode from another network, for example in a 
Zeppelin notebook on my macbook, which configuration is equivalent to 
spark-shell 
--master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077
I see in the logs : 

```
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 
(172.31.4.179:34425) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 
(172.31.4.176:47657) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 
(172.31.4.177:41379) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: 
app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 
(172.31.4.178:34353) with 2 cores
16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 
MB RAM
16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 
192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 
64058)
16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager
```

which are private IP that my macbook cannot access and when launching a job, an 
error follow : 
16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
I tryied to connect to the slave, to set SPARK_LOCAL_IP in the slave's 
spark-env.sh, stop and restart all slaves from the master, spark master still 
returns the private IP.
Thanks,

> SPARK_LOCAL_IP does not bind on Slaves
> --
>
> Key: SPARK-13317
> URL: https://issues.apache.org/jira/browse/SPARK-13317
> Project: Spark
>  Issue Type: Bug
> Environment: Linux EC2, different VPC 
>Reporter: Christopher Bourez
>
> SPARK_LOCAL_IP does not bind to the provided IP on slaves.
> When launching a job or a spark-shell from a second network, the returned IP 
> for the slave is still the first IP of the slave. 
> So the job fails with the message : 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> It is not a question of resources but the driver which cannot connect to the 
> slave given the wrong IP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves

2016-02-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146646#comment-15146646
 ] 

Sean Owen commented on SPARK-13317:
---

Can you clarify -- are you setting SPARK_LOCAL_IP correctly on each machine? 
I'm not clear what is set where and what is used where.

> SPARK_LOCAL_IP does not bind on Slaves
> --
>
> Key: SPARK-13317
> URL: https://issues.apache.org/jira/browse/SPARK-13317
> Project: Spark
>  Issue Type: Bug
> Environment: Linux EC2, different VPC 
>Reporter: Christopher Bourez
>
> SPARK_LOCAL_IP does not bind to the provided IP on slaves.
> When launching a job or a spark-shell from a second network, the returned IP 
> for the slave is still the first IP of the slave. 
> So the job fails with the message : 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> It is not a question of resources but the driver which cannot connect to the 
> slave given the wrong IP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves

2016-02-14 Thread Christopher Bourez (JIRA)

Christopher Bourez created SPARK-13317:
--

 Summary: SPARK_LOCAL_IP does not bind on Slaves
 Key: SPARK-13317
 URL: https://issues.apache.org/jira/browse/SPARK-13317
 Project: Spark
  Issue Type: Bug
 Environment: Linux EC2, different VPC 
Reporter: Christopher Bourez


SPARK_LOCAL_IP does not bind to the provided IP on slaves.
When launching a job or a spark-shell from a second network, the returned IP 
for the slave is still the first IP of the slave. 
So the job fails with the message : 

Initial job has not accepted any resources; check your cluster UI to ensure 
that workers are registered and have sufficient resources

It is not a question of resources but the driver which cannot connect to the 
slave given the wrong IP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13307) TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1

2016-02-14 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146140#comment-15146140
 ] 

Xiao Li edited comment on SPARK-13307 at 2/14/16 4:13 PM:
--

Could you provide logical plans, as suggested above? The attached only contains 
the physical plans. Thanks!


was (Author: smilegator):
Could you provided logical plans, as suggested above? The attached only 
contains the physical plans. Thanks!

> TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1
> -
>
> Key: SPARK-13307
> URL: https://issues.apache.org/jira/browse/SPARK-13307
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>
> Majority of the TPCDS queries ran faster in 1.6.0 than in 1.4.1, average 
> about 9% faster. There are a few degraded, and one that is definitely not 
> within error margin is query 66.
> Query 66 in 1.4.1: 699 seconds
> Query 66 in 1.6.0: 918 seconds
> 30% worse.
> Collected the physical plans from both versions - drastic difference maybe 
> partially from using Tungsten in 1.6, but anything else at play here?
> Please see plans here:
> https://ibm.box.com/spark-sql-q66-debug-160plan
> https://ibm.box.com/spark-sql-q66-debug-141plan



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13316) "SparkException: DStream has not been initialized" when restoring StreamingContext from checkpoint and the dstream is created afterwards

2016-02-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13316:
--
Affects Version/s: (was: 2.0.0)
 Priority: Minor  (was: Major)

OK to updates docs and/or make a better error message if you can.

> "SparkException: DStream has not been initialized" when restoring 
> StreamingContext from checkpoint and the dstream is created afterwards
> 
>
> Key: SPARK-13316
> URL: https://issues.apache.org/jira/browse/SPARK-13316
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Jacek Laskowski
>Priority: Minor
>
> I faced the issue today but [it was already reported on 
> SO|http://stackoverflow.com/q/35090180/1305344] a couple of days ago and the 
> reason is that a dstream is registered after a StreamingContext has been 
> recreated from checkpoint.
> It _appears_ that...no dstreams must be registered after a StreamingContext 
> has been recreated from checkpoint. It is *not* obvious at first.
> The code:
> {code}
> def createStreamingContext(): StreamingContext = {
> val ssc = new StreamingContext(sparkConf, Duration(1000))
> ssc.checkpoint(checkpointDir)
> ssc
> }
> val ssc = StreamingContext.getOrCreate(checkpointDir), createStreamingContext)
> val socketStream = ssc.socketTextStream(...)
> socketStream.checkpoint(Seconds(1))
> socketStream.foreachRDD(...)
> {code}
> It should be described in docs at the very least and/or checked in the code 
> when the streaming computation starts.
> The exception is as follows:
> {code}
> org.apache.spark.SparkException: 
> org.apache.spark.streaming.dstream.ConstantInputDStream@724797ab has not been 
> initialized
>   at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:311)
>   at 
> org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:89)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:332)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:332)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:329)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:233)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:228)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:228)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:97)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:83)
>   at 
> org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply$mcV$sp(StreamingContext.scala:589)
>   at 
> org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:585)
>   at 
> org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:585)
>   at ... run in separate thread using org.apache.spark.util.ThreadUtils ... ()
>   at 
> org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:585)
>   at 
> org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:579)
>   ... 43 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13316) "SparkException: DStream has not been initialized" when restoring StreamingContext from checkpoint and the dstream is created afterwards

2016-02-14 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-13316:
---

 Summary: "SparkException: DStream has not been initialized" when 
restoring StreamingContext from checkpoint and the dstream is created afterwards
 Key: SPARK-13316
 URL: https://issues.apache.org/jira/browse/SPARK-13316
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 2.0.0
Reporter: Jacek Laskowski


I faced the issue today but [it was already reported on 
SO|http://stackoverflow.com/q/35090180/1305344] a couple of days ago and the 
reason is that a dstream is registered after a StreamingContext has been 
recreated from checkpoint.

It _appears_ that...no dstreams must be registered after a StreamingContext has 
been recreated from checkpoint. It is *not* obvious at first.

The code:

{code}
def createStreamingContext(): StreamingContext = {
val ssc = new StreamingContext(sparkConf, Duration(1000))
ssc.checkpoint(checkpointDir)
ssc
}
val ssc = StreamingContext.getOrCreate(checkpointDir), createStreamingContext)

val socketStream = ssc.socketTextStream(...)
socketStream.checkpoint(Seconds(1))
socketStream.foreachRDD(...)
{code}

It should be described in docs at the very least and/or checked in the code 
when the streaming computation starts.

The exception is as follows:

{code}
org.apache.spark.SparkException: 
org.apache.spark.streaming.dstream.ConstantInputDStream@724797ab has not been 
initialized
  at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:311)
  at 
org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:89)
  at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:332)
  at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:332)
  at scala.Option.orElse(Option.scala:289)
  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:329)
  at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
  at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
  at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
  at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
  at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
  at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
  at 
org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
  at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:233)
  at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:228)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at 
org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:228)
  at 
org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:97)
  at 
org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:83)
  at 
org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply$mcV$sp(StreamingContext.scala:589)
  at 
org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:585)
  at 
org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:585)
  at ... run in separate thread using org.apache.spark.util.ThreadUtils ... ()
  at 
org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:585)
  at 
org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:579)
  ... 43 elided
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13309) Incorrect type inference for CSV data.

2016-02-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13309:
--
Target Version/s:   (was: 1.6.0)
Priority: Minor  (was: Major)
   Fix Version/s: (was: 1.6.0)

> Incorrect type inference for CSV data.
> --
>
> Key: SPARK-13309
> URL: https://issues.apache.org/jira/browse/SPARK-13309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Rahul Tanwani
>Priority: Minor
>
> Type inference for CSV data does not work as expected when the data is 
> sparse. 
> For instance: Consider the following datasets and the inferred schema:
> {code}
> A,B,C,D
> 1,,,
> ,1,,
> ,,1,
> ,,,1
> {code}
> {code}
> root
> |-- A: integer (nullable = true)
> |-- B: integer (nullable = true)
> |-- C: string (nullable = true)
> |-- D: string (nullable = true)
> {code}
> Here all the fields should have been inferred as Integer types, but clearly 
> the inferred schema is different.
> Another dataset:
> {code}
> A,B,C,D
> 1,,1,
> {code}
> and the inferred schema:
> {code}
> root
> |-- A: string (nullable = true)
> |-- B: string (nullable = true)
> |-- C: string (nullable = true)
> |-- D: string (nullable = true)
> {code}
> Here, fields A & C should be inferred as Integer types. 
> Same issue has been discussed on spark-csv package. Please take a look at 
> https://github.com/databricks/spark-csv/issues/216 for reference. 
> The issue was fixed with 
> https://github.com/databricks/spark-csv/commit/8704b26030da88ac6e18b955a81d5c22ca3b480d.
>  I will try to submit PR with the patch soon.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13314) Malformed WholeStageCodegen tree string

2016-02-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13314:
--
Component/s: SQL

> Malformed WholeStageCodegen tree string
> ---
>
> Key: SPARK-13314
> URL: https://issues.apache.org/jira/browse/SPARK-13314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> {{WholeStageCodegen}} overrides {{generateTreeString}} to show the inner plan 
> tree, but the output can be malformed when the plan contains binary operators:
> {code}
> val a = sqlContext range 5
> val b = sqlContext range 2
> a select ('id as 'a) unionAll (b select ('id as 'a)) explain true
> {code}
> {noformat}
> ...
> == Physical Plan ==
> Union
> :- WholeStageCodegen
> :  :  +- Project [id#3L AS a#6L]
> :  : +- Range 0, 1, 8, 5, [id#3L]
> +- WholeStageCodegen
>:  +- Project [id#4L AS a#7L]
>: +- Range 0, 1, 8, 2, [id#4L]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12869) Optimize conversion from BlockMatrix to IndexedRowMatrix

2016-02-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12869:
--
   Flags:   (was: Patch)
Target Version/s:   (was: 1.6.1)
Priority: Minor  (was: Major)
   Fix Version/s: (was: 1.6.1)

[~Fokko] don't set fix/target version

> Optimize conversion from BlockMatrix to IndexedRowMatrix
> 
>
> Key: SPARK-12869
> URL: https://issues.apache.org/jira/browse/SPARK-12869
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Fokko Driesprong
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> In the current implementation of the BlockMatrix, the conversion to the 
> IndexedRowMatrix is done by converting it to a CoordinateMatrix first. This 
> is somewhat ok when the matrix is very sparse, but for dense matrices this is 
> very inefficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12869) Optimize conversion from BlockMatrix to IndexedRowMatrix

2016-02-14 Thread Fokko Driesprong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated SPARK-12869:
-
Flags: Patch
Affects Version/s: 1.6.0
 Target Version/s: 1.6.1
Fix Version/s: 1.6.1

> Optimize conversion from BlockMatrix to IndexedRowMatrix
> 
>
> Key: SPARK-12869
> URL: https://issues.apache.org/jira/browse/SPARK-12869
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Fokko Driesprong
> Fix For: 1.6.1
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> In the current implementation of the BlockMatrix, the conversion to the 
> IndexedRowMatrix is done by converting it to a CoordinateMatrix first. This 
> is somewhat ok when the matrix is very sparse, but for dense matrices this is 
> very inefficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-13315) multiple columns filtering

2016-02-14 Thread Hossein Vatani (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Vatani closed SPARK-13315.
--

I found the solution
NewDf=Df.filter((Df.Col1==A) | (Df.Col2==B))

> multiple columns filtering
> --
>
> Key: SPARK-13315
> URL: https://issues.apache.org/jira/browse/SPARK-13315
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Hossein Vatani
>Priority: Minor
>
> Hi
> i tried to filter tow col like below:
> NewDf=Df.filter(Df.Col1==A | Df.Col2==B)
> but i got below
> Py4JError: An error occurred while calling o230.or. Trace:
> py4j.Py4JException: Method or([class java.lang.String]) does not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
>   at py4j.Gateway.invoke(Gateway.java:252)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:209)
>   at java.lang.Thread.run(Thread.java:745)
> as I found, there is any capability to filter(conditions) and only 
> filter(condition) available. 
> P.S. OS:CentOS7,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13315) multiple columns filtering

2016-02-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13315.
---
   Resolution: Invalid
Fix Version/s: (was: 1.6.0)

> multiple columns filtering
> --
>
> Key: SPARK-13315
> URL: https://issues.apache.org/jira/browse/SPARK-13315
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Hossein Vatani
>Priority: Minor
>
> Hi
> i tried to filter tow col like below:
> NewDf=Df.filter(Df.Col1==A | Df.Col2==B)
> but i got below
> Py4JError: An error occurred while calling o230.or. Trace:
> py4j.Py4JException: Method or([class java.lang.String]) does not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
>   at py4j.Gateway.invoke(Gateway.java:252)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:209)
>   at java.lang.Thread.run(Thread.java:745)
> as I found, there is any capability to filter(conditions) and only 
> filter(condition) available. 
> P.S. OS:CentOS7,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12363) PowerIterationClustering test case failed if we deprecated KMeans.setRuns

2016-02-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12363:
--
Labels: backport-needed  (was: )

Is it realistic to expect another 1.3 or 1.4 release? I am not even sure 1.5.3 
will be formally released

> PowerIterationClustering test case failed if we deprecated KMeans.setRuns
> -
>
> Key: SPARK-12363
> URL: https://issues.apache.org/jira/browse/SPARK-12363
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.0, 2.0.0
>Reporter: Yanbo Liang
>Assignee: Liang-Chi Hsieh
>Priority: Minor
>  Labels: backport-needed
> Fix For: 1.5.3, 1.6.1, 2.0.0
>
>
> We plan to deprecated `runs` of KMeans, PowerIterationClustering will 
> leverage KMeans to train model.
> I removed `setRuns` used in PowerIterationClustering, but one of the test 
> cases failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13315) multiple columns filtering

2016-02-14 Thread Hossein Vatani (JIRA)

Hossein Vatani created SPARK-13315:
--

 Summary: multiple columns filtering
 Key: SPARK-13315
 URL: https://issues.apache.org/jira/browse/SPARK-13315
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.6.0
Reporter: Hossein Vatani
Priority: Minor
 Fix For: 1.6.0


Hi
i tried to filter tow col like below:
NewDf=Df.filter(Df.Col1==A | Df.Col2==B)
but i got below
Py4JError: An error occurred while calling o230.or. Trace:
py4j.Py4JException: Method or([class java.lang.String]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)

as I found, there is any capability to filter(conditions) and only 
filter(condition) available. 
P.S. OS:CentOS7,




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13313) Strongly connected components doesn't find all strongly connected components

2016-02-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146534#comment-15146534
 ] 

Sean Owen commented on SPARK-13313:
---

Can you be more specific? like specific examples from the data and a pull 
request?

> Strongly connected components doesn't find all strongly connected components
> 
>
> Key: SPARK-13313
> URL: https://issues.apache.org/jira/browse/SPARK-13313
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.0
>Reporter: Petar Zecevic
>
> Strongly connected components algorithm doesn't find all strongly connected 
> components. I was using Wikispeedia dataset 
> (http://snap.stanford.edu/data/wikispeedia.html) and the algorithm found 519 
> SCCs and one of them had 4051 vertices, which in reality don't have any edges 
> between them. 
> I think the problem could be on line 89 of StronglyConnectedComponents.scala 
> file where EdgeDirection.In should be changed to EdgeDirection.Out. I believe 
> the second Pregel call should use Out edge direction, the same as the first 
> call because the direction is reversed in the provided sendMsg function 
> (message is sent to source vertex and not destination vertex).
> If that is changed (line 89), the algorithm starts finding much more SCCs, 
> but eventually stack overflow exception occurs. I believe graph objects that 
> are changed through iterations should not be cached, but checkpointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13314) Malformed WholeStageCodegen tree string

2016-02-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146527#comment-15146527
 ] 

Apache Spark commented on SPARK-13314:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/11200

> Malformed WholeStageCodegen tree string
> ---
>
> Key: SPARK-13314
> URL: https://issues.apache.org/jira/browse/SPARK-13314
> Project: Spark
>  Issue Type: Bug
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> {{WholeStageCodegen}} overrides {{generateTreeString}} to show the inner plan 
> tree, but the output can be malformed when the plan contains binary operators:
> {code}
> val a = sqlContext range 5
> val b = sqlContext range 2
> a select ('id as 'a) unionAll (b select ('id as 'a)) explain true
> {code}
> {noformat}
> ...
> == Physical Plan ==
> Union
> :- WholeStageCodegen
> :  :  +- Project [id#3L AS a#6L]
> :  : +- Range 0, 1, 8, 5, [id#3L]
> +- WholeStageCodegen
>:  +- Project [id#4L AS a#7L]
>: +- Range 0, 1, 8, 2, [id#4L]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13314) Malformed WholeStageCodegen tree string

2016-02-14 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-13314:
--

Assignee: Cheng Lian

> Malformed WholeStageCodegen tree string
> ---
>
> Key: SPARK-13314
> URL: https://issues.apache.org/jira/browse/SPARK-13314
> Project: Spark
>  Issue Type: Bug
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> {{WholeStageCodegen}} overrides {{generateTreeString}} to show the inner plan 
> tree, but the output can be malformed when the plan contains binary operators:
> {code}
> val a = sqlContext range 5
> val b = sqlContext range 2
> a select ('id as 'a) unionAll (b select ('id as 'a)) explain true
> {code}
> {noformat}
> ...
> == Physical Plan ==
> Union
> :- WholeStageCodegen
> :  :  +- Project [id#3L AS a#6L]
> :  : +- Range 0, 1, 8, 5, [id#3L]
> +- WholeStageCodegen
>:  +- Project [id#4L AS a#7L]
>: +- Range 0, 1, 8, 2, [id#4L]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13314) Malformed WholeStageCodegen tree string

2016-02-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13314:


Assignee: Apache Spark  (was: Cheng Lian)

> Malformed WholeStageCodegen tree string
> ---
>
> Key: SPARK-13314
> URL: https://issues.apache.org/jira/browse/SPARK-13314
> Project: Spark
>  Issue Type: Bug
>Reporter: Cheng Lian
>Assignee: Apache Spark
>Priority: Minor
>
> {{WholeStageCodegen}} overrides {{generateTreeString}} to show the inner plan 
> tree, but the output can be malformed when the plan contains binary operators:
> {code}
> val a = sqlContext range 5
> val b = sqlContext range 2
> a select ('id as 'a) unionAll (b select ('id as 'a)) explain true
> {code}
> {noformat}
> ...
> == Physical Plan ==
> Union
> :- WholeStageCodegen
> :  :  +- Project [id#3L AS a#6L]
> :  : +- Range 0, 1, 8, 5, [id#3L]
> +- WholeStageCodegen
>:  +- Project [id#4L AS a#7L]
>: +- Range 0, 1, 8, 2, [id#4L]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13314) Malformed WholeStageCodegen tree string

2016-02-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13314:


Assignee: Cheng Lian  (was: Apache Spark)

> Malformed WholeStageCodegen tree string
> ---
>
> Key: SPARK-13314
> URL: https://issues.apache.org/jira/browse/SPARK-13314
> Project: Spark
>  Issue Type: Bug
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> {{WholeStageCodegen}} overrides {{generateTreeString}} to show the inner plan 
> tree, but the output can be malformed when the plan contains binary operators:
> {code}
> val a = sqlContext range 5
> val b = sqlContext range 2
> a select ('id as 'a) unionAll (b select ('id as 'a)) explain true
> {code}
> {noformat}
> ...
> == Physical Plan ==
> Union
> :- WholeStageCodegen
> :  :  +- Project [id#3L AS a#6L]
> :  : +- Range 0, 1, 8, 5, [id#3L]
> +- WholeStageCodegen
>:  +- Project [id#4L AS a#7L]
>: +- Range 0, 1, 8, 2, [id#4L]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13314) Malformed WholeStageCodegen tree string

2016-02-14 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-13314:
--

 Summary: Malformed WholeStageCodegen tree string
 Key: SPARK-13314
 URL: https://issues.apache.org/jira/browse/SPARK-13314
 Project: Spark
  Issue Type: Bug
Reporter: Cheng Lian
Priority: Minor


{{WholeStageCodegen}} overrides {{generateTreeString}} to show the inner plan 
tree, but the output can be malformed when the plan contains binary operators:
{code}
val a = sqlContext range 5
val b = sqlContext range 2
a select ('id as 'a) unionAll (b select ('id as 'a)) explain true
{code}
{noformat}
...
== Physical Plan ==
Union
:- WholeStageCodegen
:  :  +- Project [id#3L AS a#6L]
:  : +- Range 0, 1, 8, 5, [id#3L]
+- WholeStageCodegen
   :  +- Project [id#4L AS a#7L]
   : +- Range 0, 1, 8, 2, [id#4L]
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13118) Support for classes defined in package objects

2016-02-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13118:
--
Target Version/s: 2.0.0  (was: 1.6.1, 2.0.0)

> Support for classes defined in package objects
> --
>
> Key: SPARK-13118
> URL: https://issues.apache.org/jira/browse/SPARK-13118
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> When you define a class inside of a package object, the name ends up being 
> something like {{org.mycompany.project.package$MyClass}}.  However, when 
> reflect on this we try and load {{org.mycompany.project.MyClass}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13128) API for building arrays / lists encoders

2016-02-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13128:
--
Target Version/s: 2.0.0  (was: 1.6.1, 2.0.0)

> API for building arrays / lists encoders
> 
>
> Key: SPARK-13128
> URL: https://issues.apache.org/jira/browse/SPARK-13128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>
> Example usage:
> {code}
> Encoder.array(Encoder.INT)
> Encoder.list(Encoder.INT)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12609) Make R to JVM timeout configurable

2016-02-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12609:
--
Target Version/s:   (was: 1.6.1, 2.0.0)

> Make R to JVM timeout configurable 
> ---
>
> Key: SPARK-12609
> URL: https://issues.apache.org/jira/browse/SPARK-12609
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> The timeout from R to the JVM is hardcoded at 6000 seconds in 
> https://github.com/apache/spark/blob/6c5bbd628aaedb6efb44c15f816fea8fb600decc/R/pkg/R/client.R#L22
> This results in Spark jobs that take more than 100 minutes to always fail. We 
> should make this timeout configurable through SparkConf.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13062) Overwriting same file with new schema destroys original file.

2016-02-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13062.
---
Resolution: Won't Fix

... though if someone has a reliable way to fail fast in most or all possible 
cases of this form, that would be a way forward

> Overwriting same file with new schema destroys original file.
> -
>
> Key: SPARK-13062
> URL: https://issues.apache.org/jira/browse/SPARK-13062
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Vincent Warmerdam
>
> I am using Hadoop with Spark 1.5.2. Using pyspark, let's create two 
> dataframes. 
> {code}
> ddf1 = sqlCtx.createDataFrame(pd.DataFrame({'time':[1,2,3], 
> 'thing':['a','b','b']}))
> ddf2 = sqlCtx.createDataFrame(pd.DataFrame({'time':[4,5,6,7], 
> 'thing':['a','b','a','b'], 
> 'name':['pi', 'ca', 'chu', '!']}))
> ddf1.printSchema()
> ddf2.printSchema()
> ddf1.write.parquet('/tmp/ddf1', mode = 'overwrite')
> ddf2.write.parquet('/tmp/ddf2', mode = 'overwrite')
> sqlCtx.read.load('/tmp/ddf1', schema=ddf2.schema).show()
> sqlCtx.read.load('/tmp/ddf2', schema=ddf1.schema).show()
> {code}
> Spark does a nice thing here, you can use different schemas consistently. 
> {code}
> root
>  |-- thing: string (nullable = true)
>  |-- time: long (nullable = true)
> root
>  |-- name: string (nullable = true)
>  |-- thing: string (nullable = true)
>  |-- time: long (nullable = true)
> ++-++
> |name|thing|time|
> ++-++
> |null|a|   1|
> |null|b|   3|
> |null|b|   2|
> ++-++
> +-++
> |thing|time|
> +-++
> |b|   7|
> |b|   5|
> |a|   4|
> |a|   6|
> +-++
> {code}
> But here comes something naughty. Imagine that I want to update `ddf1` with 
> the new schema and save this on the HDFS filesystem. 
> I'll first write it to a new filename. 
> {code}
> sqlCtx.read.load('/tmp/ddf1', schema=ddf1.schema)\
> .write.parquet('/tmp/ddf1_again', mode = 'overwrite')
> {code}
> Nothing seems to go wrong. 
> {code}
> > sqlCtx.read.load('/tmp/ddf1_again', schema=ddf2.schema).show()
> ++-++
> |name|thing|time|
> ++-++
> |null|a|   1|
> |null|b|   2|
> |null|b|   3|
> ++-++
> {code}
> But what happens when I rewrite the file with a new schema. Note that the 
> main difference is that I am attempting to rewrite the file. I am now using 
> the same file name, not a different one.
> {code}
> sqlCtx.read.load('/tmp/ddf1_again', schema=ddf2.schema)\
> .write.parquet('/tmp/ddf1_again', mode = 'overwrite')
> {code}
> I get this big error. 
> {code}
> Py4JJavaError: An error occurred while calling o97.parquet.
> : org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun

[jira] [Updated] (SPARK-13278) Launcher fails to start with JDK 9 EA

2016-02-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13278:
--
Assignee: Claes Redestad

> Launcher fails to start with JDK 9 EA
> -
>
> Key: SPARK-13278
> URL: https://issues.apache.org/jira/browse/SPARK-13278
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Claes Redestad
>Assignee: Claes Redestad
>Priority: Minor
> Fix For: 2.0.0
>
>
> CommandBuilderUtils.addPermGenSizeOpt need to handle the JDK 9 version string 
> format, which can look like the expected 9, but also like 9-ea and 9+100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13278) Launcher fails to start with JDK 9 EA

2016-02-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13278:
--
Priority: Minor  (was: Major)

> Launcher fails to start with JDK 9 EA
> -
>
> Key: SPARK-13278
> URL: https://issues.apache.org/jira/browse/SPARK-13278
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Claes Redestad
>Priority: Minor
> Fix For: 2.0.0
>
>
> CommandBuilderUtils.addPermGenSizeOpt need to handle the JDK 9 version string 
> format, which can look like the expected 9, but also like 9-ea and 9+100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13278) Launcher fails to start with JDK 9 EA

2016-02-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13278.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11160
[https://github.com/apache/spark/pull/11160]

> Launcher fails to start with JDK 9 EA
> -
>
> Key: SPARK-13278
> URL: https://issues.apache.org/jira/browse/SPARK-13278
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Claes Redestad
> Fix For: 2.0.0
>
>
> CommandBuilderUtils.addPermGenSizeOpt need to handle the JDK 9 version string 
> format, which can look like the expected 9, but also like 9-ea and 9+100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13300) Spark examples page gives errors : Liquid error: pygments

2016-02-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13300.
---
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 11180
[https://github.com/apache/spark/pull/11180]

> Spark examples page gives errors : Liquid error: pygments 
> --
>
> Key: SPARK-13300
> URL: https://issues.apache.org/jira/browse/SPARK-13300
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: stefan
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.0.0, 1.6.1
>
>
> On ubuntu 15.10 updated, firefox renders this page:
> http://spark.apache.org/examples.html
> with this error:
> Liquid error: pygments 
> Under every tab (Python, Scala, Java)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4039) KMeans support sparse cluster centers

2016-02-14 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146465#comment-15146465
 ] 

yuhao yang commented on SPARK-4039:
---

https://github.com/hhbyyh/spark/blob/kmeansSparse/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

I got an implementation there that supports sparse k-means centers. The 
calculation pattern can be switched via an extra parameter and users can choose 
which pattern to use. As expected, it can save a lot of memory according to the 
average sparsity of the cluster centers, but will consume much more time also.

For feature dimension of 10M and nonzero rate 1e-6, it can reduce memory 
consumption by 40 times yet used 700% time. Welcome to use if you really need 
to support large dimension k-means. 

> KMeans support sparse cluster centers
> -
>
> Key: SPARK-4039
> URL: https://issues.apache.org/jira/browse/SPARK-4039
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Antoine Amend
>  Labels: clustering
>
> When the number of features is not known, it might be quite helpful to create 
> sparse vectors using HashingTF.transform. KMeans transforms centers vectors 
> to dense vectors 
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L307),
>  therefore leading to OutOfMemory (even with small k).
> Any way to keep vectors sparse ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12861) Changes to support KMeans with large feature space

2016-02-14 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146464#comment-15146464
 ] 

yuhao yang edited comment on SPARK-12861 at 2/14/16 9:42 AM:
-

https://github.com/hhbyyh/spark/blob/kmeansSparse/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

I got an implementation there that supports sparse k-means centers. The 
calculation pattern can be switched via an extra parameter and users can choose 
which pattern to use. As expected, it can save a lot of memory according to the 
average sparsity of the cluster centers, but will consume much more time also.

For feature dimension of 10M and nonzero rate 1e-6, it can reduce memory 
consumption by 40 times yet used 700% time. Welcome to use if you really need 
to support large dimension k-means. 


was (Author: yuhaoyan):
https://github.com/hhbyyh/spark/blob/kmeansSparse/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

I got an implementation there that supports sparse k-means centers. The 
calculation pattern can be switched via an extra parameter and users can choose 
which pattern to use. As expected, it can save a lot of memory according to the 
average sparsity of the cluster centers, but will consume much more time also.

For feature dimension of 10M and nonzero rate is 1e-6, it can reduce memory 
consumption by 40 times yet used 700% time. Welcome to use if you really need 
to support large dimension k-means. 

> Changes to support KMeans with large feature space
> --
>
> Key: SPARK-12861
> URL: https://issues.apache.org/jira/browse/SPARK-12861
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Roy Levin
>  Labels: patch
>
> The problem:
> -
> In Spark's KMeans code the center vectors are always represented as dense 
> vectors. As a result, when each such center has a large domain space the 
> algorithm quickly runs out of memory. In my example I have a feature space of 
> around 5 and k ~= 500. This sums up to around 200MB RAM for the center 
> vectors alone while in fact the center vectors are very sparse and require a 
> lot less RAM.
> Since I am running on a system with relatively low resources I keep 
> getting OutOfMemory errors. In my setting it is OK to trade off runtime for 
> using less RAM. This is what I set out to do in my solution while allowing 
> users the flexibility to choose.
> One solution could be to reduce the dimensions of the feature space but 
> this is not always the best approach. For example, when the object space is 
> comprised of users and the feature space of items. In such an example we may 
> want to run kmeans over a feature space which is a function of how many times 
> user i clicked item j. If we reduce the dimensions of the items we will not 
> be able to map the centers vectors back to the items. Moreover in a streaming 
> context detecting the changes WRT previous runs gets more difficult.
> My solution:
> 
> Allow the kmeans algorithm to accept a VectorFactory which decides when 
> vectors used inside the algorithm should be sparse and when they should be 
> dense. For backward compatibility the default behavior is to always make them 
> dense (like the situation is now). But now potentially the user can provide a 
> SmartVectorFactory (or some proprietary VectorFactory) which can decide to 
> make vectors sparse.
> For this I made the following changes:
> (1) Added a method called reassign to SparseVectors allowing to change 
> the indices and values
> (2) Allow axpy to accept SparseVectors
> (3) create a trait called VectorFactory and two implementations for it 
> that are used within KMeans code
> To get the above described solution do the following:
> git clone https://github.com/levin-royl/spark.git -b 
> SupportLargeFeatureDomains
> Note
> --
> There are some similar issues opened in JIRA in the past, e.g.:
> https://issues.apache.org/jira/browse/SPARK-4039
> https://issues.apache.org/jira/browse/SPARK-1212
> https://github.com/mesos/spark/pull/736
> But the difference is that in the problem I describe reducing the dimensions 
> of the problem (i.e., the feature space) to allow using dense vectors is not 
> suitable. Also, the solution I implemented supports this while allowing full 
> flexibility to the user --- i.e., using the default dense vector 
> implementation or selecting an alternative (only when the default it is not 
> desired). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For addit

[jira] [Commented] (SPARK-12861) Changes to support KMeans with large feature space

2016-02-14 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146464#comment-15146464
 ] 

yuhao yang commented on SPARK-12861:


https://github.com/hhbyyh/spark/blob/kmeansSparse/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

I got an implementation there that supports sparse k-means centers. The 
calculation pattern can be switched via an extra parameter and users can choose 
which pattern to use. As expected, it can save a lot of memory according to the 
average sparsity of the cluster centers, but will consume much more time also.

For feature dimension of 10M and nonzero rate is 1e-6, it can reduce memory 
consumption by 40 times yet used 700% time. Welcome to use if you really need 
to support large dimension k-means. 

> Changes to support KMeans with large feature space
> --
>
> Key: SPARK-12861
> URL: https://issues.apache.org/jira/browse/SPARK-12861
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Roy Levin
>  Labels: patch
>
> The problem:
> -
> In Spark's KMeans code the center vectors are always represented as dense 
> vectors. As a result, when each such center has a large domain space the 
> algorithm quickly runs out of memory. In my example I have a feature space of 
> around 5 and k ~= 500. This sums up to around 200MB RAM for the center 
> vectors alone while in fact the center vectors are very sparse and require a 
> lot less RAM.
> Since I am running on a system with relatively low resources I keep 
> getting OutOfMemory errors. In my setting it is OK to trade off runtime for 
> using less RAM. This is what I set out to do in my solution while allowing 
> users the flexibility to choose.
> One solution could be to reduce the dimensions of the feature space but 
> this is not always the best approach. For example, when the object space is 
> comprised of users and the feature space of items. In such an example we may 
> want to run kmeans over a feature space which is a function of how many times 
> user i clicked item j. If we reduce the dimensions of the items we will not 
> be able to map the centers vectors back to the items. Moreover in a streaming 
> context detecting the changes WRT previous runs gets more difficult.
> My solution:
> 
> Allow the kmeans algorithm to accept a VectorFactory which decides when 
> vectors used inside the algorithm should be sparse and when they should be 
> dense. For backward compatibility the default behavior is to always make them 
> dense (like the situation is now). But now potentially the user can provide a 
> SmartVectorFactory (or some proprietary VectorFactory) which can decide to 
> make vectors sparse.
> For this I made the following changes:
> (1) Added a method called reassign to SparseVectors allowing to change 
> the indices and values
> (2) Allow axpy to accept SparseVectors
> (3) create a trait called VectorFactory and two implementations for it 
> that are used within KMeans code
> To get the above described solution do the following:
> git clone https://github.com/levin-royl/spark.git -b 
> SupportLargeFeatureDomains
> Note
> --
> There are some similar issues opened in JIRA in the past, e.g.:
> https://issues.apache.org/jira/browse/SPARK-4039
> https://issues.apache.org/jira/browse/SPARK-1212
> https://github.com/mesos/spark/pull/736
> But the difference is that in the problem I describe reducing the dimensions 
> of the problem (i.e., the feature space) to allow using dense vectors is not 
> suitable. Also, the solution I implemented supports this while allowing full 
> flexibility to the user --- i.e., using the default dense vector 
> implementation or selecting an alternative (only when the default it is not 
> desired). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13313) Strongly connected components doesn't find all strongly connected components

2016-02-14 Thread Petar Zecevic (JIRA)

Petar Zecevic created SPARK-13313:
-

 Summary: Strongly connected components doesn't find all strongly 
connected components
 Key: SPARK-13313
 URL: https://issues.apache.org/jira/browse/SPARK-13313
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.6.0
Reporter: Petar Zecevic


Strongly connected components algorithm doesn't find all strongly connected 
components. I was using Wikispeedia dataset 
(http://snap.stanford.edu/data/wikispeedia.html) and the algorithm found 519 
SCCs and one of them had 4051 vertices, which in reality don't have any edges 
between them. 
I think the problem could be on line 89 of StronglyConnectedComponents.scala 
file where EdgeDirection.In should be changed to EdgeDirection.Out. I believe 
the second Pregel call should use Out edge direction, the same as the first 
call because the direction is reversed in the provided sendMsg function 
(message is sent to source vertex and not destination vertex).
If that is changed (line 89), the algorithm starts finding much more SCCs, but 
eventually stack overflow exception occurs. I believe graph objects that are 
changed through iterations should not be cached, but checkpointed.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13309) Incorrect type inference for CSV data.

2016-02-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-13309:

Description: 
Type inference for CSV data does not work as expected when the data is sparse. 
For instance: Consider the following datasets and the inferred schema:

{code}
A,B,C,D
1,,,
,1,,
,,1,
,,,1
{code}


{code}
root
|-- A: integer (nullable = true)
|-- B: integer (nullable = true)
|-- C: string (nullable = true)
|-- D: string (nullable = true)
{code}

Here all the fields should have been inferred as Integer types, but clearly the 
inferred schema is different.

Another dataset:

{code}
A,B,C,D
1,,1,
{code}

and the inferred schema:

{code}
root
|-- A: string (nullable = true)
|-- B: string (nullable = true)
|-- C: string (nullable = true)
|-- D: string (nullable = true)
{code}

Here, fields A & C should be inferred as Integer types. 

Same issue has been discussed on spark-csv package. Please take a look at 
https://github.com/databricks/spark-csv/issues/216 for reference. 

The issue was fixed with 
https://github.com/databricks/spark-csv/commit/8704b26030da88ac6e18b955a81d5c22ca3b480d.
 I will try to submit PR with the patch soon.  

  was:
Type inference for CSV data does not work as expected when the data is sparse. 
For instance: Consider the following datasets and the inferred schema:

A,B,C,D
1,,,
,1,,
,,1,
,,,1

root
|-- A: integer (nullable = true)
|-- B: integer (nullable = true)
|-- C: string (nullable = true)
|-- D: string (nullable = true)

Here all the fields should have been inferred as Integer types, but clearly the 
inferred schema is different.

Another dataset:

A,B,C,D
1,,1,

and the inferred schema:

root
|-- A: string (nullable = true)
|-- B: string (nullable = true)
|-- C: string (nullable = true)
|-- D: string (nullable = true)

Here, fields A & C should be inferred as Integer types. 

Same issue has been discussed on spark-csv package. Please take a look at 
https://github.com/databricks/spark-csv/issues/216 for reference. 

The issue was fixed with 
https://github.com/databricks/spark-csv/commit/8704b26030da88ac6e18b955a81d5c22ca3b480d.
 I will try to submit PR with the patch soon.  


> Incorrect type inference for CSV data.
> --
>
> Key: SPARK-13309
> URL: https://issues.apache.org/jira/browse/SPARK-13309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Rahul Tanwani
> Fix For: 1.6.0
>
>
> Type inference for CSV data does not work as expected when the data is 
> sparse. 
> For instance: Consider the following datasets and the inferred schema:
> {code}
> A,B,C,D
> 1,,,
> ,1,,
> ,,1,
> ,,,1
> {code}
> {code}
> root
> |-- A: integer (nullable = true)
> |-- B: integer (nullable = true)
> |-- C: string (nullable = true)
> |-- D: string (nullable = true)
> {code}
> Here all the fields should have been inferred as Integer types, but clearly 
> the inferred schema is different.
> Another dataset:
> {code}
> A,B,C,D
> 1,,1,
> {code}
> and the inferred schema:
> {code}
> root
> |-- A: string (nullable = true)
> |-- B: string (nullable = true)
> |-- C: string (nullable = true)
> |-- D: string (nullable = true)
> {code}
> Here, fields A & C should be inferred as Integer types. 
> Same issue has been discussed on spark-csv package. Please take a look at 
> https://github.com/databricks/spark-csv/issues/216 for reference. 
> The issue was fixed with 
> https://github.com/databricks/spark-csv/commit/8704b26030da88ac6e18b955a81d5c22ca3b480d.
>  I will try to submit PR with the patch soon.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

87 matches

Mail list logo